2019-05-31 06:44:46

by Minchan Kim

[permalink] [raw]
Subject: [RFCv2 0/6] introduce memory hinting API for external process

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.

This approach is similar in spirit to madvise(MADV_WONTNEED), but the
information required to make the reclaim decision is not known to the app.
Instead, it is known to a centralized userspace daemon, and that daemon
must be able to initiate reclaim on its own without any app involvement.
To solve the concern, this patch introduces new syscall -

struct pr_madvise_param {
int size; /* the size of this structure */
int cookie; /* reserved to support atomicity */
int nr_elem; /* count of below arrary fields */
int __user *hints; /* hints for each range */
/* to store result of each operation */
const struct iovec __user *results;
/* input address ranges */
const struct iovec __user *ranges;
};

int process_madvise(int pidfd, struct pr_madvise_param *u_param,
unsigned long flags);

The syscall get pidfd to give hints to external process and provides
pair of result/ranges vector arguments so that it could give several
hints to each address range all at once. It also has cookie variable
to support atomicity of the API for address ranges operations. IOW, if
target process changes address space since monitor process has parsed
address ranges via map_files or maps, the API can detect the race so
could cancel entire address space operation. It's not implemented yet.
Daniel Colascione suggested a idea(Please read description in patch[6/6])
and this patchset adds cookie a variable for the future.

- Experiment

We did bunch of testing with several hundreds of real users, not artificial
benchmark on android. We saw about 17% cold start decreasement without any
significant battery/app startup latency issues. And with artificial benchmark
which launches and switching apps, we saw average 7% app launching improvement,
18% less lmkd kill and good stat from vmstat.

A is vanilla and B is process_madvise.

A B delta ratio(%)
allocstall_dma 0 0 0 0.00
allocstall_movable 1464 457 -1007 -69.00
allocstall_normal 263210 190763 -72447 -28.00
allocstall_total 264674 191220 -73454 -28.00
compact_daemon_wake 26912 25294 -1618 -7.00
compact_fail 17885 14151 -3734 -21.00
compact_free_scanned 4204766409 3835994922 -368771487 -9.00
compact_isolated 3446484 2967618 -478866 -14.00
compact_migrate_scanned 1621336411 1324695710 -296640701 -19.00
compact_stall 19387 15343 -4044 -21.00
compact_success 1502 1192 -310 -21.00
kswapd_high_wmark_hit_quickly 234 184 -50 -22.00
kswapd_inodesteal 221635 233093 11458 5.00
kswapd_low_wmark_hit_quickly 66065 54009 -12056 -19.00
nr_dirtied 259934 296476 36542 14.00
nr_vmscan_immediate_reclaim 2587 2356 -231 -9.00
nr_vmscan_write 1274232 2661733 1387501 108.00
nr_written 1514060 2937560 1423500 94.00
pageoutrun 67561 55133 -12428 -19.00
pgactivate 2335060 1984882 -350178 -15.00
pgalloc_dma 13743011 14096463 353452 2.00
pgalloc_movable 0 0 0 0.00
pgalloc_normal 18742440 16802065 -1940375 -11.00
pgalloc_total 32485451 30898528 -1586923 -5.00
pgdeactivate 4262210 2930670 -1331540 -32.00
pgfault 30812334 31085065 272731 0.00
pgfree 33553970 31765164 -1788806 -6.00
pginodesteal 33411 15084 -18327 -55.00
pglazyfreed 0 0 0 0.00
pgmajfault 551312 1508299 956987 173.00
pgmigrate_fail 43927 29330 -14597 -34.00
pgmigrate_success 1399851 1203922 -195929 -14.00
pgpgin 24141776 19032156 -5109620 -22.00
pgpgout 959344 1103316 143972 15.00
pgpgoutclean 4639732 3765868 -873864 -19.00
pgrefill 4884560 3006938 -1877622 -39.00
pgrotated 37828 25897 -11931 -32.00
pgscan_direct 1456037 957567 -498470 -35.00
pgscan_direct_throttle 0 0 0 0.00
pgscan_kswapd 6667767 5047360 -1620407 -25.00
pgscan_total 8123804 6004927 -2118877 -27.00
pgskip_dma 0 0 0 0.00
pgskip_movable 0 0 0 0.00
pgskip_normal 14907 25382 10475 70.00
pgskip_total 14907 25382 10475 70.00
pgsteal_direct 1118986 690215 -428771 -39.00
pgsteal_kswapd 4750223 3657107 -1093116 -24.00
pgsteal_total 5869209 4347322 -1521887 -26.00
pswpin 417613 1392647 975034 233.00
pswpout 1274224 2661731 1387507 108.00
slabs_scanned 13686905 10807200 -2879705 -22.00
workingset_activate 668966 569444 -99522 -15.00
workingset_nodereclaim 38957 32621 -6336 -17.00
workingset_refault 2816795 2179782 -637013 -23.00
workingset_restore 294320 168601 -125719 -43.00

pgmajfault is increased by 173% because swapin is increased by 200% by
process_madvise hint. However, swap read based on zram is much cheaper
than file IO in performance point of view and app hot start by swapin is
also cheaper than cold start from the beginning of app which needs many IO
from storage and initialization steps.

Brian Geffon in ChromeOS team had an experiment with process_madvise(2)
Quote form him:
"What I found is that by using process_madvise after a tab has been back
grounded for more than 45 seconds reduced the average tab switch times by
25%! This is a huge result and very obvious validation that process_madvise
hints works well for the ChromeOS use case."

This patchset is against on next-20190530.

Minchan Kim (6):
[1/6] mm: introduce MADV_COLD
[2/6] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
[3/6] mm: introduce MADV_PAGEOUT
[4/6] mm: factor out madvise's core functionality
[5/6] mm: introduce external memory hinting API
[6/6] mm: extend process_madvise syscall to support vector arrary

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/page-flags.h | 1 +
include/linux/page_idle.h | 15 +
include/linux/pid.h | 4 +
include/linux/swap.h | 2 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/mman-common.h | 13 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 8 +
kernel/signal.c | 7 +-
kernel/sys_ni.c | 1 +
mm/madvise.c | 586 +++++++++++++++++++++----
mm/swap.c | 43 ++
mm/vmscan.c | 83 +++-
15 files changed, 691 insertions(+), 81 deletions(-)

* from RFCv1
* Dropped MADV_[ANONYMOUS|FILE]_FILTER option. The option gave several
hundered millisecond improvement via removing address range parsing
overhead. However, there is other suggestion to create general API
to provide processs information(process_getinfo(2)) which would be
very fast via binary form and get only necessary information.

* There was lots of discussion how to provide *consistency*,*atomicity*
against on target process's address space change. It needs more
discussion. Let's do that in [6/6] if you have still a argument.

* There was a concern about the vector support because it didn't show
great performance benefit. However, I still included it because
it could make us address space range operation's atomicity easier
in future without introducing other new syscall.

* Naming of process_madvise - there was request to change naming from
process_madvise to pidfd_madvise for the consistency of existing
pidfd syscall but some of people including me still want to have
syscall name focus on what it's doing rather than how it does.

* I limited hints new syscall supports as MADV_COLD|PAGEOUT at this
moment because I'm not sure all hints makes sense for external
processs and would be some lurked bugs which relies on the caller
context. Please see description in [5/6].
--
2.22.0.rc1.257.g3120a18244-goog


2019-05-31 06:44:55

by Minchan Kim

[permalink] [raw]
Subject: [RFCv2 1/6] mm: introduce MADV_COLD

When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.

Internally, it works via deactivating pages from active list to inactive's
head if the page is private because inactive list could be full of
used-once pages which are first candidate for the reclaiming and that's a
reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
if the memory pressure happens, they will be reclaimed earlier than other
active pages unless there is no access until the time.

* RFCv1
* renaming from MADV_COOL to MADV_COLD - hannes

* internal review
* use clear_page_youn in deactivate_page - joelaf
* Revise the description - surenb
* Renaming from MADV_WARM to MADV_COOL - surenb

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/page-flags.h | 1 +
include/linux/page_idle.h | 15 ++++
include/linux/swap.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 111 +++++++++++++++++++++++++
mm/swap.c | 43 ++++++++++
6 files changed, 172 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..58b06654c8dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
TESTPAGEFLAG(Young, young, PF_ANY)
SETPAGEFLAG(Young, young, PF_ANY)
TESTCLEARFLAG(Young, young, PF_ANY)
+CLEARPAGEFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif

diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f3f43b317150 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
SetPageYoung(page);
}

+static inline void clear_page_young(struct page *page)
+{
+ ClearPageYoung(page);
+}
+
static inline bool test_and_clear_page_young(struct page *page)
{
return TestClearPageYoung(page);
@@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
}

+static void clear_page_young(struct page *page)
+{
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (unlikely(!page_ext))
+ return;
+
+ clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+}
+
static inline bool test_and_clear_page_young(struct page *page)
{
struct page_ext *page_ext = lookup_page_ext(page);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index bea0278f65ab..1190f4e7f7b9 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -43,6 +43,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_COLD 5 /* deactivatie these pages */

/* common parameters: try to keep these consistent across architectures */
#define MADV_FREE 8 /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..bff150eab6da 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_COLD:
case MADV_FREE:
return 0;
default:
@@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}

+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ struct page *page;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long next;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return 0;
+
+ if (is_huge_zero_pmd(*pmd))
+ goto huge_unlock;
+
+ page = pmd_page(*pmd);
+ if (page_mapcount(page) > 1)
+ goto huge_unlock;
+
+ if (next - addr != HPAGE_PMD_SIZE) {
+ int err;
+
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ err = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (!err)
+ goto regular_page;
+ return 0;
+ }
+
+ pmdp_test_and_clear_young(vma, addr, pmd);
+ deactivate_page(page);
+huge_unlock:
+ spin_unlock(ptl);
+ return 0;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+regular_page:
+ orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (pte_none(ptent))
+ continue;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (page_mapcount(page) > 1)
+ continue;
+
+ ptep_test_and_clear_young(vma, addr, pte);
+ deactivate_page(page);
+ }
+
+ pte_unmap_unlock(orig_pte, ptl);
+ cond_resched();
+
+ return 0;
+}
+
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk cool_walk = {
+ .pmd_entry = madvise_cold_pte_range,
+ .mm = vma->vm_mm,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &cool_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cold(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ *prev = vma;
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+ madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+ return 0;
+}
+
static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)

@@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_COLD:
+ return madvise_cold(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_FREE:
+ case MADV_COLD:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/swap.c b/mm/swap.c
index 7b079976cbec..cebedab15aa2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, file, 0);
}

+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+ void *arg)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ int file = page_is_file_cache(page);
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ clear_page_young(page);
+ add_page_to_lru_list(page, lruvec, lru);
+
+ __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+ update_page_reclaim_stat(lruvec, file, 0);
+ }
+}

static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
void *arg)
@@ -590,6 +608,10 @@ void lru_add_drain_cpu(int cpu)
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);

+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +645,26 @@ void deactivate_file_page(struct page *page)
}
}

+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page. This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ get_page(page);
+ if (!pagevec_add(pvec, page) || PageCompound(page))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
+}
+
/**
* mark_page_lazyfree - make an anon page lazyfree
* @page: page to deactivate
@@ -687,6 +729,7 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
--
2.22.0.rc1.257.g3120a18244-goog

2019-05-31 06:45:02

by Minchan Kim

[permalink] [raw]
Subject: [RFCv2 2/6] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM

The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.

Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.
So this patch makes references's default value to PAGEREF_RECLAIM and
rename force_reclaim with ignore_references to make it more clear.

This is a preparatory work for next patch.

* RFCv1
* use ignore_referecnes as parameter name - hannes

Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/vmscan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 84dcb651d05c..0973a46a0472 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct scan_control *sc,
enum ttu_flags ttu_flags,
struct reclaim_stat *stat,
- bool force_reclaim)
+ bool ignore_references)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct address_space *mapping;
struct page *page;
int may_enter_fs;
- enum page_references references = PAGEREF_RECLAIM_CLEAN;
+ enum page_references references = PAGEREF_RECLAIM;
bool dirty, writeback;
unsigned int nr_pages;

@@ -1247,7 +1247,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}

- if (!force_reclaim)
+ if (!ignore_references)
references = page_check_references(page, sc);

switch (references) {
--
2.22.0.rc1.257.g3120a18244-goog

2019-05-31 06:45:16

by Minchan Kim

[permalink] [raw]
Subject: [RFCv2 3/6] mm: introduce MADV_PAGEOUT

When a process expects no accesses to a certain memory range
for a long time, it could hint kernel that the pages can be
reclaimed instantly but data should be preserved for future use.
This could reduce workingset eviction so it ends up increasing
performance.

This patch introduces the new MADV_PAGEOUT hint to madvise(2)
syscall. MADV_PAGEOUT can be used by a process to mark a memory
range as not expected to be used for a long time so that kernel
reclaims the memory instantly. The hint can help kernel in deciding
which pages to evict proactively.

* RFCv1
* rename from MADV_COLD to MADV_PAGEOUT - hannes
* bail out if process is being killed - Hillf
* fix reclaim_pages bugs - Hillf

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 126 +++++++++++++++++++++++++
mm/vmscan.c | 77 +++++++++++++++
4 files changed, 205 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ce997edb8bb..063c0c1e112b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -365,6 +365,7 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;

+extern unsigned long reclaim_pages(struct list_head *page_list);
#ifdef CONFIG_NUMA
extern int node_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1190f4e7f7b9..92e347a89ddc 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -44,6 +44,7 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
#define MADV_COLD 5 /* deactivatie these pages */
+#define MADV_PAGEOUT 6 /* reclaim these pages */

/* common parameters: try to keep these consistent across architectures */
#define MADV_FREE 8 /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index bff150eab6da..9d749a1420b4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_COLD:
+ case MADV_PAGEOUT:
case MADV_FREE:
return 0;
default:
@@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
return 0;
}

+static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ LIST_HEAD(page_list);
+ struct page *page;
+ int isolated = 0;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long next;
+
+ if (fatal_signal_pending(current))
+ return -EINTR;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return 0;
+
+ if (is_huge_zero_pmd(*pmd))
+ goto huge_unlock;
+
+ page = pmd_page(*pmd);
+ if (page_mapcount(page) > 1)
+ goto huge_unlock;
+
+ if (next - addr != HPAGE_PMD_SIZE) {
+ int err;
+
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ err = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (!err)
+ goto regular_page;
+ return 0;
+ }
+
+ if (isolate_lru_page(page))
+ goto huge_unlock;
+
+ list_add(&page->lru, &page_list);
+huge_unlock:
+ spin_unlock(ptl);
+ reclaim_pages(&page_list);
+ return 0;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+regular_page:
+ orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (page_mapcount(page) > 1)
+ continue;
+
+ if (isolate_lru_page(page))
+ continue;
+
+ isolated++;
+ list_add(&page->lru, &page_list);
+ if (isolated >= SWAP_CLUSTER_MAX) {
+ pte_unmap_unlock(orig_pte, ptl);
+ reclaim_pages(&page_list);
+ isolated = 0;
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ orig_pte = pte;
+ }
+ }
+
+ pte_unmap_unlock(orig_pte, ptl);
+ reclaim_pages(&page_list);
+ cond_resched();
+
+ return 0;
+}
+
+static void madvise_pageout_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk warm_walk = {
+ .pmd_entry = madvise_pageout_pte_range,
+ .mm = vma->vm_mm,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &warm_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+
+static long madvise_pageout(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ *prev = vma;
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+ madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+ return 0;
+}
+
static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)

@@ -805,6 +928,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
return madvise_cold(vma, prev, start, end);
+ case MADV_PAGEOUT:
+ return madvise_pageout(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -827,6 +952,7 @@ madvise_behavior_valid(int behavior)
case MADV_DONTNEED:
case MADV_FREE:
case MADV_COLD:
+ case MADV_PAGEOUT:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0973a46a0472..280dd808fb91 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2126,6 +2126,83 @@ static void shrink_active_list(unsigned long nr_to_scan,
nr_deactivate, nr_rotated, sc->priority, file);
}

+unsigned long reclaim_pages(struct list_head *page_list)
+{
+ int nid = -1;
+ unsigned long nr_isolated[2] = {0, };
+ unsigned long nr_reclaimed = 0;
+ LIST_HEAD(node_page_list);
+ struct reclaim_stat dummy_stat;
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .priority = DEF_PRIORITY,
+ .may_writepage = 1,
+ .may_unmap = 1,
+ .may_swap = 1,
+ };
+
+ while (!list_empty(page_list)) {
+ struct page *page;
+
+ page = lru_to_page(page_list);
+ if (nid == -1) {
+ nid = page_to_nid(page);
+ INIT_LIST_HEAD(&node_page_list);
+ nr_isolated[0] = nr_isolated[1] = 0;
+ }
+
+ if (nid == page_to_nid(page)) {
+ list_move(&page->lru, &node_page_list);
+ nr_isolated[!!page_is_file_cache(page)] +=
+ hpage_nr_pages(page);
+ continue;
+ }
+
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ nr_isolated[1]);
+ nr_reclaimed += shrink_page_list(&node_page_list,
+ NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
+ &dummy_stat, true);
+ while (!list_empty(&node_page_list)) {
+ struct page *page = lru_to_page(&node_page_list);
+
+ list_del(&page->lru);
+ putback_lru_page(page);
+ }
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ -nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ -nr_isolated[1]);
+ nid = -1;
+ }
+
+ if (!list_empty(&node_page_list)) {
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ nr_isolated[1]);
+ nr_reclaimed += shrink_page_list(&node_page_list,
+ NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
+ &dummy_stat, true);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ -nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ -nr_isolated[1]);
+
+ while (!list_empty(&node_page_list)) {
+ struct page *page = lru_to_page(&node_page_list);
+
+ list_del(&page->lru);
+ putback_lru_page(page);
+ }
+
+ }
+
+ return nr_reclaimed;
+}
+
/*
* The inactive anon list should be small enough that the VM never has
* to do too much work.
--
2.22.0.rc1.257.g3120a18244-goog

2019-05-31 06:46:31

by Minchan Kim

[permalink] [raw]
Subject: [RFCv2 4/6] mm: factor out madvise's core functionality

This patch factor out madvise's core functionality so that upcoming
patch can reuse it without duplication. It shouldn't change any behavior.

Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
1 file changed, 101 insertions(+), 87 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 9d749a1420b4..466623ea8c36 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
int isolated = 0;
struct vm_area_struct *vma = walk->vma;
+ struct task_struct *task = walk->private;
unsigned long next;

- if (fatal_signal_pending(current))
+ if (fatal_signal_pending(task))
return -EINTR;

next = pmd_addr_end(addr, end);
@@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
}

static void madvise_pageout_page_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+ struct task_struct *task,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
{
struct mm_walk warm_walk = {
.pmd_entry = madvise_pageout_pte_range,
.mm = vma->vm_mm,
+ .private = task,
};

tlb_start_vma(tlb, vma);
@@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
}


-static long madvise_pageout(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start_addr, unsigned long end_addr)
+static long madvise_pageout(struct task_struct *task,
+ struct vm_area_struct *vma, struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
@@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,

lru_add_drain();
tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
- madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb, start_addr, end_addr);

return 0;
@@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
return 0;
}

-static long madvise_dontneed_free(struct vm_area_struct *vma,
+static long madvise_dontneed_free(struct mm_struct *mm,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end,
int behavior)
@@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
if (!userfaultfd_remove(vma, start, end)) {
*prev = NULL; /* mmap_sem has been dropped, prev is stale */

- down_read(&current->mm->mmap_sem);
- vma = find_vma(current->mm, start);
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, start);
if (!vma)
return -ENOMEM;
if (start < vma->vm_start) {
@@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
* Application wants to free up the pages and associated backing store.
* This is effectively punching a hole into the middle of a file.
*/
-static long madvise_remove(struct vm_area_struct *vma,
+static long madvise_remove(struct mm_struct *mm,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
@@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
get_file(f);
if (userfaultfd_remove(vma, start, end)) {
/* mmap_sem was not released by userfaultfd_remove() */
- up_read(&current->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
}
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
offset, end - start);
fput(f);
- down_read(&current->mm->mmap_sem);
+ down_read(&mm->mmap_sem);
return error;
}

@@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
#endif

static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
+madvise_vma(struct task_struct *task, struct mm_struct *mm,
+ struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
{
switch (behavior) {
case MADV_REMOVE:
- return madvise_remove(vma, prev, start, end);
+ return madvise_remove(mm, vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
return madvise_cold(vma, prev, start, end);
case MADV_PAGEOUT:
- return madvise_pageout(vma, prev, start, end);
+ return madvise_pageout(task, vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
- return madvise_dontneed_free(vma, prev, start, end, behavior);
+ return madvise_dontneed_free(mm, vma, prev, start,
+ end, behavior);
default:
return madvise_behavior(vma, prev, start, end, behavior);
}
@@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
}
}

-/*
- * The madvise(2) system call.
- *
- * Applications can use madvise() to advise the kernel how it should
- * handle paging I/O in this VM area. The idea is to help the kernel
- * use appropriate read-ahead and caching techniques. The information
- * provided is advisory only, and can be safely disregarded by the
- * kernel without affecting the correct operation of the application.
- *
- * behavior values:
- * MADV_NORMAL - the default behavior is to read clusters. This
- * results in some read-ahead and read-behind.
- * MADV_RANDOM - the system should read the minimum amount of data
- * on any access, since it is unlikely that the appli-
- * cation will need more than what it asks for.
- * MADV_SEQUENTIAL - pages in the given range will probably be accessed
- * once, so they can be aggressively read ahead, and
- * can be freed soon after they are accessed.
- * MADV_WILLNEED - the application is notifying the system to read
- * some pages ahead.
- * MADV_DONTNEED - the application is finished with the given range,
- * so the kernel can free resources associated with it.
- * MADV_FREE - the application marks pages in the given range as lazy free,
- * where actual purges are postponed until memory pressure happens.
- * MADV_REMOVE - the application wants to free up the given range of
- * pages and associated backing store.
- * MADV_DONTFORK - omit this area from child's address space when forking:
- * typically, to avoid COWing pages pinned by get_user_pages().
- * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
- * MADV_WIPEONFORK - present the child process with zero-filled memory in this
- * range after a fork.
- * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
- * MADV_HWPOISON - trigger memory error handler as if the given memory range
- * were corrupted by unrecoverable hardware memory failure.
- * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
- * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
- * this area with pages of identical content from other such areas.
- * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
- * MADV_HUGEPAGE - the application wants to back the given range by transparent
- * huge pages in the future. Existing pages might be coalesced and
- * new pages might be allocated as THP.
- * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
- * transparent huge pages so the existing pages will not be
- * coalesced into THP and new pages will not be allocated as THP.
- * MADV_DONTDUMP - the application wants to prevent pages in the given range
- * from being included in its core dump.
- * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
- *
- * return values:
- * zero - success
- * -EINVAL - start + len < 0, start is not page-aligned,
- * "behavior" is not a valid value, or application
- * is attempting to release locked or shared pages,
- * or the specified address range includes file, Huge TLB,
- * MAP_SHARED or VMPFNMAP range.
- * -ENOMEM - addresses in the specified range are not currently
- * mapped, or are outside the AS of the process.
- * -EIO - an I/O error occurred while paging in data.
- * -EBADF - map exists, but area maps something that isn't a file.
- * -EAGAIN - a kernel resource was temporarily unavailable.
- */
-SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+static int madvise_core(struct task_struct *task, struct mm_struct *mm,
+ unsigned long start, size_t len_in, int behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
@@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)

#ifdef CONFIG_MEMORY_FAILURE
if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
- return madvise_inject_error(behavior, start, start + len_in);
+ return madvise_inject_error(behavior,
+ start, start + len_in);
#endif

write = madvise_need_mmap_write(behavior);
if (write) {
- if (down_write_killable(&current->mm->mmap_sem))
+ if (down_write_killable(&mm->mmap_sem))
return -EINTR;
} else {
- down_read(&current->mm->mmap_sem);
+ down_read(&mm->mmap_sem);
}

/*
@@ -1084,7 +1032,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(current->mm, start, &prev);
+ vma = find_vma_prev(mm, start, &prev);
if (vma && start > vma->vm_start)
prev = vma;

@@ -1109,7 +1057,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
tmp = end;

/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma(vma, &prev, start, tmp, behavior);
+ error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior);
if (error)
goto out;
start = tmp;
@@ -1121,14 +1069,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
if (prev)
vma = prev->vm_next;
else /* madvise_remove dropped mmap_sem */
- vma = find_vma(current->mm, start);
+ vma = find_vma(mm, start);
}
out:
blk_finish_plug(&plug);
if (write)
- up_write(&current->mm->mmap_sem);
+ up_write(&mm->mmap_sem);
else
- up_read(&current->mm->mmap_sem);
+ up_read(&mm->mmap_sem);

return error;
}
+
+/*
+ * The madvise(2) system call.
+ *
+ * Applications can use madvise() to advise the kernel how it should
+ * handle paging I/O in this VM area. The idea is to help the kernel
+ * use appropriate read-ahead and caching techniques. The information
+ * provided is advisory only, and can be safely disregarded by the
+ * kernel without affecting the correct operation of the application.
+ *
+ * behavior values:
+ * MADV_NORMAL - the default behavior is to read clusters. This
+ * results in some read-ahead and read-behind.
+ * MADV_RANDOM - the system should read the minimum amount of data
+ * on any access, since it is unlikely that the appli-
+ * cation will need more than what it asks for.
+ * MADV_SEQUENTIAL - pages in the given range will probably be accessed
+ * once, so they can be aggressively read ahead, and
+ * can be freed soon after they are accessed.
+ * MADV_WILLNEED - the application is notifying the system to read
+ * some pages ahead.
+ * MADV_DONTNEED - the application is finished with the given range,
+ * so the kernel can free resources associated with it.
+ * MADV_FREE - the application marks pages in the given range as lazy free,
+ * where actual purges are postponed until memory pressure happens.
+ * MADV_REMOVE - the application wants to free up the given range of
+ * pages and associated backing store.
+ * MADV_DONTFORK - omit this area from child's address space when forking:
+ * typically, to avoid COWing pages pinned by get_user_pages().
+ * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
+ * MADV_WIPEONFORK - present the child process with zero-filled memory in this
+ * range after a fork.
+ * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
+ * MADV_HWPOISON - trigger memory error handler as if the given memory range
+ * were corrupted by unrecoverable hardware memory failure.
+ * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
+ * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
+ * this area with pages of identical content from other such areas.
+ * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
+ * MADV_HUGEPAGE - the application wants to back the given range by transparent
+ * huge pages in the future. Existing pages might be coalesced and
+ * new pages might be allocated as THP.
+ * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
+ * transparent huge pages so the existing pages will not be
+ * coalesced into THP and new pages will not be allocated as THP.
+ * MADV_DONTDUMP - the application wants to prevent pages in the given range
+ * from being included in its core dump.
+ * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *
+ * return values:
+ * zero - success
+ * -EINVAL - start + len < 0, start is not page-aligned,
+ * "behavior" is not a valid value, or application
+ * is attempting to release locked or shared pages,
+ * or the specified address range includes file, Huge TLB,
+ * MAP_SHARED or VMPFNMAP range.
+ * -ENOMEM - addresses in the specified range are not currently
+ * mapped, or are outside the AS of the process.
+ * -EIO - an I/O error occurred while paging in data.
+ * -EBADF - map exists, but area maps something that isn't a file.
+ * -EAGAIN - a kernel resource was temporarily unavailable.
+ */
+SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+{
+ return madvise_core(current, current->mm, start, len_in, behavior);
+}
--
2.22.0.rc1.257.g3120a18244-goog

2019-05-31 06:47:09

by Minchan Kim

[permalink] [raw]
Subject: [RFCv2 5/6] mm: introduce external memory hinting API

There is some usecase that centralized userspace daemon want to give
a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
ActivityManagerService is one of them.

It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.

To solve the issue, this patch introduces new syscall process_madvise(2).
It could give a hint to the exeternal process of pidfd.

int process_madvise(int pidfd, void *addr, size_t length, int advise,
unsigned long cookie, unsigned long flag);

Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrace the process could use it successfully.

The syscall has a cookie argument to privode atomicity(i.e., detect
target process's address space change since monitor process has parsed
the address range of target process so the operaion could fail in case
of happening race). Although there is no interface to get a cookie
at this moment, it could be useful to consider it as argument to avoid
introducing another new syscall in future. It could support *atomicity*
for disruptive hint(e.g., MADV_DONTNEED|FREE).
flag argument is reserved for future use if we need to extend the API.

I think supporting all hints madvise has/will supported/support to
process_madvise are rather risky. Because I'm not sure all hints makes
sense from external process and implementation for the hint may rely on
caller is current context like MADV_INJECT_ERROR so it could be error-prone
if the caller is external process. Other example is userfaultfd because
userfaultfd_remove need to release mmap_sem during the operation, which
wouuld be a obstacle to implement atomicity later. So, I just limited hints
as MADV_[COLD|PAGEOUT] at this moment. If someone want to expose other hint
we need to hear their workload/scenario and could review carefully by
design/implmentation of each hint. It's more safe for maintainace -
Once we open a buggy syscall but found hard to fix it later, we never
roll back.

TODO: once we agree the direction, I need to define the syscall for
every architecture.

* RFCv1
* not export pidfd_to_pid. Use pidfd_pid - Christian
* use mm struct instead of task_struct for madvise_core - Oleg
* add cookie variable as syscall argument to guarantee atomicity - dancol

* internal review
* use ptrace capability - surenb, dancol

I didn't solve the issue Olge pointed out(task we got from
pid_task could be a zombie leader so that syscall will fai
by mm_access even though process is alive with other threads)
because it's not a only problem of this new syscall but it's
general problem for other MM syscalls like process_vm_readv,
move_pages so need more discussion.

Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/pid.h | 4 ++
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 8 +++
kernel/signal.c | 7 ++-
kernel/sys_ni.c | 1 +
mm/madvise.c | 87 ++++++++++++++++++++++++++
9 files changed, 113 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 43e4429a5272..5f44a29b7882 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -439,3 +439,4 @@
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
+435 i386 process_madvise sys_process_madvise __ia32_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1bee0a77fdd3..35e91f3e9646 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -356,6 +356,7 @@
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
434 common pidfd_open __x64_sys_pidfd_open
+435 common process_madvise __x64_sys_process_madvise

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/pid.h b/include/linux/pid.h
index a261712ac3fe..a49ef789c034 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -73,6 +73,10 @@ extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
extern int pidfd_create(struct pid *pid);

+struct file;
+
+extern struct pid *pidfd_pid(const struct file *file);
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1a4dc53f40d9..6ba081c955f6 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -872,6 +872,9 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
+ size_t len, int behavior,
+ unsigned long cookie, unsigned long flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e5684a4512c0..082d1f3fe3a2 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -846,9 +846,11 @@ __SYSCALL(__NR_fsmount, sys_fsmount)
__SYSCALL(__NR_fspick, sys_fspick)
#define __NR_pidfd_open 434
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
+#define __NR_process_madvise 435
+__SYSCALL(__NR_process_madvise, sys_process_madvise)

#undef __NR_syscalls
-#define __NR_syscalls 435
+#define __NR_syscalls 436

/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f238cdd886e..b76aade51631 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1738,6 +1738,14 @@ const struct file_operations pidfd_fops = {
#endif
};

+struct pid *pidfd_pid(const struct file *file)
+{
+ if (file->f_op == &pidfd_fops)
+ return file->private_data;
+
+ return ERR_PTR(-EBADF);
+}
+
/**
* pidfd_create() - Create a new pid file descriptor.
*
diff --git a/kernel/signal.c b/kernel/signal.c
index b477e21ecafc..b376870d7565 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3702,8 +3702,11 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)

static struct pid *pidfd_to_pid(const struct file *file)
{
- if (file->f_op == &pidfd_fops)
- return file->private_data;
+ struct pid *pid;
+
+ pid = pidfd_pid(file);
+ if (pid)
+ return pid;

return tgid_pidfd_to_pid(file);
}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d9ae5ea6caf..5277421795ab 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
COND_SYSCALL(munlockall);
COND_SYSCALL(mincore);
COND_SYSCALL(madvise);
+COND_SYSCALL(process_madvise);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/madvise.c b/mm/madvise.c
index 466623ea8c36..fd205e928a1b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -15,6 +15,7 @@
#include <linux/hugetlb.h>
#include <linux/falloc.h>
#include <linux/sched.h>
+#include <linux/sched/mm.h>
#include <linux/ksm.h>
#include <linux/fs.h>
#include <linux/file.h>
@@ -983,6 +984,31 @@ madvise_behavior_valid(int behavior)
}
}

+static bool
+process_madvise_behavior_valid(int behavior)
+{
+ switch (behavior) {
+ case MADV_COLD:
+ case MADV_PAGEOUT:
+ return true;
+
+ default:
+ return false;
+ }
+}
+
+/*
+ * madvise_core - request behavior hint to address range of the target process
+ *
+ * @task: task_struct got behavior hint, not giving the hint
+ * @mm: mm_struct got behavior hint, not giving the hint
+ * @start: base address of the hinted range
+ * @len_in: length of the hinted range
+ * @behavior: requested hint
+ *
+ * @task could be a zombie leader if it calls sys_exit so accessing mm_struct
+ * via task->mm is prohibited. Please use @mm insetad of task->mm.
+ */
static int madvise_core(struct task_struct *task, struct mm_struct *mm,
unsigned long start, size_t len_in, int behavior)
{
@@ -1146,3 +1172,64 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
return madvise_core(current, current->mm, start, len_in, behavior);
}
+
+SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
+ size_t, len_in, int, behavior, unsigned long, cookie,
+ unsigned long, flags)
+{
+ int ret;
+ struct fd f;
+ struct pid *pid;
+ struct task_struct *task;
+ struct mm_struct *mm;
+
+ if (flags != 0)
+ return -EINVAL;
+
+ /*
+ * We don't support cookie to gaurantee address space change
+ * atomicity yet.
+ */
+ if (cookie != 0)
+ return -EINVAL;
+
+ if (!process_madvise_behavior_valid(behavior))
+ return return -EINVAL;
+
+ f = fdget(pidfd);
+ if (!f.file)
+ return -EBADF;
+
+ pid = pidfd_pid(f.file);
+ if (IS_ERR(pid)) {
+ ret = PTR_ERR(pid);
+ goto err;
+ }
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ if (!task) {
+ rcu_read_unlock();
+ ret = -ESRCH;
+ goto err;
+ }
+
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+ if (!mm || IS_ERR(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ if (ret == -EACCES)
+ ret = -EPERM;
+ goto release_task;
+ }
+
+ ret = madvise_core(task, mm, start, len_in, behavior);
+ mmput(mm);
+release_task:
+ put_task_struct(task);
+err:
+ fdput(f);
+ return ret;
+}
--
2.22.0.rc1.257.g3120a18244-goog

2019-05-31 06:47:10

by Minchan Kim

[permalink] [raw]
Subject: [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary

Currently, process_madvise syscall works for only one address range so
user should call the syscall several times to give hints to multiple
address ranges. However, it's not efficient to support atomicity of
address range opreations as well as performance perspective.

This patch extends process_madvise syscall to support multiple hints,
address ranges and return vaules so user could give hints all at once.

struct pr_madvise_param {
int size; /* the size of this structure */
int cookie; /* reserved to support atomicity */
int nr_elem; /* count of below arrary fields */
int __user *hints; /* hints for each range */
/* to store result of each operation */
const struct iovec __user *results;
/* input address ranges */
const struct iovec __user *ranges;
};

int process_madvise(int pidfd, struct pr_madvise_param *u_param,
unsigned long flags);

About cookie, Daniel Colascione suggested a idea[1] to support atomicity
as well as improving parsing speed of address ranges of the target process.
The process_getinfo(2) syscall could create vma configuration sequence
number and returns(e.g., the seq number will be increased when target process
holds mmap_sem exclusive lock) the number with address ranges as binary form.
With calling the this vector syscall with the sequence number and address
ranges we got from process_getinfo, we could detect there was race of
the target process address space layout and makes the fail of the syscall
if user want to have atomicity. It also speed up the address range
parsing because we don't need to parse human-friend strings from /proc fs.

[1] https://lore.kernel.org/lkml/[email protected]/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

struct pr_madvise_param {
int size; /* the size of this structure */
int cookie; /* reserved to support atomicity */
int nr_elem; /* count of below arrary fields */
int *hints; /* hints for each range */
/* to store result of each operation */
const struct iovec *results;
/* input address ranges */
const struct iovec *ranges;
};

int main(int argc, char *argv[])
{
struct pr_madvise_param param;
int hints[NR_ADDR_RANGE];
int ret[NR_ADDR_RANGE];
struct iovec ret_vec[NR_ADDR_RANGE];
struct iovec range_vec[NR_ADDR_RANGE];
void *addr[NR_ADDR_RANGE];
pid_t pid;

addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
if (MAP_FAILED == addr[0]) {
printf("Fail to alloc\n");
return 1;
}

addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

if (MAP_FAILED == addr[1]) {
printf("Fail to alloc\n");
return 1;
}

ret_vec[0].iov_base = &ret[0];
ret_vec[0].iov_len = sizeof(long);
ret_vec[1].iov_base = &ret[1];
ret_vec[1].iov_len = sizeof(long);
range_vec[0].iov_base = addr[0];
range_vec[0].iov_len = ALLOC_SIZE;
range_vec[1].iov_base = addr[1];
range_vec[1].iov_len = ALLOC_SIZE;

hints[0] = MADV_COLD;
hints[1] = MADV_PAGEOUT;

param.size = sizeof(struct pr_madvise_param);
param.cookie = 0;
param.nr_elem = NR_ADDR_RANGE;
param.hints = hints;
param.results = ret_vec;
param.ranges = range_vec;

pid = fork();
if (!pid) {
sleep(10);
} else {
int pidfd = syscall(__NR_pidfd_open, pid, 0);
if (pidfd < 0) {
printf("Fail to open process file descriptor\n");
return 1;
}

munmap(addr[0], ALLOC_SIZE);
munmap(addr[1], ALLOC_SIZE);

system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
if (syscall(__NR_process_madvise, pidfd, &param, 0))
perror("process_madvise fail\n");
system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
}

return 0;
}

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/syscalls.h | 6 +-
include/uapi/asm-generic/mman-common.h | 11 +++
mm/madvise.c | 126 ++++++++++++++++++++++---
3 files changed, 126 insertions(+), 17 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 6ba081c955f6..05627718a547 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -872,9 +872,9 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
-asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
- size_t len, int behavior,
- unsigned long cookie, unsigned long flags);
+asmlinkage long sys_process_madvise(int pidfd,
+ struct pr_madvise_param __user *u_params,
+ unsigned long flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 92e347a89ddc..220c2b5eb961 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -75,4 +75,15 @@
#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
PKEY_DISABLE_WRITE)

+struct pr_madvise_param {
+ int size; /* the size of this structure */
+ int cookie; /* reserved to support atomicity */
+ int nr_elem; /* count of below arrary fields */
+ int __user *hints; /* hints for each range */
+ /* to store result of each operation */
+ const struct iovec __user *results;
+ /* input address ranges */
+ const struct iovec __user *ranges;
+};
+
#endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index fd205e928a1b..94d782097afd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1107,6 +1107,56 @@ static int madvise_core(struct task_struct *task, struct mm_struct *mm,
return error;
}

+static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param,
+ struct pr_madvise_param *param)
+{
+ u32 size;
+ int ret;
+
+ memset(param, 0, sizeof(*param));
+
+ ret = get_user(size, &u_param->size);
+ if (ret)
+ return ret;
+
+ if (size > PAGE_SIZE)
+ return -E2BIG;
+
+ if (!size || size > sizeof(struct pr_madvise_param))
+ return -EINVAL;
+
+ ret = copy_from_user(param, u_param, size);
+ if (ret)
+ return -EFAULT;
+
+ return ret;
+}
+
+static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm,
+ int *behaviors,
+ struct iov_iter *iter,
+ const struct iovec *range_vec,
+ unsigned long riovcnt)
+{
+ int i;
+ long err;
+
+ for (i = 0; i < riovcnt && iov_iter_count(iter); i++) {
+ err = -EINVAL;
+ if (process_madvise_behavior_valid(behaviors[i]))
+ err = madvise_core(tsk, mm,
+ (unsigned long)range_vec[i].iov_base,
+ range_vec[i].iov_len, behaviors[i]);
+
+ if (copy_to_iter(&err, sizeof(long), iter) !=
+ sizeof(long)) {
+ return -EFAULT;
+ }
+ }
+
+ return 0;
+}
+
/*
* The madvise(2) system call.
*
@@ -1173,37 +1223,78 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return madvise_core(current, current->mm, start, len_in, behavior);
}

-SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
- size_t, len_in, int, behavior, unsigned long, cookie,
- unsigned long, flags)
+
+SYSCALL_DEFINE3(process_madvise, int, pidfd,
+ struct pr_madvise_param __user *, u_params,
+ unsigned long, flags)
{
int ret;
struct fd f;
struct pid *pid;
struct task_struct *task;
struct mm_struct *mm;
+ struct pr_madvise_param params;
+ const struct iovec __user *result_vec, __user *range_vec;
+ int *behaviors;
+ struct iovec iovstack_result[UIO_FASTIOV];
+ struct iovec iovstack_r[UIO_FASTIOV];
+ struct iovec *iov_l = iovstack_result;
+ struct iovec *iov_r = iovstack_r;
+ struct iov_iter iter;
+ int nr_elem;

if (flags != 0)
return -EINVAL;

+ ret = pr_madvise_copy_param(u_params, &params);
+ if (ret)
+ return ret;
+
/*
- * We don't support cookie to gaurantee address space change
- * atomicity yet.
+ * We don't support cookie to gaurantee address space atomicity yet.
+ * Once we implment cookie, process_madvise_core need to hold mmap_sme
+ * during entire operation to guarantee atomicity.
*/
- if (cookie != 0)
+ if (params.cookie != 0)
return -EINVAL;

- if (!process_madvise_behavior_valid(behavior))
- return return -EINVAL;
+ range_vec = params.ranges;
+ result_vec = params.results;
+ nr_elem = params.nr_elem;
+
+ behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL);
+ if (!behaviors)
+ return -ENOMEM;
+
+ ret = copy_from_user(behaviors, params.hints, sizeof(int) * nr_elem);
+ if (ret < 0)
+ goto free_behavior_vec;
+
+ ret = import_iovec(READ, result_vec, params. nr_elem, UIO_FASTIOV,
+ &iov_l, &iter);
+ if (ret < 0)
+ goto free_behavior_vec;
+
+ if (!iov_iter_count(&iter)) {
+ ret = -EINVAL;
+ goto free_iovecs;
+ }
+
+ ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem,
+ UIO_FASTIOV, iovstack_r, &iov_r);
+ if (ret <= 0)
+ goto free_iovecs;

f = fdget(pidfd);
- if (!f.file)
- return -EBADF;
+ if (!f.file) {
+ ret = -EBADF;
+ goto free_iovecs;
+ }

pid = pidfd_pid(f.file);
if (IS_ERR(pid)) {
ret = PTR_ERR(pid);
- goto err;
+ goto put_fd;
}

rcu_read_lock();
@@ -1211,7 +1302,7 @@ SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
if (!task) {
rcu_read_unlock();
ret = -ESRCH;
- goto err;
+ goto put_fd;
}

get_task_struct(task);
@@ -1225,11 +1316,18 @@ SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
goto release_task;
}

- ret = madvise_core(task, mm, start, len_in, behavior);
+ ret = process_madvise_core(task, mm, behaviors, &iter, iov_r, nr_elem);
mmput(mm);
release_task:
put_task_struct(task);
-err:
+put_fd:
fdput(f);
+free_iovecs:
+ if (iov_r != iovstack_r)
+ kfree(iov_r);
+ kfree(iov_l);
+free_behavior_vec:
+ kfree(behaviors);
+
return ret;
}
--
2.22.0.rc1.257.g3120a18244-goog

2019-05-31 07:05:53

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [RFCv2 4/6] mm: factor out madvise's core functionality

On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> This patch factor out madvise's core functionality so that upcoming
> patch can reuse it without duplication. It shouldn't change any behavior.
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> 1 file changed, 101 insertions(+), 87 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9d749a1420b4..466623ea8c36 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> struct page *page;
> int isolated = 0;
> struct vm_area_struct *vma = walk->vma;
> + struct task_struct *task = walk->private;
> unsigned long next;
>
> - if (fatal_signal_pending(current))
> + if (fatal_signal_pending(task))
> return -EINTR;
>
> next = pmd_addr_end(addr, end);
> @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> }
>
> static void madvise_pageout_page_range(struct mmu_gather *tlb,
> - struct vm_area_struct *vma,
> - unsigned long addr, unsigned long end)
> + struct task_struct *task,
> + struct vm_area_struct *vma,
> + unsigned long addr, unsigned long end)
> {
> struct mm_walk warm_walk = {
> .pmd_entry = madvise_pageout_pte_range,
> .mm = vma->vm_mm,
> + .private = task,
> };
>
> tlb_start_vma(tlb, vma);
> @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> }
>
>
> -static long madvise_pageout(struct vm_area_struct *vma,
> - struct vm_area_struct **prev,
> - unsigned long start_addr, unsigned long end_addr)
> +static long madvise_pageout(struct task_struct *task,
> + struct vm_area_struct *vma, struct vm_area_struct **prev,
> + unsigned long start_addr, unsigned long end_addr)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct mmu_gather tlb;
> @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
>
> lru_add_drain();
> tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> tlb_finish_mmu(&tlb, start_addr, end_addr);
>
> return 0;
> @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> return 0;
> }
>
> -static long madvise_dontneed_free(struct vm_area_struct *vma,
> +static long madvise_dontneed_free(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> unsigned long start, unsigned long end,
> int behavior)
> @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> if (!userfaultfd_remove(vma, start, end)) {
> *prev = NULL; /* mmap_sem has been dropped, prev is stale */
>
> - down_read(&current->mm->mmap_sem);
> - vma = find_vma(current->mm, start);
> + down_read(&mm->mmap_sem);
> + vma = find_vma(mm, start);
> if (!vma)
> return -ENOMEM;
> if (start < vma->vm_start) {
> @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> * Application wants to free up the pages and associated backing store.
> * This is effectively punching a hole into the middle of a file.
> */
> -static long madvise_remove(struct vm_area_struct *vma,
> +static long madvise_remove(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> unsigned long start, unsigned long end)
> {
> @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> get_file(f);
> if (userfaultfd_remove(vma, start, end)) {
> /* mmap_sem was not released by userfaultfd_remove() */
> - up_read(&current->mm->mmap_sem);
> + up_read(&mm->mmap_sem);
> }
> error = vfs_fallocate(f,
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> offset, end - start);
> fput(f);
> - down_read(&current->mm->mmap_sem);
> + down_read(&mm->mmap_sem);
> return error;
> }
>
> @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> #endif
>
> static long
> -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> + struct vm_area_struct *vma, struct vm_area_struct **prev,
> unsigned long start, unsigned long end, int behavior)
> {
> switch (behavior) {
> case MADV_REMOVE:
> - return madvise_remove(vma, prev, start, end);
> + return madvise_remove(mm, vma, prev, start, end);
> case MADV_WILLNEED:
> return madvise_willneed(vma, prev, start, end);
> case MADV_COLD:
> return madvise_cold(vma, prev, start, end);
> case MADV_PAGEOUT:
> - return madvise_pageout(vma, prev, start, end);
> + return madvise_pageout(task, vma, prev, start, end);
> case MADV_FREE:
> case MADV_DONTNEED:
> - return madvise_dontneed_free(vma, prev, start, end, behavior);
> + return madvise_dontneed_free(mm, vma, prev, start,
> + end, behavior);
> default:
> return madvise_behavior(vma, prev, start, end, behavior);
> }
> @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> }
> }
>
> -/*
> - * The madvise(2) system call.
> - *
> - * Applications can use madvise() to advise the kernel how it should
> - * handle paging I/O in this VM area. The idea is to help the kernel
> - * use appropriate read-ahead and caching techniques. The information
> - * provided is advisory only, and can be safely disregarded by the
> - * kernel without affecting the correct operation of the application.
> - *
> - * behavior values:
> - * MADV_NORMAL - the default behavior is to read clusters. This
> - * results in some read-ahead and read-behind.
> - * MADV_RANDOM - the system should read the minimum amount of data
> - * on any access, since it is unlikely that the appli-
> - * cation will need more than what it asks for.
> - * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> - * once, so they can be aggressively read ahead, and
> - * can be freed soon after they are accessed.
> - * MADV_WILLNEED - the application is notifying the system to read
> - * some pages ahead.
> - * MADV_DONTNEED - the application is finished with the given range,
> - * so the kernel can free resources associated with it.
> - * MADV_FREE - the application marks pages in the given range as lazy free,
> - * where actual purges are postponed until memory pressure happens.
> - * MADV_REMOVE - the application wants to free up the given range of
> - * pages and associated backing store.
> - * MADV_DONTFORK - omit this area from child's address space when forking:
> - * typically, to avoid COWing pages pinned by get_user_pages().
> - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> - * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> - * range after a fork.
> - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> - * MADV_HWPOISON - trigger memory error handler as if the given memory range
> - * were corrupted by unrecoverable hardware memory failure.
> - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> - * this area with pages of identical content from other such areas.
> - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> - * MADV_HUGEPAGE - the application wants to back the given range by transparent
> - * huge pages in the future. Existing pages might be coalesced and
> - * new pages might be allocated as THP.
> - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> - * transparent huge pages so the existing pages will not be
> - * coalesced into THP and new pages will not be allocated as THP.
> - * MADV_DONTDUMP - the application wants to prevent pages in the given range
> - * from being included in its core dump.
> - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> - *
> - * return values:
> - * zero - success
> - * -EINVAL - start + len < 0, start is not page-aligned,
> - * "behavior" is not a valid value, or application
> - * is attempting to release locked or shared pages,
> - * or the specified address range includes file, Huge TLB,
> - * MAP_SHARED or VMPFNMAP range.
> - * -ENOMEM - addresses in the specified range are not currently
> - * mapped, or are outside the AS of the process.
> - * -EIO - an I/O error occurred while paging in data.
> - * -EBADF - map exists, but area maps something that isn't a file.
> - * -EAGAIN - a kernel resource was temporarily unavailable.
> - */
> -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> + unsigned long start, size_t len_in, int behavior)

Just a minor nitpick, but can we please have it named madvise_common,
not madvise_core? This would follow a usual naming scheme, when some
common functionality is factored out (like, for mutexes, semaphores
etc), and within the kernel "core" usually means something completely
different.

> {
> unsigned long end, tmp;
> struct vm_area_struct *vma, *prev;
> @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>
> #ifdef CONFIG_MEMORY_FAILURE
> if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> - return madvise_inject_error(behavior, start, start + len_in);
> + return madvise_inject_error(behavior,
> + start, start + len_in);

Not sure what this change is about except changing the line length.
Note, madvise_inject_error() still operates on "current" through
get_user_pages_fast() and gup_pgd_range(), but that was not changed
here. I Know you've filtered out this hint later, so technically this
is not an issue, but, maybe, this needs some attention too since we've
already spotted it?

> #endif
>
> write = madvise_need_mmap_write(behavior);
> if (write) {
> - if (down_write_killable(&current->mm->mmap_sem))
> + if (down_write_killable(&mm->mmap_sem))
> return -EINTR;

Do you still need that trick with mmget_still_valid() here?
Something like:

if (current->mm != mm && !mmget_still_valid(mm))
goto skip_mm;

and that skip_mm label would be before

if (write)
up_write(&mm->mmap_sem);

below.

(see 04f5866e41fb70690e28397487d8bd8eea7d712a for details on this)

> } else {
> - down_read(&current->mm->mmap_sem);
> + down_read(&mm->mmap_sem);
> }
>
> /*
> @@ -1084,7 +1032,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> * ranges, just ignore them, but return -ENOMEM at the end.
> * - different from the way of handling in mlock etc.
> */
> - vma = find_vma_prev(current->mm, start, &prev);
> + vma = find_vma_prev(mm, start, &prev);
> if (vma && start > vma->vm_start)
> prev = vma;
>
> @@ -1109,7 +1057,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> tmp = end;
>
> /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> - error = madvise_vma(vma, &prev, start, tmp, behavior);
> + error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior);
> if (error)
> goto out;
> start = tmp;
> @@ -1121,14 +1069,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> if (prev)
> vma = prev->vm_next;
> else /* madvise_remove dropped mmap_sem */
> - vma = find_vma(current->mm, start);
> + vma = find_vma(mm, start);
> }
> out:
> blk_finish_plug(&plug);

skip_mm:

> if (write)
> - up_write(&current->mm->mmap_sem);
> + up_write(&mm->mmap_sem);
> else
> - up_read(&current->mm->mmap_sem);
> + up_read(&mm->mmap_sem);
>
> return error;
> }
> +
> +/*
> + * The madvise(2) system call.
> + *
> + * Applications can use madvise() to advise the kernel how it should
> + * handle paging I/O in this VM area. The idea is to help the kernel
> + * use appropriate read-ahead and caching techniques. The information
> + * provided is advisory only, and can be safely disregarded by the
> + * kernel without affecting the correct operation of the application.
> + *
> + * behavior values:
> + * MADV_NORMAL - the default behavior is to read clusters. This
> + * results in some read-ahead and read-behind.
> + * MADV_RANDOM - the system should read the minimum amount of data
> + * on any access, since it is unlikely that the appli-
> + * cation will need more than what it asks for.
> + * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> + * once, so they can be aggressively read ahead, and
> + * can be freed soon after they are accessed.
> + * MADV_WILLNEED - the application is notifying the system to read
> + * some pages ahead.
> + * MADV_DONTNEED - the application is finished with the given range,
> + * so the kernel can free resources associated with it.
> + * MADV_FREE - the application marks pages in the given range as lazy free,
> + * where actual purges are postponed until memory pressure happens.
> + * MADV_REMOVE - the application wants to free up the given range of
> + * pages and associated backing store.
> + * MADV_DONTFORK - omit this area from child's address space when forking:
> + * typically, to avoid COWing pages pinned by get_user_pages().
> + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> + * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> + * range after a fork.
> + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> + * MADV_HWPOISON - trigger memory error handler as if the given memory range
> + * were corrupted by unrecoverable hardware memory failure.
> + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> + * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> + * this area with pages of identical content from other such areas.
> + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> + * MADV_HUGEPAGE - the application wants to back the given range by transparent
> + * huge pages in the future. Existing pages might be coalesced and
> + * new pages might be allocated as THP.
> + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> + * transparent huge pages so the existing pages will not be
> + * coalesced into THP and new pages will not be allocated as THP.
> + * MADV_DONTDUMP - the application wants to prevent pages in the given range
> + * from being included in its core dump.
> + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *
> + * return values:
> + * zero - success
> + * -EINVAL - start + len < 0, start is not page-aligned,
> + * "behavior" is not a valid value, or application
> + * is attempting to release locked or shared pages,
> + * or the specified address range includes file, Huge TLB,
> + * MAP_SHARED or VMPFNMAP range.
> + * -ENOMEM - addresses in the specified range are not currently
> + * mapped, or are outside the AS of the process.
> + * -EIO - an I/O error occurred while paging in data.
> + * -EBADF - map exists, but area maps something that isn't a file.
> + * -EAGAIN - a kernel resource was temporarily unavailable.
> + */
> +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +{
> + return madvise_core(current, current->mm, start, len_in, behavior);
> +}
> --
> 2.22.0.rc1.257.g3120a18244-goog
>

--
Best regards,
Oleksandr Natalenko (post-factum)
Senior Software Maintenance Engineer

2019-05-31 08:39:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 5/6] mm: introduce external memory hinting API

On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> ActivityManagerService is one of them.
>
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
>
> To solve the issue, this patch introduces new syscall process_madvise(2).
> It could give a hint to the exeternal process of pidfd.
>
> int process_madvise(int pidfd, void *addr, size_t length, int advise,
> unsigned long cookie, unsigned long flag);
>
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
>
> The syscall has a cookie argument to privode atomicity(i.e., detect
> target process's address space change since monitor process has parsed
> the address range of target process so the operaion could fail in case
> of happening race). Although there is no interface to get a cookie
> at this moment, it could be useful to consider it as argument to avoid
> introducing another new syscall in future. It could support *atomicity*
> for disruptive hint(e.g., MADV_DONTNEED|FREE).
> flag argument is reserved for future use if we need to extend the API.

Providing an API that is incomplete will not fly. Really. As this really
begs for much more discussion and it would be good to move on with the
core idea of the pro active memory memory management from userspace
usecase. Could you split out the core change so that we can move on and
leave the external for a later discussion. I believe this would lead to
a smoother integration.
--
Michal Hocko
SUSE Labs

2019-05-31 08:50:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use. This could reduce
> workingset eviction so it ends up increasing performance.
>
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
>
> Internally, it works via deactivating pages from active list to inactive's
> head if the page is private because inactive list could be full of
> used-once pages which are first candidate for the reclaiming and that's a
> reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> if the memory pressure happens, they will be reclaimed earlier than other
> active pages unless there is no access until the time.

[I am intentionally not looking at the implementation because below
points should be clear from the changelog - sorry about nagging ;)]

What kind of pages can be deactivated? Anonymous/File backed.
Private/shared? If shared, are there any restrictions?

Are there any restrictions on mappings? E.g. what would be an effect of
this operation on hugetlbfs mapping?

Also you are talking about inactive LRU but what kind of LRU is that? Is
it the anonymous LRU? If yes, don't we have the same problem as with the
early MADV_FREE implementation when enough page cache causes that
deactivated anonymous memory doesn't get reclaimed anytime soon. Or
worse never when there is no swap available?
--
Michal Hocko
SUSE Labs

2019-05-31 08:52:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT

On Fri 31-05-19 15:43:10, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> for a long time, it could hint kernel that the pages can be
> reclaimed instantly but data should be preserved for future use.
> This could reduce workingset eviction so it ends up increasing
> performance.
>
> This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> syscall. MADV_PAGEOUT can be used by a process to mark a memory
> range as not expected to be used for a long time so that kernel
> reclaims the memory instantly. The hint can help kernel in deciding
> which pages to evict proactively.

Again, are there any restictions on what kind of memory can be paged out?
Private/Shared, anonymous/file backed. Any restrictions on mapping type.
Etc. Please make sure all that is in the changelog.

What are the failure modes? E.g. what if the swap is full, does the call
fails or it silently ignores the error?

Thanks!
--
Michal Hocko
SUSE Labs

2019-05-31 10:08:25

by Yann Droneaud

[permalink] [raw]
Subject: Re: [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary

Hi,

Le vendredi 31 mai 2019 à 15:43 +0900, Minchan Kim a écrit :
>
> diff --git a/include/uapi/asm-generic/mman-common.h
> b/include/uapi/asm-generic/mman-common.h
> index 92e347a89ddc..220c2b5eb961 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -75,4 +75,15 @@
> #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> PKEY_DISABLE_WRITE)
>
> +struct pr_madvise_param {
> + int size; /* the size of this structure */
> + int cookie; /* reserved to support atomicity */
> + int nr_elem; /* count of below arrary fields */

Those should be unsigned.

There's an implicit hole here on ABI with 64bits aligned pointers

> + int __user *hints; /* hints for each range */
> + /* to store result of each operation */
> + const struct iovec __user *results;
> + /* input address ranges */
> + const struct iovec __user *ranges;

Using pointer type in uAPI structure require a 'compat' version of the
syscall need to be provided.

If using iovec too.

> +};
> +
> #endif /* __ASM_GENERIC_MMAN_COMMON_H */

Regards.

--
Yann Droneaud
OPTEYA


2019-05-31 13:14:37

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 4/6] mm: factor out madvise's core functionality

On Fri, May 31, 2019 at 09:04:20AM +0200, Oleksandr Natalenko wrote:
> On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> > This patch factor out madvise's core functionality so that upcoming
> > patch can reuse it without duplication. It shouldn't change any behavior.
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> > 1 file changed, 101 insertions(+), 87 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 9d749a1420b4..466623ea8c36 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > struct page *page;
> > int isolated = 0;
> > struct vm_area_struct *vma = walk->vma;
> > + struct task_struct *task = walk->private;
> > unsigned long next;
> >
> > - if (fatal_signal_pending(current))
> > + if (fatal_signal_pending(task))
> > return -EINTR;
> >
> > next = pmd_addr_end(addr, end);
> > @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > }
> >
> > static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > - struct vm_area_struct *vma,
> > - unsigned long addr, unsigned long end)
> > + struct task_struct *task,
> > + struct vm_area_struct *vma,
> > + unsigned long addr, unsigned long end)
> > {
> > struct mm_walk warm_walk = {
> > .pmd_entry = madvise_pageout_pte_range,
> > .mm = vma->vm_mm,
> > + .private = task,
> > };
> >
> > tlb_start_vma(tlb, vma);
> > @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > }
> >
> >
> > -static long madvise_pageout(struct vm_area_struct *vma,
> > - struct vm_area_struct **prev,
> > - unsigned long start_addr, unsigned long end_addr)
> > +static long madvise_pageout(struct task_struct *task,
> > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > + unsigned long start_addr, unsigned long end_addr)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > struct mmu_gather tlb;
> > @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
> >
> > lru_add_drain();
> > tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> > tlb_finish_mmu(&tlb, start_addr, end_addr);
> >
> > return 0;
> > @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > return 0;
> > }
> >
> > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > +static long madvise_dontneed_free(struct mm_struct *mm,
> > + struct vm_area_struct *vma,
> > struct vm_area_struct **prev,
> > unsigned long start, unsigned long end,
> > int behavior)
> > @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > if (!userfaultfd_remove(vma, start, end)) {
> > *prev = NULL; /* mmap_sem has been dropped, prev is stale */
> >
> > - down_read(&current->mm->mmap_sem);
> > - vma = find_vma(current->mm, start);
> > + down_read(&mm->mmap_sem);
> > + vma = find_vma(mm, start);
> > if (!vma)
> > return -ENOMEM;
> > if (start < vma->vm_start) {
> > @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > * Application wants to free up the pages and associated backing store.
> > * This is effectively punching a hole into the middle of a file.
> > */
> > -static long madvise_remove(struct vm_area_struct *vma,
> > +static long madvise_remove(struct mm_struct *mm,
> > + struct vm_area_struct *vma,
> > struct vm_area_struct **prev,
> > unsigned long start, unsigned long end)
> > {
> > @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > get_file(f);
> > if (userfaultfd_remove(vma, start, end)) {
> > /* mmap_sem was not released by userfaultfd_remove() */
> > - up_read(&current->mm->mmap_sem);
> > + up_read(&mm->mmap_sem);
> > }
> > error = vfs_fallocate(f,
> > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > offset, end - start);
> > fput(f);
> > - down_read(&current->mm->mmap_sem);
> > + down_read(&mm->mmap_sem);
> > return error;
> > }
> >
> > @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> > #endif
> >
> > static long
> > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > unsigned long start, unsigned long end, int behavior)
> > {
> > switch (behavior) {
> > case MADV_REMOVE:
> > - return madvise_remove(vma, prev, start, end);
> > + return madvise_remove(mm, vma, prev, start, end);
> > case MADV_WILLNEED:
> > return madvise_willneed(vma, prev, start, end);
> > case MADV_COLD:
> > return madvise_cold(vma, prev, start, end);
> > case MADV_PAGEOUT:
> > - return madvise_pageout(vma, prev, start, end);
> > + return madvise_pageout(task, vma, prev, start, end);
> > case MADV_FREE:
> > case MADV_DONTNEED:
> > - return madvise_dontneed_free(vma, prev, start, end, behavior);
> > + return madvise_dontneed_free(mm, vma, prev, start,
> > + end, behavior);
> > default:
> > return madvise_behavior(vma, prev, start, end, behavior);
> > }
> > @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> > }
> > }
> >
> > -/*
> > - * The madvise(2) system call.
> > - *
> > - * Applications can use madvise() to advise the kernel how it should
> > - * handle paging I/O in this VM area. The idea is to help the kernel
> > - * use appropriate read-ahead and caching techniques. The information
> > - * provided is advisory only, and can be safely disregarded by the
> > - * kernel without affecting the correct operation of the application.
> > - *
> > - * behavior values:
> > - * MADV_NORMAL - the default behavior is to read clusters. This
> > - * results in some read-ahead and read-behind.
> > - * MADV_RANDOM - the system should read the minimum amount of data
> > - * on any access, since it is unlikely that the appli-
> > - * cation will need more than what it asks for.
> > - * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> > - * once, so they can be aggressively read ahead, and
> > - * can be freed soon after they are accessed.
> > - * MADV_WILLNEED - the application is notifying the system to read
> > - * some pages ahead.
> > - * MADV_DONTNEED - the application is finished with the given range,
> > - * so the kernel can free resources associated with it.
> > - * MADV_FREE - the application marks pages in the given range as lazy free,
> > - * where actual purges are postponed until memory pressure happens.
> > - * MADV_REMOVE - the application wants to free up the given range of
> > - * pages and associated backing store.
> > - * MADV_DONTFORK - omit this area from child's address space when forking:
> > - * typically, to avoid COWing pages pinned by get_user_pages().
> > - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> > - * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> > - * range after a fork.
> > - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> > - * MADV_HWPOISON - trigger memory error handler as if the given memory range
> > - * were corrupted by unrecoverable hardware memory failure.
> > - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> > - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> > - * this area with pages of identical content from other such areas.
> > - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> > - * MADV_HUGEPAGE - the application wants to back the given range by transparent
> > - * huge pages in the future. Existing pages might be coalesced and
> > - * new pages might be allocated as THP.
> > - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > - * transparent huge pages so the existing pages will not be
> > - * coalesced into THP and new pages will not be allocated as THP.
> > - * MADV_DONTDUMP - the application wants to prevent pages in the given range
> > - * from being included in its core dump.
> > - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > - *
> > - * return values:
> > - * zero - success
> > - * -EINVAL - start + len < 0, start is not page-aligned,
> > - * "behavior" is not a valid value, or application
> > - * is attempting to release locked or shared pages,
> > - * or the specified address range includes file, Huge TLB,
> > - * MAP_SHARED or VMPFNMAP range.
> > - * -ENOMEM - addresses in the specified range are not currently
> > - * mapped, or are outside the AS of the process.
> > - * -EIO - an I/O error occurred while paging in data.
> > - * -EBADF - map exists, but area maps something that isn't a file.
> > - * -EAGAIN - a kernel resource was temporarily unavailable.
> > - */
> > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> > + unsigned long start, size_t len_in, int behavior)
>
> Just a minor nitpick, but can we please have it named madvise_common,
> not madvise_core? This would follow a usual naming scheme, when some
> common functionality is factored out (like, for mutexes, semaphores
> etc), and within the kernel "core" usually means something completely
> different.

Sure.

>
> > {
> > unsigned long end, tmp;
> > struct vm_area_struct *vma, *prev;
> > @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >
> > #ifdef CONFIG_MEMORY_FAILURE
> > if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > - return madvise_inject_error(behavior, start, start + len_in);
> > + return madvise_inject_error(behavior,
> > + start, start + len_in);
>
> Not sure what this change is about except changing the line length.
> Note, madvise_inject_error() still operates on "current" through
> get_user_pages_fast() and gup_pgd_range(), but that was not changed
> here. I Know you've filtered out this hint later, so technically this
> is not an issue, but, maybe, this needs some attention too since we've
> already spotted it?

It is leftover I had done. I actually modified it to handle remote
task but changed my mind not to fix it because process_madvise
will not support it at this moment. I'm not sure it's a good idea
to change it for *might-be-done-in-future* at this moment even though
we have spotted.

>
> > #endif
> >
> > write = madvise_need_mmap_write(behavior);
> > if (write) {
> > - if (down_write_killable(&current->mm->mmap_sem))
> > + if (down_write_killable(&mm->mmap_sem))
> > return -EINTR;
>
> Do you still need that trick with mmget_still_valid() here?
> Something like:

Since MADV_COLD|PAGEOUT doesn't change address space layout or
vma->vm_flags, technically, we don't need it if I understand
correctly. Right?

>
> if (current->mm != mm && !mmget_still_valid(mm))
> goto skip_mm;
>
> and that skip_mm label would be before
>
> if (write)
> up_write(&mm->mmap_sem);
>
> below.
>
> (see 04f5866e41fb70690e28397487d8bd8eea7d712a for details on this)

2019-05-31 13:20:46

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 5/6] mm: introduce external memory hinting API

On Fri, May 31, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> > There is some usecase that centralized userspace daemon want to give
> > a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> > ActivityManagerService is one of them.
> >
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> >
> > To solve the issue, this patch introduces new syscall process_madvise(2).
> > It could give a hint to the exeternal process of pidfd.
> >
> > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > unsigned long cookie, unsigned long flag);
> >
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrace the process could use it successfully.
> >
> > The syscall has a cookie argument to privode atomicity(i.e., detect
> > target process's address space change since monitor process has parsed
> > the address range of target process so the operaion could fail in case
> > of happening race). Although there is no interface to get a cookie
> > at this moment, it could be useful to consider it as argument to avoid
> > introducing another new syscall in future. It could support *atomicity*
> > for disruptive hint(e.g., MADV_DONTNEED|FREE).
> > flag argument is reserved for future use if we need to extend the API.
>
> Providing an API that is incomplete will not fly. Really. As this really
> begs for much more discussion and it would be good to move on with the
> core idea of the pro active memory memory management from userspace
> usecase. Could you split out the core change so that we can move on and
> leave the external for a later discussion. I believe this would lead to
> a smoother integration.

No problem but I need to understand what you want a little bit more because
I thought this patchset is already step by step so if we reach the agreement
of part of them like [1-5/6], it could be merged first.

Could you say how you want to split the patchset for forward progress?

Thanks.

2019-05-31 13:41:06

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range, it could
> > give a hint to kernel that the pages can be reclaimed when memory pressure
> > happens but data should be preserved for future use. This could reduce
> > workingset eviction so it ends up increasing performance.
> >
> > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > MADV_COLD can be used by a process to mark a memory range as not expected
> > to be used in the near future. The hint can help kernel in deciding which
> > pages to evict early during memory pressure.
> >
> > Internally, it works via deactivating pages from active list to inactive's
> > head if the page is private because inactive list could be full of
> > used-once pages which are first candidate for the reclaiming and that's a
> > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > if the memory pressure happens, they will be reclaimed earlier than other
> > active pages unless there is no access until the time.
>
> [I am intentionally not looking at the implementation because below
> points should be clear from the changelog - sorry about nagging ;)]
>
> What kind of pages can be deactivated? Anonymous/File backed.
> Private/shared? If shared, are there any restrictions?

Both file and private pages could be deactived from each active LRU
to each inactive LRU if the page has one map_count. In other words,

if (page_mapcount(page) <= 1)
deactivate_page(page);

>
> Are there any restrictions on mappings? E.g. what would be an effect of
> this operation on hugetlbfs mapping?

VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED

>
> Also you are talking about inactive LRU but what kind of LRU is that? Is
> it the anonymous LRU? If yes, don't we have the same problem as with the

active file page -> inactive file LRU
active anon page -> inacdtive anon LRU

> early MADV_FREE implementation when enough page cache causes that
> deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> worse never when there is no swap available?

I think MADV_COLD is a little bit different symantic with MAVD_FREE.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage*. Furthemore, freeing such pages is
almost zero overhead since we don't need to swap out and access
afterward causes minor fault. Thus, it would make sense to put those
freeable pages in inactive file LRU to compete other used-once pages.

However, MADV_COLD doesn't means it's a garbage and freeing requires
swap out/swap in afterward. So, it would be better to move inactive
anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
scanning of those cold anonymous if system doesn't have a swap device.

2019-05-31 13:46:37

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT

On Fri, May 31, 2019 at 10:50:44AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:10, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > for a long time, it could hint kernel that the pages can be
> > reclaimed instantly but data should be preserved for future use.
> > This could reduce workingset eviction so it ends up increasing
> > performance.
> >
> > This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> > syscall. MADV_PAGEOUT can be used by a process to mark a memory
> > range as not expected to be used for a long time so that kernel
> > reclaims the memory instantly. The hint can help kernel in deciding
> > which pages to evict proactively.
>
> Again, are there any restictions on what kind of memory can be paged out?
> Private/Shared, anonymous/file backed. Any restrictions on mapping type.
> Etc. Please make sure all that is in the changelog.

It's same with MADV_COLD. Yes, I will include all detail in the
description.

>
> What are the failure modes? E.g. what if the swap is full, does the call
> fails or it silently ignores the error?

In such case, just ignore the swapout. It returns -EINVAL only if the
vma is one of (VM_LOCKED|VM_HUGETLB|VM_PFNMAP) at this moment.

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

2019-05-31 14:03:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 5/6] mm: introduce external memory hinting API

On Fri 31-05-19 22:19:00, Minchan Kim wrote:
> On Fri, May 31, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> > > There is some usecase that centralized userspace daemon want to give
> > > a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> > > ActivityManagerService is one of them.
> > >
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > >
> > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > It could give a hint to the exeternal process of pidfd.
> > >
> > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > unsigned long cookie, unsigned long flag);
> > >
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrace the process could use it successfully.
> > >
> > > The syscall has a cookie argument to privode atomicity(i.e., detect
> > > target process's address space change since monitor process has parsed
> > > the address range of target process so the operaion could fail in case
> > > of happening race). Although there is no interface to get a cookie
> > > at this moment, it could be useful to consider it as argument to avoid
> > > introducing another new syscall in future. It could support *atomicity*
> > > for disruptive hint(e.g., MADV_DONTNEED|FREE).
> > > flag argument is reserved for future use if we need to extend the API.
> >
> > Providing an API that is incomplete will not fly. Really. As this really
> > begs for much more discussion and it would be good to move on with the
> > core idea of the pro active memory memory management from userspace
> > usecase. Could you split out the core change so that we can move on and
> > leave the external for a later discussion. I believe this would lead to
> > a smoother integration.
>
> No problem but I need to understand what you want a little bit more because
> I thought this patchset is already step by step so if we reach the agreement
> of part of them like [1-5/6], it could be merged first.
>
> Could you say how you want to split the patchset for forward progress?

I would start with new madvise modes and once they are in a shape to be
merged then we can start the remote madvise API. I believe that even
local process reclaim modes are interesting and useful. I haven't heard
anybody objecting to them without having a remote API so far.
--
Michal Hocko
SUSE Labs

2019-05-31 14:07:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range, it could
> > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > happens but data should be preserved for future use. This could reduce
> > > workingset eviction so it ends up increasing performance.
> > >
> > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > to be used in the near future. The hint can help kernel in deciding which
> > > pages to evict early during memory pressure.
> > >
> > > Internally, it works via deactivating pages from active list to inactive's
> > > head if the page is private because inactive list could be full of
> > > used-once pages which are first candidate for the reclaiming and that's a
> > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > if the memory pressure happens, they will be reclaimed earlier than other
> > > active pages unless there is no access until the time.
> >
> > [I am intentionally not looking at the implementation because below
> > points should be clear from the changelog - sorry about nagging ;)]
> >
> > What kind of pages can be deactivated? Anonymous/File backed.
> > Private/shared? If shared, are there any restrictions?
>
> Both file and private pages could be deactived from each active LRU
> to each inactive LRU if the page has one map_count. In other words,
>
> if (page_mapcount(page) <= 1)
> deactivate_page(page);

Why do we restrict to pages that are single mapped?

> > Are there any restrictions on mappings? E.g. what would be an effect of
> > this operation on hugetlbfs mapping?
>
> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED

OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED
is really useful to mention.

>
> >
> > Also you are talking about inactive LRU but what kind of LRU is that? Is
> > it the anonymous LRU? If yes, don't we have the same problem as with the
>
> active file page -> inactive file LRU
> active anon page -> inacdtive anon LRU
>
> > early MADV_FREE implementation when enough page cache causes that
> > deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> > worse never when there is no swap available?
>
> I think MADV_COLD is a little bit different symantic with MAVD_FREE.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage*. Furthemore, freeing such pages is
> almost zero overhead since we don't need to swap out and access
> afterward causes minor fault. Thus, it would make sense to put those
> freeable pages in inactive file LRU to compete other used-once pages.
>
> However, MADV_COLD doesn't means it's a garbage and freeing requires
> swap out/swap in afterward. So, it would be better to move inactive
> anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
> scanning of those cold anonymous if system doesn't have a swap device.

Please document this, if this is really a desirable semantic because
then you have the same set of problems as we've had with the early
MADV_FREE implementation mentioned above.

--
Michal Hocko
SUSE Labs

2019-05-31 14:15:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 5/6] mm: introduce external memory hinting API

On Fri, May 31, 2019 at 04:00:50PM +0200, Michal Hocko wrote:
> On Fri 31-05-19 22:19:00, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> > > > There is some usecase that centralized userspace daemon want to give
> > > > a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> > > > ActivityManagerService is one of them.
> > > >
> > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > required to make the reclaim decision is not known to the app. Instead,
> > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > and that daemon must be able to initiate reclaim on its own without
> > > > any app involvement.
> > > >
> > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > It could give a hint to the exeternal process of pidfd.
> > > >
> > > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > unsigned long cookie, unsigned long flag);
> > > >
> > > > Since it could affect other process's address range, only privileged
> > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > gives it the right to ptrace the process could use it successfully.
> > > >
> > > > The syscall has a cookie argument to privode atomicity(i.e., detect
> > > > target process's address space change since monitor process has parsed
> > > > the address range of target process so the operaion could fail in case
> > > > of happening race). Although there is no interface to get a cookie
> > > > at this moment, it could be useful to consider it as argument to avoid
> > > > introducing another new syscall in future. It could support *atomicity*
> > > > for disruptive hint(e.g., MADV_DONTNEED|FREE).
> > > > flag argument is reserved for future use if we need to extend the API.
> > >
> > > Providing an API that is incomplete will not fly. Really. As this really
> > > begs for much more discussion and it would be good to move on with the
> > > core idea of the pro active memory memory management from userspace
> > > usecase. Could you split out the core change so that we can move on and
> > > leave the external for a later discussion. I believe this would lead to
> > > a smoother integration.
> >
> > No problem but I need to understand what you want a little bit more because
> > I thought this patchset is already step by step so if we reach the agreement
> > of part of them like [1-5/6], it could be merged first.
> >
> > Could you say how you want to split the patchset for forward progress?
>
> I would start with new madvise modes and once they are in a shape to be
> merged then we can start the remote madvise API. I believe that even
> local process reclaim modes are interesting and useful. I haven't heard
> anybody objecting to them without having a remote API so far.

Okay, let's focus on [1-3/6] at this moment.

> --
> Michal Hocko
> SUSE Labs

2019-05-31 14:36:43

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > When a process expects no accesses to a certain memory range, it could
> > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > happens but data should be preserved for future use. This could reduce
> > > > workingset eviction so it ends up increasing performance.
> > > >
> > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > to be used in the near future. The hint can help kernel in deciding which
> > > > pages to evict early during memory pressure.
> > > >
> > > > Internally, it works via deactivating pages from active list to inactive's
> > > > head if the page is private because inactive list could be full of
> > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > active pages unless there is no access until the time.
> > >
> > > [I am intentionally not looking at the implementation because below
> > > points should be clear from the changelog - sorry about nagging ;)]
> > >
> > > What kind of pages can be deactivated? Anonymous/File backed.
> > > Private/shared? If shared, are there any restrictions?
> >
> > Both file and private pages could be deactived from each active LRU
> > to each inactive LRU if the page has one map_count. In other words,
> >
> > if (page_mapcount(page) <= 1)
> > deactivate_page(page);
>
> Why do we restrict to pages that are single mapped?

Because page table in one of process shared the page would have access bit
so finally we couldn't reclaim the page. The more process it is shared,
the more fail to reclaim.

>
> > > Are there any restrictions on mappings? E.g. what would be an effect of
> > > this operation on hugetlbfs mapping?
> >
> > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED
>
> OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED
> is really useful to mention.

Sure.

>
> >
> > >
> > > Also you are talking about inactive LRU but what kind of LRU is that? Is
> > > it the anonymous LRU? If yes, don't we have the same problem as with the
> >
> > active file page -> inactive file LRU
> > active anon page -> inacdtive anon LRU
> >
> > > early MADV_FREE implementation when enough page cache causes that
> > > deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> > > worse never when there is no swap available?
> >
> > I think MADV_COLD is a little bit different symantic with MAVD_FREE.
> > MADV_FREE means it's okay to discard when the memory pressure because
> > the content of the page is *garbage*. Furthemore, freeing such pages is
> > almost zero overhead since we don't need to swap out and access
> > afterward causes minor fault. Thus, it would make sense to put those
> > freeable pages in inactive file LRU to compete other used-once pages.
> >
> > However, MADV_COLD doesn't means it's a garbage and freeing requires
> > swap out/swap in afterward. So, it would be better to move inactive
> > anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
> > scanning of those cold anonymous if system doesn't have a swap device.
>
> Please document this, if this is really a desirable semantic because
> then you have the same set of problems as we've had with the early
> MADV_FREE implementation mentioned above.

IIRC, the problem of MADV_FREE was that we couldn't discard freeable
pages because VM never scan anonymous LRU with swapless system.
However, it's not the our case because we should reclaim them, not
discarding.

I will include it in the description.

Thanks.

2019-05-31 14:38:27

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [RFCv2 4/6] mm: factor out madvise's core functionality

On Fri, May 31, 2019 at 10:12:26PM +0900, Minchan Kim wrote:
> On Fri, May 31, 2019 at 09:04:20AM +0200, Oleksandr Natalenko wrote:
> > On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> > > This patch factor out madvise's core functionality so that upcoming
> > > patch can reuse it without duplication. It shouldn't change any behavior.
> > >
> > > Signed-off-by: Minchan Kim <[email protected]>
> > > ---
> > > mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> > > 1 file changed, 101 insertions(+), 87 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 9d749a1420b4..466623ea8c36 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > struct page *page;
> > > int isolated = 0;
> > > struct vm_area_struct *vma = walk->vma;
> > > + struct task_struct *task = walk->private;
> > > unsigned long next;
> > >
> > > - if (fatal_signal_pending(current))
> > > + if (fatal_signal_pending(task))
> > > return -EINTR;
> > >
> > > next = pmd_addr_end(addr, end);
> > > @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > }
> > >
> > > static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > > - struct vm_area_struct *vma,
> > > - unsigned long addr, unsigned long end)
> > > + struct task_struct *task,
> > > + struct vm_area_struct *vma,
> > > + unsigned long addr, unsigned long end)
> > > {
> > > struct mm_walk warm_walk = {
> > > .pmd_entry = madvise_pageout_pte_range,
> > > .mm = vma->vm_mm,
> > > + .private = task,
> > > };
> > >
> > > tlb_start_vma(tlb, vma);
> > > @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > > }
> > >
> > >
> > > -static long madvise_pageout(struct vm_area_struct *vma,
> > > - struct vm_area_struct **prev,
> > > - unsigned long start_addr, unsigned long end_addr)
> > > +static long madvise_pageout(struct task_struct *task,
> > > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > + unsigned long start_addr, unsigned long end_addr)
> > > {
> > > struct mm_struct *mm = vma->vm_mm;
> > > struct mmu_gather tlb;
> > > @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
> > >
> > > lru_add_drain();
> > > tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > > - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > > + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> > > tlb_finish_mmu(&tlb, start_addr, end_addr);
> > >
> > > return 0;
> > > @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > return 0;
> > > }
> > >
> > > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > +static long madvise_dontneed_free(struct mm_struct *mm,
> > > + struct vm_area_struct *vma,
> > > struct vm_area_struct **prev,
> > > unsigned long start, unsigned long end,
> > > int behavior)
> > > @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > if (!userfaultfd_remove(vma, start, end)) {
> > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */
> > >
> > > - down_read(&current->mm->mmap_sem);
> > > - vma = find_vma(current->mm, start);
> > > + down_read(&mm->mmap_sem);
> > > + vma = find_vma(mm, start);
> > > if (!vma)
> > > return -ENOMEM;
> > > if (start < vma->vm_start) {
> > > @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > * Application wants to free up the pages and associated backing store.
> > > * This is effectively punching a hole into the middle of a file.
> > > */
> > > -static long madvise_remove(struct vm_area_struct *vma,
> > > +static long madvise_remove(struct mm_struct *mm,
> > > + struct vm_area_struct *vma,
> > > struct vm_area_struct **prev,
> > > unsigned long start, unsigned long end)
> > > {
> > > @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > > get_file(f);
> > > if (userfaultfd_remove(vma, start, end)) {
> > > /* mmap_sem was not released by userfaultfd_remove() */
> > > - up_read(&current->mm->mmap_sem);
> > > + up_read(&mm->mmap_sem);
> > > }
> > > error = vfs_fallocate(f,
> > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > > offset, end - start);
> > > fput(f);
> > > - down_read(&current->mm->mmap_sem);
> > > + down_read(&mm->mmap_sem);
> > > return error;
> > > }
> > >
> > > @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> > > #endif
> > >
> > > static long
> > > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> > > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > unsigned long start, unsigned long end, int behavior)
> > > {
> > > switch (behavior) {
> > > case MADV_REMOVE:
> > > - return madvise_remove(vma, prev, start, end);
> > > + return madvise_remove(mm, vma, prev, start, end);
> > > case MADV_WILLNEED:
> > > return madvise_willneed(vma, prev, start, end);
> > > case MADV_COLD:
> > > return madvise_cold(vma, prev, start, end);
> > > case MADV_PAGEOUT:
> > > - return madvise_pageout(vma, prev, start, end);
> > > + return madvise_pageout(task, vma, prev, start, end);
> > > case MADV_FREE:
> > > case MADV_DONTNEED:
> > > - return madvise_dontneed_free(vma, prev, start, end, behavior);
> > > + return madvise_dontneed_free(mm, vma, prev, start,
> > > + end, behavior);
> > > default:
> > > return madvise_behavior(vma, prev, start, end, behavior);
> > > }
> > > @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> > > }
> > > }
> > >
> > > -/*
> > > - * The madvise(2) system call.
> > > - *
> > > - * Applications can use madvise() to advise the kernel how it should
> > > - * handle paging I/O in this VM area. The idea is to help the kernel
> > > - * use appropriate read-ahead and caching techniques. The information
> > > - * provided is advisory only, and can be safely disregarded by the
> > > - * kernel without affecting the correct operation of the application.
> > > - *
> > > - * behavior values:
> > > - * MADV_NORMAL - the default behavior is to read clusters. This
> > > - * results in some read-ahead and read-behind.
> > > - * MADV_RANDOM - the system should read the minimum amount of data
> > > - * on any access, since it is unlikely that the appli-
> > > - * cation will need more than what it asks for.
> > > - * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> > > - * once, so they can be aggressively read ahead, and
> > > - * can be freed soon after they are accessed.
> > > - * MADV_WILLNEED - the application is notifying the system to read
> > > - * some pages ahead.
> > > - * MADV_DONTNEED - the application is finished with the given range,
> > > - * so the kernel can free resources associated with it.
> > > - * MADV_FREE - the application marks pages in the given range as lazy free,
> > > - * where actual purges are postponed until memory pressure happens.
> > > - * MADV_REMOVE - the application wants to free up the given range of
> > > - * pages and associated backing store.
> > > - * MADV_DONTFORK - omit this area from child's address space when forking:
> > > - * typically, to avoid COWing pages pinned by get_user_pages().
> > > - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> > > - * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> > > - * range after a fork.
> > > - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> > > - * MADV_HWPOISON - trigger memory error handler as if the given memory range
> > > - * were corrupted by unrecoverable hardware memory failure.
> > > - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> > > - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> > > - * this area with pages of identical content from other such areas.
> > > - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> > > - * MADV_HUGEPAGE - the application wants to back the given range by transparent
> > > - * huge pages in the future. Existing pages might be coalesced and
> > > - * new pages might be allocated as THP.
> > > - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > > - * transparent huge pages so the existing pages will not be
> > > - * coalesced into THP and new pages will not be allocated as THP.
> > > - * MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > - * from being included in its core dump.
> > > - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > - *
> > > - * return values:
> > > - * zero - success
> > > - * -EINVAL - start + len < 0, start is not page-aligned,
> > > - * "behavior" is not a valid value, or application
> > > - * is attempting to release locked or shared pages,
> > > - * or the specified address range includes file, Huge TLB,
> > > - * MAP_SHARED or VMPFNMAP range.
> > > - * -ENOMEM - addresses in the specified range are not currently
> > > - * mapped, or are outside the AS of the process.
> > > - * -EIO - an I/O error occurred while paging in data.
> > > - * -EBADF - map exists, but area maps something that isn't a file.
> > > - * -EAGAIN - a kernel resource was temporarily unavailable.
> > > - */
> > > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > > +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> > > + unsigned long start, size_t len_in, int behavior)
> >
> > Just a minor nitpick, but can we please have it named madvise_common,
> > not madvise_core? This would follow a usual naming scheme, when some
> > common functionality is factored out (like, for mutexes, semaphores
> > etc), and within the kernel "core" usually means something completely
> > different.
>
> Sure.
>
> >
> > > {
> > > unsigned long end, tmp;
> > > struct vm_area_struct *vma, *prev;
> > > @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > >
> > > #ifdef CONFIG_MEMORY_FAILURE
> > > if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > > - return madvise_inject_error(behavior, start, start + len_in);
> > > + return madvise_inject_error(behavior,
> > > + start, start + len_in);
> >
> > Not sure what this change is about except changing the line length.
> > Note, madvise_inject_error() still operates on "current" through
> > get_user_pages_fast() and gup_pgd_range(), but that was not changed
> > here. I Know you've filtered out this hint later, so technically this
> > is not an issue, but, maybe, this needs some attention too since we've
> > already spotted it?
>
> It is leftover I had done. I actually modified it to handle remote
> task but changed my mind not to fix it because process_madvise
> will not support it at this moment. I'm not sure it's a good idea
> to change it for *might-be-done-in-future* at this moment even though
> we have spotted.

I'd expect to have at least some comments in code on why other hints
are disabled, so if we already know some shortcomings, this information
would not be lost.

Of course, I don't care much about memory poisoning, but if it can be
addressed now, let's address it now.

>
> >
> > > #endif
> > >
> > > write = madvise_need_mmap_write(behavior);
> > > if (write) {
> > > - if (down_write_killable(&current->mm->mmap_sem))
> > > + if (down_write_killable(&mm->mmap_sem))
> > > return -EINTR;
> >
> > Do you still need that trick with mmget_still_valid() here?
> > Something like:
>
> Since MADV_COLD|PAGEOUT doesn't change address space layout or
> vma->vm_flags, technically, we don't need it if I understand
> correctly. Right?

I'd expect so, yes. But.

Since we want this interface to be universal and to be able to cover
various needs, and since my initial intention with working in this
direction involved KSM, I'd ask you to enable KSM hints too, and once
(and if) that happens, the work there is done under write lock, and
you'll need this trick to be applied.

Of course, I can do that myself later in a subsequent patch series once
(and, again, if) your series is merged, but, maybe, we can cover this
already especially given the fact that KSM hinting is a relatively easy
task in this pile. I did some preliminary tests with it, and so far no
dragons have started to roar.

>
> >
> > if (current->mm != mm && !mmget_still_valid(mm))
> > goto skip_mm;
> >
> > and that skip_mm label would be before
> >
> > if (write)
> > up_write(&mm->mmap_sem);
> >
> > below.
> >
> > (see 04f5866e41fb70690e28397487d8bd8eea7d712a for details on this)

--
Best regards,
Oleksandr Natalenko (post-factum)
Senior Software Maintenance Engineer

2019-05-31 17:01:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT

Hi Michan,

this looks pretty straight-forward to me, only one kink:

On Fri, May 31, 2019 at 03:43:10PM +0900, Minchan Kim wrote:
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2126,6 +2126,83 @@ static void shrink_active_list(unsigned long nr_to_scan,
> nr_deactivate, nr_rotated, sc->priority, file);
> }
>
> +unsigned long reclaim_pages(struct list_head *page_list)
> +{
> + int nid = -1;
> + unsigned long nr_isolated[2] = {0, };
> + unsigned long nr_reclaimed = 0;
> + LIST_HEAD(node_page_list);
> + struct reclaim_stat dummy_stat;
> + struct scan_control sc = {
> + .gfp_mask = GFP_KERNEL,
> + .priority = DEF_PRIORITY,
> + .may_writepage = 1,
> + .may_unmap = 1,
> + .may_swap = 1,
> + };
> +
> + while (!list_empty(page_list)) {
> + struct page *page;
> +
> + page = lru_to_page(page_list);
> + if (nid == -1) {
> + nid = page_to_nid(page);
> + INIT_LIST_HEAD(&node_page_list);
> + nr_isolated[0] = nr_isolated[1] = 0;
> + }
> +
> + if (nid == page_to_nid(page)) {
> + list_move(&page->lru, &node_page_list);
> + nr_isolated[!!page_is_file_cache(page)] +=
> + hpage_nr_pages(page);
> + continue;
> + }
> +
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> + nr_isolated[0]);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> + nr_isolated[1]);
> + nr_reclaimed += shrink_page_list(&node_page_list,
> + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> + &dummy_stat, true);
> + while (!list_empty(&node_page_list)) {
> + struct page *page = lru_to_page(&node_page_list);
> +
> + list_del(&page->lru);
> + putback_lru_page(page);
> + }
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> + -nr_isolated[0]);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> + -nr_isolated[1]);
> + nid = -1;
> + }
> +
> + if (!list_empty(&node_page_list)) {
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> + nr_isolated[0]);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> + nr_isolated[1]);
> + nr_reclaimed += shrink_page_list(&node_page_list,
> + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> + &dummy_stat, true);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> + -nr_isolated[0]);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> + -nr_isolated[1]);
> +
> + while (!list_empty(&node_page_list)) {
> + struct page *page = lru_to_page(&node_page_list);
> +
> + list_del(&page->lru);
> + putback_lru_page(page);
> + }
> +
> + }

The NR_ISOLATED accounting, nid parsing etc. is really awkward and
makes it hard to see what the function actually does.

Can you please make those ISOLATED counters part of the isolation API?
Your patch really shows this is an overdue cleanup.

These are fast local percpu counters, we don't need the sprawling
batching we do all over vmscan.c, migrate.c, khugepaged.c,
compaction.c etc. Isolation can increase the counter page by page, and
reclaim or putback can likewise decrease them one by one.

It looks like mlock is the only user of the isolation api that does
not participate in the NR_ISOLATED_* counters protocol, but I don't
see why it wouldn't, or why doing so would hurt.

There are also seem to be quite a few callsites that use the atomic
versions of the counter API when they're clearly under the irqsafe
lru_lock. That would be fixed automatically by this work as well.

2019-05-31 17:36:59

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFCv2 5/6] mm: introduce external memory hinting API

On Thu, May 30, 2019 at 11:43 PM Minchan Kim <[email protected]> wrote:
>
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> ActivityManagerService is one of them.
>
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
>
> To solve the issue, this patch introduces new syscall process_madvise(2).
> It could give a hint to the exeternal process of pidfd.
>
> int process_madvise(int pidfd, void *addr, size_t length, int advise,
> unsigned long cookie, unsigned long flag);
>
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
>
> The syscall has a cookie argument to privode atomicity(i.e., detect
> target process's address space change since monitor process has parsed
> the address range of target process so the operaion could fail in case
> of happening race). Although there is no interface to get a cookie
> at this moment, it could be useful to consider it as argument to avoid
> introducing another new syscall in future. It could support *atomicity*
> for disruptive hint(e.g., MADV_DONTNEED|FREE).
> flag argument is reserved for future use if we need to extend the API.

How about a compromise? Let's allow all madvise hints if the process
is calling process_madvise *on itself* (which will be useful once we
wire up the atomicity cookie) and restrict the cross-process case to
the hints you've mentioned. This way, the restriction on madvise hints
isn't tied to the specific API, but to the relationship between hinter
and hintee.

2019-05-31 23:17:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT

Hey Johannes,

On Fri, May 31, 2019 at 12:59:27PM -0400, Johannes Weiner wrote:
> Hi Michan,
>
> this looks pretty straight-forward to me, only one kink:
>
> On Fri, May 31, 2019 at 03:43:10PM +0900, Minchan Kim wrote:
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2126,6 +2126,83 @@ static void shrink_active_list(unsigned long nr_to_scan,
> > nr_deactivate, nr_rotated, sc->priority, file);
> > }
> >
> > +unsigned long reclaim_pages(struct list_head *page_list)
> > +{
> > + int nid = -1;
> > + unsigned long nr_isolated[2] = {0, };
> > + unsigned long nr_reclaimed = 0;
> > + LIST_HEAD(node_page_list);
> > + struct reclaim_stat dummy_stat;
> > + struct scan_control sc = {
> > + .gfp_mask = GFP_KERNEL,
> > + .priority = DEF_PRIORITY,
> > + .may_writepage = 1,
> > + .may_unmap = 1,
> > + .may_swap = 1,
> > + };
> > +
> > + while (!list_empty(page_list)) {
> > + struct page *page;
> > +
> > + page = lru_to_page(page_list);
> > + if (nid == -1) {
> > + nid = page_to_nid(page);
> > + INIT_LIST_HEAD(&node_page_list);
> > + nr_isolated[0] = nr_isolated[1] = 0;
> > + }
> > +
> > + if (nid == page_to_nid(page)) {
> > + list_move(&page->lru, &node_page_list);
> > + nr_isolated[!!page_is_file_cache(page)] +=
> > + hpage_nr_pages(page);
> > + continue;
> > + }
> > +
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > + nr_isolated[0]);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > + nr_isolated[1]);
> > + nr_reclaimed += shrink_page_list(&node_page_list,
> > + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> > + &dummy_stat, true);
> > + while (!list_empty(&node_page_list)) {
> > + struct page *page = lru_to_page(&node_page_list);
> > +
> > + list_del(&page->lru);
> > + putback_lru_page(page);
> > + }
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > + -nr_isolated[0]);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > + -nr_isolated[1]);
> > + nid = -1;
> > + }
> > +
> > + if (!list_empty(&node_page_list)) {
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > + nr_isolated[0]);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > + nr_isolated[1]);
> > + nr_reclaimed += shrink_page_list(&node_page_list,
> > + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> > + &dummy_stat, true);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > + -nr_isolated[0]);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > + -nr_isolated[1]);
> > +
> > + while (!list_empty(&node_page_list)) {
> > + struct page *page = lru_to_page(&node_page_list);
> > +
> > + list_del(&page->lru);
> > + putback_lru_page(page);
> > + }
> > +
> > + }
>
> The NR_ISOLATED accounting, nid parsing etc. is really awkward and
> makes it hard to see what the function actually does.
>
> Can you please make those ISOLATED counters part of the isolation API?
> Your patch really shows this is an overdue cleanup.

Yeah, that was very painful.

>
> These are fast local percpu counters, we don't need the sprawling
> batching we do all over vmscan.c, migrate.c, khugepaged.c,
> compaction.c etc. Isolation can increase the counter page by page, and
> reclaim or putback can likewise decrease them one by one.
>
> It looks like mlock is the only user of the isolation api that does
> not participate in the NR_ISOLATED_* counters protocol, but I don't
> see why it wouldn't, or why doing so would hurt.
>
> There are also seem to be quite a few callsites that use the atomic
> versions of the counter API when they're clearly under the irqsafe
> lru_lock. That would be fixed automatically by this work as well.

I agree all points so will prepare clean up patch.

2019-05-31 23:20:29

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary

Hi Yann,

On Fri, May 31, 2019 at 12:06:52PM +0200, Yann Droneaud wrote:
> Hi,
>
> Le vendredi 31 mai 2019 ? 15:43 +0900, Minchan Kim a ?crit :
> >
> > diff --git a/include/uapi/asm-generic/mman-common.h
> > b/include/uapi/asm-generic/mman-common.h
> > index 92e347a89ddc..220c2b5eb961 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -75,4 +75,15 @@
> > #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> > PKEY_DISABLE_WRITE)
> >
> > +struct pr_madvise_param {
> > + int size; /* the size of this structure */
> > + int cookie; /* reserved to support atomicity */
> > + int nr_elem; /* count of below arrary fields */
>
> Those should be unsigned.
>
> There's an implicit hole here on ABI with 64bits aligned pointers
>
> > + int __user *hints; /* hints for each range */
> > + /* to store result of each operation */
> > + const struct iovec __user *results;
> > + /* input address ranges */
> > + const struct iovec __user *ranges;
>
> Using pointer type in uAPI structure require a 'compat' version of the
> syscall need to be provided.
>
> If using iovec too.

I will fix them when I submit next revision.
Thanks for the review.

2019-05-31 23:32:38

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 4/6] mm: factor out madvise's core functionality

On Fri, May 31, 2019 at 04:35:45PM +0200, Oleksandr Natalenko wrote:
> On Fri, May 31, 2019 at 10:12:26PM +0900, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 09:04:20AM +0200, Oleksandr Natalenko wrote:
> > > On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> > > > This patch factor out madvise's core functionality so that upcoming
> > > > patch can reuse it without duplication. It shouldn't change any behavior.
> > > >
> > > > Signed-off-by: Minchan Kim <[email protected]>
> > > > ---
> > > > mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> > > > 1 file changed, 101 insertions(+), 87 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 9d749a1420b4..466623ea8c36 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > > struct page *page;
> > > > int isolated = 0;
> > > > struct vm_area_struct *vma = walk->vma;
> > > > + struct task_struct *task = walk->private;
> > > > unsigned long next;
> > > >
> > > > - if (fatal_signal_pending(current))
> > > > + if (fatal_signal_pending(task))
> > > > return -EINTR;
> > > >
> > > > next = pmd_addr_end(addr, end);
> > > > @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > > }
> > > >
> > > > static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > > > - struct vm_area_struct *vma,
> > > > - unsigned long addr, unsigned long end)
> > > > + struct task_struct *task,
> > > > + struct vm_area_struct *vma,
> > > > + unsigned long addr, unsigned long end)
> > > > {
> > > > struct mm_walk warm_walk = {
> > > > .pmd_entry = madvise_pageout_pte_range,
> > > > .mm = vma->vm_mm,
> > > > + .private = task,
> > > > };
> > > >
> > > > tlb_start_vma(tlb, vma);
> > > > @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > > > }
> > > >
> > > >
> > > > -static long madvise_pageout(struct vm_area_struct *vma,
> > > > - struct vm_area_struct **prev,
> > > > - unsigned long start_addr, unsigned long end_addr)
> > > > +static long madvise_pageout(struct task_struct *task,
> > > > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > + unsigned long start_addr, unsigned long end_addr)
> > > > {
> > > > struct mm_struct *mm = vma->vm_mm;
> > > > struct mmu_gather tlb;
> > > > @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
> > > >
> > > > lru_add_drain();
> > > > tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > > > - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > > > + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> > > > tlb_finish_mmu(&tlb, start_addr, end_addr);
> > > >
> > > > return 0;
> > > > @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > > return 0;
> > > > }
> > > >
> > > > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > +static long madvise_dontneed_free(struct mm_struct *mm,
> > > > + struct vm_area_struct *vma,
> > > > struct vm_area_struct **prev,
> > > > unsigned long start, unsigned long end,
> > > > int behavior)
> > > > @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > if (!userfaultfd_remove(vma, start, end)) {
> > > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */
> > > >
> > > > - down_read(&current->mm->mmap_sem);
> > > > - vma = find_vma(current->mm, start);
> > > > + down_read(&mm->mmap_sem);
> > > > + vma = find_vma(mm, start);
> > > > if (!vma)
> > > > return -ENOMEM;
> > > > if (start < vma->vm_start) {
> > > > @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > * Application wants to free up the pages and associated backing store.
> > > > * This is effectively punching a hole into the middle of a file.
> > > > */
> > > > -static long madvise_remove(struct vm_area_struct *vma,
> > > > +static long madvise_remove(struct mm_struct *mm,
> > > > + struct vm_area_struct *vma,
> > > > struct vm_area_struct **prev,
> > > > unsigned long start, unsigned long end)
> > > > {
> > > > @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > > > get_file(f);
> > > > if (userfaultfd_remove(vma, start, end)) {
> > > > /* mmap_sem was not released by userfaultfd_remove() */
> > > > - up_read(&current->mm->mmap_sem);
> > > > + up_read(&mm->mmap_sem);
> > > > }
> > > > error = vfs_fallocate(f,
> > > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > > > offset, end - start);
> > > > fput(f);
> > > > - down_read(&current->mm->mmap_sem);
> > > > + down_read(&mm->mmap_sem);
> > > > return error;
> > > > }
> > > >
> > > > @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> > > > #endif
> > > >
> > > > static long
> > > > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> > > > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > unsigned long start, unsigned long end, int behavior)
> > > > {
> > > > switch (behavior) {
> > > > case MADV_REMOVE:
> > > > - return madvise_remove(vma, prev, start, end);
> > > > + return madvise_remove(mm, vma, prev, start, end);
> > > > case MADV_WILLNEED:
> > > > return madvise_willneed(vma, prev, start, end);
> > > > case MADV_COLD:
> > > > return madvise_cold(vma, prev, start, end);
> > > > case MADV_PAGEOUT:
> > > > - return madvise_pageout(vma, prev, start, end);
> > > > + return madvise_pageout(task, vma, prev, start, end);
> > > > case MADV_FREE:
> > > > case MADV_DONTNEED:
> > > > - return madvise_dontneed_free(vma, prev, start, end, behavior);
> > > > + return madvise_dontneed_free(mm, vma, prev, start,
> > > > + end, behavior);
> > > > default:
> > > > return madvise_behavior(vma, prev, start, end, behavior);
> > > > }
> > > > @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> > > > }
> > > > }
> > > >
> > > > -/*
> > > > - * The madvise(2) system call.
> > > > - *
> > > > - * Applications can use madvise() to advise the kernel how it should
> > > > - * handle paging I/O in this VM area. The idea is to help the kernel
> > > > - * use appropriate read-ahead and caching techniques. The information
> > > > - * provided is advisory only, and can be safely disregarded by the
> > > > - * kernel without affecting the correct operation of the application.
> > > > - *
> > > > - * behavior values:
> > > > - * MADV_NORMAL - the default behavior is to read clusters. This
> > > > - * results in some read-ahead and read-behind.
> > > > - * MADV_RANDOM - the system should read the minimum amount of data
> > > > - * on any access, since it is unlikely that the appli-
> > > > - * cation will need more than what it asks for.
> > > > - * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> > > > - * once, so they can be aggressively read ahead, and
> > > > - * can be freed soon after they are accessed.
> > > > - * MADV_WILLNEED - the application is notifying the system to read
> > > > - * some pages ahead.
> > > > - * MADV_DONTNEED - the application is finished with the given range,
> > > > - * so the kernel can free resources associated with it.
> > > > - * MADV_FREE - the application marks pages in the given range as lazy free,
> > > > - * where actual purges are postponed until memory pressure happens.
> > > > - * MADV_REMOVE - the application wants to free up the given range of
> > > > - * pages and associated backing store.
> > > > - * MADV_DONTFORK - omit this area from child's address space when forking:
> > > > - * typically, to avoid COWing pages pinned by get_user_pages().
> > > > - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> > > > - * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> > > > - * range after a fork.
> > > > - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> > > > - * MADV_HWPOISON - trigger memory error handler as if the given memory range
> > > > - * were corrupted by unrecoverable hardware memory failure.
> > > > - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> > > > - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> > > > - * this area with pages of identical content from other such areas.
> > > > - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> > > > - * MADV_HUGEPAGE - the application wants to back the given range by transparent
> > > > - * huge pages in the future. Existing pages might be coalesced and
> > > > - * new pages might be allocated as THP.
> > > > - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > > > - * transparent huge pages so the existing pages will not be
> > > > - * coalesced into THP and new pages will not be allocated as THP.
> > > > - * MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > > - * from being included in its core dump.
> > > > - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > > - *
> > > > - * return values:
> > > > - * zero - success
> > > > - * -EINVAL - start + len < 0, start is not page-aligned,
> > > > - * "behavior" is not a valid value, or application
> > > > - * is attempting to release locked or shared pages,
> > > > - * or the specified address range includes file, Huge TLB,
> > > > - * MAP_SHARED or VMPFNMAP range.
> > > > - * -ENOMEM - addresses in the specified range are not currently
> > > > - * mapped, or are outside the AS of the process.
> > > > - * -EIO - an I/O error occurred while paging in data.
> > > > - * -EBADF - map exists, but area maps something that isn't a file.
> > > > - * -EAGAIN - a kernel resource was temporarily unavailable.
> > > > - */
> > > > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > > > +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> > > > + unsigned long start, size_t len_in, int behavior)
> > >
> > > Just a minor nitpick, but can we please have it named madvise_common,
> > > not madvise_core? This would follow a usual naming scheme, when some
> > > common functionality is factored out (like, for mutexes, semaphores
> > > etc), and within the kernel "core" usually means something completely
> > > different.
> >
> > Sure.
> >
> > >
> > > > {
> > > > unsigned long end, tmp;
> > > > struct vm_area_struct *vma, *prev;
> > > > @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > > >
> > > > #ifdef CONFIG_MEMORY_FAILURE
> > > > if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > > > - return madvise_inject_error(behavior, start, start + len_in);
> > > > + return madvise_inject_error(behavior,
> > > > + start, start + len_in);
> > >
> > > Not sure what this change is about except changing the line length.
> > > Note, madvise_inject_error() still operates on "current" through
> > > get_user_pages_fast() and gup_pgd_range(), but that was not changed
> > > here. I Know you've filtered out this hint later, so technically this
> > > is not an issue, but, maybe, this needs some attention too since we've
> > > already spotted it?
> >
> > It is leftover I had done. I actually modified it to handle remote
> > task but changed my mind not to fix it because process_madvise
> > will not support it at this moment. I'm not sure it's a good idea
> > to change it for *might-be-done-in-future* at this moment even though
> > we have spotted.
>
> I'd expect to have at least some comments in code on why other hints
> are disabled, so if we already know some shortcomings, this information
> would not be lost.

Okay, I will add some comment but do not want to fix code piece until
someone want to expose the poisoning to external process.

>
> Of course, I don't care much about memory poisoning, but if it can be
> addressed now, let's address it now.
>
> >
> > >
> > > > #endif
> > > >
> > > > write = madvise_need_mmap_write(behavior);
> > > > if (write) {
> > > > - if (down_write_killable(&current->mm->mmap_sem))
> > > > + if (down_write_killable(&mm->mmap_sem))
> > > > return -EINTR;
> > >
> > > Do you still need that trick with mmget_still_valid() here?
> > > Something like:
> >
> > Since MADV_COLD|PAGEOUT doesn't change address space layout or
> > vma->vm_flags, technically, we don't need it if I understand
> > correctly. Right?
>
> I'd expect so, yes. But.
>
> Since we want this interface to be universal and to be able to cover
> various needs, and since my initial intention with working in this
> direction involved KSM, I'd ask you to enable KSM hints too, and once
> (and if) that happens, the work there is done under write lock, and
> you'll need this trick to be applied.
>
> Of course, I can do that myself later in a subsequent patch series once
> (and, again, if) your series is merged, but, maybe, we can cover this
> already especially given the fact that KSM hinting is a relatively easy
> task in this pile. I did some preliminary tests with it, and so far no
> dragons have started to roar.

Then, do you mind sending a patch based upon this series to expose
MADV_MERGEABLE to process_madvise? It will have the right description
why you want to have such feature which I couldn't provide since I don't
have enough material to write the motivation. And the patch also could
include the logic to prevent coredump race, which is more proper since
finally we need to hold mmap_sem write-side lock, finally.
I will pick it up and will rebase since then.

Thanks.

2019-06-03 09:49:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > When a process expects no accesses to a certain memory range, it could
> > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > happens but data should be preserved for future use. This could reduce
> > > > > workingset eviction so it ends up increasing performance.
> > > > >
> > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > pages to evict early during memory pressure.
> > > > >
> > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > head if the page is private because inactive list could be full of
> > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > active pages unless there is no access until the time.
> > > >
> > > > [I am intentionally not looking at the implementation because below
> > > > points should be clear from the changelog - sorry about nagging ;)]
> > > >
> > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > Private/shared? If shared, are there any restrictions?
> > >
> > > Both file and private pages could be deactived from each active LRU
> > > to each inactive LRU if the page has one map_count. In other words,
> > >
> > > if (page_mapcount(page) <= 1)
> > > deactivate_page(page);
> >
> > Why do we restrict to pages that are single mapped?
>
> Because page table in one of process shared the page would have access bit
> so finally we couldn't reclaim the page. The more process it is shared,
> the more fail to reclaim.

So what? In other words why should it be restricted solely based on the
map count. I can see a reason to restrict based on the access
permissions because we do not want to simplify all sorts of side channel
attacks but memory reclaim is capable of reclaiming shared pages and so
far I haven't heard any sound argument why madvise should skip those.
Again if there are any reasons, then document them in the changelog.

[...]

> > Please document this, if this is really a desirable semantic because
> > then you have the same set of problems as we've had with the early
> > MADV_FREE implementation mentioned above.
>
> IIRC, the problem of MADV_FREE was that we couldn't discard freeable
> pages because VM never scan anonymous LRU with swapless system.
> However, it's not the our case because we should reclaim them, not
> discarding.

Right. But there is still the page cache reclaim. Is it expected that
an explicitly cold memory doesn't get reclaimed because we have a
sufficient amount of page cache (a very common case) and we never age
anonymous memory because of that?

--
Michal Hocko
SUSE Labs

2019-06-03 15:46:17

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Mon, Jun 3, 2019 at 12:16 AM Michal Hocko <[email protected]> wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > >
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > pages to evict early during memory pressure.
> > > > > >
> > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > active pages unless there is no access until the time.
> > > > >
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > >
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > >
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > >
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > >
> > > Why do we restrict to pages that are single mapped?
> >
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
>
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

Whether to reclaim shared pages is a policy decision best left to
userland, IMHO.

2019-06-03 17:56:29

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > >
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > pages to evict early during memory pressure.
> > > > > >
> > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > active pages unless there is no access until the time.
> > > > >
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > >
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > >
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > >
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > >
> > > Why do we restrict to pages that are single mapped?
> >
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
>
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

I think it makes sense. It could be explained, but it also follows
established madvise semantics, and I'm not sure it's necessarily
Minchan's job to re-iterate those.

Sharing isn't exactly transparent to userspace. The kernel does COW,
ksm etc. When you madvise, you can really only speak for your own
reference to that memory - "*I* am not using this."

This is in line with other madvise calls: MADV_DONTNEED clears the
local page table entries and drops the corresponding references, so
shared pages won't get freed. MADV_FREE clears the pte dirty bit and
also has explicit mapcount checks before clearing PG_dirty, so again
shared pages don't get freed.

2019-06-03 20:35:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > >
> > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > pages to evict early during memory pressure.
> > > > > > >
> > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > head if the page is private because inactive list could be full of
> > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > active pages unless there is no access until the time.
> > > > > >
> > > > > > [I am intentionally not looking at the implementation because below
> > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > >
> > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > Private/shared? If shared, are there any restrictions?
> > > > >
> > > > > Both file and private pages could be deactived from each active LRU
> > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > >
> > > > > if (page_mapcount(page) <= 1)
> > > > > deactivate_page(page);
> > > >
> > > > Why do we restrict to pages that are single mapped?
> > >
> > > Because page table in one of process shared the page would have access bit
> > > so finally we couldn't reclaim the page. The more process it is shared,
> > > the more fail to reclaim.
> >
> > So what? In other words why should it be restricted solely based on the
> > map count. I can see a reason to restrict based on the access
> > permissions because we do not want to simplify all sorts of side channel
> > attacks but memory reclaim is capable of reclaiming shared pages and so
> > far I haven't heard any sound argument why madvise should skip those.
> > Again if there are any reasons, then document them in the changelog.
>
> I think it makes sense. It could be explained, but it also follows
> established madvise semantics, and I'm not sure it's necessarily
> Minchan's job to re-iterate those.
>
> Sharing isn't exactly transparent to userspace. The kernel does COW,
> ksm etc. When you madvise, you can really only speak for your own
> reference to that memory - "*I* am not using this."
>
> This is in line with other madvise calls: MADV_DONTNEED clears the
> local page table entries and drops the corresponding references, so
> shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> also has explicit mapcount checks before clearing PG_dirty, so again
> shared pages don't get freed.

Right, being consistent with other madvise syscalls is certainly a way
to go. And I am not pushing one way or another, I just want this to be
documented with a reasoning behind. Consistency is certainly an argument
to use.

On the other hand these non-destructive madvise operations are quite
different and the shared policy might differ as a result as well. We are
aging objects rather than destroying them after all. Being able to age
a pagecache with a sufficient privileges sounds like a useful usecase to
me. In other words you are able to cause the same effect indirectly
without the madvise operation so it kinda makes sense to allow it in a
more sophisticated way.

That being said, madvise is just a _hint_ and the kernel will be always
free to ignore it so the future implementation might change so we can
start simple and consistent with existing MADV_$FOO operations now and
extend later on. But let's document the intention in the changelog and
make the decision clear. I am sorry to be so anal about this but I have
seen so many ad-hoc policies that were undocumented and it was so hard
to guess when revisiting later on and make some sense of it.
--
Michal Hocko
SUSE Labs

2019-06-03 21:52:55

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > >
> > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > pages to evict early during memory pressure.
> > > > > > > >
> > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > active pages unless there is no access until the time.
> > > > > > >
> > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > >
> > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > >
> > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > >
> > > > > > if (page_mapcount(page) <= 1)
> > > > > > deactivate_page(page);
> > > > >
> > > > > Why do we restrict to pages that are single mapped?
> > > >
> > > > Because page table in one of process shared the page would have access bit
> > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > the more fail to reclaim.
> > >
> > > So what? In other words why should it be restricted solely based on the
> > > map count. I can see a reason to restrict based on the access
> > > permissions because we do not want to simplify all sorts of side channel
> > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > far I haven't heard any sound argument why madvise should skip those.
> > > Again if there are any reasons, then document them in the changelog.
> >
> > I think it makes sense. It could be explained, but it also follows
> > established madvise semantics, and I'm not sure it's necessarily
> > Minchan's job to re-iterate those.
> >
> > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > ksm etc. When you madvise, you can really only speak for your own
> > reference to that memory - "*I* am not using this."
> >
> > This is in line with other madvise calls: MADV_DONTNEED clears the
> > local page table entries and drops the corresponding references, so
> > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > also has explicit mapcount checks before clearing PG_dirty, so again
> > shared pages don't get freed.
>
> Right, being consistent with other madvise syscalls is certainly a way
> to go. And I am not pushing one way or another, I just want this to be
> documented with a reasoning behind. Consistency is certainly an argument
> to use.
>
> On the other hand these non-destructive madvise operations are quite
> different and the shared policy might differ as a result as well. We are
> aging objects rather than destroying them after all. Being able to age
> a pagecache with a sufficient privileges sounds like a useful usecase to
> me. In other words you are able to cause the same effect indirectly
> without the madvise operation so it kinda makes sense to allow it in a
> more sophisticated way.

Right, I don't think it's about permission - as you say, you can do
this indirectly. Page reclaim is all about relative page order, so if
we thwarted you from demoting some pages, you could instead promote
other pages to cause a similar end result.

I think it's about intent. You're advising the kernel that *you're*
not using this memory and would like to have it cleared out based on
that knowledge. You could do the same by simply allocating the new
pages and have the kernel sort it out. However, if the kernel sorts it
out, it *will* look at other users of the page, and it might decide
that other pages are actually colder when considering all users.

When you ignore shared state, on the other hand, the pages you advise
out could refault right after. And then, not only did you not free up
the memory, but you also caused IO that may interfere with bringing in
the new data for which you tried to create room in the first place.

So I don't think it ever makes sense to override it.

But it might be better to drop the explicit mapcount check and instead
make the local pte young and call shrink_page_list() without the
TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
to handle references and shared pages exactly the same way it would if
those pages came fresh off the LRU tail, excluding only the reference
from the mapping that we're madvising.

2019-06-03 23:03:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

Hi Johannes,

On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > >
> > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > >
> > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > active pages unless there is no access until the time.
> > > > > > > >
> > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > >
> > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > >
> > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > >
> > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > deactivate_page(page);
> > > > > >
> > > > > > Why do we restrict to pages that are single mapped?
> > > > >
> > > > > Because page table in one of process shared the page would have access bit
> > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > the more fail to reclaim.
> > > >
> > > > So what? In other words why should it be restricted solely based on the
> > > > map count. I can see a reason to restrict based on the access
> > > > permissions because we do not want to simplify all sorts of side channel
> > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > far I haven't heard any sound argument why madvise should skip those.
> > > > Again if there are any reasons, then document them in the changelog.
> > >
> > > I think it makes sense. It could be explained, but it also follows
> > > established madvise semantics, and I'm not sure it's necessarily
> > > Minchan's job to re-iterate those.
> > >
> > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > ksm etc. When you madvise, you can really only speak for your own
> > > reference to that memory - "*I* am not using this."
> > >
> > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > local page table entries and drops the corresponding references, so
> > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > shared pages don't get freed.
> >
> > Right, being consistent with other madvise syscalls is certainly a way
> > to go. And I am not pushing one way or another, I just want this to be
> > documented with a reasoning behind. Consistency is certainly an argument
> > to use.
> >
> > On the other hand these non-destructive madvise operations are quite
> > different and the shared policy might differ as a result as well. We are
> > aging objects rather than destroying them after all. Being able to age
> > a pagecache with a sufficient privileges sounds like a useful usecase to
> > me. In other words you are able to cause the same effect indirectly
> > without the madvise operation so it kinda makes sense to allow it in a
> > more sophisticated way.
>
> Right, I don't think it's about permission - as you say, you can do
> this indirectly. Page reclaim is all about relative page order, so if
> we thwarted you from demoting some pages, you could instead promote
> other pages to cause a similar end result.
>
> I think it's about intent. You're advising the kernel that *you're*
> not using this memory and would like to have it cleared out based on
> that knowledge. You could do the same by simply allocating the new
> pages and have the kernel sort it out. However, if the kernel sorts it
> out, it *will* look at other users of the page, and it might decide
> that other pages are actually colder when considering all users.
>
> When you ignore shared state, on the other hand, the pages you advise
> out could refault right after. And then, not only did you not free up
> the memory, but you also caused IO that may interfere with bringing in
> the new data for which you tried to create room in the first place.
>
> So I don't think it ever makes sense to override it.
>
> But it might be better to drop the explicit mapcount check and instead
> make the local pte young and call shrink_page_list() without the
^
old?

> TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> to handle references and shared pages exactly the same way it would if
> those pages came fresh off the LRU tail, excluding only the reference
> from the mapping that we're madvising.

You are confused from the name change. Here, MADV_COLD is deactivating
, not pageing out. Therefore, shrink_page_list doesn't matter.
And madvise_cold_pte_range already makes the local pte *old*(I guess
your saying was typo).
I guess that's exactly what Michal wanted: just removing page_mapcount
check and defers to decision on normal page reclaim policy:
If I didn't miss your intention, it seems you and Michal are on same page.
(Please correct me if you want to say something other)
I could drop the page_mapcount check at next revision.

Thanks for the review!

2019-06-04 04:29:25

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > >
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > pages to evict early during memory pressure.
> > > > > >
> > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > active pages unless there is no access until the time.
> > > > >
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > >
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > >
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > >
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > >
> > > Why do we restrict to pages that are single mapped?
> >
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
>
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

I will go with removing the part so that defer to decision to the VM reclaim
based on the review.

>
> [...]
>
> > > Please document this, if this is really a desirable semantic because
> > > then you have the same set of problems as we've had with the early
> > > MADV_FREE implementation mentioned above.
> >
> > IIRC, the problem of MADV_FREE was that we couldn't discard freeable
> > pages because VM never scan anonymous LRU with swapless system.
> > However, it's not the our case because we should reclaim them, not
> > discarding.
>
> Right. But there is still the page cache reclaim. Is it expected that
> an explicitly cold memory doesn't get reclaimed because we have a
> sufficient amount of page cache (a very common case) and we never age
> anonymous memory because of that?

If there are lots of used-once pages in file-LRU, I think there is no
need to reclaim anonymous pages because it needs bigger overhead due to
IO. It has been true for a long time in current VM policy.

Reclaim preference model based on hints is as following based on cost:

MADV_DONTNEED >> MADV_PAGEOUT > used-once pages > MADV_FREE >= MADV_COLD

It is desirable for the new hints to be placed in the reclaiming preference
order such that a) they don't overlap functionally with existing hints and
b) we have a balanced ordering of disruptive and non-disruptive hints.

2019-06-04 06:57:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Mon 03-06-19 17:50:59, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > >
> > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > >
> > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > active pages unless there is no access until the time.
> > > > > > > >
> > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > >
> > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > >
> > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > >
> > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > deactivate_page(page);
> > > > > >
> > > > > > Why do we restrict to pages that are single mapped?
> > > > >
> > > > > Because page table in one of process shared the page would have access bit
> > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > the more fail to reclaim.
> > > >
> > > > So what? In other words why should it be restricted solely based on the
> > > > map count. I can see a reason to restrict based on the access
> > > > permissions because we do not want to simplify all sorts of side channel
> > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > far I haven't heard any sound argument why madvise should skip those.
> > > > Again if there are any reasons, then document them in the changelog.
> > >
> > > I think it makes sense. It could be explained, but it also follows
> > > established madvise semantics, and I'm not sure it's necessarily
> > > Minchan's job to re-iterate those.
> > >
> > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > ksm etc. When you madvise, you can really only speak for your own
> > > reference to that memory - "*I* am not using this."
> > >
> > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > local page table entries and drops the corresponding references, so
> > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > shared pages don't get freed.
> >
> > Right, being consistent with other madvise syscalls is certainly a way
> > to go. And I am not pushing one way or another, I just want this to be
> > documented with a reasoning behind. Consistency is certainly an argument
> > to use.
> >
> > On the other hand these non-destructive madvise operations are quite
> > different and the shared policy might differ as a result as well. We are
> > aging objects rather than destroying them after all. Being able to age
> > a pagecache with a sufficient privileges sounds like a useful usecase to
> > me. In other words you are able to cause the same effect indirectly
> > without the madvise operation so it kinda makes sense to allow it in a
> > more sophisticated way.
>
> Right, I don't think it's about permission - as you say, you can do
> this indirectly. Page reclaim is all about relative page order, so if
> we thwarted you from demoting some pages, you could instead promote
> other pages to cause a similar end result.

There is one notable difference. If we allow an easy way to demote a
shared resource _easily_ then we have to think about potential side
channel attacks. Sure you can generate a memory pressure to cause the
same but that is much harder and impractical in many cases.

> I think it's about intent. You're advising the kernel that *you're*
> not using this memory and would like to have it cleared out based on
> that knowledge. You could do the same by simply allocating the new
> pages and have the kernel sort it out. However, if the kernel sorts it
> out, it *will* look at other users of the page, and it might decide
> that other pages are actually colder when considering all users.
>
> When you ignore shared state, on the other hand, the pages you advise
> out could refault right after. And then, not only did you not free up
> the memory, but you also caused IO that may interfere with bringing in
> the new data for which you tried to create room in the first place.

That is a fair argument and I would tend to agree. On the other hand we
are talking about potential usecases which tend to _know_ what they are
doing and removing the possibility completely sounds like they will not
exploit the existing interface to the maximum. But as already mentioned
starting simpler and more restricted is usually a better choice when
the semantic is not carved in stone from the very beginning and
documented that way.

> So I don't think it ever makes sense to override it.
>
> But it might be better to drop the explicit mapcount check and instead
> make the local pte young and call shrink_page_list() without the
> TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> to handle references and shared pages exactly the same way it would if
> those pages came fresh off the LRU tail, excluding only the reference
> from the mapping that we're madvising.

Yeah that makes sense to me.

--
Michal Hocko
SUSE Labs

2019-06-04 06:58:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Tue 04-06-19 08:02:05, Minchan Kim wrote:
> Hi Johannes,
>
> On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > >
> > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > >
> > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > >
> > > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > > >
> > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > >
> > > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > > >
> > > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > > deactivate_page(page);
> > > > > > >
> > > > > > > Why do we restrict to pages that are single mapped?
> > > > > >
> > > > > > Because page table in one of process shared the page would have access bit
> > > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > > the more fail to reclaim.
> > > > >
> > > > > So what? In other words why should it be restricted solely based on the
> > > > > map count. I can see a reason to restrict based on the access
> > > > > permissions because we do not want to simplify all sorts of side channel
> > > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > > far I haven't heard any sound argument why madvise should skip those.
> > > > > Again if there are any reasons, then document them in the changelog.
> > > >
> > > > I think it makes sense. It could be explained, but it also follows
> > > > established madvise semantics, and I'm not sure it's necessarily
> > > > Minchan's job to re-iterate those.
> > > >
> > > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > > ksm etc. When you madvise, you can really only speak for your own
> > > > reference to that memory - "*I* am not using this."
> > > >
> > > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > > local page table entries and drops the corresponding references, so
> > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > > shared pages don't get freed.
> > >
> > > Right, being consistent with other madvise syscalls is certainly a way
> > > to go. And I am not pushing one way or another, I just want this to be
> > > documented with a reasoning behind. Consistency is certainly an argument
> > > to use.
> > >
> > > On the other hand these non-destructive madvise operations are quite
> > > different and the shared policy might differ as a result as well. We are
> > > aging objects rather than destroying them after all. Being able to age
> > > a pagecache with a sufficient privileges sounds like a useful usecase to
> > > me. In other words you are able to cause the same effect indirectly
> > > without the madvise operation so it kinda makes sense to allow it in a
> > > more sophisticated way.
> >
> > Right, I don't think it's about permission - as you say, you can do
> > this indirectly. Page reclaim is all about relative page order, so if
> > we thwarted you from demoting some pages, you could instead promote
> > other pages to cause a similar end result.
> >
> > I think it's about intent. You're advising the kernel that *you're*
> > not using this memory and would like to have it cleared out based on
> > that knowledge. You could do the same by simply allocating the new
> > pages and have the kernel sort it out. However, if the kernel sorts it
> > out, it *will* look at other users of the page, and it might decide
> > that other pages are actually colder when considering all users.
> >
> > When you ignore shared state, on the other hand, the pages you advise
> > out could refault right after. And then, not only did you not free up
> > the memory, but you also caused IO that may interfere with bringing in
> > the new data for which you tried to create room in the first place.
> >
> > So I don't think it ever makes sense to override it.
> >
> > But it might be better to drop the explicit mapcount check and instead
> > make the local pte young and call shrink_page_list() without the
> ^
> old?
>
> > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> > to handle references and shared pages exactly the same way it would if
> > those pages came fresh off the LRU tail, excluding only the reference
> > from the mapping that we're madvising.
>
> You are confused from the name change. Here, MADV_COLD is deactivating
> , not pageing out. Therefore, shrink_page_list doesn't matter.
> And madvise_cold_pte_range already makes the local pte *old*(I guess
> your saying was typo).
> I guess that's exactly what Michal wanted: just removing page_mapcount
> check and defers to decision on normal page reclaim policy:
> If I didn't miss your intention, it seems you and Michal are on same page.
> (Please correct me if you want to say something other)

Indeed.

> I could drop the page_mapcount check at next revision.

Yes please.
--
Michal Hocko
SUSE Labs

2019-06-04 07:05:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Tue 04-06-19 13:26:51, Minchan Kim wrote:
> On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
[...]
> > Right. But there is still the page cache reclaim. Is it expected that
> > an explicitly cold memory doesn't get reclaimed because we have a
> > sufficient amount of page cache (a very common case) and we never age
> > anonymous memory because of that?
>
> If there are lots of used-once pages in file-LRU, I think there is no
> need to reclaim anonymous pages because it needs bigger overhead due to
> IO. It has been true for a long time in current VM policy.

You are making an assumption which is not universally true. If I _know_
that there is a considerable amount of idle anonymous memory then I
would really prefer if it goes to the swap rather than make a pressure
on caching. Inactive list is not guaranteed to contain only used-once
pages, right?

Anyway, as already mentioned, we can start with a simpler implementation
for now and explicitly note that pagecache biased reclaim is known to be
a problem potentially. I am pretty sure somebody will come sooner or
later and we can address the problem then with some good numbers to back
the additional complexity.

--
Michal Hocko
SUSE Labs

2019-06-04 12:09:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD

On Tue, Jun 04, 2019 at 08:02:05AM +0900, Minchan Kim wrote:
> Hi Johannes,
>
> On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > >
> > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > >
> > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > >
> > > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > > >
> > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > >
> > > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > > >
> > > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > > deactivate_page(page);
> > > > > > >
> > > > > > > Why do we restrict to pages that are single mapped?
> > > > > >
> > > > > > Because page table in one of process shared the page would have access bit
> > > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > > the more fail to reclaim.
> > > > >
> > > > > So what? In other words why should it be restricted solely based on the
> > > > > map count. I can see a reason to restrict based on the access
> > > > > permissions because we do not want to simplify all sorts of side channel
> > > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > > far I haven't heard any sound argument why madvise should skip those.
> > > > > Again if there are any reasons, then document them in the changelog.
> > > >
> > > > I think it makes sense. It could be explained, but it also follows
> > > > established madvise semantics, and I'm not sure it's necessarily
> > > > Minchan's job to re-iterate those.
> > > >
> > > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > > ksm etc. When you madvise, you can really only speak for your own
> > > > reference to that memory - "*I* am not using this."
> > > >
> > > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > > local page table entries and drops the corresponding references, so
> > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > > shared pages don't get freed.
> > >
> > > Right, being consistent with other madvise syscalls is certainly a way
> > > to go. And I am not pushing one way or another, I just want this to be
> > > documented with a reasoning behind. Consistency is certainly an argument
> > > to use.
> > >
> > > On the other hand these non-destructive madvise operations are quite
> > > different and the shared policy might differ as a result as well. We are
> > > aging objects rather than destroying them after all. Being able to age
> > > a pagecache with a sufficient privileges sounds like a useful usecase to
> > > me. In other words you are able to cause the same effect indirectly
> > > without the madvise operation so it kinda makes sense to allow it in a
> > > more sophisticated way.
> >
> > Right, I don't think it's about permission - as you say, you can do
> > this indirectly. Page reclaim is all about relative page order, so if
> > we thwarted you from demoting some pages, you could instead promote
> > other pages to cause a similar end result.
> >
> > I think it's about intent. You're advising the kernel that *you're*
> > not using this memory and would like to have it cleared out based on
> > that knowledge. You could do the same by simply allocating the new
> > pages and have the kernel sort it out. However, if the kernel sorts it
> > out, it *will* look at other users of the page, and it might decide
> > that other pages are actually colder when considering all users.
> >
> > When you ignore shared state, on the other hand, the pages you advise
> > out could refault right after. And then, not only did you not free up
> > the memory, but you also caused IO that may interfere with bringing in
> > the new data for which you tried to create room in the first place.
> >
> > So I don't think it ever makes sense to override it.
> >
> > But it might be better to drop the explicit mapcount check and instead
> > make the local pte young and call shrink_page_list() without the
> ^
> old?

Ah yes, of course. Clear the reference bit.

> > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> > to handle references and shared pages exactly the same way it would if
> > those pages came fresh off the LRU tail, excluding only the reference
> > from the mapping that we're madvising.
>
> You are confused from the name change. Here, MADV_COLD is deactivating
> , not pageing out. Therefore, shrink_page_list doesn't matter.
> And madvise_cold_pte_range already makes the local pte *old*(I guess
> your saying was typo).
> I guess that's exactly what Michal wanted: just removing page_mapcount
> check and defers to decision on normal page reclaim policy:
> If I didn't miss your intention, it seems you and Michal are on same page.
> (Please correct me if you want to say something other)
> I could drop the page_mapcount check at next revision.

Sorry, I was actually talking about the MADV_PAGEOUT patch in this
case, since this turned into a general discussion about how shared
pages should be handled, which applies to both operations.

My argument was for removing the check in both patches, yes, but to
additionally change the pageout patch to 1) make the advised pte old
and then 2) call shrink_page_list() WITHOUT ignore_access/references
so that it respects references from other mappings, if any.

2019-06-05 13:29:01

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [RFCv2 4/6] mm: factor out madvise's core functionality

Hi.

On Sat, Jun 01, 2019 at 08:29:59AM +0900, Minchan Kim wrote:
> > > > > /* snip a lot */
> > > > >
> > > > > #ifdef CONFIG_MEMORY_FAILURE
> > > > > if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > > > > - return madvise_inject_error(behavior, start, start + len_in);
> > > > > + return madvise_inject_error(behavior,
> > > > > + start, start + len_in);
> > > >
> > > > Not sure what this change is about except changing the line length.
> > > > Note, madvise_inject_error() still operates on "current" through
> > > > get_user_pages_fast() and gup_pgd_range(), but that was not changed
> > > > here. I Know you've filtered out this hint later, so technically this
> > > > is not an issue, but, maybe, this needs some attention too since we've
> > > > already spotted it?
> > >
> > > It is leftover I had done. I actually modified it to handle remote
> > > task but changed my mind not to fix it because process_madvise
> > > will not support it at this moment. I'm not sure it's a good idea
> > > to change it for *might-be-done-in-future* at this moment even though
> > > we have spotted.
> >
> > I'd expect to have at least some comments in code on why other hints
> > are disabled, so if we already know some shortcomings, this information
> > would not be lost.
>
> Okay, I will add some comment but do not want to fix code piece until
> someone want to expose the poisoning to external process.

Fair enough.

> > > > > write = madvise_need_mmap_write(behavior);
> > > > > if (write) {
> > > > > - if (down_write_killable(&current->mm->mmap_sem))
> > > > > + if (down_write_killable(&mm->mmap_sem))
> > > > > return -EINTR;
> > > >
> > > > Do you still need that trick with mmget_still_valid() here?
> > > > Something like:
> > >
> > > Since MADV_COLD|PAGEOUT doesn't change address space layout or
> > > vma->vm_flags, technically, we don't need it if I understand
> > > correctly. Right?
> >
> > I'd expect so, yes. But.
> >
> > Since we want this interface to be universal and to be able to cover
> > various needs, and since my initial intention with working in this
> > direction involved KSM, I'd ask you to enable KSM hints too, and once
> > (and if) that happens, the work there is done under write lock, and
> > you'll need this trick to be applied.
> >
> > Of course, I can do that myself later in a subsequent patch series once
> > (and, again, if) your series is merged, but, maybe, we can cover this
> > already especially given the fact that KSM hinting is a relatively easy
> > task in this pile. I did some preliminary tests with it, and so far no
> > dragons have started to roar.
>
> Then, do you mind sending a patch based upon this series to expose
> MADV_MERGEABLE to process_madvise? It will have the right description
> why you want to have such feature which I couldn't provide since I don't
> have enough material to write the motivation. And the patch also could
> include the logic to prevent coredump race, which is more proper since
> finally we need to hold mmap_sem write-side lock, finally.
> I will pick it up and will rebase since then.

Sure, I can. Would you really like to have it being based on this exact
revision, or I should wait till you deal with MADV_COLD & Co and re-iterate
this part again?

Thanks.

--
Best regards,
Oleksandr Natalenko (post-factum)
Senior Software Maintenance Engineer

2019-06-10 10:21:04

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFCv2 4/6] mm: factor out madvise's core functionality

Hi Oleksandr,

On Wed, Jun 05, 2019 at 03:27:28PM +0200, Oleksandr Natalenko wrote:
< snip >
> > > > > > write = madvise_need_mmap_write(behavior);
> > > > > > if (write) {
> > > > > > - if (down_write_killable(&current->mm->mmap_sem))
> > > > > > + if (down_write_killable(&mm->mmap_sem))
> > > > > > return -EINTR;
> > > > >
> > > > > Do you still need that trick with mmget_still_valid() here?
> > > > > Something like:
> > > >
> > > > Since MADV_COLD|PAGEOUT doesn't change address space layout or
> > > > vma->vm_flags, technically, we don't need it if I understand
> > > > correctly. Right?
> > >
> > > I'd expect so, yes. But.
> > >
> > > Since we want this interface to be universal and to be able to cover
> > > various needs, and since my initial intention with working in this
> > > direction involved KSM, I'd ask you to enable KSM hints too, and once
> > > (and if) that happens, the work there is done under write lock, and
> > > you'll need this trick to be applied.
> > >
> > > Of course, I can do that myself later in a subsequent patch series once
> > > (and, again, if) your series is merged, but, maybe, we can cover this
> > > already especially given the fact that KSM hinting is a relatively easy
> > > task in this pile. I did some preliminary tests with it, and so far no
> > > dragons have started to roar.
> >
> > Then, do you mind sending a patch based upon this series to expose
> > MADV_MERGEABLE to process_madvise? It will have the right description
> > why you want to have such feature which I couldn't provide since I don't
> > have enough material to write the motivation. And the patch also could
> > include the logic to prevent coredump race, which is more proper since
> > finally we need to hold mmap_sem write-side lock, finally.
> > I will pick it up and will rebase since then.
>
> Sure, I can. Would you really like to have it being based on this exact
> revision, or I should wait till you deal with MADV_COLD & Co and re-iterate
> this part again?

I'm okay you to send your patch against this revision. I'm happy to
include it when I start a new thread for process_madvise discussion
after resolving MADV_COLD|PAGEOUT.

Thanks.