Recently, Shaohua reported that MADV_FREE is much slower than
MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
applications went to stall with direct reclaim since kswapd's
reclaim speed isn't fast than applications's allocation speed
so that it causes lots of stall and lock contention.
This patch throttles MADV_FREEing so it works only if there
are enough pages in the system which will not trigger backgroud/
direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
because there is no point to delay freeing if we know system
is under memory pressure.
When I test the patch on my 3G machine + 12 CPU + 8G swap,
test: 12 processes
loop = 5;
mmap(512M);
while (loop--) {
memset(512M);
madvise(MADV_FREE or MADV_DONTNEED);
}
1) dontneed: 6.78user 234.09system 0:48.89elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
It's clearly win.
Reported-by: Shaohua Li <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8921c2..81bb26ecf064 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -523,8 +523,17 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
* XXX: In this implementation, MADV_FREE works like
* MADV_DONTNEED on swapless system or full swap.
*/
- if (get_nr_swap_pages() > 0)
- return madvise_free(vma, prev, start, end);
+ if (get_nr_swap_pages() > 0) {
+ unsigned long threshold;
+ /*
+ * If we have trobule with memory pressure(ie,
+ * under high watermark), free pages instantly.
+ */
+ threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
+ threshold = threshold + (threshold >> 1);
+ if (nr_free_pages() > threshold)
+ return madvise_free(vma, prev, start, end);
+ }
/* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
--
1.9.1
Deactivate_page was born for file invalidation so it has too
specific logic for file-backed pages.
Make the name of function to file-specific.
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 2 +-
mm/swap.c | 20 ++++++++++----------
mm/truncate.c | 2 +-
3 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7067eca501e2..cee108cbe2d5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
-extern void deactivate_page(struct page *page);
+extern void deactivate_file_page(struct page *page);
extern void swap_setup(void);
extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index cd3a5e64cea9..5b2a60578f9c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -42,7 +42,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
/*
* This path almost never happens for VM activity - pages are normally
@@ -743,7 +743,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
* be write it out by flusher threads as this is much more effective
* than the single-page writeout from reclaim.
*/
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{
int lru, file;
@@ -811,22 +811,22 @@ void lru_add_drain_cpu(int cpu)
local_irq_restore(flags);
}
- pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ pvec = &per_cpu(lru_deactivate_file_pvecs, cpu);
if (pagevec_count(pvec))
- pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
activate_page_drain(cpu);
}
/**
- * deactivate_page - forcefully deactivate a page
+ * deactivate_file_page - forcefully deactivate a file page
* @page: page to deactivate
*
* This function hints the VM that @page is a good reclaim candidate,
* for example if its invalidation fails due to the page being dirty
* or under writeback.
*/
-void deactivate_page(struct page *page)
+void deactivate_file_page(struct page *page)
{
/*
* In a workload with many unevictable page such as mprotect, unevictable
@@ -836,11 +836,11 @@ void deactivate_page(struct page *page)
return;
if (likely(get_page_unless_zero(page))) {
- struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_file_pvecs);
if (!pagevec_add(pvec, page))
- pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
- put_cpu_var(lru_deactivate_pvecs);
+ pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ put_cpu_var(lru_deactivate_file_pvecs);
}
}
@@ -872,7 +872,7 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
- pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
diff --git a/mm/truncate.c b/mm/truncate.c
index ddec5a5966d7..a0619c492f17 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -513,7 +513,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
* of interest and try to speed up its reclaim.
*/
if (!ret)
- deactivate_page(page);
+ deactivate_file_page(page);
count += ret;
}
pagevec_remove_exceptionals(&pvec);
--
1.9.1
MADV_FREE is hint that it's okay to discard pages if memory is
pressure and we uses reclaimers(ie, kswapd and direct reclaim)
to free them so there is no worth to remain them in active
anonymous LRU list so this patch moves them to inactive LRU list.
A arguable issue for the approach is whether we should put it
head or tail in inactive list and selected it as head because
kernel cannot make sure it's really cold or warm for every usecase
but at least we know it's not hot so landing of inactive head
would be comprimise if it stayed in active LRU.
As well, if we put recent hinted pages to inactive's tail,
VM could discard cache hot pages, which would be bad.
As a bonus, we don't need to move them back and forth in inactive
list whenever MADV_SYSCALL syscall is called.
As drawback, VM should scan more pages in inactive anonymous LRU
to discard but it has happened all the time if recent reference
happens on those pages in inactive LRU list so I don't think
it's not a main drawback.
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 1 +
mm/madvise.c | 2 ++
mm/swap.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 38 insertions(+)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index cee108cbe2d5..0428e4c84e1d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void swap_setup(void);
extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 81bb26ecf064..6176039c69e4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -324,6 +324,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
ptent = pte_mkold(ptent);
ptent = pte_mkclean(ptent);
set_pte_at(mm, addr, pte, ptent);
+ if (PageActive(page))
+ deactivate_page(page);
tlb_remove_tlb_entry(tlb, pte, addr);
}
arch_leave_lazy_mmu_mode();
diff --git a/mm/swap.c b/mm/swap.c
index 5b2a60578f9c..393968c33667 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -43,6 +43,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
/*
* This path almost never happens for VM activity - pages are normally
@@ -789,6 +790,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, file, 0);
}
+
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+ void *arg)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ int file = page_is_file_cache(page);
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ add_page_to_lru_list(page, lruvec, lru);
+
+ __count_vm_event(PGDEACTIVATE);
+ update_page_reclaim_stat(lruvec, file, 0);
+ }
+}
+
/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
@@ -815,6 +833,10 @@ void lru_add_drain_cpu(int cpu)
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
activate_page_drain(cpu);
}
@@ -844,6 +866,18 @@ void deactivate_file_page(struct page *page)
}
}
+void deactivate_page(struct page *page)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ page_cache_get(page);
+ if (!pagevec_add(pvec, page))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
+}
+
void lru_add_drain(void)
{
lru_add_drain_cpu(get_cpu());
@@ -873,6 +907,7 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
--
1.9.1
Historically, we have disabled reclaiming completely with swapoff
or non-swap configurable system. It did make sense but problem
for lazy free pages is that we couldn't get a chance to discard
hinted pages in reclaim path in those systems.
That's why current MADV_FREE implmentation drop pages instantly
like MADV_DONTNNED in swapless system so that users on those
systems couldn't get the benefit of MADV_FREE.
This patch makes VM try to reclaim anonymous pages on swapless
system. Basic strategy is to try reclaim anonymous pages
if there are pages in *inactive anon*.
In non-swap config system, VM doesn't do aging/reclaiming
anonymous LRU list so inactive anon LRU list should be always
empty. So, if there are some pages in inactive anon LRU,
it means they are MADV_FREE hinted pages so VM try to reclaim
them and discard or promote them onto active list.
In swap-config-but-not-yet-swapon, VM doesn't do aging/reclaiming
anonymous LRU list so inactive anon LRU list would be empty but
it might be not always true because some pages could remain
inactive anon list if the admin had swapon before. So, if there
are some pages in inactive anon LRU, it means they are MADV_FREE
hinted pages or non-hinted pages which have stayed before.
VM try to reclaim them and discard or promote them onto active list
so we could have only hinted pages on inactive anon LRU list
after a while.
In swap-enabled-and-full, VM does aging but not try to reclaim
in current implementation. However, we try to reclaim them by
this patch so reclaim efficiency would be worse than old.
I don't know how many such workloads(ie, doing a job with
full-swap for a long time) we should take care of are.
Hope the comment.
On swapoff system with 3G ram, there are 10 processes with below
loop = 12;
mmap(256M);
while (loop--) {
memset(256M);
madvise(MADV_FREE or MADV_DONTNEED);
sleep(1);
}
1) dontneed: 5.40user 24.75system 0:15.36elapsed
2) madvfree + this patch: 5.18user 6.90system 0:13.39elapsed
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 21 ++++++++-------------
mm/vmscan.c | 32 +++++++++++++++++++++-----------
2 files changed, 29 insertions(+), 24 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6176039c69e4..b3937e8012e6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -515,6 +515,8 @@ static long
madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
{
+ unsigned long free_threshold;
+
switch (behavior) {
case MADV_REMOVE:
return madvise_remove(vma, prev, start, end);
@@ -522,20 +524,13 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_willneed(vma, prev, start, end);
case MADV_FREE:
/*
- * XXX: In this implementation, MADV_FREE works like
- * MADV_DONTNEED on swapless system or full swap.
+ * If we have trobule with memory pressure(ie,
+ * under high watermark), free pages instantly.
*/
- if (get_nr_swap_pages() > 0) {
- unsigned long threshold;
- /*
- * If we have trobule with memory pressure(ie,
- * under high watermark), free pages instantly.
- */
- threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
- threshold = threshold + (threshold >> 1);
- if (nr_free_pages() > threshold)
- return madvise_free(vma, prev, start, end);
- }
+ free_threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
+ free_threshold = free_threshold + (free_threshold >> 1);
+ if (nr_free_pages() > free_threshold)
+ return madvise_free(vma, prev, start, end);
/* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 671e47edb584..1574cd783ab9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -163,16 +163,23 @@ static bool global_reclaim(struct scan_control *sc)
static unsigned long zone_reclaimable_pages(struct zone *zone)
{
- int nr;
+ unsigned long file, anon;
- nr = zone_page_state(zone, NR_ACTIVE_FILE) +
- zone_page_state(zone, NR_INACTIVE_FILE);
+ file = zone_page_state(zone, NR_ACTIVE_FILE) +
+ zone_page_state(zone, NR_INACTIVE_FILE);
- if (get_nr_swap_pages() > 0)
- nr += zone_page_state(zone, NR_ACTIVE_ANON) +
- zone_page_state(zone, NR_INACTIVE_ANON);
+ /*
+ * Although there is no swap space, we should consider
+ * lazy free pages in inactive anon LRU list.
+ */
+ if (total_swap_pages > 0) {
+ anon = zone_page_state(zone, NR_ACTIVE_ANON) +
+ zone_page_state(zone, NR_INACTIVE_ANON);
+ } else {
+ anon = zone_page_state(zone, NR_INACTIVE_ANON);
+ }
- return nr;
+ return file + anon;
}
bool zone_reclaimable(struct zone *zone)
@@ -2002,8 +2009,11 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
if (!global_reclaim(sc))
force_scan = true;
- /* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
+ /*
+ * If we have no inactive anon page, do not bother scanning
+ * anon pages.
+ */
+ if (!sc->may_swap || !zone_page_state(zone, NR_INACTIVE_ANON)) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2344,8 +2354,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
*/
pages_for_compaction = (2UL << sc->order);
inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
- if (get_nr_swap_pages() > 0)
- inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+ inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+
if (sc->nr_reclaimed < pages_for_compaction &&
inactive_lru_pages > pages_for_compaction)
return true;
--
1.9.1
On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> Recently, Shaohua reported that MADV_FREE is much slower than
> MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> applications went to stall with direct reclaim since kswapd's
> reclaim speed isn't fast than applications's allocation speed
> so that it causes lots of stall and lock contention.
I am not sure I understand this correctly. So the issue is that there is
huge number of MADV_FREE on the LRU and they are not close to the tail
of the list so the reclaim has to do a lot of work before it starts
dropping them?
> This patch throttles MADV_FREEing so it works only if there
> are enough pages in the system which will not trigger backgroud/
> direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
> because there is no point to delay freeing if we know system
> is under memory pressure.
Hmm, this is still conforming to the documentation because the kernel is
free to free pages at its convenience. I am not sure this is a good
idea, though. Why some MADV_FREE calls should be treated differently?
Wouldn't that lead to hard to predict behavior? E.g. LIFO reused blocks
would work without long stalls most of the time - except when there is a
memory pressure.
Comparison to MADV_DONTNEED is not very fair IMHO because the scope of the
two calls is different.
> When I test the patch on my 3G machine + 12 CPU + 8G swap,
> test: 12 processes
>
> loop = 5;
> mmap(512M);
Who is eating the rest of the memory?
> while (loop--) {
> memset(512M);
> madvise(MADV_FREE or MADV_DONTNEED);
> }
>
> 1) dontneed: 6.78user 234.09system 0:48.89elapsed
> 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> 3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
>
> It's clearly win.
>
> Reported-by: Shaohua Li <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
I don't know. This looks like a hack with hard to predict consequences
which might trigger pathological corner cases.
> ---
> mm/madvise.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8921c2..81bb26ecf064 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -523,8 +523,17 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> * XXX: In this implementation, MADV_FREE works like
> * MADV_DONTNEED on swapless system or full swap.
> */
> - if (get_nr_swap_pages() > 0)
> - return madvise_free(vma, prev, start, end);
> + if (get_nr_swap_pages() > 0) {
> + unsigned long threshold;
> + /*
> + * If we have trobule with memory pressure(ie,
> + * under high watermark), free pages instantly.
> + */
> + threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
> + threshold = threshold + (threshold >> 1);
Why threshold += threshold >> 1 ?
> + if (nr_free_pages() > threshold)
> + return madvise_free(vma, prev, start, end);
> + }
> /* passthrough */
> case MADV_DONTNEED:
> return madvise_dontneed(vma, prev, start, end);
> --
> 1.9.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
On Tue 24-02-15 17:18:16, Minchan Kim wrote:
> MADV_FREE is hint that it's okay to discard pages if memory is
> pressure and we uses reclaimers(ie, kswapd and direct reclaim)
s@if memory is pressure@if there is memory pressure@
> to free them so there is no worth to remain them in active
> anonymous LRU list so this patch moves them to inactive LRU list.
Makes sense to me.
> A arguable issue for the approach is whether we should put it
> head or tail in inactive list
Is it really arguable? Why should active MADV_FREE pages appear before
those which were living on the inactive list. This doesn't make any
sense to me.
> and selected it as head because
> kernel cannot make sure it's really cold or warm for every usecase
> but at least we know it's not hot so landing of inactive head
> would be comprimise if it stayed in active LRU.
This is really hard to read. What do you think about the following
wording?
"
The active status of those pages is cleared and they are moved to the
head of the inactive LRU. This means that MADV_FREE-ed pages which
were living on the inactive list are reclaimed first because they
are more likely to be cold rather than recently active pages.
"
> As well, if we put recent hinted pages to inactive's tail,
> VM could discard cache hot pages, which would be bad.
>
> As a bonus, we don't need to move them back and forth in inactive
> list whenever MADV_SYSCALL syscall is called.
>
> As drawback, VM should scan more pages in inactive anonymous LRU
> to discard but it has happened all the time if recent reference
> happens on those pages in inactive LRU list so I don't think
> it's not a main drawback.
Rather than the above paragraphs I would like to see a description why
this is needed. Something like the following?
"
This is fixing a suboptimal behavior of MADV_FREE when pages living on
the active list will sit there for a long time even under memory
pressure while the inactive list is reclaimed heavily. This basically
breaks the whole purpose of using MADV_FREE to help the system to free
memory which is might not be used.
"
> Signed-off-by: Minchan Kim <[email protected]>
Other than that the patch looks good to me.
Acked-by: Michal Hocko <[email protected]>
> ---
> include/linux/swap.h | 1 +
> mm/madvise.c | 2 ++
> mm/swap.c | 35 +++++++++++++++++++++++++++++++++++
> 3 files changed, 38 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index cee108cbe2d5..0428e4c84e1d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
> extern void lru_add_drain_all(void);
> extern void rotate_reclaimable_page(struct page *page);
> extern void deactivate_file_page(struct page *page);
> +extern void deactivate_page(struct page *page);
> extern void swap_setup(void);
>
> extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 81bb26ecf064..6176039c69e4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -324,6 +324,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> ptent = pte_mkold(ptent);
> ptent = pte_mkclean(ptent);
> set_pte_at(mm, addr, pte, ptent);
> + if (PageActive(page))
> + deactivate_page(page);
> tlb_remove_tlb_entry(tlb, pte, addr);
> }
> arch_leave_lazy_mmu_mode();
> diff --git a/mm/swap.c b/mm/swap.c
> index 5b2a60578f9c..393968c33667 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -43,6 +43,7 @@ int page_cluster;
> static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
> static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>
> /*
> * This path almost never happens for VM activity - pages are normally
> @@ -789,6 +790,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
> update_page_reclaim_stat(lruvec, file, 0);
> }
>
> +
> +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> + void *arg)
> +{
> + if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> + int file = page_is_file_cache(page);
> + int lru = page_lru_base_type(page);
> +
> + del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> + ClearPageActive(page);
> + add_page_to_lru_list(page, lruvec, lru);
> +
> + __count_vm_event(PGDEACTIVATE);
> + update_page_reclaim_stat(lruvec, file, 0);
> + }
> +}
> +
> /*
> * Drain pages out of the cpu's pagevecs.
> * Either "cpu" is the current CPU, and preemption has already been
> @@ -815,6 +833,10 @@ void lru_add_drain_cpu(int cpu)
> if (pagevec_count(pvec))
> pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
>
> + pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> + if (pagevec_count(pvec))
> + pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> +
> activate_page_drain(cpu);
> }
>
> @@ -844,6 +866,18 @@ void deactivate_file_page(struct page *page)
> }
> }
>
> +void deactivate_page(struct page *page)
> +{
> + if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> + struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +
> + page_cache_get(page);
> + if (!pagevec_add(pvec, page))
> + pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> + put_cpu_var(lru_deactivate_pvecs);
> + }
> +}
> +
> void lru_add_drain(void)
> {
> lru_add_drain_cpu(get_cpu());
> @@ -873,6 +907,7 @@ void lru_add_drain_all(void)
> if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
> pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
> pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
> + pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
> need_activate_page_drain(cpu)) {
> INIT_WORK(work, lru_add_drain_per_cpu);
> schedule_work_on(cpu, work);
> --
> 1.9.1
>
--
Michal Hocko
SUSE Labs
On Tue 24-02-15 17:18:17, Minchan Kim wrote:
> Historically, we have disabled reclaiming completely with swapoff
> or non-swap configurable system. It did make sense but problem
> for lazy free pages is that we couldn't get a chance to discard
> hinted pages in reclaim path in those systems.
>
> That's why current MADV_FREE implmentation drop pages instantly
> like MADV_DONTNNED in swapless system so that users on those
> systems couldn't get the benefit of MADV_FREE.
>
> This patch makes VM try to reclaim anonymous pages on swapless
> system. Basic strategy is to try reclaim anonymous pages
> if there are pages in *inactive anon*.
Nice idea.
> In non-swap config system, VM doesn't do aging/reclaiming
> anonymous LRU list so inactive anon LRU list should be always
> empty. So, if there are some pages in inactive anon LRU,
> it means they are MADV_FREE hinted pages so VM try to reclaim
> them and discard or promote them onto active list.
>
> In swap-config-but-not-yet-swapon, VM doesn't do aging/reclaiming
> anonymous LRU list so inactive anon LRU list would be empty but
> it might be not always true because some pages could remain
> inactive anon list if the admin had swapon before. So, if there
> are some pages in inactive anon LRU, it means they are MADV_FREE
> hinted pages or non-hinted pages which have stayed before.
> VM try to reclaim them and discard or promote them onto active list
> so we could have only hinted pages on inactive anon LRU list
> after a while.
>
> In swap-enabled-and-full, VM does aging but not try to reclaim
> in current implementation. However, we try to reclaim them by
> this patch so reclaim efficiency would be worse than old.
> I don't know how many such workloads(ie, doing a job with
> full-swap for a long time) we should take care of are.
Talking about performance when the swap is full is a bit funny. The
system is crawling at the time most probably.
> Hope the comment.
>
> On swapoff system with 3G ram, there are 10 processes with below
>
> loop = 12;
> mmap(256M);
again what is the memory pressure which fills up the rest of the memory.
> while (loop--) {
> memset(256M);
> madvise(MADV_FREE or MADV_DONTNEED);
> sleep(1);
> }
>
> 1) dontneed: 5.40user 24.75system 0:15.36elapsed
> 2) madvfree + this patch: 5.18user 6.90system 0:13.39elapsed
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> mm/madvise.c | 21 ++++++++-------------
> mm/vmscan.c | 32 +++++++++++++++++++++-----------
> 2 files changed, 29 insertions(+), 24 deletions(-)
>
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 671e47edb584..1574cd783ab9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -163,16 +163,23 @@ static bool global_reclaim(struct scan_control *sc)
>
> static unsigned long zone_reclaimable_pages(struct zone *zone)
> {
> - int nr;
> + unsigned long file, anon;
>
> - nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> - zone_page_state(zone, NR_INACTIVE_FILE);
> + file = zone_page_state(zone, NR_ACTIVE_FILE) +
> + zone_page_state(zone, NR_INACTIVE_FILE);
>
> - if (get_nr_swap_pages() > 0)
> - nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> - zone_page_state(zone, NR_INACTIVE_ANON);
> + /*
> + * Although there is no swap space, we should consider
> + * lazy free pages in inactive anon LRU list.
> + */
> + if (total_swap_pages > 0) {
> + anon = zone_page_state(zone, NR_ACTIVE_ANON) +
> + zone_page_state(zone, NR_INACTIVE_ANON);
> + } else {
> + anon = zone_page_state(zone, NR_INACTIVE_ANON);
> + }
Hmm, this doesn't look right to me. The zone would be considered
reclaimable even though it is not in fact. Most of the systems are
configured with swap and this would break those AFAICS.
> - return nr;
> + return file + anon;
> }
>
> bool zone_reclaimable(struct zone *zone)
> @@ -2002,8 +2009,11 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> if (!global_reclaim(sc))
> force_scan = true;
>
> - /* If we have no swap space, do not bother scanning anon pages. */
> - if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
> + /*
> + * If we have no inactive anon page, do not bother scanning
> + * anon pages.
> + */
> + if (!sc->may_swap || !zone_page_state(zone, NR_INACTIVE_ANON)) {
> scan_balance = SCAN_FILE;
> goto out;
> }
but later in the function we are considering active anon pages as well.
Does this work at all?
> @@ -2344,8 +2354,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
> */
> pages_for_compaction = (2UL << sc->order);
> inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> - if (get_nr_swap_pages() > 0)
> - inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> + inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +
> if (sc->nr_reclaimed < pages_for_compaction &&
> inactive_lru_pages > pages_for_compaction)
> return true;
> --
> 1.9.1
>
--
Michal Hocko
SUSE Labs
On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > Recently, Shaohua reported that MADV_FREE is much slower than
> > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > applications went to stall with direct reclaim since kswapd's
> > reclaim speed isn't fast than applications's allocation speed
> > so that it causes lots of stall and lock contention.
>
> I am not sure I understand this correctly. So the issue is that there is
> huge number of MADV_FREE on the LRU and they are not close to the tail
> of the list so the reclaim has to do a lot of work before it starts
> dropping them?
I thought the main reason is current reclaim stragety. Anonymous pages are
considered to be hard to be reclaimed with current policy, VM bias to reclaim
file pages (anon pages are in active list first, referenced pte will reactivate
anon pages and increase rotate count)
> > This patch throttles MADV_FREEing so it works only if there
> > are enough pages in the system which will not trigger backgroud/
> > direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
> > because there is no point to delay freeing if we know system
> > is under memory pressure.
>
> Hmm, this is still conforming to the documentation because the kernel is
> free to free pages at its convenience. I am not sure this is a good
> idea, though. Why some MADV_FREE calls should be treated differently?
> Wouldn't that lead to hard to predict behavior? E.g. LIFO reused blocks
> would work without long stalls most of the time - except when there is a
> memory pressure.
>
> Comparison to MADV_DONTNEED is not very fair IMHO because the scope of the
> two calls is different.
>
> > When I test the patch on my 3G machine + 12 CPU + 8G swap,
> > test: 12 processes
> >
> > loop = 5;
> > mmap(512M);
>
> Who is eating the rest of the memory?
>
> > while (loop--) {
> > memset(512M);
> > madvise(MADV_FREE or MADV_DONTNEED);
> > }
> >
> > 1) dontneed: 6.78user 234.09system 0:48.89elapsed
> > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > 3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
> >
> > It's clearly win.
> >
> > Reported-by: Shaohua Li <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
>
> I don't know. This looks like a hack with hard to predict consequences
> which might trigger pathological corner cases.
This has big improvement in practise, but as Michael said, this will introduce
unpredictable behavior. madvfree pages before memory pressure will be free
later. Plus, this doesn't change the situation madvfree pages are hard to be
free (even with the 3rd patch). Of course it's not introduced by the the
madfree patch, VM bias free file pages.
Thanks,
Shaohua
Hi Michal,
On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > Recently, Shaohua reported that MADV_FREE is much slower than
> > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > applications went to stall with direct reclaim since kswapd's
> > reclaim speed isn't fast than applications's allocation speed
> > so that it causes lots of stall and lock contention.
>
> I am not sure I understand this correctly. So the issue is that there is
> huge number of MADV_FREE on the LRU and they are not close to the tail
> of the list so the reclaim has to do a lot of work before it starts
> dropping them?
No, Shaohua already tested deactivating of hinted pages to head/tail
of inactive anon LRU and he said it didn't solve his problem.
I thought main culprit was scanning/rotating/throttling in
direct reclaim path.
>
> > This patch throttles MADV_FREEing so it works only if there
> > are enough pages in the system which will not trigger backgroud/
> > direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
> > because there is no point to delay freeing if we know system
> > is under memory pressure.
>
> Hmm, this is still conforming to the documentation because the kernel is
> free to free pages at its convenience. I am not sure this is a good
> idea, though. Why some MADV_FREE calls should be treated differently?
It's hint for VM to free pages so I think it's okay to free them instantly
sometime if it can save more important thing like system stall.
IOW, madvise is just hint, not a strict rule.
> Wouldn't that lead to hard to predict behavior? E.g. LIFO reused blocks
> would work without long stalls most of the time - except when there is a
> memory pressure.
True.
>
> Comparison to MADV_DONTNEED is not very fair IMHO because the scope of the
> two calls is different.
I agree it's not a apple to apple comparison.
Acutally, MADV_FREE moves the cost from hot path(ie, system call path)
to slow path(ie, reclaim context) so it would be slower if there are
much memory pressure continuously due to a lot overhead of freeing pages
in reclaim context. So, it would be good if kernel detects it nicely
and prevent the situation. This patch aims for that.
>
> > When I test the patch on my 3G machine + 12 CPU + 8G swap,
> > test: 12 processes
> >
> > loop = 5;
> > mmap(512M);
>
> Who is eating the rest of the memory?
As I wrote down, there are 12 processes with below test.
IOW, 512M * 12 = 6G but system RAM is just 3G.
>
> > while (loop--) {
> > memset(512M);
> > madvise(MADV_FREE or MADV_DONTNEED);
> > }
> >
> > 1) dontneed: 6.78user 234.09system 0:48.89elapsed
> > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > 3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
> >
> > It's clearly win.
> >
> > Reported-by: Shaohua Li <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
>
> I don't know. This looks like a hack with hard to predict consequences
> which might trigger pathological corner cases.
Yeb, it might be. That's why I tagged RFC so hope other guys suggest
better idea.
>
> > ---
> > mm/madvise.c | 13 +++++++++++--
> > 1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8921c2..81bb26ecf064 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -523,8 +523,17 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > * XXX: In this implementation, MADV_FREE works like
> > * MADV_DONTNEED on swapless system or full swap.
> > */
> > - if (get_nr_swap_pages() > 0)
> > - return madvise_free(vma, prev, start, end);
> > + if (get_nr_swap_pages() > 0) {
> > + unsigned long threshold;
> > + /*
> > + * If we have trobule with memory pressure(ie,
> > + * under high watermark), free pages instantly.
> > + */
> > + threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
> > + threshold = threshold + (threshold >> 1);
>
> Why threshold += threshold >> 1 ?
I wanted to trigger this logic if we have free pages under high watermark.
>
> > + if (nr_free_pages() > threshold)
> > + return madvise_free(vma, prev, start, end);
> > + }
> > /* passthrough */
> > case MADV_DONTNEED:
> > return madvise_dontneed(vma, prev, start, end);
> > --
> > 1.9.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> Michal Hocko
> SUSE Labs
--
Kind regards,
Minchan Kim
On Tue, Feb 24, 2015 at 05:14:08PM +0100, Michal Hocko wrote:
> On Tue 24-02-15 17:18:16, Minchan Kim wrote:
> > MADV_FREE is hint that it's okay to discard pages if memory is
> > pressure and we uses reclaimers(ie, kswapd and direct reclaim)
>
> s@if memory is pressure@if there is memory pressure@
>
> > to free them so there is no worth to remain them in active
> > anonymous LRU list so this patch moves them to inactive LRU list.
>
> Makes sense to me.
>
> > A arguable issue for the approach is whether we should put it
> > head or tail in inactive list
>
> Is it really arguable? Why should active MADV_FREE pages appear before
> those which were living on the inactive list. This doesn't make any
> sense to me.
It would be better to drop garbage pages(ie, freed from allocator)
rather than swap out and now anon LRU aging is seq model so
inacitve list can include a lot working set so putting hinted pages
into tail of LRU could enhance reclaim efficiency.
That's why I said it might be arguble.
>
> > and selected it as head because
> > kernel cannot make sure it's really cold or warm for every usecase
> > but at least we know it's not hot so landing of inactive head
> > would be comprimise if it stayed in active LRU.
>
> This is really hard to read. What do you think about the following
> wording?
> "
> The active status of those pages is cleared and they are moved to the
> head of the inactive LRU. This means that MADV_FREE-ed pages which
> were living on the inactive list are reclaimed first because they
> are more likely to be cold rather than recently active pages.
> "
My phrase is to focus why we should put them into head of inactive
so it's orthogonal with your phrase and maybe my phrase could be
complement.
>
> > As well, if we put recent hinted pages to inactive's tail,
> > VM could discard cache hot pages, which would be bad.
> >
> > As a bonus, we don't need to move them back and forth in inactive
> > list whenever MADV_SYSCALL syscall is called.
> >
> > As drawback, VM should scan more pages in inactive anonymous LRU
> > to discard but it has happened all the time if recent reference
> > happens on those pages in inactive LRU list so I don't think
> > it's not a main drawback.
>
> Rather than the above paragraphs I would like to see a description why
> this is needed. Something like the following?
> "
> This is fixing a suboptimal behavior of MADV_FREE when pages living on
> the active list will sit there for a long time even under memory
> pressure while the inactive list is reclaimed heavily. This basically
> breaks the whole purpose of using MADV_FREE to help the system to free
> memory which is might not be used.
> "
Good to me. I will add this. Thanks!
>
> > Signed-off-by: Minchan Kim <[email protected]>
>
> Other than that the patch looks good to me.
> Acked-by: Michal Hocko <[email protected]>
Thanks for the review, Michal!
--
Kind regards,
Minchan Kim
On Tue, Feb 24, 2015 at 05:51:46PM +0100, Michal Hocko wrote:
> On Tue 24-02-15 17:18:17, Minchan Kim wrote:
> > Historically, we have disabled reclaiming completely with swapoff
> > or non-swap configurable system. It did make sense but problem
> > for lazy free pages is that we couldn't get a chance to discard
> > hinted pages in reclaim path in those systems.
> >
> > That's why current MADV_FREE implmentation drop pages instantly
> > like MADV_DONTNNED in swapless system so that users on those
> > systems couldn't get the benefit of MADV_FREE.
> >
> > This patch makes VM try to reclaim anonymous pages on swapless
> > system. Basic strategy is to try reclaim anonymous pages
> > if there are pages in *inactive anon*.
>
> Nice idea.
>
> > In non-swap config system, VM doesn't do aging/reclaiming
> > anonymous LRU list so inactive anon LRU list should be always
> > empty. So, if there are some pages in inactive anon LRU,
> > it means they are MADV_FREE hinted pages so VM try to reclaim
> > them and discard or promote them onto active list.
> >
> > In swap-config-but-not-yet-swapon, VM doesn't do aging/reclaiming
> > anonymous LRU list so inactive anon LRU list would be empty but
> > it might be not always true because some pages could remain
> > inactive anon list if the admin had swapon before. So, if there
> > are some pages in inactive anon LRU, it means they are MADV_FREE
> > hinted pages or non-hinted pages which have stayed before.
> > VM try to reclaim them and discard or promote them onto active list
> > so we could have only hinted pages on inactive anon LRU list
> > after a while.
> >
> > In swap-enabled-and-full, VM does aging but not try to reclaim
> > in current implementation. However, we try to reclaim them by
> > this patch so reclaim efficiency would be worse than old.
> > I don't know how many such workloads(ie, doing a job with
> > full-swap for a long time) we should take care of are.
>
> Talking about performance when the swap is full is a bit funny. The
> system is crawling at the time most probably.
>
> > Hope the comment.
> >
> > On swapoff system with 3G ram, there are 10 processes with below
> >
> > loop = 12;
> > mmap(256M);
>
> again what is the memory pressure which fills up the rest of the memory.
256M * 10 processes = 2.5G on 3G ram with swapoff
>
> > while (loop--) {
> > memset(256M);
> > madvise(MADV_FREE or MADV_DONTNEED);
> > sleep(1);
> > }
> >
> > 1) dontneed: 5.40user 24.75system 0:15.36elapsed
> > 2) madvfree + this patch: 5.18user 6.90system 0:13.39elapsed
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > mm/madvise.c | 21 ++++++++-------------
> > mm/vmscan.c | 32 +++++++++++++++++++++-----------
> > 2 files changed, 29 insertions(+), 24 deletions(-)
> >
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 671e47edb584..1574cd783ab9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -163,16 +163,23 @@ static bool global_reclaim(struct scan_control *sc)
> >
> > static unsigned long zone_reclaimable_pages(struct zone *zone)
> > {
> > - int nr;
> > + unsigned long file, anon;
> >
> > - nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> > - zone_page_state(zone, NR_INACTIVE_FILE);
> > + file = zone_page_state(zone, NR_ACTIVE_FILE) +
> > + zone_page_state(zone, NR_INACTIVE_FILE);
> >
> > - if (get_nr_swap_pages() > 0)
> > - nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> > - zone_page_state(zone, NR_INACTIVE_ANON);
> > + /*
> > + * Although there is no swap space, we should consider
> > + * lazy free pages in inactive anon LRU list.
> > + */
> > + if (total_swap_pages > 0) {
> > + anon = zone_page_state(zone, NR_ACTIVE_ANON) +
> > + zone_page_state(zone, NR_INACTIVE_ANON);
> > + } else {
> > + anon = zone_page_state(zone, NR_INACTIVE_ANON);
> > + }
>
> Hmm, this doesn't look right to me. The zone would be considered
> reclaimable even though it is not in fact. Most of the systems are
> configured with swap and this would break those AFAICS.
Old behavior:
a: active i: inactive f: file a: anon
1) swapoff : af + if
2) swapon + free space on swap : af + if + aa + ia
3) swapon + no free space on swap : af + if
New behavior
1) swapoff : af + if + ia
In here, ia would be zero most of time unless there are
hinted pages in inactive anon LRU so it would be not different with old.
If there are hinted pages in inactive anon LRU, it's reclaimable pages
potentially so we should include those pages for calc.
2) swapon + free space on swap : af + if + aa + ia
Same with old.
3) swapon + no free space on swap : af + if + ia
I agree this case would be different with old so it would have
side effects(ex, more reclaim trying, OOM delaying). But I don't know
how we take care of fullly swap used system.
>
> > - return nr;
> > + return file + anon;
> > }
> >
> > bool zone_reclaimable(struct zone *zone)
> > @@ -2002,8 +2009,11 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > if (!global_reclaim(sc))
> > force_scan = true;
> >
> > - /* If we have no swap space, do not bother scanning anon pages. */
> > - if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
> > + /*
> > + * If we have no inactive anon page, do not bother scanning
> > + * anon pages.
> > + */
> > + if (!sc->may_swap || !zone_page_state(zone, NR_INACTIVE_ANON)) {
> > scan_balance = SCAN_FILE;
> > goto out;
> > }
>
> but later in the function we are considering active anon pages as well.
> Does this work at all?
I don't understand what your concern is. Could you elaborate it more?
What I want is just to try reclaiming of anonymous pages in situation
where no swap but pages in inactive anon.
Thanks.
--
Kind regards,
Minchan Kim
On Wed, Feb 25, 2015 at 09:08:09AM +0900, Minchan Kim wrote:
> Hi Michal,
>
> On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > applications went to stall with direct reclaim since kswapd's
> > > reclaim speed isn't fast than applications's allocation speed
> > > so that it causes lots of stall and lock contention.
> >
> > I am not sure I understand this correctly. So the issue is that there is
> > huge number of MADV_FREE on the LRU and they are not close to the tail
> > of the list so the reclaim has to do a lot of work before it starts
> > dropping them?
>
> No, Shaohua already tested deactivating of hinted pages to head/tail
> of inactive anon LRU and he said it didn't solve his problem.
> I thought main culprit was scanning/rotating/throttling in
> direct reclaim path.
I investigated my workload and found most of slowness came from swapin.
1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin
If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the pages in pte. Instead, free the cold page
because swapin is more expensive than (alloc page + zeroing).
I tested below quick fix and reduced swapin from 879,585 to 1,878.
Elapsed time was
1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
Although it was not good as throttling, it's better than old and
it's orthogoral with throttling so I hope to merge this first
than arguable throttling. Any comments?
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8921c2..d41ae76d3e54 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
pte_t *pte, ptent;
struct page *page;
+ swp_entry_t entry;
unsigned long next;
+ int rss = 0;
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
@@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;
- if (!pte_present(ptent))
+ if (pte_none(ptent))
continue;
+ if (!pte_present(ptent)) {
+ entry = pte_to_swp_entry(ptent);
+ if (non_swap_entry(entry))
+ continue;
+ rss--;
+ free_swap_and_cache(entry);
+ pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ continue;
+ }
+
page = vm_normal_page(vma, addr, ptent);
if (!page)
continue;
@@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
set_pte_at(mm, addr, pte, ptent);
tlb_remove_tlb_entry(tlb, pte, addr);
}
+
+ if (rss) {
+ if (current->mm == mm)
+ sync_mm_rss(mm);
+
+ add_mm_counter(mm, MM_SWAPENTS, rss);
+ }
+
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
next:
--
1.9.1
>
> >
> > > This patch throttles MADV_FREEing so it works only if there
> > > are enough pages in the system which will not trigger backgroud/
> > > direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
> > > because there is no point to delay freeing if we know system
> > > is under memory pressure.
> >
> > Hmm, this is still conforming to the documentation because the kernel is
> > free to free pages at its convenience. I am not sure this is a good
> > idea, though. Why some MADV_FREE calls should be treated differently?
>
> It's hint for VM to free pages so I think it's okay to free them instantly
> sometime if it can save more important thing like system stall.
> IOW, madvise is just hint, not a strict rule.
>
> > Wouldn't that lead to hard to predict behavior? E.g. LIFO reused blocks
> > would work without long stalls most of the time - except when there is a
> > memory pressure.
>
> True.
>
> >
> > Comparison to MADV_DONTNEED is not very fair IMHO because the scope of the
> > two calls is different.
>
> I agree it's not a apple to apple comparison.
>
> Acutally, MADV_FREE moves the cost from hot path(ie, system call path)
> to slow path(ie, reclaim context) so it would be slower if there are
> much memory pressure continuously due to a lot overhead of freeing pages
> in reclaim context. So, it would be good if kernel detects it nicely
> and prevent the situation. This patch aims for that.
>
> >
> > > When I test the patch on my 3G machine + 12 CPU + 8G swap,
> > > test: 12 processes
> > >
> > > loop = 5;
> > > mmap(512M);
> >
> > Who is eating the rest of the memory?
>
> As I wrote down, there are 12 processes with below test.
> IOW, 512M * 12 = 6G but system RAM is just 3G.
>
> >
> > > while (loop--) {
> > > memset(512M);
> > > madvise(MADV_FREE or MADV_DONTNEED);
> > > }
> > >
> > > 1) dontneed: 6.78user 234.09system 0:48.89elapsed
> > > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > > 3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
> > >
> > > It's clearly win.
> > >
> > > Reported-by: Shaohua Li <[email protected]>
> > > Signed-off-by: Minchan Kim <[email protected]>
> >
> > I don't know. This looks like a hack with hard to predict consequences
> > which might trigger pathological corner cases.
>
> Yeb, it might be. That's why I tagged RFC so hope other guys suggest
> better idea.
>
> >
> > > ---
> > > mm/madvise.c | 13 +++++++++++--
> > > 1 file changed, 11 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8921c2..81bb26ecf064 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -523,8 +523,17 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > * XXX: In this implementation, MADV_FREE works like
> > > * MADV_DONTNEED on swapless system or full swap.
> > > */
> > > - if (get_nr_swap_pages() > 0)
> > > - return madvise_free(vma, prev, start, end);
> > > + if (get_nr_swap_pages() > 0) {
> > > + unsigned long threshold;
> > > + /*
> > > + * If we have trobule with memory pressure(ie,
> > > + * under high watermark), free pages instantly.
> > > + */
> > > + threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
> > > + threshold = threshold + (threshold >> 1);
> >
> > Why threshold += threshold >> 1 ?
>
> I wanted to trigger this logic if we have free pages under high watermark.
>
> >
> > > + if (nr_free_pages() > threshold)
> > > + return madvise_free(vma, prev, start, end);
> > > + }
> > > /* passthrough */
> > > case MADV_DONTNEED:
> > > return madvise_dontneed(vma, prev, start, end);
> > > --
> > > 1.9.1
> > >
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to [email protected]. For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
> > --
> > Michal Hocko
> > SUSE Labs
>
> --
> Kind regards,
> Minchan Kim
--
Kind regards,
Minchan Kim
On Tue 24-02-15 14:54:01, Shaohua Li wrote:
> On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > applications went to stall with direct reclaim since kswapd's
> > > reclaim speed isn't fast than applications's allocation speed
> > > so that it causes lots of stall and lock contention.
> >
> > I am not sure I understand this correctly. So the issue is that there is
> > huge number of MADV_FREE on the LRU and they are not close to the tail
> > of the list so the reclaim has to do a lot of work before it starts
> > dropping them?
>
> I thought the main reason is current reclaim stragety. Anonymous pages are
> considered to be hard to be reclaimed with current policy, VM bias to reclaim
> file pages (anon pages are in active list first, referenced pte will reactivate
> anon pages and increase rotate count)
Makes sense. We are really biasing to page cache reclaim most of the
time.
--
Michal Hocko
SUSE Labs
On Wed 25-02-15 16:11:18, Minchan Kim wrote:
[...]
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8921c2..d41ae76d3e54 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> spinlock_t *ptl;
> pte_t *pte, ptent;
> struct page *page;
> + swp_entry_t entry;
> unsigned long next;
> + int rss = 0;
>
> next = pmd_addr_end(addr, end);
> if (pmd_trans_huge(*pmd)) {
> @@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> for (; addr != end; pte++, addr += PAGE_SIZE) {
> ptent = *pte;
>
> - if (!pte_present(ptent))
> + if (pte_none(ptent))
> continue;
>
> + if (!pte_present(ptent)) {
> + entry = pte_to_swp_entry(ptent);
> + if (non_swap_entry(entry))
> + continue;
> + rss--;
> + free_swap_and_cache(entry);
> + pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> + continue;
> + }
Yes this makes sense. rss is a bit confusing because those pages are not
resident.
> +
> page = vm_normal_page(vma, addr, ptent);
> if (!page)
> continue;
> @@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> set_pte_at(mm, addr, pte, ptent);
> tlb_remove_tlb_entry(tlb, pte, addr);
> }
> +
> + if (rss) {
> + if (current->mm == mm)
> + sync_mm_rss(mm);
> +
> + add_mm_counter(mm, MM_SWAPENTS, rss);
> + }
> +
> arch_leave_lazy_mmu_mode();
> pte_unmap_unlock(pte - 1, ptl);
> next:
> --
> 1.9.1
--
Michal Hocko
SUSE Labs
On Wed 25-02-15 09:27:28, Minchan Kim wrote:
> On Tue, Feb 24, 2015 at 05:14:08PM +0100, Michal Hocko wrote:
> > On Tue 24-02-15 17:18:16, Minchan Kim wrote:
> > > MADV_FREE is hint that it's okay to discard pages if memory is
> > > pressure and we uses reclaimers(ie, kswapd and direct reclaim)
> >
> > s@if memory is pressure@if there is memory pressure@
> >
> > > to free them so there is no worth to remain them in active
> > > anonymous LRU list so this patch moves them to inactive LRU list.
> >
> > Makes sense to me.
> >
> > > A arguable issue for the approach is whether we should put it
> > > head or tail in inactive list
> >
> > Is it really arguable? Why should active MADV_FREE pages appear before
> > those which were living on the inactive list. This doesn't make any
> > sense to me.
>
> It would be better to drop garbage pages(ie, freed from allocator)
> rather than swap out and now anon LRU aging is seq model so
> inacitve list can include a lot working set so putting hinted pages
> into tail of LRU could enhance reclaim efficiency.
> That's why I said it might be arguble.
OK, maybe I misunderstood what you tried to tell. Sure we can discuss
whether to put all MADV_FREE pages to the tail of inactive list. But
the above wording suggested that _active_ MADV_FREE pages were
discussed and treating them differently from the inactive pages simply
didn't make sense to me.
[...]
--
Michal Hocko
SUSE Labs
On Wed, Feb 25, 2015 at 04:11:18PM +0900, Minchan Kim wrote:
> On Wed, Feb 25, 2015 at 09:08:09AM +0900, Minchan Kim wrote:
> > Hi Michal,
> >
> > On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > > applications went to stall with direct reclaim since kswapd's
> > > > reclaim speed isn't fast than applications's allocation speed
> > > > so that it causes lots of stall and lock contention.
> > >
> > > I am not sure I understand this correctly. So the issue is that there is
> > > huge number of MADV_FREE on the LRU and they are not close to the tail
> > > of the list so the reclaim has to do a lot of work before it starts
> > > dropping them?
> >
> > No, Shaohua already tested deactivating of hinted pages to head/tail
> > of inactive anon LRU and he said it didn't solve his problem.
> > I thought main culprit was scanning/rotating/throttling in
> > direct reclaim path.
>
> I investigated my workload and found most of slowness came from swapin.
>
> 1) dontneed: 1,612 swapin
> 2) madvfree: 879,585 swapin
>
> If we find hinted pages were already swapped out when syscall is called,
> it's pointless to keep the pages in pte. Instead, free the cold page
> because swapin is more expensive than (alloc page + zeroing).
>
> I tested below quick fix and reduced swapin from 879,585 to 1,878.
> Elapsed time was
>
> 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
>
> Although it was not good as throttling, it's better than old and
> it's orthogoral with throttling so I hope to merge this first
> than arguable throttling. Any comments?
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8921c2..d41ae76d3e54 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> spinlock_t *ptl;
> pte_t *pte, ptent;
> struct page *page;
> + swp_entry_t entry;
> unsigned long next;
> + int rss = 0;
>
> next = pmd_addr_end(addr, end);
> if (pmd_trans_huge(*pmd)) {
> @@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> for (; addr != end; pte++, addr += PAGE_SIZE) {
> ptent = *pte;
>
> - if (!pte_present(ptent))
> + if (pte_none(ptent))
> continue;
>
> + if (!pte_present(ptent)) {
> + entry = pte_to_swp_entry(ptent);
> + if (non_swap_entry(entry))
> + continue;
> + rss--;
> + free_swap_and_cache(entry);
> + pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> + continue;
> + }
> +
> page = vm_normal_page(vma, addr, ptent);
> if (!page)
> continue;
> @@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> set_pte_at(mm, addr, pte, ptent);
> tlb_remove_tlb_entry(tlb, pte, addr);
> }
> +
> + if (rss) {
> + if (current->mm == mm)
> + sync_mm_rss(mm);
> +
> + add_mm_counter(mm, MM_SWAPENTS, rss);
> + }
> +
This looks make sense, but I'm wondering why it can help and if this can help
real workload. Let me have an example. Say there is 1G memory, workload uses
800M memory with DONTNEED, there should be no swap. With FREE, workload might
use more than 1G memory and trigger swap. I thought the case (DONTNEED doesn't
trigger swap) is more suitable to evaluate the performance of the patch.
Thanks,
Shaohua
Hello,
On Wed, Feb 25, 2015 at 10:37:48AM -0800, Shaohua Li wrote:
> On Wed, Feb 25, 2015 at 04:11:18PM +0900, Minchan Kim wrote:
> > On Wed, Feb 25, 2015 at 09:08:09AM +0900, Minchan Kim wrote:
> > > Hi Michal,
> > >
> > > On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > > > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > > > applications went to stall with direct reclaim since kswapd's
> > > > > reclaim speed isn't fast than applications's allocation speed
> > > > > so that it causes lots of stall and lock contention.
> > > >
> > > > I am not sure I understand this correctly. So the issue is that there is
> > > > huge number of MADV_FREE on the LRU and they are not close to the tail
> > > > of the list so the reclaim has to do a lot of work before it starts
> > > > dropping them?
> > >
> > > No, Shaohua already tested deactivating of hinted pages to head/tail
> > > of inactive anon LRU and he said it didn't solve his problem.
> > > I thought main culprit was scanning/rotating/throttling in
> > > direct reclaim path.
> >
> > I investigated my workload and found most of slowness came from swapin.
> >
> > 1) dontneed: 1,612 swapin
> > 2) madvfree: 879,585 swapin
> >
> > If we find hinted pages were already swapped out when syscall is called,
> > it's pointless to keep the pages in pte. Instead, free the cold page
> > because swapin is more expensive than (alloc page + zeroing).
> >
> > I tested below quick fix and reduced swapin from 879,585 to 1,878.
> > Elapsed time was
> >
> > 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> > 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> >
> > Although it was not good as throttling, it's better than old and
> > it's orthogoral with throttling so I hope to merge this first
> > than arguable throttling. Any comments?
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8921c2..d41ae76d3e54 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > spinlock_t *ptl;
> > pte_t *pte, ptent;
> > struct page *page;
> > + swp_entry_t entry;
> > unsigned long next;
> > + int rss = 0;
> >
> > next = pmd_addr_end(addr, end);
> > if (pmd_trans_huge(*pmd)) {
> > @@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > for (; addr != end; pte++, addr += PAGE_SIZE) {
> > ptent = *pte;
> >
> > - if (!pte_present(ptent))
> > + if (pte_none(ptent))
> > continue;
> >
> > + if (!pte_present(ptent)) {
> > + entry = pte_to_swp_entry(ptent);
> > + if (non_swap_entry(entry))
> > + continue;
> > + rss--;
> > + free_swap_and_cache(entry);
> > + pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > + continue;
> > + }
> > +
> > page = vm_normal_page(vma, addr, ptent);
> > if (!page)
> > continue;
> > @@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > set_pte_at(mm, addr, pte, ptent);
> > tlb_remove_tlb_entry(tlb, pte, addr);
> > }
> > +
> > + if (rss) {
> > + if (current->mm == mm)
> > + sync_mm_rss(mm);
> > +
> > + add_mm_counter(mm, MM_SWAPENTS, rss);
> > + }
> > +
>
> This looks make sense, but I'm wondering why it can help and if this can help
> real workload. Let me have an example. Say there is 1G memory, workload uses
void *ptr1 = malloc(len); /* allocator mmap new chunk */
touch_iow_dirty(ptr1, len);
..
..
..
.. /* swapout happens */
free(ptr1); /* allocator calls MADV_FREE on the chunk */
void *ptr2 = malloc(len) /* allocator reuses previous chunk */
touch_iow_dirty(ptr2, len); /* swapin happens to read garbage and application overwrite the garbage */
It's really unnecessary cost.
> 800M memory with DONTNEED, there should be no swap. With FREE, workload might
> use more than 1G memory and trigger swap. I thought the case (DONTNEED doesn't
> trigger swap) is more suitable to evaluate the performance of the patch.
I think above example is really clear and possible scenario.
Could you give me more concrete example to test if you want?
Thanks.
>
> Thanks,
> Shaohua
>
--
Kind regards,
Minchan Kim
On Thu, Feb 26, 2015 at 09:42:06AM +0900, Minchan Kim wrote:
> Hello,
>
> On Wed, Feb 25, 2015 at 10:37:48AM -0800, Shaohua Li wrote:
> > On Wed, Feb 25, 2015 at 04:11:18PM +0900, Minchan Kim wrote:
> > > On Wed, Feb 25, 2015 at 09:08:09AM +0900, Minchan Kim wrote:
> > > > Hi Michal,
> > > >
> > > > On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > > > > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > > > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > > > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > > > > applications went to stall with direct reclaim since kswapd's
> > > > > > reclaim speed isn't fast than applications's allocation speed
> > > > > > so that it causes lots of stall and lock contention.
> > > > >
> > > > > I am not sure I understand this correctly. So the issue is that there is
> > > > > huge number of MADV_FREE on the LRU and they are not close to the tail
> > > > > of the list so the reclaim has to do a lot of work before it starts
> > > > > dropping them?
> > > >
> > > > No, Shaohua already tested deactivating of hinted pages to head/tail
> > > > of inactive anon LRU and he said it didn't solve his problem.
> > > > I thought main culprit was scanning/rotating/throttling in
> > > > direct reclaim path.
> > >
> > > I investigated my workload and found most of slowness came from swapin.
> > >
> > > 1) dontneed: 1,612 swapin
> > > 2) madvfree: 879,585 swapin
> > >
> > > If we find hinted pages were already swapped out when syscall is called,
> > > it's pointless to keep the pages in pte. Instead, free the cold page
> > > because swapin is more expensive than (alloc page + zeroing).
> > >
> > > I tested below quick fix and reduced swapin from 879,585 to 1,878.
> > > Elapsed time was
> > >
> > > 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> > > 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> > >
> > > Although it was not good as throttling, it's better than old and
> > > it's orthogoral with throttling so I hope to merge this first
> > > than arguable throttling. Any comments?
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8921c2..d41ae76d3e54 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > > spinlock_t *ptl;
> > > pte_t *pte, ptent;
> > > struct page *page;
> > > + swp_entry_t entry;
> > > unsigned long next;
> > > + int rss = 0;
> > >
> > > next = pmd_addr_end(addr, end);
> > > if (pmd_trans_huge(*pmd)) {
> > > @@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > > for (; addr != end; pte++, addr += PAGE_SIZE) {
> > > ptent = *pte;
> > >
> > > - if (!pte_present(ptent))
> > > + if (pte_none(ptent))
> > > continue;
> > >
> > > + if (!pte_present(ptent)) {
> > > + entry = pte_to_swp_entry(ptent);
> > > + if (non_swap_entry(entry))
> > > + continue;
> > > + rss--;
> > > + free_swap_and_cache(entry);
> > > + pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > > + continue;
> > > + }
> > > +
> > > page = vm_normal_page(vma, addr, ptent);
> > > if (!page)
> > > continue;
> > > @@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > > set_pte_at(mm, addr, pte, ptent);
> > > tlb_remove_tlb_entry(tlb, pte, addr);
> > > }
> > > +
> > > + if (rss) {
> > > + if (current->mm == mm)
> > > + sync_mm_rss(mm);
> > > +
> > > + add_mm_counter(mm, MM_SWAPENTS, rss);
> > > + }
> > > +
> >
> > This looks make sense, but I'm wondering why it can help and if this can help
> > real workload. Let me have an example. Say there is 1G memory, workload uses
>
> void *ptr1 = malloc(len); /* allocator mmap new chunk */
> touch_iow_dirty(ptr1, len);
> ..
> ..
> ..
> .. /* swapout happens */
> free(ptr1); /* allocator calls MADV_FREE on the chunk */
>
> void *ptr2 = malloc(len) /* allocator reuses previous chunk */
> touch_iow_dirty(ptr2, len); /* swapin happens to read garbage and application overwrite the garbage */
>
> It's really unnecessary cost.
>
>
> > 800M memory with DONTNEED, there should be no swap. With FREE, workload might
> > use more than 1G memory and trigger swap. I thought the case (DONTNEED doesn't
> > trigger swap) is more suitable to evaluate the performance of the patch.
>
> I think above example is really clear and possible scenario.
> Could you give me more concrete example to test if you want?
Sorry, no, I don't have concrete example either. My magor concern is workload
which has no swap with DONTNEED and has possible swap with FREE.
Thanks,
Shaohua
This patch add ClearPageDirty() to clear AnonPage dirty flag,
the Anonpage mapcount must be 1, so that this page is only used by
the current process, not shared by other process like fork().
if not clear page dirty for this anon page, the page will never be
treated as freeable.
Signed-off-by: Yalin Wang <[email protected]>
---
mm/madvise.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..257925a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
continue;
page = vm_normal_page(vma, addr, ptent);
- if (!page)
+ if (!page || !PageAnon(page) || !trylock_page(page))
continue;
if (PageSwapCache(page)) {
- if (!trylock_page(page))
+ if (!try_to_free_swap(page))
continue;
-
- if (!try_to_free_swap(page)) {
- unlock_page(page);
- continue;
- }
-
- ClearPageDirty(page);
- unlock_page(page);
}
+ if (page_mapcount(page) == 1)
+ ClearPageDirty(page);
+ unlock_page(page);
/*
* Some of architecture(ex, PPC) don't update TLB
* with set_pte_at and tlb_remove_tlb_entry so for
--
2.2.2
Hello,
On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> This patch add ClearPageDirty() to clear AnonPage dirty flag,
> the Anonpage mapcount must be 1, so that this page is only used by
> the current process, not shared by other process like fork().
> if not clear page dirty for this anon page, the page will never be
> treated as freeable.
In case of anonymous page, it has PG_dirty when VM adds it to
swap cache and clear it in clear_page_dirty_for_io. That's why
I added ClearPageDirty if we found it in swapcache.
What case am I missing? It would be better to understand if you
describe specific scenario.
Thanks.
>
> Signed-off-by: Yalin Wang <[email protected]>
> ---
> mm/madvise.c | 15 +++++----------
> 1 file changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8..257925a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> continue;
>
> page = vm_normal_page(vma, addr, ptent);
> - if (!page)
> + if (!page || !PageAnon(page) || !trylock_page(page))
> continue;
>
> if (PageSwapCache(page)) {
> - if (!trylock_page(page))
> + if (!try_to_free_swap(page))
> continue;
> -
> - if (!try_to_free_swap(page)) {
> - unlock_page(page);
> - continue;
> - }
> -
> - ClearPageDirty(page);
> - unlock_page(page);
> }
>
> + if (page_mapcount(page) == 1)
> + ClearPageDirty(page);
> + unlock_page(page);
> /*
> * Some of architecture(ex, PPC) don't update TLB
> * with set_pte_at and tlb_remove_tlb_entry so for
> --
> 2.2.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kind regards,
Minchan Kim
> -----Original Message-----
> From: Minchan Kim [mailto:[email protected]] On Behalf Of Minchan Kim
> Sent: Friday, February 27, 2015 1:28 PM
> To: Wang, Yalin
> Cc: Michal Hocko; Andrew Morton; [email protected]; linux-
> [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
>
> Hello,
>
> On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > the Anonpage mapcount must be 1, so that this page is only used by
> > the current process, not shared by other process like fork().
> > if not clear page dirty for this anon page, the page will never be
> > treated as freeable.
>
> In case of anonymous page, it has PG_dirty when VM adds it to
> swap cache and clear it in clear_page_dirty_for_io. That's why
> I added ClearPageDirty if we found it in swapcache.
> What case am I missing? It would be better to understand if you
> describe specific scenario.
>
> Thanks.
>
> >
> > Signed-off-by: Yalin Wang <[email protected]>
> > ---
> > mm/madvise.c | 15 +++++----------
> > 1 file changed, 5 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8..257925a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> unsigned long addr,
> > continue;
> >
> > page = vm_normal_page(vma, addr, ptent);
> > - if (!page)
> > + if (!page || !PageAnon(page) || !trylock_page(page))
> > continue;
> >
> > if (PageSwapCache(page)) {
> > - if (!trylock_page(page))
> > + if (!try_to_free_swap(page))
> > continue;
> > -
> > - if (!try_to_free_swap(page)) {
> > - unlock_page(page);
> > - continue;
> > - }
> > -
> > - ClearPageDirty(page);
> > - unlock_page(page);
> > }
> >
> > + if (page_mapcount(page) == 1)
> > + ClearPageDirty(page);
> > + unlock_page(page);
> > /*
> > * Some of architecture(ex, PPC) don't update TLB
> > * with set_pte_at and tlb_remove_tlb_entry so for
> > --
Yes, for page which is in SwapCache, it is correct,
But for anon page which is not in SwapCache, it is always
PageDirty(), so we should also clear dirty bit to make it freeable,
Another problem is that if an anon page is shared by more than one process,
This happened when fork(), the anon page will be copy on write,
In this case, we should not clear page dirty,
Otherwise, other process will get zero page if the page is freed by reclaim path,
This is not correct for other process which don't call MADV_FREE syscall.
Thanks
On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Minchan Kim [mailto:[email protected]] On Behalf Of Minchan Kim
> > Sent: Friday, February 27, 2015 1:28 PM
> > To: Wang, Yalin
> > Cc: Michal Hocko; Andrew Morton; [email protected]; linux-
> > [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> >
> > Hello,
> >
> > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > the Anonpage mapcount must be 1, so that this page is only used by
> > > the current process, not shared by other process like fork().
> > > if not clear page dirty for this anon page, the page will never be
> > > treated as freeable.
> >
> > In case of anonymous page, it has PG_dirty when VM adds it to
> > swap cache and clear it in clear_page_dirty_for_io. That's why
> > I added ClearPageDirty if we found it in swapcache.
> > What case am I missing? It would be better to understand if you
> > describe specific scenario.
> >
> > Thanks.
> >
> > >
> > > Signed-off-by: Yalin Wang <[email protected]>
> > > ---
> > > mm/madvise.c | 15 +++++----------
> > > 1 file changed, 5 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8..257925a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > unsigned long addr,
> > > continue;
> > >
> > > page = vm_normal_page(vma, addr, ptent);
> > > - if (!page)
> > > + if (!page || !PageAnon(page) || !trylock_page(page))
> > > continue;
> > >
> > > if (PageSwapCache(page)) {
> > > - if (!trylock_page(page))
> > > + if (!try_to_free_swap(page))
> > > continue;
> > > -
> > > - if (!try_to_free_swap(page)) {
> > > - unlock_page(page);
> > > - continue;
> > > - }
> > > -
> > > - ClearPageDirty(page);
> > > - unlock_page(page);
> > > }
> > >
> > > + if (page_mapcount(page) == 1)
> > > + ClearPageDirty(page);
> > > + unlock_page(page);
> > > /*
> > > * Some of architecture(ex, PPC) don't update TLB
> > > * with set_pte_at and tlb_remove_tlb_entry so for
> > > --
> Yes, for page which is in SwapCache, it is correct,
> But for anon page which is not in SwapCache, it is always
> PageDirty(), so we should also clear dirty bit to make it freeable,
No. Every anon page starts from !PageDirty and it has PG_dirty
only when it's addeded into swap cache. If vm_swap_full turns on,
a page in swap cache could have PG_dirty via try_to_free_swap again.
So, Do you have concern about swapped-out pages when MADV_FREE is
called? If so, please look at my patch.
https://lkml.org/lkml/2015/2/25/43
It will zap the swapped out page. So, this is not a issue any more?
>
> Another problem is that if an anon page is shared by more than one process,
> This happened when fork(), the anon page will be copy on write,
> In this case, we should not clear page dirty,
> This is not correct for other process which don't call MADV_FREE syscall.
You mean we shouldn't inherit MADV_FREE attribute?
Why?
parent:
ptr1 = malloc(len);
-> allocator calls mmap(len);
memset(ptr1, 'a', len);
free(ptr1);
-> allocator calss madvise_free(ptr1, len);
..
..
-> VM discard hinted pages
fork();
child:
ptr2 = malloc(len)
-> allocator reuses the chunk allocated from parent.
so, child will see zero pages from ptr2 but he doesn't write
anything so garbage|zero page anything is okay to him.
>
> Thanks
>
>
--
Kind regards,
Minchan Kim
> -----Original Message-----
> From: Minchan Kim [mailto:[email protected]] On Behalf Of Minchan Kim
> Sent: Friday, February 27, 2015 2:44 PM
> To: Wang, Yalin
> Cc: Michal Hocko; Andrew Morton; [email protected]; linux-
> [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
>
> On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > > -----Original Message-----
> > > From: Minchan Kim [mailto:[email protected]] On Behalf Of Minchan
> Kim
> > > Sent: Friday, February 27, 2015 1:28 PM
> > > To: Wang, Yalin
> > > Cc: Michal Hocko; Andrew Morton; [email protected]; linux-
> > > [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > >
> > > Hello,
> > >
> > > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > the Anonpage mapcount must be 1, so that this page is only used by
> > > > the current process, not shared by other process like fork().
> > > > if not clear page dirty for this anon page, the page will never be
> > > > treated as freeable.
> > >
> > > In case of anonymous page, it has PG_dirty when VM adds it to
> > > swap cache and clear it in clear_page_dirty_for_io. That's why
> > > I added ClearPageDirty if we found it in swapcache.
> > > What case am I missing? It would be better to understand if you
> > > describe specific scenario.
> > >
> > > Thanks.
> > >
> > > >
> > > > Signed-off-by: Yalin Wang <[email protected]>
> > > > ---
> > > > mm/madvise.c | 15 +++++----------
> > > > 1 file changed, 5 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 6d0fcb8..257925a 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > > unsigned long addr,
> > > > continue;
> > > >
> > > > page = vm_normal_page(vma, addr, ptent);
> > > > - if (!page)
> > > > + if (!page || !PageAnon(page) || !trylock_page(page))
> > > > continue;
> > > >
> > > > if (PageSwapCache(page)) {
> > > > - if (!trylock_page(page))
> > > > + if (!try_to_free_swap(page))
> > > > continue;
> > > > -
> > > > - if (!try_to_free_swap(page)) {
> > > > - unlock_page(page);
> > > > - continue;
> > > > - }
> > > > -
> > > > - ClearPageDirty(page);
> > > > - unlock_page(page);
> > > > }
> > > >
> > > > + if (page_mapcount(page) == 1)
> > > > + ClearPageDirty(page);
> > > > + unlock_page(page);
> > > > /*
> > > > * Some of architecture(ex, PPC) don't update TLB
> > > > * with set_pte_at and tlb_remove_tlb_entry so for
> > > > --
> > Yes, for page which is in SwapCache, it is correct,
> > But for anon page which is not in SwapCache, it is always
> > PageDirty(), so we should also clear dirty bit to make it freeable,
>
> No. Every anon page starts from !PageDirty and it has PG_dirty
> only when it's addeded into swap cache. If vm_swap_full turns on,
> a page in swap cache could have PG_dirty via try_to_free_swap again.
mmm..
sometimes you can see an anon page PageDirty(), but it is not in swapcache,
for example, handle_pte_fault()-->do_swap_page()-->try_to_free_swap(),
at this time, the page is deleted from swapcache and is marked PageDirty(),
> So, Do you have concern about swapped-out pages when MADV_FREE is
> called? If so, please look at my patch.
>
> https://lkml.org/lkml/2015/2/25/43
>
> It will zap the swapped out page. So, this is not a issue any more?
>
> >
> > Another problem is that if an anon page is shared by more than one
> process,
> > This happened when fork(), the anon page will be copy on write,
> > In this case, we should not clear page dirty,
> > This is not correct for other process which don't call MADV_FREE syscall.
>
> You mean we shouldn't inherit MADV_FREE attribute?
> Why?
Is it correct behavior if code like this:
Parent:
ptr1 = malloc(len);
memset(ptr1, 'a', len);
fork();
if (I am parent)
madvise_free(ptr1, len);
child:
sleep(10);
parse_data(ptr1, len); // child may see zero, not 'a',
// is it the right behavior that the programer want?
Because child don't call madvise_free(), so it should see 'a', not zero page.
Isn't it ?
Thanks
On Fri, Feb 27, 2015 at 03:50:29PM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Minchan Kim [mailto:[email protected]] On Behalf Of Minchan Kim
> > Sent: Friday, February 27, 2015 2:44 PM
> > To: Wang, Yalin
> > Cc: Michal Hocko; Andrew Morton; [email protected]; linux-
> > [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> >
> > On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > > > -----Original Message-----
> > > > From: Minchan Kim [mailto:[email protected]] On Behalf Of Minchan
> > Kim
> > > > Sent: Friday, February 27, 2015 1:28 PM
> > > > To: Wang, Yalin
> > > > Cc: Michal Hocko; Andrew Morton; [email protected]; linux-
> > > > [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > > >
> > > > Hello,
> > > >
> > > > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > > the Anonpage mapcount must be 1, so that this page is only used by
> > > > > the current process, not shared by other process like fork().
> > > > > if not clear page dirty for this anon page, the page will never be
> > > > > treated as freeable.
> > > >
> > > > In case of anonymous page, it has PG_dirty when VM adds it to
> > > > swap cache and clear it in clear_page_dirty_for_io. That's why
> > > > I added ClearPageDirty if we found it in swapcache.
> > > > What case am I missing? It would be better to understand if you
> > > > describe specific scenario.
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > > Signed-off-by: Yalin Wang <[email protected]>
> > > > > ---
> > > > > mm/madvise.c | 15 +++++----------
> > > > > 1 file changed, 5 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > index 6d0fcb8..257925a 100644
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > > > unsigned long addr,
> > > > > continue;
> > > > >
> > > > > page = vm_normal_page(vma, addr, ptent);
> > > > > - if (!page)
> > > > > + if (!page || !PageAnon(page) || !trylock_page(page))
> > > > > continue;
> > > > >
> > > > > if (PageSwapCache(page)) {
> > > > > - if (!trylock_page(page))
> > > > > + if (!try_to_free_swap(page))
> > > > > continue;
> > > > > -
> > > > > - if (!try_to_free_swap(page)) {
> > > > > - unlock_page(page);
> > > > > - continue;
> > > > > - }
> > > > > -
> > > > > - ClearPageDirty(page);
> > > > > - unlock_page(page);
> > > > > }
> > > > >
> > > > > + if (page_mapcount(page) == 1)
> > > > > + ClearPageDirty(page);
> > > > > + unlock_page(page);
> > > > > /*
> > > > > * Some of architecture(ex, PPC) don't update TLB
> > > > > * with set_pte_at and tlb_remove_tlb_entry so for
> > > > > --
> > > Yes, for page which is in SwapCache, it is correct,
> > > But for anon page which is not in SwapCache, it is always
> > > PageDirty(), so we should also clear dirty bit to make it freeable,
> >
> > No. Every anon page starts from !PageDirty and it has PG_dirty
> > only when it's addeded into swap cache. If vm_swap_full turns on,
> > a page in swap cache could have PG_dirty via try_to_free_swap again.
>
> mmm..
> sometimes you can see an anon page PageDirty(), but it is not in swapcache,
> for example, handle_pte_fault()-->do_swap_page()-->try_to_free_swap(),
> at this time, the page is deleted from swapcache and is marked PageDirty(),
That's what I missed. It's clear and would be simple patch so
could you send a patch to fix this issue with detailed description
like above?
>
>
> > So, Do you have concern about swapped-out pages when MADV_FREE is
> > called? If so, please look at my patch.
> >
> > https://lkml.org/lkml/2015/2/25/43
> >
> > It will zap the swapped out page. So, this is not a issue any more?
> >
> > >
> > > Another problem is that if an anon page is shared by more than one
> > process,
> > > This happened when fork(), the anon page will be copy on write,
> > > In this case, we should not clear page dirty,
> > > This is not correct for other process which don't call MADV_FREE syscall.
> >
> > You mean we shouldn't inherit MADV_FREE attribute?
> > Why?
>
> Is it correct behavior if code like this:
>
> Parent:
> ptr1 = malloc(len);
> memset(ptr1, 'a', len);
> fork();
> if (I am parent)
> madvise_free(ptr1, len);
>
> child:
> sleep(10);
> parse_data(ptr1, len); // child may see zero, not 'a',
> // is it the right behavior that the programer want?
>
> Because child don't call madvise_free(), so it should see 'a', not zero page.
> Isn't it ?
You're absolutely right. Thanks.
But I doubt your fix is best. Most of fork will do exec soonish so
it's not a good idea to make MADV_FREE void even though hinted pages
are shared when the syscall was called.
How about checking the page is shared or not in reclaim path?
If it is still shared, we shouldn't discard it.
Thanks.
> Thanks
>
>
>
>
>
>
--
Kind regards,
Minchan Kim
On Fri 27-02-15 11:37:18, Wang, Yalin wrote:
> This patch add ClearPageDirty() to clear AnonPage dirty flag,
> the Anonpage mapcount must be 1, so that this page is only used by
> the current process, not shared by other process like fork().
> if not clear page dirty for this anon page, the page will never be
> treated as freeable.
Very well spotted! I haven't noticed that during the review.
> Signed-off-by: Yalin Wang <[email protected]>
> ---
> mm/madvise.c | 15 +++++----------
> 1 file changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8..257925a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> continue;
>
> page = vm_normal_page(vma, addr, ptent);
> - if (!page)
> + if (!page || !PageAnon(page) || !trylock_page(page))
> continue;
PageAnon check seems to be redundant because we are not allowing
MADV_FREE on any !anon private mappings AFAIR.
>
> if (PageSwapCache(page)) {
> - if (!trylock_page(page))
> + if (!try_to_free_swap(page))
> continue;
You need to unlock the page here.
> -
> - if (!try_to_free_swap(page)) {
> - unlock_page(page);
> - continue;
> - }
> -
> - ClearPageDirty(page);
> - unlock_page(page);
> }
>
> + if (page_mapcount(page) == 1)
> + ClearPageDirty(page);
Please add a comment about why we need to ClearPageDirty even
!PageSwapCache. Anon pages are usually not marked dirty AFAIR. The
reason seem to be racing try_to_free_swap which sets the page that way
(although I do not seem to remember why are we doing that in the first
place...)
> + unlock_page(page);
> /*
> * Some of architecture(ex, PPC) don't update TLB
> * with set_pte_at and tlb_remove_tlb_entry so for
> --
> 2.2.2
--
Michal Hocko
SUSE Labs
> -----Original Message-----
> From: Michal Hocko [mailto:[email protected]] On Behalf Of Michal Hocko
> Sent: Saturday, February 28, 2015 5:03 AM
> To: Wang, Yalin
> Cc: 'Minchan Kim'; Andrew Morton; [email protected]; linux-
> [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
>
> On Fri 27-02-15 11:37:18, Wang, Yalin wrote:
> > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > the Anonpage mapcount must be 1, so that this page is only used by
> > the current process, not shared by other process like fork().
> > if not clear page dirty for this anon page, the page will never be
> > treated as freeable.
>
> Very well spotted! I haven't noticed that during the review.
>
> > Signed-off-by: Yalin Wang <[email protected]>
> > ---
> > mm/madvise.c | 15 +++++----------
> > 1 file changed, 5 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8..257925a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> unsigned long addr,
> > continue;
> >
> > page = vm_normal_page(vma, addr, ptent);
> > - if (!page)
> > + if (!page || !PageAnon(page) || !trylock_page(page))
> > continue;
>
> PageAnon check seems to be redundant because we are not allowing
> MADV_FREE on any !anon private mappings AFAIR.
I only see this check:
/* MADV_FREE works for only anon vma at the moment */
if (vma->vm_file)
return -EINVAL;
but for file private map, there are also AnonPage sometimes, do we need change
to like this:
if (vma->vm_flags & VM_SHARED)
return -EINVAL;
> >
> > if (PageSwapCache(page)) {
> > - if (!trylock_page(page))
> > + if (!try_to_free_swap(page))
> > continue;
>
> You need to unlock the page here.
Good spot.
> > -
> > - if (!try_to_free_swap(page)) {
> > - unlock_page(page);
> > - continue;
> > - }
> > -
> > - ClearPageDirty(page);
> > - unlock_page(page);
> > }
> >
> > + if (page_mapcount(page) == 1)
> > + ClearPageDirty(page);
>
> Please add a comment about why we need to ClearPageDirty even
> !PageSwapCache. Anon pages are usually not marked dirty AFAIR. The
> reason seem to be racing try_to_free_swap which sets the page that way
> (although I do not seem to remember why are we doing that in the first
> place...)
>
Use page_mapcount to judge if a page can be clear dirty flag seems
Not a very good solution, that is because we don't know how many
ptes are share this page, I am thinking if there is some good solution
For shared AnonPage.
This patch add ClearPageDirty() to clear AnonPage dirty flag,
if not clear page dirty for this anon page, the page will never be
treated as freeable. we also make sure the shared AnonPage is not
freeable, we implement it by dirty all copyed AnonPage pte,
so that make sure the Anonpage will not become freeable, unless
all process which shared this page call madvise_free syscall.
Another change is that we also handle file map page,
we just clear pte young bit for file map, this is useful,
it can make reclaim patch move file pages into inactive
lru list aggressively.
Signed-off-by: Yalin Wang <[email protected]>
---
mm/madvise.c | 26 +++++++++++++++-----------
mm/memory.c | 12 ++++++++++--
2 files changed, 25 insertions(+), 13 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..712756b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -299,30 +299,38 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
page = vm_normal_page(vma, addr, ptent);
if (!page)
continue;
+ if (!PageAnon(page))
+ goto set_pte;
+ if (!trylock_page(page))
+ continue;
if (PageSwapCache(page)) {
- if (!trylock_page(page))
- continue;
-
if (!try_to_free_swap(page)) {
unlock_page(page);
continue;
}
-
- ClearPageDirty(page);
- unlock_page(page);
}
/*
+ * we clear page dirty flag for AnonPage, no matter if this
+ * page is in swapcahce or not, AnonPage not in swapcache also set
+ * dirty flag sometimes, this happened when an AnonPage is removed
+ * from swapcahce by try_to_free_swap()
+ */
+ ClearPageDirty(page);
+ unlock_page(page);
+ /*
* Some of architecture(ex, PPC) don't update TLB
* with set_pte_at and tlb_remove_tlb_entry so for
* the portability, remap the pte with old|clean
* after pte clearing.
*/
+set_pte:
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
ptent = pte_mkold(ptent);
- ptent = pte_mkclean(ptent);
+ if (PageAnon(page))
+ ptent = pte_mkclean(ptent);
set_pte_at(mm, addr, pte, ptent);
tlb_remove_tlb_entry(tlb, pte, addr);
}
@@ -364,10 +372,6 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
return -EINVAL;
- /* MADV_FREE works for only anon vma at the moment */
- if (vma->vm_file)
- return -EINVAL;
-
start = max(vma->vm_start, start_addr);
if (start >= vma->vm_end)
return -EINVAL;
diff --git a/mm/memory.c b/mm/memory.c
index 8068893..3d949b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -874,10 +874,18 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (page) {
get_page(page);
page_dup_rmap(page);
- if (PageAnon(page))
+ if (PageAnon(page)) {
+ /*
+ * we dirty the copyed pte for anon page,
+ * this is useful for madvise_free_pte_range(),
+ * this can prevent shared anon page freed by madvise_free
+ * syscall
+ */
+ pte = pte_mkdirty(pte);
rss[MM_ANONPAGES]++;
- else
+ } else {
rss[MM_FILEPAGES]++;
+ }
}
out_set_pte:
--
2.2.2
On Fri, Feb 27, 2015 at 10:37:14PM +0900, Minchan Kim wrote:
> On Fri, Feb 27, 2015 at 03:50:29PM +0800, Wang, Yalin wrote:
> > > -----Original Message-----
> > > From: Minchan Kim [mailto:[email protected]] On Behalf Of Minchan Kim
> > > Sent: Friday, February 27, 2015 2:44 PM
> > > To: Wang, Yalin
> > > Cc: Michal Hocko; Andrew Morton; [email protected]; linux-
> > > [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > >
> > > On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > > > > -----Original Message-----
> > > > > From: Minchan Kim [mailto:[email protected]] On Behalf Of Minchan
> > > Kim
> > > > > Sent: Friday, February 27, 2015 1:28 PM
> > > > > To: Wang, Yalin
> > > > > Cc: Michal Hocko; Andrew Morton; [email protected]; linux-
> > > > > [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > > > the Anonpage mapcount must be 1, so that this page is only used by
> > > > > > the current process, not shared by other process like fork().
> > > > > > if not clear page dirty for this anon page, the page will never be
> > > > > > treated as freeable.
> > > > >
> > > > > In case of anonymous page, it has PG_dirty when VM adds it to
> > > > > swap cache and clear it in clear_page_dirty_for_io. That's why
> > > > > I added ClearPageDirty if we found it in swapcache.
> > > > > What case am I missing? It would be better to understand if you
> > > > > describe specific scenario.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > > Signed-off-by: Yalin Wang <[email protected]>
> > > > > > ---
> > > > > > mm/madvise.c | 15 +++++----------
> > > > > > 1 file changed, 5 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > > index 6d0fcb8..257925a 100644
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > > > > unsigned long addr,
> > > > > > continue;
> > > > > >
> > > > > > page = vm_normal_page(vma, addr, ptent);
> > > > > > - if (!page)
> > > > > > + if (!page || !PageAnon(page) || !trylock_page(page))
> > > > > > continue;
> > > > > >
> > > > > > if (PageSwapCache(page)) {
> > > > > > - if (!trylock_page(page))
> > > > > > + if (!try_to_free_swap(page))
> > > > > > continue;
> > > > > > -
> > > > > > - if (!try_to_free_swap(page)) {
> > > > > > - unlock_page(page);
> > > > > > - continue;
> > > > > > - }
> > > > > > -
> > > > > > - ClearPageDirty(page);
> > > > > > - unlock_page(page);
> > > > > > }
> > > > > >
> > > > > > + if (page_mapcount(page) == 1)
> > > > > > + ClearPageDirty(page);
> > > > > > + unlock_page(page);
> > > > > > /*
> > > > > > * Some of architecture(ex, PPC) don't update TLB
> > > > > > * with set_pte_at and tlb_remove_tlb_entry so for
> > > > > > --
> > > > Yes, for page which is in SwapCache, it is correct,
> > > > But for anon page which is not in SwapCache, it is always
> > > > PageDirty(), so we should also clear dirty bit to make it freeable,
> > >
> > > No. Every anon page starts from !PageDirty and it has PG_dirty
> > > only when it's addeded into swap cache. If vm_swap_full turns on,
> > > a page in swap cache could have PG_dirty via try_to_free_swap again.
> >
> > mmm..
> > sometimes you can see an anon page PageDirty(), but it is not in swapcache,
> > for example, handle_pte_fault()-->do_swap_page()-->try_to_free_swap(),
> > at this time, the page is deleted from swapcache and is marked PageDirty(),
>
> That's what I missed. It's clear and would be simple patch so
> could you send a patch to fix this issue with detailed description
> like above?
>
> >
> >
> > > So, Do you have concern about swapped-out pages when MADV_FREE is
> > > called? If so, please look at my patch.
> > >
> > > https://lkml.org/lkml/2015/2/25/43
> > >
> > > It will zap the swapped out page. So, this is not a issue any more?
> > >
> > > >
> > > > Another problem is that if an anon page is shared by more than one
> > > process,
> > > > This happened when fork(), the anon page will be copy on write,
> > > > In this case, we should not clear page dirty,
> > > > This is not correct for other process which don't call MADV_FREE syscall.
> > >
> > > You mean we shouldn't inherit MADV_FREE attribute?
> > > Why?
> >
> > Is it correct behavior if code like this:
> >
> > Parent:
> > ptr1 = malloc(len);
> > memset(ptr1, 'a', len);
> > fork();
> > if (I am parent)
> > madvise_free(ptr1, len);
> >
> > child:
> > sleep(10);
> > parse_data(ptr1, len); // child may see zero, not 'a',
> > // is it the right behavior that the programer want?
> >
> > Because child don't call madvise_free(), so it should see 'a', not zero page.
> > Isn't it ?
>
> You're absolutely right. Thanks.
> But I doubt your fix is best. Most of fork will do exec soonish so
> it's not a good idea to make MADV_FREE void even though hinted pages
> are shared when the syscall was called.
> How about checking the page is shared or not in reclaim path?
> If it is still shared, we shouldn't discard it.
I got confused. With looking at copy_one_pte, it copys from src_pte
and not clear dirty bit if it's not a shared mapping.
If so, in your example, child pte has pte dirty bit on while parent
has clean bit by madvise_free so that VM shouldn't discard the page.
No?
>
> Thanks.
>
> > Thanks
> >
> >
> >
> >
> >
> >
>
> --
> Kind regards,
> Minchan Kim
--
Kind regards,
Minchan Kim
On Sat, Feb 28, 2015 at 10:11:13AM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Michal Hocko [mailto:[email protected]] On Behalf Of Michal Hocko
> > Sent: Saturday, February 28, 2015 5:03 AM
> > To: Wang, Yalin
> > Cc: 'Minchan Kim'; Andrew Morton; [email protected]; linux-
> > [email protected]; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> >
> > On Fri 27-02-15 11:37:18, Wang, Yalin wrote:
> > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > the Anonpage mapcount must be 1, so that this page is only used by
> > > the current process, not shared by other process like fork().
> > > if not clear page dirty for this anon page, the page will never be
> > > treated as freeable.
> >
> > Very well spotted! I haven't noticed that during the review.
> >
> > > Signed-off-by: Yalin Wang <[email protected]>
> > > ---
> > > mm/madvise.c | 15 +++++----------
> > > 1 file changed, 5 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8..257925a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > unsigned long addr,
> > > continue;
> > >
> > > page = vm_normal_page(vma, addr, ptent);
> > > - if (!page)
> > > + if (!page || !PageAnon(page) || !trylock_page(page))
> > > continue;
> >
> > PageAnon check seems to be redundant because we are not allowing
> > MADV_FREE on any !anon private mappings AFAIR.
> I only see this check:
> /* MADV_FREE works for only anon vma at the moment */
> if (vma->vm_file)
> return -EINVAL;
>
> but for file private map, there are also AnonPage sometimes, do we need change
> to like this:
> if (vma->vm_flags & VM_SHARED)
> return -EINVAL;
I couldn't understand your point. In this stage, we intentionally
disabled madvise_free on file mapped area(AFAIRC, some guys tried
it long time ago but it had many issues so dropped).
So, how can file-private mmaped can reach this code?
Could you elaborate it more about that why we need PageAnon check
in here?
> > >
> > > if (PageSwapCache(page)) {
> > > - if (!trylock_page(page))
> > > + if (!try_to_free_swap(page))
> > > continue;
> >
> > You need to unlock the page here.
> Good spot.
>
> > > -
> > > - if (!try_to_free_swap(page)) {
> > > - unlock_page(page);
> > > - continue;
> > > - }
> > > -
> > > - ClearPageDirty(page);
> > > - unlock_page(page);
> > > }
> > >
> > > + if (page_mapcount(page) == 1)
> > > + ClearPageDirty(page);
> >
> > Please add a comment about why we need to ClearPageDirty even
> > !PageSwapCache. Anon pages are usually not marked dirty AFAIR. The
> > reason seem to be racing try_to_free_swap which sets the page that way
> > (although I do not seem to remember why are we doing that in the first
> > place...)
> >
> Use page_mapcount to judge if a page can be clear dirty flag seems
> Not a very good solution, that is because we don't know how many
> ptes are share this page, I am thinking if there is some good solution
> For shared AnonPage.
>
--
Kind regards,
Minchan Kim