2017-03-17 23:29:12

by Tim Murray

[permalink] [raw]
Subject: [RFC 0/1] add support for reclaiming priorities per mem cgroup

Hi all,

I've been working to improve Android's memory management and drop lowmemorykiller from the kernel, and I'd like to get some feedback on a small patch with a lot of side effects.

Currently, when an Android device is under memory pressure, one of three things will happen from kswapd:

1. Compress an anonymous page to ZRAM.
2. Evict a file page.
3. Kill a process via lowmemorykiller.

The first two are cheap and per-page, the third is relatively cheap in the short term, frees many pages, and may cause power and performance penalties later on when the process has to be started again. For lots of reasons, I'd like a better balance between reclamation and killing on Android.

One of the nice things about Android from an optimization POV is that the execution model is more constrained than a generic Linux machine. There are only a limited number of processes that need to execute quickly for the device to appear to have good performance, and a userspace daemon (called ActivityManagerService) knows exactly what those processes are at any given time. We've made use of that in the past via cpusets and schedtune to limit the CPU resources available to background processes, and I think we can apply the same concept to memory.

This patch adds a new tunable to mem cgroups, memory.priority. A mem cgroup with a non-zero priority will not be eligible for scanning until the scan_control's priority is greater than zero. Once the mem cgroup is eligible for scanning, the priority acts as a bias to reduce the number of pages that should be scanned.

We've seen cases on Android where the global LRU isn't sufficient. For example, notifications in Android are rendered as part of a separate process that runs infrequently. However, when a notification appears and the user slides down the notification tray, we'll often see dropped frames due to page faults if there has been severe memory pressure. There are similar issues with other persistent processes.

The goal on an Android device is to aggressively evict from very low-priority background tasks that are likely to be killed anyway, since this will reduce the likelihood of lowmemorykiller running in the first place. It will still evict some from foreground and persistent processes, but it should help ensure that background processes are effectively reduced to the size of their heaps before evicting from more critical tasks. This should mean fewer background processes end up killed, which should improve performance and power on Android across the board (since it costs significantly less to page things back in than to replay the entirety of application startup).

The follow-on that I'm also experimenting with is how to improve vmpressure such that userspace can have some idea when low-priority memory cgroups are about as small as they can get. The correct time for Android to kill a background process under memory pressure is when there is evidence that a process has to be killed in order to alleviate memory pressure. If the device is below the low memory watermark and we know that there's probably no way to reclaim any more from background processes, then a userspace daemon should kill one or more background processes to fix that. Per-cgroup priority could be the first step toward that information.

I've tested a version of this patch on a Pixel running 3.18 along with an overhauled version of lmkd (the Android userspace lowmemorykiller daemon), and it does seem to work fine. I've ported it forward but have not yet rigorously tested it at TOT, since I don't have an Android test setup running TOT. While I'm getting my tests ported over, I would like some feedback on adding another tunable as well as what the tunable's interface should be--I really don't like the 0-10 priority scheme I have in the patch but I don't have a better idea.

Thanks,
Tim

Tim Murray (1):
mm, memcg: add prioritized reclaim

include/linux/memcontrol.h | 20 +++++++++++++++++++-
mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
mm/vmscan.c | 3 ++-
3 files changed, 54 insertions(+), 2 deletions(-)

--
2.12.0.367.g23dc2f6d3c-goog


2017-03-17 23:26:49

by Tim Murray

[permalink] [raw]
Subject: [RFC 1/1] mm, memcg: add prioritized reclaim

When a system is under memory pressure, it may be beneficial to prioritize
some memory cgroups to keep their pages resident ahead of other cgroups'
pages. Add a new interface to memory cgroups, memory.priority, that enables
kswapd and direct reclaim to scan more pages in lower-priority cgroups
before looking at higher-priority cgroups.

Signed-off-by: Tim Murray <[email protected]>
---
include/linux/memcontrol.h | 20 +++++++++++++++++++-
mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
mm/vmscan.c | 3 ++-
3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5af377303880..0d0f95839a8d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -206,7 +206,9 @@ struct mem_cgroup {
bool oom_lock;
int under_oom;

- int swappiness;
+ int swappiness;
+ int priority;
+
/* OOM-Killer disable */
int oom_kill_disable;

@@ -487,6 +489,16 @@ static inline bool task_in_memcg_oom(struct task_struct *p)

bool mem_cgroup_oom_synchronize(bool wait);

+static inline int mem_cgroup_priority(struct mem_cgroup *memcg)
+{
+ /* root ? */
+ if (mem_cgroup_disabled() || !memcg->css.parent)
+ return 0;
+
+ return memcg->priority;
+}
+
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -766,6 +778,12 @@ static inline
void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline int mem_cgroup_priority(struct mem_cgroup *memcg)
+{
+ return 0;
+}
+
#endif /* CONFIG_MEMCG */

#ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd7541d7c11..7343ca106a36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -81,6 +81,8 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;

#define MEM_CGROUP_RECLAIM_RETRIES 5

+#define MEM_CGROUP_PRIORITY_MAX 10
+
/* Socket memory accounting disabled? */
static bool cgroup_memory_nosocket;

@@ -241,6 +243,7 @@ enum res_type {
_OOM_TYPE,
_KMEM,
_TCP,
+ _PRIO,
};

#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
@@ -842,6 +845,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
*/
memcg = mem_cgroup_from_css(css);

+ if (reclaim && reclaim->priority &&
+ (DEF_PRIORITY - memcg->priority) < reclaim->priority)
+ continue;
+
if (css == &root->css)
break;

@@ -2773,6 +2780,7 @@ enum {
RES_MAX_USAGE,
RES_FAILCNT,
RES_SOFT_LIMIT,
+ RES_PRIORITY,
};

static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
@@ -2783,6 +2791,7 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,

switch (MEMFILE_TYPE(cft->private)) {
case _MEM:
+ case _PRIO:
counter = &memcg->memory;
break;
case _MEMSWAP:
@@ -2813,6 +2822,8 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
return counter->failcnt;
case RES_SOFT_LIMIT:
return (u64)memcg->soft_limit * PAGE_SIZE;
+ case RES_PRIORITY:
+ return (u64)memcg->priority;
default:
BUG();
}
@@ -2966,6 +2977,22 @@ static int memcg_update_tcp_limit(struct mem_cgroup *memcg, unsigned long limit)
return ret;
}

+static ssize_t mem_cgroup_update_prio(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long long prio = -1;
+
+ buf = strstrip(buf);
+ prio = memparse(buf, NULL);
+
+ if (prio >= 0 && prio <= MEM_CGROUP_PRIORITY_MAX) {
+ memcg->priority = (int)prio;
+ return nbytes;
+ }
+ return -EINVAL;
+}
+
/*
* The user of this function is...
* RES_LIMIT.
@@ -3940,6 +3967,12 @@ static struct cftype mem_cgroup_legacy_files[] = {
.read_u64 = mem_cgroup_read_u64,
},
{
+ .name = "priority",
+ .private = MEMFILE_PRIVATE(_PRIO, RES_PRIORITY),
+ .write = mem_cgroup_update_prio,
+ .read_u64 = mem_cgroup_read_u64,
+ },
+ {
.name = "stat",
.seq_show = memcg_stat_show,
},
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc8031ef994d..c47b21326ab0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2116,6 +2116,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
unsigned long *lru_pages)
{
int swappiness = mem_cgroup_swappiness(memcg);
+ int priority = mem_cgroup_priority(memcg);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
u64 denominator = 0; /* gcc */
@@ -2287,7 +2288,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
unsigned long scan;

size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
- scan = size >> sc->priority;
+ scan = size >> (sc->priority + priority);

if (!scan && pass && force_scan)
scan = min(size, SWAP_CLUSTER_MAX);
--
2.12.0.367.g23dc2f6d3c-goog

2017-03-20 05:59:36

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

Hello,

On Fri, Mar 17, 2017 at 04:16:35PM -0700, Tim Murray wrote:
> Hi all,
>
> I've been working to improve Android's memory management and drop lowmemorykiller from the kernel, and I'd like to get some feedback on a small patch with a lot of side effects.
>
> Currently, when an Android device is under memory pressure, one of three things will happen from kswapd:
>
> 1. Compress an anonymous page to ZRAM.
> 2. Evict a file page.
> 3. Kill a process via lowmemorykiller.
>
> The first two are cheap and per-page, the third is relatively cheap in the short term, frees many pages, and may cause power and performance penalties later on when the process has to be started again. For lots of reasons, I'd like a better balance between reclamation and killing on Android.
>
> One of the nice things about Android from an optimization POV is that the execution model is more constrained than a generic Linux machine. There are only a limited number of processes that need to execute quickly for the device to appear to have good performance, and a userspace daemon (called ActivityManagerService) knows exactly what those processes are at any given time. We've made use of that in the past via cpusets and schedtune to limit the CPU resources available to background processes, and I think we can apply the same concept to memory.
>

AFAIK, many platforms as well as android have done it. IOW, they know what
processes are important while others are not critical for user-response.

> This patch adds a new tunable to mem cgroups, memory.priority. A mem cgroup with a non-zero priority will not be eligible for scanning until the scan_control's priority is greater than zero. Once the mem cgroup is eligible for scanning, the priority acts as a bias to reduce the number of pages that should be scanned.

First of all, the concept makes sense to me. The problem with cgroup-per-app
model is that it's really hard to predict how many of memory a group needs to
make system smooth although we know what processes are important.
Because of it, it's hard to tune memcg low/high/max proactively.

So, it would be great if admin can give more priority some groups like
graphic mamager, laucher and killer applications like TV manager, Dial
manager and so (ie, when memory pressure happens, please reclaim more pages
from low priority groups).

However, I'm not sure your approach is good. It seems your approach just
reclaims pages from groups (DEF_PRIORITY - memcg->priority) >= sc->priority.
IOW, it is based on *temporal* memory pressure fluctuation sc->priority.

Rather than it, I guess pages to be reclaimed should be distributed by
memcg->priority. Namely, if global memory pressure happens and VM want to
reclaim 100 pages, VM should reclaim 90 pages from memcg-A(priority-10)
and 10 pages from memcg-B(prioirty-90).

Anyway, it's really desireble approach so memcg maintainers, Please, have a
look.

Thanks.

>
> We've seen cases on Android where the global LRU isn't sufficient. For example, notifications in Android are rendered as part of a separate process that runs infrequently. However, when a notification appears and the user slides down the notification tray, we'll often see dropped frames due to page faults if there has been severe memory pressure. There are similar issues with other persistent processes.
>
> The goal on an Android device is to aggressively evict from very low-priority background tasks that are likely to be killed anyway, since this will reduce the likelihood of lowmemorykiller running in the first place. It will still evict some from foreground and persistent processes, but it should help ensure that background processes are effectively reduced to the size of their heaps before evicting from more critical tasks. This should mean fewer background processes end up killed, which should improve performance and power on Android across the board (since it costs significantly less to page things back in than to replay the entirety of application startup).
>
> The follow-on that I'm also experimenting with is how to improve vmpressure such that userspace can have some idea when low-priority memory cgroups are about as small as they can get. The correct time for Android to kill a background process under memory pressure is when there is evidence that a process has to be killed in order to alleviate memory pressure. If the device is below the low memory watermark and we know that there's probably no way to reclaim any more from background processes, then a userspace daemon should kill one or more background processes to fix that. Per-cgroup priority could be the first step toward that information.
>
> I've tested a version of this patch on a Pixel running 3.18 along with an overhauled version of lmkd (the Android userspace lowmemorykiller daemon), and it does seem to work fine. I've ported it forward but have not yet rigorously tested it at TOT, since I don't have an Android test setup running TOT. While I'm getting my tests ported over, I would like some feedback on adding another tunable as well as what the tunable's interface should be--I really don't like the 0-10 priority scheme I have in the patch but I don't have a better idea.
>
> Thanks,
> Tim
>
> Tim Murray (1):
> mm, memcg: add prioritized reclaim
>
> include/linux/memcontrol.h | 20 +++++++++++++++++++-
> mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
> mm/vmscan.c | 3 ++-
> 3 files changed, 54 insertions(+), 2 deletions(-)
>
> --
> 2.12.0.367.g23dc2f6d3c-goog
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-20 07:07:29

by peter enderborg

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

Hi Tim. Do you have a link to the new version lmkd?

On 03/18/2017 12:16 AM, Tim Murray wrote:
> Hi all,
>
> I've been working to improve Android's memory management and drop lowmemorykiller from the kernel, and I'd like to get some feedback on a small patch with a lot of side effects.
>
> Currently, when an Android device is under memory pressure, one of three things will happen from kswapd:
>
> 1. Compress an anonymous page to ZRAM.
> 2. Evict a file page.
> 3. Kill a process via lowmemorykiller.
>
> The first two are cheap and per-page, the third is relatively cheap in the short term, frees many pages, and may cause power and performance penalties later on when the process has to be started again. For lots of reasons, I'd like a better balance between reclamation and killing on Android.
>
> One of the nice things about Android from an optimization POV is that the execution model is more constrained than a generic Linux machine. There are only a limited number of processes that need to execute quickly for the device to appear to have good performance, and a userspace daemon (called ActivityManagerService) knows exactly what those processes are at any given time. We've made use of that in the past via cpusets and schedtune to limit the CPU resources available to background processes, and I think we can apply the same concept to memory.
>
> This patch adds a new tunable to mem cgroups, memory.priority. A mem cgroup with a non-zero priority will not be eligible for scanning until the scan_control's priority is greater than zero. Once the mem cgroup is eligible for scanning, the priority acts as a bias to reduce the number of pages that should be scanned.
>
> We've seen cases on Android where the global LRU isn't sufficient. For example, notifications in Android are rendered as part of a separate process that runs infrequently. However, when a notification appears and the user slides down the notification tray, we'll often see dropped frames due to page faults if there has been severe memory pressure. There are similar issues with other persistent processes.
>
> The goal on an Android device is to aggressively evict from very low-priority background tasks that are likely to be killed anyway, since this will reduce the likelihood of lowmemorykiller running in the first place. It will still evict some from foreground and persistent processes, but it should help ensure that background processes are effectively reduced to the size of their heaps before evicting from more critical tasks. This should mean fewer background processes end up killed, which should improve performance and power on Android across the board (since it costs significantly less to page things back in than to replay the entirety of application startup).
>
> The follow-on that I'm also experimenting with is how to improve vmpressure such that userspace can have some idea when low-priority memory cgroups are about as small as they can get. The correct time for Android to kill a background process under memory pressure is when there is evidence that a process has to be killed in order to alleviate memory pressure. If the device is below the low memory watermark and we know that there's probably no way to reclaim any more from background processes, then a userspace daemon should kill one or more background processes to fix that. Per-cgroup priority could be the first step toward that information.
>
> I've tested a version of this patch on a Pixel running 3.18 along with an overhauled version of lmkd (the Android userspace lowmemorykiller daemon), and it does seem to work fine. I've ported it forward but have not yet rigorously tested it at TOT, since I don't have an Android test setup running TOT. While I'm getting my tests ported over, I would like some feedback on adding another tunable as well as what the tunable's interface should be--I really don't like the 0-10 priority scheme I have in the patch but I don't have a better idea.
>
> Thanks,
> Tim
>
> Tim Murray (1):
> mm, memcg: add prioritized reclaim
>
> include/linux/memcontrol.h | 20 +++++++++++++++++++-
> mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
> mm/vmscan.c | 3 ++-
> 3 files changed, 54 insertions(+), 2 deletions(-)
>

2017-03-20 08:19:33

by Kyungmin Park

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On Sat, Mar 18, 2017 at 8:16 AM, Tim Murray <[email protected]> wrote:
> Hi all,
>
> I've been working to improve Android's memory management and drop lowmemorykiller from the kernel, and I'd like to get some feedback on a small patch with a lot of side effects.
>
> Currently, when an Android device is under memory pressure, one of three things will happen from kswapd:
>
> 1. Compress an anonymous page to ZRAM.
> 2. Evict a file page.
> 3. Kill a process via lowmemorykiller.
>
> The first two are cheap and per-page, the third is relatively cheap in the short term, frees many pages, and may cause power and performance penalties later on when the process has to be started again. For lots of reasons, I'd like a better balance between reclamation and killing on Android.
>
> One of the nice things about Android from an optimization POV is that the execution model is more constrained than a generic Linux machine. There are only a limited number of processes that need to execute quickly for the device to appear to have good performance, and a userspace daemon (called ActivityManagerService) knows exactly what those processes are at any given time. We've made use of that in the past via cpusets and schedtune to limit the CPU resources available to background processes, and I think we can apply the same concept to memory.
>
> This patch adds a new tunable to mem cgroups, memory.priority. A mem cgroup with a non-zero priority will not be eligible for scanning until the scan_control's priority is greater than zero. Once the mem cgroup is eligible for scanning, the priority acts as a bias to reduce the number of pages that should be scanned.

Here's old discussion to support app-per-memcg reclaim

"[PATCH] memcg: Add force_reclaim to reclaim tasks' memory in memcg."
http://www.spinics.net/lists/cgroups/msg07874.html

unlike existing interface, it can reclaim the memory while process is
still in memcg.

In our case, it's used for reclaim and swap out pages for that app.

Thank you,
Kyungmin Park
>
> We've seen cases on Android where the global LRU isn't sufficient. For example, notifications in Android are rendered as part of a separate process that runs infrequently. However, when a notification appears and the user slides down the notification tray, we'll often see dropped frames due to page faults if there has been severe memory pressure. There are similar issues with other persistent processes.
>
> The goal on an Android device is to aggressively evict from very low-priority background tasks that are likely to be killed anyway, since this will reduce the likelihood of lowmemorykiller running in the first place. It will still evict some from foreground and persistent processes, but it should help ensure that background processes are effectively reduced to the size of their heaps before evicting from more critical tasks. This should mean fewer background processes end up killed, which should improve performance and power on Android across the board (since it costs significantly less to page things back in than to replay the entirety of application startup).
>
> The follow-on that I'm also experimenting with is how to improve vmpressure such that userspace can have some idea when low-priority memory cgroups are about as small as they can get. The correct time for Android to kill a background process under memory pressure is when there is evidence that a process has to be killed in order to alleviate memory pressure. If the device is below the low memory watermark and we know that there's probably no way to reclaim any more from background processes, then a userspace daemon should kill one or more background processes to fix that. Per-cgroup priority could be the first step toward that information.
>
> I've tested a version of this patch on a Pixel running 3.18 along with an overhauled version of lmkd (the Android userspace lowmemorykiller daemon), and it does seem to work fine. I've ported it forward but have not yet rigorously tested it at TOT, since I don't have an Android test setup running TOT. While I'm getting my tests ported over, I would like some feedback on adding another tunable as well as what the tunable's interface should be--I really don't like the 0-10 priority scheme I have in the patch but I don't have a better idea.
>
> Thanks,
> Tim
>
> Tim Murray (1):
> mm, memcg: add prioritized reclaim
>
> include/linux/memcontrol.h | 20 +++++++++++++++++++-
> mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
> mm/vmscan.c | 3 ++-
> 3 files changed, 54 insertions(+), 2 deletions(-)
>
> --
> 2.12.0.367.g23dc2f6d3c-goog
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-20 14:10:46

by Vinayak Menon

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup


On Fri, Mar 17, 2017 at 04:16:35PM -0700, Tim Murray wrote:

Hi Tim,
>> Hi all,
>>
>> I've been working to improve Android's memory management and drop lowmemorykiller from the kernel, and I'd like to get some feedback on a small patch with a lot of side effects.
>>
>> Currently, when an Android device is under memory pressure, one of three things will happen from kswapd:
>>
>> 1. Compress an anonymous page to ZRAM.
>> 2. Evict a file page.
>> 3. Kill a process via lowmemorykiller.
>>
>> The first two are cheap and per-page, the third is relatively cheap in the short term, frees many pages, and may cause power and performance penalties later on when the process has to be started again. For lots of reasons, I'd like a better balance between reclamation and killing on Android.
>>
>> One of the nice things about Android from an optimization POV is that the execution model is more constrained than a generic Linux machine. There are only a limited number of processes that need to execute quickly for the device to appear to have good performance, and a userspace daemon (called ActivityManagerService) knows exactly what those processes are at any given time. We've made use of that in the past via cpusets and schedtune to limit the CPU resources available to background processes, and I think we can apply the same concept to memory.
>> This patch adds a new tunable to mem cgroups, memory.priority. A mem cgroup with a non-zero priority will not be eligible for scanning until the scan_control's priority is greater than zero. Once the mem cgroup is eligible for scanning, the priority acts as a bias to reduce the number of pages that should be scanned.
>From the discussions @ https://lkml.org/lkml/2017/3/3/752, I assume you are trying
per-app memcg. We were trying to implement per app memory cgroups and were
encountering some issues (https://www.spinics.net/lists/linux-mm/msg121665.html) .
I am curious if you have seen similar issues and would like to know if the patch also
address some of these problems.

The major issues were:
(1) Because of multiple per-app memcgs, the per memcg LRU size is so small and
results in kswapd priority drop. This results in sudden increase in scan at lower priorities.
And kswapd ends up consuming around 3 times more time.
(2) Due to kswapd taking more time in freeing up memory, allocstalls are high and for
similar reasons stated above direct reclaim path consumes 2.5 times more time.
(3) Because of multiple LRUs, the aging of pages is affected and this results in wrong
pages being evicted resulting in higher number of major faults.

Since soft reclaim was not of much help in mitigating the problem, I was trying out
something similar to memcg priority. But what I have seen is that this aggravates the
above mentioned problems. I think this is because, even though the high priority tasks
(foreground) are having pages which are used at the moment, there are idle pages too
which could be reclaimed. But due to the high priority of foreground memcg, it requires
the kswapd priority to drop down much to reclaim these idle pages. This results in excessive
reclaim from background apps resulting in increased major faults, pageins and thus increased
launch latency when these apps are later brought back to foreground.

One thing which is found to fix the above problems is to have both global LRU and the per-memcg LRU.
Global reclaim can use the global LRU thus fixing the above 3 issues. The memcg LRUs can then be used
for soft reclaim or a proactive reclaim similar to Minchan's Per process reclaim for the background or
low priority tasks. I have been trying this change on 4.4 kernel (yet to try the per-app
reclaim/soft reclaim part). One downside is the extra list_head in struct page and the memory it consumes.
>> We've seen cases on Android where the global LRU isn't sufficient. For example, notifications in Android are rendered as part of a separate process that runs infrequently. However, when a notification appears and the user slides down the notification tray, we'll often see dropped frames due to page faults if there has been severe memory pressure. There are similar issues with other persistent processes.
>>
>> The goal on an Android device is to aggressively evict from very low-priority background tasks that are likely to be killed anyway, since this will reduce the likelihood of lowmemorykiller running in the first place. It will still evict some from foreground and persistent processes, but it should help ensure that background processes are effectively reduced to the size of their heaps before evicting from more critical tasks. This should mean fewer background processes end up killed, which should improve performance and power on Android across the board (since it costs significantly less to page things back in than to replay the entirety of application startup).
>>
>> The follow-on that I'm also experimenting with is how to improve vmpressure such that userspace can have some idea when low-priority memory cgroups are about as small as they can get. The correct time for Android to kill a background process under memory pressure is when there is evidence that a process has to be killed in order to alleviate memory pressure. If the device is below the low memory watermark and we know that there's probably no way to reclaim any more from background processes, then a userspace daemon should kill one or more background processes to fix that. Per-cgroup priority could be the first step toward that information.
>>
>> I've tested a version of this patch on a Pixel running 3.18 along with an overhauled version of lmkd (the Android userspace lowmemorykiller daemon), and it does seem to work fine. I've ported it forward but have not yet rigorously tested it at TOT, since I don't have an Android test setup running TOT. While I'm getting my tests ported over, I would like some feedback on adding another tunable as well as what the tunable's interface should be--I really don't like the 0-10 priority scheme I have in the patch but I don't have a better idea.
>>
>> Thanks,
>> Tim
>>
>> Tim Murray (1):
>> mm, memcg: add prioritized reclaim
>>
>> include/linux/memcontrol.h | 20 +++++++++++++++++++-
>> mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
>> mm/vmscan.c | 3 ++-
>> 3 files changed, 54 insertions(+), 2 deletions(-)
>>
>> --
>> 2.12.0.367.g23dc2f6d3c-goog
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-20 14:43:04

by vinayak menon

[permalink] [raw]
Subject: Re: [RFC 1/1] mm, memcg: add prioritized reclaim

On Sat, Mar 18, 2017 at 4:46 AM, Tim Murray <[email protected]> wrote:
> When a system is under memory pressure, it may be beneficial to prioritize
> some memory cgroups to keep their pages resident ahead of other cgroups'
> pages. Add a new interface to memory cgroups, memory.priority, that enables
> kswapd and direct reclaim to scan more pages in lower-priority cgroups
> before looking at higher-priority cgroups.
>
> Signed-off-by: Tim Murray <[email protected]>
> ---
> include/linux/memcontrol.h | 20 +++++++++++++++++++-
> mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
> mm/vmscan.c | 3 ++-
> 3 files changed, 54 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5af377303880..0d0f95839a8d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -206,7 +206,9 @@ struct mem_cgroup {
> bool oom_lock;
> int under_oom;
>
> - int swappiness;
> + int swappiness;
> + int priority;
> +
> /* OOM-Killer disable */
> int oom_kill_disable;
>
> @@ -487,6 +489,16 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
>
> bool mem_cgroup_oom_synchronize(bool wait);
>
> +static inline int mem_cgroup_priority(struct mem_cgroup *memcg)
> +{
> + /* root ? */
> + if (mem_cgroup_disabled() || !memcg->css.parent)
> + return 0;
> +
> + return memcg->priority;
> +}
> +
> +
> #ifdef CONFIG_MEMCG_SWAP
> extern int do_swap_account;
> #endif
> @@ -766,6 +778,12 @@ static inline
> void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
> {
> }
> +
> +static inline int mem_cgroup_priority(struct mem_cgroup *memcg)
> +{
> + return 0;
> +}
> +
> #endif /* CONFIG_MEMCG */
>
> #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2bd7541d7c11..7343ca106a36 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -81,6 +81,8 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
>
> #define MEM_CGROUP_RECLAIM_RETRIES 5
>
> +#define MEM_CGROUP_PRIORITY_MAX 10
> +
> /* Socket memory accounting disabled? */
> static bool cgroup_memory_nosocket;
>
> @@ -241,6 +243,7 @@ enum res_type {
> _OOM_TYPE,
> _KMEM,
> _TCP,
> + _PRIO,
> };
>
> #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
> @@ -842,6 +845,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> */
> memcg = mem_cgroup_from_css(css);
>
> + if (reclaim && reclaim->priority &&
> + (DEF_PRIORITY - memcg->priority) < reclaim->priority)
> + continue;
> +
This as I understand will skip say a priority 0 memcg until scan
priority is less
than 3. Considering a case of foreground task at memcg priority 0, and
large number of background tasks each consuming very small amount of
memory (and thus tiny LRUs) and at priority 10. Also assume that
a large part of memory is occupied by these small apps (which I think is a valid
scenario on android). Because of the small LRU sizes of BG apps, the
kswapd priority will
drop fast and we would eventually reach priority 2. And at 2 or 1, a
lot of pages
would get scanned from both foreground and background tasks. The foreground
LRU will get excessively scanned, even though most of the memory is occupied
by the large number of small BG apps. No ?


> if (css == &root->css)
> break;
>
> @@ -2773,6 +2780,7 @@ enum {
> RES_MAX_USAGE,
> RES_FAILCNT,
> RES_SOFT_LIMIT,
> + RES_PRIORITY,
> };
>
> static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
> @@ -2783,6 +2791,7 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>
> switch (MEMFILE_TYPE(cft->private)) {
> case _MEM:
> + case _PRIO:
> counter = &memcg->memory;
> break;
> case _MEMSWAP:
> @@ -2813,6 +2822,8 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
> return counter->failcnt;
> case RES_SOFT_LIMIT:
> return (u64)memcg->soft_limit * PAGE_SIZE;
> + case RES_PRIORITY:
> + return (u64)memcg->priority;
> default:
> BUG();
> }
> @@ -2966,6 +2977,22 @@ static int memcg_update_tcp_limit(struct mem_cgroup *memcg, unsigned long limit)
> return ret;
> }
>
> +static ssize_t mem_cgroup_update_prio(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned long long prio = -1;
> +
> + buf = strstrip(buf);
> + prio = memparse(buf, NULL);
> +
> + if (prio >= 0 && prio <= MEM_CGROUP_PRIORITY_MAX) {
> + memcg->priority = (int)prio;
> + return nbytes;
> + }
> + return -EINVAL;
> +}
> +
> /*
> * The user of this function is...
> * RES_LIMIT.
> @@ -3940,6 +3967,12 @@ static struct cftype mem_cgroup_legacy_files[] = {
> .read_u64 = mem_cgroup_read_u64,
> },
> {
> + .name = "priority",
> + .private = MEMFILE_PRIVATE(_PRIO, RES_PRIORITY),
> + .write = mem_cgroup_update_prio,
> + .read_u64 = mem_cgroup_read_u64,
> + },
> + {
> .name = "stat",
> .seq_show = memcg_stat_show,
> },
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc8031ef994d..c47b21326ab0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2116,6 +2116,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> unsigned long *lru_pages)
> {
> int swappiness = mem_cgroup_swappiness(memcg);
> + int priority = mem_cgroup_priority(memcg);
> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> u64 fraction[2];
> u64 denominator = 0; /* gcc */
> @@ -2287,7 +2288,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> unsigned long scan;
>
> size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> - scan = size >> sc->priority;
> + scan = size >> (sc->priority + priority);
If most of the apps in background (with memcg priortiy near 10) are
smaller ones in terms of LRU size,
this would result in a priority drop because of increasing the
priority ? And this would also cause some
small LRUs never to be scanned i.e. when (size >>
MEM_CGROUP_PRIORITY_MAX) is 0 (or when
scan is > 0, but SCAN_FRACT makes it 0) ?

>
> if (!scan && pass && force_scan)
> scan = min(size, SWAP_CLUSTER_MAX);
> --
> 2.12.0.367.g23dc2f6d3c-goog
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-20 15:56:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On Mon, Mar 20, 2017 at 07:28:53PM +0530, Vinayak Menon wrote:
> From the discussions @ https://lkml.org/lkml/2017/3/3/752, I assume you are trying
> per-app memcg. We were trying to implement per app memory cgroups and were
> encountering some issues (https://www.spinics.net/lists/linux-mm/msg121665.html) .
> I am curious if you have seen similar issues and would like to know if the patch also
> address some of these problems.
>
> The major issues were:
> (1) Because of multiple per-app memcgs, the per memcg LRU size is so small and
> results in kswapd priority drop. This results in sudden increase in scan at lower priorities.
> And kswapd ends up consuming around 3 times more time.

There shouldn't be a connection between those two things.

Yes, priority levels used to dictate aggressiveness of reclaim, and we
did add a bunch of memcg code to avoid priority drops.

But nowadays the priority level should only set the LRU scan window
and we bail out once we have reclaimed enough (see the code in
shrink_node_memcg()).

If kswapd gets stuck on smaller LRUs, we should find out why and then
address that problem.

> (2) Due to kswapd taking more time in freeing up memory, allocstalls are high and for
> similar reasons stated above direct reclaim path consumes 2.5 times more time.
> (3) Because of multiple LRUs, the aging of pages is affected and this results in wrong
> pages being evicted resulting in higher number of major faults.
>
> Since soft reclaim was not of much help in mitigating the problem, I was trying out
> something similar to memcg priority. But what I have seen is that this aggravates the
> above mentioned problems. I think this is because, even though the high priority tasks
> (foreground) are having pages which are used at the moment, there are idle pages too
> which could be reclaimed. But due to the high priority of foreground memcg, it requires
> the kswapd priority to drop down much to reclaim these idle pages. This results in excessive
> reclaim from background apps resulting in increased major faults, pageins and thus increased
> launch latency when these apps are later brought back to foreground.

This is what the soft limit *should* do, but unfortunately its
semantics and implementation in cgroup1 are too broken for this.

Have you tried configuring memory.low for the foreground groups in
cgroup2? That protects those pages from reclaim as long as there are
reclaimable idle pages in the memory.low==0 background groups.

> One thing which is found to fix the above problems is to have both global LRU and the per-memcg LRU.
> Global reclaim can use the global LRU thus fixing the above 3 issues. The memcg LRUs can then be used
> for soft reclaim or a proactive reclaim similar to Minchan's Per process reclaim for the background or
> low priority tasks. I have been trying this change on 4.4 kernel (yet to try the per-app
> reclaim/soft reclaim part). One downside is the extra list_head in struct page and the memory it consumes.

That would be a major step backwards, and I'm not entirely convinced
that the issues you are seeing cannot be fixed by improving the way we
do global round-robin reclaim and/or configuring memory.low.

2017-03-21 17:19:12

by Tim Murray

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On Sun, Mar 19, 2017 at 10:59 PM, Minchan Kim <[email protected]> wrote:
> However, I'm not sure your approach is good. It seems your approach just
> reclaims pages from groups (DEF_PRIORITY - memcg->priority) >= sc->priority.
> IOW, it is based on *temporal* memory pressure fluctuation sc->priority.
>
> Rather than it, I guess pages to be reclaimed should be distributed by
> memcg->priority. Namely, if global memory pressure happens and VM want to
> reclaim 100 pages, VM should reclaim 90 pages from memcg-A(priority-10)
> and 10 pages from memcg-B(prioirty-90).

This is what I debated most while writing this patch. If I'm
understanding your concern correctly, I think I'm doing more than
skipping high-priority cgroups:

- If the scan isn't high priority yet, then skip high-priority cgroups.
- When the scan is high priority, scan fewer pages from
higher-priority cgroups (using the priority to modify the shift in
get_scan_count).

This is tightly coupled with the question of what to do with
vmpressure. The right thing to do on an Android device under memory
pressure is probably something like this:

1. Reclaim aggressively from low-priority background processes. The
goal here is to reduce the pages used by background processes to the
size of their heaps (or smaller with ZRAM) but zero file pages.
They're already likely to be killed by userspace and we're keeping
them around opportunistically, so a performance hit if they run and
have to do IO to restore some of that working set is okay.
2. Reclaim a small amount from persistent processes. These often have
a performance-critical subset of pages that we absolutely don't want
paged out, but some reclaim of these processes is fine. They're large,
some of them only run sporadically and don't impact performance, it's
okay to touch these sometimes.
3. If there still aren't enough free pages, notify userspace to kill
any processes it can. If I put my "Android performance engineer
working on userspace" hat on, what I'd want to know from userspace is
that kswapd/direct reclaim probably has to scan foreground processes
in order to reclaim enough free pages to satisfy watermarks. That's a
signal I could directly act on from userspace.
4. If that still isn't enough, reclaim from foreground processes,
since those processes are performance-critical.

As a result, I like not being fair about which cgroups are scanned
initially. Some cgroups are strictly more important than others. (With
that said, I'm not tied to enforcing unfairness in scanning. Android
would probably use different priority levels for each app level for
fair scanning vs unfair scanning, but my guess is that the actual
reclaiming behavior would look similar in both schemes.)

Mem cgroup priority suggests a useful signal for vmpressure. If
scanning is starting to touch cgroups at a higher priority than
persistent processes, then the userspace lowmemorykiller could kill
one or more background processes (which would be in low-priority
cgroups that have already been scanned aggressively). The current lmk
hand-tuned watermarks would be gone, and tuning the /proc/sys/vm knobs
would be all that's required to make an Android device do the right
thing in terms of memory.

On Sun, Mar 19, 2017 at 10:59 PM, Minchan Kim <[email protected]> wrote:
> Hello,
>
> On Fri, Mar 17, 2017 at 04:16:35PM -0700, Tim Murray wrote:
>> Hi all,
>>
>> I've been working to improve Android's memory management and drop lowmemorykiller from the kernel, and I'd like to get some feedback on a small patch with a lot of side effects.
>>
>> Currently, when an Android device is under memory pressure, one of three things will happen from kswapd:
>>
>> 1. Compress an anonymous page to ZRAM.
>> 2. Evict a file page.
>> 3. Kill a process via lowmemorykiller.
>>
>> The first two are cheap and per-page, the third is relatively cheap in the short term, frees many pages, and may cause power and performance penalties later on when the process has to be started again. For lots of reasons, I'd like a better balance between reclamation and killing on Android.
>>
>> One of the nice things about Android from an optimization POV is that the execution model is more constrained than a generic Linux machine. There are only a limited number of processes that need to execute quickly for the device to appear to have good performance, and a userspace daemon (called ActivityManagerService) knows exactly what those processes are at any given time. We've made use of that in the past via cpusets and schedtune to limit the CPU resources available to background processes, and I think we can apply the same concept to memory.
>>
>
> AFAIK, many platforms as well as android have done it. IOW, they know what
> processes are important while others are not critical for user-response.
>
>> This patch adds a new tunable to mem cgroups, memory.priority. A mem cgroup with a non-zero priority will not be eligible for scanning until the scan_control's priority is greater than zero. Once the mem cgroup is eligible for scanning, the priority acts as a bias to reduce the number of pages that should be scanned.
>
> First of all, the concept makes sense to me. The problem with cgroup-per-app
> model is that it's really hard to predict how many of memory a group needs to
> make system smooth although we know what processes are important.
> Because of it, it's hard to tune memcg low/high/max proactively.
>
> So, it would be great if admin can give more priority some groups like
> graphic mamager, laucher and killer applications like TV manager, Dial
> manager and so (ie, when memory pressure happens, please reclaim more pages
> from low priority groups).
>
> However, I'm not sure your approach is good. It seems your approach just
> reclaims pages from groups (DEF_PRIORITY - memcg->priority) >= sc->priority.
> IOW, it is based on *temporal* memory pressure fluctuation sc->priority.
>
> Rather than it, I guess pages to be reclaimed should be distributed by
> memcg->priority. Namely, if global memory pressure happens and VM want to
> reclaim 100 pages, VM should reclaim 90 pages from memcg-A(priority-10)
> and 10 pages from memcg-B(prioirty-90).
>
> Anyway, it's really desireble approach so memcg maintainers, Please, have a
> look.
>
> Thanks.
>
>>
>> We've seen cases on Android where the global LRU isn't sufficient. For example, notifications in Android are rendered as part of a separate process that runs infrequently. However, when a notification appears and the user slides down the notification tray, we'll often see dropped frames due to page faults if there has been severe memory pressure. There are similar issues with other persistent processes.
>>
>> The goal on an Android device is to aggressively evict from very low-priority background tasks that are likely to be killed anyway, since this will reduce the likelihood of lowmemorykiller running in the first place. It will still evict some from foreground and persistent processes, but it should help ensure that background processes are effectively reduced to the size of their heaps before evicting from more critical tasks. This should mean fewer background processes end up killed, which should improve performance and power on Android across the board (since it costs significantly less to page things back in than to replay the entirety of application startup).
>>
>> The follow-on that I'm also experimenting with is how to improve vmpressure such that userspace can have some idea when low-priority memory cgroups are about as small as they can get. The correct time for Android to kill a background process under memory pressure is when there is evidence that a process has to be killed in order to alleviate memory pressure. If the device is below the low memory watermark and we know that there's probably no way to reclaim any more from background processes, then a userspace daemon should kill one or more background processes to fix that. Per-cgroup priority could be the first step toward that information.
>>
>> I've tested a version of this patch on a Pixel running 3.18 along with an overhauled version of lmkd (the Android userspace lowmemorykiller daemon), and it does seem to work fine. I've ported it forward but have not yet rigorously tested it at TOT, since I don't have an Android test setup running TOT. While I'm getting my tests ported over, I would like some feedback on adding another tunable as well as what the tunable's interface should be--I really don't like the 0-10 priority scheme I have in the patch but I don't have a better idea.
>>
>> Thanks,
>> Tim
>>
>> Tim Murray (1):
>> mm, memcg: add prioritized reclaim
>>
>> include/linux/memcontrol.h | 20 +++++++++++++++++++-
>> mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
>> mm/vmscan.c | 3 ++-
>> 3 files changed, 54 insertions(+), 2 deletions(-)
>>
>> --
>> 2.12.0.367.g23dc2f6d3c-goog
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-22 04:42:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

Hi Tim,

On Tue, Mar 21, 2017 at 10:18:26AM -0700, Tim Murray wrote:
> On Sun, Mar 19, 2017 at 10:59 PM, Minchan Kim <[email protected]> wrote:
> > However, I'm not sure your approach is good. It seems your approach just
> > reclaims pages from groups (DEF_PRIORITY - memcg->priority) >= sc->priority.
> > IOW, it is based on *temporal* memory pressure fluctuation sc->priority.
> >
> > Rather than it, I guess pages to be reclaimed should be distributed by
> > memcg->priority. Namely, if global memory pressure happens and VM want to
> > reclaim 100 pages, VM should reclaim 90 pages from memcg-A(priority-10)
> > and 10 pages from memcg-B(prioirty-90).
>
> This is what I debated most while writing this patch. If I'm
> understanding your concern correctly, I think I'm doing more than
> skipping high-priority cgroups:

Yes, that is my concern. It could give too much pressure lower-priority
group. You already reduced scanning window for high-priority group so
I guess it would be enough for working.

The rationale from my thining is high-priority group can have cold pages(
for instance, used-once pages, madvise_free pages and so on) so, VM should
age every groups to reclaim cold pages but we can reduce scanning window
for high-priority group to keep more workingset as you did. By that, we
already give more pressure to lower priority group than high-prioirty group.

>
> - If the scan isn't high priority yet, then skip high-priority cgroups.

This part is the one I think it's too much ;-)
I think no need to skip but just reduce scanning window by the group's
prioirty.

> - When the scan is high priority, scan fewer pages from
> higher-priority cgroups (using the priority to modify the shift in
> get_scan_count).

That sounds lkie a good idea but need to tune more.

How about this?

get_scan_count for memcg-A:
..
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx) *
(memcg-A / sum(memcg all priorities))

get_scan_count for memcg-B:
..
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx) *
(memcg-B / sum(memcg all priorities))

By that, can't it support memcg-hierarchy as well? I don't know. ;(
Hope memcg guys give more thought.

>
> This is tightly coupled with the question of what to do with
> vmpressure. The right thing to do on an Android device under memory
> pressure is probably something like this:
>
> 1. Reclaim aggressively from low-priority background processes. The
> goal here is to reduce the pages used by background processes to the
> size of their heaps (or smaller with ZRAM) but zero file pages.
> They're already likely to be killed by userspace and we're keeping
> them around opportunistically, so a performance hit if they run and
> have to do IO to restore some of that working set is okay.
> 2. Reclaim a small amount from persistent processes. These often have
> a performance-critical subset of pages that we absolutely don't want
> paged out, but some reclaim of these processes is fine. They're large,
> some of them only run sporadically and don't impact performance, it's
> okay to touch these sometimes.

That's why I wanted to age LRU from all of memcg but slow for high-priority
group via reduing scanning window, which means high-priority group's
pages makes more chance to be activated. So, it's already prioirty-boost.

> 3. If there still aren't enough free pages, notify userspace to kill
> any processes it can. If I put my "Android performance engineer
> working on userspace" hat on, what I'd want to know from userspace is
> that kswapd/direct reclaim probably has to scan foreground processes
> in order to reclaim enough free pages to satisfy watermarks. That's a
> signal I could directly act on from userspace.

Hmm, could you tell us how many of memcg groups do you thinking now?

background, foreground? Just two?

The reason I ask is if you want to make foregroud/background memcg
and move apps between them back and forth when the status changed,
we need to remember lru pages are not moved from originated memcg
so it wouldn't work as expected.


> 4. If that still isn't enough, reclaim from foreground processes,
> since those processes are performance-critical.
>
> As a result, I like not being fair about which cgroups are scanned
> initially. Some cgroups are strictly more important than others. (With

Yeb, *initially* is arguable point. I hope only reducing scanning
window would work. However, just my two cent. If it have a problem,
yes, need more thing.


> that said, I'm not tied to enforcing unfairness in scanning. Android
> would probably use different priority levels for each app level for
> fair scanning vs unfair scanning, but my guess is that the actual
> reclaiming behavior would look similar in both schemes.)
>
> Mem cgroup priority suggests a useful signal for vmpressure. If
> scanning is starting to touch cgroups at a higher priority than
> persistent processes, then the userspace lowmemorykiller could kill
> one or more background processes (which would be in low-priority
> cgroups that have already been scanned aggressively). The current lmk
> hand-tuned watermarks would be gone, and tuning the /proc/sys/vm knobs
> would be all that's required to make an Android device do the right
> thing in terms of memory.

Yes, it's better way. I think.

Thanks.

2017-03-22 05:21:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On Wed, Mar 22, 2017 at 01:41:17PM +0900, Minchan Kim wrote:
> Hi Tim,
>
> On Tue, Mar 21, 2017 at 10:18:26AM -0700, Tim Murray wrote:
> > On Sun, Mar 19, 2017 at 10:59 PM, Minchan Kim <[email protected]> wrote:
> > > However, I'm not sure your approach is good. It seems your approach just
> > > reclaims pages from groups (DEF_PRIORITY - memcg->priority) >= sc->priority.
> > > IOW, it is based on *temporal* memory pressure fluctuation sc->priority.
> > >
> > > Rather than it, I guess pages to be reclaimed should be distributed by
> > > memcg->priority. Namely, if global memory pressure happens and VM want to
> > > reclaim 100 pages, VM should reclaim 90 pages from memcg-A(priority-10)
> > > and 10 pages from memcg-B(prioirty-90).
> >
> > This is what I debated most while writing this patch. If I'm
> > understanding your concern correctly, I think I'm doing more than
> > skipping high-priority cgroups:
>
> Yes, that is my concern. It could give too much pressure lower-priority
> group. You already reduced scanning window for high-priority group so
> I guess it would be enough for working.
>
> The rationale from my thining is high-priority group can have cold pages(
> for instance, used-once pages, madvise_free pages and so on) so, VM should
> age every groups to reclaim cold pages but we can reduce scanning window
> for high-priority group to keep more workingset as you did. By that, we
> already give more pressure to lower priority group than high-prioirty group.
>
> >
> > - If the scan isn't high priority yet, then skip high-priority cgroups.
>
> This part is the one I think it's too much ;-)
> I think no need to skip but just reduce scanning window by the group's
> prioirty.
>
> > - When the scan is high priority, scan fewer pages from
> > higher-priority cgroups (using the priority to modify the shift in
> > get_scan_count).
>
> That sounds lkie a good idea but need to tune more.
>
> How about this?
>
> get_scan_count for memcg-A:
> ..
> size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx) *
> (memcg-A / sum(memcg all priorities))
>
> get_scan_count for memcg-B:
> ..
> size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx) *
> (memcg-B / sum(memcg all priorities))
>

Huh, correction.

size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;
scan = scan * (sum(memcg) - memcg A) / sum(memcg);

2017-03-22 12:22:40

by Vinayak Menon

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On 3/20/2017 8:53 PM, Johannes Weiner wrote:
> On Mon, Mar 20, 2017 at 07:28:53PM +0530, Vinayak Menon wrote:
>> From the discussions @ https://lkml.org/lkml/2017/3/3/752, I assume you are trying
>> per-app memcg. We were trying to implement per app memory cgroups and were
>> encountering some issues (https://www.spinics.net/lists/linux-mm/msg121665.html) .
>> I am curious if you have seen similar issues and would like to know if the patch also
>> address some of these problems.
>>
>> The major issues were:
>> (1) Because of multiple per-app memcgs, the per memcg LRU size is so small and
>> results in kswapd priority drop. This results in sudden increase in scan at lower priorities.
>> And kswapd ends up consuming around 3 times more time.
> There shouldn't be a connection between those two things.
>
> Yes, priority levels used to dictate aggressiveness of reclaim, and we
> did add a bunch of memcg code to avoid priority drops.
>
> But nowadays the priority level should only set the LRU scan window
> and we bail out once we have reclaimed enough (see the code in
> shrink_node_memcg()).
>
> If kswapd gets stuck on smaller LRUs, we should find out why and then
> address that problem.
Hi Johannes, Thanks for your comments. I will try to explain what I have observed while debugging this
problem.
When there are multiple small LRUs and very few LRUs with considerable size (by considerable size I mean
those sizes which can result in a non-zero scan value in get_scan_count at priorities near to DEF_PRIORITY).
Since I am trying on 4.4 kernel there are more small LRUs per app (per memcg) because of further split due to per zone LRU.
Considering the case where most of the apps in the system are of this small category, the scan calculated by
get_scan_count for these memcg LRUs at around DEF_PRIORITY become zero or very less, either because of
size >> sc->priority is 0 or because of SCAN_FRACT. For these runs around DEF_PRIORITY (say till DEF_PRIORITY/2)
since sc->nr_scanned is < sc->nr_to_reclaim, the kswapd priority drops. Now say at kswapd priority less than
DEF_PRIORITY/2, the scan returned by get_scan_count gets higher slowly for all memcgs. This causes sudden
excessive scanning of most of the memcgs (because this also results in heavy scanning of memcgs which have
considerable size). As I understand, the scan priority in this case results in aggressive reclaim and not just decides the
scan window because, in the following check in shrink_node_memcg, the "nr_to_reclaim" (sc->nr_to_reclaim)
is a high value compared to the memcg LRU size. I have seen that this also causes either nr_file or nr_anon
go zero most of the time (after this check), which as I understand means that proportional scanning does not happen.

if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;

I had tried making the "nr_to_reclaim" proportional to the lru size and that brings some benefits, but does not
solve the problem. Because when that is done, in some cases, the scanned pages at this priority decreases again,
resulting in further priority drop.

The priority drop and excessive scan/reclaim at lower priorities I have confirmed by keeping scanned and reclaimed
counters for each priority in vmstat. And yes, this results in kswapd being awake and running for longer time.

There was some benefit by prioritizing the memcgs similar to what Tim does in his patch and also by proportionally
reclaiming from the per-task memcgs based on their priority. But still the stats are far bad compared to having
a global LRU. One thing which was commonly seen in all these experiments is the multi fold increase in majfaults,
which I think is partly caused by poor aging of pages when the pages are distributed among a large number of tiny LRUs,
and global reclaim trying to reclaim from all of them.
>> (2) Due to kswapd taking more time in freeing up memory, allocstalls are high and for
>> similar reasons stated above direct reclaim path consumes 2.5 times more time.
>> (3) Because of multiple LRUs, the aging of pages is affected and this results in wrong
>> pages being evicted resulting in higher number of major faults.
>>
>> Since soft reclaim was not of much help in mitigating the problem, I was trying out
>> something similar to memcg priority. But what I have seen is that this aggravates the
>> above mentioned problems. I think this is because, even though the high priority tasks
>> (foreground) are having pages which are used at the moment, there are idle pages too
>> which could be reclaimed. But due to the high priority of foreground memcg, it requires
>> the kswapd priority to drop down much to reclaim these idle pages. This results in excessive
>> reclaim from background apps resulting in increased major faults, pageins and thus increased
>> launch latency when these apps are later brought back to foreground.
> This is what the soft limit *should* do, but unfortunately its
> semantics and implementation in cgroup1 are too broken for this.
>
> Have you tried configuring memory.low for the foreground groups in
> cgroup2? That protects those pages from reclaim as long as there are
> reclaimable idle pages in the memory.low==0 background groups.
I have not yet tried cgroup2. I was trying to understand it sometime back and IIUC it supports only a
single hierarchy and a process can be part of only one cgroup, which means when we try per-task mem
cgroup, this would mean all other controllers will have to be configured per-task. No ? I would like to try
memory.low that you suggest. Let me check if I have a way to test this without disturbing other controllers,
or will try with memory cgroup alone.
>> One thing which is found to fix the above problems is to have both global LRU and the per-memcg LRU.
>> Global reclaim can use the global LRU thus fixing the above 3 issues. The memcg LRUs can then be used
>> for soft reclaim or a proactive reclaim similar to Minchan's Per process reclaim for the background or
>> low priority tasks. I have been trying this change on 4.4 kernel (yet to try the per-app
>> reclaim/soft reclaim part). One downside is the extra list_head in struct page and the memory it consumes.
> That would be a major step backwards, and I'm not entirely convinced
> that the issues you are seeing cannot be fixed by improving the way we
> do global round-robin reclaim and/or configuring memory.low.
I understand and agree that it would be better to fix the existing design if it is possible.

Thanks,
Vinayak

2017-03-30 05:59:15

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

To memcg maintainer,

Could you comment about this topic?

On Fri, Mar 17, 2017 at 04:16:35PM -0700, Tim Murray wrote:
> Hi all,
>
> I've been working to improve Android's memory management and drop lowmemorykiller from the kernel, and I'd like to get some feedback on a small patch with a lot of side effects.
>
> Currently, when an Android device is under memory pressure, one of three things will happen from kswapd:
>
> 1. Compress an anonymous page to ZRAM.
> 2. Evict a file page.
> 3. Kill a process via lowmemorykiller.
>
> The first two are cheap and per-page, the third is relatively cheap in the short term, frees many pages, and may cause power and performance penalties later on when the process has to be started again. For lots of reasons, I'd like a better balance between reclamation and killing on Android.
>
> One of the nice things about Android from an optimization POV is that the execution model is more constrained than a generic Linux machine. There are only a limited number of processes that need to execute quickly for the device to appear to have good performance, and a userspace daemon (called ActivityManagerService) knows exactly what those processes are at any given time. We've made use of that in the past via cpusets and schedtune to limit the CPU resources available to background processes, and I think we can apply the same concept to memory.
>
> This patch adds a new tunable to mem cgroups, memory.priority. A mem cgroup with a non-zero priority will not be eligible for scanning until the scan_control's priority is greater than zero. Once the mem cgroup is eligible for scanning, the priority acts as a bias to reduce the number of pages that should be scanned.
>
> We've seen cases on Android where the global LRU isn't sufficient. For example, notifications in Android are rendered as part of a separate process that runs infrequently. However, when a notification appears and the user slides down the notification tray, we'll often see dropped frames due to page faults if there has been severe memory pressure. There are similar issues with other persistent processes.
>
> The goal on an Android device is to aggressively evict from very low-priority background tasks that are likely to be killed anyway, since this will reduce the likelihood of lowmemorykiller running in the first place. It will still evict some from foreground and persistent processes, but it should help ensure that background processes are effectively reduced to the size of their heaps before evicting from more critical tasks. This should mean fewer background processes end up killed, which should improve performance and power on Android across the board (since it costs significantly less to page things back in than to replay the entirety of application startup).
>
> The follow-on that I'm also experimenting with is how to improve vmpressure such that userspace can have some idea when low-priority memory cgroups are about as small as they can get. The correct time for Android to kill a background process under memory pressure is when there is evidence that a process has to be killed in order to alleviate memory pressure. If the device is below the low memory watermark and we know that there's probably no way to reclaim any more from background processes, then a userspace daemon should kill one or more background processes to fix that. Per-cgroup priority could be the first step toward that information.
>
> I've tested a version of this patch on a Pixel running 3.18 along with an overhauled version of lmkd (the Android userspace lowmemorykiller daemon), and it does seem to work fine. I've ported it forward but have not yet rigorously tested it at TOT, since I don't have an Android test setup running TOT. While I'm getting my tests ported over, I would like some feedback on adding another tunable as well as what the tunable's interface should be--I really don't like the 0-10 priority scheme I have in the patch but I don't have a better idea.
>
> Thanks,
> Tim
>
> Tim Murray (1):
> mm, memcg: add prioritized reclaim
>
> include/linux/memcontrol.h | 20 +++++++++++++++++++-
> mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
> mm/vmscan.c | 3 ++-
> 3 files changed, 54 insertions(+), 2 deletions(-)
>
> --
> 2.12.0.367.g23dc2f6d3c-goog
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-30 07:10:51

by Tim Murray

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

Sorry for the delay on my end as well. I realized that given multiple
equivalent prioritization implementations, my favorite would be the
one that provides the clearest signal to vmpressure. I've been
experimenting with different approaches to using memcg priority in
vmpressure, and I'm cautiously optimistic about my latest attempt. I
like the data I get from vmscan, but I'm still wiring up userspace and
testing different thresholds so I don't yet know that it's a strong
enough signal. I hope to have a new RFC before the weekend.

On Wed, Mar 29, 2017 at 10:59 PM, Minchan Kim <[email protected]> wrote:
> To memcg maintainer,
>
> Could you comment about this topic?
>
> On Fri, Mar 17, 2017 at 04:16:35PM -0700, Tim Murray wrote:
>> Hi all,
>>
>> I've been working to improve Android's memory management and drop lowmemorykiller from the kernel, and I'd like to get some feedback on a small patch with a lot of side effects.
>>
>> Currently, when an Android device is under memory pressure, one of three things will happen from kswapd:
>>
>> 1. Compress an anonymous page to ZRAM.
>> 2. Evict a file page.
>> 3. Kill a process via lowmemorykiller.
>>
>> The first two are cheap and per-page, the third is relatively cheap in the short term, frees many pages, and may cause power and performance penalties later on when the process has to be started again. For lots of reasons, I'd like a better balance between reclamation and killing on Android.
>>
>> One of the nice things about Android from an optimization POV is that the execution model is more constrained than a generic Linux machine. There are only a limited number of processes that need to execute quickly for the device to appear to have good performance, and a userspace daemon (called ActivityManagerService) knows exactly what those processes are at any given time. We've made use of that in the past via cpusets and schedtune to limit the CPU resources available to background processes, and I think we can apply the same concept to memory.
>>
>> This patch adds a new tunable to mem cgroups, memory.priority. A mem cgroup with a non-zero priority will not be eligible for scanning until the scan_control's priority is greater than zero. Once the mem cgroup is eligible for scanning, the priority acts as a bias to reduce the number of pages that should be scanned.
>>
>> We've seen cases on Android where the global LRU isn't sufficient. For example, notifications in Android are rendered as part of a separate process that runs infrequently. However, when a notification appears and the user slides down the notification tray, we'll often see dropped frames due to page faults if there has been severe memory pressure. There are similar issues with other persistent processes.
>>
>> The goal on an Android device is to aggressively evict from very low-priority background tasks that are likely to be killed anyway, since this will reduce the likelihood of lowmemorykiller running in the first place. It will still evict some from foreground and persistent processes, but it should help ensure that background processes are effectively reduced to the size of their heaps before evicting from more critical tasks. This should mean fewer background processes end up killed, which should improve performance and power on Android across the board (since it costs significantly less to page things back in than to replay the entirety of application startup).
>>
>> The follow-on that I'm also experimenting with is how to improve vmpressure such that userspace can have some idea when low-priority memory cgroups are about as small as they can get. The correct time for Android to kill a background process under memory pressure is when there is evidence that a process has to be killed in order to alleviate memory pressure. If the device is below the low memory watermark and we know that there's probably no way to reclaim any more from background processes, then a userspace daemon should kill one or more background processes to fix that. Per-cgroup priority could be the first step toward that information.
>>
>> I've tested a version of this patch on a Pixel running 3.18 along with an overhauled version of lmkd (the Android userspace lowmemorykiller daemon), and it does seem to work fine. I've ported it forward but have not yet rigorously tested it at TOT, since I don't have an Android test setup running TOT. While I'm getting my tests ported over, I would like some feedback on adding another tunable as well as what the tunable's interface should be--I really don't like the 0-10 priority scheme I have in the patch but I don't have a better idea.
>>
>> Thanks,
>> Tim
>>
>> Tim Murray (1):
>> mm, memcg: add prioritized reclaim
>>
>> include/linux/memcontrol.h | 20 +++++++++++++++++++-
>> mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++
>> mm/vmscan.c | 3 ++-
>> 3 files changed, 54 insertions(+), 2 deletions(-)
>>
>> --
>> 2.12.0.367.g23dc2f6d3c-goog
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-30 15:51:34

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

Hi Tim,

On Fri, Mar 17, 2017 at 04:16:35PM -0700, Tim Murray wrote:
> Hi all,
>
> I've been working to improve Android's memory management and drop lowmemorykiller from the kernel, and I'd like to get some feedback on a small patch with a lot of side effects.
>
> Currently, when an Android device is under memory pressure, one of three things will happen from kswapd:
>
> 1. Compress an anonymous page to ZRAM.
> 2. Evict a file page.
> 3. Kill a process via lowmemorykiller.
>
> The first two are cheap and per-page, the third is relatively cheap in the short term, frees many pages, and may cause power and performance penalties later on when the process has to be started again. For lots of reasons, I'd like a better balance between reclamation and killing on Android.
>
> One of the nice things about Android from an optimization POV is that the execution model is more constrained than a generic Linux machine. There are only a limited number of processes that need to execute quickly for the device to appear to have good performance, and a userspace daemon (called ActivityManagerService) knows exactly what those processes are at any given time. We've made use of that in the past via cpusets and schedtune to limit the CPU resources available to background processes, and I think we can apply the same concept to memory.
>
> This patch adds a new tunable to mem cgroups, memory.priority. A mem cgroup with a non-zero priority will not be eligible for scanning until the scan_control's priority is greater than zero. Once the mem cgroup is eligible for scanning, the priority acts as a bias to reduce the number of pages that should be scanned.
>
> We've seen cases on Android where the global LRU isn't sufficient. For example, notifications in Android are rendered as part of a separate process that runs infrequently. However, when a notification appears and the user slides down the notification tray, we'll often see dropped frames due to page faults if there has been severe memory pressure. There are similar issues with other persistent processes.
>
> The goal on an Android device is to aggressively evict from very low-priority background tasks that are likely to be killed anyway, since this will reduce the likelihood of lowmemorykiller running in the first place. It will still evict some from foreground and persistent processes, but it should help ensure that background processes are effectively reduced to the size of their heaps before evicting from more critical tasks. This should mean fewer background processes end up killed, which should improve performance and power on Android across the board (since it costs significantly less to page things back in than to replay the entirety of application startup).

In cgroup2, we've added a memory.low knob, where groups within their
memory.low setting are not reclaimed.

You can set that knob on foreground groups to the amount of memory
they need to function properly, and set it to 0 on background groups.

Have you tried doing that?

> The follow-on that I'm also experimenting with is how to improve vmpressure such that userspace can have some idea when low-priority memory cgroups are about as small as they can get. The correct time for Android to kill a background process under memory pressure is when there is evidence that a process has to be killed in order to alleviate memory pressure. If the device is below the low memory watermark and we know that there's probably no way to reclaim any more from background processes, then a userspace daemon should kill one or more background processes to fix that. Per-cgroup priority could be the first step toward that information.

Memory pressure is a wider-reaching issue, something I've been working
on for a while.

Both vmpressure and priority levels are based on reclaim efficiency,
which is problematic on solid state storage because page reads have
very low latency. It's rare that pages are still locked from the
read-in by the time reclaim gets to them on the LRU, so efficiency
tends to stay at 100%, until the system is essentially livelocked.

On solid state storage, the bigger problem when you don't have enough
memory is that you can reclaim just fine but wait a significant amount
of time to refault the recently evicted pages, i.e. on thrashing.

A more useful metric for memory pressure at this point is quantifying
that time you spend thrashing: time the job spends in direct reclaim
and on the flipside time the job waits for recently evicted pages to
come back. Combined, that gives you a good measure of overhead from
memory pressure; putting that in relation to a useful baseline of
meaningful work done gives you a portable scale of how effictively
your job is running.

I'm working on that right now, hopefully I'll have something useful
soon.

2017-03-30 16:48:58

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

> A more useful metric for memory pressure at this point is quantifying
> that time you spend thrashing: time the job spends in direct reclaim
> and on the flipside time the job waits for recently evicted pages to
> come back. Combined, that gives you a good measure of overhead from
> memory pressure; putting that in relation to a useful baseline of
> meaningful work done gives you a portable scale of how effictively
> your job is running.
>
> I'm working on that right now, hopefully I'll have something useful
> soon.

Johannes, is the work you are doing only about file pages or will it
equally apply to anon pages as well?

2017-03-30 19:40:36

by Tim Murray

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On Thu, Mar 30, 2017 at 8:51 AM, Johannes Weiner <[email protected]> wrote:
> In cgroup2, we've added a memory.low knob, where groups within their
> memory.low setting are not reclaimed.
>
> You can set that knob on foreground groups to the amount of memory
> they need to function properly, and set it to 0 on background groups.
>
> Have you tried doing that?

I have not, but I'm trying to get that working now to evaluate it on Android.

However, based on other experiences, I don't think it will work well.
We've experimented a lot with different limits in different places
(Java heap limits, hard_reclaim, soft_reclaim) at different times in
the process lifecycle, and the problem has always been that there's no
way for us to know what limit is reasonable. memory.low will have the
same problem. If memory.low is higher than the actual working set of a
foreground process, the system wastes memory (eg, file pages loaded
during app startup that are never used again won't be reclaimed under
pressure). If memory.low is less than the actual working set,
foreground processes will still get hit by thrashing.

Another issue is that the working set varies tremendously from app to
app. An email client's working set may be 1/10 or 1/20 of a camera
running a computational photography pipeline with multiple captures in
flight. I can imagine a case where it makes sense for a foreground
application to take 50-75% of a device's physical memory (the camera
case or something similar), but I hope that's an extreme outlier
compared to most apps on the system. However, high-memory apps are
often the most performance-sensitive, so reclaim is more likely to
cause problems.

As a result, I think there's still a need for relative priority
between mem cgroups, not just an absolute limit.

Does that make sense?

> Both vmpressure and priority levels are based on reclaim efficiency,
> which is problematic on solid state storage because page reads have
> very low latency. It's rare that pages are still locked from the
> read-in by the time reclaim gets to them on the LRU, so efficiency
> tends to stay at 100%, until the system is essentially livelocked.
>
> On solid state storage, the bigger problem when you don't have enough
> memory is that you can reclaim just fine but wait a significant amount
> of time to refault the recently evicted pages, i.e. on thrashing.
>
> A more useful metric for memory pressure at this point is quantifying
> that time you spend thrashing: time the job spends in direct reclaim
> and on the flipside time the job waits for recently evicted pages to
> come back. Combined, that gives you a good measure of overhead from
> memory pressure; putting that in relation to a useful baseline of
> meaningful work done gives you a portable scale of how effictively
> your job is running.

This sounds fantastic, and it matches the behavior I've seen around
pagecache thrashing on Android.

On Android, I think there are three different times where userspace
would do something useful for memory:

1. scan priority is creeping up, scanned/reclaim ratio is getting
worse, system is exhibiting signs of approaching severe memory
pressure. userspace should probably kill something if it's got
something it can kill cheaply.
2. direct reclaim is happening, system is thrashing, things are bad.
userspace should aggressively kill non-critical processes because
performance has already gotten worse.
3. something's gone horribly wrong, oom_killer is imminent: userspace
should kill everything it possibly can to keep the system stable.

My vmpressure experiments have focused on #1 because it integrates
nicely with memcg priorities. However, it doesn't seem like a good
approach for #2 or #3. Time spent thrashing sounds ideal for #2. I'm
not sure what to do for #3. The current critical vmpressure event
hasn't been that successful in avoiding oom-killer (on 3.18, at
least)--I've been able to get oom-killer to trigger without a
vmpressure event.

Assuming that memcg priorities are reasonable, would you be open to
using scan priority info as a vmpressure signal for a low amount of
memory pressure?

2017-03-30 21:54:31

by Tim Murray

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On Thu, Mar 30, 2017 at 12:40 PM, Tim Murray <[email protected]> wrote:
> The current critical vmpressure event
> hasn't been that successful in avoiding oom-killer (on 3.18, at
> least)--I've been able to get oom-killer to trigger without a
> vmpressure event.

Looked at this some more, and this is almost certainly because
vmpressure relies on workqueues. Scheduling delay from CFS workqueues
would explain vmpressure latency that results in oom-killer running
long before the critical vmpressure notification is received in
userspace, even if userspace is running as FIFO. We regularly see
10ms+ latency on workqueues, even when an Android device isn't heavily
loaded.

2017-04-13 04:30:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On Thu, Mar 30, 2017 at 12:40:32PM -0700, Tim Murray wrote:
> On Thu, Mar 30, 2017 at 8:51 AM, Johannes Weiner <[email protected]> wrote:
> > In cgroup2, we've added a memory.low knob, where groups within their
> > memory.low setting are not reclaimed.
> >
> > You can set that knob on foreground groups to the amount of memory
> > they need to function properly, and set it to 0 on background groups.
> >
> > Have you tried doing that?
>
> I have not, but I'm trying to get that working now to evaluate it on Android.
>
> However, based on other experiences, I don't think it will work well.
> We've experimented a lot with different limits in different places
> (Java heap limits, hard_reclaim, soft_reclaim) at different times in
> the process lifecycle, and the problem has always been that there's no
> way for us to know what limit is reasonable. memory.low will have the
> same problem. If memory.low is higher than the actual working set of a
> foreground process, the system wastes memory (eg, file pages loaded
> during app startup that are never used again won't be reclaimed under
> pressure). If memory.low is less than the actual working set,
> foreground processes will still get hit by thrashing.
>
> Another issue is that the working set varies tremendously from app to
> app. An email client's working set may be 1/10 or 1/20 of a camera
> running a computational photography pipeline with multiple captures in
> flight. I can imagine a case where it makes sense for a foreground
> application to take 50-75% of a device's physical memory (the camera
> case or something similar), but I hope that's an extreme outlier
> compared to most apps on the system. However, high-memory apps are
> often the most performance-sensitive, so reclaim is more likely to
> cause problems.
>
> As a result, I think there's still a need for relative priority
> between mem cgroups, not just an absolute limit.
>
> Does that make sense?

I agree with it.

Recently, embedded platform's workload for smart things would be much
diverse(from game to alarm) so it's hard to handle the absolute limit
proactively and userspace has more hints about what workloads are
more important(ie, greedy) compared to others although it would be
harmful for something(e.g., it's not visible effect to user)

As a such point of view, I support this idea as basic approach.
And with thrashing detector from Johannes, we can do fine-tune of
LRU balancing and vmpressure shooting time better.

Johannes,

Do you have any concern about this memcg prority idea?
Or
Do you think the patchset you are preparing solve this situation?

>
> > Both vmpressure and priority levels are based on reclaim efficiency,
> > which is problematic on solid state storage because page reads have
> > very low latency. It's rare that pages are still locked from the
> > read-in by the time reclaim gets to them on the LRU, so efficiency
> > tends to stay at 100%, until the system is essentially livelocked.
> >
> > On solid state storage, the bigger problem when you don't have enough
> > memory is that you can reclaim just fine but wait a significant amount
> > of time to refault the recently evicted pages, i.e. on thrashing.
> >
> > A more useful metric for memory pressure at this point is quantifying
> > that time you spend thrashing: time the job spends in direct reclaim
> > and on the flipside time the job waits for recently evicted pages to
> > come back. Combined, that gives you a good measure of overhead from
> > memory pressure; putting that in relation to a useful baseline of
> > meaningful work done gives you a portable scale of how effictively
> > your job is running.
>
> This sounds fantastic, and it matches the behavior I've seen around
> pagecache thrashing on Android.
>
> On Android, I think there are three different times where userspace
> would do something useful for memory:
>
> 1. scan priority is creeping up, scanned/reclaim ratio is getting
> worse, system is exhibiting signs of approaching severe memory
> pressure. userspace should probably kill something if it's got
> something it can kill cheaply.
> 2. direct reclaim is happening, system is thrashing, things are bad.
> userspace should aggressively kill non-critical processes because
> performance has already gotten worse.
> 3. something's gone horribly wrong, oom_killer is imminent: userspace
> should kill everything it possibly can to keep the system stable.
>
> My vmpressure experiments have focused on #1 because it integrates
> nicely with memcg priorities. However, it doesn't seem like a good
> approach for #2 or #3. Time spent thrashing sounds ideal for #2. I'm
> not sure what to do for #3. The current critical vmpressure event
> hasn't been that successful in avoiding oom-killer (on 3.18, at
> least)--I've been able to get oom-killer to trigger without a
> vmpressure event.
>
> Assuming that memcg priorities are reasonable, would you be open to
> using scan priority info as a vmpressure signal for a low amount of
> memory pressure?
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-04-13 16:01:58

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On Thu, Apr 13, 2017 at 01:30:47PM +0900, Minchan Kim wrote:
> On Thu, Mar 30, 2017 at 12:40:32PM -0700, Tim Murray wrote:
> > As a result, I think there's still a need for relative priority
> > between mem cgroups, not just an absolute limit.
> >
> > Does that make sense?
>
> I agree with it.
>
> Recently, embedded platform's workload for smart things would be much
> diverse(from game to alarm) so it's hard to handle the absolute limit
> proactively and userspace has more hints about what workloads are
> more important(ie, greedy) compared to others although it would be
> harmful for something(e.g., it's not visible effect to user)
>
> As a such point of view, I support this idea as basic approach.
> And with thrashing detector from Johannes, we can do fine-tune of
> LRU balancing and vmpressure shooting time better.
>
> Johannes,
>
> Do you have any concern about this memcg prority idea?

While I fully agree that relative priority levels would be easier to
configure, this patch doesn't really do that. It allows you to set a
scan window divider to a fixed amount and, as I already pointed out,
the scan window is no longer representative of memory pressure.

[ Really, sc->priority should probably just be called LRU lookahead
factor or something, there is not much about it being representative
of any kind of urgency anymore. ]

With this patch, if you configure the priorities of two 8G groups to 0
and 4, reclaim will treat them exactly the same*. If you configure the
priorities of two 100G groups to 0 and 7, reclaim will treat them
exactly the same. The bigger the group, the more of the lower range of
the priority range becomes meaningless, because once the divider
produces outcomes bigger than SWAP_CLUSTER_MAX(32), it doesn't
actually bias reclaim anymore.

So that's not a portable relative scale of pressure discrimination.

But the bigger problem with this is that, as sc->priority doesn't
represent memory pressure anymore, it is merely a cut-off for which
groups to scan and which groups not to scan *based on their size*.

That is the same as setting memory.low!

* For simplicity, I'm glossing over the fact here that LRUs are split
by type and into inactive/active, so in reality the numbers are a
little different, but you get the point.

> Or
> Do you think the patchset you are preparing solve this situation?

It's certainly a requirement. In order to implement a relative scale
of memory pressure discrimination, we first need to be able to really
quantify memory pressure.

Then we can either allow setting absolute latency/slowdown minimums
for each group, with reclaim skipping groups above those thresholds,
or we can map a relative priority scale against the total slowdown due
to lack of memory in the system, and each group gets a relative share
based on its priority compared to other groups.

But there is no way around first having a working measure of memory
pressure before we can meaningfully distribute it among the groups.

Thanks

2017-04-13 16:03:56

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

On Thu, Mar 30, 2017 at 09:48:55AM -0700, Shakeel Butt wrote:
> > A more useful metric for memory pressure at this point is quantifying
> > that time you spend thrashing: time the job spends in direct reclaim
> > and on the flipside time the job waits for recently evicted pages to
> > come back. Combined, that gives you a good measure of overhead from
> > memory pressure; putting that in relation to a useful baseline of
> > meaningful work done gives you a portable scale of how effictively
> > your job is running.
> >
> > I'm working on that right now, hopefully I'll have something useful
> > soon.
>
> Johannes, is the work you are doing only about file pages or will it
> equally apply to anon pages as well?

It will work on both, with the caveat that *any* swapin is counted as
memory delay, whereas only cache misses of recently evicted entries
count toward it (we don't have timestamped shadow entries for anon).

2017-04-17 04:26:43

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

Hi Johannes,

On Thu, Apr 13, 2017 at 12:01:47PM -0400, Johannes Weiner wrote:
> On Thu, Apr 13, 2017 at 01:30:47PM +0900, Minchan Kim wrote:
> > On Thu, Mar 30, 2017 at 12:40:32PM -0700, Tim Murray wrote:
> > > As a result, I think there's still a need for relative priority
> > > between mem cgroups, not just an absolute limit.
> > >
> > > Does that make sense?
> >
> > I agree with it.
> >
> > Recently, embedded platform's workload for smart things would be much
> > diverse(from game to alarm) so it's hard to handle the absolute limit
> > proactively and userspace has more hints about what workloads are
> > more important(ie, greedy) compared to others although it would be
> > harmful for something(e.g., it's not visible effect to user)
> >
> > As a such point of view, I support this idea as basic approach.
> > And with thrashing detector from Johannes, we can do fine-tune of
> > LRU balancing and vmpressure shooting time better.
> >
> > Johannes,
> >
> > Do you have any concern about this memcg prority idea?
>
> While I fully agree that relative priority levels would be easier to
> configure, this patch doesn't really do that. It allows you to set a
> scan window divider to a fixed amount and, as I already pointed out,
> the scan window is no longer representative of memory pressure.
>
> [ Really, sc->priority should probably just be called LRU lookahead
> factor or something, there is not much about it being representative
> of any kind of urgency anymore. ]

I agree that sc->priority is not memory pressure indication.
I should have clarified my intention. Sorry about that.

I'm not saying I like this implementation as I mentioned with
previous reply.
http://lkml.kernel.org/r/20170322052013.GE30149@bbox

Just about general idea, in global OOM case, break proportional
reclaim and then prefering low-priority group's reclaim would be
good for some workload like current embedded platform. And to
achieve it, aging velocity control via scan window adjusting seems
to be reasonable.

>
> With this patch, if you configure the priorities of two 8G groups to 0
> and 4, reclaim will treat them exactly the same*. If you configure the
> priorities of two 100G groups to 0 and 7, reclaim will treat them
> exactly the same. The bigger the group, the more of the lower range of
> the priority range becomes meaningless, because once the divider
> produces outcomes bigger than SWAP_CLUSTER_MAX(32), it doesn't
> actually bias reclaim anymore.

It seems it's the logic of memcg reclaim not global which is major
concern for current problem because there is no set up limitation for
each memcg.

>
> So that's not a portable relative scale of pressure discrimination.
>
> But the bigger problem with this is that, as sc->priority doesn't
> represent memory pressure anymore, it is merely a cut-off for which
> groups to scan and which groups not to scan *based on their size*.

Yes, because there are no measurable pressure concept in current VM
and you are trying to add the notion which is really good!

>
> That is the same as setting memory.low!
>
> * For simplicity, I'm glossing over the fact here that LRUs are split
> by type and into inactive/active, so in reality the numbers are a
> little different, but you get the point.
>
> > Or
> > Do you think the patchset you are preparing solve this situation?
>
> It's certainly a requirement. In order to implement a relative scale
> of memory pressure discrimination, we first need to be able to really
> quantify memory pressure.

Yeb. If we can get it, it would be better than unconditional
discriminated aging by the static priority which would leave
non-workingset pages in high priority group while workingset in
low-priority group would be evicted.

Rather than it, we ages every group's LRU fairly and if high-priority
group makes memory pressure beyond his threshold, VM should feedback
to low priority groups to be reclaimed more, which would be better.

>
> Then we can either allow setting absolute latency/slowdown minimums
> for each group, with reclaim skipping groups above those thresholds,
> or we can map a relative priority scale against the total slowdown due
> to lack of memory in the system, and each group gets a relative share
> based on its priority compared to other groups.

Fully agreed.

>
> But there is no way around first having a working measure of memory
> pressure before we can meaningfully distribute it among the groups.

Yeb. I'm looking foward to seeing it.

Thanks for the thoughtful comment, Johannes!