Hi,
Currently, workingset detection logic is not memcg aware - inactive_age
is maintained per zone. As a result, if memory cgroups are used,
refaulted file pages are activated randomly. This patch set makes
inactive_age per lruvec so that workingset detection will work correctly
for memory cgroup reclaim.
Thanks,
Vladimir Davydov (3):
mm: move workingset_activation under lru_lock
mm: make workingset detection logic memcg aware
mm: workingset: make shadow node shrinker memcg aware
include/linux/list_lru.h | 1 -
include/linux/mmzone.h | 7 ++++---
include/linux/swap.h | 2 +-
mm/filemap.c | 2 +-
mm/internal.h | 1 +
mm/swap.c | 5 +++--
mm/vmscan.c | 2 +-
mm/workingset.c | 34 +++++++++++++++++++++++-----------
8 files changed, 34 insertions(+), 20 deletions(-)
--
2.1.4
The following patch will move inactive_age from zone to lruvec in order
to make workingset detection logic memcg aware. To achieve that we need
to be able to call mem_cgroup_page_lruvec() from all the workingset
detection related functions. Currently, workingset_eviction() and
workingset_refault() meet this requirement, because both of them are
always called with the page isolated and locked, which prevents the page
from being migrated to another cgroup. However, workingset_activation(),
which is called from mark_page_accessed(), does not. To make this
function safe to call mem_cgroup_page_lruvec(), this patch moves its
invocation to __activate_page() called under the lru_lock.
Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/swap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index db43c9b4891d..f3569c8280be 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -514,6 +514,9 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
__count_vm_event(PGACTIVATE);
update_page_reclaim_stat(lruvec, file, 1);
+
+ if (file)
+ workingset_activation(page);
}
}
@@ -618,8 +621,6 @@ void mark_page_accessed(struct page *page)
else
__lru_cache_activate_page(page);
ClearPageReferenced(page);
- if (page_is_file_cache(page))
- workingset_activation(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
--
2.1.4
Currently, inactive_age is maintained per zone, which makes file page
activations pretty random in case memory cgroups are used. This patch
moves inactive_age to lruvec and makes all workingset detection related
functions use mem_cgroup_page_lruvec() to get the actual inactive_age.
Note, we do not make pack_shadow() store info about the memory cgroup
the evicted page belonged to in a shadow entry. Instead, on refault, we
simply use the memory cgroup the refaulted page belongs to. Since page
migration between different memory cgroups is a rather rare event, this
should be acceptable.
Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/mmzone.h | 7 ++++---
include/linux/swap.h | 2 +-
mm/filemap.c | 2 +-
mm/internal.h | 1 +
mm/vmscan.c | 2 +-
mm/workingset.c | 25 ++++++++++++++++---------
6 files changed, 24 insertions(+), 15 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ac00e2050943..cc7ec7546371 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -216,6 +216,10 @@ struct zone_reclaim_stat {
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
+
+ /* Evictions & activations on the inactive file list */
+ atomic_long_t inactive_age;
+
#ifdef CONFIG_MEMCG
struct zone *zone;
#endif
@@ -491,9 +495,6 @@ struct zone {
spinlock_t lru_lock;
struct lruvec lruvec;
- /* Evictions & activations on the inactive file list */
- atomic_long_t inactive_age;
-
/*
* When free pages are below this point, additional steps are taken
* when reading the number of free pages to avoid per-cpu counter
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9c7c4b418498..b7070f49aff2 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,7 +250,7 @@ struct swap_info_struct {
/* linux/mm/workingset.c */
void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+bool workingset_refault(void *shadow, struct page *page);
void workingset_activation(struct page *page);
extern struct list_lru workingset_shadow_nodes;
diff --git a/mm/filemap.c b/mm/filemap.c
index 204fd1c7c813..f72ee2e4ec0d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -652,7 +652,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
* recently, in which case it should be activated like
* any other repeatedly accessed page.
*/
- if (shadow && workingset_refault(shadow)) {
+ if (shadow && workingset_refault(shadow, page)) {
SetPageActive(page);
workingset_activation(page);
} else
diff --git a/mm/internal.h b/mm/internal.h
index 1195dd2d6a2b..ec3863d1c62a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -99,6 +99,7 @@ extern unsigned long highest_memmap_pfn;
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
extern bool zone_reclaimable(struct zone *zone);
+extern unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru);
/*
* in mm/rmap.c:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b405277d2fc..5221e19e98f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,7 +213,7 @@ bool zone_reclaimable(struct zone *zone)
zone_reclaimable_pages(zone) * 6;
}
-static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
+unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
{
if (!mem_cgroup_disabled())
return mem_cgroup_get_lru_size(lruvec, lru);
diff --git a/mm/workingset.c b/mm/workingset.c
index aa017133744b..76bf9b6ee88c 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -12,6 +12,7 @@
#include <linux/swap.h>
#include <linux/fs.h>
#include <linux/mm.h>
+#include "internal.h"
/*
* Double CLOCK lists
@@ -142,7 +143,7 @@
* Implementation
*
* For each zone's file LRU lists, a counter for inactive evictions
- * and activations is maintained (zone->inactive_age).
+ * and activations is maintained (lruvec->inactive_age).
*
* On eviction, a snapshot of this counter (along with some bits to
* identify the zone) is stored in the now empty page cache radix tree
@@ -161,8 +162,8 @@ static void *pack_shadow(unsigned long eviction, struct zone *zone)
return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
}
-static void unpack_shadow(void *shadow,
- struct zone **zone,
+static void unpack_shadow(void *shadow, struct page *page,
+ struct zone **zone, struct lruvec **lruvec,
unsigned long *distance)
{
unsigned long entry = (unsigned long)shadow;
@@ -179,8 +180,9 @@ static void unpack_shadow(void *shadow,
eviction = entry;
*zone = NODE_DATA(nid)->node_zones + zid;
+ *lruvec = mem_cgroup_page_lruvec(page, *zone);
- refault = atomic_long_read(&(*zone)->inactive_age);
+ refault = atomic_long_read(&(*lruvec)->inactive_age);
mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
RADIX_TREE_EXCEPTIONAL_SHIFT);
/*
@@ -213,30 +215,33 @@ static void unpack_shadow(void *shadow,
void *workingset_eviction(struct address_space *mapping, struct page *page)
{
struct zone *zone = page_zone(page);
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, zone);
unsigned long eviction;
- eviction = atomic_long_inc_return(&zone->inactive_age);
+ eviction = atomic_long_inc_return(&lruvec->inactive_age);
return pack_shadow(eviction, zone);
}
/**
* workingset_refault - evaluate the refault of a previously evicted page
* @shadow: shadow entry of the evicted page
+ * @page: the refaulted page
*
* Calculates and evaluates the refault distance of the previously
* evicted page in the context of the zone it was allocated in.
*
* Returns %true if the page should be activated, %false otherwise.
*/
-bool workingset_refault(void *shadow)
+bool workingset_refault(void *shadow, struct page *page)
{
unsigned long refault_distance;
struct zone *zone;
+ struct lruvec *lruvec;
- unpack_shadow(shadow, &zone, &refault_distance);
+ unpack_shadow(shadow, page, &zone, &lruvec, &refault_distance);
inc_zone_state(zone, WORKINGSET_REFAULT);
- if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+ if (refault_distance <= get_lru_size(lruvec, LRU_ACTIVE_FILE)) {
inc_zone_state(zone, WORKINGSET_ACTIVATE);
return true;
}
@@ -249,7 +254,9 @@ bool workingset_refault(void *shadow)
*/
void workingset_activation(struct page *page)
{
- atomic_long_inc(&page_zone(page)->inactive_age);
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
+
+ atomic_long_inc(&lruvec->inactive_age);
}
/*
--
2.1.4
Shadow nodes are accounted to memcg/kmem, so they must be reclaimed per
memcg, otherwise they can eat all memory available to a memcg.
Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/list_lru.h | 1 -
mm/workingset.c | 9 +++++++--
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 2a6b9947aaa3..132d86f031ff 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -58,7 +58,6 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
struct lock_class_key *key);
#define list_lru_init(lru) __list_lru_init((lru), false, NULL)
-#define list_lru_init_key(lru, key) __list_lru_init((lru), false, (key))
#define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL)
int memcg_update_all_list_lrus(int num_memcgs);
diff --git a/mm/workingset.c b/mm/workingset.c
index 76bf9b6ee88c..424fdf5d0a80 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -286,6 +286,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
local_irq_enable();
pages = node_present_pages(sc->nid);
+#ifdef CONFIG_MEMCG
+ if (sc->memcg)
+ pages = min(pages, sc->memcg->memory.limit);
+#endif
/*
* Active cache pages are limited to 50% of memory, and shadow
* entries that represent a refault distance bigger than that
@@ -394,7 +398,7 @@ static struct shrinker workingset_shadow_shrinker = {
.count_objects = count_shadow_nodes,
.scan_objects = scan_shadow_nodes,
.seeks = DEFAULT_SEEKS,
- .flags = SHRINKER_NUMA_AWARE,
+ .flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
};
/*
@@ -407,7 +411,8 @@ static int __init workingset_init(void)
{
int ret;
- ret = list_lru_init_key(&workingset_shadow_nodes, &shadow_nodes_key);
+ ret = __list_lru_init(&workingset_shadow_nodes, true,
+ &shadow_nodes_key);
if (ret)
goto err;
ret = register_shrinker(&workingset_shadow_shrinker);
--
2.1.4
On Mon, Aug 03, 2015 at 03:04:22PM +0300, Vladimir Davydov wrote:
> @@ -179,8 +180,9 @@ static void unpack_shadow(void *shadow,
> eviction = entry;
>
> *zone = NODE_DATA(nid)->node_zones + zid;
> + *lruvec = mem_cgroup_page_lruvec(page, *zone);
>
> - refault = atomic_long_read(&(*zone)->inactive_age);
> + refault = atomic_long_read(&(*lruvec)->inactive_age);
> mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
> RADIX_TREE_EXCEPTIONAL_SHIFT);
> /*
You can not compare an eviction shadow entry from one lruvec with the
inactive age of another lruvec. The inactive ages are not related and
might differ significantly: memcgs are created ad hoc, memory hotplug,
page allocator fairness drift. In those cases the result will be pure
noise.
As much as I would like to see a simpler way, I am pessimistic that
there is a way around storing memcg ids in the shadow entries.
On Mon, Aug 03, 2015 at 09:23:58AM -0400, Johannes Weiner wrote:
> On Mon, Aug 03, 2015 at 03:04:22PM +0300, Vladimir Davydov wrote:
> > @@ -179,8 +180,9 @@ static void unpack_shadow(void *shadow,
> > eviction = entry;
> >
> > *zone = NODE_DATA(nid)->node_zones + zid;
> > + *lruvec = mem_cgroup_page_lruvec(page, *zone);
> >
> > - refault = atomic_long_read(&(*zone)->inactive_age);
> > + refault = atomic_long_read(&(*lruvec)->inactive_age);
> > mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
> > RADIX_TREE_EXCEPTIONAL_SHIFT);
> > /*
>
> You can not compare an eviction shadow entry from one lruvec with the
> inactive age of another lruvec. The inactive ages are not related and
> might differ significantly: memcgs are created ad hoc, memory hotplug,
> page allocator fairness drift. In those cases the result will be pure
> noise.
That's true. If a page is evicted in one cgroup and then refaulted in
another, the activation will be random. However, is it a frequent event
when a page used by and evicted from one cgroup is refaulted in another?
If there is no active file sharing (is it common?), this should only
happen to code pages, but those will most likely end up in the cgroup
that has the greatest limit, so they shouldn't be evicted and refaulted
frequently. So the question is can we tolerate some noise here?
>
> As much as I would like to see a simpler way, I am pessimistic that
> there is a way around storing memcg ids in the shadow entries.
On 32 bit there is too little space for storing memcg id. We can shift
the distance so that it would fit and still contain something meaningful
though, but that would take much more code, so I'm trying to try the
simplest way first.
Thanks,
Vladimir
On Mon, Aug 03, 2015 at 04:52:29PM +0300, Vladimir Davydov wrote:
> On Mon, Aug 03, 2015 at 09:23:58AM -0400, Johannes Weiner wrote:
> > On Mon, Aug 03, 2015 at 03:04:22PM +0300, Vladimir Davydov wrote:
> > > @@ -179,8 +180,9 @@ static void unpack_shadow(void *shadow,
> > > eviction = entry;
> > >
> > > *zone = NODE_DATA(nid)->node_zones + zid;
> > > + *lruvec = mem_cgroup_page_lruvec(page, *zone);
> > >
> > > - refault = atomic_long_read(&(*zone)->inactive_age);
> > > + refault = atomic_long_read(&(*lruvec)->inactive_age);
> > > mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
> > > RADIX_TREE_EXCEPTIONAL_SHIFT);
> > > /*
> >
> > You can not compare an eviction shadow entry from one lruvec with the
> > inactive age of another lruvec. The inactive ages are not related and
> > might differ significantly: memcgs are created ad hoc, memory hotplug,
> > page allocator fairness drift. In those cases the result will be pure
> > noise.
>
> That's true. If a page is evicted in one cgroup and then refaulted in
> another, the activation will be random. However, is it a frequent event
> when a page used by and evicted from one cgroup is refaulted in another?
> If there is no active file sharing (is it common?), this should only
> happen to code pages, but those will most likely end up in the cgroup
> that has the greatest limit, so they shouldn't be evicted and refaulted
> frequently. So the question is can we tolerate some noise here?
It's not just the memcg, it's also the difference between zones
themselves.
> > As much as I would like to see a simpler way, I am pessimistic that
> > there is a way around storing memcg ids in the shadow entries.
>
> On 32 bit there is too little space for storing memcg id. We can shift
> the distance so that it would fit and still contain something meaningful
> though, but that would take much more code, so I'm trying to try the
> simplest way first.
It should be easy to trim quite a few bits from the timestamp, both in
terms of available memory as well as in terms of distance granularity.
We probably don't care if the refault distance is only accurate to say
2MB, and how many pages do we have to represent on 32-bit in the first
place? Once we trim that, we should be able to fit a CSS ID.
On Mon, Aug 03, 2015 at 04:55:32PM -0400, Johannes Weiner wrote:
> On Mon, Aug 03, 2015 at 04:52:29PM +0300, Vladimir Davydov wrote:
> > On Mon, Aug 03, 2015 at 09:23:58AM -0400, Johannes Weiner wrote:
> > > On Mon, Aug 03, 2015 at 03:04:22PM +0300, Vladimir Davydov wrote:
> > > > @@ -179,8 +180,9 @@ static void unpack_shadow(void *shadow,
> > > > eviction = entry;
> > > >
> > > > *zone = NODE_DATA(nid)->node_zones + zid;
> > > > + *lruvec = mem_cgroup_page_lruvec(page, *zone);
> > > >
> > > > - refault = atomic_long_read(&(*zone)->inactive_age);
> > > > + refault = atomic_long_read(&(*lruvec)->inactive_age);
> > > > mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
> > > > RADIX_TREE_EXCEPTIONAL_SHIFT);
> > > > /*
> > >
> > > You can not compare an eviction shadow entry from one lruvec with the
> > > inactive age of another lruvec. The inactive ages are not related and
> > > might differ significantly: memcgs are created ad hoc, memory hotplug,
> > > page allocator fairness drift. In those cases the result will be pure
> > > noise.
> >
> > That's true. If a page is evicted in one cgroup and then refaulted in
> > another, the activation will be random. However, is it a frequent event
> > when a page used by and evicted from one cgroup is refaulted in another?
> > If there is no active file sharing (is it common?), this should only
> > happen to code pages, but those will most likely end up in the cgroup
> > that has the greatest limit, so they shouldn't be evicted and refaulted
> > frequently. So the question is can we tolerate some noise here?
>
> It's not just the memcg, it's also the difference between zones
> themselves.
But I do take into account the difference between zones in this patch -
zone and node ids are still stored in a shadow entry. I only neglect
memcg id. So if a page is refaulted in another zone within the same
cgroup, its refault distance will be calculated correctly. We only get
noise in case of a page refaulted from a different cgroup.
>
> > > As much as I would like to see a simpler way, I am pessimistic that
> > > there is a way around storing memcg ids in the shadow entries.
> >
> > On 32 bit there is too little space for storing memcg id. We can shift
> > the distance so that it would fit and still contain something meaningful
> > though, but that would take much more code, so I'm trying to try the
> > simplest way first.
>
> It should be easy to trim quite a few bits from the timestamp, both in
> terms of available memory as well as in terms of distance granularity.
> We probably don't care if the refault distance is only accurate to say
> 2MB, and how many pages do we have to represent on 32-bit in the first
> place? Once we trim that, we should be able to fit a CSS ID.
NODES_SHIFT <= 10, ZONES_SHIFT == 2, RADIX_TREE_EXCEPTIONAL_SHIFT == 2
And we need 16 bit for storing memcg id, so there are only 2 bits left.
Even with 2MB accuracy, it gives us the maximal refault distance of 6MB
:-(
However, I doubt there is a 32 bit host with 1024 NUMA nodes. Can we
possibly limit this config option on 32 bit architectures?
Or may be we can limit the number of cgroups to say 1024 if running on
32 bit? This would allow us to win 6 more bits, so that the maximal
refault distance would be 512MB with the accuracy of 2MB. But can we be
sure this won't brake anyone's setup, especially counting that cgroups
can be zombieing around for a while after rmdir?
Thanks,
Vladimir
On 2015/08/03 21:04, Vladimir Davydov wrote:
> Hi,
>
> Currently, workingset detection logic is not memcg aware - inactive_age
> is maintained per zone. As a result, if memory cgroups are used,
> refaulted file pages are activated randomly. This patch set makes
> inactive_age per lruvec so that workingset detection will work correctly
> for memory cgroup reclaim.
>
> Thanks,
>
Reading discussion, I feel storing more data is difficult, too.
I wonder, rather than collecting more data, rough calculation can help the situation.
for example,
(refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's active list
If one of per-zone calc or per-memcg calc returns true, refault should be true.
memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.
Thanks,
-Kame
On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:
> Reading discussion, I feel storing more data is difficult, too.
Yep, even with the current 16-bit memcg id. Things would get even worse
if we wanted to extend it one day (will we?)
>
> I wonder, rather than collecting more data, rough calculation can help the situation.
> for example,
>
> (refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's active list
>
> If one of per-zone calc or per-memcg calc returns true, refault should be true.
>
> memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.
This particular formula wouldn't work I'm afraid. If there are two
isolated cgroups issuing local reclaim on the same zone, the refault
distance needed for activation would be reduced by half for no apparent
reason.
The thing is that there is no need in inventing anything if refaults
from different cgroups are infrequent - it is enough to store only
zone/node ids in shadow entries then, as this patch set does. The
question remains, can we disregard them? Sometimes we need to sacrifice
accuracy for the sake of performance and/or code simplicity. E.g.
inter-cgroup concurrent file writes are not supported in the
implementation of the blkio writeback accounting AFAIK. May be, we could
neglect inter-cgroup refaults too? My point is that even if two cgroups
are actively sharing the same file, its pages will end up in the cgroup
which experiences less memory pressure (most likely, the one with the
greater limit), so inter-cgroup refaults should be rare. Am I wrong?
Anyway, workingset detection is broken for local reclaim (activations
are random) and needs to be fixed. What is worse, shadow entries are
accounted per memcg, but reclaimed only on global memory pressure, so
that they can eat all RAM available to a container w/o giving it a
chance to reclaim it. That said, even this patch set is a huge step
forward, because it makes activations much more deterministic and fixes
per memcg shadow nodes reclaim.
Thanks,
Vladimir
On 2015/08/06 17:59, Vladimir Davydov wrote:
> On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:
>
>> Reading discussion, I feel storing more data is difficult, too.
>
> Yep, even with the current 16-bit memcg id. Things would get even worse
> if we wanted to extend it one day (will we?)
>
>>
>> I wonder, rather than collecting more data, rough calculation can help the situation.
>> for example,
>>
>> (refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's active list
>>
>> If one of per-zone calc or per-memcg calc returns true, refault should be true.
>>
>> memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.
>
> This particular formula wouldn't work I'm afraid. If there are two
> isolated cgroups issuing local reclaim on the same zone, the refault
> distance needed for activation would be reduced by half for no apparent
> reason.
Hmm, you mean activation in memcg means activation in global LRU, and it's not a
valid reason. Current implementation does have the same issue, right ?
i.e. when a container has been hitting its limit for a while, and then, a file cache is
pushed out but came back soon, it can be easily activated.
I'd like to confirm what you want to do.
1) avoid activating a file cache when it was kicked out because of memcg's local limit.
2) maintain acitve/inactive ratio in memcg properly as global LRU does.
3) reclaim shadow entry at proper timing.
All ? hmm. It seems that mixture of record of global memory pressure and of local memory
pressure is just wrong.
Now, the record is
eviction | node | zone | 2bit.
How about changing this as
0 |eviction | node | zone | 2bit
1 |eviction | memcgid | 2bit
Assume each memcg has an eviction counter, which ignoring node/zone.
i.e. memcg local reclaim happens against memcg not against zone.
At page-in,
if (the 1st bit is 0)
compare eviction counter with zone's counter and activate the page if needed.
else if (the 1st bit is 1)
compare eviction counter with the memcg (if exists)
if (current memcg == recorded memcg && eviction distance is okay)
activate page.
else
inactivate
At page-out
if (global memory pressure)
record eviction id with using zone's counter.
else if (memcg local memory pressure)
record eviction id with memcg's counter.
By this,
1) locally reclaimed pages cannot be activated unless it's refaulted in the same memcg.
In this case, activating in the memcg has some meaning.
2) At global memory pressure, distance is properly calculated based on global system status.
global memory pressure can ignore memcg's behavior.
about shadow entries, kmemcg should take care of it....
Thanks,
-Kame
On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:
> On 2015/08/06 17:59, Vladimir Davydov wrote:
> >On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:
> >>I wonder, rather than collecting more data, rough calculation can help the situation.
> >>for example,
> >>
> >> (refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's active list
> >>
> >>If one of per-zone calc or per-memcg calc returns true, refault should be true.
> >>
> >>memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.
> >
> >This particular formula wouldn't work I'm afraid. If there are two
> >isolated cgroups issuing local reclaim on the same zone, the refault
> >distance needed for activation would be reduced by half for no apparent
> >reason.
>
> Hmm, you mean activation in memcg means activation in global LRU, and it's not a
> valid reason. Current implementation does have the same issue, right ?
>
> i.e. when a container has been hitting its limit for a while, and then, a file cache is
> pushed out but came back soon, it can be easily activated.
>
> I'd like to confirm what you want to do.
>
> 1) avoid activating a file cache when it was kicked out because of memcg's local limit.
No, that's not what I want. I want pages of the workingset to get
activated on refault no matter if they were evicted on global memory
pressure or due to hitting a memory cgroup limit.
> 2) maintain acitve/inactive ratio in memcg properly as global LRU does.
> 3) reclaim shadow entry at proper timing.
>
> All ? hmm. It seems that mixture of record of global memory pressure and of local memory
> pressure is just wrong.
What makes you think so? An example of misbehavior caused by this would
be nice to have.
>
> Now, the record is
>
> eviction | node | zone | 2bit.
>
> How about changing this as
>
> 0 |eviction | node | zone | 2bit
> 1 |eviction | memcgid | 2bit
>
> Assume each memcg has an eviction counter, which ignoring node/zone.
> i.e. memcg local reclaim happens against memcg not against zone.
>
> At page-in,
> if (the 1st bit is 0)
> compare eviction counter with zone's counter and activate the page if needed.
> else if (the 1st bit is 1)
> compare eviction counter with the memcg (if exists)
Having a single counter per memcg won't scale with the number of NUMA
nodes.
> if (current memcg == recorded memcg && eviction distance is okay)
> activate page.
> else
> inactivate
> At page-out
> if (global memory pressure)
> record eviction id with using zone's counter.
> else if (memcg local memory pressure)
> record eviction id with memcg's counter.
>
I don't understand how this is supposed to work when a memory cgroup
experiences both local and global pressure simultaneously.
Also, what if a memory cgroup is protected by memory.low? Such a cgroup
may have all its pages in the active list, because it is never scanned.
This will affect the refault distance of other cgroups, making
activations unpredictable.
Thanks,
Vladimir
On 2015/08/08 22:05, Vladimir Davydov wrote:
> On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:
>> On 2015/08/06 17:59, Vladimir Davydov wrote:
>>> On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:
>>>> I wonder, rather than collecting more data, rough calculation can help the situation.
>>>> for example,
>>>>
>>>> (refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's active list
>>>>
>>>> If one of per-zone calc or per-memcg calc returns true, refault should be true.
>>>>
>>>> memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.
>>>
>>> This particular formula wouldn't work I'm afraid. If there are two
>>> isolated cgroups issuing local reclaim on the same zone, the refault
>>> distance needed for activation would be reduced by half for no apparent
>>> reason.
>>
>> Hmm, you mean activation in memcg means activation in global LRU, and it's not a
>> valid reason. Current implementation does have the same issue, right ?
>>
>> i.e. when a container has been hitting its limit for a while, and then, a file cache is
>> pushed out but came back soon, it can be easily activated.
>>
>> I'd like to confirm what you want to do.
>>
>> 1) avoid activating a file cache when it was kicked out because of memcg's local limit.
>
> No, that's not what I want. I want pages of the workingset to get
> activated on refault no matter if they were evicted on global memory
> pressure or due to hitting a memory cgroup limit.
>
Sure.
>> 2) maintain acitve/inactive ratio in memcg properly as global LRU does.
>> 3) reclaim shadow entry at proper timing.
>>
>> All ? hmm. It seems that mixture of record of global memory pressure and of local memory
>> pressure is just wrong.
>
> What makes you think so? An example of misbehavior caused by this would
> be nice to have.
>
By design, memcg's LRU aging logic is independent from global memory allocation/pressure.
Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB server,
# contaienr A workingset=600M limit=1G (sleepy)
# contaienr B workingset=300M limit=1G (work often)
# container C workingset=500M limit=1G (work slowly)
# container D workingset=1.2G limit=1G (work hard)
container D can drive the zone's distance counter because of local memory reclaim.
If active/inactive = 1:1, container D page can be activated.
At kswapd(global reclaim) runs, all container's LRU will rotate.
Possibility of refault in A, B, C is reduced by conainer D's counter updates.
But yes, some _real_ test are required.
>>
>> Now, the record is
>>
>> eviction | node | zone | 2bit.
>>
>> How about changing this as
>>
>> 0 |eviction | node | zone | 2bit
>> 1 |eviction | memcgid | 2bit
>>
>> Assume each memcg has an eviction counter, which ignoring node/zone.
>> i.e. memcg local reclaim happens against memcg not against zone.
>>
>> At page-in,
>> if (the 1st bit is 0)
>> compare eviction counter with zone's counter and activate the page if needed.
>> else if (the 1st bit is 1)
>> compare eviction counter with the memcg (if exists)
>
> Having a single counter per memcg won't scale with the number of NUMA
> nodes.
>
It doesn't matter, we can use lazy counter like pcpu counter because it's not needed to be very accurate.
>> if (current memcg == recorded memcg && eviction distance is okay)
>> activate page.
>> else
>> inactivate
>> At page-out
>> if (global memory pressure)
>> record eviction id with using zone's counter.
>> else if (memcg local memory pressure)
>> record eviction id with memcg's counter.
>>
>
> I don't understand how this is supposed to work when a memory cgroup
> experiences both local and global pressure simultaneously.
>
I think updating global distance counter by local reclaim may update counter too much.
Above is to avoid updating zone's counter and keep memcg's LRU active/inactive balanced.
> Also, what if a memory cgroup is protected by memory.low? Such a cgroup
> may have all its pages in the active list, because it is never scanned.
If LRU never scanned, all file caches tend to be in INACTIVE...it never refaults.
> This will affect the refault distance of other cgroups, making
> activations unpredictable.
>
Thanks,
-Kame
On Sun, Aug 09, 2015 at 11:12:25PM +0900, Kamezawa Hiroyuki wrote:
> On 2015/08/08 22:05, Vladimir Davydov wrote:
> >On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:
...
> >>All ? hmm. It seems that mixture of record of global memory pressure and of local memory
> >>pressure is just wrong.
> >
> >What makes you think so? An example of misbehavior caused by this would
> >be nice to have.
> >
>
> By design, memcg's LRU aging logic is independent from global memory allocation/pressure.
>
>
> Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB server,
> # contaienr A workingset=600M limit=1G (sleepy)
> # contaienr B workingset=300M limit=1G (work often)
> # container C workingset=500M limit=1G (work slowly)
> # container D workingset=1.2G limit=1G (work hard)
> container D can drive the zone's distance counter because of local memory reclaim.
> If active/inactive = 1:1, container D page can be activated.
> At kswapd(global reclaim) runs, all container's LRU will rotate.
>
> Possibility of refault in A, B, C is reduced by conainer D's counter updates.
This does not necessarily mean we have to use different inactive_age
counter for global and local memory pressure. In your example, having
inactive_age per lruvec and using it for evictions on both global and
local memory pressure would work just fine.
>
> But yes, some _real_ test are required.
>
> >>
> >>Now, the record is
> >>
> >> eviction | node | zone | 2bit.
> >>
> >>How about changing this as
> >>
> >> 0 |eviction | node | zone | 2bit
> >> 1 |eviction | memcgid | 2bit
> >>
> >>Assume each memcg has an eviction counter, which ignoring node/zone.
> >>i.e. memcg local reclaim happens against memcg not against zone.
> >>
> >>At page-in,
> >> if (the 1st bit is 0)
> >> compare eviction counter with zone's counter and activate the page if needed.
> >> else if (the 1st bit is 1)
> >> compare eviction counter with the memcg (if exists)
> >
> >Having a single counter per memcg won't scale with the number of NUMA
> >nodes.
> >
> It doesn't matter, we can use lazy counter like pcpu counter because it's not needed to be very accurate.
Fair enough.
>
>
> >> if (current memcg == recorded memcg && eviction distance is okay)
> >> activate page.
> >> else
> >> inactivate
> >>At page-out
> >> if (global memory pressure)
> >> record eviction id with using zone's counter.
> >> else if (memcg local memory pressure)
> >> record eviction id with memcg's counter.
> >>
> >
> >I don't understand how this is supposed to work when a memory cgroup
> >experiences both local and global pressure simultaneously.
> >
>
> I think updating global distance counter by local reclaim may update counter too much.
But if the inactive_age counter was per lruvec, then we wouldn't need to
bother about it.
> Above is to avoid updating zone's counter and keep memcg's LRU active/inactive balanced.
>
> >Also, what if a memory cgroup is protected by memory.low? Such a cgroup
> >may have all its pages in the active list, because it is never scanned.
>
> If LRU never scanned, all file caches tend to be in INACTIVE...it never refaults.
This is not true - there still may be activations from
mark_page_accessed.
Thanks,
Vladimir
On 2015/08/10 17:14, Vladimir Davydov wrote:
> On Sun, Aug 09, 2015 at 11:12:25PM +0900, Kamezawa Hiroyuki wrote:
>> On 2015/08/08 22:05, Vladimir Davydov wrote:
>>> On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:
> ...
>>>> All ? hmm. It seems that mixture of record of global memory pressure and of local memory
>>>> pressure is just wrong.
>>>
>>> What makes you think so? An example of misbehavior caused by this would
>>> be nice to have.
>>>
>>
>> By design, memcg's LRU aging logic is independent from global memory allocation/pressure.
>>
>>
>> Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB server,
>> # contaienr A workingset=600M limit=1G (sleepy)
>> # contaienr B workingset=300M limit=1G (work often)
>> # container C workingset=500M limit=1G (work slowly)
>> # container D workingset=1.2G limit=1G (work hard)
>> container D can drive the zone's distance counter because of local memory reclaim.
>> If active/inactive = 1:1, container D page can be activated.
>> At kswapd(global reclaim) runs, all container's LRU will rotate.
>>
>> Possibility of refault in A, B, C is reduced by conainer D's counter updates.
>
> This does not necessarily mean we have to use different inactive_age
> counter for global and local memory pressure. In your example, having
> inactive_age per lruvec and using it for evictions on both global and
> local memory pressure would work just fine.
>
you're right.
>>
>>
>>>> if (current memcg == recorded memcg && eviction distance is okay)
>>>> activate page.
>>>> else
>>>> inactivate
>>>> At page-out
>>>> if (global memory pressure)
>>>> record eviction id with using zone's counter.
>>>> else if (memcg local memory pressure)
>>>> record eviction id with memcg's counter.
>>>>
>>>
>>> I don't understand how this is supposed to work when a memory cgroup
>>> experiences both local and global pressure simultaneously.
>>>
>>
>> I think updating global distance counter by local reclaim may update counter too much.
>
> But if the inactive_age counter was per lruvec, then we wouldn't need to
> bother about it.
>
yes.
Anyway, what I understand now is that we need to reduce influence from a memcg's behavior
against other memcgs. Your way is dividing counter completely, my idea was implementing
different counter. Doing it by calculation will be good because we can't have enough record
space.
Thanks,
-Kame