Hi,
This is v2 of the patch set introducing swap accounting to cgroup2. For
a detailed description and rationale please see patches 1 and 7.
v1 can be found here: https://lwn.net/Articles/667472/
v2 mostly addresses comments by Johannes. For the detailed changelog,
see individual patches.
Thanks,
Vladimir Davydov (7):
mm: memcontrol: charge swap to cgroup2
mm: vmscan: pass memcg to get_scan_count()
mm: memcontrol: replace mem_cgroup_lruvec_online with
mem_cgroup_online
swap.h: move memcg related stuff to the end of the file
mm: vmscan: do not scan anon pages if memcg swap limit is hit
mm: free swap cache aggressively if memcg swap is full
Documentation: cgroup: add memory.swap.{current,max} description
Documentation/cgroup.txt | 33 ++++++++++
include/linux/memcontrol.h | 28 ++++-----
include/linux/swap.h | 76 ++++++++++++++--------
mm/memcontrol.c | 154 ++++++++++++++++++++++++++++++++++++++++++---
mm/memory.c | 3 +-
mm/shmem.c | 4 ++
mm/swap_state.c | 5 ++
mm/swapfile.c | 6 +-
mm/vmscan.c | 26 ++++----
9 files changed, 265 insertions(+), 70 deletions(-)
--
2.1.4
In the legacy hierarchy we charge memsw, which is dubious, because:
- memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.
- memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.
That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.
The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.
Note, to charge swap resource properly in the unified hierarchy, we have
to make swap_entry_free uncharge swap only when ->usage reaches zero,
not just ->count, i.e. when all references to a swap entry, including
the one taken by swap cache, are gone. This is necessary, because
otherwise swap-in could result in uncharging swap even if the page is
still in swap cache and hence still occupies a swap entry. At the same
time, this shouldn't break memsw counter logic, where a page is never
charged twice for using both memory and swap, because in case of legacy
hierarchy we uncharge swap on commit (see mem_cgroup_commit_charge).
Signed-off-by: Vladimir Davydov <[email protected]>
---
Changes in v2:
- Rename mem_cgroup_charge_swap -> mem_cgroup_try_charge_swap to
conform to other memcg charging methods.
- Remove unnecessary READ_ONCE from mem_cgroup_get_limit.
- Uncharge swap only when the last reference is gone, including the one
taken by swap cache. In other words, make swap_entry_free call
mem_cgroup_uncharge_swap only when ->usage reaches zero, not just
->count.
- Note: I left legacy and default swap charge paths separated for now,
because unifying them (i.e. moving swap_cgroup_record to
mem_cgroup_try_charge_swap) would result in changing behavior of
MEM_CGROUP_STAT_SWAP counter on the legacy hierarchy.
include/linux/memcontrol.h | 1 +
include/linux/swap.h | 6 +++
mm/memcontrol.c | 118 ++++++++++++++++++++++++++++++++++++++++++---
mm/shmem.c | 4 ++
mm/swap_state.c | 5 ++
mm/swapfile.c | 4 +-
6 files changed, 128 insertions(+), 10 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 27123e597eca..6e0126230878 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -163,6 +163,7 @@ struct mem_cgroup {
/* Accounted resources */
struct page_counter memory;
+ struct page_counter swap;
/* Legacy consumer-oriented counters */
struct page_counter memsw;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 457181844b6e..478e7dd038c7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -368,11 +368,17 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
#endif
#ifdef CONFIG_MEMCG_SWAP
extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
+extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
#else
static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
{
}
+static inline int mem_cgroup_try_charge_swap(struct page *page,
+ swp_entry_t entry)
+{
+ return 0;
+}
static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9d951faebbd4..567b56da2c23 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1222,7 +1222,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
pr_cont(":");
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
- if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
+ if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
continue;
pr_cont(" %s:%luKB", mem_cgroup_stat_names[i],
K(mem_cgroup_read_stat(iter, i)));
@@ -1261,9 +1261,12 @@ static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
limit = memcg->memory.limit;
if (mem_cgroup_swappiness(memcg)) {
unsigned long memsw_limit;
+ unsigned long swap_limit;
memsw_limit = memcg->memsw.limit;
- limit = min(limit + total_swap_pages, memsw_limit);
+ swap_limit = memcg->swap.limit;
+ swap_limit = min(swap_limit, (unsigned long)total_swap_pages);
+ limit = min(limit + swap_limit, memsw_limit);
}
return limit;
}
@@ -4198,11 +4201,13 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
if (parent && parent->use_hierarchy) {
memcg->use_hierarchy = true;
page_counter_init(&memcg->memory, &parent->memory);
+ page_counter_init(&memcg->swap, &parent->swap);
page_counter_init(&memcg->memsw, &parent->memsw);
page_counter_init(&memcg->kmem, &parent->kmem);
page_counter_init(&memcg->tcpmem, &parent->tcpmem);
} else {
page_counter_init(&memcg->memory, NULL);
+ page_counter_init(&memcg->swap, NULL);
page_counter_init(&memcg->memsw, NULL);
page_counter_init(&memcg->kmem, NULL);
page_counter_init(&memcg->tcpmem, NULL);
@@ -5212,7 +5217,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
if (page->mem_cgroup)
goto out;
- if (do_memsw_account()) {
+ if (do_swap_account) {
swp_entry_t ent = { .val = page_private(page), };
unsigned short id = lookup_swap_cgroup_id(ent);
@@ -5665,26 +5670,66 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
memcg_check_events(memcg, page);
}
+/*
+ * mem_cgroup_try_charge_swap - try charging a swap entry
+ * @page: page being added to swap
+ * @entry: swap entry to charge
+ *
+ * Try to charge @entry to the memcg that @page belongs to.
+ *
+ * Returns 0 on success, -ENOMEM on failure.
+ */
+int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
+{
+ struct mem_cgroup *memcg;
+ struct page_counter *counter;
+ unsigned short oldid;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || !do_swap_account)
+ return 0;
+
+ memcg = page->mem_cgroup;
+
+ /* Readahead page, never charged */
+ if (!memcg)
+ return 0;
+
+ if (!mem_cgroup_is_root(memcg) &&
+ !page_counter_try_charge(&memcg->swap, 1, &counter))
+ return -ENOMEM;
+
+ oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
+ VM_BUG_ON_PAGE(oldid, page);
+ mem_cgroup_swap_statistics(memcg, true);
+
+ css_get(&memcg->css);
+ return 0;
+}
+
/**
* mem_cgroup_uncharge_swap - uncharge a swap entry
* @entry: swap entry to uncharge
*
- * Drop the memsw charge associated with @entry.
+ * Drop the swap charge associated with @entry.
*/
void mem_cgroup_uncharge_swap(swp_entry_t entry)
{
struct mem_cgroup *memcg;
unsigned short id;
- if (!do_memsw_account())
+ if (!do_swap_account)
return;
id = swap_cgroup_record(entry, 0);
rcu_read_lock();
memcg = mem_cgroup_from_id(id);
if (memcg) {
- if (!mem_cgroup_is_root(memcg))
- page_counter_uncharge(&memcg->memsw, 1);
+ if (!mem_cgroup_is_root(memcg)) {
+ if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ page_counter_uncharge(&memcg->swap, 1);
+ else
+ page_counter_uncharge(&memcg->memsw, 1);
+ }
mem_cgroup_swap_statistics(memcg, false);
css_put(&memcg->css);
}
@@ -5708,6 +5753,63 @@ static int __init enable_swap_account(char *s)
}
__setup("swapaccount=", enable_swap_account);
+static u64 swap_current_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+ return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
+}
+
+static int swap_max_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ unsigned long max = READ_ONCE(memcg->swap.limit);
+
+ if (max == PAGE_COUNTER_MAX)
+ seq_puts(m, "max\n");
+ else
+ seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
+
+ return 0;
+}
+
+static ssize_t swap_max_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long max;
+ int err;
+
+ buf = strstrip(buf);
+ err = page_counter_memparse(buf, "max", &max);
+ if (err)
+ return err;
+
+ mutex_lock(&memcg_limit_mutex);
+ err = page_counter_limit(&memcg->swap, max);
+ mutex_unlock(&memcg_limit_mutex);
+ if (err)
+ return err;
+
+ return nbytes;
+}
+
+static struct cftype swap_files[] = {
+ {
+ .name = "swap.current",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = swap_current_read,
+ },
+ {
+ .name = "swap.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = swap_max_show,
+ .write = swap_max_write,
+ },
+ { } /* terminate */
+};
+
static struct cftype memsw_cgroup_files[] = {
{
.name = "memsw.usage_in_bytes",
@@ -5739,6 +5841,8 @@ static int __init mem_cgroup_swap_init(void)
{
if (!mem_cgroup_disabled() && really_do_swap_account) {
do_swap_account = 1;
+ WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys,
+ swap_files));
WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys,
memsw_cgroup_files));
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 9b051115a100..95e89b26dac3 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -912,6 +912,9 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
if (!swap.val)
goto redirty;
+ if (mem_cgroup_try_charge_swap(page, swap))
+ goto free_swap;
+
/*
* Add inode to shmem_unuse()'s list of swapped-out inodes,
* if it's not already there. Do it now before the page is
@@ -940,6 +943,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
}
mutex_unlock(&shmem_swaplist_mutex);
+free_swap:
swapcache_free(swap);
redirty:
set_page_dirty(page);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d783872d746c..934b33879af1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -170,6 +170,11 @@ int add_to_swap(struct page *page, struct list_head *list)
if (!entry.val)
return 0;
+ if (mem_cgroup_try_charge_swap(page, entry)) {
+ swapcache_free(entry);
+ return 0;
+ }
+
if (unlikely(PageTransHuge(page)))
if (unlikely(split_huge_page_to_list(page, list))) {
swapcache_free(entry);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7073faecb38f..efa279221302 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -788,14 +788,12 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
count--;
}
- if (!count)
- mem_cgroup_uncharge_swap(entry);
-
usage = count | has_cache;
p->swap_map[offset] = usage;
/* free if no reference */
if (!usage) {
+ mem_cgroup_uncharge_swap(entry);
dec_cluster_info_page(p, p->cluster_info, offset);
if (offset < p->lowest_bit)
p->lowest_bit = offset;
--
2.1.4
memcg will come in handy in get_scan_count(). It can already be used for
getting swappiness immediately in get_scan_count() instead of passing it
around. The following patches will add more memcg-related values, which
will be used there.
Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bb01b04154ad..acc6bff84e26 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1957,10 +1957,11 @@ enum scan_balance {
* nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
* nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
*/
-static void get_scan_count(struct lruvec *lruvec, int swappiness,
+static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
struct scan_control *sc, unsigned long *nr,
unsigned long *lru_pages)
{
+ int swappiness = mem_cgroup_swappiness(memcg);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
u64 denominator = 0; /* gcc */
@@ -2184,9 +2185,10 @@ static inline void init_tlb_ubc(void)
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
-static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc, unsigned long *lru_pages)
+static void shrink_zone_memcg(struct zone *zone, struct mem_cgroup *memcg,
+ struct scan_control *sc, unsigned long *lru_pages)
{
+ struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
unsigned long nr[NR_LRU_LISTS];
unsigned long targets[NR_LRU_LISTS];
unsigned long nr_to_scan;
@@ -2196,7 +2198,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
struct blk_plug plug;
bool scan_adjusted;
- get_scan_count(lruvec, swappiness, sc, nr, lru_pages);
+ get_scan_count(lruvec, memcg, sc, nr, lru_pages);
/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
@@ -2400,8 +2402,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
unsigned long lru_pages;
unsigned long reclaimed;
unsigned long scanned;
- struct lruvec *lruvec;
- int swappiness;
if (mem_cgroup_low(root, memcg)) {
if (!sc->may_thrash)
@@ -2409,12 +2409,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
mem_cgroup_events(memcg, MEMCG_LOW, 1);
}
- lruvec = mem_cgroup_zone_lruvec(zone, memcg);
- swappiness = mem_cgroup_swappiness(memcg);
reclaimed = sc->nr_reclaimed;
scanned = sc->nr_scanned;
- shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
+ shrink_zone_memcg(zone, memcg, sc, &lru_pages);
zone_lru_pages += lru_pages;
if (memcg && is_classzone)
@@ -2884,8 +2882,6 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
.may_unmap = 1,
.may_swap = !noswap,
};
- struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
- int swappiness = mem_cgroup_swappiness(memcg);
unsigned long lru_pages;
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2902,7 +2898,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_lruvec(lruvec, swappiness, &sc, &lru_pages);
+ shrink_zone_memcg(zone, memcg, &sc, &lru_pages);
trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
--
2.1.4
mem_cgroup_lruvec_online() takes lruvec, but it only needs memcg. Since
get_scan_count(), which is the only user of this function, now possesses
pointer to memcg, let's pass memcg directly to mem_cgroup_online()
instead of picking it out of lruvec and rename the function accordingly.
Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 27 ++++++++++-----------------
mm/vmscan.c | 2 +-
2 files changed, 11 insertions(+), 18 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6e0126230878..166661708410 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -355,6 +355,13 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
}
+static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
+{
+ if (mem_cgroup_disabled())
+ return true;
+ return !!(memcg->css.flags & CSS_ONLINE);
+}
+
/*
* For memory reclaim.
*/
@@ -363,20 +370,6 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
int nr_pages);
-static inline bool mem_cgroup_lruvec_online(struct lruvec *lruvec)
-{
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup *memcg;
-
- if (mem_cgroup_disabled())
- return true;
-
- mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
- memcg = mz->memcg;
-
- return !!(memcg->css.flags & CSS_ONLINE);
-}
-
static inline
unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
{
@@ -589,13 +582,13 @@ static inline bool mem_cgroup_disabled(void)
return true;
}
-static inline bool
-mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
+static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
{
return true;
}
-static inline bool mem_cgroup_lruvec_online(struct lruvec *lruvec)
+static inline bool
+mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
{
return true;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index acc6bff84e26..b220e6cda25d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1988,7 +1988,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
if (current_is_kswapd()) {
if (!zone_reclaimable(zone))
force_scan = true;
- if (!mem_cgroup_lruvec_online(lruvec))
+ if (!mem_cgroup_online(memcg))
force_scan = true;
}
if (!global_reclaim(sc))
--
2.1.4
The following patches will add more functions to the memcg section of
include/linux/swap.h. Some of them will need values defined below the
current location of the section. So let's move the section to the end of
the file. No functional changes intended.
Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
include/linux/swap.h | 70 ++++++++++++++++++++++++++++------------------------
1 file changed, 38 insertions(+), 32 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 478e7dd038c7..f8fb4e06c4bd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -350,39 +350,7 @@ extern void check_move_unevictable_pages(struct page **, int nr_pages);
extern int kswapd_run(int nid);
extern void kswapd_stop(int nid);
-#ifdef CONFIG_MEMCG
-static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
-{
- /* root ? */
- if (mem_cgroup_disabled() || !memcg->css.parent)
- return vm_swappiness;
-
- return memcg->swappiness;
-}
-#else
-static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
-{
- return vm_swappiness;
-}
-#endif
-#ifdef CONFIG_MEMCG_SWAP
-extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
-extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
-extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
-#else
-static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
-{
-}
-static inline int mem_cgroup_try_charge_swap(struct page *page,
- swp_entry_t entry)
-{
- return 0;
-}
-static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
-{
-}
-#endif
#ifdef CONFIG_SWAP
/* linux/mm/page_io.c */
extern int swap_readpage(struct page *);
@@ -561,5 +529,43 @@ static inline swp_entry_t get_swap_page(void)
}
#endif /* CONFIG_SWAP */
+
+#ifdef CONFIG_MEMCG
+static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
+{
+ /* root ? */
+ if (mem_cgroup_disabled() || !memcg->css.parent)
+ return vm_swappiness;
+
+ return memcg->swappiness;
+}
+
+#else
+static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
+{
+ return vm_swappiness;
+}
+#endif
+
+#ifdef CONFIG_MEMCG_SWAP
+extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
+extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
+#else
+static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+}
+
+static inline int mem_cgroup_try_charge_swap(struct page *page,
+ swp_entry_t entry)
+{
+ return 0;
+}
+
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+}
+#endif
+
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
--
2.1.4
We don't scan anonymous memory if we ran out of swap, neither should we
do it in case memcg swap limit is hit, because swap out is impossible
anyway.
Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
Changes in v2:
- Do not check swap limit on the legacy hierarchy.
include/linux/swap.h | 6 ++++++
mm/memcontrol.c | 13 +++++++++++++
mm/vmscan.c | 2 +-
3 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f8fb4e06c4bd..c544998dfbe7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -551,6 +551,7 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
+extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
#else
static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
{
@@ -565,6 +566,11 @@ static inline int mem_cgroup_try_charge_swap(struct page *page,
static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
{
}
+
+static inline long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
+{
+ return get_nr_swap_pages();
+}
#endif
#endif /* __KERNEL__*/
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 567b56da2c23..e0e498f5ca32 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5736,6 +5736,19 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
rcu_read_unlock();
}
+long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
+{
+ long nr_swap_pages = get_nr_swap_pages();
+
+ if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return nr_swap_pages;
+ for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
+ nr_swap_pages = min_t(long, nr_swap_pages,
+ READ_ONCE(memcg->swap.limit) -
+ page_counter_read(&memcg->swap));
+ return nr_swap_pages;
+}
+
/* for remember boot option*/
#ifdef CONFIG_MEMCG_SWAP_ENABLED
static int really_do_swap_account __initdata = 1;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b220e6cda25d..ab52d865d922 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1995,7 +1995,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
force_scan = true;
/* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
+ if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
scan_balance = SCAN_FILE;
goto out;
}
--
2.1.4
Swap cache pages are freed aggressively if swap is nearly full (>50%
currently), because otherwise we are likely to stop scanning anonymous
when we near the swap limit even if there is plenty of freeable swap
cache pages. We should follow the same trend in case of memory cgroup,
which has its own swap limit.
Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
Changes in v2:
- Remove unnecessary PageSwapCache check from mem_cgroup_swap_full.
- Do not check swap limit on the legacy hierarchy.
include/linux/swap.h | 6 ++++++
mm/memcontrol.c | 22 ++++++++++++++++++++++
mm/memory.c | 3 ++-
mm/swapfile.c | 2 +-
mm/vmscan.c | 2 +-
5 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c544998dfbe7..5ebdbabc62f0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -552,6 +552,7 @@ extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
+extern bool mem_cgroup_swap_full(struct page *page);
#else
static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
{
@@ -571,6 +572,11 @@ static inline long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
{
return get_nr_swap_pages();
}
+
+static inline bool mem_cgroup_swap_full(struct page *page)
+{
+ return vm_swap_full();
+}
#endif
#endif /* __KERNEL__*/
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e0e498f5ca32..fc25dc211eaf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5749,6 +5749,28 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
return nr_swap_pages;
}
+bool mem_cgroup_swap_full(struct page *page)
+{
+ struct mem_cgroup *memcg;
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+ if (vm_swap_full())
+ return true;
+ if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+
+ memcg = page->mem_cgroup;
+ if (!memcg)
+ return false;
+
+ for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
+ if (page_counter_read(&memcg->swap) * 2 >= memcg->swap.limit)
+ return true;
+
+ return false;
+}
+
/* for remember boot option*/
#ifdef CONFIG_MEMCG_SWAP_ENABLED
static int really_do_swap_account __initdata = 1;
diff --git a/mm/memory.c b/mm/memory.c
index 3b115dcaa26e..2bd6a78c142b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2563,7 +2563,8 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
swap_free(entry);
- if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
+ if (mem_cgroup_swap_full(page) ||
+ (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
try_to_free_swap(page);
unlock_page(page);
if (page != swapcache) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index efa279221302..ab1a8a619676 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1009,7 +1009,7 @@ int free_swap_and_cache(swp_entry_t entry)
* Also recheck PageSwapCache now page is locked (above).
*/
if (PageSwapCache(page) && !PageWriteback(page) &&
- (!page_mapped(page) || vm_swap_full())) {
+ (!page_mapped(page) || mem_cgroup_swap_full(page))) {
delete_from_swap_cache(page);
SetPageDirty(page);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ab52d865d922..1cd88e9b0383 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1206,7 +1206,7 @@ cull_mlocked:
activate_locked:
/* Not a candidate for swapping, so reclaim swap space. */
- if (PageSwapCache(page) && vm_swap_full())
+ if (PageSwapCache(page) && mem_cgroup_swap_full(page))
try_to_free_swap(page);
VM_BUG_ON_PAGE(PageActive(page), page);
SetPageActive(page);
--
2.1.4
The rationale of separate swap counter is given by Johannes Weiner.
Signed-off-by: Vladimir Davydov <[email protected]>
---
Changes in v2:
- Add rationale of separate swap counter provided by Johannes.
Documentation/cgroup.txt | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
index 31d1f7bf12a1..f441564023e1 100644
--- a/Documentation/cgroup.txt
+++ b/Documentation/cgroup.txt
@@ -819,6 +819,22 @@ PAGE_SIZE multiple when read back.
the cgroup. This may not exactly match the number of
processes killed but should generally be close.
+ memory.swap.current
+
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The total amount of swap currently being used by the cgroup
+ and its descendants.
+
+ memory.swap.max
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "max".
+
+ Swap usage hard limit. If a cgroup's swap usage reaches this
+ limit, anonymous meomry of the cgroup will not be swapped out.
+
5-2-2. General Usage
@@ -1291,3 +1307,20 @@ allocation from the slack available in other groups or the rest of the
system than killing the group. Otherwise, memory.max is there to
limit this type of spillover and ultimately contain buggy or even
malicious applications.
+
+The combined memory+swap accounting and limiting is replaced by real
+control over swap space.
+
+The main argument for a combined memory+swap facility in the original
+cgroup design was that global or parental pressure would always be
+able to swap all anonymous memory of a child group, regardless of the
+child's own (possibly untrusted) configuration. However, untrusted
+groups can sabotage swapping by other means - such as referencing its
+anonymous memory in a tight loop - and an admin can not assume full
+swappability when overcommitting untrusted jobs.
+
+For trusted jobs, on the other hand, a combined counter is not an
+intuitive userspace interface, and it flies in the face of the idea
+that cgroup controllers should account and limit specific physical
+resources. Swap space is a resource like all others in the system,
+and that's why unified hierarchy allows distributing it separately.
--
2.1.4
On Thu, Dec 17, 2015 at 03:29:54PM +0300, Vladimir Davydov wrote:
> In the legacy hierarchy we charge memsw, which is dubious, because:
>
> - memsw.limit must be >= memory.limit, so it is impossible to limit
> swap usage less than memory usage. Taking into account the fact that
> the primary limiting mechanism in the unified hierarchy is
> memory.high while memory.limit is either left unset or set to a very
> large value, moving memsw.limit knob to the unified hierarchy would
> effectively make it impossible to limit swap usage according to the
> user preference.
>
> - memsw.usage != memory.usage + swap.usage, because a page occupying
> both swap entry and a swap cache page is charged only once to memsw
> counter. As a result, it is possible to effectively eat up to
> memory.limit of memory pages *and* memsw.limit of swap entries, which
> looks unexpected.
>
> That said, we should provide a different swap limiting mechanism for
> cgroup2.
>
> This patch adds mem_cgroup->swap counter, which charges the actual
> number of swap entries used by a cgroup. It is only charged in the
> unified hierarchy, while the legacy hierarchy memsw logic is left
> intact.
>
> The swap usage can be monitored using new memory.swap.current file and
> limited using memory.swap.max.
>
> Note, to charge swap resource properly in the unified hierarchy, we have
> to make swap_entry_free uncharge swap only when ->usage reaches zero,
> not just ->count, i.e. when all references to a swap entry, including
> the one taken by swap cache, are gone. This is necessary, because
> otherwise swap-in could result in uncharging swap even if the page is
> still in swap cache and hence still occupies a swap entry. At the same
> time, this shouldn't break memsw counter logic, where a page is never
> charged twice for using both memory and swap, because in case of legacy
> hierarchy we uncharge swap on commit (see mem_cgroup_commit_charge).
This was actually an oversight when rewriting swap accounting. It
should have always been uncharged when the swap slot is released.
> Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Thu, Dec 17, 2015 at 03:30:00PM +0300, Vladimir Davydov wrote:
> The rationale of separate swap counter is given by Johannes Weiner.
>
> Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Fri, Dec 18, 2015 at 11:51:23AM +0900, Kamezawa Hiroyuki wrote:
...
> Could you give here a hint how to calculate amount of swapcache,
> counted both in memory.current and swap.current ?
Currently it's impossible, but once memory.stat has settled in the
unified hierarchy, it might be worth adding a stat counter for such
pages (equivalent to SwapCached in the global /proc/meminfo).
Thanks,
Vladimir
On 2015/12/17 21:30, Vladimir Davydov wrote:
> The rationale of separate swap counter is given by Johannes Weiner.
>
> Signed-off-by: Vladimir Davydov <[email protected]>
> ---
> Changes in v2:
> - Add rationale of separate swap counter provided by Johannes.
>
> Documentation/cgroup.txt | 33 +++++++++++++++++++++++++++++++++
> 1 file changed, 33 insertions(+)
>
> diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
> index 31d1f7bf12a1..f441564023e1 100644
> --- a/Documentation/cgroup.txt
> +++ b/Documentation/cgroup.txt
> @@ -819,6 +819,22 @@ PAGE_SIZE multiple when read back.
> the cgroup. This may not exactly match the number of
> processes killed but should generally be close.
>
> + memory.swap.current
> +
> + A read-only single value file which exists on non-root
> + cgroups.
> +
> + The total amount of swap currently being used by the cgroup
> + and its descendants.
> +
> + memory.swap.max
> +
> + A read-write single value file which exists on non-root
> + cgroups. The default is "max".
> +
> + Swap usage hard limit. If a cgroup's swap usage reaches this
> + limit, anonymous meomry of the cgroup will not be swapped out.
> +
>
> 5-2-2. General Usage
>
> @@ -1291,3 +1307,20 @@ allocation from the slack available in other groups or the rest of the
> system than killing the group. Otherwise, memory.max is there to
> limit this type of spillover and ultimately contain buggy or even
> malicious applications.
> +
> +The combined memory+swap accounting and limiting is replaced by real
> +control over swap space.
> +
> +The main argument for a combined memory+swap facility in the original
> +cgroup design was that global or parental pressure would always be
> +able to swap all anonymous memory of a child group, regardless of the
> +child's own (possibly untrusted) configuration. However, untrusted
> +groups can sabotage swapping by other means - such as referencing its
> +anonymous memory in a tight loop - and an admin can not assume full
> +swappability when overcommitting untrusted jobs.
> +
> +For trusted jobs, on the other hand, a combined counter is not an
> +intuitive userspace interface, and it flies in the face of the idea
> +that cgroup controllers should account and limit specific physical
> +resources. Swap space is a resource like all others in the system,
> +and that's why unified hierarchy allows distributing it separately.
>
Could you give here a hint how to calculate amount of swapcache,
counted both in memory.current and swap.current ?
Thanks,
-Kame