Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754036AbbLJRAg (ORCPT ); Thu, 10 Dec 2015 12:00:36 -0500 Received: from mx2.parallels.com ([199.115.105.18]:41713 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753665AbbLJRAa (ORCPT ); Thu, 10 Dec 2015 12:00:30 -0500 Date: Thu, 10 Dec 2015 20:00:20 +0300 From: Vladimir Davydov To: Johannes Weiner CC: Andrew Morton , Michal Hocko , , Subject: Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2 Message-ID: <20151210170020.GA5171@esperanza> References: <265d8fe623ed2773d69a26d302eb31e335377c77.1449742560.git.vdavydov@virtuozzo.com> <20151210160027.GA3308@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151210160027.GA3308@cmpxchg.org> X-ClientProxiedBy: US-EXCH.sw.swsoft.com (10.255.249.47) To US-EXCH.sw.swsoft.com (10.255.249.47) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5013 Lines: 138 On Thu, Dec 10, 2015 at 11:00:27AM -0500, Johannes Weiner wrote: > On Thu, Dec 10, 2015 at 02:39:14PM +0300, Vladimir Davydov wrote: ... > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index c6a5ed2f2744..993c9a26b637 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -169,6 +169,7 @@ struct mem_cgroup { > > > > /* Accounted resources */ > > struct page_counter memory; > > + struct page_counter swap; > > struct page_counter memsw; > > struct page_counter kmem; > > We should probably separate this to differentiate the new counters > from the old ones. Only memory and swap are actual resources, the > memsw and kmem counters are counting consumer-oriented. Yeah, but we'd better do it in a separate patch. > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 457181844b6e..f4b3ccdcba91 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -368,11 +368,16 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) > > #endif > > #ifdef CONFIG_MEMCG_SWAP > > extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry); > > +extern int mem_cgroup_charge_swap(struct page *page, swp_entry_t entry); > > Should this be mem_cgroup_try_swap() to keep in line with the page > counter terminology? So it's clear this is not forcing a charge. Hmm, I thought we only use try_charge name in memcontrol.c if there is commit stage, e.g. we have memcg_kmem_charge, not memcg_kmem_try_charge. This conflicts with page_counter semantics though, so we might want to rename it. > > > @@ -1248,12 +1248,15 @@ static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg) > > { > > unsigned long limit; > > > > - limit = memcg->memory.limit; > > + limit = READ_ONCE(memcg->memory.limit); > > if (mem_cgroup_swappiness(memcg)) { > > unsigned long memsw_limit; > > + unsigned long swap_limit; > > > > - memsw_limit = memcg->memsw.limit; > > - limit = min(limit + total_swap_pages, memsw_limit); > > + memsw_limit = READ_ONCE(memcg->memsw.limit); > > + swap_limit = min(READ_ONCE(memcg->swap.limit), > > + (unsigned long)total_swap_pages); > > + limit = min(limit + swap_limit, memsw_limit); > > } > > return limit; > > This is taking a racy snapshot, so we don't rely on 100% accuracy. Can > we do without the READ_ONCE()? Well, I suppose we can, but passing a volatile value to min macro looks a bit scary to me. What if swap_limit is changed from a finite value less than total_swap_pages to inf while we are there? We might use an infinite memory size in OOM which would screw up OOM scores AFAIU. Unlikely, but still. > > > @@ -5754,26 +5760,66 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) > > memcg_check_events(memcg, page); > > } > > > > +/* > > + * mem_cgroup_charge_swap - charge a swap entry > > + * @page: page being added to swap > > + * @entry: swap entry to charge > > + * > > + * Try to charge @entry to the memcg that @page belongs to. > > + * > > + * Returns 0 on success, -ENOMEM on failure. > > + */ > > +int mem_cgroup_charge_swap(struct page *page, swp_entry_t entry) > > +{ > > + struct mem_cgroup *memcg; > > + struct page_counter *counter; > > + unsigned short oldid; > > + > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || !do_swap_account) > > + return 0; > > + > > + memcg = page->mem_cgroup; > > + > > + /* Readahead page, never charged */ > > + if (!memcg) > > + return 0; > > + > > + if (!mem_cgroup_is_root(memcg) && > > + !page_counter_try_charge(&memcg->swap, 1, &counter)) > > + return -ENOMEM; > > + > > + oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg)); > > + VM_BUG_ON_PAGE(oldid, page); > > + mem_cgroup_swap_statistics(memcg, true); > > + > > + css_get(&memcg->css); > > I think we don't have to duplicate the swap record code. Both cgroup1 > and cgroup2 could run this function to handle the swapout record and > statistics, and then mem_cgroup_swapout() would simply uncharge memsw. Well, may be. I'm afraid this might make mem_cgroup_charge_swap look a bit messy due to necessity to check cgroup_subsys_on_dfl in the middle of it, but I'll give it a try. > > > @@ -5828,6 +5931,8 @@ static int __init mem_cgroup_swap_init(void) > > { > > if (!mem_cgroup_disabled() && really_do_swap_account) { > > do_swap_account = 1; > > + WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, > > + swap_files)); > > WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, > > memsw_cgroup_files)); > > I guess we could also support cgroup.memory=noswap. > Yeah, that would be cleaner. I wonder if we could drop swapaccount boot param then so as not to clutter the API. Thanks for the review! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/