Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp4820963ybl; Wed, 22 Jan 2020 05:23:17 -0800 (PST) X-Google-Smtp-Source: APXvYqziropnxtfPtm8A7CrlV/3kI7PqXZj9Eeb/JHi5qC9EQZU3DTyg/Qpz36gVO+HiKFPfwD4C X-Received: by 2002:aca:55cc:: with SMTP id j195mr6973324oib.22.1579699397753; Wed, 22 Jan 2020 05:23:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579699397; cv=none; d=google.com; s=arc-20160816; b=Yx/WmXfi8m8Eyev0Dx3+2vSR5CJ4Gk9gVg4pGGWXWDNr78Og4+7pyWrKmGSgxw9T3E zwQhL3RMwvBe5DLyVuraQxhd0FfyVF6vO0+iPF8xja72PT58WQ3x5kfx9Uy+mmFysynQ ObyS2ezkxKZhMveaVWxsWGs0MccxsbNhwW5sDiH1uOVQAhSG2ZP75dgSLnNuX2gaaunx UVqW1Rd5//TFgMnrg3nzuwcsFJQ8zZqKKw/hN/lHGDjbliwwC49rvOJ4N5tN7LVfkUEu 8LvdW26mR2mtXgkwY8yuCcSQdeCMy+jxXBFR+B23iBuHv06fz+n8eJyzzwbA3I80gNE2 xCNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=0Tpgr6aJDZUQ4a698tRrfbEK9przPr80zkBlZhHNy2Q=; b=EUVl0tJeoUFYzwFBwYPwurMw2766wn963lNKMoJIVhjM9GLTsqbGQfBE7mutte9o4m u6xNp/D4r36ltM8WDp5iweHpJF2kX7eyIQqnDT2rypuFqW+9qyIME7+BtIretM973nNZ uTl3dc5n2e/6XSIKxXZIfhJi5UhUKsP8d9jMBI3CSU5sEH5zDKlUrcgNfV81DQsdt5Tf m3nzAt4WxNRZAwa+vntac2kK8OZzSOTjMLZzKBw3qJdG8j1IJOqx7MVV3PhpujNq5juv xK7jGDuQHsdz1mB6DR7CwelEIyieL8W5YModo2cG3ofMOtzdMhNrP8wKQwi4qfeKpeIh yrQA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=AVjSqhPP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w4si20772742otl.214.2020.01.22.05.23.05; Wed, 22 Jan 2020 05:23:17 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=AVjSqhPP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729976AbgAVNUb (ORCPT + 99 others); Wed, 22 Jan 2020 08:20:31 -0500 Received: from mail.kernel.org ([198.145.29.99]:36714 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726004AbgAVNU1 (ORCPT ); Wed, 22 Jan 2020 08:20:27 -0500 Received: from localhost (unknown [84.241.205.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 567342071E; Wed, 22 Jan 2020 13:20:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1579699226; bh=1wQbaO2zRud0D34T1b+rR687+h1dBr30VUA1s4nyweM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=AVjSqhPPH9LZVC0G+S766A6cxQQAtTI75Za1iFRlnTYWys/cCzKiFm18JNZ+mhEVo in07TAXczDT8Dn9xefFIZwEEFH5PSR3zCeFCKbY3imzsdIC53Dju58q9h+mACtFWVt OnbRrlcBeOjwUgmEC3eJWl2vpSgzVoQFLYXBjVUc= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Roman Gushchin , Johannes Weiner , Michal Hocko , Andrew Morton , Linus Torvalds Subject: [PATCH 5.4 080/222] mm: memcg/slab: fix percpu slab vmstats flushing Date: Wed, 22 Jan 2020 10:27:46 +0100 Message-Id: <20200122092839.440399005@linuxfoundation.org> X-Mailer: git-send-email 2.25.0 In-Reply-To: <20200122092833.339495161@linuxfoundation.org> References: <20200122092833.339495161@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Roman Gushchin commit 4a87e2a25dc27131c3cce5e94421622193305638 upstream. Currently slab percpu vmstats are flushed twice: during the memcg offlining and just before freeing the memcg structure. Each time percpu counters are summed, added to the atomic counterparts and propagated up by the cgroup tree. The second flushing is required due to how recursive vmstats are implemented: counters are batched in percpu variables on a local level, and once a percpu value is crossing some predefined threshold, it spills over to atomic values on the local and each ascendant levels. It means that without flushing some numbers cached in percpu variables will be dropped on floor each time a cgroup is destroyed. And with uptime the error on upper levels might become noticeable. The first flushing aims to make counters on ancestor levels more precise. Dying cgroups may resume in the dying state for a long time. After kmem_cache reparenting which is performed during the offlining slab counters of the dying cgroup don't have any chances to be updated, because any slab operations will be performed on the parent level. It means that the inaccuracy caused by percpu batching will not decrease up to the final destruction of the cgroup. By the original idea flushing slab counters during the offlining should minimize the visible inaccuracy of slab counters on the parent level. The problem is that percpu counters are not zeroed after the first flushing. So every cached percpu value is summed twice. It creates a small error (up to 32 pages per cpu, but usually less) which accumulates on parent cgroup level. After creating and destroying of thousands of child cgroups, slab counter on parent level can be way off the real value. For now, let's just stop flushing slab counters on memcg offlining. It can't be done correctly without scheduling a work on each cpu: reading and zeroing it during css offlining can race with an asynchronous update, which doesn't expect values to be changed underneath. With this change, slab counters on parent level will become eventually consistent. Once all dying children are gone, values are correct. And if not, the error is capped by 32 * NR_CPUS pages per dying cgroup. It's not perfect, as slab are reparented, so any updates after the reparenting will happen on the parent level. It means that if a slab page was allocated, a counter on child level was bumped, then the page was reparented and freed, the annihilation of positive and negative counter values will not happen until the child cgroup is released. It makes slab counters different from others, and it might want us to implement flushing in a correct form again. But it's also a question of performance: scheduling a work on each cpu isn't free, and it's an open question if the benefit of having more accurate counters is worth it. We might also consider flushing all counters on offlining, not only slab counters. So let's fix the main problem now: make the slab counters eventually consistent, so at least the error won't grow with uptime (or more precisely the number of created and destroyed cgroups). And think about the accuracy of counters separately. Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining") Signed-off-by: Roman Gushchin Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- include/linux/mmzone.h | 5 ++--- mm/memcontrol.c | 37 +++++++++---------------------------- 2 files changed, 11 insertions(+), 31 deletions(-) --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -215,9 +215,8 @@ enum node_stat_item { NR_INACTIVE_FILE, /* " " " " " */ NR_ACTIVE_FILE, /* " " " " " */ NR_UNEVICTABLE, /* " " " " " */ - NR_SLAB_RECLAIMABLE, /* Please do not reorder this item */ - NR_SLAB_UNRECLAIMABLE, /* and this one without looking at - * memcg_flush_percpu_vmstats() first. */ + NR_SLAB_RECLAIMABLE, + NR_SLAB_UNRECLAIMABLE, NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_NODES, --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3404,49 +3404,34 @@ static u64 mem_cgroup_read_u64(struct cg } } -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only) +static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) { - unsigned long stat[MEMCG_NR_STAT]; + unsigned long stat[MEMCG_NR_STAT] = {0}; struct mem_cgroup *mi; int node, cpu, i; - int min_idx, max_idx; - - if (slab_only) { - min_idx = NR_SLAB_RECLAIMABLE; - max_idx = NR_SLAB_UNRECLAIMABLE; - } else { - min_idx = 0; - max_idx = MEMCG_NR_STAT; - } - - for (i = min_idx; i < max_idx; i++) - stat[i] = 0; for_each_online_cpu(cpu) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < MEMCG_NR_STAT; i++) stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < MEMCG_NR_STAT; i++) atomic_long_add(stat[i], &mi->vmstats[i]); - if (!slab_only) - max_idx = NR_VM_NODE_STAT_ITEMS; - for_each_node(node) { struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; struct mem_cgroup_per_node *pi; - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) stat[i] = 0; for_each_online_cpu(cpu) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) stat[i] += per_cpu( pn->lruvec_stat_cpu->count[i], cpu); for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) atomic_long_add(stat[i], &pi->lruvec_stat[i]); } } @@ -3520,13 +3505,9 @@ static void memcg_offline_kmem(struct me parent = root_mem_cgroup; /* - * Deactivate and reparent kmem_caches. Then flush percpu - * slab statistics to have precise values at the parent and - * all ancestor levels. It's required to keep slab stats - * accurate after the reparenting of kmem_caches. + * Deactivate and reparent kmem_caches. */ memcg_deactivate_kmem_caches(memcg, parent); - memcg_flush_percpu_vmstats(memcg, true); kmemcg_id = memcg->kmemcg_id; BUG_ON(kmemcg_id < 0); @@ -5037,7 +5018,7 @@ static void mem_cgroup_free(struct mem_c * Flush percpu vmstats and vmevents to guarantee the value correctness * on parent's and all ancestor levels. */ - memcg_flush_percpu_vmstats(memcg, false); + memcg_flush_percpu_vmstats(memcg); memcg_flush_percpu_vmevents(memcg); __mem_cgroup_free(memcg); }