Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1105371pxb; Tue, 9 Feb 2021 23:29:17 -0800 (PST) X-Google-Smtp-Source: ABdhPJx016CV4j45RBwj+ufBYLvC72PNTGiF4cQigHgOU/UnXpylRVTHinywoy1nxsMZ19VTNhG0 X-Received: by 2002:a17:906:7e49:: with SMTP id z9mr1628331ejr.293.1612942157269; Tue, 09 Feb 2021 23:29:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612942157; cv=none; d=google.com; s=arc-20160816; b=IbTBioopVO2/H3h7kqfk5voWr50H8cWH5amjSIZZMYSpwEshLCPUiv+dl2Uiy+qvJZ vNg1pVSA7uRnapREFgMDzepDX9IzLpFZ5K09IidyEv9LZIa9q1mBahjFovGlnKxhySGP lr9x5WKgxXE5luI0H4QG3imokOKIPvV8Wjpqt8aLtN4N9OHV4wSx2C3dBYSgAmpqnzyD Amh5wt4fzvVdZtzDEQI/uXVfs2lOvBLq5990TSGNbji3h/xtUV4ChLRYwTqg4uBcL/cN ZA867cASjTZdobozg1d51E2Nvq9cZZOnfc2KencA0i/dsomtJ+hAb9Nt169KeyEI+RIC OpDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=XqwbjVKT92bgGmTwHneB7S6uqWbAlrDany7fE7oGT1Y=; b=g0vDcYu0ilmEB+oDgcJ3uI0U6hgA9/U/EdlIPhkh0JFaZ+XHoU3db0cykimymEPm9K qntBFRW4uoFv1ad70BS5kHBOLhSKvgQrZVWjrUSxavksLB2+9wOo7S4tcgAisqk2G5Wh KvqcyTea/uQwrWUaeoI3HTfqdXDGf1ysORm2NirIrzodLzg5f3vX6YbUZAbuDTail9n4 rRQNbcZVlF856scJ9So6Yl4ISjvgWwS0rcwMLTbKHpI4EXZW6cpLROO/hNmHiS+iYyFq +tTqHk+B0niOXjQPSx2GIJbX0nO88mOrEpdR0bsCENEbCokLMlqP2Ey4/pZO1hDX8r37 ZNEA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=wl5iSzgZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id br15si798316ejb.696.2021.02.09.23.28.53; Tue, 09 Feb 2021 23:29:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=wl5iSzgZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231355AbhBIQef (ORCPT + 99 others); Tue, 9 Feb 2021 11:34:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40100 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232710AbhBIQeJ (ORCPT ); Tue, 9 Feb 2021 11:34:09 -0500 Received: from mail-qv1-xf32.google.com (mail-qv1-xf32.google.com [IPv6:2607:f8b0:4864:20::f32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9EB78C061574 for ; Tue, 9 Feb 2021 08:33:28 -0800 (PST) Received: by mail-qv1-xf32.google.com with SMTP id 2so8987864qvd.0 for ; Tue, 09 Feb 2021 08:33:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=XqwbjVKT92bgGmTwHneB7S6uqWbAlrDany7fE7oGT1Y=; b=wl5iSzgZjMYgWkqOkI6F1OA6EQOr0VBwWz+1jGJaqi9r0iUCCJMwGfNy2ZgOyJmrST vI9XWIPnkXj3pyoS9MC9dwZeK338IHOw9TivjHI1cpLlskJGdf5qqx53svtAV2vJ1MP+ 4gjuaJl6/GvvI9OK4j1vD+s8KCK/lEJ3wnl16G3olAhs+NxlVEHkzb/mrooKe86FFG86 K8jc53i5n0xi4ZZKuID1hBG8PLcyVmjfJ00zIKKsi8iZzFGEwYc8oDqm+IViUkuIzBHA TPd5DYSVcdaXjzcwLruGKXCJgCuxRWnF/I3rqeoMLw37Dnmv1MGK8blqtFqMuuXyLG9m I1IA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=XqwbjVKT92bgGmTwHneB7S6uqWbAlrDany7fE7oGT1Y=; b=JV5HaVUJlcy+uDgYgRR6TM2krwu8jmtOmyacVQHR0VVHJUyKtHBDK/SVk7S7+5PKIa TCcBACQchhBZmC74TeYBjdW2an39sqSTQYmuSCCb/Jjg2w2PUbQQvtYLb1cDWayMwV6N dtzw1EigGjbtk0kiVGJRa+EKFSg+kLQ53if/Ajaem9wMdZdychsXknlBJD48//6HREtu +qJQAEELHkRMjLX5StabYqkcWkExJR1yaGqQ9AjIXDJl+27sM2m+lbFanA/uAxm3R4AD ERp1vPBQ5Bfe0zxSyLl4ppruwHa4u6oKnSMIQ9svw1UIokihpvKnbZwf91RdEIXirgIl bKNA== X-Gm-Message-State: AOAM530qOi37rAgiHzU5mE3t0imk2lQ6LrpxyTk5hLTgLUzfvj7jzths HT8I7RL4k5Lf9H+awoDZWUCjYw== X-Received: by 2002:a0c:814e:: with SMTP id 72mr12762669qvc.16.1612888407771; Tue, 09 Feb 2021 08:33:27 -0800 (PST) Received: from localhost (70.44.39.90.res-cmts.bus.ptd.net. [70.44.39.90]) by smtp.gmail.com with ESMTPSA id h12sm18296262qko.29.2021.02.09.08.33.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 08:33:26 -0800 (PST) From: Johannes Weiner To: Andrew Morton Cc: Tejun Heo , Michal Hocko , Roman Gushchin , Shakeel Butt , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v3 0/8] mm: memcontrol: switch to rstat Date: Tue, 9 Feb 2021 11:32:56 -0500 Message-Id: <20210209163304.77088-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.30.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is version 3 of the memcg rstat patches. Updates since v2: - optimize root level in io controller's rstat flush callback (Tejun) - add comments and changelog notes on root level optimization (Tejun) - eliminate mem_cgroup_usage() rstat flush for leaf cgroups (Shakeel) - collected review & ack tags Updates since v1: - added cgroup selftest output (see test section below) (thanks Roman) - patched cgroup selftest so error expectations match kernel implementation - added Fixes: tag to 'mm: memcontrol: fix cpuhotplug statistics flushing' (Shakeel) - collected review & ack tags - added rstat overview to 'mm: memcontrol: switch to rstat' changelog (Michal) - simplified memcg_flush_lruvec_page_state() and removed cpu==-1 case (Michal) --- This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers. The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications. The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times: - Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore. - Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments. - The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree. Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups. --- The test_kmem cgroup selftest is currently failing due to excessive cumulative vmstat drift from 100 subgroups: ok 1 test_kmem_basic memory.current = 8810496 slab + anon + file + kernel_stack = 17074568 slab = 6101384 anon = 946176 file = 0 kernel_stack = 10027008 not ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic As you can see, memory.stat items far exceed memory.current. The kernel stack alone is bigger than all of charged memory. That's because the memory of the test has been uncharged from memory.current, but the negative vmstat deltas are still sitting in the percpu caches. The test at this time isn't even counting percpu, pagetables etc. yet, which would further contribute to the error. The last patch in the series updates the test to include them - as well as reduces the vmstat tolerances in general to only expect page_counter batching. With all patches applied, the (now more stringent) test succeeds: ok 1 test_kmem_basic ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic --- A kernel build test confirms that overhead is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings: root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9 During the builds, kernelbuild/memory.stat is read once a second. A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols: 0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault --- The on-demand aggregated stats are now fully accurate: $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088 --- block/blk-cgroup.c | 17 +- include/linux/memcontrol.h | 119 ++++------- kernel/cgroup/cgroup.c | 34 +-- kernel/cgroup/rstat.c | 63 +++--- mm/memcontrol.c | 305 +++++++++++++-------------- tools/testing/selftests/cgroup/test_kmem.c | 22 +- 6 files changed, 267 insertions(+), 293 deletions(-) Based on v5.11-rc6-mmotm-2021-02-02-20-19.