Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp2522789yba; Mon, 15 Apr 2019 13:31:14 -0700 (PDT) X-Google-Smtp-Source: APXvYqwqWPDUBiYa0AA+maty+thjA9y+JiUa7+RJI9bcTVUfoJPqRe0Kym6Q96ovhEopa6s7xhAZ X-Received: by 2002:a17:902:b617:: with SMTP id b23mr74973267pls.73.1555360274022; Mon, 15 Apr 2019 13:31:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555360274; cv=none; d=google.com; s=arc-20160816; b=ggfuuRHtQy28oK64EoP4zHha8OGvqJJ5CdhdOTt45T3F1NqbH9S2/P4ri5gidkS7gX shAxSPgX842U9b9eh8oXm448le1p8VX0AdMBZ5ACW8HU/fVqx+C5qttnYL9lJHPr8ogW 1KdhYUmvT/dDVtrYAOSaMNI7RnNl8U75wC2FMuoBy0UyLMvMppauOMTWyDYljgXS90z1 lBIUrf6yWpvlV8db4HAYqSwHJPuZw1jzvVVamqBl+RWXETje1xxQspAHqra7/7oWqliF ZxJu5lvLr0A+bI9I9updfkpannBPerA9BxGx+JNmGUwL6oSA+bLR+E3tdpK/ySUnP7j0 IWlg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=ZtGsZJhtCl0oMxa81XCnaS2+O9MkaAbwtjFhcjij0X0=; b=f7ERIXyqazpmnYisZCfWT8vp7BuIlRzwtKmIvQ7g0mbd2ST7r091DY3Nfds+SwXdTn fr0b04aGyeaUXdTfpynpJOySEO0A15LvYpgdwHmEYcBVGSSdYoZMYVa1CQjCsg2m84tj nj0olZAP25sIvjcJtY0q8fAwl0FCZCG9+7F29sTR3VTzduohWU+L6smmI06XRBDGlvED 5C+h6fc5HvWXa6fyl6Vs9PytXzKQd420XfE7EQn6aw/3+J7Yxx8XJ4scRa7pbnWtoYnW a/efmK1je/Iy3gWWg/tPfF3mtCKR3Oz6s7CGvhDGqWCv6VQEZHMx48l6AWkg663oGrna TMOw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=B775Z8Oe; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i96si27375728plb.331.2019.04.15.13.30.57; Mon, 15 Apr 2019 13:31:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=B775Z8Oe; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730172AbfDOTH0 (ORCPT + 99 others); Mon, 15 Apr 2019 15:07:26 -0400 Received: from mail.kernel.org ([198.145.29.99]:42276 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730161AbfDOTHY (ORCPT ); Mon, 15 Apr 2019 15:07:24 -0400 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 13BA3218A1; Mon, 15 Apr 2019 19:07:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1555355242; bh=VVMl++n3uhVTe0xMydQc4rY7xpoiXH/Z1dtMfIxV2WQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=B775Z8OeOuuKgSSy26wBNM1TkbdIHrAjZ3t0lnOrcYDlmziGh9rb4z++YlEs2PxD9 4V2K5D2k+Df6y0uxPh3In6vIAJor9IyyCwYg89I01MuieKyKs1/zH/XAF1a5KNJ3Q2 0eao1dXb4P8jw/xe88V9p/sKgw71UBkuP+/cMR1M= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Greg Thelen , Roman Gushchin , Johannes Weiner , Michal Hocko , Vladimir Davydov , Tejun Heo , Andrew Morton , Linus Torvalds Subject: [PATCH 4.19 064/101] mm: writeback: use exact memcg dirty counts Date: Mon, 15 Apr 2019 20:59:02 +0200 Message-Id: <20190415183743.876384717@linuxfoundation.org> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190415183740.341577907@linuxfoundation.org> References: <20190415183740.341577907@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Greg Thelen commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream. Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu*32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c /* * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -99*32 pages * - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -99*32 pages, which * it max()s to 0. * - So a workload can dirty -99*32 pages before balance_dirty_pages() * cares. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include static char *buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(&affinity); CPU_SET(cpu, &affinity); if (sched_setaffinity(0, sizeof(affinity), &affinity)) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char **argv) { int cpu, flush_cpu = 1, output_fd; const char *output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to use exact dirty and writeback counters in memcg oom reports. But that is saved for later. Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com Signed-off-by: Greg Thelen Reviewed-by: Roman Gushchin Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Tejun Heo Cc: [4.16+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- include/linux/memcontrol.h | 5 ++++- mm/memcontrol.c | 20 ++++++++++++++++++-- 2 files changed, 22 insertions(+), 3 deletions(-) --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -559,7 +559,10 @@ struct mem_cgroup *lock_page_memcg(struc void __unlock_page_memcg(struct mem_cgroup *memcg); void unlock_page_memcg(struct page *page); -/* idx can be of type enum memcg_stat_item or node_stat_item */ +/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page_state(). + */ static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3897,6 +3897,22 @@ struct wb_domain *mem_cgroup_wb_domain(s return &memcg->cgwb_domain; } +/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page(). + */ +static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) +{ + long x = atomic_long_read(&memcg->stat[idx]); + int cpu; + + for_each_online_cpu(cpu) + x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx]; + if (x < 0) + x = 0; + return x; +} + /** * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg * @wb: bdi_writeback in question @@ -3922,10 +3938,10 @@ void mem_cgroup_wb_stats(struct bdi_writ struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); + *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); /* this should eventually include NR_UNSTABLE_NFS */ - *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); + *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | (1 << LRU_ACTIVE_FILE)); *pheadroom = PAGE_COUNTER_MAX;