Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp913024ybh; Mon, 13 Jul 2020 04:43:17 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxmPYlGvKZP/WOsrCz6n6IzrqPdoPvQ8ZhDtJX6JZRI+KV9zQql3izB+TvNy/n7g8dYJUW0 X-Received: by 2002:a17:906:fa15:: with SMTP id lo21mr76368711ejb.156.1594640597725; Mon, 13 Jul 2020 04:43:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594640597; cv=none; d=google.com; s=arc-20160816; b=Fev2E2srRet2xOYsDJUb09S3OJEBjciMFoR40mXar/IOngwwvDZOepZzxwEHdN5lnC 6LIVATbLr77NjHGgbkzzfwRBQ6Mm1DyxkSLowYlOjq1UR5bP0nC+9yB7vBRBAWfC4wfq wzVeXBRie07QHYkTgSOGZRauniwt9AloNdzEzEp6WSgqbieZ/6TcZKlX82XMl3TQubdU 2gLh8UMbu/Zg8/8ad5MqyWEJBOexdR9tysJzuZJvCZoqFytuI42+OHElyse11Cr7lLJS e9IrEkXi0nHQUdQGc+6bOHZqvS4wy1kNPheMqg9oHAP12ClHkxIO6+7lLzwjvQ13C5BB rRNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=asIUMaHIscmVUOt9O3PjjLWYy2UNKC6B4ByL+UWtDFY=; b=M/cN35llZdE27fd1c9VJrF+VHNwHIiY0J615Ss1XWTHjAM2A6DW3N/sBfLxw4K2roB 4W7r3QDadXIriOxtTNzUiUOZYHESJ9G+gZ1micgpK9A0JUVPdJb0/kBm7TcItKJs8pMs r7WAQ5Hd0r8MWUoe4J64bw4TDhsJpUaq0P0Py1Eu0UnL2SlMACd0kWLCEyp+Glg8SA75 GcqJKS4lYyz5UYuBiiDSoJtA3fuOyrWo5TkW/A66XVLqwAhKGa5ZWUd/h8w6BKvDjk7P DFkseo65HE9vxps9Tx7A7bTmG7YSRwIPujrfUYkToA8fiPWJams9I5UEhaWW9ngTVCRa o4BQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chrisdown.name header.s=google header.b=e1FeX0GR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chrisdown.name Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id gj5si8419612ejb.49.2020.07.13.04.42.54; Mon, 13 Jul 2020 04:43:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@chrisdown.name header.s=google header.b=e1FeX0GR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chrisdown.name Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729492AbgGMLmh (ORCPT + 99 others); Mon, 13 Jul 2020 07:42:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58504 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729027AbgGMLmh (ORCPT ); Mon, 13 Jul 2020 07:42:37 -0400 Received: from mail-ej1-x641.google.com (mail-ej1-x641.google.com [IPv6:2a00:1450:4864:20::641]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3BFBAC061794 for ; Mon, 13 Jul 2020 04:42:37 -0700 (PDT) Received: by mail-ej1-x641.google.com with SMTP id dp18so16641914ejc.8 for ; Mon, 13 Jul 2020 04:42:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chrisdown.name; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=asIUMaHIscmVUOt9O3PjjLWYy2UNKC6B4ByL+UWtDFY=; b=e1FeX0GR+KD623Ybe45d+HmtHhrVBq/5lHLV88DeXVo5dTMNF7JFmFa8IE/NcRmlIM kW3MDiMLjPHtp+vFl/UhNCB7LrIgkpwbXdS7xi3bcM5jqM5WCnz/tfSLQFDQeb0Awkg6 FF7K4opZLmhkbQILg412dwpU2CrRZiOpa9ceg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=asIUMaHIscmVUOt9O3PjjLWYy2UNKC6B4ByL+UWtDFY=; b=smgxpJOtk+cqn0Re+Wwo4+CdxMpqAk5ERR3De1Ic9B3wGtO3vx0AdH40gu8rhk+F7c CaPIInDEI8fftepbHdvpnXTTi/6VJWLfXeL4N4To4z1xIwf0ArlIhq1g5Ly9Id3sUutM fodidUa56xokcOk52KJ2E+S3jjeJbViSYhV6m+867IUocsxpLingJSw2T5JPkf0y//fj JxShTdoCyXI7RH2ZsMpmpESLSPBR4bWC85LCvzxNA0fTSmV6vgOOeU5/BQaUv3X7OUWi /ImIJ3mIMMudrAo2CVa0sDpCqbjgjr1mGUMkkKVXb8SQWLWAlCvI613pDEIkck3D7QVH 2tdQ== X-Gm-Message-State: AOAM5331DeapCb1VZvlv4D/zr4SnFTCIxO5lKznQ8SJmlL6aGnsGGRI5 svN/oP2tTaNJiNg5c3oxDXL75Q== X-Received: by 2002:a17:906:27c9:: with SMTP id k9mr71713920ejc.74.1594640555870; Mon, 13 Jul 2020 04:42:35 -0700 (PDT) Received: from localhost ([2620:10d:c093:400::5:ef88]) by smtp.gmail.com with ESMTPSA id bs18sm11363158edb.38.2020.07.13.04.42.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Jul 2020 04:42:35 -0700 (PDT) Date: Mon, 13 Jul 2020 12:42:35 +0100 From: Chris Down To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v2 1/2] mm, memcg: reclaim more aggressively before high allocator throttling Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.14.5 (2020-06-23) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In Facebook production, we've seen cases where cgroups have been put into allocator throttling even when they appear to have a lot of slack file caches which should be trivially reclaimable. Looking more closely, the problem is that we only try a single cgroup reclaim walk for each return to usermode before calculating whether or not we should throttle. This single attempt doesn't produce enough pressure to shrink for cgroups with a rapidly growing amount of file caches prior to entering allocator throttling. As an example, we see that threads in an affected cgroup are stuck in allocator throttling: # for i in $(cat cgroup.threads); do > grep over_high "/proc/$i/stack" > done [<0>] mem_cgroup_handle_over_high+0x10b/0x150 [<0>] mem_cgroup_handle_over_high+0x10b/0x150 [<0>] mem_cgroup_handle_over_high+0x10b/0x150 ...however, there is no I/O pressure reported by PSI, despite a lot of slack file pages: # cat memory.pressure some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 # cat io.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 # grep _file memory.stat inactive_file 1370939392 active_file 661635072 This patch changes the behaviour to retry reclaim either until the current task goes below the 10ms grace period, or we are making no reclaim progress at all. In the latter case, we enter reclaim throttling as before. To a user, there's no intuitive reason for the reclaim behaviour to differ from hitting memory.high as part of a new allocation, as opposed to hitting memory.high because someone lowered its value. As such this also brings an added benefit: it unifies the reclaim behaviour between the two. There's precedent for this behaviour: we already do reclaim retries when writing to memory.{high,max}, in max reclaim, and in the page allocator itself. Signed-off-by: Chris Down Cc: Andrew Morton Cc: Johannes Weiner Cc: Tejun Heo Cc: Michal Hocko --- mm/memcontrol.c | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0145a77aa074..d4b0d8af3747 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -73,6 +73,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys); struct mem_cgroup *root_mem_cgroup __read_mostly; +/* The number of times we should retry reclaim failures before giving up. */ #define MEM_CGROUP_RECLAIM_RETRIES 5 /* Socket memory accounting disabled? */ @@ -2365,18 +2366,23 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) return 0; } -static void reclaim_high(struct mem_cgroup *memcg, - unsigned int nr_pages, - gfp_t gfp_mask) +static unsigned long reclaim_high(struct mem_cgroup *memcg, + unsigned int nr_pages, + gfp_t gfp_mask) { + unsigned long nr_reclaimed = 0; + do { if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->memory.high)) continue; memcg_memory_event(memcg, MEMCG_HIGH); - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, + gfp_mask, true); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); + + return nr_reclaimed; } static void high_work_func(struct work_struct *work) @@ -2532,16 +2538,32 @@ void mem_cgroup_handle_over_high(void) { unsigned long penalty_jiffies; unsigned long pflags; + unsigned long nr_reclaimed; unsigned int nr_pages = current->memcg_nr_pages_over_high; + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct mem_cgroup *memcg; + bool in_retry = false; if (likely(!nr_pages)) return; memcg = get_mem_cgroup_from_mm(current->mm); - reclaim_high(memcg, nr_pages, GFP_KERNEL); current->memcg_nr_pages_over_high = 0; +retry_reclaim: + /* + * The allocating task should reclaim at least the batch size, but for + * subsequent retries we only want to do what's necessary to prevent oom + * or breaching resource isolation. + * + * This is distinct from memory.max or page allocator behaviour because + * memory.high is currently batched, whereas memory.max and the page + * allocator run every time an allocation is made. + */ + nr_reclaimed = reclaim_high(memcg, + in_retry ? SWAP_CLUSTER_MAX : nr_pages, + GFP_KERNEL); + /* * memory.high is breached and reclaim is unable to keep up. Throttle * allocators proactively to slow down excessive growth. @@ -2568,6 +2590,16 @@ void mem_cgroup_handle_over_high(void) if (penalty_jiffies <= HZ / 100) goto out; + /* + * If reclaim is making forward progress but we're still over + * memory.high, we want to encourage that rather than doing allocator + * throttling. + */ + if (nr_reclaimed || nr_retries--) { + in_retry = true; + goto retry_reclaim; + } + /* * If we exit early, we're guaranteed to die (since * schedule_timeout_killable sets TASK_KILLABLE). This means we don't -- 2.27.0