Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp4315995pxb; Tue, 26 Jan 2021 19:23:08 -0800 (PST) X-Google-Smtp-Source: ABdhPJxxYUbUzCm6sYyS/XDi1ffrN6H+cZvScC4QnLI3/e8NFACWW78Hn3eW82jvZgKuTA9AbfQM X-Received: by 2002:a05:6402:70f:: with SMTP id w15mr7115359edx.121.1611717788618; Tue, 26 Jan 2021 19:23:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611717788; cv=none; d=google.com; s=arc-20160816; b=ojBflUgnh6A/Qj+YgWIL84Ju0UxYsEDHCuZ7Oe21HYileXVirgFj/64OOy1oXCvV8x 9v5bZKFwbcSrlBACGhCm8GhSWewTW4kO5QKiyBvu8JCCXspiEroPMY7hnVZpyJXpfjuE hDhHwnVaaeg50BvpNfmdtJxFGUiSlE2IP1E+kIy2xWeTRYR2bZCg/g8uuehnZ8Xnup30 grTaPLORTbhw79FYQzZX1JbiDdqWVOjhGhJP+JJ1+C3ynQEnzbbMQ3RMW6aMjARlHM9i ae3AwecLsWH6NC00uA7ZKqLJznDwqWs2zelxXg8vD7FGNJOC1JhCQG1zv2In22QyJaKa 4omw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=PW+ga58q4T1GO4Kq1Fd0BmEznrnJ+b7jHg2LKuh/ZIk=; b=jrpO/35gzvhW0/luXdq5jPhwa8sIvCoIFhxM/6B/MQz815GkJ6NK2mlu6GBY9CePno CErCJHmGTFSP3BWr2GIaiHzbc+D7F0Kbf4dCmnWNeEVKGr2XFWM6UQxZrY0MP9XtAwdq tMxloWLflreZbZD+AElP4v33QF07DmGRol1gTO6A9LFLlnIfXueD5LzkzHRuq4Bxe2Fl 8JgnI3rZd1GkHddA2u6yrTWgVLlgz4/yUW2+/z1ckDXuBMdaMzbSMnrX4U6GUt0IkXgr TiZqFsdEg1HaKPTMdNJeJZ+WK0p33gk99G/NtZoyefdR1L14zFkNioZG/cpcFFaFiYxe TGZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=LZzn+FFr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id m13si296016edi.144.2021.01.26.19.22.15; Tue, 26 Jan 2021 19:23:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=LZzn+FFr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727901AbhAZEry (ORCPT + 99 others); Mon, 25 Jan 2021 23:47:54 -0500 Received: from mx2.suse.de ([195.135.220.15]:36322 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725994AbhAYJQK (ORCPT ); Mon, 25 Jan 2021 04:16:10 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1611565115; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=PW+ga58q4T1GO4Kq1Fd0BmEznrnJ+b7jHg2LKuh/ZIk=; b=LZzn+FFrwFdD8mekB41YvBeP54Xq1CDTztG27hogvGB8nNrLZq82n6Ra9f+LSpfvXZQvxW Bj461hJ6OsIt+XIMNoRfgq4r6xf93o7vCw3TC8by8FREEMWG9iowwlBXIMziJxrcHNr6Sr jsEDbHZBEXSYM7fzVZE49EaVHk4Zd0A= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 240ABADD6; Mon, 25 Jan 2021 08:58:35 +0000 (UTC) Date: Mon, 25 Jan 2021 09:58:33 +0100 From: Michal Hocko To: Johannes Weiner Cc: Andrew Morton , Roman Gushchin , Shakeel Butt , Michal =?iso-8859-1?Q?Koutn=FD?= , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" Message-ID: <20210125085833.GA827@dhcp22.suse.cz> References: <20210122184341.292461-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210122184341.292461-1-hannes@cmpxchg.org> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 22-01-21 13:43:41, Johannes Weiner wrote: > This reverts commit 536d3bf261a2fc3b05b3e91e7eef7383443015cf, as it > can cause writers to memory.high to get stuck in the kernel forever, > performing page reclaim and consuming excessive amounts of CPU cycles. > > Before the patch, a write to memory.high would first put the new limit > in place for the workload, and then reclaim the requested delta. After > the patch, the kernel tries to reclaim the delta before putting the > new limit into place, in order to not overwhelm the workload with a > sudden, large excess over the limit. However, if reclaim is actively > racing with new allocations from the uncurbed workload, it can keep > the write() working inside the kernel indefinitely. > > This is causing problems in Facebook production. A privileged > system-level daemon that adjusts memory.high for various workloads > running on a host can get unexpectedly stuck in the kernel and > essentially turn into a sort of involuntary kswapd for one of the > workloads. We've observed that daemon busy-spin in a write() for > minutes at a time, neglecting its other duties on the system, and > expending privileged system resources on behalf of a workload. > > To remedy this, we have first considered changing the reclaim logic to > break out after a couple of loops - whether the workload has converged > to the new limit or not - and bound the write() call this way. > However, the root cause that inspired the sequence change in the first > place has been fixed through other means, and so a revert back to the > proven limit-setting sequence, also used by memory.max, is preferable. > > The sequence was changed to avoid extreme latencies in the workload > when the limit was lowered: the sudden, large excess created by the > limit lowering would erroneously trigger the penalty sleeping code > that is meant to throttle excessive growth from below. Allocating > threads could end up sleeping long after the write() had already > reclaimed the delta for which they were being punished. > > However, erroneous throttling also caused problems in other scenarios > at around the same time. This resulted in commit b3ff92916af3 ("mm, > memcg: reclaim more aggressively before high allocator throttling"), > included in the same release as the offending commit. When allocating > threads now encounter large excess caused by a racing write() to > memory.high, instead of entering punitive sleeps, they will simply be > tasked with helping reclaim down the excess, and will be held no > longer than it takes to accomplish that. This is in line with regular > limit enforcement - i.e. if the workload allocates up against or over > an otherwise unchanged limit from below. > > With the patch breaking userspace, and the root cause addressed by > other means already, revert it again. > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > Cc: # 5.8+ > Reported-by: Tejun Heo > Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Thanks for extending the changelog to describe the scenario in a more detail. > --- > mm/memcontrol.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > Andrew, this is a replacement for > mm-memcontrol-prevent-starvation-when-writing-memoryhigh.patch > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 605f671203ef..a8611a62bafd 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6273,6 +6273,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > if (err) > return err; > > + page_counter_set_high(&memcg->memory, high); > + > for (;;) { > unsigned long nr_pages = page_counter_read(&memcg->memory); > unsigned long reclaimed; > @@ -6296,10 +6298,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > break; > } > > - page_counter_set_high(&memcg->memory, high); > - > memcg_wb_domain_size_changed(memcg); > - > return nbytes; > } > > -- > 2.30.0 > -- Michal Hocko SUSE Labs