Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp866355imu; Wed, 16 Jan 2019 08:49:51 -0800 (PST) X-Google-Smtp-Source: ALg8bN4zyAjyHv0BWB/U/eDaF3JA1ZkXOduEVupsIAFO601n1uns0aDDWwTZ+/LlgJSaSZrlMV0b X-Received: by 2002:a62:64d7:: with SMTP id y206mr10740248pfb.84.1547657391268; Wed, 16 Jan 2019 08:49:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547657391; cv=none; d=google.com; s=arc-20160816; b=J6hc2MOOItbx17DKPMMRjtjDRl6PTqLbGCtJGKeaQYAWkGP5iu79NyaPI8hQNM0gtp kbp054mEQ5KhkxAmH1/CStgOhK+N6MBaBgEJjH1iwZljsL/1AfAAvAroM9NEzmVEFZNH azU5hJS89TJANcZCwb8m/CpYJ68uflDALXDTpl6TAwfijZYx/ghUiIqeLuD5uznoGBDL eyr3AoLMB7+RixwvqhPpUH04YD3YZzuwbYFSS9ezjScXfEc/9scyhvTWFxEESaU0W9Bf M2JYawg0KkAWQVU1rOMYFo+vOaUdqONCi90rTp/U2Ui2y8XaH/hLiLCz17fGly+OhnK0 Msvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=H0YBJt6mddWe+/dBMyAVYU/OSOFJJbkZDhznKbUHYOw=; b=QoSUckxwB1lCGtNAqloADSL4JMaAibLpev2lfxGD4q56clLbjLMaUhuq4/87roxKId Y15Ecdu5A6xk1X3+MB10yxandvJBJEwnk6HsWVeQpuZSfwcwVS4chMmIiLrbVlU1fgeS 6Zkr+zu6Ut7xF1Cv+e+oMvBACsJDvK6D3A+kqkWYHsR9oojwkOVfLhugQAiIRwLix4ie 36QfabGjOPl1F848bNzq7zwWEb/03QaaUk+WeRR17A82Cb39KDuW6hfz1lLcZZt7fw9G 00YpgWR8CTXlBBJpTPw+it7RK4J1LrzQlSG739Mn4Ip4Anz04EVoPRRL58IEcTb69SpP 8taw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b34si6697643pld.305.2019.01.16.08.49.35; Wed, 16 Jan 2019 08:49:51 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388314AbfAPHCW (ORCPT + 99 others); Wed, 16 Jan 2019 02:02:22 -0500 Received: from mx2.suse.de ([195.135.220.15]:42616 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1731011AbfAPHCV (ORCPT ); Wed, 16 Jan 2019 02:02:21 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id B61CBAD86; Wed, 16 Jan 2019 07:02:19 +0000 (UTC) Date: Wed, 16 Jan 2019 08:02:18 +0100 From: Michal Hocko To: Shakeel Butt Cc: Johannes Weiner , Andrew Morton , Vladimir Davydov , Cgroups , Linux MM , LKML Subject: Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work Message-ID: <20190116070218.GF24149@dhcp22.suse.cz> References: <20190110174432.82064-1-shakeelb@google.com> <20190111205948.GA4591@cmpxchg.org> <20190113183402.GD1578@dhcp22.suse.cz> <20190115072551.GO21345@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 15-01-19 11:38:23, Shakeel Butt wrote: > On Mon, Jan 14, 2019 at 11:25 PM Michal Hocko wrote: > > > > On Mon 14-01-19 12:18:07, Shakeel Butt wrote: > > > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko wrote: > > > > > > > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote: > > > > > Hi Johannes, > > > > > > > > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner wrote: > > > > > > > > > > > > Hi Shakeel, > > > > > > > > > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote: > > > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on > > > > > > > return-to-userland. However it is assumed that the memcg is the current > > > > > > > process's memcg. With remote memcg charging for kmem or swapping in a > > > > > > > page charged to remote memcg, current process can trigger reclaim on > > > > > > > remote memcg. So, schduling reclaim on return-to-userland for remote > > > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg > > > > > > > needing high reclaim and trigger high reclaim for that memcg on > > > > > > > return-to-userland. However if the memcg is already recorded for high > > > > > > > reclaim and the recorded memcg is not the descendant of the the memcg > > > > > > > needing high reclaim, punt the high reclaim to the work queue. > > > > > > > > > > > > The idea behind remote charging is that the thread allocating the > > > > > > memory is not responsible for that memory, but a different cgroup > > > > > > is. Why would the same thread then have to work off any high excess > > > > > > this could produce in that unrelated group? > > > > > > > > > > > > Say you have a inotify/dnotify listener that is restricted in its > > > > > > memory use - now everybody sending notification events from outside > > > > > > that listener's group would get throttled on a cgroup over which it > > > > > > has no control. That sounds like a recipe for priority inversions. > > > > > > > > > > > > It seems to me we should only do reclaim-on-return when current is in > > > > > > the ill-behaved cgroup, and punt everything else - interrupts and > > > > > > remote charges - to the workqueue. > > > > > > > > > > This is what v1 of this patch was doing but Michal suggested to do > > > > > what this version is doing. Michal's argument was that the current is > > > > > already charging and maybe reclaiming a remote memcg then why not do > > > > > the high excess reclaim as well. > > > > > > > > Johannes has a good point about the priority inversion problems which I > > > > haven't thought about. > > > > > > > > > Personally I don't have any strong opinion either way. What I actually > > > > > wanted was to punt this high reclaim to some process in that remote > > > > > memcg. However I didn't explore much on that direction thinking if > > > > > that complexity is worth it. Maybe I should at least explore it, so, > > > > > we can compare the solutions. What do you think? > > > > > > > > My question would be whether we really care all that much. Do we know of > > > > workloads which would generate a large high limit excess? > > > > > > > > > > The current semantics of memory.high is that it can be breached under > > > extreme conditions. However any workload where memory.high is used and > > > a lot of remote memcg charging happens (inotify/dnotify example given > > > by Johannes or swapping in tmpfs file or shared memory region) the > > > memory.high breach will become common. > > > > This is exactly what I am asking about. Is this something that can > > happen easily? Remote charges on themselves should be rare, no? > > > > At the moment, for kmem we can do remote charging for fanotify, > inotify and buffer_head and for anon pages we can do remote charging > on swap in. Now based on the workload's cgroup setup the remote > charging can be very frequent or rare. > > At Google, remote charging is very frequent but since we are still on > cgroup-v1 and do not use memory.high, the issue this patch is fixing > is not observed. However for the adoption of cgroup-v2, this fix is > needed. Adding some numbers into the changelog would be really valuable to judge the urgency and the scale of the problem. If we are going via kworker then it is also important to evaluate what kind of effect on the system this has. How big of the excess can we get? Why don't those memcgs resolve the excess by themselves on the first direct charge? Is it possible that kworkers simply swamp the system with many parallel memcgs with remote charges? In other words we need deeper analysis of the problem and the solution. -- Michal Hocko SUSE Labs