Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp528515imu; Fri, 4 Jan 2019 02:21:41 -0800 (PST) X-Google-Smtp-Source: ALg8bN652Hva6W4NcU5M6yq4lDXxGCBOj8VCA82VzyVuTxILlr1dFYHxgLtrWzWA+8w58ePPpF7E X-Received: by 2002:a63:cc4e:: with SMTP id q14mr1119448pgi.291.1546597301788; Fri, 04 Jan 2019 02:21:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546597301; cv=none; d=google.com; s=arc-20160816; b=bdQ9Qt0oFfKX8YYXsPLGIV2F3/BQhLhqYffkYRYGYbU+JYuAMsAPQqbWaiWQxQCPFU k+cM+buf/B1rzIef4gKzUOC7iEdDYsYU8InFWAUPsbtkXVVJds4Unj+Cxz4AJ53UXt0i Pgy2R6t/XApU+Cs+UEa9QzLvyfoLye2J3vudrXnA/db6kV27aeP0OeEFvHcqU/PzJ6Td Azq9nsHcLGX8aqJCjn47SbLhV/tHJuSUx+yQr+PxGWT6FZ1+xdBjeB6cU43La4L/hVgm 6qdH94Ds9him+hqPqhyDtF6VPRTRMFvW5KY858wXtXaNngXTmu7Iw9kZt6cTDbuyAF6L bEQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=yS+EWbAbabrgkZLfJ6RmAzOl/CoRVTkq9hWGYtCr5g0=; b=th6/3mGZNey498k8z/wabXqKbn7BdDONI4rLD5SDqmiFh5wQf4a6dv7Hw4n1+Q9mTK 9qBe+rJ0LpDqpMFokMPxsRU6e4l4Ia5qpIBThi3egWbXHTD4y622c2Tkm87whn/pbxBT TvI850EMSKaaOAe6oWJmQb8MGh7Cq+wlH1bV2wYqD09oRfGfBxmWota4wYh2YUV4M/hH Sw+DS9e0vi+9ctFCfIf1Ienp1cLnCOAGPFaNCSulscNhq3cTzK1RGuyC9efTCUlwLA65 x4ALo+kVwbvLIrydctzBVIVWuKu0J8mO8X/dR7bb2RMTu/ielaKlEQXvfy7ruBg1o1yP PNwA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f5si52564201plo.422.2019.01.04.02.21.26; Fri, 04 Jan 2019 02:21:41 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727754AbfADIzx (ORCPT + 99 others); Fri, 4 Jan 2019 03:55:53 -0500 Received: from mx2.suse.de ([195.135.220.15]:33134 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726800AbfADIzv (ORCPT ); Fri, 4 Jan 2019 03:55:51 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 3A37AABAC; Fri, 4 Jan 2019 08:55:49 +0000 (UTC) Date: Fri, 4 Jan 2019 09:55:47 +0100 From: Michal Hocko To: Yang Shi Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty Message-ID: <20190104085547.GG31793@dhcp22.suse.cz> References: <20190103101215.GH31793@dhcp22.suse.cz> <20190103181329.GW31793@dhcp22.suse.cz> <6f43e926-3bb5-20d1-2e39-1d30bf7ad375@linux.alibaba.com> <20190103185333.GX31793@dhcp22.suse.cz> <20190103192339.GA31793@dhcp22.suse.cz> <88b4d986-0b3c-cbf0-65ad-95f3e8ccd870@linux.alibaba.com> <20190103200111.GD31793@dhcp22.suse.cz> <146af1c6-4405-76c5-b253-c8fba11779bf@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <146af1c6-4405-76c5-b253-c8fba11779bf@linux.alibaba.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 03-01-19 20:15:30, Yang Shi wrote: > > > On 1/3/19 12:01 PM, Michal Hocko wrote: > > On Thu 03-01-19 11:49:32, Yang Shi wrote: > > > > > > On 1/3/19 11:23 AM, Michal Hocko wrote: > > > > On Thu 03-01-19 11:10:00, Yang Shi wrote: > > > > > On 1/3/19 10:53 AM, Michal Hocko wrote: > > > > > > On Thu 03-01-19 10:40:54, Yang Shi wrote: > > > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote: > > > > [...] > > > > > > > > Is there any reason for your scripts to be strictly sequential here? In > > > > > > > > other words why cannot you offload those expensive operations to a > > > > > > > > detached context in _userspace_? > > > > > > > I would say it has not to be strictly sequential. The above script is just > > > > > > > an example to illustrate the pattern. But, sometimes it may hit such pattern > > > > > > > due to the complicated cluster scheduling and container scheduling in the > > > > > > > production environment, for example the creation process might be scheduled > > > > > > > to the same CPU which is doing force_empty. I have to say I don't know too > > > > > > > much about the internals of the container scheduling. > > > > > > In that case I do not see a strong reason to implement the offloding > > > > > > into the kernel. It is an additional code and semantic to maintain. > > > > > Yes, it does introduce some additional code and semantic, but IMHO, it is > > > > > quite simple and very straight forward, isn't it? Just utilize the existing > > > > > css offline worker. And, that a couple of lines of code do improve some > > > > > throughput issues for some real usecases. > > > > I do not really care it is few LOC. It is more important that it is > > > > conflating force_empty into offlining logic. There was a good reason to > > > > remove reparenting/emptying the memcg during the offline. Considering > > > > that you can offload force_empty from userspace trivially then I do not > > > > see any reason to implement it in the kernel. > > > Er, I may not articulate in the earlier email, force_empty can not be > > > offloaded from userspace *trivially*. IOWs the container scheduler may > > > unexpectedly overcommit something due to the stall of synchronous force > > > empty, which can't be figured out by userspace before it actually happens. > > > The scheduler doesn't know how long force_empty would take. If the > > > force_empty could be offloaded by kernel, it would make scheduler's life > > > much easier. This is not something userspace could do. > > What exactly prevents > > ( > > echo 1 > $memecg/force_empty > > rmdir $memcg > > ) & > > > > so that this sequence doesn't really block anything? > > We have "restarting the same name job" logic in our usecase (I'm not quite > sure why they do so). Basically, it means to create memcg with the exact > same name right after the old one is deleted, but may have different limit > or other settings. The creation has to wait for rmdir is done. Even though > rmdir is done in background like the above, the stall still exists since > rmdir simply is waiting for force_empty. OK, I see. This is an important detail you didn't mention previously (or at least I didn't understand it). One thing is still not clear to me. "Restarting the same job" sounds as if the memcg itself could be recycled as well. You are saying that the setting might change but if that is about limits then we should handle that just fine. Or what other kind of setting changes that wouldn't work properly? If the recycling is not possible then I would suggest to not reuse force_empty interface but add wipe_on_destruction or similar new knob which would enforce reclaim on offlining. It seems we have several people asking for something like that already. -- Michal Hocko SUSE Labs