Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1129547imu; Fri, 4 Jan 2019 13:46:24 -0800 (PST) X-Google-Smtp-Source: ALg8bN5NIuOY79OVvAhvzQDPIMicf9hP69bvwsDTcXVC1oyWd47iQ8JjtT5kn3wQud0a/cK140F9 X-Received: by 2002:a17:902:e28e:: with SMTP id cf14mr53350133plb.311.1546638384903; Fri, 04 Jan 2019 13:46:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546638384; cv=none; d=google.com; s=arc-20160816; b=0aVeiUf5BlD2OGnYWbB0pPuTAr0i4+8Vub649ub277dFcdismLYb2gH7TrkjW3gvQT 1BQDRNNcCVvDQ6WWf5d+h9g8UTouQrw3aBjTzu0/owYgrz/IyoQCq87iKzxz7QG+H+Xu SeWPDpLEs+S3li5VsNyx6qbjMErD3/Nxmvk621TAZ1vFn5XG0kXe040WeBd5ceKICIjr I9IHja658AJLI29g5PCP7Q+ozzy81ubqduDPVdTPOp/PIK5DbBp/yQOjxNlUBSjp7GzI HP1RZwxmXHAcBCmb7XvPwwTpAx/yp7w/59e4Z0zC0LjiOu39/rspvNoiRfiXNzwlAwtj MW1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=SjQX7L3aVt5obMgxndvcMrQgzy3oMoIS7QFbicaWTQU=; b=SKiwQjY2O/3EE6TWZLuKRBEyPcB0m0aZlcKIr/iUKXB+WkRMukA2sa4p2m5IPvHwiU /Gs2H2JGwM8l2HNUMIRfzj49sLQ8DeYrYp8igE+YAujFyAJMsTJyAiYiae20uH1JpNRf Bg6mFdbPW/uTpHDwtluTsOcmiG/nFJ/0VUJfgAPgQj+OgxC5ZBYvJ2b6tPq4Lph8ZwDB txjevgP8rKkOIR0PczPhOdxiaYChzVWEckB6wHg/6ewnertlSbuGckvCic99FsVpe0FD sgtS1BRC9cR4UEG4s4FnuAK2eHXcit7rsC8BW1uCXzth6Nxyz9MsRQO5wj2gE6nntGE3 X+ig== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p185si13069465pfg.112.2019.01.04.13.46.09; Fri, 04 Jan 2019 13:46:24 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726239AbfADVow (ORCPT + 99 others); Fri, 4 Jan 2019 16:44:52 -0500 Received: from out30-132.freemail.mail.aliyun.com ([115.124.30.132]:47833 "EHLO out30-132.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726094AbfADVow (ORCPT ); Fri, 4 Jan 2019 16:44:52 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01451;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0THYCz9Q_1546638111; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0THYCz9Q_1546638111) by smtp.aliyun-inc.com(127.0.0.1); Sat, 05 Jan 2019 05:42:06 +0800 Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty To: Greg Thelen , Michal Hocko Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1546459533-36247-1-git-send-email-yang.shi@linux.alibaba.com> <20190103101215.GH31793@dhcp22.suse.cz> <20190103181329.GW31793@dhcp22.suse.cz> <6f43e926-3bb5-20d1-2e39-1d30bf7ad375@linux.alibaba.com> <20190103185333.GX31793@dhcp22.suse.cz> <20190103192339.GA31793@dhcp22.suse.cz> <88b4d986-0b3c-cbf0-65ad-95f3e8ccd870@linux.alibaba.com> From: Yang Shi Message-ID: <344793c0-f987-85a1-2a75-bc27083f52f4@linux.alibaba.com> Date: Fri, 4 Jan 2019 13:41:50 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/4/19 12:03 PM, Greg Thelen wrote: > Yang Shi wrote: > >> On 1/3/19 11:23 AM, Michal Hocko wrote: >>> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>>> On 1/3/19 10:53 AM, Michal Hocko wrote: >>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote: >>> [...] >>>>>>> Is there any reason for your scripts to be strictly sequential here? In >>>>>>> other words why cannot you offload those expensive operations to a >>>>>>> detached context in _userspace_? >>>>>> I would say it has not to be strictly sequential. The above script is just >>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern >>>>>> due to the complicated cluster scheduling and container scheduling in the >>>>>> production environment, for example the creation process might be scheduled >>>>>> to the same CPU which is doing force_empty. I have to say I don't know too >>>>>> much about the internals of the container scheduling. >>>>> In that case I do not see a strong reason to implement the offloding >>>>> into the kernel. It is an additional code and semantic to maintain. >>>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>>> quite simple and very straight forward, isn't it? Just utilize the existing >>>> css offline worker. And, that a couple of lines of code do improve some >>>> throughput issues for some real usecases. >>> I do not really care it is few LOC. It is more important that it is >>> conflating force_empty into offlining logic. There was a good reason to >>> remove reparenting/emptying the memcg during the offline. Considering >>> that you can offload force_empty from userspace trivially then I do not >>> see any reason to implement it in the kernel. >> Er, I may not articulate in the earlier email, force_empty can not be >> offloaded from userspace *trivially*. IOWs the container scheduler may >> unexpectedly overcommit something due to the stall of synchronous force >> empty, which can't be figured out by userspace before it actually >> happens. The scheduler doesn't know how long force_empty would take. If >> the force_empty could be offloaded by kernel, it would make scheduler's >> life much easier. This is not something userspace could do. > If kernel workqueues are doing more work (i.e. force_empty processing), > then it seem like the time to offline could grow. I'm not sure if > that's important. Yes, it would grow. I'm not sure, but it seems fine with our workloads. The reclaim can be placed at the last step of offline, and it can be interrupted by some signals, i.e. fatal signal in current code. > > I assume that if we make force_empty an async side effect of rmdir then > user space scheduler would not be unable to immediately assume the > rmdir'd container memory is available without subjecting a new container > to direct reclaim. So it seems like user space would use a mechanism to > wait for reclaim: either the existing sync force_empty or polling > meminfo/etc waiting for free memory to appear. Yes, it is expected side effect, the memory reclaim would happen in a short while. In this series I keep sync reclaim behavior of force_empty by checking the written value. Michal suggested a new knob do the offline reclaim, and keep force_empty intact. I think using which knob is user's discretion. Thanks, Yang > >>>>> I think it is more important to discuss whether we want to introduce >>>>> force_empty in cgroup v2. >>>> We would prefer have it in v2 as well. >>> Then bring this up in a separate email thread please. >> Sure. Will prepare the patches later. >> >> Thanks, >> Yang