Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp428853ybf; Wed, 26 Feb 2020 16:00:06 -0800 (PST) X-Google-Smtp-Source: APXvYqyyfBZPXEPL4bKn4fx/K766SlH8W2QnSWfqH+/oXex8honU8l17C6JK4TDpZH1rTDKIsxBn X-Received: by 2002:a54:4705:: with SMTP id k5mr1223010oik.154.1582761606461; Wed, 26 Feb 2020 16:00:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582761606; cv=none; d=google.com; s=arc-20160816; b=g8VswULg2Fo6SSgjfDybcQIkFNbMA7GDvUk5kmvFqjebPvjsf8kUe4CdcdBE6+JvKB 6XOnLaMt+QKHKNHdQZqKSZkGNEoJmqwXgfNF7mKNy2ioS6W4T1ILI1hawZrqT0Ygf/7h i5VnkDN3uqguytqPMmj5MDf3x7jKBrqz8i11ZLTZG9CyTEVLmzWnfLz3nDWQO+XH14aG 3Mu2b9YxVG4CRFbyUtq/TH1cXuW+O7JpKu2kzcOvoNkrHd86Rr8owm5xM6HVT0xzUeX9 BR9Leowdjwibu+nw+rI7rRioQ/LAUo9Oj2med9aVkILD9UxV0l8Pe1iaTaJigo/NwMfC U+gQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=NMiQWoMTFcXKuIYJHhbr7BnGndQneQ4YOxLUZdr/dSI=; b=sLiXVZxt2Qhk0T0ebSUc5fxOkvD5sl9kYSeILBe3RVbBWho7Z+hxY1YkwKXKfxWe5d IteLL+QVyF9CITA1s6vmQ5YpcZ+ve7olpabm50V2+znO/FB2+JGQDpQpjwNGkWlVCUD5 jIaXpyDAAsfHeMosQ+2ekC0bjaGl5pyMmNllCzkNE7TWmRVO6eJSvciG6F3xOHP3Ak1z xXglYUVR4ijxywuMtZKLioIGlytK1PlxbVgyA4berD7BsAuR/MQfBsLSxADiYnikYUDk fkPzxs/kAJEaH4NDMAjZjB8hx+ggBkH3aHGvMF9r1/kxsojCd9+dQKypFjdQC4vo8qrO QP4Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c4si896624ots.107.2020.02.26.15.59.54; Wed, 26 Feb 2020 16:00:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728043AbgBZX7d (ORCPT + 99 others); Wed, 26 Feb 2020 18:59:33 -0500 Received: from out30-43.freemail.mail.aliyun.com ([115.124.30.43]:35368 "EHLO out30-43.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726413AbgBZX7d (ORCPT ); Wed, 26 Feb 2020 18:59:33 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R801e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04397;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0Tr.Q08s_1582761566; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tr.Q08s_1582761566) by smtp.aliyun-inc.com(127.0.0.1); Thu, 27 Feb 2020 07:59:28 +0800 Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high To: Shakeel Butt , Johannes Weiner Cc: Andrew Morton , Michal Hocko , Tejun Heo , Roman Gushchin , Linux MM , Cgroups , LKML , Kernel Team References: <20200219181219.54356-1-hannes@cmpxchg.org> From: Yang Shi Message-ID: <1bfd6ea4-f012-5778-64c6-36731e69b5ba@linux.alibaba.com> Date: Wed, 26 Feb 2020 15:59:23 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2/26/20 12:25 PM, Shakeel Butt wrote: > On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner wrote: >> We have received regression reports from users whose workloads moved >> into containers and subsequently encountered new latencies. For some >> users these were a nuisance, but for some it meant missing their SLA >> response times. We tracked those delays down to cgroup limits, which >> inject direct reclaim stalls into the workload where previously all >> reclaim was handled my kswapd. >> >> This patch adds asynchronous reclaim to the memory.high cgroup limit >> while keeping direct reclaim as a fallback. In our testing, this >> eliminated all direct reclaim from the affected workload. >> >> memory.high has a grace buffer of about 4% between when it becomes >> exceeded and when allocating threads get throttled. We can use the >> same buffer for the async reclaimer to operate in. If the worker >> cannot keep up and the grace buffer is exceeded, allocating threads >> will fall back to direct reclaim before getting throttled. >> >> For irq-context, there's already async memory.high enforcement. Re-use >> that work item for all allocating contexts, but switch it to the >> unbound workqueue so reclaim work doesn't compete with the workload. >> The work item is per cgroup, which means the workqueue infrastructure >> will create at maximum one worker thread per reclaiming cgroup. >> >> Signed-off-by: Johannes Weiner >> --- >> mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------ >> mm/vmscan.c | 10 +++++++-- > This reminds me of the per-memcg kswapd proposal from LSFMM 2018 > (https://lwn.net/Articles/753162/). Thanks for bringing this up. > > If I understand this correctly, the use-case is that the job instead > of direct reclaiming (potentially in latency sensitive tasks), prefers > a background non-latency sensitive task to do the reclaim. I am > wondering if we can use the memory.high notification along with a new > memcg interface (like memory.try_to_free_pages) to implement a user > space background reclaimer. That would resolve the cpu accounting > concerns as the user space background reclaimer can share the cpu cost > with the task. Actually I'm interested how you implement userspace reclaimer. Via a new syscall or a variant of existing syscall? > > One concern with this approach will be that the memory.high > notification is too late and the latency sensitive task has faced the > stall. We can either introduce a threshold notification or another > notification only limit like memory.near_high which can be set based > on the job's rate of allocations and when the usage hits this limit > just notify the user space. Yes, the solo purpose of background reclaimer is to avoid direct reclaim for latency sensitive workloads. Our in-house implementation has high watermark and low watermark, both of which is lower than limit or high. The background reclaimer would be triggered once available memory is reached low watermark, then keep reclaimed until available memory is reached high watermark. It is pretty same with how global water mark works. > > Shakeel