Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp1151422ybt; Tue, 7 Jul 2020 08:53:42 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzl85cNBzeieH5X9/W1X0U72ua81KXJgUy8BTz97/OjtiNUqL/kY+t8tioC8fACDr1MBjnr X-Received: by 2002:a17:907:20ba:: with SMTP id pw26mr46904000ejb.425.1594137222131; Tue, 07 Jul 2020 08:53:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594137222; cv=none; d=google.com; s=arc-20160816; b=A2/0ulaK5aGTGnCrXvwhy8tAFk7QXatSk0hpqkpufs7d4HNSrCC1hd/i7OoV5T2a+7 GU0ZitoFhOTvUZYSN3poNBwnr0ps2NypvDZ4WiI8yuKkedzSDkyuGn77Hb9dr6gUjTKt RTP9VMFM+9J1Jq4S8JvAAS8hdi/0Jb/XcfFsl2Er90Zn451CVQLmfVExqGeQ7L96zMri pHtNeuUUqf03KaB80RUuxc6AYwpyaLleeNZyFOOcgUdONmaarFw36yXJo9BVfUmN7h9z 8cDzA/y0g6Z4E66qdXItEk86zjcALlq7Z4X/Hc2PMjKOxIpYQZhlVhjXKQythe7eMA1M 4CXA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=y/kOBwO7OLhAG5/zoDioEbhDbIra0ifUwtyOcu+taVE=; b=FVj+Ev+k+Vk2xPODtJ/gJEQkKU70dpw6n3TUWDQGlJx3EDWswXucP0N1P//qvi0F4K CI7wN6O+xjZdWjbvq7Q+ddQjNmj4DnoJXxMsvBtFfZkO0SxTWyBgTezYFFGslzThHX92 L8Z2HQvBUQxlwVea3ETQgvNQUcDSWmR7AtdXnuadvb9lUIHk8L742Xj7ERSKCGCe++Xl QjKuZR5etaa/swsYOvLMRzB6PewTilI6Lu4JvQAEY/2QdwFREZDwKcTMq1hDHzdGoZS+ X4sT5jr2ILKJQf016dWaOjn5E29obA7HWI3jHtvTZknfHEg/OFJSjS2LwOKqPNO6YTKC pxaQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=L3SIjbQ2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t25si6699830edr.537.2020.07.07.08.53.19; Tue, 07 Jul 2020 08:53:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=L3SIjbQ2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728716AbgGGPvn (ORCPT + 99 others); Tue, 7 Jul 2020 11:51:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37024 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728029AbgGGPvm (ORCPT ); Tue, 7 Jul 2020 11:51:42 -0400 Received: from mail-lj1-x243.google.com (mail-lj1-x243.google.com [IPv6:2a00:1450:4864:20::243]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70A07C061755 for ; Tue, 7 Jul 2020 08:51:42 -0700 (PDT) Received: by mail-lj1-x243.google.com with SMTP id t25so45891264lji.12 for ; Tue, 07 Jul 2020 08:51:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=y/kOBwO7OLhAG5/zoDioEbhDbIra0ifUwtyOcu+taVE=; b=L3SIjbQ2aGUMKNVOsgYJSH0RyGv1yqE3EspINaDxV4Qp8l0EecCKFOqwnxga/y+h8Y 2X0KH3FZOV8/IlyRauMigZhGEIdAaQWWZD+yMxaZDDJrf0UKEJCKQTztAwNoUxcOcoSr 8MLugsOwfz0qFWdozatId6q9p//k2M4rt6i+b82/4DZD+KliP8EFP5t2gEeqrOJX9hZL 5U+eSXspaw2k59P+sBud1CuBDW7zu5vUC6+vNCHxtZRxVveGPJaXdmkBhF1NZ9SqydwI elgQJT7e7qGxHtzdpoLU8GvAZlD/n7SQevJljznZY23clvyeaem0Qz6xKoorVTHKsPlf +xuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=y/kOBwO7OLhAG5/zoDioEbhDbIra0ifUwtyOcu+taVE=; b=gG9bs/g8mAFbtixU9fLq2NqHIcgRIJp3K0ON+7le6pcvC/Rw/a2cL52mJ/aqqJo+Qq KTs3gszr7kH//Z4Yiz3e9Einx/8SnZeGco45fp2eEwY3tryX8IxwDo6IHDjqPSKnc5OW e6HI8yiux9LbZjKTVNT4xo8qfkimZ65RPlFiZksJZbZxZlr06mKwmsJ458a8EpLopyxA BZKxBjYxMZHw5/A9wo2+RIRc/QsBM3eZZKqhCNtK4lZa0FRB8HAbaZoBquRlmW7KGVLh ct0L8LI5w5LNjtnZ9PmypwBHYdaOteG09KM50A2mMAk/CiF1o52RlDJjyAGaiP8pEEqq sygg== X-Gm-Message-State: AOAM531+iA1fraUTcZNhhEhtke9BNVYaC+O1KSEbwOuMpjFpkPYv0fw+ pVCKRdaiJdL54DZaGI2mP95gKhEz9s2oIMS5p+o9xg== X-Received: by 2002:a2e:8e36:: with SMTP id r22mr31066177ljk.77.1594137100635; Tue, 07 Jul 2020 08:51:40 -0700 (PDT) MIME-Version: 1.0 References: <20200702152222.2630760-1-shakeelb@google.com> <20200703063548.GM18446@dhcp22.suse.cz> <20200703155021.GB114903@carbon.dhcp.thefacebook.com> <20200706213404.GA152560@carbon.lan> In-Reply-To: <20200706213404.GA152560@carbon.lan> From: Shakeel Butt Date: Tue, 7 Jul 2020 08:51:29 -0700 Message-ID: Subject: Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface To: Roman Gushchin Cc: Michal Hocko , Johannes Weiner , Yang Shi , David Rientjes , Greg Thelen , Andrew Morton , Linux MM , LKML , Cgroups Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 6, 2020 at 2:38 PM Roman Gushchin wrote: > > On Fri, Jul 03, 2020 at 09:27:19AM -0700, Shakeel Butt wrote: > > On Fri, Jul 3, 2020 at 8:50 AM Roman Gushchin wrote: > > > > > > On Fri, Jul 03, 2020 at 07:23:14AM -0700, Shakeel Butt wrote: > > > > On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko wrote: > > > > > > > > > > On Thu 02-07-20 08:22:22, Shakeel Butt wrote: > > > > > [...] > > > > > > Interface options: > > > > > > ------------------ > > > > > > > > > > > > 1) memcg interface e.g. 'echo 10M > memory.reclaim' > > > > > > > > > > > > + simple > > > > > > + can be extended to target specific type of memory (anon, file, kmem). > > > > > > - most probably restricted to cgroup v2. > > > > > > > > > > > > 2) fadvise(PAGEOUT) on cgroup_dir_fd > > > > > > > > > > > > + more general and applicable to other FSes (actually we are using > > > > > > something similar for tmpfs). > > > > > > + can be extended in future to just age the LRUs instead of reclaim or > > > > > > some new use cases. > > > > > > > > > > Could you explain why memory.high as an interface to trigger pro-active > > > > > memory reclaim is not sufficient. Also memory.low limit to protect > > > > > latency sensitve workloads? > > > > > > I initially liked the proposal, but after some thoughts I've realized > > > that I don't know a good use case where memory.high is less useful. > > > Shakeel, what's the typical use case you thinking of? > > > Who and how will use the new interface? > > > > > > > > > > > Yes, we can use memory.high to trigger [proactive] reclaim in a memcg > > > > but note that it can also introduce stalls in the application running > > > > in that memcg. Let's suppose the memory.current of a memcg is 100MiB > > > > and we want to reclaim 20MiB from it, we can set the memory.high to > > > > 80MiB but any allocation attempt from the application running in that > > > > memcg can get stalled/throttled. I want the functionality of the > > > > reclaim without potential stalls. > > > > > > But reclaiming some pagecache/swapping out anon pages can always > > > generate some stalls caused by pagefaults, no? > > > > > > > Thanks for looking into the proposal. Let me answer both of your > > questions together. I have added the two use-cases but let me explain > > the proactive reclaim a bit more as we actually use that in our > > production. > > > > We have defined tolerable refault rates for the applications based on > > their type (latency sensitive or not). Proactive reclaim is triggered > > in the application based on their current refault rates and usage. If > > the current refault rate exceeds the tolerable refault rate then > > stop/slowdown the proactive reclaim. > > > > For the second question, yes, each individual refault can induce the > > stall as well but we have more control on that stall as compared to > > stalls due to reclaim. For us almost all the reclaimable memory is > > anon and we use compression based swap, so, the cost of each refault > > is fixed and a couple of microseconds. > > > > I think the next question is what about the refaults from disk or > > source with highly variable cost. Usually the latency sensitive > > applications remove such uncertainty by mlocking the pages backed by > > such backends (e.g. mlocking the executable) or at least that is the > > case for us. > > Got it. > > It feels like you're suggesting something similar to memory.high with > something similar to a different gfp flags. In other words, the > difference is only which pages can be reclaimed and which not. I don't > have a definitive answer here, but I wonder if we can somehow > generalize the existing interface? E.g. if the problem is with artificially > induced delays, we can have a config option/sysctl/sysfs knob/something else > which would disable it. Otherwise we risk ending up with many different kinds > of soft memory limits. > It is possible to achieve this functionality with memory.high with some config/sysctls e.t.c as you suggested but it can bloat and complicate the memory.high interface. I understand your concern with different kinds of soft memory limits but I see this as a simple interface (i.e. just trigger reclaim) that gives users the flexibility to define and (soft) enforce their own virtual limits on their jobs. More specifically this interface allows reclaiming from a job to keep the usage below some virtual limit which can correspond to some user defined metric. In my proactive reclaim example, the user defined metric is refault rates. Keep the usage of the job at a level where the refault rates are tolerable.