Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp412191ybf; Wed, 26 Feb 2020 15:38:29 -0800 (PST) X-Google-Smtp-Source: APXvYqxYZhWG0yjMaDtf+0nsETqfQHNZs+fKwnhjQHZh4RLBaOUeCpxztQ5qULzSCGhSvjczXPuB X-Received: by 2002:a9d:4c14:: with SMTP id l20mr1047650otf.125.1582760309106; Wed, 26 Feb 2020 15:38:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582760309; cv=none; d=google.com; s=arc-20160816; b=VY52Jw7uEU6/tXB8Z8KPYlTyhyXAaVDR+w59fYrq9k0Z97PAfn+EJ+CobRno48H3jC 5Ar5IjnoJD5p+YbftM6Z/Ean/WZ7ajBKSGEOB3PMvYLDUUYwIshu0ElzuQQ3aBaEU9Mi Vq41NeXoqTWXlZb5LZiCIfhJ09zyGWAkWYwW/c/b8b19s4jG+APIG73MlgM21o3P9Vj7 ntP/T/QWjD84iPqrEG6/hy2S2YG5ynKatk9E1cYKy+6beUhQmEorwtEn66NZ4y+r/PRy gm0kkPYigcOOvXPsCNlxlPW/OxVUR7XN3LELWWp5AcXb+u7GhDwmRbPQk0vruYmrozEC bfGQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=XULG1qV+GtPvt2uO0AXXC3BAWFE0n3RhoAR0KT3bDIA=; b=Q/0ixgjfiGYQCEkjPZ5wpIgzuvT6n6ekIPw+wr0am3BanTPOXuJbV9Ikb2dw99h6dA MlUeKJ3DrHpY01izpLUs9/b0mR0q/omB4WADwZ5lAXzphmBpgD/Ix90EMnQi1JxormEU spalxK6O8A7oTzblVAFfNJh2sgsszEfi8YV+Iq0GG4LuHlSuIRzYGQMwxtynWZUEphBo RHm/s93SzIwQ53mgez64U8OwTQOb/inha3tidB4eOHzhNqMuL4WC41a09G0fRXzD66mF RDMkzK59CwJzo7pvIcMyLXp+GbPtU3tiSl+2OvFeaQgetDVAx+ZhT7/NYCiA2k2B5PwP HUEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ZJBAH3FN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o18si612972otk.80.2020.02.26.15.38.15; Wed, 26 Feb 2020 15:38:29 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ZJBAH3FN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727993AbgBZXhE (ORCPT + 99 others); Wed, 26 Feb 2020 18:37:04 -0500 Received: from mail-ot1-f66.google.com ([209.85.210.66]:43321 "EHLO mail-ot1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727932AbgBZXhE (ORCPT ); Wed, 26 Feb 2020 18:37:04 -0500 Received: by mail-ot1-f66.google.com with SMTP id p8so1154998oth.10 for ; Wed, 26 Feb 2020 15:37:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=XULG1qV+GtPvt2uO0AXXC3BAWFE0n3RhoAR0KT3bDIA=; b=ZJBAH3FNJRTpWpahP8HUNPNfUAwrAW/0tUJi9fsibAX1z7l3xqlvASMOgyOiToMPA6 ZbMpl4IPsxjgHQwZakzzGhqWaHK//Xz85+h1es+oYw985yX8ZUjOydsHjG0444tchghQ HolRU5DwwDPudBWVQlb6+RIlAg0k7DvmQzWwW7iu3QOtzB00p8EeLyoYHDUVDRrhQYFH i0gpivgz5BFHbZB+B+rEgPgcP+S2xgsgitzBglXAmRSW+ciUvPiemJKHMKKj0EtXymWx wakh2O/n7qyNR7j+OEUqmzqqY4MCtY3FdsL1zlfkCdfwz4Jq6zwnC+SRkqJ3R0Jpi2+5 QIdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=XULG1qV+GtPvt2uO0AXXC3BAWFE0n3RhoAR0KT3bDIA=; b=toqRWLy2IFwf3AYmJOtiLEA2G7GA/55qWPJ88vxHwL1EuGH0dBusJYLe1iUrvqpVMV jz3nB1jG1fQOJVVjC5P1eQW0aE6NqObPX3DRudLsArOAaZT8s7nOGEXLt86XBTHgf+3u d3jYokrHoq6g6oEmZQSc0gl2eip9Aq0Odeg7VKN/PeQvk4lQ3duzHH/leRFCMLaMXXLu zp0I7s28OaEJGBfR2wuPcFfhFYzmSbC+OKJ+RzwLuKifTtayv2G/FYxHRkYDnXy4TFxN x9MJ/TMYQPq0ZIYEQcZE9i8U06n7Ms6F6yI+K5QC3+QUCyrAOEChCCmPnPwLCrLRCaMY 4TSg== X-Gm-Message-State: APjAAAX5ZYCYpVV2gMN3Aw99YP3rLnlbwezCLNuQuPgXWLTP5KfjMLua F45vwWEjqeb6oOh5oJ6gbZSurzFGx/Btd5c8Aj6tAQ== X-Received: by 2002:a05:6830:11:: with SMTP id c17mr1052015otp.360.1582760222491; Wed, 26 Feb 2020 15:37:02 -0800 (PST) MIME-Version: 1.0 References: <20200219181219.54356-1-hannes@cmpxchg.org> <20200226222642.GB30206@cmpxchg.org> In-Reply-To: <20200226222642.GB30206@cmpxchg.org> From: Shakeel Butt Date: Wed, 26 Feb 2020 15:36:50 -0800 Message-ID: Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high To: Johannes Weiner Cc: Yang Shi , Andrew Morton , Michal Hocko , Tejun Heo , Roman Gushchin , Linux MM , Cgroups , LKML , Kernel Team Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 26, 2020 at 2:26 PM Johannes Weiner wrote: > > On Wed, Feb 26, 2020 at 12:25:33PM -0800, Shakeel Butt wrote: > > On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner wrote: > > > > > > We have received regression reports from users whose workloads moved > > > into containers and subsequently encountered new latencies. For some > > > users these were a nuisance, but for some it meant missing their SLA > > > response times. We tracked those delays down to cgroup limits, which > > > inject direct reclaim stalls into the workload where previously all > > > reclaim was handled my kswapd. > > > > > > This patch adds asynchronous reclaim to the memory.high cgroup limit > > > while keeping direct reclaim as a fallback. In our testing, this > > > eliminated all direct reclaim from the affected workload. > > > > > > memory.high has a grace buffer of about 4% between when it becomes > > > exceeded and when allocating threads get throttled. We can use the > > > same buffer for the async reclaimer to operate in. If the worker > > > cannot keep up and the grace buffer is exceeded, allocating threads > > > will fall back to direct reclaim before getting throttled. > > > > > > For irq-context, there's already async memory.high enforcement. Re-use > > > that work item for all allocating contexts, but switch it to the > > > unbound workqueue so reclaim work doesn't compete with the workload. > > > The work item is per cgroup, which means the workqueue infrastructure > > > will create at maximum one worker thread per reclaiming cgroup. > > > > > > Signed-off-by: Johannes Weiner > > > --- > > > mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------ > > > mm/vmscan.c | 10 +++++++-- > > > > This reminds me of the per-memcg kswapd proposal from LSFMM 2018 > > (https://lwn.net/Articles/753162/). > > Ah yes, I remember those discussions. :) > > One thing that has changed since we tried to implement this last was > the workqueue concurrency code. We don't have to worry about a single > thread or fixed threads per cgroup, because the workqueue code has > improved significantly to handle concurrency demands, and having one > work item per cgroup makes sure we have anywhere between 0 threads and > one thread per cgroup doing this reclaim work, completely on-demand. > > Also, with cgroup2, memory and cpu always have overlapping control > domains, so the question who to account the work to becomes a much > easier one to answer. > > > If I understand this correctly, the use-case is that the job instead > > of direct reclaiming (potentially in latency sensitive tasks), prefers > > a background non-latency sensitive task to do the reclaim. I am > > wondering if we can use the memory.high notification along with a new > > memcg interface (like memory.try_to_free_pages) to implement a user > > space background reclaimer. That would resolve the cpu accounting > > concerns as the user space background reclaimer can share the cpu cost > > with the task. > > The idea is not necessarily that the background reclaimer is lower > priority work, but that it can execute in parallel on a separate CPU > instead of being forced into the execution stream of the main work. > > So we should be able to fully resolve this problem inside the kernel, > without going through userspace, by accounting CPU cycles used by the > background reclaim worker to the cgroup that is being reclaimed. > > > One concern with this approach will be that the memory.high > > notification is too late and the latency sensitive task has faced the > > stall. We can either introduce a threshold notification or another > > notification only limit like memory.near_high which can be set based > > on the job's rate of allocations and when the usage hits this limit > > just notify the user space. > > Yeah, I think it would be a pretty drastic expansion of the memory > controller's interface. I understand the concern of expanding the interface and resolving the problem within kernel but there are genuine use-cases which can be fulfilled by these interfaces. We have a distributed caching service which manages the caches in anon pages and their hotness. It is preferable to drop a cold cache known to the application in the user space on near stall/oom/memory_pressure then let the kernel swap it out and face a stall on fault as the caches are replicated and other nodes can serve it. For such workloads kernel reclaim does not help. What would be your recommendation for such a workload. I can envision memory.high + PSI notification but note that these are based on stalls which the application wants to avoid. Shakeel