Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp1086587ybf; Thu, 27 Feb 2020 04:50:48 -0800 (PST) X-Google-Smtp-Source: APXvYqyuhd8+VaG0QPQOPsOD8Q6+YspYaaXUIe6ucu7Tl9h3AG3R1EZ7N9qxbgRhClEtlADD9wun X-Received: by 2002:a05:6830:18f5:: with SMTP id d21mr3141428otf.225.1582807848239; Thu, 27 Feb 2020 04:50:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582807848; cv=none; d=google.com; s=arc-20160816; b=q/f43GqH/hKseHC07yovva7h1YA7y3i73YMeCoUKpYo4RRvS9VISYFbE8s8Nc5l+62 fEGzyXhUK8zgW+32P2ydST725ESKCshNMLOTDb3vxFQj3jjiYmLei+XtBxr1xoRMw8ib 6A3K7p29R6HzGpVhS2L+vA7MtJJEEjFNDtxRfSkFPP8N1w30TaVPWqgfaWDZOvoW/yzQ RuGB5LMrLgJhv1G5m4Rg+WePpKfnw8xwNme38iJO+jnGCbI7a2hDpZWGL4MOx09mWdqE BQKdSMZq0IeWH1S1PFkmLGoRK1/TtXr3Z3J+ZJesE00WLeecL12dMJmPwErThvXnzoZt eWeQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=+/pCx60f/1X+9oiLWtdCr7slRVvciDKJGzX506kGhKA=; b=kBohYxxnkt6ERyI6cgEsaOusw8r/DRPL2geqOEM+GWyYYk8ryZR5y/5epayJdO7isO g3XVgmnME5OiSVf9X/Nw3H0BWVYGChlTps7o9Kpcj5KQ4WkU3J2FUTyIVNMpUsO5zUOd btjOGhzSWRiJSgZfCfUVqAYPK1LF7VZ5snPKMWEYZRGQlp/pPOqdS+F1brgrPkhbcc12 bkaL8NklJrFVua+xTj0ourIObW7/GkhZJFAYRoihIOBBkLxjdRPD+VLdhMQF7Vl7/oKL ZP9AMvSfd9v0CcWDbb1G9/vxbvdl1fmBXkIcJhWymm87rWL3NCyNrJr0XCSANKv9c/OM f81g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=LwZJ5hvM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z23si1426736oti.34.2020.02.27.04.50.27; Thu, 27 Feb 2020 04:50:48 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=LwZJ5hvM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729084AbgB0MuP (ORCPT + 99 others); Thu, 27 Feb 2020 07:50:15 -0500 Received: from mail-qt1-f195.google.com ([209.85.160.195]:45388 "EHLO mail-qt1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729032AbgB0MuO (ORCPT ); Thu, 27 Feb 2020 07:50:14 -0500 Received: by mail-qt1-f195.google.com with SMTP id d9so2105232qte.12 for ; Thu, 27 Feb 2020 04:50:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=+/pCx60f/1X+9oiLWtdCr7slRVvciDKJGzX506kGhKA=; b=LwZJ5hvMkch1yo5ly172o7GJ3mfainNjJJKqHYsyqJ2MSPmvQnLSjknp31jnre3Ema 1FUB9Tef3NCtRijJuv9eFIoeetxVNKLwMWAJBDQVp46vYR2nLb8T/5gzw6xZc3TWEzzo aDoDQV65wdYVhTygc0oRW+ctO9RbCHUwfFlxd8mbUxnldyd90iPPcsHHN9LEXURzD+Bd 4A58qRoJNQrHoYHUxFysgDEityuGQQVtrnQt/dAaNpSvwaM6omm5Oyk7vd9jXJhW23jD 2BKJgPA7rfl6RdxiivWiiA6tQiPXyQIoXHWk919tGT2rNBcQLL89YyEXXEhwilHkkHU0 /U5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=+/pCx60f/1X+9oiLWtdCr7slRVvciDKJGzX506kGhKA=; b=KkvWVBIB1r7JL6moHBrtIHC3zrnkM9boefTYMiSGAvoH4gcW1ivNXfnt5ux1omDXAz ooGiGpBWIburUMq/8DjnGxx+ecbgqqWaaDD7UulwDAJC9H1YlWnR711yz7mN2zhDKATZ XpOp5HIE99y6Wi7KFosEyTWAA2Hikdi4xyqybSVYgbaqi/ZT53sPzQ5Z5CkDMKPbg0Gx eHYLYKiT2JYbL7aFC6iTGj4ahDug4rKIpHNd6VOxOAOz78BKRGfZQLo2MgecY7vu04bg pRn++9IoAScXJitrQnGNnyX29pPXOcqGvh0sJXcG/4lOhmGfTynJDNtwg4lJf+aWiefb oSew== X-Gm-Message-State: APjAAAWd8nq2XhLwopjnZhl6/opZG91ojrB8sk9WVyLflrt8DqGW7FSM g02GdHKX7nxzmUYflK3mvGIM5A== X-Received: by 2002:ac8:7103:: with SMTP id z3mr4958280qto.172.1582807813193; Thu, 27 Feb 2020 04:50:13 -0800 (PST) Received: from localhost (pool-108-27-252-85.nycmny.fios.verizon.net. [108.27.252.85]) by smtp.gmail.com with ESMTPSA id v82sm3040358qka.51.2020.02.27.04.50.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Feb 2020 04:50:12 -0800 (PST) Date: Thu, 27 Feb 2020 07:50:11 -0500 From: Johannes Weiner To: Yang Shi Cc: Shakeel Butt , Andrew Morton , Michal Hocko , Tejun Heo , Roman Gushchin , Linux MM , Cgroups , LKML , Kernel Team Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high Message-ID: <20200227125011.GB39625@cmpxchg.org> References: <20200219181219.54356-1-hannes@cmpxchg.org> <20200226222642.GB30206@cmpxchg.org> <2be6ac8d-e290-0a85-5cfa-084968a7fe36@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2be6ac8d-e290-0a85-5cfa-084968a7fe36@linux.alibaba.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 26, 2020 at 04:12:23PM -0800, Yang Shi wrote: > On 2/26/20 2:26 PM, Johannes Weiner wrote: > > So we should be able to fully resolve this problem inside the kernel, > > without going through userspace, by accounting CPU cycles used by the > > background reclaim worker to the cgroup that is being reclaimed. > > Actually I'm wondering if we really need account CPU cycles used by > background reclaimer or not. For our usecase (this may be not general), the > purpose of background reclaimer is to avoid latency sensitive workloads get > into direct relcaim (avoid the stall from direct relcaim). In fact it just > "steal" CPU cycles from lower priority or best-effort workloads to guarantee > latency sensitive workloads behave well. If the "stolen" CPU cycles are > accounted, it means the latency sensitive workloads would get throttled from > somewhere else later, i.e. by CPU share. That doesn't sound right. "Not accounting" isn't an option. If we don't annotate the reclaim work, the cycles will go to the root cgroup. That means that the latency-sensitive workload can steal cycles from the low-pri job, yes, but also that the low-pri job can steal from the high-pri one. Say your two workloads on the system are a web server and a compile job and the CPU shares are allocated 80:20. The compile job will cause most of the reclaim. If the reclaim cycles can escape to the root cgroup, the compile job will effectively consume more than 20 shares and the low-pri job will get less than 80. But let's say we executed all background reclaim in the low-pri group, to allow the high-pri group to steal cycles from the low-pri group, but not the other way round. Again an 80:20 CPU distribution. Now the reclaim work competes with the compile job over a very small share of CPU. The reclaim work that the high priority job is relying on is running at low priority. That means that the compile job can cause the web server to go into direct reclaim. That's a priority inversion. > We definitely don't want to the background reclaimer eat all CPU cycles. So, > the whole background reclaimer is opt in stuff. The higher level cluster > management and administration components make sure the cgroups are setup > correctly, i.e. enable for specific cgroups, setup watermark properly, etc. > > Of course, this may be not universal and may be just fine for some specific > configurations or usecases. Yes, I suspect it works for you because you set up watermarks on the high-pri job but not on the background jobs, thus making sure only high-pri jobs can steal cycles from the rest of the system. However, we do want low-pri jobs to have background reclaim as well. A compile job may not be latency-sensitive, but it still benefits from a throughput POV when the reclaim work runs concurrently. And if there are idle CPU cycles available that the high-pri work isn't using right now, it would be wasteful not to make use of them. So yes, I can see how such an accounting loophole can be handy. By letting reclaim CPU cycles sneak out of containment, you can kind of use it for high-pri jobs. Or rather *one* high-pri job, because more than one becomes unsafe again, where one can steal a large number of cycles from others at the same priority. But it's more universally useful to properly account CPU cycles that are actually consumed by a cgroup, to that cgroup, and then reflect the additional CPU explicitly in the CPU weight configuration. That way you can safely have background reclaim on jobs of all priorities.