Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp886634ybv; Wed, 19 Feb 2020 11:16:41 -0800 (PST) X-Google-Smtp-Source: APXvYqzJlzezP/FTqGbBlP3gIrwjkiADRr0kx3ufHHmjAjo1ik05UY3ugwylMegYkLP7REdmYvGY X-Received: by 2002:a9d:68c8:: with SMTP id i8mr21325827oto.34.1582139800983; Wed, 19 Feb 2020 11:16:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582139800; cv=none; d=google.com; s=arc-20160816; b=FC2rC1tteYKnUCd0BZlUoiswiNPx8fW7ixt8uzl/W1LNuMHGP79+Ye2TqftcwUbW2P Iv23MPTcDXZd+alCLt26mesDiJ62/TKUNvONVG07qjL1I9QV7JYmTsfU3mHNJ8ncZOIM 9b0ZbrjmNSTxt5WBu+UVcyqT6kiOHqlv9GhrxSTqL5X6NZBvB8b+s45kljg9B7gtpjl/ dtvoqNp62OgfGlu7mk646Y/v6PtmrPsGO+JG7WH925rSpYTwTxYsjbbyZQE/46H1l4/1 qYBGbKp5Mc9PDN8SIroVRhE+J/wRzCbi+fR3TLG99jnX4yvBt/n5eduGwv/1oOeQB8BV /N4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=mZCRvJUEd0aeZMSM2KiDRl8M85Gpn4sTqFj3MDhSdPs=; b=UkjXvFdCT4G5oVpLKpzLHSnvVTs7YX7P+BYcZOeLNKm/9w2RhVCFaGijlj9x1k12bw fmYkiec1j/eP1w3k1Hd3g8aL1Q3LKaYqp6oHavXGTdlCLsfEbRHtgow8V+TGv/NwMmHX DwAzMHhcOsoNwAcwtziaALAnPQ1tCVGzvvV2kByWbhJObOYHo7Ci8axw3pBTMdFCd1TD c0g7wA/JFAALo56Rxcua8AGCiOq979fQv+CrDsOaaCaWdfYu3QwpuGQ7MlPtq/NB7oXr InflRYOI+8gwyb5OwlR5nqsCLkWLTV6q9rGTi5Qe6P5m934XS4HK2U1ckpZBra84+54t fdVQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=ZICvLtBj; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n24si342312otf.58.2020.02.19.11.16.29; Wed, 19 Feb 2020 11:16:40 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=ZICvLtBj; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726826AbgBSTQV (ORCPT + 99 others); Wed, 19 Feb 2020 14:16:21 -0500 Received: from mail-qk1-f196.google.com ([209.85.222.196]:42936 "EHLO mail-qk1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726638AbgBSTQV (ORCPT ); Wed, 19 Feb 2020 14:16:21 -0500 Received: by mail-qk1-f196.google.com with SMTP id o28so1187815qkj.9 for ; Wed, 19 Feb 2020 11:16:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=mZCRvJUEd0aeZMSM2KiDRl8M85Gpn4sTqFj3MDhSdPs=; b=ZICvLtBj1Wsb4nDHlvOkhJNjp3TK0gM17yFa9MWXljGt5qzSyWPJyOsQ2sFarzK7fo BUjh3r6byLkmeK3UhKrBX+McO538JJpggApObSaFa7bl67ekDkGK29iwsMiq41gfPvMX 4UHHiWljqNjxb1OKPRXzIXh8ymNLCJGtes2Zcn12WpgyyO+0SUmRejsSms4mxW7hSzgU nKjNF019p2KE9fxskrQvcbZt5XkL0ZWaIaAcdZtK7/I/2r4+s7VR/gLAVBzZknuwqgPw s7kQMRNLecMpnp/YU05mEVI/pqk3i00TvTx+a7189OODFLRBJcRg3laDJqIQubVaU+LO 924w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=mZCRvJUEd0aeZMSM2KiDRl8M85Gpn4sTqFj3MDhSdPs=; b=trl9JyrTecRq23qnw5RJHn+Nfec5T+OhsdPJLbHUmcknnN1/DwFthdAyaozl4APtVj A4stOu9xdWHVmijyioWR2YNqGqp5J+xiTKn6GavWbUXmeXaYDc9x6YJF9Ss50wgPvGHu QxFMLGdKoKbJKCx2/jpewT0z+sbclUMFNyEpdesRzi7xh9khK52kePDKI092PKdnKymX 1baRQI6uFqO/JNF+Mn8o7ntWfUe7SZvSptv3NOW/+cVPNs6jYnfJY1VHEdEV1AdkZvMe k4AAudGta2/HDNkk8N+gB0ovEzRlXn4qqhqVwWiX836eu9Bmbbm0e5reWX0jU73I0+FT 5nIQ== X-Gm-Message-State: APjAAAVLHBPPVHyHEyWdZEK52vUQR/93leEoKqQgefn5A2SrFwuI59i7 fSvdFvgFUCsE/UTqTHolZUWtouGdSj0= X-Received: by 2002:ae9:e207:: with SMTP id c7mr24017223qkc.128.1582139780194; Wed, 19 Feb 2020 11:16:20 -0800 (PST) Received: from localhost ([2620:10d:c091:500::2:3bde]) by smtp.gmail.com with ESMTPSA id b84sm310728qkg.90.2020.02.19.11.16.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 19 Feb 2020 11:16:19 -0800 (PST) Date: Wed, 19 Feb 2020 14:16:18 -0500 From: Johannes Weiner To: Michal Hocko Cc: Andrew Morton , Tejun Heo , Roman Gushchin , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high Message-ID: <20200219191618.GB54486@cmpxchg.org> References: <20200219181219.54356-1-hannes@cmpxchg.org> <20200219183731.GC11847@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200219183731.GC11847@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote: > On Wed 19-02-20 13:12:19, Johannes Weiner wrote: > > We have received regression reports from users whose workloads moved > > into containers and subsequently encountered new latencies. For some > > users these were a nuisance, but for some it meant missing their SLA > > response times. We tracked those delays down to cgroup limits, which > > inject direct reclaim stalls into the workload where previously all > > reclaim was handled my kswapd. > > I am curious why is this unexpected when the high limit is explicitly > documented as a throttling mechanism. Memory.high is supposed to curb aggressive growth using throttling instead of OOM killing. However, if the workload has plenty of easily reclaimable memory and just needs to recycle a couple of cache pages to permit an allocation, there is no need to throttle the workload - just as there wouldn't be any need to trigger the OOM killer. So it's not unexpected, but it's unnecessarily heavy-handed: since memory.high allows some flexibility around the target size, we can move the routine reclaim activity (cache recycling) out of the main execution stream of the workload, just like we do with kswapd. If that cannot keep up, we can throttle and do direct reclaim. It doesn't change the memory.high semantics, but it allows exploiting the fact that we have SMP systems and can parallize the book keeping. > > This patch adds asynchronous reclaim to the memory.high cgroup limit > > while keeping direct reclaim as a fallback. In our testing, this > > eliminated all direct reclaim from the affected workload. > > Who is accounted for all the work? Unless I am missing something this > just gets hidden in the system activity and that might hurt the > isolation. I do see how moving the work to a different context is > desirable but this work has to be accounted properly when it is going to > become a normal mode of operation (rather than a rare exception like the > existing irq context handling). Yes, the plan is to account it to the cgroup on whose behalf we're doing the work. The problem is that we have a general lack of usable CPU control right now - see Rik's work on this: https://lkml.org/lkml/2019/8/21/1208. For workloads that are contended on CPU, we cannot enable the CPU controller because the scheduling latencies are too high. And for workloads that aren't CPU contended, well, it doesn't really matter where the reclaim cycles are accounted to. Once we have the CPU controller up to speed, we can add annotations like these to account stretches of execution to specific cgroups. There just isn't much point to do it before we can actually enable CPU control on the real workloads where it would matter. [ This is generally work in process: for example, if you isolate workloads with memory.low, kswapd cpu time isn't accounted to the cgroup that causes it. Swap IO issued by kswapd isn't accounted to the group that is getting swapped. The memory consumed by struct cgroup itself, the percpu allocations for the vmstat arrays etc., which is sizable, are not accounted to the cgroup that creates subgroups, and so forth. ]