Received: by 2002:ac2:48a3:0:0:0:0:0 with SMTP id u3csp564660lfg; Fri, 11 Mar 2022 13:26:31 -0800 (PST) X-Google-Smtp-Source: ABdhPJzM5Wrk8RPkWoseomBaBbVMzJSN5pmRFdlS+rOpriKGvQftjzRF9mophJq3zy2o74wmOslo X-Received: by 2002:a63:204d:0:b0:378:c9e5:bea6 with SMTP id r13-20020a63204d000000b00378c9e5bea6mr10086775pgm.573.1647033991116; Fri, 11 Mar 2022 13:26:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1647033991; cv=none; d=google.com; s=arc-20160816; b=GX3aKoNnWEEqGZQkEc2GFKBnJkdzKNqqlCopqCc82GrSOLlgyTy99HIHWYaNhJ3I4L K1RA1cZTp1tSawnEPcPbkTRNh/ZYtRlZTvsyyy0iMs4np1v9GPmpdwOrtMU8apAKUKr2 +Kqwac30Ome5j39kw7PTynj9U6vOUMkSsnEV1VaQvY9yjZB0W3iPdtSVcu32zxPq3UNb alkKN4zhlk7z/Ie1ZjH02oqN+FFIX4VvRAHceFJztRf+sFWX5yp0nUdruN96X0JorMHN 6VEy24XZKB652PfofOJSLtJtvGAgg9G4NjsQiKIfc3tH3Yrmi4UfGI3tvlwoYsEHjhnC PFJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=SAaUOGAe1srGHsWHBvn1FZ7QUyxX7XpJoMnIGqLANGo=; b=Jr8meraajXOujVtuk46f+FGbYeTCtipWStcqn+lPFYQJljCtVlbRODAjkM/CFq6UAo JPEAmpn+jHet4gE/VWJSFeNJ5CPUe4RUndGTeMvIKXpMJWr6sYb2xd8Y+gmdi4o2NmF6 6SxarQ2Pl+L5b/P1ggTr4iOHMV3tp9hExbbe0L7EpaDWvNkTvyZgVS6a3vOqEzGhbZTr T1c3ptD/mXzFa3SmTgleSyyl2hyAEGFpSE25Muqnb1tIbTP/DiE2tgbOX7QZK5vrpNK2 xho2GHcSf8CMAgJXTzZ+hW69D5seWYjCsedUftWOJJNUmhpItGKrkX0LQ+HMGMwOduQh G+mw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=JC9r4EOT; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id y16-20020a170902b49000b001517f1b5e21si8171548plr.263.2022.03.11.13.26.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 11 Mar 2022 13:26:31 -0800 (PST) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=JC9r4EOT; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0E96C3A185; Fri, 11 Mar 2022 13:01:02 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230502AbiCKQCH (ORCPT + 99 others); Fri, 11 Mar 2022 11:02:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35146 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350102AbiCKQB6 (ORCPT ); Fri, 11 Mar 2022 11:01:58 -0500 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6EEAC1CCB02; Fri, 11 Mar 2022 08:00:54 -0800 (PST) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 1B6B621106; Fri, 11 Mar 2022 16:00:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1647014453; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SAaUOGAe1srGHsWHBvn1FZ7QUyxX7XpJoMnIGqLANGo=; b=JC9r4EOT7aEN0yB8Qf9FysIjDt/Lfq9D5zM4Ke/NbM9uN/KN1uAZpE5K4BgUSbilo63d9z ht/QmX4FXDK1KjhUAtCwUayHbXzLF9H3rBvvtmdT+W0F37+q8YPjroPqnmmlYH+VtrjBi7 QtXDY6i3jI+KGH0cCUUGI9NsJFBQN9M= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id DE50513A89; Fri, 11 Mar 2022 16:00:52 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id gxi4NTRyK2IjHwAAMHmgww (envelope-from ); Fri, 11 Mar 2022 16:00:52 +0000 Date: Fri, 11 Mar 2022 17:00:51 +0100 From: Michal =?iso-8859-1?Q?Koutn=FD?= To: Shakeel Butt Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Ivan Babrou , Frank Hofmann , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Daniel Dao , stable@vger.kernel.org Subject: Re: [PATCH] memcg: sync flush only if periodic flush is delayed Message-ID: <20220311160051.GA24796@blackbody.suse.cz> References: <20220304184040.1304781-1-shakeelb@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20220304184040.1304781-1-shakeelb@google.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello. TL;DR rstats are slow but accurate on reader side. To tackle the performance regression no flush seems simpler than this patch. So, I've made an overview for myself what were the relevant changes with rstat introduction. The amount of work is: - before R: O(1) W: O(tree_depth) - after R: O(nr_cpus * nr_cgroups(subtree) * nr_counters) W: O(tree_depth) That doesn't look like a positive change especially on the reader side. (In theory, the reader's work would be amortized but as the original report here shows, not all workloads are diluting the flushes sufficiently. [1]) The benefit this was traded for was the greater accuracy, the possible error is: - before - O(nr_cpus * nr_cgroups(subtree) * MEMCG_CHARGE_BATCH) (1) - after O(nr_cpus * MEMCG_CHARGE_BATCH) // sync. flush or O(flush_period * max_cr) // periodic flush only (2) // max_cr is per-counter max change rate So we could argue that if the pre-rstat kernels did just fine with the error (1), they would not be worse with periodic flush if we can compare (1) and (2). On Fri, Mar 04, 2022 at 06:40:40PM +0000, Shakeel Butt wrote: > This patch fixes this regression by making the rstat flushing > conditional in the performance critical codepaths. More specifically, > the kernel relies on the async periodic rstat flusher to flush the stats > and only if the periodic flusher is delayed by more than twice the > amount of its normal time window then the kernel allows rstat flushing > from the performance critical codepaths. I'm not sure whether your patch attempts to solve the problem of (a) periodic flush getting stuck or (b) limiting error on refault path. If it's (a), it should be tackled more systematically (dedicated wq?). If it's (b), why not just rely on periodic flush (self answer: (1) and (2) comparison is workload dependent). > Now the question: what are the side-effects of this change? The worst > that can happen is the refault codepath will see 4sec old lruvec stats > and may cause false (or missed) activations of the refaulted page which > may under-or-overestimate the workingset size. Though that is not very > concerning as the kernel can already miss or do false activations. We can't argue what's the effect of periodic only flushing so this newly introduced factor would inherit that too. I find it superfluous. Michal [1] This is worth looking at in more detail. From the flush condition we have cr * Δt = nr_cpus * MEMCG_CHARGE_BATCH where Δt is time between flushes and cr is global change rate. cr composes of all updates together (corresponds to stats_updates in memcg_rstat_updated(), max_cr is change rate per counter) cr = Σ cr_i <= nr_counters * max_cr By combining these two we get shortest time between flushes: cr * Δt <= nr_counters * max_cr * Δt nr_cpus * MEMCG_CHARGE_BATCH <= nr_counters * max_cr * Δt Δt >= (nr_cpus * MEMCG_CHARGE_BATCH) / (nr_counters * max_cr) We are interested in R_amort = flush_work / Δt which is R_amort <= flush_work * nr_counters * max_cr / (nr_cpus * MEMCG_CHARGE_BATCH) R_amort: O( nr_cpus * nr_cgroups(subtree) * nr_counters * (nr_counters * max_cr) / (nr_cpus * MEMCG_CHARGE_BATCH) ) R_amort: O( nr_cgroups(subtree) * nr_counters^2 * max_cr) / (MEMCG_CHARGE_BATCH) ) The square looks interesting given there are already tens of counters. (As data from Ivan have shown, we can hardly restore the pre-rstat performance on the read side even with mere mod_delayed_work().) This is what you partially solved with introduction of NR_MEMCG_EVENTS but the stats_updates was still sum of all events, so the flush might have still triggered too frequently. Maybe that would be better long-term approach, splitting into accurate and approximate counters and reflect that in the error estimator stats_updates. Or some other optimization of mem_cgroup_css_rstat_flush().