Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp5035102rwl; Tue, 28 Mar 2023 15:21:23 -0700 (PDT) X-Google-Smtp-Source: AKy350a35wjJRaUDoXm4E8+ObsTWgbdgwyTud9hN/TuyM/XVL+FKsCjTipqzpue9gfv+uq7KYCGM X-Received: by 2002:a50:e604:0:b0:4fa:c3eb:d2c9 with SMTP id y4-20020a50e604000000b004fac3ebd2c9mr17533908edm.12.1680042083305; Tue, 28 Mar 2023 15:21:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680042083; cv=none; d=google.com; s=arc-20160816; b=SeyzUyeWxIzRwNwbTBWZ6KfUGh2mdeMYolaYusA0uyY0mE8fvy++cpQyeLHaFWZI1E 5jx2+kYSDaLwjqmhbMisL1ic+X3TnjkTR9FjaHrxhZlAN358IfRWJgayfajt22SIDtgi 7f8+NCIuOpkqqik0/GkcgQWhvb8xIfUeir8yoE3k/PgytearDVA5Ad5mNnM693xlBhBr lr+Yi6iGA/0FFcSSQqHXTvkxi7Q0C/Yvl942WfSrba54LicRNibsE/yvgkK7qgegtsFA UfUdEUXwuQyEdhFbkR2VjkwXp18Sl2kF5DMA6Olf+bi7YJ+7BGsqWo/Qbq8XAkintGKO 33JQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=rItEK+KYvsSIiuNXlyB23Cs11TTT5fJ8aFIXbrMiIEw=; b=UAycv//cGiY6R4BlY5AjcdRsxfG3m+mnu+SLth21VsbZNFWEdsRh1/6RHVO6+EDdQi Gw64tZrFyXuE6XtAC3Kc0zJhs+4h0HVCNrk8heHWG937Yt0sM3fhAILAexn0FPGB0ME4 /KnLy74ERI2ig40PD5PRGersev6H2fo7zMM8e61gcvr7LoZeZFdPmnmNRN82C/vW6IL0 plCqHFPQW5kg2myisU+Y5Jt1Y76jGc2ZfazyWx6HtOJdGc9gwDM4DLt9HmIEIrFBDn++ TUtAhIIXBEfZzMRQtLmZM6vec+UFmp/5j7IXA1OZ9MBgURCb0fkNwwd6xP3lUkOasWpd 5LOA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=IxPRAGCq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l9-20020aa7c3c9000000b0050245e7f729si5274391edr.615.2023.03.28.15.20.58; Tue, 28 Mar 2023 15:21:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=IxPRAGCq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230161AbjC1WRe (ORCPT + 99 others); Tue, 28 Mar 2023 18:17:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39306 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230129AbjC1WRP (ORCPT ); Tue, 28 Mar 2023 18:17:15 -0400 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 46CF530EF for ; Tue, 28 Mar 2023 15:17:02 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id p9-20020a170902e74900b001a1c7b2e7afso8291638plf.0 for ; Tue, 28 Mar 2023 15:17:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680041822; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=rItEK+KYvsSIiuNXlyB23Cs11TTT5fJ8aFIXbrMiIEw=; b=IxPRAGCqLmO64hebIOFNkgw1YGUDJSVyfgVVBwxUcvdgL59f7l4LN5zMPrXb+jjbv/ 44AuhCK3u7FtNfz2jZoS0YDY1UIo89Iub/aEWQsolPiMTjtNQ86ze7JWsd1EKsY5Y1Aw 07n0xvSJnL2lwlEm697GGmrpvMBX0F+5BNT3XjdQx09Cev4XnMOYZ7hcuMbACVnhAT8t T+Q0avDfvu5GCgeBYFMaudoWaGLkL6vKFSabNVsNzlmPSiAqWuR+oN/II5DxoATas6tB N+Po8vyHrMmHA1KarR8lOaWajBlxLkTT4ktZV7Et/fSUfAvv7rnsRtElYoPsoIvIEGG6 moFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680041822; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rItEK+KYvsSIiuNXlyB23Cs11TTT5fJ8aFIXbrMiIEw=; b=y3uPC4wQUPpO8IlPUZjHaRG43OlVVEgT9mCnZLtYBbYkxJCGiO86DzEZZSNF3orrcq jmEqMPtXdta+RGouRasx/Uib8OumBK55kN+wQS9DQOLXrBD6atEJIam4xTzWJSeh/N9t nmTL95MCkaRqBhpV0uv3sj7whDkWKfIwbn7funcU2ydF/f5PAc2MtDdOOWuWRu+7C6KI oO67gO93tQ/8A5keL2DZLxujKi3d7UJ/Q1b1UqfqTRYOpopVoq/g5j2rZzVtXF9mXHZw N+Zw5GXGj0ieOG15MLi34dC2+xU7sxRE3opYS6cDia+Ra4wrLaCZicufZVGjHiGPUY9G rwMg== X-Gm-Message-State: AAQBX9dTfU19qyJ/gvkw3Al+WMq04Cf/3IUX8A05aTeghN8/A3+6YuFn 8ru3Q2bHjuvD0RmSwWekBmuM6H4gsgtDVKng X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a63:ce08:0:b0:50c:6cd:cace with SMTP id y8-20020a63ce08000000b0050c06cdcacemr4745262pgf.2.1680041821773; Tue, 28 Mar 2023 15:17:01 -0700 (PDT) Date: Tue, 28 Mar 2023 22:16:41 +0000 In-Reply-To: <20230328221644.803272-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230328221644.803272-1-yosryahmed@google.com> X-Mailer: git-send-email 2.40.0.348.gf938b09366-goog Message-ID: <20230328221644.803272-7-yosryahmed@google.com> Subject: [PATCH v2 6/9] memcg: sleep during flushing stats in safe contexts From: Yosry Ahmed To: Tejun Heo , Josef Bacik , Jens Axboe , Zefan Li , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , "=?UTF-8?q?Michal=20Koutn=C3=BD?=" Cc: Vasily Averin , cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, Yosry Ahmed Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.7 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently, all contexts that flush memcg stats do so with sleeping not allowed. Some of these contexts are perfectly safe to sleep in, such as reading cgroup files from userspace or the background periodic flusher. Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka sleepable), and provide a separate atomic version. The atomic version is used in reclaim, refault, writeback, and in mem_cgroup_usage(). All other code paths are left to use the non-atomic version. This includes callbacks for userspace reads and the periodic flusher. Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(), change it to mem_cgroup_flush_stats_atomic_ratelimited(). Reclaim and refault code paths are modified to do non-atomic flushing in separate later patches -- so it will eventually be changed back to mem_cgroup_flush_stats_ratelimited(). Signed-off-by: Yosry Ahmed Acked-by: Shakeel Butt --- include/linux/memcontrol.h | 9 ++++++-- mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++-------- mm/vmscan.c | 2 +- mm/workingset.c | 2 +- 4 files changed, 45 insertions(+), 13 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ac3f3b3a45e2..b424ba3ebd09 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1037,7 +1037,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, } void mem_cgroup_flush_stats(void); -void mem_cgroup_flush_stats_ratelimited(void); +void mem_cgroup_flush_stats_atomic(void); +void mem_cgroup_flush_stats_atomic_ratelimited(void); void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); @@ -1535,7 +1536,11 @@ static inline void mem_cgroup_flush_stats(void) { } -static inline void mem_cgroup_flush_stats_ratelimited(void) +static inline void mem_cgroup_flush_stats_atomic(void) +{ +} + +static inline void mem_cgroup_flush_stats_atomic_ratelimited(void) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 65750f8b8259..a2ce3aa10d94 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -634,7 +634,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) } } -static void __mem_cgroup_flush_stats(void) +static void do_flush_stats(bool atomic) { /* * We always flush the entire tree, so concurrent flushers can just @@ -646,26 +646,46 @@ static void __mem_cgroup_flush_stats(void) return; WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); - cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup); + + if (atomic) + cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup); + else + cgroup_rstat_flush(root_mem_cgroup->css.cgroup); + atomic_set(&stats_flush_threshold, 0); atomic_set(&stats_flush_ongoing, 0); } +static bool should_flush_stats(void) +{ + return atomic_read(&stats_flush_threshold) > num_online_cpus(); +} + void mem_cgroup_flush_stats(void) { - if (atomic_read(&stats_flush_threshold) > num_online_cpus()) - __mem_cgroup_flush_stats(); + if (should_flush_stats()) + do_flush_stats(false); } -void mem_cgroup_flush_stats_ratelimited(void) +void mem_cgroup_flush_stats_atomic(void) +{ + if (should_flush_stats()) + do_flush_stats(true); +} + +void mem_cgroup_flush_stats_atomic_ratelimited(void) { if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_atomic(); } static void flush_memcg_stats_dwork(struct work_struct *w) { - __mem_cgroup_flush_stats(); + /* + * Always flush here so that flushing in latency-sensitive paths is + * as cheap as possible. + */ + do_flush_stats(false); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } @@ -3685,9 +3705,12 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) * done from irq context; use stale stats in this case. * Arguably, usage threshold events are not reliable on the root * memcg anyway since its usage is ill-defined. + * + * Additionally, other call paths through memcg_check_events() + * disable irqs, so make sure we are flushing stats atomically. */ if (in_task()) - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_atomic(); val = memcg_page_state(memcg, NR_FILE_PAGES) + memcg_page_state(memcg, NR_ANON_MAPPED); if (swap) @@ -4610,7 +4633,11 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_flush_stats(); + /* + * wb_writeback() takes a spinlock and calls + * wb_over_bg_thresh()->mem_cgroup_wb_stats(). Do not sleep. + */ + mem_cgroup_flush_stats_atomic(); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); diff --git a/mm/vmscan.c b/mm/vmscan.c index 9c1c5e8b24b8..a9511ccb936f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) * Flush the memory cgroup stats, so that we read accurate per-memcg * lruvec stats for heuristics. */ - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_atomic(); /* * Determine the scan balance between anon and file LRUs. diff --git a/mm/workingset.c b/mm/workingset.c index af862c6738c3..dab0c362b9e3 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -462,7 +462,7 @@ void workingset_refault(struct folio *folio, void *shadow) mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); - mem_cgroup_flush_stats_ratelimited(); + mem_cgroup_flush_stats_atomic_ratelimited(); /* * Compare the distance to the existing workingset size. We * don't activate pages that couldn't stay resident even if -- 2.40.0.348.gf938b09366-goog