Received: by 2002:a05:7412:1492:b0:e2:908c:2ebd with SMTP id s18csp321205rdh; Mon, 21 Aug 2023 17:16:23 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG69jP9KDqAjHEckfg/P/xNoB7ABwh2EwQMLzxUP86mg2jQ4U+G41mFofsViOBnGUXhj8NB X-Received: by 2002:a2e:3a06:0:b0:2bb:afad:ffdb with SMTP id h6-20020a2e3a06000000b002bbafadffdbmr6319105lja.28.1692663382865; Mon, 21 Aug 2023 17:16:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1692663382; cv=none; d=google.com; s=arc-20160816; b=tLZ92Ljsqh9Coe+agLPc6cBjU9OfTJWz3l2hFMcvbaKs0s8ePlS48V/ScU3fbUJXno 7wcpHHm9zEBBkkUZuZm5BZaKyS0WJQNIzNPauzIOKi6R/TP8IkhgETsnEWyHZLdBHWVH G9E/X7U5cF17nmrcFu+im+rWlJDGR2Q2tQ2N3u7Bq7ACMvI1lAdTlxevVrjsCOmVo22k Jl2WaY83oFODRKUraAo8QzlsJLKmlI8pW2yse45Co2jLOpr3iB3/peU2oGHpfWZRRxP0 gx+8CPfnKBjI+ayeu8yhMQUvqpGVZMEGeIKKlk7ozcKWJCas9grD/Wu61cVYJQlkB2f8 h8fg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=IbpQ6J2JbDbq9n2VQq/6we3rCBQZ2LXRiD+xapKkGXw=; fh=kanvGH95QaL05rIihwWd04f0DDL4sWNDwXUlUTuu7bg=; b=JxmSzG+Sw6OTUJn1ACHJgLB8OlCQqfoVJ2joGy+Q8PXdJKW0tvUNLeR45krfLDti+g lTQ7eC4A99yqPU5AqKZnU7RpaH9HGZAYR/M6bTwpnSA/a1xd+icCsym6fl+gdX9cdrHi XrxK8zphMUB1K2TTJ5cXo+H+IWGZumD6iUPzKDH0nKi7dac1eS+0eMeGTc6mWzSHqXmH Uq5PRxYXijFUONu0wxMMRUZUz9SjcSE20MH/Zzd5ycqTj/FY4df21xY9r9eWF6b6iPQX GQU97AGIXwhEbLsWKt7APvNK6dOcbVctY7KIhp2LXinf2tAGo6yapuhMsBRirRAEojhR ycxw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b="U8qf/ENQ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h25-20020a17090634d900b009980fa69e46si6578238ejb.514.2023.08.21.17.15.57; Mon, 21 Aug 2023 17:16:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b="U8qf/ENQ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230364AbjHUU7D (ORCPT + 99 others); Mon, 21 Aug 2023 16:59:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35184 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230041AbjHUU7D (ORCPT ); Mon, 21 Aug 2023 16:59:03 -0400 Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA5FFCC for ; Mon, 21 Aug 2023 13:59:00 -0700 (PDT) Received: by mail-ej1-x629.google.com with SMTP id a640c23a62f3a-99c3c8adb27so492850466b.1 for ; Mon, 21 Aug 2023 13:59:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692651539; x=1693256339; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=IbpQ6J2JbDbq9n2VQq/6we3rCBQZ2LXRiD+xapKkGXw=; b=U8qf/ENQTd2sCVviFY7hWlKIaAEj8on8IZmLHdWDcLUdn1xNFXFpQF+vAd/IQdGYHC opuIAkQgOKr5sqPudhkcPSXI4KMe3WtfotnVUiiHfqZZxttlyJRij/6AJTOdzKjdiycI N1TZStvj86wlanUfbAFjmwF/jq9QGZyWZXVvuBa/9vYvSqpkxlJNRNDzCj35JQf3iw2c dP8ARm2Tu1mUD25pF9mpx9GTvjhq+swewTo0XQfKxKNrhB9GXY8vH5adzkEuuMMsunxB 7xF3zah+cMidpUjvkveU5hTtr9xzy9mK24FfKEXNDFHF2Q7ah4RPJVwGuK5IdJ91wr0i F/sw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692651539; x=1693256339; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=IbpQ6J2JbDbq9n2VQq/6we3rCBQZ2LXRiD+xapKkGXw=; b=FX6EUmlGBYiNrVIReYoZpq4y5mFG1dnNv9FxTTTZqVkx3uQdY6djzLjXZkKq0+lol7 cjVmt1hLhiUB4fmGT/EfovG4QlpEsHhzqblHaFbBFcFJIiRnKQVDdGcg3NhRUOJPLiZu IVDYb6KCJaxAX75jV5QLPHmTMlU0iUZbkI2kryEuHkN7rO4f+i+7BsaxHlTfXVWDjvLG rDuBnqPCx2fDd6Q5sz/q54wrRc6iqmxpLrkBQgBRGXAi2J5PeeU2VFKcLJzKPEj9itno fvgkm1X/XTJ5N3CKDekgznCj0F3yahInfpJKXZclWP91eycd6A5ib98ZYf4hiWxozgOs HvLQ== X-Gm-Message-State: AOJu0YzNeRzemjuc/6ZeGMk+zTMuyoKH4DwAlbIi86uJoUGJdoc2vp0B R3RIHHnHYVuFvPZKwORVuxFEpelNTuZ1tN/YzZC72A== X-Received: by 2002:a17:906:844a:b0:993:f12a:39ce with SMTP id e10-20020a170906844a00b00993f12a39cemr5797844ejy.15.1692651538948; Mon, 21 Aug 2023 13:58:58 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Mon, 21 Aug 2023 13:58:22 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads To: Tejun Heo Cc: Shakeel Butt , Michal Hocko , Johannes Weiner , Roman Gushchin , Andrew Morton , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ivan Babrou Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 18, 2023 at 2:40=E2=80=AFPM Yosry Ahmed = wrote: > > On Wed, Aug 16, 2023 at 3:35=E2=80=AFPM Yosry Ahmed wrote: > > > > On Wed, Aug 16, 2023 at 12:08=E2=80=AFPM Tejun Heo wrot= e: > > > > > > Hello, > > > > > > On Wed, Aug 16, 2023 at 10:11:20AM -0700, Shakeel Butt wrote: > > > > These options are not white and black and there can be something in > > > > between but let me be very clear on what I don't want and would NAC= K. > > > > > > I'm not a big fan of interfaces with hidden states. What you're propo= sing > > > isn't strictly that but it's still a bit nasty. So, if we can get by = without > > > doing that, that'd be great. > > > > Agreed. I will try to send patches soon implementing option (2) above, > > basically removing unified flushing. I will try to document any > > potential regressions that may happen and how we may fix them. Ideally > > we see no regressions. > > > > > > > > > I don't want a global sleepable lock which can be taken by potentia= lly > > > > any application running on the system. We have seen similar global > > > > locks causing isolation and priority inversion issues in production= . > > > > So, not another lock which needs to be taken under extreme conditio= n > > > > (reading stats under OOM) by a high priority task (node controller) > > > > and might be held by a low priority task. > > > > > > Yeah, this is a real concern. Those priority inversions do occur and = can be > > > serious but causing serious problems under memory pressure usually re= quires > > > involving memory allocations and IOs. Here, it's just all CPU. So, at= least > > > in OOM conditions, this shouldn't be in the way (the system wouldn't = have > > > anything else to do anyway). > > > > > > It is true that this still can lead to priority through CPU competiti= on tho. > > > However, that problem isn't necessarily solved by what you're suggest= ing > > > either unless you want to restrict explicit flushing based on permiss= ions > > > which is another can of worms. > > > > Right. Also in the case of a mutex, if we disable preemption while > > holding the mutex, this makes sure that whoever holding the mutex does > > not starve waiters. Essentially the difference would be that waiters > > will sleep with the mutex instead of spinning, but the mutex holder > > itself wouldn't sleep. > > > > I will make this a separate patch, just in case it's too > > controversial. Switching the spinlock to a mutex should not block > > removing unified flushing. > > > > > > > > My preference is not exposing this in user interface. This is mostly = arising > > > from internal implementation details and isn't what users necessarily= care > > > about. There are many things we can do on the kernel side to make tra= de-offs > > > among overhead, staleness and priority inversions. If we make this an > > > explicit userland interface behavior, we get locked into that semanti= cs > > > which we'll likely regret in some future. > > > > > > > Yeah that's what I am trying to do here as well. I will try to follow > > up on this discussion with patches soon. > > > > Thanks everyone! > > So status update. I tried implementing removing unified flushing. I > can send the patches if it helps visualize things, but essentially > mem_cgroup_flush_stats{_ratelimited} takes a memcg argument that it > passes to cgroup_flush_stats(). No skipping if someone else is > flushing (stats_flush_ongoing) and no threshold at which we start > flushing (stats_flush_threshold). > > I tested this by some synthetic reclaim stress test that I wrote, > because reclaim is the easiest way to stress in-kernel flushing. I > basically create 10s or 100s cgroups and run workers in them that keep > allocating memory beyond the limit. I also run workers that > periodically read the stats. > > Removing unified flushing makes such a synthetic benchmark 2-3 times > slower. This is not surprising, as all these concurrent reclaimers > used to just skip flushing, regardless of whether or not they get > accurate stats. Now I don't know how realistic such a benchmark is, > but if there's a workload somewhere that runs into such conditions it > will surely regress. > > Sorry if the above is difficult to visualize. I can send the patches > and the test script if it makes things easier, I just didn't want to > overwhelm the thread. > > I think there are 2 options going forward: > (a) If we believe removing unified flushing is the best way forward > for most use cases, then we need to find a better way of testing this > approach practically. Any suggestions here are useful. > > (b) Decide that it's too expensive to remove unified flushing > completely, at least for in-kernel flushers that can see a lot of > concurrency. We can only remove unified flushing for userspace reads. > Only flush the memcg being read and never skip (essentially what Ivan > tried). There are already some flushing contexts that do this (e.g. > zswap memcg charging code). I believe as long as there isn't a lot of > concurrency in these paths it should be fine to skip unified flushing > for them, and keep it only for in-kernel flushers. I sent a short series implementing (b) above: https://lore.kernel.org/lkml/20230821205458.1764662-1-yosryahmed@google.com= / Basic testing shows that it works well.