Received: by 2002:ab2:69cc:0:b0:1f4:be93:e15a with SMTP id n12csp139600lqp; Fri, 12 Apr 2024 12:52:51 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUg4vuxP37et9Ga4V5qdphEo2Vs4LQy3TIyCwE8nNKegQ4B1UB20IHb+ZzmOdIUkZWhJQEta3EANDJ6eS7nhtS3GaeuNDvEGAektNpkzg== X-Google-Smtp-Source: AGHT+IFKq6PQKH6tigQeIpLk76OTF+qtHXj4DsCjTOu+KhWQaCSyIu3FT6UStT5qqoLu9bHRIxxf X-Received: by 2002:a17:90b:3b48:b0:2a2:8ed7:da34 with SMTP id ot8-20020a17090b3b4800b002a28ed7da34mr3339402pjb.1.1712951571276; Fri, 12 Apr 2024 12:52:51 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712951571; cv=pass; d=google.com; s=arc-20160816; b=UKD4TbEsi2+1eIJ9gJfpurUT4piOX+j7C8T2a2GQ/h0EsIY7cp1gTWlrrtthzgG+8h nO74WOypRxJp4oE3pn3u2igVfqja3yyOO45pMn2/eaRAZYN16cFWwiVnEzEeerSBoIoj B9IEfrfMhe7ILLjYBdk5CzJz7Wu+rl1D74ee9jcSl/Earsc1S7cXtKbiiXZBzgNKEbz0 mYYTR9zu4DTCfA1Ib2VgK4JkYOUxC9AqGpwvR0BcnKML5p9XvuUq7HpJRtJyQuJXfq+b h1kcy8HDSbmyUNT4Cv0MMvwpjs+A7H7JaEHO6dz0ZNd59RInoqrzlQBTwwvWJ9bjh8N8 X2IQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=jAsbWIWxDyCbTeCy6d/WiDDQ6YO4elX4E4lumQbhzs4=; fh=aDGKPUKL8EYJjB8Lhi6OnpaItV2EiIoIARuO2HQQWa0=; b=otztugyAR1ZC1P+HURy+URJ/RCVRnMJr2jLzi9SB+eFmu2SjV1hzlW2Oc3B45neeVX aeGR3cQ/RROaorYN7s3BxvYuYQ0SE8HQwLvAcdXuiOK9wFTNeIt5MMoiILPkoPqima4g eHpID1KQLZ886u2NmpRktk9FHmO28ROQRAU0n/Tyj0hnIQMMu1RKQmMcuXwF8XmyOpq+ +yOVREjNdXy3fDkbpGIxVUc14nMYoaxjfak8Smt9tVfxvatmHo5IWh7YRfXKzw2Q6RxY c+Fej5u2Soyqt4MDYpdSaB3rnCyIIr/KIq1r4o8vLIXx1iFGOYKtr9JW9UI0ROdZW3p0 O5tA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=HTAVhnsp; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-143275-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-143275-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id s89-20020a17090a2f6200b002a2f9cdec8esi5982135pjd.43.2024.04.12.12.52.51 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Apr 2024 12:52:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-143275-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=HTAVhnsp; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-143275-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-143275-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 09ABF283AE7 for ; Fri, 12 Apr 2024 19:52:06 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E90A21509BC; Fri, 12 Apr 2024 19:51:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="HTAVhnsp" Received: from mail-lf1-f42.google.com (mail-lf1-f42.google.com [209.85.167.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ECF36150981 for ; Fri, 12 Apr 2024 19:51:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.42 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712951511; cv=none; b=vFhGa2A4nyv0LCG90Lk3qHLjpSyawc8mzieluGjXYM+dW6H90K5U5CDCWDmMo1A1JoUNatTD6LhbJ1W7MVjP1epa3wjY+Qjb6rp0Quq9hJNV3Mb39ZvzROpESWt8LR2hAJoSfdsrcX3s3dCnPpU05jAw1c9pu/RtbtLRkntfgls= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712951511; c=relaxed/simple; bh=q5wK8qWyu1ly6paxXG5Rsydht5XYRaIbkzHHnH6XDno=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=aok6twmFirQjRCABWp2K71hxPZuygE//8eJh1LahbAg+fMlNwF6Jo3yDWC2anVMcw521o51Dk/8lt+CkuXR0L+ygST0ZlwIcll/7kqIo8oyTQRkZbZBPXyqa4zagMLi/xyCnmwfrFKQVYMlXaDvzxXOj3TxMFcRqBNBa0Y9Xf4o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=HTAVhnsp; arc=none smtp.client-ip=209.85.167.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-lf1-f42.google.com with SMTP id 2adb3069b0e04-516d6898bebso1491153e87.3 for ; Fri, 12 Apr 2024 12:51:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1712951508; x=1713556308; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=jAsbWIWxDyCbTeCy6d/WiDDQ6YO4elX4E4lumQbhzs4=; b=HTAVhnspWzekDgpXNC3AAyPhJBYMzbvseLYzPJ7mqHtZkgo2s6rmODHjuzSwXk90Z+ cUqy/Sr/CAI5zuXmO5+WfWTFpohfh3Q2VPu1vVT5CiE65klE8kiI+l6l2/oG/0E1Zt1X Ayu+ozVZzJc0DKOs3gyf70epTJBS4QdNczdt1tDsI9lSkvyO7ZexPpiHMkzwaoGJGeOQ UrU4Y56eYkVxvE9BrXwvDmaZ1UtIcg8Q4RILAE/JCf2LE80/k//bPB8giFbdmOJmYO9Y y7cZpq5e2x98lG/oCVvycbcAspdjEIOa475T/WGa0xH86kIm/f6oUEDRhqHwUq7r1XID T5og== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712951508; x=1713556308; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jAsbWIWxDyCbTeCy6d/WiDDQ6YO4elX4E4lumQbhzs4=; b=sLQkyM8AtXnwgHM0vzyD+5XDaMx5HrG3jiU3Mq9DfpJZ+BJwZxNwX26m/3WUFR8P1m 1QY+ayyZBLdN6uw7lYRUU6gKIeHargl5Sye+CiN4AGYiy3h2QnRXBf+igIJ6j4RpcxFs ucl9reruqE4zlE9qajphlxkW9jrFRjhoskmtom+YjuVw1dNxbccYoCK/pbCbLnL69ccz 5YVqqoD/OM5elbjKM3ZehKH8NKxP0O0iO0kmu2FCnEvmkHiMbm8vqHtCNmYig//ouNpV hPpeXLqsci1SOI/ToeaLvJzZQcKz9yaLnFLL5rXr551iHCPJCscGWZWH8TRGZOmSk7EG Fx3Q== X-Forwarded-Encrypted: i=1; AJvYcCVnMsusBtkLdOHCnYWjInuJ3+zWQ3ogQireLdZ9nmxRkAwWSBGArF4Y/4fmbWCEeALazrmpU7J/z18B+jMFGs+8HXU1RgJAebPqM/1v X-Gm-Message-State: AOJu0YzjDr3/IRwY3EpvgmRQSW0x3y11x4woBocds4MceldKrlj2MD8o PQSWT76AUYuuX2XYe8Wz8pndT+VMgu9TfSJVJiMC7007uCRtbERZtxUX9xR7Gbz0pczxC7epLGT tH1otgmeH3ulsV74NJDlQAqlll25Kxyrk4zgt X-Received: by 2002:ac2:41c5:0:b0:516:cdfa:17f6 with SMTP id d5-20020ac241c5000000b00516cdfa17f6mr2105982lfi.67.1712951507911; Fri, 12 Apr 2024 12:51:47 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <7cd05fac-9d93-45ca-aa15-afd1a34329c6@kernel.org> <20240319154437.GA144716@cmpxchg.org> <56556042-5269-4c7e-99ed-1a1ab21ac27f@kernel.org> <96728c6d-3863-48c7-986b-b0b37689849e@redhat.com> <4fd9106c-40a6-415a-9409-c346d7ab91ce@redhat.com> <75d837cc-4d33-44f6-bb0c-7558f0488d4e@kernel.org> In-Reply-To: <75d837cc-4d33-44f6-bb0c-7558f0488d4e@kernel.org> From: Yosry Ahmed Date: Fri, 12 Apr 2024 12:51:09 -0700 Message-ID: Subject: Re: Advice on cgroup rstat lock To: Jesper Dangaard Brouer Cc: Waiman Long , Johannes Weiner , Tejun Heo , Jesper Dangaard Brouer , "David S. Miller" , Sebastian Andrzej Siewior , Shakeel Butt , Arnaldo Carvalho de Melo , Daniel Bristot de Oliveira , kernel-team , cgroups@vger.kernel.org, Linux-MM , Netdev , bpf , LKML , Ivan Babrou Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Apr 12, 2024 at 12:26=E2=80=AFPM Jesper Dangaard Brouer wrote: > > > > On 11/04/2024 19.22, Yosry Ahmed wrote: > > [..] > >>>>>> > >>>>>> How far can we go... could cgroup_rstat_lock be converted to a mut= ex? > >> >>> > >>>>> The cgroup_rstat_lock was originally a mutex. It was converted to a > >>>>> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex= with > >>>>> a spinlock"). Irq was disabled to enable calling from atomic contex= t. > >>>>> Since commit 0a2dc6ac3329 ("cgroup: remove > >>>>> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called fro= m > >>>>> atomic context anymore. Theoretically, we could change it back to a > >>>>> mutex or not disabling interrupt. That will require that the API ca= nnot > >>>>> be called from atomic context going forward. > >> >>> > >>>> I think we should avoid flushing from atomic contexts going forward > >>>> anyway tbh. It's just too much work to do with IRQs disabled, and we > >>>> observed hard lockups before in worst case scenarios. > >>>> > >> > >> Appreciate the historic commits as documentation for how the code > >> evolved. Sounds like we agree that the IRQ-disable can be lifted, > >> at-least between the three of us. > > > > It can be lifted, but whether it should be or not is a different > > story. I tried keeping it as a spinlock without disabling IRQs before > > and Tejun pointed out possible problems, see below. > > > > IMHO it *MUST* be lifted, as disabling IRQs here is hurting other parts > of the system and actual production systems. > > The "offending" IRQ-spin_lock commit (0fa294fb1985) is from 2018, and > GitHub noticed in 2019 (via blog[1]) and at Red Hat I backported[2] > patches (which I now understand) only mitigate the issues. Our prod > systems are on 6.1 and 6.6 where we still clearly see the issue > occurring. Also Daniel's "rtla timerlat" tool for catching systems > latency issues have "cgroup_rstat_flush_locked" as the poster child [3][4= ]. We have been bitten by the IRQ-spinlock before, so I cannot disagree, although for us removing atomic flushes and allowing the lock to be dropped between CPU flushes seems to be good enough (for now). > > > [1] https://github.blog/2019-11-21-debugging-network-stalls-on-kubernet= es/ > [2] https://bugzilla.redhat.com/show_bug.cgi?id=3D1795049 > [3] https://bristot.me/linux-scheduling-latency-debug-and-analysis/ > [4] Documentation/tools/rtla/rtla-timerlat-top.rst > > >> > >>>> I think one problem that was discussed before is that flushing is > >>>> exercised from multiple contexts and could have very high concurrenc= y > >>>> (e.g. from reclaim when the system is under memory pressure). With a > >>>> mutex, the flusher could sleep with the mutex held and block other > >>>> threads for a while. > >>>> > >> > >> Fair point, so in first iteration we keep the spin_lock but don't do t= he > >> IRQ disable. > > > > I tried doing that before, and Tejun had some objections: > > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/ > > > > My read of that thread is that Tejun would prefer we look into > > converting cgroup_rsat_lock into a mutex again, or more aggressively > > drop the lock on CPU boundaries. Perhaps we can unconditionally drop > > the lock on each CPU boundary, but I am worried that contending the > > lock too often may be an issue, which is why I suggested dropping the > > lock if there are pending IRQs instead -- but I am not sure how to do > > that :) > > > > Like Tejun, I share the concern that keeping this a spinlock will > can increase the chance of several CPUs contend on this lock (which is > also a production issue we see). This is why I suggested to "exit" if > (1) we see the lock have been taken by somebody else, or if (2) stats > were flushed recently. When you say "exit", do you mean abort the whole thing, or just don't spin for the lock but wait for the ongoing flush? > > For (2), memcg have a mem_cgroup_flush_stats_ratelimited() system > combined with memcg_vmstats_needs_flush(), which limits the pressure on > the global lock (cgroup_rstat_lock). > *BUT* other users of cgroup_rstat_flush() like when reading io.stat > (blk-cgroup.c) and cpu.stat, don't have such a system to limit pressure > on global lock. Further more, userspace can easily trigger this via > reading those stat files. And normal userspace stats tools (like > cadvisor, nomad, systemd) spawn threads reading io.stat, cpu.stat and > memory.stat, likely without realizing that kernel side they share same > global lock... > > I'm working on a code solution/proposal for "ratelimiting" global lock > access when reading io.stat and cpu.stat. I personally don't like mem_cgroup_flush_stats_ratelimited() very much, because it is time-based (unlike memcg_vmstats_needs_flush()), and a lot of changes can happen in a very short amount of time. However, it seems like for some workloads it's a necessary evil :/ I briefly looked into a global scheme similar to memcg_vmstats_needs_flush() in core cgroups code, but I gave up quickly. Different subsystems have different incomparable stats, so we cannot have a simple magnitude of pending updates on a cgroup-level that represents all subsystems fairly. I tried to have per-subsystem callbacks to update the pending stats and check if flushing is required -- but it got complicated quickly and performance was bad. At some point, having different rstat trees for different subsystems was brought up. I never looked into actually implementing it, but I suppose if we do that we have a generic scheme similar to memcg_vmstats_needs_flush() that can be customized by each subsystem in a clean performant way? I am not sure. [..] > >> > >>>> I vaguely recall experimenting locally with changing that lock into = a > >>>> mutex and not liking the results, but I can't remember much more. I > >>>> could be misremembering though. > >>>> > >>>> Currently, the lock is dropped in cgroup_rstat_flush_locked() betwee= n > >>>> CPU iterations if rescheduling is needed or the lock is being > >>>> contended (i.e. spin_needbreak() returns true). I had always wondere= d > >>>> if it's possible to introduce a similar primitive for IRQs? We could > >>>> also drop the lock (and re-enable IRQs) if IRQs are pending then. > >>> > >>> I am not sure if there is a way to check if a hardirq is pending, but= we > >>> do have a local_softirq_pending() helper. > >> > >> The local_softirq_pending() might work well for me, as this is our pro= d > >> problem, that CPU local pending softirq's are getting starved. > > > > If my understanding is correct, softirqs are usually scheduled by > > IRQs, which means that local_softirq_pending() may return false if > > there are pending IRQs (that will schedule softirqs). Is this correct? > > > > Yes, networking hard IRQ will raise softirq, but software often also > raise softirq. > I see where you are going with this... the cgroup_rstat_flush_locked() > loop "play nice" check happens with IRQ lock held, so you speculate that > IRQ handler will not be able to raise softirq, thus > local_softirq_pending() will not work inside IRQ lock. Exactly. I wonder if it would be okay to just unconditionally drop the lock at each CPU boundary. Would be interesting to experiment with this. One disadvantage of the mutex in this case (imo) is that outside of the percpu spinlock critical section, we don't really need to be holding the global lock/mutex. So sleeping while holding it is not needed and only introduces problems. Dropping the spinlock at each boundary seems like a way to circumvent that. If the problems you are observing are mainly on CPUs that are holding the lock and flushing, I suspect this should greatly. If the problems are mainly on CPUs spinning for the lock, I suspect it will still help redistribute the lock (and IRQs disablement) more often, but not as much. > > > >> > >> In production another problematic (but rarely occurring issue) is when > >> several CPUs contend on this lock. Yosry's recent work/patches have > >> already reduced the chances of this happening (thanks), BUT it still c= an > >> and does happen. > >> A simple solution to this, would be to do a spin_trylock() in > >> cgroup_rstat_flush(), and exit if we cannot get the lock, because we > >> know someone else will do the work. > > > > I am not sure I understand what you mean specifically with the checks > > below, but I generally don't like this (as you predicted :) ). > > > > On the memcg side, we used to have similar logic when we used to > > always flush the entire tree. This leaded to flushing being > > indeterministic. You would occasionally get stale stats because of the > > contention, which resulted in some inconsistencies (e.g. performing > > proactive reclaim successfully then reading the stats that do not > > reflect that). > > > > Now that we dropped the logic to always flush the entire tree, it is > > even more difficult because concurrent flushes could be in completely > > irrelevant subtrees. > > > > If we were to introduce some smart logic to figure out that the > > subtree we are trying to flush is already being flushed, I think we > > would need to wait for that ongoing flush to complete instead of just > > returning (e.g. using completions). But I think such implementations > > to find overlapping flushes and wait for them may be too compicated. > > > > We will see if you hate my current code approach ;-) Just to be clear, if the spinlock was to be converted to a mutex, or to be dropped at each CPU boundary, do you still think such ratelimiting is still needed to mitigate lock contention -- even if the IRQs latency problem is fixed?