Received: by 2002:a05:6500:1b45:b0:1f5:f2ab:c469 with SMTP id cz5csp162903lqb; Tue, 16 Apr 2024 11:41:37 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUuBFZQycS4ympUe7RwPReKXCwKUCKN2yrh3n0+ht8QeMjkpVBEsXE6lH1Z+4sxQCrSNpaWZxm8pwEF/emvEO4OLDpGXZwU69w/Hn31DQ== X-Google-Smtp-Source: AGHT+IG7NPO1kENl9HhJR3ty148BeVg1ILu8iqt2+LZHeev39Ylp9lNCsK9f0DfQ8l0ahwau6spM X-Received: by 2002:ac8:5d88:0:b0:434:adb1:13d2 with SMTP id d8-20020ac85d88000000b00434adb113d2mr17358192qtx.65.1713292897050; Tue, 16 Apr 2024 11:41:37 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1713292897; cv=pass; d=google.com; s=arc-20160816; b=WysaYMSjQovk1RBkp7vvQmJ0oirGt5W63c+IvRZ0xmpmZCpQFcInaBS89gH1snS58U zpMq3TWgMVH9FNnjoZIIqHZen1Pil0eOD6LY84Np1TC3RWiabRb3DTe2Y6Ub7VOYX4Vj 14UMSJ5IJ4uTC9ZRSDHghyCYB8k0+pEjXFqx9yTQQxsAgviSiWHQ5hbEH74c2Qe9g6bA 8IG/N21QiIXpCLXFbRiej5IhxVCu/zACeGceIqDgRJjiKILCWR/foQ+aACxmBNiRzQBi Dube0It/fgpBJo8LvsIHG+Gx2A64RqmeW2pJPwLmgAa3VDZDcLFp3Lr5Wic6GsbKqrbZ roQQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:dkim-signature:date; bh=AARNMAAmv0ZXNznpgmWdVj/ZM3y0d0nkfa+epz38edo=; fh=EXQu+drjrUwgmuPn3SklnvhxWW++FPtgQU7s6r9j5CM=; b=WN3YmoD0334gwhryL6iurdKDdsVHeS2FuN22oVAHatHgcM93A4uaJga1FXmx/8Pwlk acqno6ZgDwr8RTWi36zNMJP6t/0HUWPG6KHNlIpRX9zlavudrYGg2BRhwQ2xFxilx4jy gx0GH4WaQrj1CVJfg9kl/AGnFqTxclT8qk7daylfL9+JS39uVIedGY4zU0x5RnT5XYca jUCvrgWrCdVGYJcICSggGySGcT8LMHYGB9xbNiK6F6FgDoEABF3ffUc3d8oMTSIcP97g Sh2+v5cjNAxA0/UOHGbkphSkaxD5/bm91k8kR6YQhZGnanmVHbnj6Q/6EcGThqnRQrWx xFQw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="Tg/YRJvi"; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-kernel+bounces-147398-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-147398-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id c19-20020ac87dd3000000b00436585672dcsi13691136qte.736.2024.04.16.11.41.36 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Apr 2024 11:41:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-147398-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="Tg/YRJvi"; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-kernel+bounces-147398-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-147398-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id C4AEF1C2186E for ; Tue, 16 Apr 2024 18:41:36 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id DD637137761; Tue, 16 Apr 2024 18:41:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Tg/YRJvi" Received: from out-187.mta1.migadu.com (out-187.mta1.migadu.com [95.215.58.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81177133283 for ; Tue, 16 Apr 2024 18:41:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.187 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713292886; cv=none; b=RFI8XHBLS++DQ5WDdMHJMnVFP/hIqjbik7x0VGdeoxlZR2T0qUfm5wlYWed/Z7zFqWku1Wz/ugWVR+1xb/z18nnP3dRoSrDPq1Mc7YWsyAyYvdPq0ErveLgmNacQ9hZ6Xj/LoYon+aJm+lVG/y889Fx9a4NAcrHAjZdWwXe8fxE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713292886; c=relaxed/simple; bh=bmzjTTFBWVz12gXu4GvtCl6/if4RyTfGK18ivEjF9FA=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=lpumgiEVQYI3qHaJ/jZhirl0sBnGAtmqBjxtc6y4DwUmAksb/I86cdxhjzfVQ72RHJ+Mu73+YkwWfUS+S2Et2NBCLPB63bYc9uhHs9uylt8UWmg6gQ3iDVIDjkrYjFQiY0X7svZVkhBCWOoY+uP/uFjVOxZZ2HfsJD2gsz3ey80= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Tg/YRJvi; arc=none smtp.client-ip=95.215.58.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Date: Tue, 16 Apr 2024 11:41:15 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1713292882; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=AARNMAAmv0ZXNznpgmWdVj/ZM3y0d0nkfa+epz38edo=; b=Tg/YRJvi+gsorMTVglfkGMo25xZslc2DG+Q3niadqPeDe0UXXpz//oQEO/8rXho+keqy0O fT0WCzxTkS7d3YdtM/X2cSn/g/keKmFcCaY8tWIWLgWq52vcuvbfGQXocNILOLXpK6doWM I4E72O2k43z0IjPC4afmINIRpSVCOFU= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Jesper Dangaard Brouer Cc: Yosry Ahmed , Waiman Long , Johannes Weiner , Tejun Heo , Jesper Dangaard Brouer , "David S. Miller" , Sebastian Andrzej Siewior , Shakeel Butt , Arnaldo Carvalho de Melo , Daniel Bristot de Oliveira , kernel-team , cgroups@vger.kernel.org, Linux-MM , Netdev , bpf , LKML , Ivan Babrou Subject: Re: Advice on cgroup rstat lock Message-ID: References: <96728c6d-3863-48c7-986b-b0b37689849e@redhat.com> <4fd9106c-40a6-415a-9409-c346d7ab91ce@redhat.com> <75d837cc-4d33-44f6-bb0c-7558f0488d4e@kernel.org> <9f6333ec-f28c-4a91-b7b9-07a028d92225@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9f6333ec-f28c-4a91-b7b9-07a028d92225@kernel.org> X-Migadu-Flow: FLOW_OUT On Tue, Apr 16, 2024 at 04:22:51PM +0200, Jesper Dangaard Brouer wrote: Sorry for the late response and I see there are patches posted as well which I will take a look but let me put somethings in perspective. > > > > > > I personally don't like mem_cgroup_flush_stats_ratelimited() very > > much, because it is time-based (unlike memcg_vmstats_needs_flush()), > > and a lot of changes can happen in a very short amount of time. > > However, it seems like for some workloads it's a necessary evil :/ > > Other than obj_cgroup_may_zswap(), there is no other place which really need very very accurate stats. IMO we should actually make ratelimited version the default one for all the places. Stats will always be out of sync for some time window even with non-ratelimited flush and I don't see any place where 2 second old stat would be any issue. > > I like the combination of the two mem_cgroup_flush_stats_ratelimited() > and memcg_vmstats_needs_flush(). > IMHO the jiffies rate limit 2*FLUSH_TIME is too high, looks like 4 sec? 4 sec is the worst case and I don't think anyone have seen or reported that they are seeing 4 sec delayed flush and if it is happening, it seems like no one cares. > > > > I briefly looked into a global scheme similar to > > memcg_vmstats_needs_flush() in core cgroups code, but I gave up > > quickly. Different subsystems have different incomparable stats, so we > > cannot have a simple magnitude of pending updates on a cgroup-level > > that represents all subsystems fairly. > > > > I tried to have per-subsystem callbacks to update the pending stats > > and check if flushing is required -- but it got complicated quickly > > and performance was bad. > > > > I like the time-based limit because it doesn't require tracking pending > updates. > > I'm looking at using a time-based limit, on how often userspace can take > the lock, but in the area of 50ms to 100 ms. Sounds good to me and you might just need to check obj_cgroup_may_zswap is not getting delayed or getting stale stats. > > > With a mutex lock contention will be less obvious, as converting this to > a mutex avoids multiple CPUs spinning while waiting for the lock, but > it doesn't remove the lock contention. > I don't like global sleepable locks as those are source of priority inversion issues on highly utilized multi-tenant systems but I still need to see how you are handling that. > Userspace can easily triggered pressure on the global cgroup_rstat_lock > via simply reading io.stat and cpu.stat files (under /sys/fs/cgroup/). > I think we need a system to mitigate lock contention from userspace > (waiting on code compiling with a proposal). We see normal userspace > stats tools like cadvisor, nomad (and systemd) trigger this by reading > all the stat file on the system and even spawning parallel threads > without realizing that kernel side they share same global lock. > > You have done a huge effort to mitigate lock contention from memcg, > thank you for that. It would be sad if userspace reading these stat > files can block memcg. On production I see shrink_node having a > congestion point happening on this global lock. Seems like another instance where we should use the ratelimited version of the flush function.