Received: by 2002:a05:6358:111d:b0:dc:6189:e246 with SMTP id f29csp1750454rwi; Thu, 3 Nov 2022 08:40:21 -0700 (PDT) X-Google-Smtp-Source: AMsMyM68jRfoJf8vAviTMC+5nhYfNo0fNkjAZ4/oLwH5BnlcICZKAX3aWGkhAEqFG+Yj44P2QaS3 X-Received: by 2002:a17:906:6b90:b0:7ad:b6d3:3394 with SMTP id l16-20020a1709066b9000b007adb6d33394mr26545089ejr.497.1667490020848; Thu, 03 Nov 2022 08:40:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667490020; cv=none; d=google.com; s=arc-20160816; b=zVo2Pt5VGnZ7OzXCFzj633jEDlQPcAEWDb788Ep3dU7VPCWXltXVJ5nCfxmM4j2zYq jQTvHR2xpA3vUHTCxZF4I967T0bQceOqWBO/sPHg4cZhtdYyXoBLGTL9B23/4nRHllfV bkQTGUByU0fOPCJ1Ru+DiV/q3nKC4hMnOHdPnX1WBCeQQTLA8uq4AksRuR0MxOQ4SKjt gP0W2+15YRix8ATvQUT3moDEOLzbTEPw8fcJvarnHBGcoQUnL4GtrN12yqUDi+WFxGs8 e3dkCJbKLoOID3U+O6o43jGON87XVqcEJ9t1N9sbuA5uVgo+TW8PbXV880OcEUtMkogd +BAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=StiAt3bag3QhUFoK3wiydM4LWZxj5b1y3vB2+S4EuH0=; b=xuklcomFflNCEiWAocZu1UHGaMYeKdoLWmuC1BmGiJsYW60AfpXsrLw4YEfAsIotlA ABTohQiycOBM93KXyS8rqib2ptUZVGkmYp6nkf03fRa1De9/0g6JN5jMBVkxNUen9ham XOjrtE2Wb+oYMAcNBf6jrLiffrMAV96MqromBhsP4xleFR36dYTWYXc8011vQcSbnkoP zE4dsukSV1Vj2cVw8SfEEpf1rsJwuGjw2Mj46BjtqWMvVg4oZNnsZ3OtsLNBOfUkwD/a vptOOZWsAdeNeSUywxjdqZm7/DUsqJYmPlUAK06EKu5AGbxG/rJRxkDJqzypmftr4oDm YX7w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=EyTq5zPz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id eb9-20020a0564020d0900b00456b734ceebsi2039775edb.436.2022.11.03.08.39.55; Thu, 03 Nov 2022 08:40:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=EyTq5zPz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231808AbiKCPca (ORCPT + 97 others); Thu, 3 Nov 2022 11:32:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42346 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232331AbiKCPcI (ORCPT ); Thu, 3 Nov 2022 11:32:08 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 670641A213; Thu, 3 Nov 2022 08:31:49 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 244D11F88C; Thu, 3 Nov 2022 15:31:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1667489508; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=StiAt3bag3QhUFoK3wiydM4LWZxj5b1y3vB2+S4EuH0=; b=EyTq5zPzhPkGEin16Z865rp96ZWIg3VmFiHEBIYJIJIFhBYzoYKnGEGUINLIyNyt4U7TWr JTgzynHcHLiZA9vrK6tguPCHjis/FsA0Au6qV7xNZxoLPKo8wXqfAQv3XW/RC+fq/n85qJ lz0G6QQqelu2ZZdew1IE6pFngvh7q5U= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id F3BBE13AAF; Thu, 3 Nov 2022 15:31:47 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 9DkPOePeY2OGJAAAMHmgww (envelope-from ); Thu, 03 Nov 2022 15:31:47 +0000 Date: Thu, 3 Nov 2022 16:31:47 +0100 From: Michal Hocko To: Leonardo =?iso-8859-1?Q?Br=E1s?= Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Frederic Weisbecker , Phil Auld , Marcelo Tosatti , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus Message-ID: References: <20221102020243.522358-1-leobras@redhat.com> <07810c49ef326b26c971008fb03adf9dc533a178.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <07810c49ef326b26c971008fb03adf9dc533a178.camel@redhat.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 03-11-22 11:59:20, Leonardo Br?s wrote: > On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote: > > On Tue 01-11-22 23:02:40, Leonardo Bras wrote: > > > Patch #1 expands housekeep?ng_any_cpu() so we can find housekeeping cpus > > > closer (NUMA) to any desired CPU, instead of only the current CPU. > > > > > > ### Performance argument that motivated the change: > > > There could be an argument of why would that be needed, since the current > > > CPU is probably acessing the current cacheline, and so having a CPU closer > > > to the current one is always the best choice since the cache invalidation > > > will take less time. OTOH, there could be cases like this which uses > > > perCPU variables, and we can have up to 3 different CPUs touching the > > > cacheline: > > > > > > C1 - Isolated CPU: The perCPU data 'belongs' to this one > > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu > > > C3 - Housekeeping CPU: This one will do the work > > > > > > Most of the times the cacheline is touched, it should be by C1. Some times > > > a C2 will schedule work to run on C3, since C1 is isolated. > > > > > > If C1 and C2 are in different NUMA nodes, we could have C3 either in > > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node > > > (housekeeping_any_cpu_from(C1). > > > > > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3 > > > tries to get cacheline exclusivity, and then a slower invalidation when > > > this happens in C1, when it's working in its data. > > > > > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3 > > > tries to get cacheline exclusivity, and then a faster invalidation when > > > this happens in C1. > > > > > > The thing is: it should be better to wait less when doing kernel work > > > on an isolated CPU, even at the cost of some housekeeping CPU waiting > > > a few more cycles. > > > ### > > > > > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from > > > local_lock to spinlocks, so it can be later used to do remote percpu > > > cache draining on patch #3. Most performance concerns should be pointed > > > in the commit log. > > > > > > Patch #3 implements the remote per-CPU cache drain, making use of both > > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should > > > introduce an extra function call and a single test to check if the CPU is > > > isolated. > > > > > > On scenarios with isolation enabled on boot, it will also introduce an > > > extra test to check in the cpumask if the CPU is isolated. If it is, > > > there will also be an extra read of the cpumask to look for a > > > housekeeping CPU. > > > > Hello Michael, thanks for reviewing! > > > This is a rather deep dive in the cache line usage but the most > > important thing is really missing. Why do we want this change? From the > > context it seems that this is an actual fix for isolcpu= setup when > > remote (aka non isolated activity) interferes with isolated cpus by > > scheduling pcp charge caches on those cpus. > > > > Is this understanding correct? > > That's correct! The idea is to avoid scheduling work to isolated CPUs. > > > If yes, how big of a problem that is? > > The use case I have been following requires both isolcpus= and PREEMPT_RT, since > the isolated CPUs will be running a real-time workload. In this scenario, > getting any work done instead of the real-time workload may cause the system to > miss a deadline, which can be bad. OK, I see. But is memcg charging actually a RT friendly operation in the first place? Please note that this path can trigger memory reclaim and that is when any RT expectations are simply going down the drain. > > If you want a remote draining then > > you need some sort of locking (currently we rely on local lock). How > > come this locking is not going to cause a different form of disturbance? > > If I did everything right, most of the extra work should be done either in non- > isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches > will be happening on a housekeeping CPU, and the locking cost should be paid > there as we want to avoid doing that in the isolated CPUs. > > I understand there will be a locking cost being paid in the isolated CPUs when: > a) The isolated CPU is requesting the stock drain, > b) When the isolated CPUs do a syscall and end up using the protected structure > the first time after a remote drain. And anytime the charging path (consume_stock resp. refill_stock) contends with the remote draining which is out of control of the RT task. It is true that the RT kernel will turn that spin lock into a sleeping RT lock and that could help with potential priority inversions but still quite costly thing I would expect. > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload > should not expect the syscalls to be have a predictable time, so it should be > fine. Now I am not sure I understand. If you do not consider charging path to be RT sensitive then why is this needed in the first place? What else would be populating the pcp cache on the isolated cpu? IRQs? -- Michal Hocko SUSE Labs