Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp6114588rwd; Mon, 5 Jun 2023 13:16:45 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6QV1UVlu0wliPyt0Mqc2AqHZXx8hIzArn1fQlxC21CPiNE+ufdRXxbPvgVtro/j3LI9Gj2 X-Received: by 2002:ac8:58d6:0:b0:3f8:1fc6:7ed3 with SMTP id u22-20020ac858d6000000b003f81fc67ed3mr9369038qta.20.1685996205114; Mon, 05 Jun 2023 13:16:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685996205; cv=none; d=google.com; s=arc-20160816; b=TpgrvO3gPcC4JtskQOMmIdjLjrPwZwXSs2cwF2BKx2WIyeK+kC5yaXjkqzGH9xLTL5 VlUmCS4reuB06HlKRMaMCE7aZNkcJjfon7BLFIBFt8cicZnVjLXvHX7pT6p+vcXjGSQm HZoha28LFV6fQGvoT47fzmFfQeobjmIC7FUgThKn3W2JCnk9cKd88V7zu0spJfG4NOTK g3xtqHtbmvrHJUDgOVjjM+Xj0c29AMzW+Gk8rHqTpUv+SWHJCzu16PSfD/l/xt8mbqi/ uZwt9G+qyQoVuphH0eFWWP2ybcqgA0jt3fnMDtPocGD8PRuQpTT20j5HwT5zgyWO9H2g yBhQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=DjZASK4THEoJah63UAyHaSZR2Zss7Nl3xaufKyQMunY=; b=DZyGJiZMfJzZPr35cA01Tuxb2BPRyUX9JM6ln0+ppjP1JMpwewjKGIZM5Pav02PxFR 2fL3x0ogqZ2j4X1IWAhcMfEY23DyVcbSRs77u7vCaEWU5vSfx6Rxw8VNKKzo7fdjWOF8 Lf1xE4oh785XmXXBuPOIaMfKxf2zpglJ+bSNF6Dy29DOqWth9AM74/dMbWf6XtXMEF/L YhrHH9/bosMYF6BRPIC2ZNHOGtMujf+0fJbNNHldxctmu3x1LUU6KO+rNC5+b/avDnw2 TQj0vXjgqLh1+KPT8RP2lC4GW23jPn2JHxIs2P+u3hlA0nKA6eTVQRLg98XCf6D/zmUN KJVw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="EVoqYk/X"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j12-20020ac85f8c000000b003f5399197c4si5178845qta.388.2023.06.05.13.16.27; Mon, 05 Jun 2023 13:16:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="EVoqYk/X"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235022AbjFETp6 (ORCPT + 99 others); Mon, 5 Jun 2023 15:45:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54608 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235042AbjFETp4 (ORCPT ); Mon, 5 Jun 2023 15:45:56 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B05DC113 for ; Mon, 5 Jun 2023 12:45:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1685994305; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DjZASK4THEoJah63UAyHaSZR2Zss7Nl3xaufKyQMunY=; b=EVoqYk/Xwp/Nbe/UqLH7n6ixXxt7VtBgv7iFtUfL062f4rHuOQnCCt44Dq2I6jLOhTmL3N nuV3rctGGOr5QpqJGteSL/JMWRkI6vlQ/nP50eZWfZ2urEYS4AkT/y0bLdNDa/u2XMnQJ3 U9hhPrbULxiBgbNAiiJ/Ycj+qm0aKrA= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-594-RApbw2rGN22gCkNu8vwuFw-1; Mon, 05 Jun 2023 15:44:59 -0400 X-MC-Unique: RApbw2rGN22gCkNu8vwuFw-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BA762185A78B; Mon, 5 Jun 2023 19:44:58 +0000 (UTC) Received: from tpad.localdomain (ovpn-112-3.gru2.redhat.com [10.97.112.3]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 19E917AE4; Mon, 5 Jun 2023 19:44:58 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id 38A48400E71AD; Mon, 5 Jun 2023 16:44:40 -0300 (-03) Date: Mon, 5 Jun 2023 16:44:40 -0300 From: Marcelo Tosatti To: Michal Hocko Cc: Christoph Lameter , Aaron Tomlin , Frederic Weisbecker , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Vlastimil Babka Subject: Re: [PATCH v2 3/3] mm/vmstat: do not refresh stats for nohz_full CPUs Message-ID: References: <20230602185757.110910188@redhat.com> <20230602190115.545766386@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE,URI_DOTEDU autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 05, 2023 at 09:12:01PM +0200, Michal Hocko wrote: > On Mon 05-06-23 15:14:25, Marcelo Tosatti wrote: > > On Mon, Jun 05, 2023 at 06:10:57PM +0200, Michal Hocko wrote: > > > On Mon 05-06-23 12:43:24, Marcelo Tosatti wrote: > > > > On Mon, Jun 05, 2023 at 09:59:57AM +0200, Michal Hocko wrote: > > > > > On Fri 02-06-23 15:58:00, Marcelo Tosatti wrote: > > > > > > The interruption caused by queueing work on nohz_full CPUs > > > > > > is undesirable for certain aplications. > > > > > > > > > > This is not a proper changelog. I am not going to write a changelog for > > > > > you this time. Please explain why this is really needed and why this > > > > > approach is desired. > > > > > E.g. why don't you prevent userspace from > > > > > refreshing stats if interference is not desirable. > > > > > > > > Michal, > > > > > > > > Can you please check if the following looks better, as > > > > a changelog? thanks > > > > > > > > --- > > > > > > > > schedule_work_on API uses the workqueue mechanism to > > > > queue a work item on a queue. A kernel thread, which > > > > runs on the target CPU, executes those work items. > > > > > > > > Therefore, when using the schedule_work_on API, > > > > it is necessary for the kworker kernel thread to > > > > be scheduled in, for the work function to be executed. > > > > > > > > Time sensitive applications such as SoftPLCs > > > > (https://tum-esi.github.io/publications-list/PDF/2022-ETFA-How_Real_Time_Are_Virtual_PLCs.pdf), > > > > have their response times affected by such interruptions. > > > > > > > > The /proc/sys/vm/stat_refresh file was originally introduced by > > > > > > > > commit 52b6f46bc163eef17ecba4cd552beeafe2b24453 > > > > Author: Hugh Dickins > > > > Date: Thu May 19 17:12:50 2016 -0700 > > > > > > > > mm: /proc/sys/vm/stat_refresh to force vmstat update > > > > > > > > Provide /proc/sys/vm/stat_refresh to force an immediate update of > > > > per-cpu into global vmstats: useful to avoid a sleep(2) or whatever > > > > before checking counts when testing. Originally added to work around a > > > > bug which left counts stranded indefinitely on a cpu going idle (an > > > > inaccuracy magnified when small below-batch numbers represent "huge" > > > > amounts of memory), but I believe that bug is now fixed: nonetheless, > > > > this is still a useful knob. > > > > > > No need to quote the full changelog. > > > > > > > Other than the potential interruption to a time sensitive application, > > > > if using SCHED_FIFO or SCHED_RR priority on the isolated CPU, then > > > > system hangs can occur: > > > > > > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=978688 > > > > > > Confused... This report says that accessing the file (i.e. to force the > > > refresh) can get stalled because high priority tasks will not allow > > > kworkers to run. No? > > > > Yes. > > > > > There is simply no way around that unless those kworkers inherit the > > > priority. > > > > stalld is an attempt to workaround the situation by allowing the > > lower priority processes to execute for a small amount of time > > (for example 20us every 1s). https://github.com/bristot/stalld: > > > > "The stalld program (which stands for 'stall daemon') is a mechanism to > > prevent the starvation of operating system threads in a Linux system. > > The premise is to start up on a housekeeping cpu (one that is not used > > for real-application purposes) and to periodically monitor the state of > > each thread in the system, looking for a thread that has been on a run > > queue (i.e. ready to run) for a specifed length of time without being > > run. This condition is usually hit when the thread is on the same cpu > > as a high-priority cpu-intensive task and therefore is being given no > > opportunity to run. > > > > When a thread is judged to be starving, stalld changes that thread to > > use the SCHED_DEADLINE policy and gives the thread a small slice of time > > for that cpu (specified on the command line). The thread then runs and > > when that timeslice is used, the thread is then returned to its original > > scheduling policy and stalld then continues to monitor thread states." > > But your task is not on rq. It is sleeping and waiting for completion so > I fail to see how this all is related. The problem is that the userspace > depends on kernel WQ to complete. There quite some operations that will > behave like that. Yes. In more detail, two cases, with current kernel code: Case-1) no stalld daemon running. SCHED_FIFO task on isolated CPU that does: do { pkt = read_network_packet(); if (pkt) { process_pkt(pkt); } } while (!stop_request); Someone else runs "echo 1 > /proc/sys/vm/stat_refresh" or "sysctl -a". flush_work hangs, because SCHED_FIFO task never yields the processor for kwork to finish execution of vmstat_refresh work. Case-2) stalld daemon running. SCHED_FIFO has on isolated CPU with same condition as above. stalld daemon detects kworker "starving" (not being able to execute) and changes it priority to SCHED_DEADLINE, for 20us, then changes it back to SCHED_OTHER. This causes the packet processing thread to be interrupted for 20us, which is not what is expected from the system. --- Now with this patch integrated: Case-1B) "echo 1 > /proc/sys/vm/stat_refresh" or "sysctl -a" returns, no system hang. Case-2B) No work is scheduled to run on the isolated CPU, packet processing is not interrupted for 20us, everyone is happy. --- In both B cases, kernel statistics were not fully synchronized from per-CPU counters to global counters (but hopefully nobody cares). > > Unfortunately, if you allow that, then the latency sensitive > > application might be interrupted for longer than acceptable > > (which is the case for a certain class of applications, for example > > SoftPLC inside a VM). > > I am losing you again. You can either have top priority processes > running uninterrupted or or latency sensitive running with your SLAs. > Even if you apply this patch you cannot protect your sensitive > application from CPU top prio hogs. You either have your process > priorities configured properly or not. I really fail to follow your line > of arguments here. Hopefully what is written above helps. > > > It certainly is unfortunate that the call is not killable > > > but being stuck behind real time busy looping processes is nothing > > > really uncommong. One has to be really careful when using real time > > > priorities. > > > > Yes. > > > > > > To avoid the problems above, do not schedule the work to synchronize > > > > per-CPU mm counters on isolated CPUs. Given the possibility for > > > > breaking existing userspace applications, avoid changing > > > > behaviour of access to /proc/sys/vm/stat_refresh, such as > > > > returning errors to userspace. > > > > > > You are changing the behavior. The preexisting behavior was to flush > > > everything. This is clearly changing that. > > > > I meant that this patch does not cause read/write to the procfs file > > to return errors. > > > > I believe returning errors has a higher potential for regressions > > than not flushing per-CPU VM counters of isolated CPUs (which are > > bounded). > > Silent change of behavior is even worse because you cannot really tell a > difference. What do you suggest ? > > > > > Also would it make some sense to reduce flushing to cpumask > > > > > of the calling process? (certainly a daring thought but have > > > > > you even considered it?) > > > > > > > > Fail to see the point here ? > > > > > > I mean that, if you already want to change the semantic of the call then > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > it would likely be safer to change it in a more robust way and only > > > flush pcp vmstat caches that are in the process effective cpu mask. > > > > That would change behaviour for systems without isolated CPUs. > > yes, see above. > > > > This > > > way one can control which pcp caches to flush (e.g. those that are not > > > on isolated CPUs or contrary those that are isolated but you can afford > > > to flush at the specific moment). See? > > > > Yes, but not sure what to think of this idea. > > If you want to break something, at least make the change kinda more > generic than, magically single purpose (isolcpu/nohz in this case) oriented. I hope nobody cares about stat_refresh being off by reclaim threshold (which is a small percentage). The alternative was to synchronize per-CPU counters remotely, but it was decided as too complex. > > > Now I am not saying this is the right way to go because there is still a > > > slim chance this will break userspace expectations. Therefore I have > > > asked why you simply do not stop any random application accessing > > > stat_refresh in the first place. > > > > I think this is what should be done, but not on the current patchset. > > > > https://lkml.iu.edu/hypermail/linux/kernel/2209.1/01263.html > > > > Regarding housekeeping flags, it is usually the case that initialization might > > require code execution on interference blocked CPUs (for example MTRR > > initialization, resctrlfs initialization, MSR writes, ...). Therefore > > tagging the CPUs after system initialization is necessary, which > > is not possible with current housekeeping flags infrastructure. > > > > > These highly specialized setups with > > > isolated resources shouldn't run arbitrary crap, should that? > > > > Problem is that its hard to control what people run on a system. > > > > > What if I just start allocating memory and get the system close to OOM. > > > > Sure, or "poweroff". > > > > > I am > > > pretty sure a small latency induced by the vmstat refreshes is the least > > > problem you will have. > > > > If OOM codepath sends no IPI or queues work on isolated CPUs, then OOM > > should be fine. > > You are missing a big picture I am afraid. IPIs is the least of a > problem in that case (just imagine all the indirect dependencies through > locking - get a lock, held by somebody requesting a memory). > > > > So please step back and try to think whether this is actually fixing > > > anything real before trying to change a user visible interface. > > > > It is fixing either a latency violation or a hang on a system where some user or > > piece of software happens to run "sysctl -a" (or read vmstat_refresh). > > I believe we have established the "hang problem" as described above is > not fixable by this patch. And I still argue that you should simply not > allow abuse of the interface if you want to have any latency guarantees. > Same as with any other kernel activity where you can compete for > resources (directly or indirectly). > > > If one is using CPU isolation, the latency violation has higher > > priority than vmstat_refresh returning proper counters. > > This is a strong claim without any actual argument other than you would > like to have it this way. Well its the least worse option that i can see. Do you think returning error from procfs file handler, if isolated cpu is encountered, is not a potential source of regressions ? Again, it seemed to me that returning an error only if a different flag is enabled (flag which is disabled by default), similar to what is suggested in the "block interference" patchset, would be more resilient against regressions. But don't have a strong preference.