Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Mon, 5 Jun 2023 15:25:19 -0300
From:   Marcelo Tosatti <mtosatti@redhat.com>
To:     Michal Hocko <mhocko@suse.com>
Cc:     Christoph Lameter <cl@linux.com>,
        Aaron Tomlin <atomlin@atomlin.com>,
        Frederic Weisbecker <frederic@kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH v2 3/3] mm/vmstat: do not refresh stats for nohz_full CPUs
Message-ID: <ZH4ojxxNh3bW3+ez@tpad>
References: <20230602185757.110910188@redhat.com>
 <20230602190115.545766386@redhat.com>
 <ZH2V/QxDrq7aq5fY@dhcp22.suse.cz>
 <ZH4CnJlpBMxEEwPW@tpad>
 <ZH4JEVGdICubP0Du@dhcp22.suse.cz>
 <ZH4mAbucXtK1PUON@tpad>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ZH4mAbucXtK1PUON@tpad>
Precedence: bulk

On Mon, Jun 05, 2023 at 03:14:25PM -0300, Marcelo Tosatti wrote:
> On Mon, Jun 05, 2023 at 06:10:57PM +0200, Michal Hocko wrote:
> > On Mon 05-06-23 12:43:24, Marcelo Tosatti wrote:
> > > On Mon, Jun 05, 2023 at 09:59:57AM +0200, Michal Hocko wrote:
> > > > On Fri 02-06-23 15:58:00, Marcelo Tosatti wrote:
> > > > > The interruption caused by queueing work on nohz_full CPUs 
> > > > > is undesirable for certain aplications.
> > > > 
> > > > This is not a proper changelog. I am not going to write a changelog for
> > > > you this time. Please explain why this is really needed and why this
> > > > approach is desired. 
> > > > E.g. why don't you prevent userspace from
> > > > refreshing stats if interference is not desirable.
> > > 
> > > Michal,
> > > 
> > > Can you please check if the following looks better, as
> > > a changelog? thanks
> > > 
> > > ---
> > > 
> > > schedule_work_on API uses the workqueue mechanism to
> > > queue a work item on a queue. A kernel thread, which
> > > runs on the target CPU, executes those work items.
> > > 
> > > Therefore, when using the schedule_work_on API,
> > > it is necessary for the kworker kernel thread to
> > > be scheduled in, for the work function to be executed.
> > > 
> > > Time sensitive applications such as SoftPLCs
> > > (https://tum-esi.github.io/publications-list/PDF/2022-ETFA-How_Real_Time_Are_Virtual_PLCs.pdf),
> > > have their response times affected by such interruptions.
> > > 
> > > The /proc/sys/vm/stat_refresh file was originally introduced by
> > > 
> > > commit 52b6f46bc163eef17ecba4cd552beeafe2b24453
> > > Author: Hugh Dickins <hughd@google.com>
> > > Date:   Thu May 19 17:12:50 2016 -0700
> > > 
> > >     mm: /proc/sys/vm/stat_refresh to force vmstat update
> > > 
> > >     Provide /proc/sys/vm/stat_refresh to force an immediate update of
> > >     per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
> > >     before checking counts when testing.  Originally added to work around a
> > >     bug which left counts stranded indefinitely on a cpu going idle (an
> > >     inaccuracy magnified when small below-batch numbers represent "huge"
> > >     amounts of memory), but I believe that bug is now fixed: nonetheless,
> > >     this is still a useful knob.
> > 
> > No need to quote the full changelog.

I think its useful to put things in perspective.

> > > Other than the potential interruption to a time sensitive application,
> > > if using SCHED_FIFO or SCHED_RR priority on the isolated CPU, then
> > > system hangs can occur:
> > > 
> > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=978688
> > 
> > Confused... This report says that accessing the file (i.e. to force the
> > refresh) can get stalled because high priority tasks will not allow
> > kworkers to run. No?
> 
> Yes.
> 
> > There is simply no way around that unless those kworkers inherit the
> > priority.
> 
> stalld is an attempt to workaround the situation by allowing the 
> lower priority processes to execute for a small amount of time
> (for example 20us every 1s). https://github.com/bristot/stalld:
> 
> "The stalld program (which stands for 'stall daemon') is a mechanism to
> prevent the starvation of operating system threads in a Linux system.
> The premise is to start up on a housekeeping cpu (one that is not used
> for real-application purposes) and to periodically monitor the state of
> each thread in the system, looking for a thread that has been on a run
> queue (i.e. ready to run) for a specifed length of time without being
> run. This condition is usually hit when the thread is on the same cpu
> as a high-priority cpu-intensive task and therefore is being given no
> opportunity to run.
> 
> When a thread is judged to be starving, stalld changes that thread to
> use the SCHED_DEADLINE policy and gives the thread a small slice of time
> for that cpu (specified on the command line). The thread then runs and
> when that timeslice is used, the thread is then returned to its original
> scheduling policy and stalld then continues to monitor thread states."
> 
> Unfortunately, if you allow that, then the latency sensitive
> application might be interrupted for longer than acceptable
> (which is the case for a certain class of applications, for example
> SoftPLC inside a VM).
> 
> > It certainly is unfortunate that the call is not killable
> > but being stuck behind real time busy looping processes is nothing
> > really uncommong. One has to be really careful when using real time
> > priorities.
> 
> Yes.
> 
> > > To avoid the problems above, do not schedule the work to synchronize
> > > per-CPU mm counters on isolated CPUs. Given the possibility for
> > > breaking existing userspace applications, avoid changing
> > > behaviour of access to /proc/sys/vm/stat_refresh, such as
> > > returning errors to userspace.
> > 
> > You are changing the behavior. The preexisting behavior was to flush
> > everything. This is clearly changing that.
> 
> I meant that this patch does not cause read/write to the procfs file 
> to return errors.
> 
> I believe returning errors has a higher potential for regressions
> than not flushing per-CPU VM counters of isolated CPUs (which are
> bounded).
> 
> > > ---
> > > 
> > > > Also would it make some sense to reduce flushing to cpumask 
> > > > of the calling process? (certainly a daring thought but have
> > > > you even considered it?)
> > > 
> > > Fail to see the point here ?
> > 
> > I mean that, if you already want to change the semantic of the call then
> > it would likely be safer to change it in a more robust way and only
> > flush pcp vmstat caches that are in the process effective cpu mask. 
> 
> That would change behaviour for systems without isolated CPUs.
> 
> > This
> > way one can control which pcp caches to flush (e.g. those that are not
> > on isolated CPUs or contrary those that are isolated but you can afford 
> > to flush at the specific moment). See?
> 
> Yes, but not sure what to think of this idea. 
> 
> > Now I am not saying this is the right way to go because there is still a
> > slim chance this will break userspace expectations. Therefore I have
> > asked why you simply do not stop any random application accessing
> > stat_refresh in the first place.
> 
> I think this is what should be done, but not on the current patchset.
> 
> https://lkml.iu.edu/hypermail/linux/kernel/2209.1/01263.html
> 
> Regarding housekeeping flags, it is usually the case that initialization might
> require code execution on interference blocked CPUs (for example MTRR
> initialization, resctrlfs initialization, MSR writes, ...). Therefore
> tagging the CPUs after system initialization is necessary, which
> is not possible with current housekeeping flags infrastructure.
> 
> >  These highly specialized setups with
> > isolated resources shouldn't run arbitrary crap, should that?
> 
> Problem is that its hard to control what people run on a system.
> 
> > What if I just start allocating memory and get the system close to OOM. 
> 
> Sure, or "poweroff".
> 
> > I am
> > pretty sure a small latency induced by the vmstat refreshes is the least
> > problem you will have.
> 
> If OOM codepath sends no IPI or queues work on isolated CPUs, then OOM
> should be fine.
> 
> > So please step back and try to think whether this is actually fixing
> > anything real before trying to change a user visible interface.
> 
> It is fixing either a latency violation or a hang on a system where some user or
> piece of software happens to run "sysctl -a" (or read vmstat_refresh).
> 
> If one is using CPU isolation, the latency violation has higher 
> priority than vmstat_refresh returning proper counters.

OK, so this patch is not going to include the per-CPU vmstat counters
(up to the threshold) in the synchronization step of reading/writing to the
vmstat_refresh file.

This is a tradeoff: one prefers not to have accurate counters 
(for a procfs file whose value is going to be interpreted, and which 
accurate value might or might not be important) than to interrupt
an isolated CPU.