In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, use a cmpxchg loop
instead of a pair of read/write instructions.
Signed-off-by: Marcelo Tosatti <[email protected]>
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -885,7 +885,7 @@ static int refresh_cpu_vm_stats(void)
}
/*
- * Fold the data for an offline cpu into the global array.
+ * Fold the data for a cpu into the global array.
* There cannot be any access by the offline cpu and therefore
* synchronization is simplified.
*/
@@ -906,8 +906,9 @@ void cpu_vm_stats_fold(int cpu)
if (pzstats->vm_stat_diff[i]) {
int v;
- v = pzstats->vm_stat_diff[i];
- pzstats->vm_stat_diff[i] = 0;
+ do {
+ v = pzstats->vm_stat_diff[i];
+ } while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));
atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
}
@@ -917,8 +918,9 @@ void cpu_vm_stats_fold(int cpu)
if (pzstats->vm_numa_event[i]) {
unsigned long v;
- v = pzstats->vm_numa_event[i];
- pzstats->vm_numa_event[i] = 0;
+ do {
+ v = pzstats->vm_numa_event[i];
+ } while (!try_cmpxchg(&pzstats->vm_numa_event[i], &v, 0));
zone_numa_event_add(v, zone, i);
}
}
@@ -934,8 +936,9 @@ void cpu_vm_stats_fold(int cpu)
if (p->vm_node_stat_diff[i]) {
int v;
- v = p->vm_node_stat_diff[i];
- p->vm_node_stat_diff[i] = 0;
+ do {
+ v = p->vm_node_stat_diff[i];
+ } while (!try_cmpxchg(&p->vm_node_stat_diff[i], &v, 0));
atomic_long_add(v, &pgdat->vm_stat[i]);
global_node_diff[i] += v;
}
On Thu, Feb 09, 2023 at 12:01:59PM -0300, Marcelo Tosatti wrote:
> /*
> - * Fold the data for an offline cpu into the global array.
> + * Fold the data for a cpu into the global array.
> * There cannot be any access by the offline cpu and therefore
> * synchronization is simplified.
> */
> @@ -906,8 +906,9 @@ void cpu_vm_stats_fold(int cpu)
> if (pzstats->vm_stat_diff[i]) {
> int v;
>
> - v = pzstats->vm_stat_diff[i];
> - pzstats->vm_stat_diff[i] = 0;
> + do {
> + v = pzstats->vm_stat_diff[i];
> + } while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));
IIUC try_cmpxchg will update "v" already, so I'd assume this'll work the
same:
while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));
Then I figured, maybe it's easier to use xchg()?
I've no knowledge at all on cpu offline code, so sorry if this will be a
naive question. But from what I understand this should not be touched by
anyone else. Reasons:
(1) cpu_vm_stats_fold() is only called in page_alloc_cpu_dead(), and the
comment says:
/*
* Zero the differential counters of the dead processor
* so that the vm statistics are consistent.
*
* This is only okay since the processor is dead and cannot
* race with what we are doing.
*/
cpu_vm_stats_fold(cpu);
so.. I think that's what it says..
(2) If someone can modify the dead cpu's vm_stat_diff, what guarantees it
won't be e.g. boosted again right after try_cmpxchg() / xchg()
returns? What to do with the left-overs?
Thanks,
--
Peter Xu
On Wed, Mar 01, 2023 at 05:57:08PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:01:59PM -0300, Marcelo Tosatti wrote:
> > /*
> > - * Fold the data for an offline cpu into the global array.
> > + * Fold the data for a cpu into the global array.
> > * There cannot be any access by the offline cpu and therefore
> > * synchronization is simplified.
> > */
> > @@ -906,8 +906,9 @@ void cpu_vm_stats_fold(int cpu)
> > if (pzstats->vm_stat_diff[i]) {
> > int v;
> >
> > - v = pzstats->vm_stat_diff[i];
> > - pzstats->vm_stat_diff[i] = 0;
> > + do {
> > + v = pzstats->vm_stat_diff[i];
> > + } while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));
>
> IIUC try_cmpxchg will update "v" already, so I'd assume this'll work the
> same:
>
> while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));
>
> Then I figured, maybe it's easier to use xchg()?
Yes, fixed.
> I've no knowledge at all on cpu offline code, so sorry if this will be a
> naive question. But from what I understand this should not be touched by
> anyone else. Reasons:
>
> (1) cpu_vm_stats_fold() is only called in page_alloc_cpu_dead(), and the
> comment says:
>
> /*
> * Zero the differential counters of the dead processor
> * so that the vm statistics are consistent.
> *
> * This is only okay since the processor is dead and cannot
> * race with what we are doing.
> */
> cpu_vm_stats_fold(cpu);
>
> so.. I think that's what it says..
This refers to the use of this_cpu operations being performed by the
counter updates.
If both the updater and reader use atomic accesses (which is the case after patch 8:
"mm/vmstat: switch counter modification to cmpxchg"), and
CONFIG_HAVE_CMPXCHG_LOCAL is set, then the comment is stale.
Removed it.
> (2) If someone can modify the dead cpu's vm_stat_diff,
The only context that can modify the cpu's vm_stat_diff are:
1) The CPU itself (increases the counter).
2) cpu_vm_stats_fold (from vmstat_shepherd kernel thread), from
x -> 0 only.
So you should not be able to increase the counter after this point.
I suppose this is what this comment refers to.
> what guarantees it
> won't be e.g. boosted again right after try_cmpxchg() / xchg()
> returns? What to do with the left-overs?
If any code runs on the CPU that is being hotunplugged,
after cpu_vm_stats_fold (from page_alloc_cpu_dead), then there will
be left-overs. But such bugs would exist today as well.
Or, if that bug exists, you could replace "for_each_online_cpu" to
"for_each_cpu" here:
static void vmstat_shepherd(struct work_struct *w)
{
int cpu;
cpus_read_lock();
/* Check processors whose vmstat worker threads have been disabled */
for_each_online_cpu(cpu) {
On Thu, Mar 02, 2023 at 10:55:09AM -0300, Marcelo Tosatti wrote:
> > (2) If someone can modify the dead cpu's vm_stat_diff,
>
> The only context that can modify the cpu's vm_stat_diff are:
>
> 1) The CPU itself (increases the counter).
> 2) cpu_vm_stats_fold (from vmstat_shepherd kernel thread), from
> x -> 0 only.
I think I didn't continue reading so I didn't see cpu_vm_stats_fold() will
be reused when commenting, sorry.
Now with a reworked (and SMP-safe) cpu_vm_stats_fold() and vmstats, I'm
wondering the possibility of merging it with refresh_cpu_vm_stats() since
they really look similar.
IIUC the new refresh_cpu_vm_stats() logically doesn't need the small
preempt disabled sections, not anymore, if with a cpu_id passed over to
cpu_vm_stats_fold(), which seems to be even a good side effect. But not
sure I missed something.
--
Peter Xu
On Thu, Mar 02, 2023 at 04:19:50PM -0500, Peter Xu wrote:
> On Thu, Mar 02, 2023 at 10:55:09AM -0300, Marcelo Tosatti wrote:
> > > (2) If someone can modify the dead cpu's vm_stat_diff,
> >
> > The only context that can modify the cpu's vm_stat_diff are:
> >
> > 1) The CPU itself (increases the counter).
> > 2) cpu_vm_stats_fold (from vmstat_shepherd kernel thread), from
> > x -> 0 only.
>
> I think I didn't continue reading so I didn't see cpu_vm_stats_fold() will
> be reused when commenting, sorry.
>
> Now with a reworked (and SMP-safe) cpu_vm_stats_fold() and vmstats, I'm
> wondering the possibility of merging it with refresh_cpu_vm_stats() since
> they really look similar.
Seems like a possibility. However that might require replacing
v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0);
with
pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
Which would drop the this_cpu optimization described at
7340a0b15280c9d902c7dd0608b8e751b5a7c403
Also you would not want the unified function to sync NUMA events
(as it would be called from NOHZ entry and exit).
See f19298b9516c1a031b34b4147773457e3efe743b
> IIUC the new refresh_cpu_vm_stats() logically doesn't need the small
> preempt disabled sections, not anymore,
What preempt disabled sections you refer to?
> if with a cpu_id passed over to
> cpu_vm_stats_fold(), which seems to be even a good side effect. But not
> sure I missed something.
>
> --
> Peter Xu
>
>