2023-03-20 18:24:45

by Marcelo Tosatti

[permalink] [raw]
Subject: [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold

Large NUMA systems might have significant portions
of system memory to be trapped in pcp queues. The number of pcp is
determined by the number of processors and nodes in a system. A system
with 4 processors and 2 nodes has 8 pcps which is okay. But a system
with 1024 processors and 512 nodes has 512k pcps with a high potential
for large amount of memory being caught in them.

Enable remote node draining for the CONFIG_HAVE_CMPXCHG_LOCAL case,
where vmstat_shepherd will perform the aging and draining via
cpu_vm_stats_fold.

Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -928,7 +928,7 @@ static int refresh_cpu_vm_stats(bool do_
* There cannot be any access by the offline cpu and therefore
* synchronization is simplified.
*/
-void cpu_vm_stats_fold(int cpu)
+void cpu_vm_stats_fold(int cpu, bool do_pagesets)
{
struct pglist_data *pgdat;
struct zone *zone;
@@ -938,6 +938,9 @@ void cpu_vm_stats_fold(int cpu)

for_each_populated_zone(zone) {
struct per_cpu_zonestat *pzstats;
+#ifdef CONFIG_NUMA
+ struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+#endif

pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);

@@ -948,6 +951,11 @@ void cpu_vm_stats_fold(int cpu)
v = xchg(&pzstats->vm_stat_diff[i], 0);
atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
+#ifdef CONFIG_NUMA
+ /* 3 seconds idle till flush */
+ if (do_pagesets)
+ pcp->expire = 3;
+#endif
}
}
#ifdef CONFIG_NUMA
@@ -959,6 +967,38 @@ void cpu_vm_stats_fold(int cpu)
zone_numa_event_add(v, zone, i);
}
}
+
+ if (do_pagesets) {
+ cond_resched();
+ /*
+ * Deal with draining the remote pageset of a
+ * processor
+ *
+ * Check if there are pages remaining in this pageset
+ * if not then there is nothing to expire.
+ */
+ if (!pcp->expire || !pcp->count)
+ continue;
+
+ /*
+ * We never drain zones local to this processor.
+ */
+ if (zone_to_nid(zone) == cpu_to_node(cpu)) {
+ pcp->expire = 0;
+ continue;
+ }
+
+ WARN_ON(pcp->expire < 0);
+ /*
+ * pcp->expire is only accessed from vmstat_shepherd context,
+ * therefore no locking is required.
+ */
+ if (--pcp->expire)
+ continue;
+
+ if (pcp->count)
+ drain_zone_pages(zone, pcp);
+ }
#endif
}

@@ -2060,7 +2100,7 @@ static int refresh_all_vm_stats(void)

cpus_read_lock();
for_each_online_cpu(cpu) {
- cpu_vm_stats_fold(cpu);
+ cpu_vm_stats_fold(cpu, true);
cond_resched();
}
cpus_read_unlock();
Index: linux-vmstat-remote/include/linux/vmstat.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/vmstat.h
+++ linux-vmstat-remote/include/linux/vmstat.h
@@ -291,7 +291,7 @@ extern void __dec_zone_state(struct zone
extern void __dec_node_state(struct pglist_data *, enum node_stat_item);

void quiet_vmstat(void);
-void cpu_vm_stats_fold(int cpu);
+void cpu_vm_stats_fold(int cpu, bool do_pagesets);
void refresh_zone_stat_thresholds(void);

struct ctl_table;
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -8629,7 +8629,7 @@ static int page_alloc_cpu_dead(unsigned
* Zero the differential counters of the dead processor
* so that the vm statistics are consistent.
*/
- cpu_vm_stats_fold(cpu);
+ cpu_vm_stats_fold(cpu, false);

for_each_populated_zone(zone)
zone_pcp_update(zone, 0);




2023-03-20 20:44:23

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold

On Mon, 2023-03-20 at 15:03 -0300, Marcelo Tosatti wrote:
>
> +
> + if (do_pagesets) {
> + cond_resched();
> + /*
> + * Deal with draining the remote pageset of a
> + * processor
> + *
> + * Check if there are pages remaining in this pageset
> + * if not then there is nothing to expire.
> + */
> + if (!pcp->expire || !pcp->count)
> + continue;refresh_cpu_vm_stats
> +
> + /*
> + * We never drain zones local to this processor.
> + */
> + if (zone_to_nid(zone) == cpu_to_node(cpu)) {
> + pcp->expire = 0;
> + continue;
> + }
> +
> + WARN_ON(pcp->expire < 0);
> + /*
> + * pcp->expire is only accessed from vmstat_shepherd context,
> + * therefore no locking is required.
> + */
> + if (--pcp->expire)
> + continue;
> +
> + if (pcp->count)
> + drain_zone_pages(zone, pcp);
> + }

This logic is the same to that for the do_pagesets portion of code in refresh_cpu_vm_stats().
Is it possible to consolidate to avoid replicating the logic across two functions?

Tim

2023-03-22 01:30:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold

On Mon, Mar 20, 2023 at 01:43:09PM -0700, Tim Chen wrote:
> On Mon, 2023-03-20 at 15:03 -0300, Marcelo Tosatti wrote:
> >
> > +
> > + if (do_pagesets) {
> > + cond_resched();
> > + /*
> > + * Deal with draining the remote pageset of a
> > + * processor
> > + *
> > + * Check if there are pages remaining in this pageset
> > + * if not then there is nothing to expire.
> > + */
> > + if (!pcp->expire || !pcp->count)
> > + continue;refresh_cpu_vm_stats
> > +
> > + /*
> > + * We never drain zones local to this processor.
> > + */
> > + if (zone_to_nid(zone) == cpu_to_node(cpu)) {
> > + pcp->expire = 0;
> > + continue;
> > + }
> > +
> > + WARN_ON(pcp->expire < 0);
> > + /*
> > + * pcp->expire is only accessed from vmstat_shepherd context,
> > + * therefore no locking is required.
> > + */
> > + if (--pcp->expire)
> > + continue;
> > +
> > + if (pcp->count)
> > + drain_zone_pages(zone, pcp);
> > + }
>
> This logic is the same to that for the do_pagesets portion of code in refresh_cpu_vm_stats().
> Is it possible to consolidate to avoid replicating the logic across two functions?
>
> Tim

Tim,

One of the versions is for remote access, the other version is for local
access (thats why merging the two is not so simple).

I suppose the next logical step would be ensure that the current
!CONFIG_HAVE_CMPXCHG_LOCAL also uses the cmpxchg remotely (then
refresh_cpu_vm_stats can be removed).