Received: by 10.223.164.202 with SMTP id h10csp430803wrb; Thu, 30 Nov 2017 01:35:13 -0800 (PST) X-Google-Smtp-Source: AGs4zMYOTC22mfV8kCOlV+nMBnMqyvARDbsLV2/lMau5oPsMzHi2cEdbWt4P1o7wXmm2uBKmc7H9 X-Received: by 10.98.62.221 with SMTP id y90mr5953758pfj.71.1512034513167; Thu, 30 Nov 2017 01:35:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1512034513; cv=none; d=google.com; s=arc-20160816; b=h2+PtBa8egVke23EqN5/3Bmj4EFF1BMMQOhtPSLGnOZXSyz/Bwfe/YzRFpJMR81GZJ s98IGMxE9QXOCaTXwtQ1f9CCcesKVf1h10rcZMAXCGCOZMiQkd/YjglYqOZVN1B31dCM qrFvpkXCfGRu1kSnjLf0dg26Lp6nFfXb0GsLGNNc9jLlckSyI/ncltyLIiF9gL+8wMvz PgDtPADHE7YcWt216ZEdGPOgViK/bkOcWU2b5oZRsGPaVbxr0kk+wXgDudz8fIJcPez4 9p4yPT93j/kt27i3tRl+orD/QdcdWNTbhRLkRJH5ed3eyHc3wupkPHs6LUmMYRYYaCU0 qArg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=FJTasEX7gKG9K2rU4IfBF3lepucN0T6jzdbwysYHias=; b=ZrJp7J1YD+fAau09mKefpyYm/DHS9XvvBX7F3T8TCLXsrVCGYhZSm37MYMBjCfSnOX arSrxAsTZHyIdtAgkAAMwCtkENOTQpFkmUX5rIgSz+KNveV2WTj+rsQ17gBq13KpthTD IvaVXRubDfkTvGyDbbx76Tij1MsjzIDMzUhmfVM4hK9SLJxmNfokyq1UhqIj0IGQ0Qcw i2jkL5cw2PlNteZh49HM6Yg/pUEo7aqoLQMsuh1B7eTF2DyDM6oMAnjodKkEZLiWXuyG n3wr0WBvJJzEp7Nc7zhWPwf1UPwuq/yf8IZL7oI/VWUS/7NlqcSvsoS4gzWVr1rwF0U3 XJNQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g68si2706370pgc.304.2017.11.30.01.35.00; Thu, 30 Nov 2017 01:35:13 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752028AbdK3JeI (ORCPT + 99 others); Thu, 30 Nov 2017 04:34:08 -0500 Received: from mga06.intel.com ([134.134.136.31]:47834 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750902AbdK3JeF (ORCPT ); Thu, 30 Nov 2017 04:34:05 -0500 Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 30 Nov 2017 01:34:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,341,1508828400"; d="scan'208";a="7553116" Received: from kemi-desktop.sh.intel.com (HELO [10.239.13.118]) ([10.239.13.118]) by FMSMGA003.fm.intel.com with ESMTP; 30 Nov 2017 01:34:01 -0800 Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement To: Michal Hocko Cc: Greg Kroah-Hartman , Andrew Morton , Vlastimil Babka , Mel Gorman , Johannes Weiner , Christopher Lameter , YASUAKI ISHIMATSU , Andrey Ryabinin , Nikolay Borisov , Pavel Tatashin , David Rientjes , Sebastian Andrzej Siewior , Dave , Andi Kleen , Tim Chen , Jesper Dangaard Brouer , Ying Huang , Aaron Lu , Aubrey Li , Linux MM , Linux Kernel References: <1511848824-18709-1-git-send-email-kemi.wang@intel.com> <20171129121740.f6drkbktc43l5ib6@dhcp22.suse.cz> <4b840074-cb5f-3c10-d65b-916bc02fb1ee@intel.com> <20171130085322.tyys6xbzzvui7ogz@dhcp22.suse.cz> From: kemi Message-ID: <0f039a89-5500-1bf5-c013-d39ba3bf62bd@intel.com> Date: Thu, 30 Nov 2017 17:32:08 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <20171130085322.tyys6xbzzvui7ogz@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2017年11月30日 16:53, Michal Hocko wrote: > On Thu 30-11-17 13:56:13, kemi wrote: >> >> >> On 2017年11月29日 20:17, Michal Hocko wrote: >>> On Tue 28-11-17 14:00:23, Kemi Wang wrote: >>>> The existed implementation of NUMA counters is per logical CPU along with >>>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array >>>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't >>>> effect system's decision and are only read from /proc and /sys, it is a >>>> slow path operation and likely tolerate higher overhead. Additionally, >>>> usually nodes only have a single zone, except for node 0. And there isn't >>>> really any use where you need these hits counts separated by zone. >>>> >>>> Therefore, we can migrate the implementation of numa stats from per-zone to >>>> per-node, and get rid of these global numa counters. It's good enough to >>>> keep everything in a per cpu ptr of type u64, and sum them up when need, as >>>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement >>>> (e.g. save more than 130+ lines code). >>> >>> I agree. Having these stats per zone is a bit of overcomplication. The >>> only consumer is /proc/zoneinfo and I would argue this doesn't justify >>> the additional complexity. Who does really need to know per zone broken >>> out numbers? >>> >>> Anyway, I haven't checked your implementation too deeply but why don't >>> you simply define static percpu array for each numa node? >> >> To be honest, there are another two ways I can think of listed below. but I don't >> think they are simpler than my current implementation. Maybe you have better idea. >> >> static u64 __percpu vm_stat_numa[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]; >> But it's not correct. >> >> Or we can add an u64 percpu array with size of NR_VM_NUMA_STAT_ITEMS in struct pglist_data. >> >> My current implementation is quite straightforward by combining all of local counters >> together, only one percpu array with size of num_possible_nodes()*NR_VM_NUMA_STAT_ITEMS >> is enough for that. > > Well, this is certainly a matter of taste. But let's have a look what we > have currently. We have per zone, per node and numa stats. That looks one > way to many to me. Why don't we simply move the whole numa stat thingy > into per node stats? The code would simplify even more. We are going to > lose /proc/zoneinfo per-zone data but we are losing those without your > patch anyway. So I've just scratched the following on your patch and the > cumulative diff looks even better > > drivers/base/node.c | 22 ++--- > include/linux/mmzone.h | 22 ++--- > include/linux/vmstat.h | 38 +-------- > mm/mempolicy.c | 2 +- > mm/page_alloc.c | 20 ++--- > mm/vmstat.c | 221 +------------------------------------------------ > 6 files changed, 30 insertions(+), 295 deletions(-) > > I haven't tested it at all yet. This is just to show the idea. > --- > commit 92f8f58d1b6cb5c54a5a197a42e02126a5f7ea1a > Author: Michal Hocko > Date: Thu Nov 30 09:49:45 2017 +0100 > > - move NUMA stats to node stats > > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 0be5fbdadaac..315156310c99 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -190,17 +190,9 @@ static ssize_t node_read_vmstat(struct device *dev, > n += sprintf(buf+n, "%s %lu\n", vmstat_text[i], > sum_zone_node_page_state(nid, i)); > > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - n += sprintf(buf+n, "%s %lu\n", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > - node_numa_state_snapshot(nid, i)); > -#endif > - > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > n += sprintf(buf+n, "%s %lu\n", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + > - NR_VM_NUMA_STAT_ITEMS], > + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > node_page_state(pgdat, i)); > > return n; > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index b2d264f8c0c6..2c9c8b13c44b 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -115,20 +115,6 @@ struct zone_padding { > #define ZONE_PADDING(name) > #endif > > -#ifdef CONFIG_NUMA > -enum numa_stat_item { > - NUMA_HIT, /* allocated in intended node */ > - NUMA_MISS, /* allocated in non intended node */ > - NUMA_FOREIGN, /* was intended here, hit elsewhere */ > - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ > - NUMA_LOCAL, /* allocation from local node */ > - NUMA_OTHER, /* allocation from other node */ > - NR_VM_NUMA_STAT_ITEMS > -}; > -#else > -#define NR_VM_NUMA_STAT_ITEMS 0 > -#endif > - > enum zone_stat_item { > /* First 128 byte cacheline (assuming 64 bit words) */ > NR_FREE_PAGES, > @@ -180,6 +166,12 @@ enum node_stat_item { > NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ > NR_DIRTIED, /* page dirtyings since bootup */ > NR_WRITTEN, /* page writings since bootup */ > + NUMA_HIT, /* allocated in intended node */ > + NUMA_MISS, /* allocated in non intended node */ > + NUMA_FOREIGN, /* was intended here, hit elsewhere */ > + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ > + NUMA_LOCAL, /* allocation from local node */ > + NUMA_OTHER, /* allocation from other node */ > NR_VM_NODE_STAT_ITEMS > }; > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index c07850f413de..cc1edd95e949 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -187,19 +187,15 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone, > #endif > return x; > } > - > #ifdef CONFIG_NUMA > -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item); > +extern unsigned long node_page_state(struct pglist_data *pgdat, > + enum node_stat_item item); > extern unsigned long sum_zone_node_page_state(int node, > enum zone_stat_item item); > -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item); > -extern unsigned long node_page_state(struct pglist_data *pgdat, > - enum node_stat_item item); > #else > #define sum_zone_node_page_state(node, item) global_zone_page_state(item) > #define node_page_state(node, item) global_node_page_state(item) > #endif /* CONFIG_NUMA */ > - > #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d) > #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d)) > #define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d) > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index f604b22ebb65..84e72f2b5748 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1939,7 +1939,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, > return page; > if (page && page_to_nid(page) == nid) { > preempt_disable(); > - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT); > + inc_node_page_state(page, NUMA_INTERLEAVE_HIT); > preempt_enable(); > } > return page; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 044daba8c11a..c8e34157f7b8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2785,25 +2785,25 @@ int __isolate_free_page(struct page *page, unsigned int order) > * > * Must be called with interrupts disabled. > */ > -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) > +static inline void zone_statistics(int preferred_nid, int page_nid) > { > #ifdef CONFIG_NUMA > - enum numa_stat_item local_stat = NUMA_LOCAL; > + enum node_stat_item local_stat = NUMA_LOCAL; > > /* skip numa counters update if numa stats is disabled */ > if (!static_branch_likely(&vm_numa_stat_key)) > return; > > - if (z->node != numa_node_id()) > + if (page_nid != numa_node_id()) > local_stat = NUMA_OTHER; > > - if (z->node == preferred_zone->node) > - __inc_numa_state(z, NUMA_HIT); > + if (page_nid == preferred_nid) > + inc_node_state(NODE_DATA(page_nid), NUMA_HIT); > else { > - __inc_numa_state(z, NUMA_MISS); > - __inc_numa_state(preferred_zone, NUMA_FOREIGN); > + inc_node_state(NODE_DATA(page_nid), NUMA_MISS); > + inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN); > } Your patch saves more code than mine because the node stats framework is reused for numa stats. But it has a performance regression because of the limitation of threshold size (125 at most, see calculate_normal_threshold() in vmstat.c) in inc_node_state(). You can check this patch "1d90ca8 mm: update NUMA counter threshold size" for details. This issue is reported by Jesper Dangaard Brouer originally. From 1585480513276479545@xxx Thu Nov 30 08:54:04 +0000 2017 X-GM-THRID: 1585288579833195240 X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread