Received: by 10.223.164.202 with SMTP id h10csp2766212wrb; Tue, 28 Nov 2017 00:35:50 -0800 (PST) X-Google-Smtp-Source: AGs4zMZHkr/n7xamB0jmRtwLj8mkEu0tKzfwfHXAPUdPl5klX+RbMbfrEZQOUKpr2MsvnvbTQlft X-Received: by 10.84.231.196 with SMTP id g4mr34259029pln.62.1511858149919; Tue, 28 Nov 2017 00:35:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1511858149; cv=none; d=google.com; s=arc-20160816; b=F6ASEgF5rSdTRpaRNbCwzCgLDGxYqcdigBKYLx1/CyBMsrg/ydIygj89Dgaqnt8KCR we7SPqjmkkC/fqxBbQ6KJAqqXlMU/K1NBvdcM9ksxhic6jSpPnYUBR0zXXKX5F6v5X+T OK5ZxjmD+U2ucm1Tqbm1vzNSJnQcLL8QCEvgDF3FOMwTXx2ILvBZeYr9hTespdT3jt4V +BidMZXUT1vtN1FHZHipuNcZPezOsF723O8pX3Bih/ccT4Q7EOgXbGUV2zNq9obY2qwv cweEc4QTORWHRhrPEBBBVzvvgZfE/bjj3mzYfc2ylPOSxD33wd9xpdw0fzBmOKUBANsM 70tg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=JdQrtc0CPIJKWvrQW6nI4l1L/k5Ac5A7UtC26Yb3LwY=; b=vgQ+w5/Qir8NBKNJtWF5CWapHpeurh3SupcNYy8yGBgZugQEhg/pB+MMq7LDjHa0l5 Sm2MPLCD1/PLUJROWuYogWkeiVu7uHhoTNaNT9MNIdubCwGQHqcD2qIAyiOS+pkmCdBm blmZ0D4TdCz1q+LN1gLPZ8K0hrKr4RUx15LzAZR6AI3VyvJbQCwB2r0NDU5n3/V+5ZId bsFOqyEDImYFN1bGhrvGq/U6T4dML+zKUJE2i1r+QQZNdWLVC8KiJCM32lg3wSm4WSnO Lw8sCg0J0JC//MlHS2bxgdMfVb+LcIOxMeupovoL5DY2l+dvw5Qx9z9ZRntgm2aw/58a fmgQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t18si25998291plr.60.2017.11.28.00.35.38; Tue, 28 Nov 2017 00:35:49 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751724AbdK1IfE (ORCPT + 78 others); Tue, 28 Nov 2017 03:35:04 -0500 Received: from mga06.intel.com ([134.134.136.31]:61127 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750883AbdK1IfD (ORCPT ); Tue, 28 Nov 2017 03:35:03 -0500 Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 28 Nov 2017 00:35:01 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.44,467,1505804400"; d="scan'208";a="6971713" Received: from kemi-desktop.sh.intel.com (HELO [10.239.13.118]) ([10.239.13.118]) by fmsmga004.fm.intel.com with ESMTP; 28 Nov 2017 00:34:58 -0800 Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement To: Vlastimil Babka , Greg Kroah-Hartman , Andrew Morton , Michal Hocko , Mel Gorman , Johannes Weiner , Christopher Lameter , YASUAKI ISHIMATSU , Andrey Ryabinin , Nikolay Borisov , Pavel Tatashin , David Rientjes , Sebastian Andrzej Siewior Cc: Dave , Andi Kleen , Tim Chen , Jesper Dangaard Brouer , Ying Huang , Aaron Lu , Aubrey Li , Linux MM , Linux Kernel References: <1511848824-18709-1-git-send-email-kemi.wang@intel.com> <9b4d5612-24eb-4bea-7164-49e42dc76f30@suse.cz> From: kemi Message-ID: Date: Tue, 28 Nov 2017 16:33:04 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <9b4d5612-24eb-4bea-7164-49e42dc76f30@suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2017年11月28日 16:09, Vlastimil Babka wrote: > On 11/28/2017 07:00 AM, Kemi Wang wrote: >> The existed implementation of NUMA counters is per logical CPU along with >> zone->vm_numa_stat[] separated by zone, plus a global numa counter array >> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't >> effect system's decision and are only read from /proc and /sys, it is a >> slow path operation and likely tolerate higher overhead. Additionally, >> usually nodes only have a single zone, except for node 0. And there isn't >> really any use where you need these hits counts separated by zone. >> >> Therefore, we can migrate the implementation of numa stats from per-zone to >> per-node, and get rid of these global numa counters. It's good enough to >> keep everything in a per cpu ptr of type u64, and sum them up when need, as >> suggested by Andi Kleen. That's helpful for code cleanup and enhancement >> (e.g. save more than 130+ lines code). > > OK. > >> With this patch, we can see 1.8%(335->329) drop of CPU cycles for single >> page allocation and deallocation concurrently with 112 threads tested on a >> 2-sockets skylake platform using Jesper's page_bench03 benchmark. > > To be fair, one can now avoid the overhead completely since 4518085e127d > ("mm, sysctl: make NUMA stats configurable"). But if we can still > optimize it, sure. > Yes, I did that several months ago. And both Dave Hansen and me thought that auto tuning should be better because people probably do not touch this interface, but Michal had some concerns about that. This patch aims to cleanup up the code for numa stats with a little performance improvement. >> Benchmark provided by Jesper D Brouer(increase loop times to 10000000): >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ >> bench >> >> Also, it does not cause obvious latency increase when read /proc and /sys >> on a 2-sockets skylake platform. Latency shown by time command: >> base head >> /proc/vmstat sys 0m0.001s sys 0m0.001s >> >> /sys/devices/system/ sys 0m0.001s sys 0m0.000s >> node/node*/numastat > > Well, here I have to point out that the coarse "time" command resolution > here means the comparison of a single read cannot be compared. You would > have to e.g. time a loop with enough iterations (which would then be all > cache-hot, but better than nothing I guess). > It indeed is a coarse comparison to show that it does not cause obvious overhead in a slow path. All right, I will do that to get a more accurate value. >> We would not worry it much as it is a slow path and will not be read >> frequently. >> >> Suggested-by: Andi Kleen >> Signed-off-by: Kemi Wang > > ... > >> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h >> index 1779c98..7383d66 100644 >> --- a/include/linux/vmstat.h >> +++ b/include/linux/vmstat.h >> @@ -118,36 +118,8 @@ static inline void vm_events_fold_cpu(int cpu) >> * Zone and node-based page accounting with per cpu differentials. >> */ >> extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; >> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]; >> extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; >> - >> -#ifdef CONFIG_NUMA >> -static inline void zone_numa_state_add(long x, struct zone *zone, >> - enum numa_stat_item item) >> -{ >> - atomic_long_add(x, &zone->vm_numa_stat[item]); >> - atomic_long_add(x, &vm_numa_stat[item]); >> -} >> - >> -static inline unsigned long global_numa_state(enum numa_stat_item item) >> -{ >> - long x = atomic_long_read(&vm_numa_stat[item]); >> - >> - return x; >> -} >> - >> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone, >> - enum numa_stat_item item) >> -{ >> - long x = atomic_long_read(&zone->vm_numa_stat[item]); >> - int cpu; >> - >> - for_each_online_cpu(cpu) >> - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]; >> - >> - return x; >> -} >> -#endif /* CONFIG_NUMA */ >> +extern u64 __percpu *vm_numa_stat; >> >> static inline void zone_page_state_add(long x, struct zone *zone, >> enum zone_stat_item item) >> @@ -234,10 +206,39 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat, >> >> >> #ifdef CONFIG_NUMA >> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone, >> + enum numa_stat_item item) >> +{ >> + return 0; >> +} >> + >> +static inline unsigned long node_numa_state_snapshot(int node, >> + enum numa_stat_item item) >> +{ >> + unsigned long x = 0; >> + int cpu; >> + >> + for_each_possible_cpu(cpu) > > I'm worried about the "for_each_possible..." approach here and elsewhere > in the patch as it can be rather excessive compared to the online number > of cpus (we've seen BIOSes report large numbers of possible CPU's). IIRC > the general approach with vmstat is to query just online cpu's / nodes, > and if they go offline, transfer their accumulated stats to some other > "victim"? > It's a trade off I think. "for_each_possible_cpu()" can avoid to fold local cpu stats to a global counter (actually, the first available cpu in this patch) when cpu is offline/dead. > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > From 1585296610753874153@xxx Tue Nov 28 08:11:01 +0000 2017 X-GM-THRID: 1585288579833195240 X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread