Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932320AbbHDIGG (ORCPT ); Tue, 4 Aug 2015 04:06:06 -0400 Received: from mga01.intel.com ([192.55.52.88]:11955 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932196AbbHDIFw (ORCPT ); Tue, 4 Aug 2015 04:05:52 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.15,607,1432623600"; d="scan'208";a="618876693" Subject: Re: [PATCH 1/5] x86, gfp: Cache best near node for memory allocation. To: Tang Chen , Tejun Heo References: <1436261425-29881-1-git-send-email-tangchen@cn.fujitsu.com> <1436261425-29881-2-git-send-email-tangchen@cn.fujitsu.com> <20150715214802.GL15934@mtj.duckdns.org> <55C03332.2030808@cn.fujitsu.com> Cc: mingo@redhat.com, akpm@linux-foundation.org, rjw@rjwysocki.net, hpa@zytor.com, laijs@cn.fujitsu.com, yasu.isimatu@gmail.com, isimatu.yasuaki@jp.fujitsu.com, kamezawa.hiroyu@jp.fujitsu.com, izumi.taku@jp.fujitsu.com, gongzhaogang@inspur.com, qiaonuohan@cn.fujitsu.com, x86@kernel.org, linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org From: Jiang Liu Organization: Intel Message-ID: <55C0725B.80201@linux.intel.com> Date: Tue, 4 Aug 2015 16:05:47 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <55C03332.2030808@cn.fujitsu.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4371 Lines: 155 On 2015/8/4 11:36, Tang Chen wrote: > Hi TJ, > > Sorry for the late reply. > > On 07/16/2015 05:48 AM, Tejun Heo wrote: >> ...... >> so in initialization pharse makes no sense any more. The best near online >> node for each cpu should be cached somewhere. >> I'm not really following. Is this because the now offline node can >> later come online and we'd have to break the constant mapping >> invariant if we update the mapping later? If so, it'd be nice to >> spell that out. > > Yes. Will document this in the next version. > >>> ...... >>> +int get_near_online_node(int node) >>> +{ >>> + return per_cpu(x86_cpu_to_near_online_node, >>> + cpumask_first(&node_to_cpuid_mask_map[node])); >>> +} >>> +EXPORT_SYMBOL(get_near_online_node); >> Umm... this function is sitting on a fairly hot path and scanning a >> cpumask each time. Why not just build a numa node -> numa node array? > > Indeed. Will avoid to scan a cpumask. > >> ...... >> >>> static inline struct page *alloc_pages_exact_node(int nid, gfp_t >>> gfp_mask, >>> unsigned int order) >>> { >>> - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); >>> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); >>> + >>> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA) >>> + if (!node_online(nid)) >>> + nid = get_near_online_node(nid); >>> +#endif >>> return __alloc_pages(gfp_mask, order, node_zonelist(nid, >>> gfp_mask)); >>> } >> Ditto. Also, what's the synchronization rules for NUMA node >> on/offlining. If you end up updating the mapping later, how would >> that be synchronized against the above usages? > > I think the near online node map should be updated when node online/offline > happens. But about this, I think the current numa code has a little > problem. > > As you know, firmware info binds a set of CPUs and memory to a node. But > at boot time, if the node has no memory (a memory-less node) , it won't > be online. > But the CPUs on that node is available, and bound to the near online node. > (Here, I mean numa_set_node(cpu, node).) > > Why does the kernel do this ? I think it is used to ensure that we can > allocate memory > successfully by calling functions like alloc_pages_node() and > alloc_pages_exact_node(). > By these two fuctions, any CPU should be bound to a node who has memory > so that > memory allocation can be successful. > > That means, for a memory-less node at boot time, CPUs on the node is > online, > but the node is not online. > > That also means, "the node is online" equals to "the node has memory". > Actually, there > are a lot of code in the kernel is using this rule. > > > But, > 1) in cpu_up(), it will try to online a node, and it doesn't check if > the node has memory. > 2) in try_offline_node(), it offlines CPUs first, and then the memory. > > This behavior looks a little wired, or let's say it is ambiguous. It > seems that a NUMA node > consists of CPUs and memory. So if the CPUs are online, the node should > be online. Hi Chen, I have posted a patch set to enable memoryless node on x86, will repost it for review:) Hope it help to solve this issue. Thanks! Gerry > > And also, > The main purpose of this patch-set is to make the cpuid <-> nodeid > mapping persistent. > After this patch-set, alloc_pages_node() and alloc_pages_exact_node() > won't depend on > cpuid <-> nodeid mapping any more. So the node should be online if the > CPUs on it are > online. Otherwise, we cannot setup interfaces of CPUs under /sys. > > > Unfortunately, since I don't have a machine a with memory-less node, I > cannot reproduce > the problem right now. > > How do you think the node online behavior should be changed ? > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/