Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752367AbdLHIuW (ORCPT ); Fri, 8 Dec 2017 03:50:22 -0500 Received: from mga18.intel.com ([134.134.136.126]:54958 "EHLO mga18.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750944AbdLHIuV (ORCPT ); Fri, 8 Dec 2017 03:50:21 -0500 X-Greylist: delayed 574 seconds by postgrey-1.27 at vger.kernel.org; Fri, 08 Dec 2017 03:50:21 EST X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,377,1508828400"; d="scan'208";a="10724103" From: kemi Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement To: Michal Hocko Cc: Greg Kroah-Hartman , Andrew Morton , Vlastimil Babka , Mel Gorman , Johannes Weiner , Christopher Lameter , YASUAKI ISHIMATSU , Andrey Ryabinin , Nikolay Borisov , Pavel Tatashin , David Rientjes , Sebastian Andrzej Siewior , Dave , Andi Kleen , Tim Chen , Jesper Dangaard Brouer , Ying Huang , Aaron Lu , Aubrey Li , Linux MM , Linux Kernel References: <1511848824-18709-1-git-send-email-kemi.wang@intel.com> <20171129121740.f6drkbktc43l5ib6@dhcp22.suse.cz> <4b840074-cb5f-3c10-d65b-916bc02fb1ee@intel.com> <20171130085322.tyys6xbzzvui7ogz@dhcp22.suse.cz> <0f039a89-5500-1bf5-c013-d39ba3bf62bd@intel.com> <20171130094523.vvcljyfqjpbloe5e@dhcp22.suse.cz> Message-ID: <9cd6cc9f-252a-3c6f-2f1f-e39d4ec0457b@intel.com> Date: Fri, 8 Dec 2017 16:38:46 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <20171130094523.vvcljyfqjpbloe5e@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3503 Lines: 79 On 2017年11月30日 17:45, Michal Hocko wrote: > On Thu 30-11-17 17:32:08, kemi wrote: > Do not get me wrong. If we want to make per-node stats more optimal, > then by all means let's do that. But having 3 sets of counters is just > way to much. > Hi, Michal Apologize to respond later in this email thread. After thinking about how to optimize our per-node stats more gracefully, we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus, we can keep everything in per cpu counter and sum them up when read /proc or /sys for numa stats. What's your idea for that? thanks The motivation for that modification is listed below: 1) thanks to 0-day system, a bug is reported for the V1 patch: [ 0.000000] BUG: unable to handle kernel paging request at 0392b000 [ 0.000000] IP: __inc_numa_state+0x2a/0x34 [ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff53 [ 0.000000] Oops: 0002 [#1] PREEMPT SMP [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-12996-g81611e2 #1 [ 0.000000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 0.000000] task: cbf56000 task.stack: cbf4e000 [ 0.000000] EIP: __inc_numa_state+0x2a/0x34 [ 0.000000] EFLAGS: 00210006 CPU: 0 [ 0.000000] EAX: 0392b000 EBX: 00000000 ECX: 00000000 EDX: cbef90ef [ 0.000000] ESI: cffdb320 EDI: 00000004 EBP: cbf4fd80 ESP: cbf4fd7c [ 0.000000] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 0.000000] CR0: 80050033 CR2: 0392b000 CR3: 0c0a8000 CR4: 000406b0 [ 0.000000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 0.000000] DR6: fffe0ff0 DR7: 00000400 [ 0.000000] Call Trace: [ 0.000000] zone_statistics+0x4d/0x5b [ 0.000000] get_page_from_freelist+0x257/0x993 [ 0.000000] __alloc_pages_nodemask+0x108/0x8c8 [ 0.000000] ? __bitmap_weight+0x38/0x41 [ 0.000000] ? pcpu_next_md_free_region+0xe/0xab [ 0.000000] ? pcpu_chunk_refresh_hint+0x8b/0xbc [ 0.000000] ? pcpu_chunk_slot+0x1e/0x24 [ 0.000000] ? pcpu_chunk_relocate+0x15/0x6d [ 0.000000] ? find_next_bit+0xa/0xd [ 0.000000] ? cpumask_next+0x15/0x18 [ 0.000000] ? pcpu_alloc+0x399/0x538 [ 0.000000] cache_grow_begin+0x85/0x31c [ 0.000000] ____cache_alloc+0x147/0x1e0 [ 0.000000] ? debug_smp_processor_id+0x12/0x14 [ 0.000000] kmem_cache_alloc+0x80/0x145 [ 0.000000] create_kmalloc_cache+0x22/0x64 [ 0.000000] kmem_cache_init+0xf9/0x16c [ 0.000000] start_kernel+0x1d4/0x3d6 [ 0.000000] i386_start_kernel+0x9a/0x9e [ 0.000000] startup_32_smp+0x15f/0x170 That is because u64 percpu pointer vm_numa_stat is used before initialization. [...] > +extern u64 __percpu *vm_numa_stat; [...] > +#ifdef CONFIG_NUMA > + size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS; > + align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]); > + vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align); > +#endif The pointer is used in mm_init->kmem_cache_init->create_kmalloc_cache->...-> __alloc_pages() when CONFIG_SLAB/CONFIG_ZONE_DMA is set in kconfig, while the vm_numa_stat is initialized in setup_per_cpu_pageset after mm_init is called. The proposal mentioned above can fix it by making the numa stats counter ready before calling mm_init (start_kernel->build_all_zonelists() can help to do that) 2) Compare to the V1 patch, this modification makes the semantics of per-node numa stats more clear for review and maintenance.