Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753110AbbDTCB2 (ORCPT ); Sun, 19 Apr 2015 22:01:28 -0400 Received: from cn.fujitsu.com ([59.151.112.132]:24002 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751544AbbDTCB0 (ORCPT ); Sun, 19 Apr 2015 22:01:26 -0400 X-IronPort-AV: E=Sophos;i="5.04,848,1406563200"; d="scan'208";a="90912811" Message-ID: <55345979.2020502@cn.fujitsu.com> Date: Mon, 20 Apr 2015 09:42:17 +0800 From: Gu Zheng User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110930 Thunderbird/7.0.1 MIME-Version: 1.0 To: Xishi Qiu CC: Yasuaki Ishimatsu , Andrew Morton , Yasuaki Ishimatsu , Kamezawa Hiroyuki , , Tang Chen , Xiexiuqi , Mel Gorman , David Rientjes , Linux MM , LKML , Subject: Re: [PATCH 1/2 V2] memory-hotplug: fix BUG_ON in move_freepages() References: <5530E578.9070505@huawei.com> <5531679d.4642ec0a.1beb.3569@mx.google.com> In-Reply-To: <5531679d.4642ec0a.1beb.3569@mx.google.com> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.167.226.100] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11147 Lines: 219 Hi Xishi, On 04/18/2015 04:05 AM, Yasuaki Ishimatsu wrote: > > Your patches will fix your issue. > But, if BIOS reports memory first at node hot add, pgdat can > not be initialized. > > Memory hot add flows are as follows: > > add_memory > ... > -> hotadd_new_pgdat() > ... > -> node_set_online(nid) > > When calling hotadd_new_pgdat() for a hot added node, the node is > offline because node_set_online() is not called yet. So if applying > your patches, the pgdat is not initialized in this case. Ishimtasu's worry is reasonable. And I am afraid the fix here is a bit over-kill. > > Thanks, > Yasuaki Ishimatsu > > On Fri, 17 Apr 2015 18:50:32 +0800 > Xishi Qiu wrote: > >> Hot remove nodeXX, then hot add nodeXX. If BIOS report cpu first, it will call >> hotadd_new_pgdat(nid, 0), this will set pgdat->node_start_pfn to 0. As nodeXX >> exists at boot time, so pgdat->node_spanned_pages is the same as original. Then >> free_area_init_core()->memmap_init() will pass a wrong start and a nonzero size. As your analysis said the root cause here is passing a *0* as the node_start_pfn, then the chaos occurred when init the zones. And this only happens to the re-hotadd node, so how about using the saved *node_start_pfn* (via get_pfn_range_for_nid(nid, &start_pfn, &end_pfn)) instead if we find "pgdat->node_start_pfn == 0 && !node_online(XXX)"? Thanks, Gu >> >> free_area_init_core() >> memmap_init() >> memmap_init_zone() >> early_pfn_in_nid() >> set_page_links() >> >> "if (!early_pfn_in_nid(pfn, nid))" will skip the pfn(memory in section), but it >> will not skip the pfn(hole in section), this will cover and relink the page to >> zone/nid, so page_zone() from memory and hole in the same section are different. >> The following call trace shows the bug. >> >> This patch will set the node size to 0 when hotadd a new node(original or new). >> init_currently_empty_zone() and memmap_init() will be called in add_zone(), so >> need not to change it. >> >> [90476.077469] kernel BUG at mm/page_alloc.c:1042! // move_freepages() -> BUG_ON(page_zone(start_page) != page_zone(end_page)); >> [90476.077469] invalid opcode: 0000 [#1] SMP >> [90476.077469] Modules linked in: iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack fuse btrfs zlib_deflate raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables cfg80211 rfkill sg iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr igb vfat i2c_algo_bit dca fat sb_edac edac_core i2c_i801 lpc_ich i2c_core mfd_core shpchp acpi_pad ipmi_si ipmi_msghandler uinput nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod crc_t10dif crct10dif_common ahci libahci megaraid_sas tg3 ptp libata pps_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: rasf] >> [90476.157382] CPU: 2 PID: 322803 Comm: updatedb Tainted: GF W O-------------- 3.10.0-229.1.2.5.hulk.rc14.x86_64 #1 >> [90476.157382] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 04/13/2015 >> [90476.157382] task: ffff88006a6d5b00 ti: ffff880068eb8000 task.ti: ffff880068eb8000 >> [90476.157382] RIP: 0010:[] [] move_freepages+0x12f/0x140 >> [90476.157382] RSP: 0018:ffff880068ebb640 EFLAGS: 00010002 >> [90476.157382] RAX: ffff880002316cc0 RBX: ffffea0001bd0000 RCX: 0000000000000001 >> [90476.157382] RDX: ffff880002476e40 RSI: 0000000000000000 RDI: ffff880002316cc0 >> [90476.157382] RBP: ffff880068ebb690 R08: 0000000000100000 R09: ffffea0001bd7fc0 >> [90476.157382] R10: 000000000006f5ff R11: 0000000000000000 R12: 0000000000000001 >> [90476.157382] R13: 0000000000000003 R14: ffff880002316eb8 R15: ffffea0001bd7fc0 >> [90476.157382] FS: 00007f4d3ab95740(0000) GS:ffff880033a00000(0000) knlGS:0000000000000000 >> [90476.157382] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [90476.157382] CR2: 00007f4d3ae1a808 CR3: 000000018907a000 CR4: 00000000001407e0 >> [90476.157382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [90476.157382] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> [90476.157382] Stack: >> [90476.157382] ffff880068ebb698 ffff880002316cc0 ffffa800b5378098 ffff880068ebb698 >> [90476.157382] ffffffff810b11dc ffff880002316cc0 0000000000000001 0000000000000003 >> [90476.157382] ffff880002316eb8 ffffea0001bd6420 ffff880068ebb6a0 ffffffff8115a003 >> [90476.157382] Call Trace: >> [90476.157382] [] ? update_curr+0xcc/0x150 >> [90476.157382] [] move_freepages_block+0x73/0x80 >> [90476.157382] [] __rmqueue+0x26a/0x460 >> [90476.157382] [] ? native_sched_clock+0x13/0x80 >> [90476.157382] [] get_page_from_freelist+0x7f2/0xd30 >> [90476.157382] [] ? __switch_to+0x179/0x4a0 >> [90476.157382] [] ? xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs] >> [90476.157382] [] __alloc_pages_nodemask+0x1c1/0xc90 >> [90476.157382] [] ? _xfs_buf_ioapply+0x31c/0x420 [xfs] >> [90476.157382] [] ? down_trylock+0x2d/0x40 >> [90476.157382] [] ? xfs_buf_trylock+0x1f/0x80 [xfs] >> [90476.157382] [] alloc_pages_current+0xa9/0x170 >> [90476.157382] [] new_slab+0x275/0x300 >> [90476.157382] [] __slab_alloc+0x315/0x48f >> [90476.157382] [] ? kmem_zone_alloc+0x77/0x100 [xfs] >> [90476.157382] [] ? xfs_bmap_search_extents+0x5c/0xc0 [xfs] >> [90476.157382] [] kmem_cache_alloc+0x193/0x1d0 >> [90476.157382] [] ? kmem_zone_alloc+0x77/0x100 [xfs] >> [90476.157382] [] kmem_zone_alloc+0x77/0x100 [xfs] >> [90476.157382] [] xfs_inode_alloc+0x25/0x250 [xfs] >> [90476.157382] [] xfs_iget+0x219/0x680 [xfs] >> [90476.157382] [] xfs_lookup+0xf6/0x120 [xfs] >> [90476.157382] [] xfs_vn_lookup+0x7b/0xd0 [xfs] >> [90476.157382] [] lookup_real+0x1d/0x50 >> [90476.157382] [] __lookup_hash+0x42/0x60 >> [90476.157382] [] lookup_slow+0x42/0xa7 >> [90476.157382] [] path_lookupat+0x76b/0x7a0 >> [90476.157382] [] ? do_last+0x635/0x1260 >> [90476.157382] [] ? kmem_cache_alloc+0x35/0x1d0 >> [90476.157382] [] ? getname_flags+0x4f/0x190 >> [90476.157382] [] filename_lookup+0x2b/0xc0 >> [90476.157382] [] user_path_at_empty+0x67/0xc0 >> [90476.157382] [] ? from_kgid_munged+0x12/0x20 >> [90476.157382] [] ? cp_new_stat+0x14f/0x180 >> [90476.157382] [] user_path_at+0x11/0x20 >> [90476.157382] [] vfs_fstatat+0x63/0xc0 >> [90476.157382] [] SYSC_newlstat+0x31/0x60 >> [90476.157382] [] ? __audit_syscall_exit+0x1f6/0x2a0 >> [90476.157382] [] SyS_newlstat+0xe/0x10 >> [90476.157382] [] system_call_fastpath+0x16/0x1b >> [90476.157382] Code: d0 8b 45 c4 48 c1 e2 06 48 01 d3 01 c8 49 39 df 0f 83 7b ff ff ff 66 0f 1f 44 00 00 48 83 c4 28 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 >> [90476.157382] RIP [] move_freepages+0x12f/0x140 >> [90476.157382] RSP >> [90476.157382] ---[ end trace 58557f791c6d66d4 ]--- >> [90476.157382] Kernel panic - not syncing: Fatal exception >> >> log and analyse: >> ... >> [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x2000ffffffff] >> [ 0.000000] NODE_DATA [mem 0x02312400-0x023393ff] // node0 >> ... >> [ 0.000000] Initmem setup node 9 [mem 0x30000000000-0x3ffffffffff] >> [ 0.000000] NODE_DATA [mem 0x02471400-0x024983ff] // node9 >> ... >> [ 0.000000] node 0: [mem 0x5b880000-0x5baf2fff] >> [ 0.000000] node 0: [mem 0x61382000-0x6148ffff] >> [ 0.000000] node 0: [mem 0x61a90000-0x6f39bfff] // 1562.56 - 1779.61 >> [ 0.000000] node 0: [mem 0x6f51c000-0x6f7c9fff] // 1781.11 - 1783.79 >> [ 0.000000] node 0: [mem 0x6fb1c000-0x6fb1cfff] >> [ 0.000000] node 0: [mem 0x6fba3000-0x6fffffff] >> ... >> >> start_page = ffffea0001bd0000 -> pfn=0x6F400 -> 1780M, in node0, right! >> end_page = ffffea0001bd7fc0 -> pfn=0x6F5FF -> 1782M-4kb, in node0, right! >> page_zone(start_page) = ffff880002476e40, in node9, wrong! >> page_zone(end_page) = ffff880002316cc0, in node0, right! >> >> Signed-off-by: Xishi Qiu >> --- >> mm/page_alloc.c | 14 ++++++++++++++ >> 1 files changed, 14 insertions(+), 0 deletions(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index ebffa0e..f74398c 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -4667,6 +4667,10 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid, >> { >> unsigned long zone_start_pfn, zone_end_pfn; >> >> + /* When hotadd a new node, init node's zones as empty zones */ >> + if (!node_online(nid)) >> + return 0; >> + >> /* Get the start and end of the zone */ >> zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; >> zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; >> @@ -4730,6 +4734,10 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid, >> unsigned long zone_high = arch_zone_highest_possible_pfn[zone_type]; >> unsigned long zone_start_pfn, zone_end_pfn; >> >> + /* When hotadd a new node, init node's zones as empty zones */ >> + if (!node_online(nid)) >> + return 0; >> + >> zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high); >> zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high); >> >> @@ -4746,6 +4754,9 @@ static inline unsigned long __meminit zone_spanned_pages_in_node(int nid, >> unsigned long node_end_pfn, >> unsigned long *zones_size) >> { >> + if (!node_online(nid)) >> + return 0; >> + >> return zones_size[zone_type]; >> } >> >> @@ -4755,6 +4766,9 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid, >> unsigned long node_end_pfn, >> unsigned long *zholes_size) >> { >> + if (!node_online(nid)) >> + return 0; >> + >> if (!zholes_size) >> return 0; >> >> -- >> 1.7.1 >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > . > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/