Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751859AbdFIJvu (ORCPT ); Fri, 9 Jun 2017 05:51:50 -0400 Received: from mail-oi0-f68.google.com ([209.85.218.68]:36521 "EHLO mail-oi0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751677AbdFIJvq (ORCPT ); Fri, 9 Jun 2017 05:51:46 -0400 MIME-Version: 1.0 In-Reply-To: <20170515085827.16474-1-mhocko@kernel.org> References: <20170515085827.16474-1-mhocko@kernel.org> From: Wei Yang Date: Fri, 9 Jun 2017 17:51:24 +0800 Message-ID: Subject: Re: [PATCH -v4 0/14] mm: make movable onlining suck less To: Michal Hocko Cc: Andrew Morton , Linux-MM , Mel Gorman , Vlastimil Babka , Andrea Arcangeli , Jerome Glisse , Reza Arbab , Yasuaki Ishimatsu , qiuxishi@huawei.com, Kani Toshimitsu , slaoub@gmail.com, Joonsoo Kim , Andi Kleen , David Rientjes , Daniel Kiper , Igor Mammedov , Vitaly Kuznetsov , LKML , Balbir Singh , Dan Williams , Heiko Carstens , Martin Schwidefsky , Michal Hocko , Tobias Regnery , Yasuaki Ishimatsu Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12167 Lines: 297 Hi, Michal I am not that familiar with hotplug and trying to catch up the issue and your solution. One potential issue I found is we don't check the physical boundary when add_memory_resource(). For example, on x86-64, only 64T physical memory is supported currently. Looks it is expanded after 5-level pagetable is introduced. While there is still some limitations on this. But we don't check the boundary I think. During the bootup, this is ensured by the max_pfn which is guaranteed to be under MAX_ARCH_PFN. I don't see some limitation on this when doing hotplug. Here is my very simple check on this, while it seems not every arch has MAX_ARCH_PFN. Willing to hear from you. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 6fa7208bcd56..27541566f9ac 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1355,7 +1355,12 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online) int ret; start = res->start; + if (start > (MAX_ARCH_PFN << PAGE_SHIFT)) + return -EINVAL; + size = resource_size(res); + if ((start + size) > (MAX_ARCH_PFN << PAGE_SHIFT)) + size = (MAX_ARCH_PFN << PAGE_SHIFT) - start; ret = check_hotplug_memory_range(start, size); if (ret) On Mon, May 15, 2017 at 4:58 PM, Michal Hocko wrote: > Hi, > The last version of this series has been posted here [1]. The timing wasn't > all that great so this is mostly a resubmit. I've added one additional patch > to fix another pfn walker noticed by Joonsoo (this is patch 11) and also > added a clarification for pfn_valid() and offline pages. > > There is still a lot of work on top - namely this implementation doesn't > support reonlining to a different zone on the zones boundaries but I > will do that in a separate series because this one is getting quite > large already and it should work reasonably well now. > > Joonsoo had some worries about pfn_valid and suggested to change its > semantic to return false on offline holes but I would be rally worried > to change a established semantic used by a lot of code and so I have > introuduced pfn_to_online_page helper instead. If this is seen as a > controversial point I would rather drop pfn_to_online_page and related > patches as they are not stictly necessary because the code would be > similarly broken as now wrt. offline holes. > > This is a rebase on top of the current mmotm tree (mmotm-2017-05-12-15-53) > and the full series is in > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git try > attempts/rewrite-mem_hotplug branch. > > Motivation: > Movable onlining is a real hack with many downsides - mainly > reintroduction of lowmem/highmem issues we used to have on 32b systems - > but it is the only way to make the memory hotremove more reliable which > is something that people are asking for. > > The current semantic of memory movable onlinening is really cumbersome, > however. The main reason for this is that the udev driven approach is > basically unusable because udev races with the memory probing while only > the last memory block or the one adjacent to the existing zone_movable > are allowed to be onlined movable. In short the criterion for the > successful online_movable changes under udev's feet. A reliable udev > approach would require a 2 phase approach where the first successful > movable online would have to check all the previous blocks and online > them in descending order. This is hard to be considered sane. > > This patchset aims at making the onlining semantic more usable. First of > all it allows to online memory movable as long as it doesn't clash with > the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE > cannot overlap. Currently I preserve the original ordering semantic so > the zone always precedes the movable zone but I have plans to remove this > restriction in future because it is not really necessary. > > First 3 patches are cleanups which should be ready to be merged right > away (unless I have missed something subtle of course). > > Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path. > > Patch 5 deals with implicit assumptions of register_one_node on pgdat > initialization. > > Patches 6-10 deal with offline holes in the zone for pfn walkers. I > hope I got all of them right but people familiar with compaction should > double check this. > > Patch 11 is the core of the change. In order to make it easier to review > I have tried it to be as minimalistic as possible and the large code > removal is moved to patch 14. > > Patch 12 is a trivial follow up cleanup. Patch 13 fixes sparse warnings > and finally patch 14 removes the unused code. > > I have tested the patches in kvm: > # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ... > > and then probed the additional memory by > (qemu) object_add memory-backend-ram,id=mem1,size=1G > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > > Then I have used this simple script to probe the memory block by hand > # cat probe_memblock.sh > #!/bin/sh > > BLOCK_NR=$1 > > # echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe > > # for i in $(seq 10); do sh probe_memblock.sh $i; done > # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null > /sys/devices/system/memory/memory33/valid_zones:Normal Movable > /sys/devices/system/memory/memory34/valid_zones:Normal Movable > /sys/devices/system/memory/memory35/valid_zones:Normal Movable > /sys/devices/system/memory/memory36/valid_zones:Normal Movable > /sys/devices/system/memory/memory37/valid_zones:Normal Movable > /sys/devices/system/memory/memory38/valid_zones:Normal Movable > /sys/devices/system/memory/memory39/valid_zones:Normal Movable > > The main difference to the original implementation is that all new > memblocks can be both online_kernel and online_movable initially > because there is no clash obviously. For the comparison the original > implementation would have > > /sys/devices/system/memory/memory33/valid_zones:Normal > /sys/devices/system/memory/memory34/valid_zones:Normal > /sys/devices/system/memory/memory35/valid_zones:Normal > /sys/devices/system/memory/memory36/valid_zones:Normal > /sys/devices/system/memory/memory37/valid_zones:Normal > /sys/devices/system/memory/memory38/valid_zones:Normal > /sys/devices/system/memory/memory39/valid_zones:Normal Movable > > Now > # echo online_movable > /sys/devices/system/memory/memory34/state > # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null > /sys/devices/system/memory/memory33/valid_zones:Normal Movable > /sys/devices/system/memory/memory34/valid_zones:Movable > /sys/devices/system/memory/memory35/valid_zones:Movable > /sys/devices/system/memory/memory36/valid_zones:Movable > /sys/devices/system/memory/memory37/valid_zones:Movable > /sys/devices/system/memory/memory38/valid_zones:Movable > /sys/devices/system/memory/memory39/valid_zones:Movable > > Block 33 can still be online both kernel and movable while all > the remaining can be only movable. > /proc/zonelist says > Node 0, zone Normal > pages free 0 > min 0 > low 0 > high 0 > spanned 0 > present 0 > -- > Node 0, zone Movable > pages free 32753 > min 85 > low 117 > high 149 > spanned 32768 > present 32768 > > A new memblock at a lower address will result in a new memblock (32) > which will still allow both Normal and Movable. > # sh probe_memblock.sh 0 > # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null > /sys/devices/system/memory/memory32/valid_zones:Normal Movable > /sys/devices/system/memory/memory33/valid_zones:Normal Movable > /sys/devices/system/memory/memory34/valid_zones:Movable > /sys/devices/system/memory/memory35/valid_zones:Movable > > and online_kernel will convert it to the zone normal properly > while 33 can be still onlined both ways. > # echo online_kernel > /sys/devices/system/memory/memory32/state > # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null > /sys/devices/system/memory/memory32/valid_zones:Normal > /sys/devices/system/memory/memory33/valid_zones:Normal Movable > /sys/devices/system/memory/memory34/valid_zones:Movable > /sys/devices/system/memory/memory35/valid_zones:Movable > > /proc/zoneinfo will now tell > Node 0, zone Normal > pages free 65441 > min 165 > low 230 > high 295 > spanned 65536 > present 65536 > -- > Node 0, zone Movable > pages free 32740 > min 82 > low 114 > high 146 > spanned 32768 > present 32768 > > so both zones have one memblock spanned and present. > > Onlining 39 should associate this block to the movable zone > # echo online > /sys/devices/system/memory/memory39/state > > /proc/zoneinfo will now tell > Node 0, zone Normal > pages free 32765 > min 80 > low 112 > high 144 > spanned 32768 > present 32768 > -- > Node 0, zone Movable > pages free 65501 > min 160 > low 225 > high 290 > spanned 196608 > present 65536 > > so we will have a movable zone which spans 6 memblocks, 2 present and 4 > representing a hole. > > Offlining both movable blocks will lead to the zone with no present > pages which is the expected behavior I believe. > # echo offline > /sys/devices/system/memory/memory39/state > # echo offline > /sys/devices/system/memory/memory34/state > # grep -A6 "Movable\|Normal" /proc/zoneinfo > Node 0, zone Normal > pages free 32735 > min 90 > low 122 > high 154 > spanned 32768 > present 32768 > -- > Node 0, zone Movable > pages free 0 > min 0 > low 0 > high 0 > spanned 196608 > present 0 > > Any thoughts, complains, suggestions? > > As a bonus we will get a nice cleanup in the memory hotplug codebase. > arch/ia64/mm/init.c | 11 +- > arch/powerpc/mm/mem.c | 12 +- > arch/s390/mm/init.c | 32 +-- > arch/sh/mm/init.c | 10 +- > arch/x86/mm/init_32.c | 7 +- > arch/x86/mm/init_64.c | 11 +- > drivers/base/memory.c | 79 +++---- > drivers/base/node.c | 58 ++---- > include/linux/memory_hotplug.h | 40 +++- > include/linux/mmzone.h | 57 +++++- > include/linux/node.h | 35 +++- > kernel/memremap.c | 6 +- > mm/compaction.c | 5 +- > mm/memory_hotplug.c | 455 ++++++++++++++--------------------------- > mm/page_alloc.c | 13 +- > mm/page_isolation.c | 26 ++- > mm/sparse.c | 48 ++++- > mm/vmstat.c | 4 +- > 18 files changed, 417 insertions(+), 492 deletions(-) > > Shortlog says: > Michal Hocko (14): > mm: remove return value from init_currently_empty_zone > mm, memory_hotplug: use node instead of zone in can_online_high_movable > mm: drop page_initialized check from get_nid_for_pfn > mm, memory_hotplug: get rid of is_zone_device_section > mm, memory_hotplug: split up register_one_node > mm, memory_hotplug: consider offline memblocks removable > mm: consider zone which is not fully populated to have holes > mm, compaction: skip over holes in __reset_isolation_suitable > mm: __first_valid_page skip over offline pages > mm, vmstat: skip reporting offline pages in pagetypeinfo > mm, memory_hotplug: do not associate hotadded memory to zones until online > mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory > mm, memory_hotplug: fix the section mismatch warning > mm, memory_hotplug: remove unused cruft after memory hotplug rework > > [1] http://lkml.kernel.org/r/20170421120512.23960-1-mhocko@kernel.org > >