Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp8997373ybi; Fri, 7 Jun 2019 01:36:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqw0Tp2e7nTOVQ42qXK+twv6L4Tv6dgy7uXFeRStCaxFN1GzE0VGCTNGn40dKHDaaCbUAXlE X-Received: by 2002:a63:5158:: with SMTP id r24mr1720342pgl.79.1559896595791; Fri, 07 Jun 2019 01:36:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559896595; cv=none; d=google.com; s=arc-20160816; b=tUoR4ZOmny1iFX+EKpB6ztvudMBk+CjvSsrlLzGjjG6A3wLNe+VcS1P55W7cQz9VcG zNoJw6zgeD1XNLdRc/kjvDZOt0skXJehgr2t2xGpOcG2PC4eZPj4VN5dz6BCD1ngwH3m ViBBZxITVtXJwT2mrb167q0NdAd2byHcMKWm2IAbCwt/k2ieryoP5++fung2+GG1yyT8 9QrLAqplNAoMSffS4k3yhRrU6VhFiLRCrmXOJeVxYHEBhL9VzJ1oE1gmDbYZt9IsC3jh umVqhBq95g4PzTwrcrswxcNi3m+QF+ds0Ll3So9EORQeXKYmRmJmp6IHIpknoOfIAEHU iRCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=HyvrDinC1IPAIzuprI5A00pYg/6BzKS0lFeC6y+42zA=; b=m32XHgXlOJSH+KhGALCgeNFhxNlSdonlugHEu0BpvSya4OOR2k/7D0iGH+LeJVkWOD 4mFvLWmshwLI4KdxGd9cf5eoLmpdggX+rghihYuXDF89qwfN1dMeQzBVoPYpYX1Zzycl VHBNOdspkbS/GdGgsAQbLQENRjIlRIn8pv1WAx8ga+XxHh34oPr9obHq76eDJL2gfgHf TmB54JO0McutrV90XiINAbo096E2E+KC83Qv5bbsZQcU7F3GtYF6gRuU5dR7Okdgq9m+ xuKAbsmm3psle1H4pW4i4wsxhXz+CWGFJC8a80vTSENPUbppGgWOD8BXeuycKW9I4Vhh 2RPg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l97si1318554pje.6.2019.06.07.01.36.19; Fri, 07 Jun 2019 01:36:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727163AbfFGIeF (ORCPT + 99 others); Fri, 7 Jun 2019 04:34:05 -0400 Received: from mx2.suse.de ([195.135.220.15]:33462 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726808AbfFGIeE (ORCPT ); Fri, 7 Jun 2019 04:34:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 123A7ABD5; Fri, 7 Jun 2019 08:34:03 +0000 (UTC) Date: Fri, 7 Jun 2019 10:33:58 +0200 From: Oscar Salvador To: Dan Williams Cc: akpm@linux-foundation.org, Michal Hocko , Vlastimil Babka , Logan Gunthorpe , Pavel Tatashin , linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v9 08/12] mm/sparsemem: Support sub-section hotplug Message-ID: <20190607083351.GA5342@linux> References: <155977186863.2443951.9036044808311959913.stgit@dwillia2-desk3.amr.corp.intel.com> <155977192280.2443951.13941265207662462739.stgit@dwillia2-desk3.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <155977192280.2443951.13941265207662462739.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 05, 2019 at 02:58:42PM -0700, Dan Williams wrote: > The libnvdimm sub-system has suffered a series of hacks and broken > workarounds for the memory-hotplug implementation's awkward > section-aligned (128MB) granularity. For example the following backtrace > is emitted when attempting arch_add_memory() with physical address > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM) > within a given section: > > WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0 > devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200] > [..] > Call Trace: > dump_stack+0x86/0xc3 > __warn+0xcb/0xf0 > warn_slowpath_fmt+0x5f/0x80 > devm_memremap_pages+0x3b5/0x4c0 > __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap] > pmem_attach_disk+0x19a/0x440 [nd_pmem] > > Recently it was discovered that the problem goes beyond RAM vs PMEM > collisions as some platform produce PMEM vs PMEM collisions within a > given section. The libnvdimm workaround for that case revealed that the > libnvdimm section-alignment-padding implementation has been broken for a > long while. A fix for that long-standing breakage introduces as many > problems as it solves as it would require a backward-incompatible change > to the namespace metadata interpretation. Instead of that dubious route > [1], address the root problem in the memory-hotplug implementation. > > [1]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com > Cc: Michal Hocko > Cc: Vlastimil Babka > Cc: Logan Gunthorpe > Cc: Oscar Salvador > Cc: Pavel Tatashin > Signed-off-by: Dan Williams > --- > include/linux/memory_hotplug.h | 2 > mm/memory_hotplug.c | 7 - > mm/page_alloc.c | 2 > mm/sparse.c | 225 +++++++++++++++++++++++++++------------- > 4 files changed, 155 insertions(+), 81 deletions(-) > [...] > @@ -325,6 +332,15 @@ static void __meminit sparse_init_one_section(struct mem_section *ms, > unsigned long pnum, struct page *mem_map, > struct mem_section_usage *usage) > { > + /* > + * Given that SPARSEMEM_VMEMMAP=y supports sub-section hotplug, > + * ->section_mem_map can not be guaranteed to point to a full > + * section's worth of memory. The field is only valid / used > + * in the SPARSEMEM_VMEMMAP=n case. > + */ > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) > + mem_map = NULL; Will this be a problem when reading mem_map with the crash-tool? I do not expect it to be, but I am not sure if crash internally tries to read ms->section_mem_map and do some sort of translation. And since ms->section_mem_map SECTION_HAS_MEM_MAP, it might be that it expects a valid mem_map? > +static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > + struct vmem_altmap *altmap) > +{ > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; > + DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; > + struct mem_section *ms = __pfn_to_section(pfn); > + bool early_section = is_early_section(ms); > + struct page *memmap = NULL; > + unsigned long *subsection_map = ms->usage > + ? &ms->usage->subsection_map[0] : NULL; > + > + subsection_mask_set(map, pfn, nr_pages); > + if (subsection_map) > + bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION); > + > + if (WARN(!subsection_map || !bitmap_equal(tmp, map, SUBSECTIONS_PER_SECTION), > + "section already deactivated (%#lx + %ld)\n", > + pfn, nr_pages)) > + return; > + > + /* > + * There are 3 cases to handle across two configurations > + * (SPARSEMEM_VMEMMAP={y,n}): > + * > + * 1/ deactivation of a partial hot-added section (only possible > + * in the SPARSEMEM_VMEMMAP=y case). > + * a/ section was present at memory init > + * b/ section was hot-added post memory init > + * 2/ deactivation of a complete hot-added section > + * 3/ deactivation of a complete section from memory init > + * > + * For 1/, when subsection_map does not empty we will not be > + * freeing the usage map, but still need to free the vmemmap > + * range. > + * > + * For 2/ and 3/ the SPARSEMEM_VMEMMAP={y,n} cases are unified > + */ > + bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION); > + if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) { > + unsigned long section_nr = pfn_to_section_nr(pfn); > + > + if (!early_section) { > + kfree(ms->usage); > + ms->usage = NULL; > + } > + memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); > + ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr); > + } > + > + if (early_section && memmap) > + free_map_bootmem(memmap); > + else > + depopulate_section_memmap(pfn, nr_pages, altmap); > +} > + > +static struct page * __meminit section_activate(int nid, unsigned long pfn, > + unsigned long nr_pages, struct vmem_altmap *altmap) > +{ > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; > + struct mem_section *ms = __pfn_to_section(pfn); > + struct mem_section_usage *usage = NULL; > + unsigned long *subsection_map; > + struct page *memmap; > + int rc = 0; > + > + subsection_mask_set(map, pfn, nr_pages); > + > + if (!ms->usage) { > + usage = kzalloc(mem_section_usage_size(), GFP_KERNEL); > + if (!usage) > + return ERR_PTR(-ENOMEM); > + ms->usage = usage; > + } > + subsection_map = &ms->usage->subsection_map[0]; > + > + if (bitmap_empty(map, SUBSECTIONS_PER_SECTION)) > + rc = -EINVAL; > + else if (bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION)) > + rc = -EEXIST; > + else > + bitmap_or(subsection_map, map, subsection_map, > + SUBSECTIONS_PER_SECTION); > + > + if (rc) { > + if (usage) > + ms->usage = NULL; > + kfree(usage); > + return ERR_PTR(rc); > + } We should not be really looking at subsection_map stuff when running on !CONFIG_SPARSE_VMEMMAP, right? Would it make sense to hide the bitmap dance behind if(IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) ? Sorry for nagging here > /** > - * sparse_add_one_section - add a memory section > + * sparse_add_section - add a memory section, or populate an existing one > * @nid: The node to add section on > * @start_pfn: start pfn of the memory range > + * @nr_pages: number of pfns to add in the section > * @altmap: device page map > * > * This is only intended for hotplug. Below this, the return codes are specified: --- * Return: * * 0 - On success. * * -EEXIST - Section has been present. * * -ENOMEM - Out of memory. */ --- We can get rid of -EEXIST since we do not return that anymore. -- Oscar Salvador SUSE L3