Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp3585404imm; Fri, 25 May 2018 08:09:31 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrlcddqEoMU+NND4CFdPsmiXFZawV5AWoy4fhin3U8qehitCUtbVN9DIwJBbOW75aHHJCCS X-Received: by 2002:a17:902:6bca:: with SMTP id m10-v6mr3038594plt.6.1527260971088; Fri, 25 May 2018 08:09:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527260971; cv=none; d=google.com; s=arc-20160816; b=AItEbckO9VrFcrhMH647Z9AqbR4jLya0RsO3nvyMToT/ls1h2ko9tKhmU5bMlC7PMw DEtQY8Zs9/Dkfgv3GzOTaU5LRgYCu6EfY0KAIiWTswtga36rmdEg2w0nyKMOAy/aZcpX 4d7vCr70sJ6Qr0kBnsRGL+vgclAz2NwWesTZlELtdxm8Qksckl7z1gIO8cuEW3Ym0WlZ NxrZ0NViOo5o0pbdyo8/PPlG2gLP+P3yS2z0DydDFfYEkPKpNmatLdsEs4LSaTXziPXP 9ZVbwPtF5XFMiwULOGndggrnBzfTXpSA87Puz1YRkLNp59YuR7mFKkgS+fnD0mcoTy/a +uWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :arc-authentication-results; bh=VFSvm4/r8570AaIkRj0HgLH2bZXp743MVMrLMabaVpQ=; b=S/16mdWfV/WU22dVUP4rQo56+8Y733kDKrkNukplchU0+XFRUDAmvYrhv9MSkR2C23 ojHaEmzdvfLDhyL8ExKTegsItuOF1HAj3o96NoJPbGioO4FmM3kt6S9XEb3BeLE7hcZZ hepIoDs4fjPJ4SR8HzXeqDrDaC4oWEb+DStnLcba3cstT7gogXc2L3GUgMsVEAvL45YQ 5k63TGsP7ulg90DOt1FHyhk3rYEztehen8gDWA+JslkBFemZSTmjO1qyyvaqSYEYeUFV 72BqPLz69KcHmw/be7uC++TXRsuCQGWvWVvY9rglmx0vo4C/7p9KvoysTfX3vU1UPXRn JyeQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 191-v6si926422pgh.407.2018.05.25.08.09.00; Fri, 25 May 2018 08:09:31 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936251AbeEYPIw (ORCPT + 99 others); Fri, 25 May 2018 11:08:52 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:39426 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S936185AbeEYPIu (ORCPT ); Fri, 25 May 2018 11:08:50 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id C41C34071188; Fri, 25 May 2018 15:08:49 +0000 (UTC) Received: from [10.36.117.166] (ovpn-117-166.ams2.redhat.com [10.36.117.166]) by smtp.corp.redhat.com (Postfix) with ESMTP id 300B16353C; Fri, 25 May 2018 15:08:38 +0000 (UTC) Subject: Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Alexander Potapenko , Andrew Morton , Andrey Ryabinin , Balbir Singh , Baoquan He , Benjamin Herrenschmidt , Boris Ostrovsky , Dan Williams , Dave Young , Dmitry Vyukov , Greg Kroah-Hartman , Hari Bathini , Huang Ying , Hugh Dickins , Ingo Molnar , Jaewon Kim , Jan Kara , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Joonsoo Kim , Juergen Gross , Kate Stewart , "Kirill A. Shutemov" , Matthew Wilcox , Mel Gorman , Michael Ellerman , Miles Chen , Oscar Salvador , Paul Mackerras , Pavel Tatashin , Philippe Ombredanne , Rashmica Gupta , Reza Arbab , Souptick Joarder , Tetsuo Handa , Thomas Gleixner , Vlastimil Babka , kexec@lists.infradead.org References: <20180523151151.6730-1-david@redhat.com> <20180524075327.GU20441@dhcp22.suse.cz> <14d79dad-ad47-f090-2ec0-c5daf87ac529@redhat.com> <20180524093121.GZ20441@dhcp22.suse.cz> <20180524120341.GF20441@dhcp22.suse.cz> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: <216ca71b-9880-f013-878b-ae39e865b94b@redhat.com> Date: Fri, 25 May 2018 17:08:37 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180524120341.GF20441@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Fri, 25 May 2018 15:08:50 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Fri, 25 May 2018 15:08:50 +0000 (UTC) for IP:'10.11.54.5' DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'david@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >> So, no, virtio-mem is not a balloon driver :) > [...] >>>> 1. "hotplug should simply not depend on kdump at all" >>>> >>>> In theory yes. In the current state we already have to trigger kdump to >>>> reload whenever we add/remove a memory block. >>> >>> More details please. >> >> I just had another look at the whole complexity of >> makedumfile/kdump/uevents and I'll follow up with a detailed description. >> >> kdump.service is definitely reloaded when setting a memory block >> online/offline (not when adding/removing as I wrongly claimed before). >> >> I'll follow up with a more detailed description and all the pointers. > > Please make sure to describe what is the architecture then. I have no > idea what kdump.servise is supposed to do for example. Giving a high level description, going into applicable details: Dump tools always generate the dump file from /proc/vmcore inside the kexec environment. This is a vmcore dump in ELF format, with required and optional headers and notes. 1. Core collectors The tool that writes /proc/vmcore into a file is called "core collector". "This allows you to specify the command to copy the vmcore. You could use the dump filtering program makedumpfile, the default one, to retrieve your core, which on some arches can drastically reduce core file size." [1] E.g. under RHEL, the only supported core collector is in fact makedumpfile [2][3], which is e.g. able to exclude e.g. hwpoison pages, which could result otherwise in a crash if you simply copy /proc/vmcore into a file on harddisk. 2. vmcoreinfo /proc/vmcore can optionally contain a vmcoreinfo, that exposes some magic variables necessary to e.g. find and interpret segments but also struct pages. This is generated in "kernel/crash_core.c" in the crashed linux kernel. ... VMCOREINFO_SYMBOL_ARRAY(mem_section); VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS); ... VMCOREINFO_NUMBER(PG_hwpoison); ... VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); ... If not available, it is e.g. tried to extract relevant symbols/variables/pointers from vmlinux (similar like e.g. GDB). 3. PM_LOAD / Memory holes Each vmcore contains "PM_LOAD" sections. These sections define which physical memory areas are available in the vmcore (and to which virtual addresses they translate). Generated e.g. in "kernel/kexec_file.c" - and in some other places "git grep Elf64_Phdr". This information is generated in the crashed kernel. arch/x86/kernel/crash.c: walk_system_ram_res() is effectively used to generate PM_LOAD segments arch/s390/kernel/crash_dump.c: for_each_mem_range() is effectively used to generate PM_LOAD information At this point, I don't see how offline sections are treated. I assume they are always also included. So PT_LOAD will include all memory, no matter if online or offline. 4. Reloading kexec/kdump.service The important thing is that the vmcore *excluding* the actual memory has to be prepared by the *old* kernel. The kexec kernel will allow to - Read the prepared vmcore (contained in kexec kernel) - Read the memory So dump tools only have the vmcore (esp. PT_LOAD) to figure out which physical memory was available in the *old* system. The kexec kernel neither reads or interprets segments/struct pages from the old kernel (and there would be no way to really do it). All it does is allow to read old memory as defined in the prepared vmcore. If that memory is not accessible or broken (hwpoison), we will crash the system. So what does this imply? vmcore (including PT_LOAD sections) has to be regenerated every time memory is added/removed from the system. Otherwise the data contained in the prepared vmcore is stale. As far as I understand this cannot be done by the actual kernel when adding/removing memory but has to be done by user space. The same is e.g. also true when hot(un)plugging CPUs. This is done by reloading kexec, resulting in a regeneration of the vmcore. UDEV events are used to reload kdump.service and therefore regenerate. This events are triggered when onlining/offlining a memory block. ... SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump.service" SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl try-restart kdump.service" ... For "online", this is the right thing to do. I am right now not 100% if that is the right thing to do for "offline". I guess we should regenerate actually after "remove" events, but I didn't follow the details. Otherwise it could happen that the vmcore is regenerated before the actual removal of memory blocks. So the applicable memory blocks would still be included as PT_LOAD in the vmcore. If we then remove the actual DIMM then, trying to dump the vmcore will result in reading invalid memory. But maybe I am missing something there. 5. Access to vmcore / memory in the kexec environment fs/proc/vmcore.c: contains the code for parsing vmcore in the kexec kernel, prepared by the crashed kernel. The kexec kernel provides read access to /proc/vmcore on this basis. All PT_LOAD sections will be converted and stored in "vmcore_list". When reading the vmcore, this list will be used to actually provide access to the original crash memory (__read_vmcore()). So only memory that was originally in vmcore PT_LOAD will be allowed to be red. read_from_oldmem() will perform the actual read. At that point we have no control over old page flags or segments. Just a straight memory read. There is special handling for e.g. XEN in there: pfn_is_ram() can be used to hinder reading inflated memory. (register_oldmem_pfn_is_ram) However reusing that for virtio-mem with multiple devices and queues and such might not be possible. It is the last resort :) 6. makedumpfile makedumpfile can exclude free (buddy) pages, hwpoison pages and some more. It will *not* exclude reserved pages or balloon (e.g. virtio-balloon) inflated pages. So it will read inflated pages and if they are zero, save a compressed zero page. However it will (read) access that memory. makedumpfile was adapted to the new SECTION_IS_ONLINE bit (to mask the right section address), offline sections will *not* be excluded. So also all memory in offline sections will be accessed and dumped - unless pages don't fall into PT_LOAD sections ("memory hole"), in this case they are not accessed. 7. Further information Some more details can be found in "Documentation/kdump/kdump.txt". "All of the necessary information about the system kernel's core image is encoded in the ELF format, and stored in a reserved area of memory before a crash. The physical address of the start of the ELF header is passed to the dump-capture kernel through the elfcorehdr= boot parameter." -> I am pretty sure this is why the kexec reload from user space is necessary "For s390x there are two kdump modes: If a ELF header is specified with the elfcorehdr= kernel parameter, it is used by the kdump kernel as it is done on all other architectures. If no elfcorehdr= kernel parameter is specified, the s390x kdump kernel dynamically creates the header. The second mode has the advantage that for CPU and memory hotplug, kdump has not to be reloaded with kexec_load()." Any experts, please jump in :) [1] https://www.systutorials.com/docs/linux/man/5-kdump/ [2] https://sourceforge.net/projects/makedumpfile/ [3] git://git.code.sf.net/p/makedumpfile/code -- Thanks, David / dhildenb