Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp2586596pxa; Mon, 17 Aug 2020 13:31:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx0tDe1Qw7C5/MMjGjGzWL81I1maPD2cGY9+1H2d9migH/H2MnfpTDkxZxXTRjz16Ibjpw+ X-Received: by 2002:aa7:d8d7:: with SMTP id k23mr17294107eds.54.1597696312601; Mon, 17 Aug 2020 13:31:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1597696312; cv=none; d=google.com; s=arc-20160816; b=a0uNREk4G1lGmmISKGxlt0HuuOUhjCynT566GVnE5zknaa8Q6gRt+5MUdiXBr2UYG5 +H10mT/nTiYWAAawaMnJGAIykjIXUbuIRLfOdJZIYLL9+8uxzaNwJW8pqGRCazFfoTkW /EqgkuVP6rHfgIHOyvobBgmWgRnPAzRGhboWznWfcCoQt1BQ+HlO7Cuw9augM04+KmAZ 5cimMmFii8L+0DwBVcRV5h9cSk53+tJJGr4ceY9d3HqN46xi1dn78vv2BBdtwW5r72Wf eEL+/X2C4zFilSZbCUfCN1/cQr1nDfZfYVuGhT5rU4cid+OaTkpLX6zG+Hv3rCiVgDIj NJ8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=3bWibmB7a7anJ/RW0CABzCpk9teb9IRsBxjmdDXzcm8=; b=NKOWEfVY5lN+vJ2sdkMItzq2vRVzAwftYBVjcLbuaskhNkPU/MKAbijSBP9Nq8otE+ my5GtAtsNcN6odbF3dB5EAm0qBiMIcC31cm5x3Y5oSIWeCJm+dqX7M/XuJtYCms67ONU nsa+xoHz6Emm2/9hcFdjU16ZI07jrXfMhXd0pT9utc3hHq0BMdEajVOYn9Z3FSAT4u9W MlamEAs0IXAiBZMQI3NhM5bn3WX1qZSwBaIxeWonET5zWaD+Uj4WOWjHD6gRXey1MFWN R0uLd27BpTYfMJlDVSo0E1Rp7Eyjjuv8zETgMyTIM5B+AnkCln3eU3kiIrVYzcr5JiCX s0mg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id di8si10450620edb.395.2020.08.17.13.31.28; Mon, 17 Aug 2020 13:31:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728241AbgHQMOd (ORCPT + 99 others); Mon, 17 Aug 2020 08:14:33 -0400 Received: from foss.arm.com ([217.140.110.172]:54236 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726457AbgHQMOa (ORCPT ); Mon, 17 Aug 2020 08:14:30 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7B81430E; Mon, 17 Aug 2020 05:14:29 -0700 (PDT) Received: from [172.16.1.113] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 26AD03F66B; Mon, 17 Aug 2020 05:14:27 -0700 (PDT) Subject: Re: [PATCH 1/2] kexec: Add quick kexec support for kernel To: Pavel Tatashin , Dave Young Cc: Sang Yan , kexec mailing list , "Eric W. Biederman" , LKML , xiexiuqi@huawei.com, guohanjun@huawei.com, luanjianhai@huawei.com, zhuling8@huawei.com, luchunhua@huawei.com References: <20200814055239.47348-1-sangyan@huawei.com> <20200814065845.GA18234@dhcp-128-65.nay.redhat.com> <20200814112413.GA8097@dhcp-128-65.nay.redhat.com> From: James Morse Message-ID: Date: Mon, 17 Aug 2020 13:14:19 +0100 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi guys, On 14/08/2020 20:22, Pavel Tatashin wrote: > On Fri, Aug 14, 2020 at 7:24 AM Dave Young wrote: >> On 08/14/20 at 04:21pm, Sang Yan wrote: >>> On 08/14/20 14:58, Dave Young wrote: >>>> On 08/14/20 at 01:52am, Sang Yan wrote: >>> Yes, it's particularly obvious on arm64. I will add it to the patch log, >>> and test how long it takes on x86 and other arch. Earlier versions of kexec-tools had the in-purgatory checksum enabled unconditionally. More recent versions let you disable it, I think the parameter is called no-checks. This saves some time, but the relocations still have to be done. >>>> About the arm64 problem, I know Pavel Tatashin is working on a patchset >>>> to improve the performance with enabling mmu. >>>> >>>> I added Pavel in cc, can you try his patches? >>>> >>> Thanks for your tips, I will try these patches. @Pavel. >>> Disable mmu after finishing copying pages? >>>>> We introduce quick kexec to save time of copying memory as above, >>>>> just like kdump(kexec on crash), by using reserved memory >>>>> "Quick Kexec". >>>> >>>> This approach may have gain, but it also introduce extra requirements to >>>> pre-reserve a memory region. I wonder how Eric thinks about the idea. >>>> >>>> Anyway the "quick" name sounds not very good, I would suggest do not >>>> introduce a new param, and the code can check if pre-reserved region >>>> exist then use it, if not then fallback to old way. >>>> >>> aha. I agree with it, but I thought it may change the old behaviors of >>> kexec_load. >>> >>> I will update a new patch without introducing new flags and new params. >> >> Frankly I'm still not sure it is worth to introduce a new interface if the >> improvement can be done in arch code like Pavel is doing. Can you try >> that first? > My patches will fix this issue. This is an ARM64 specific problem and > I did not see this to be performance problem on x86 during kexec > relocation. This happens because on ARM64 relocation is performed with > MMU disabled, and when MMU is disabled the caching is disabled as > well. > I have a patch series that fixes this entirely, but James Morse > (+CCed) and I still have not agreed on the final approach. We had an > off-list conversation about it, and we need to continue it in public > ML. > > Here is some history: > > This is the original series that I sent a year ago. It basically > proposes the same thing as this series from Sang Yan: > https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com/ > > Once, I realized that with enabling MMU the relocation is issue is > gone completely, I sent a new series, and this is the latest version > of that series: > https://lore.kernel.org/lkml/20200326032420.27220-1-pasha.tatashin@soleen.com/ > > It has been tested in production, and several people from different > companies commented to me that they are using it as well. > > After my patch series was sent out, James created a new branch in his > tree with his approach of enabling MMU without having a new VA space, > but instead re-use what the kernel has now. I have not tested that > branch yet. For context, that is here: http://www.linux-arm.org/git?p=linux-jm.git;a=shortlog;h=refs/heads/kexec%2Bmmu/v0 I think we can maintain this approach, but it doesn't work for Pavel, as he has extra requirements. I stopped looking at it because it became a solution no-one needed. > Here are some comments from James Morse and the off-list discussion we had: > ------- > It sounds like you are depending on write streaming mode to meet your > target performance. > This isn't even CPU specific, its cache and firmware configuration specific! > I don't think we should optimise a general purpose operating system > based on things like this. > .. > I think the best approach is going to be to eliminate the relocations entirely.> ... > I'm afraid I view this kexec-map thing as high-risk duct-tape over the > kexec core code > deliberately scattering the kexec payload. > I'd prefer any approach that causes the payload to be stored in-place > from the beginning > as that benefits other architectures too. > ------- The 'eliminate relocations' comment goes with some of the context you removed. > It appears James is leaning to the approach of not performing > relocation at all and use what is proposed by Sang Yan and me during > my first approach for this problem. The background to that is Pavel's timing requirements: Enabling the MMU isn't enough, from his description he also depends on re-arranging the memory so the CPU only sees increasing virtual addresses. This is what my 'write streaming' comment refers to. Doing this requires rewriting the relocation assembly code. If we enable the MMU during kexec relocation, I expect someone on a memory constrained system to come out of the woodwork screaming 'regression'. Systems with insufficient memory to allocate the page tables will no longer be able to kexec. If we keep the relocation assembly code as it is, its possible for it to handle MMU-on and MMU-off relocations with a very small adjustment. It just won't work for Pavel, as enabling the MMU is not enough. I'm confident we won't get a second copy of the relocation code, that only runs on some platforms, past the arch code maintainer. > However, I have several issues > with this take, which if addressed would be OK for me. > 1. The newer, more secure kexec syscall kexec_file_load(), which > allows to check the IMA integrity of the loaded file does not have a > way to specify the area in memory where to place the kernel. We are > using this syscall in production, and cannot go back to kexec_load() > for security reasons. > 2. Reserving memory means wasting memory during run-time. Our machine > has only 8G of RAM, and reserving even 128M You're loading a 128M kernel!? > for the next kernel is an > expensive proposition. Now we start loading the next kernel after some > non essential processes are stopped, but before essential processes > are stopped for the lowest downtime possible. > 3. Disabling relocation means changes in the common code, which I am > not sure actually helps any other platform beside ARM64, so I am > worried it won't be accepted into upstream. I'm happy to post the MMU-enabled series, it needs to be maintainable and solve someone's problem. To chip away at the rest of Pavel's problem, my suggestions were: * Allocate things in place. If we can allocate any 2MB hugepage, we can place the DTB there and not need to relocate it. * use huge pages more generally in the core code. With the MMU enabled, this might keep the core in write-streaming-mode for longer. (might, because this is very platform specific). * Store the kexec payload in the crashkernel carveout, to eliminate the relocations completely. This is the bit Pavel quoted. To expand on the carveout: The crashkernel carveout is sized to load the payload, and memory to run the kdump kernel entirely from within the carveout. Its obviously bigger than the payload it contains. If you load your kexec kernel into the 'memory' part of the carveout, it won't overwrite the kdump payload, and it wont require relocation, as its already stored in place. arm64's arch code will spot these in-place buffers, and skip the relocation. If you kdump even after doing this, the kdump kernel sees the kexec payload as uninitialised memory, and will overwrite it. If we did this, we wouldn't need to enable the MMU, and it should skip most of the relocation code on all architectures without any changes to the arch code. On arm64 the kernel and initramfs only need relocating because their physical addresses are baked into the DTB. The DTB's physical address is then passed to the new kernel. From memory: 'relocatable kernel' is detectable from the image header, and the support predates arm64's kexec support. James