Received: by 2002:a05:7412:bbc7:b0:fc:a2b0:25d7 with SMTP id kh7csp2891648rdb; Tue, 6 Feb 2024 00:18:23 -0800 (PST) X-Google-Smtp-Source: AGHT+IF9wJPxNR832rJ205ZfDDzX4XtQWD2EzDY1dc9or4Zs7hyWLHqW6RsqGmhRX5Z9MppfpsBV X-Received: by 2002:a05:6830:11d7:b0:6dd:ef70:50ff with SMTP id v23-20020a05683011d700b006ddef7050ffmr2308765otq.9.1707207503229; Tue, 06 Feb 2024 00:18:23 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707207503; cv=pass; d=google.com; s=arc-20160816; b=VFn8K3ty+LnNDoycdO7+eDnPtmKb0HvGAV9jbjnJFrGD0Ye6m0FdQlFGjCHNG5Smdb w0EX7uQgkDPwt5ZDDfy9Z41QV+f1nqAg2pcrP4gfHBB1tT3JkwF7hAL9+7WmbRiRjrmf F1odEqApVw5EeFNLi6xihyfDZHiaeQYir6pNN+jv9+abb8b3cZYhH8gt6mgvyBdFJmiM BGptcnqHpg5UGbNz3lJZt2kiY0rfSTVa2f2vLTGB7svx9/ICm+4xFgjmenqYUInKN8ir /5PhvXn/aFtUez1rETtJfAQvG6symLhYaIb91nnRhGtwW4Ffyeutz2zke7P1q2cOJ3BQ XGMw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date; bh=lT+jV+yJe745NWQtTenOUUy64IRhYw/FVLpJOLHNjM4=; fh=148BQuS9COJJ1grlBeaXZEcJzwvlgrCoGHr+DQ++21E=; b=jrn1lKB63aVDcP26uLJ+bXN46izzKcEXcPoQnCdVDpiOZCjO2pNbZi21yvg83MDCj5 y/KfjpzhlKviwIGDDznTgEs0r7q6jyMK96h5mloPZZVqk5HNlt8JtqxAHrhCyZJIB0nT jp0kdVkrJRYey3xZBq/cs3VBq2asJXg7jcSu5h+ZLxsXjswoDFuC+wC7kmmi342OQW8u kFzeqYBYmhQWZeKcGi8ELr8RlHTc/awXE+62dpTZgc76XQVOgkbBwJQJyVCWXtsxlQvd t+gpYnMGk0t37jh65CjCEvZtieJjjGJSMFwjb0Lx4G589dAibZtE9Fajd3xGbK7Tj8mH NKCA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=pengutronix.de); spf=pass (google.com: domain of linux-kernel+bounces-54481-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-54481-linux.lists.archive=gmail.com@vger.kernel.org" X-Forwarded-Encrypted: i=1; AJvYcCWqGfaHo6HEhIa9ZBBGBEJrx6niUZ5j4FSe79T7MD6wwHEYxCgJyRlAudfAzOWhPpSaIfq9kd3pO8O3dgDX00Gsv67fiJaZ0oVLLrUdgg== Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id u191-20020a6379c8000000b005d48d7c6559si1337694pgc.202.2024.02.06.00.18.22 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Feb 2024 00:18:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-54481-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=pengutronix.de); spf=pass (google.com: domain of linux-kernel+bounces-54481-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-54481-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id BB2D4287B74 for ; Tue, 6 Feb 2024 08:18:22 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 682BF12B16F; Tue, 6 Feb 2024 08:17:56 +0000 (UTC) Received: from metis.whiteo.stw.pengutronix.de (metis.whiteo.stw.pengutronix.de [185.203.201.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D66CC12AAF5 for ; Tue, 6 Feb 2024 08:17:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.203.201.7 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707207475; cv=none; b=HuZlket3NtHAUotgqTjn446Xh/Duy3gck8oBMCGCmnEN/PHNjpeR2wboDqPiSRc/bNR8PUiYim/cEDq51fuQVv93KBI/gumGyqOj55ONv4IzqkfgiHf/drGZDGymTph9k7vN3PDtod56OF0Ry6UPYYeHqv6pAFXzKbJ7WVp/yuw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707207475; c=relaxed/simple; bh=s450wOfvSBSteTUkLqao8mIdQacIljDGHpPnabtnEdY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=AE+qcbCznw2AqomCoc5xXyN+qinAlc17T0DntNQT9+4VsQTrfLdNLbXn7Bz7AK/yaff8V3i3lNRSqNeDQ5Xn2gn3zkVKyOEwc3YG2pgpmISp+X60wjKTdMWu2c2wiS9jX/D12U5RzOv2cGRKDaXNCGskfV4CJHZXAvJQoPp90eg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=pengutronix.de; spf=pass smtp.mailfrom=pengutronix.de; arc=none smtp.client-ip=185.203.201.7 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=pengutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=pengutronix.de Received: from drehscheibe.grey.stw.pengutronix.de ([2a0a:edc0:0:c01:1d::a2]) by metis.whiteo.stw.pengutronix.de with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1rXGeE-0002Xs-AM; Tue, 06 Feb 2024 09:17:42 +0100 Received: from [2a0a:edc0:2:b01:1d::c5] (helo=pty.whiteo.stw.pengutronix.de) by drehscheibe.grey.stw.pengutronix.de with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rXGe9-004mrs-2o; Tue, 06 Feb 2024 09:17:37 +0100 Received: from ore by pty.whiteo.stw.pengutronix.de with local (Exim 4.96) (envelope-from ) id 1rXGe8-00EA9b-38; Tue, 06 Feb 2024 09:17:36 +0100 Date: Tue, 6 Feb 2024 09:17:36 +0100 From: Oleksij Rempel To: Alexander Graf Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, devicetree@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kexec@lists.infradead.org, linux-doc@vger.kernel.org, x86@kernel.org, Eric Biederman , "H . Peter Anvin" , Andy Lutomirski , Peter Zijlstra , Steven Rostedt , Andrew Morton , Mark Rutland , Tom Lendacky , Ashish Kalra , James Gowans , Stanislav Kinsburskii , arnd@arndb.de, pbonzini@redhat.com, madvenka@linux.microsoft.com, Anthony Yznaga , Usama Arif , David Woodhouse , Benjamin Herrenschmidt , Rob Herring , Krzysztof Kozlowski Subject: Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers Message-ID: References: <20240117144704.602-1-graf@amazon.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20240117144704.602-1-graf@amazon.com> X-Sent-From: Pengutronix Hildesheim X-URL: http://www.pengutronix.de/ X-Accept-Language: de,en X-Accept-Content-Type: text/plain X-SA-Exim-Connect-IP: 2a0a:edc0:0:c01:1d::a2 X-SA-Exim-Mail-From: ore@pengutronix.de X-SA-Exim-Scanned: No (on metis.whiteo.stw.pengutronix.de); SAEximRunCond expanded to false X-PTX-Original-Recipient: linux-kernel@vger.kernel.org Hi Alexander, Nice work! On Wed, Jan 17, 2024 at 02:46:47PM +0000, Alexander Graf wrote: > Kexec today considers itself purely a boot loader: When we enter the new > kernel, any state the previous kernel left behind is irrelevant and the > new kernel reinitializes the system. > > However, there are use cases where this mode of operation is not what we > actually want. In virtualization hosts for example, we want to use kexec > to update the host kernel while virtual machine memory stays untouched. > When we add device assignment to the mix, we also need to ensure that > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we > need to do the same for the PCI subsystem. If we want to kexec while an > SEV-SNP enabled virtual machine is running, we need to preserve the VM > context pages and physical memory. See James' and my Linux Plumbers > Conference 2023 presentation for details: > > https://lpc.events/event/17/contributions/1485/ > > To start us on the journey to support all the use cases above, this > patch implements basic infrastructure to allow hand over of kernel state > across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace: > With this patch set applied, you can read ftrace records from the > pre-kexec environment in your post-kexec one. This creates a very powerful > debugging and performance analysis tool for kexec. It's also slightly > easier to reason about than full blown VFIO state preservation. > > == Alternatives == > > There are alternative approaches to (parts of) the problems above: > > * Memory Pools [1] - preallocated persistent memory region + allocator > * PRMEM [2] - resizable persistent memory regions with fixed metadata > pointer on the kernel command line + allocator > * Pkernfs [3] - preallocated file system for in-kernel data with fixed > address location on the kernel command line > * PKRAM [4] - handover of user space pages using a fixed metadata page > specified via command line > > All of the approaches above fundamentally have the same problem: They > require the administrator to explicitly carve out a physical memory > location because they have no mechanism outside of the kernel command > line to pass data (including memory reservations) between kexec'ing > kernels. > > KHO provides that base foundation. We will determine later whether we > still need any of the approaches above for fast bulk memory handover of for > example IOMMU page tables. But IMHO they would all be users of KHO, with > KHO providing the foundational primitive to pass metadata and bulk memory > reservations as well as provide easy versioning for data. > > == Overview == > > We introduce a metadata file that the kernels pass between each other. How > they pass it is architecture specific. The file's format is a Flattened > Device Tree (fdt) which has a generator and parser already included in > Linux. When the root user enables KHO through /sys/kernel/kho/active, the > kernel invokes callbacks to every driver that supports KHO to serialize > its state. When the actual kexec happens, the fdt is part of the image > set that we boot into. In addition, we keep a "scratch region" available > for kexec: A physically contiguous memory region that is guaranteed to > not have any memory that KHO would preserve. The new kernel bootstraps > itself using the scratch region and sets all handed over memory as in use. > When drivers initialize that support KHO, they introspect the fdt and > recover their state from it. This includes memory reservations, where the > driver can either discard or claim reservations. > > == Limitations == > > I currently only implemented file based kexec. The kernel interfaces > in the patch set are already in place to support user space kexec as well, > but I have not implemented it yet inside kexec tools. > > == How to Use == > > To use the code, please boot the kernel with the "kho_scratch=" command > line parameter set: "kho_scratch=512M". KHO requires a scratch region. > > Make sure to fill ftrace with contents that you want to observe after > kexec. Then, before you invoke file based "kexec -l", activate KHO: > > # echo 1 > /sys/kernel/kho/active > # kexec -l Image --initrd=initrd -s > # kexec -e > > The new kernel will boot up and contain the previous kernel's trace > buffers in /sys/kernel/debug/tracing/trace. Assuming: - we wont to start tracing as early as possible, before rootfs or initrd would be able to configure it. - traces are stored on a different device, not RAM. For example NVMEM. - Location of NVMEM is different for different board types, but bootloader is able to give the right configuration to the kernel. What would be the best, acceptable for mainline, way to provide this kind of configuration? At least part of this information do not describes devices or device states, this would not fit in to devicetree universe. Amount of possible information would not fit in to bootconfig too. Other more or less overlapping use case I have in mind is a netbootable embedded system with a requirement to boot as fast as possible. Since bootloader already established a link and got all needed ip configuration, it would be able to hand over etherent controller and ip configuration states. Wille be the KHO the way to go for this use case? Regards, Oleksij -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |