Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp3331833rdg; Tue, 17 Oct 2023 11:09:29 -0700 (PDT) X-Google-Smtp-Source: AGHT+IERvmj1x1WfegUdjXhraTtj1FgaKft+pcQhjZFnfbPa+IZnDneECz+d+0HnO0N08vUpDI5J X-Received: by 2002:a17:903:25c4:b0:1c9:c879:ee74 with SMTP id jc4-20020a17090325c400b001c9c879ee74mr3432985plb.26.1697566169118; Tue, 17 Oct 2023 11:09:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697566169; cv=none; d=google.com; s=arc-20160816; b=PnX1yi/6fEuY/4ExVzHwv6kkGlwBcnub3XcB1VM96KuDdi1JVtIF/i/nz5LkzffUI8 MjQMnK1AMVTJkKWLyvfMSuBTyfBYNYOKzUBQb3lIuJPyIRJ71hGDcdPuRUMqHPNNIOa9 iSU+VbQe9NQV8Fo97idqHx13xBa/r8tn5d0AszH7tK/Mp06ouyFzDkgbz/un3VibG+8b sW/dd3QAqq9NrZBsgHaE2Z1xlBnBJnXsdRX2itsu65IDhW4Bh5hhOT/eaM6Ioko3DBBS ijM7fqtUeaDHsRVpx2HjWiScYxkE4aKCAHGI/948zvFFoPs0/Cre912Hzv/UqgAmvYfk WFSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :content-language:references:to:subject:user-agent:mime-version:date :message-id:dkim-signature:dkim-filter; bh=NvMLHFub5RqmwlTRrKAUpbaqyKFGOPEO1aUbNfXtibk=; fh=iypAzW7JCeEqyzDCP6WIBmVykK+8h9tGKnMOgtVGNd8=; b=JRCRcfOvUBeEfJGGLuLmfuetgFcvj1/BHdvRRiFOjugacjmCp9IkgTim54BVWGdoZP VYFZk/udfs9jbhtWMuZL/bH287EQdK97h2ZrECEdvJ6TUICCeAeOFuEVAWCEjtLDlV82 dhV+bJCBl4EUhN1Rn5OXlw0lZCQkRWpu8amm9w9v8+quwTFyGDLV5aqyzNtJOpi+zRio 0go1njsl2kmk5ens8kpbLSjJwQPEsCbzdUEfwA1DFAMIH0MKgvKEUegwzO6mLBPEYaPP 9wcFEa2CTZAoetyR2f3JL9AYZZEEBmFiLkrgi/FZsX/nv6Dr2+0eqRQdMsG8qsYU0dIC XfHQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=Dg6z3WIA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id n5-20020a170902e54500b001ca7a4c8360si2537088plf.31.2023.10.17.11.09.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 Oct 2023 11:09:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=Dg6z3WIA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 7604A802D452; Tue, 17 Oct 2023 11:09:25 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344065AbjJQSI7 (ORCPT + 99 others); Tue, 17 Oct 2023 14:08:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37578 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234891AbjJQSI6 (ORCPT ); Tue, 17 Oct 2023 14:08:58 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 302B190; Tue, 17 Oct 2023 11:08:56 -0700 (PDT) Received: from [192.168.4.26] (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id D7CD520B74C0; Tue, 17 Oct 2023 11:08:54 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com D7CD520B74C0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697566135; bh=NvMLHFub5RqmwlTRrKAUpbaqyKFGOPEO1aUbNfXtibk=; h=Date:Subject:To:References:From:In-Reply-To:From; b=Dg6z3WIAsikWdJ5JYDib/5orzmxLG7FfdddMhGnQb2WDKtAlvVEonusFHHds/U6x2 gChzgrVbb8Mqh4ehGGNzLTW1D92G800MTwlA9BraTayqtt69R+FMkcGt2GgurOoKJ+ gkJ5iINocZdX2dCZrIPk0j5lhnD9Kow8og+Q++fM= Message-ID: <76917285-d9b1-48af-ac5f-49c2d327e729@linux.microsoft.com> Date: Tue, 17 Oct 2023 13:08:53 -0500 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) To: Alexander Graf , gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, jamorris@linux.microsoft.com, "rostedt@goodmis.org" , kvm References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> <8f9d81a8-1071-43ca-98cd-e9c1eab8e014@amazon.de> Content-Language: en-US From: "Madhavan T. Venkataraman" In-Reply-To: <8f9d81a8-1071-43ca-98cd-e9c1eab8e014@amazon.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-8.3 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Tue, 17 Oct 2023 11:09:25 -0700 (PDT) Hey Alex, Thanks a lot for your comments! On 10/17/23 03:31, Alexander Graf wrote: > Hey Madhavan! > > This patch set looks super exciting - thanks a lot for putting it together. We've been poking at a very similar direction for a while as well and will discuss the fundamental problem of how to persist kernel metadata across kexec at LPC: > >   https://lpc.events/event/17/contributions/1485/ > > It would be great to have you in the room as well then. > Yes. I am planning to attend. But I am attending virtually as I am not able to travel. > Some more comments inline. > > On 17.10.23 01:32, madvenka@linux.microsoft.com wrote: >> From: "Madhavan T. Venkataraman" >> >> Introduction >> ============ >> >> This feature can be used to persist kernel and user data across kexec reboots >> in RAM for various uses. E.g., persisting: >> >>          - cached data. E.g., database caches. >>          - state. E.g., KVM guest states. >>          - historical information since the last cold boot. E.g., events, logs >>            and journals. >>          - measurements for integrity checks on the next boot. >>          - driver data. >>          - IOMMU mappings. >>          - MMIO config information. >> >> This is useful on systems where there is no non-volatile storage or >> non-volatile storage is too small or too slow. > > > This is useful in more situations. We for example need it to do a kexec while a virtual machine is in suspended state, but has IOMMU mappings intact (Live Update). For that, we need to ensure DMA can still reach the VM memory and that everything gets reassembled identically and without interruptions on the receiving end. > > I see. >> The following sections describe the implementation. >> >> I have enhanced the ram disk block device driver to provide persistent ram >> disks on which any filesystem can be created. This is for persisting user data. >> I have also implemented DAX support for the persistent ram disks. > > > This is probably the least interesting of the enablements, right? You can already today reserve RAM on boot as DAX block device and use it for that purpose. > Yes. pmem provides that functionality. There are a few differences though. However, I don't have a good feel for how important these differences are to users. May be, they are not very significant. E.g, - pmem regions need some setup using the ndctl command. - IIUC, one needs to specify a starting address and a size for a pmem region. Having to specify a starting address may make it somewhat less flexible from a configuration point of view. - In the case of pmem, the entire range of memory is set aside. In the case of the prmem persistent ram disk, pages are allocated as needed. So, persistent memory is shared among multiple consumers more flexibly. Also Greg H. wanted to see a filesystem based use case to be presented for persistent memory so we can see how it all comes together. I am working on prmemfs (a special FS tailored for persistence). But that will take some time. So, I wanted to present this ram disk use case as a more flexible alternative to pmem. But you are right. They are equivalent for all practical purposes. > >> I am also working on making ZRAM persistent. >> >> I have also briefly discussed the following use cases: >> >>          - Persisting IOMMU mappings >>          - Remembering DMA pages >>          - Reserving pages that encounter memory errors >>          - Remembering IMA measurements for integrity checks >>          - Remembering MMIO config info >>          - Implementing prmemfs (special filesystem tailored for persistence) >> >> Allocate metadata >> ================= >> >> Define a metadata structure to store all persistent memory related information. >> The metadata fits into one page. On a cold boot, allocate and initialize the >> metadata page. >> >> Allocate data >> ============= >> >> On a cold boot, allocate some memory for storing persistent data. Call it >> persistent memory. Specify the size in a command line parameter: >> >>          prmem=size[KMG][,max_size[KMG]] >> >>          size            Initial amount of memory allocated to prmem during boot >>          max_size        Maximum amount of memory that can be allocated to prmem >> >> When the initial memory is exhaused via allocations, expand prmem dynamically >> up to max_size. Expansion is done by allocating from the buddy allocator. >> Record all allocations in the metadata. > > > I don't understand why we need a separate allocator. Why can't we just use normal Linux allocations and serialize their location for handover? We would obviously still need to find a large contiguous piece of memory for the target kernel to bootstrap itself into until it can read which pages it can and can not use, but we can do that allocation in the source environment using CMA, no? > > What I'm trying to say is: I think we're better off separating the handover mechanism from the allocation mechanism. If we can implement handover without a new allocator, we can use it for simple things with a slight runtime penalty. To accelerate the handover then, we can later add a compacting allocator that can use the handover mechanism we already built to persist itself. > > > > I have a WIP branch where I'm toying with such a handover mechanism that uses device tree to serialize/deserialize state. By standardizing the property naming, we can in the receiving kernel mark all persistent allocations as reserved and then slowly either free them again or mark them as in-use one by one: > > https://github.com/agraf/linux/commit/fd5736a21d549a9a86c178c91acb29ed7f364f42 > > I used ftrace as example payload to persist: With the handover mechanism in place, we serialize/deserialize ftrace ring buffer metadata and are thus able to read traces of the previous system after kexec. This way, you can for example profile the kexec exit path. > > It's not even in RFC state yet, there are a few things where I would need a couple days to think hard about data structures, layouts and other problems :). But I believe from the patch you get the idea. > > One such user of kho could be a new allocator like prmem and each subsystem's serialization code could choose to rely on the prmem subsystem to persist data instead of doing it themselves. That way you get a very non-intrusive enablement path for kexec handover, easily amendable data structures that can change compatibly over time as well as the ability to recreate ephemeral data structure based on persistent information - which will be necessary to persist VFIO containers. > OK. I will study your changes and your comments. I will send my feedback as well. Thanks again! Madhavan > > Alex > > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > >