Received: by 2002:a05:7412:bbc7:b0:fc:a2b0:25d7 with SMTP id kh7csp2546605rdb; Mon, 5 Feb 2024 09:42:59 -0800 (PST) X-Google-Smtp-Source: AGHT+IH71wFouBZCvf1/h+msg1VEeRdn0ZjVj5aDwWvbdrKUn2MIDU8sFxX50JhcXUuE2zChNXoV X-Received: by 2002:a05:620a:27c2:b0:785:8b5f:dc80 with SMTP id i2-20020a05620a27c200b007858b5fdc80mr9025qkp.37.1707154979669; Mon, 05 Feb 2024 09:42:59 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707154979; cv=pass; d=google.com; s=arc-20160816; b=il5Dw7RwP3f0S2u6fkSotmIdjmsVyn9yf7xydIA1pdkt6c5DtNzpXZXVYSvqfiuOir 95ifrL8S2zNIN+KfXs1HJv77SGFH756XRuhv35QmaVonXJgGMI8ZMD7/2/ffACpcy8YJ Ktg9UbhV18GrBcz4CdwFAJfggrPU0KnzmAinXxgVqnad54eRcaX+0E+dGsTp1jUFumkT LIEmJgOJYFtdVY0SkvXdnBC1CIRXxde8tjJBYISdpzc681w+5vhuuuZRb50Ft1FIIpV0 q4GuGq+sMGVEoOpXZdJi2/YWz4Pwc+OAxUqni9+TBZsPKdjrlckEWPYZalo/TAYA+ErQ gSgw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:date:dkim-signature; bh=8HE4Ga8BSdBO6M6GmqVFa+HwwrrcZFVY3gTfyQML7uA=; fh=J3AzBGOVZORILamYZqyXrjS/yr9FmG3jEb4vYXYYglw=; b=uwXfZMynCVpHzaeXKPsWMbSD2cplCo5YCUMzb1Z1EFUV7lYH2037R6LEe4vztreTL5 p6zjShNFLLs+NeKlAMoUtv2+9EUOGF04QnRcXES31+sSCOwYbB46j/ftfPVjTH0Rp2f9 pL7CErCQxVun2SxqZ5UITPl1tAOlHG/5Gjv6wNRPrmMjns+Am4Y8jgBWo0kJuybNdNbc UleGj5WgWXSQ1SQSYZlv31545gF/ls3u0o+LvbrhHDIZncgOU04pru0rHl+IUKaEHQwb j+AKk8MxdPGddgn+q1J1pCDZ/NQON4zAM9PtN9cr40jeueriAHR/sXS6D3HAM+nyyVLZ Yi7Q==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=YSDkBdlq; arc=pass (i=1 spf=pass spfdomain=ziepe.ca dkim=pass dkdomain=ziepe.ca); spf=pass (google.com: domain of linux-kernel+bounces-53135-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-53135-linux.lists.archive=gmail.com@vger.kernel.org" X-Forwarded-Encrypted: i=1; AJvYcCV7Qfl7F5VCvfCLSFQ+jD462/bYX5943NXeO6kNAvan2vkYJWJ+Jn5vRZ93mF9FbfOHIx8VQMJ6c4hMkwOX5gxsdTGZrCFRzpbnGnFDsQ== Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id qs15-20020a05620a394f00b00783f7ba12efsi372819qkn.675.2024.02.05.09.42.59 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 Feb 2024 09:42:59 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-53135-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=YSDkBdlq; arc=pass (i=1 spf=pass spfdomain=ziepe.ca dkim=pass dkdomain=ziepe.ca); spf=pass (google.com: domain of linux-kernel+bounces-53135-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-53135-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 5A4B31C22349 for ; Mon, 5 Feb 2024 17:42:59 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id AA95C45946; Mon, 5 Feb 2024 17:42:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ziepe.ca header.i=@ziepe.ca header.b="YSDkBdlq" Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com [209.85.160.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DA8944594B for ; Mon, 5 Feb 2024 17:42:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.173 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707154962; cv=none; b=c4dPrJVKqQ9lxJ25lapB2OcOASKpiz86yB3ek5Ii7/6cMeJVCfA+HO9MsC6utYHh7xlIYn9yAdOBtiEPpuH/FGFxQE+zEsppUV8hTJOrdRI3Tjf38wfKqGNC3B9aLbylRhndaydQsOwphmzPudGfmlxW+ijrPotx6STDNlyGVxw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707154962; c=relaxed/simple; bh=8HE4Ga8BSdBO6M6GmqVFa+HwwrrcZFVY3gTfyQML7uA=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=tMbA2gd5M8elwy8xSDnl0c0e6YfLR3yMG3fP52xNL/Rj4Eq0uqUNRSMLO4JNy+TTUX2uWpFZq26MvlMOtpUXQH/GzsVTLkXyeRiIo+7DfyuV+706xCPRC23st2++fC6Z5GX+/mr9K+b66Wmf03Qu6uGV8pL0DtXqLw1mz5h042c= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=ziepe.ca; spf=pass smtp.mailfrom=ziepe.ca; dkim=pass (2048-bit key) header.d=ziepe.ca header.i=@ziepe.ca header.b=YSDkBdlq; arc=none smtp.client-ip=209.85.160.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=ziepe.ca Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=ziepe.ca Received: by mail-qt1-f173.google.com with SMTP id d75a77b69052e-42c2998d3a3so5468271cf.3 for ; Mon, 05 Feb 2024 09:42:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; t=1707154960; x=1707759760; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=8HE4Ga8BSdBO6M6GmqVFa+HwwrrcZFVY3gTfyQML7uA=; b=YSDkBdlqh+RdEfZVeIPD5O0bw8zJoptlcqlngUBEXj/6l+uwzv4Ibe/PZhDGET0cHt LWMIfk/s0RAOGiReC1q969qyY6NlkJQhSCGSPJOne9c0gwLDQoWHyxkOMROmG3FWNo8C tW2stCvOx90JO3GQcLvJ3ZC1ywR8VX5Jy5zAzNuM5wxLd2D/xqhzkWdfPAGmgBPdzXQP KYJ5pEqbX1WA2CqLMwEi4l2c8LaYxkhHA+QoGfCV0S/p1aoSdK055MnmMu4lXMTNGtPL zlh6XgbqwbZZGaCE4WygoV2RfqWwiKIRgBwqzYF3GEySL0Bz6I3TrFDEL50qKYg/tQO6 5rjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707154960; x=1707759760; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8HE4Ga8BSdBO6M6GmqVFa+HwwrrcZFVY3gTfyQML7uA=; b=idEacOPXbocmr0DSoPKXGzkD2WOeA7G3NZ5ouia9/cPaTs5ADrBPRVq8e3TspDc5UG 4XeUwrPaGx5X2Qr36gxaCJnfXFWsO8qTwnOIv6usVhEVKtakRPju9o8dkJf7RcCk8MBA QTkczCJvElBsixmXAsBrtT4g4O37TeU23NODzyymUJqNlMmWZf/N+H2r2OeIMja4befu 5Zp5BN1yC6enQeT7imeTJRm668y14X7pXQt7PPfPYhMuhFGCoclz9vA57IA6ouj0Mqz8 oflRraWNFgld4oakKYSzTzWkyc5g00RvVqLk6Lzu+8ZMaEQ0rVdVA54xyFAW7Xi3ZX8V ZLhw== X-Gm-Message-State: AOJu0YxNhG+ceX7RvrTGvTmb25U3q54zOuGPelOVjpzbxekXNASHT0pK sKLsaCb6Rv/UWMLUFUIuTLAf7Xbtceordy2igVkNCA100dJZafDnv2O9PWxz34k= X-Received: by 2002:a05:622a:60c:b0:42a:9cd0:10d6 with SMTP id z12-20020a05622a060c00b0042a9cd010d6mr76578qta.34.1707154959703; Mon, 05 Feb 2024 09:42:39 -0800 (PST) X-Forwarded-Encrypted: i=0; AJvYcCUJEoYPtNarb3IKpx+TlDaQWnlGCPYYwqRse61FZumUi17YhNjF1PV/xF22JLrvNXqFbkebE3MWbOPPsd6X9lkUYDUu8OcgOJYYw525y7FBiRK5b6k1ygrBSUjRHKinQ83+/bgI+lNtpfQIgMjD96anUt1JUN/A07pGruYvZXzU+IkWkDlqulHlYWtlENVhDo7EDFB9txF76qn+sovVDt+EhL7ut1S/hZswfg+VWs1j/g37rEX6ap6/4HYUNOLco77wrroLxU0HPFDsjm75+gckY78bO+VGj8ZjeEVOo3As9m48+X0kxJ0Q+DCxEkK31OMS3+rk9JgZYfc/rvBZ4DfknXCKEFbNUSYiTsiRn8LuNk6aoyUi8Yt0fkyR8E0kWWgCK3tQyQl37yuAXLLWMzwjLPdLL+LetoDhwayckEgT6hVcQTSAdr/64XA4acyBFh/KrJTmXe27EODPpZ7XhFeqlSK9tOkNG5gDuy05MIpWcUZfCBG5xHxACRXuyekJg91XbC9JxH2AkXUiOdSYdpHdIXrVkRrLeiPD0hPJPc/vsi8qojAJdzmxkzyfLzLhg19sg4M2WCCcphSLrPd+8rW3G5XCnkab07S61wEP7FLkMnOakYoKQoDGJkTOFQdDEWmESvGbq6OoQFA0uUJj+KlWqrYt9e4aDsPJbchaYnyqj5Tje8QC1runF4T0tAZjue/v+GvEzghpVJQWjy+T/0qjRPKiwuFVw3YGWf5PAbhZzViL Received: from ziepe.ca (hlfxns017vw-142-68-80-239.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.68.80.239]) by smtp.gmail.com with ESMTPSA id z5-20020ac86b85000000b0042c04cef1d6sm137895qts.66.2024.02.05.09.42.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 Feb 2024 09:42:39 -0800 (PST) Received: from jgg by wakko with local (Exim 4.95) (envelope-from ) id 1rX2zO-000fQ2-NV; Mon, 05 Feb 2024 13:42:38 -0400 Date: Mon, 5 Feb 2024 13:42:38 -0400 From: Jason Gunthorpe To: James Gowans Cc: linux-kernel@vger.kernel.org, Eric Biederman , kexec@lists.infradead.org, Joerg Roedel , Will Deacon , iommu@lists.linux.dev, Alexander Viro , Christian Brauner , linux-fsdevel@vger.kernel.org, Paolo Bonzini , Sean Christopherson , kvm@vger.kernel.org, Andrew Morton , linux-mm@kvack.org, Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , madvenka@linux.microsoft.com, steven.sistare@oracle.com, yuleixzhang@tencent.com Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update Message-ID: <20240205174238.GC31743@ziepe.ca> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote: > The main aspect we’re looking for feedback/opinions on here is the concept of > putting all persistent state in a single filesystem: combining guest RAM and > IOMMU pgtables in one store. Also, the question of a hard separation between > persistent memory and ephemeral memory, compared to allowing arbitrary pages to > be persisted. Pkernfs does it via a hard separation defined at boot time, other > approaches could make the carving out of persistent pages dynamic. I think if you are going to attempt something like this then the end result must bring things back to having the same data structures fully restored. It is fine that the pkernfs holds some persistant memory that guarentees the IOMMU can remain programmed and the VM pages can become fixed across the kexec But once the VMM starts to restore it self we need to get back to the original configuration: - A mmap that points to the VM's physical pages - An iommufd IOAS that points to the above mmap - An iommufd HWPT that represents that same mapping - An iommu_domain programmed into HW that the HWPT Ie you can't just reboot and leave the IOMMU hanging out in some undefined land - especially in latest kernels! For vt-d you need to retain the entire root table and all the required context entries too, The restarting iommu needs to understand that it has to "restore" a temporary iommu_domain from the pkernfs. You can later reconstitute a proper iommu_domain from the VMM and atomic switch. So, I'm surprised to see this approach where things just live forever in the kernfs, I don't see how "restore" is going to work very well like this. I would think that a save/restore mentalitity would make more sense. For instance you could make a special iommu_domain that is fixed and lives in the pkernfs. The operation would be to copy from the live iommu_domain to the fixed one and then replace the iommu HW to the fixed one. In the post-kexec world the iommu would recreate that special domain and point the iommu at it. (copying the root and context descriptions out of the pkernfs). Then somehow that would get into iommufd and VFIO so that it could take over that special mapping during its startup. Then you'd build the normal operating ioas and hwpt (with all the right page refcounts/etc) then switch to it and free the pkernfs memory. It seems alot less invasive to me. The special case is clearly a special case and doesn't mess up the normal operation of the drivers. It becomes more like kdump where the iommu driver is running in a fairly normal mode, just with some stuff copied from the prior kernel. Your text spent alot of time talking about the design of how the pages persist, which is interesting, but it seems like only a small part of the problem. Actually using that mechanism in a sane way and cover all the functional issues in the HW drivers is going to be really challenging. > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file. > Really we should move the abstraction one level up and make the whole VFIO > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO > container file and all of the DMA mappings inside VFIO would already be set up. I doubt this.. It probably needs to be much finer grained actually, otherwise you are going to be serializing everything. Somehow I think you are better to serialize a minimum and try to reconstruct everything else in userspace. Like conserving iommufd IDs would be a huge PITA. There are also going to be lots of security questions here, like we can't just let userspace feed in any garbage and violate vfio and iommu invariants. Jason