Received: by 2002:a05:7412:da14:b0:e2:908c:2ebd with SMTP id fe20csp58172rdb; Thu, 5 Oct 2023 16:49:26 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHkJliVe9ZXzqdtwtWc3BZU+cHsH49Z1mj9hfkB2+b5XrnGzx/b+jvg1R96A1VnabCR8ob8 X-Received: by 2002:a05:6808:3023:b0:3ae:554b:6c57 with SMTP id ay35-20020a056808302300b003ae554b6c57mr7662533oib.11.1696549765785; Thu, 05 Oct 2023 16:49:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696549765; cv=none; d=google.com; s=arc-20160816; b=lyB78cNdRLxJo1Xq2sT7BhGnAoPeFhoouuXkWw0ks2JOvbI6ruZIh7FlShi+O13FVY b5bP4rzwKM25DSNleGGmSKTVpjf/8oM4Ryyw1i9OFfIBwGW/MhaYbnsL8soPiVfWNdek Z4Fq0lRiIfOVUrxgMkIVBH+EuCRYJFvau4ED2wM5ntx8RnuAjRenGBUrrBaMJZsQZNHj 8HhWGjw3MHp1NGrBboCpKjO5lXbUmFpd6aQ6kFjZfsqII3IUyBIHXMIbSu8Lbrfob3Me /qXeKkVTKQnK9NCbkiMgjt4oAecS0jONriWVz1p+DeN793zH0IiNjFOMB33t7qUf4R9a J47A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=+lEPeKg/GP3Y80L6NB8ajXYC+n4V3di2jIUF82A5N1I=; fh=Mvl226YHeVe1NpNDKNi7y0a9CA4lozQPEbDKEM+2ZQI=; b=pcp1seGQfoPhlIAT1pddMJKOO++o60ZkK2ysGewVkS5buaD/H/FHJ5OAhx1paQYFiT 0iTBX0xn4i5Jr7/P4L5H0Mz/sVGM5HRgvWF/tfjnltUGEPG+x1PA5YzI0Qib/ugJZRd9 P9bAfnG997xWLy1FAiK8nOwUoEruMMqlL6+MW9bmd92emXcIdVXC5jKrbDUAL+9jir7F hC5m0cg0UULX7qm//vFXoV77UJLai3J/8jp38ap5eyaQjubhj1xUZjoFHoTMWeIneQY/ cNS0CAsz6Tl1cyyrXfJj9V3CQ8qUmoAhP1V8dcCBgI+G+/+4aWItcEgROLzwABMqAPkq kHdw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=hymtPkWL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id p24-20020a63f458000000b0056a670ebd73si2496348pgk.498.2023.10.05.16.49.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Oct 2023 16:49:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=hymtPkWL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 4917780758C3; Thu, 5 Oct 2023 16:49:23 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229495AbjJEXs5 (ORCPT + 99 others); Thu, 5 Oct 2023 19:48:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55360 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229459AbjJEXsy (ORCPT ); Thu, 5 Oct 2023 19:48:54 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 95C3FC6 for ; Thu, 5 Oct 2023 16:48:52 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-d918aef0d0dso2261995276.3 for ; Thu, 05 Oct 2023 16:48:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696549731; x=1697154531; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=+lEPeKg/GP3Y80L6NB8ajXYC+n4V3di2jIUF82A5N1I=; b=hymtPkWLHvjXnwmr7k7cb5XQ+J2RhhbGESHmSvrmkhOlN8FIvTvO8eHhj0DZafvtql 3YQzgW+tDi4HQ+blrzko5aLAxONiEQQVn6MlQnEAiOlnt5FjGtRL7rB2oa63C/0m05Os pBm7XnMsZWjb4HYLkIn8vGeB2YuOxYdrIPKtPToF4v8JqfRjbAT03K438Nb1vPfP13yW YdiwyCy7f3XhLh4+9KigJi6ktsNtapV+GV45TEZRTdv4IQccZ90hVowdWgvdJsZX/eNQ 5JvBN3CaevknfC8hqqz/aDs+y1HDYZVQcU/wOGJcWlNxJP6l5PlSBXpCEiLL0EE+brXc 6BMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696549731; x=1697154531; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+lEPeKg/GP3Y80L6NB8ajXYC+n4V3di2jIUF82A5N1I=; b=cIpFsUWBhBX6/0uCg8eOk2wq2qwPuvsf3ubgnsEMOFYRyYHbMBf6Ha3eSfwVvBnfWk sntUQ5uSg3KSfRX5c2YhceFBB5IEOm/0LEbtC+TdFHkYlvHhnU3zq3DJnCOV6LhD6wrv /+SRi4wKubDY5jpmYsPrVPoBT/+gUcNxbxnbDu+gfEnQdNjv3AtzvTOivylgGEfZMCzF +N36yfU/rZBaBUrmptEFOfh+9ZTkBybL5TwkLJPmJ85kebeT3Togmy/0GEVEQkRCm/62 UCN4F79l2G9EReqtYjfjprEwOm6I+uSKbPrBxKR+VRhlVmiYby/utGUep9/RYVoZ7yb4 lOOg== X-Gm-Message-State: AOJu0YxGW5I+F8c4YSg7N9/TuSpuxrwVqJ81QdK1tvdzNUtqvxgTHd3r kMM7ZVdMLMzpynJRyJJuvpMzbfkFxMI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:1f56:0:b0:d7e:7a8a:2159 with SMTP id f83-20020a251f56000000b00d7e7a8a2159mr100071ybf.5.1696549731485; Thu, 05 Oct 2023 16:48:51 -0700 (PDT) Date: Thu, 5 Oct 2023 16:48:50 -0700 In-Reply-To: <20231005175238.7bb2zut4fb7ebdqc@amd.com> Mime-Version: 1.0 References: <20231005175238.7bb2zut4fb7ebdqc@amd.com> Message-ID: Subject: Re: [RFC PATCH v2 1/6] KVM: gmem: Truncate pages on punch hole From: Sean Christopherson To: Michael Roth Cc: isaku.yamahata@intel.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com, Paolo Bonzini , erdemaktas@google.com, Sagi Shahar , David Matlack , Kai Huang , Zhi Wang , chen.bo@intel.com, linux-coco@lists.linux.dev, Chao Peng , Ackerley Tng , Vishal Annapurve , Yuan Yao , Jarkko Sakkinen , Xu Yilun , Quentin Perret , wei.w.wang@intel.com, Fuad Tabba Content-Type: text/plain; charset="us-ascii" X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Thu, 05 Oct 2023 16:49:23 -0700 (PDT) On Thu, Oct 05, 2023, Michael Roth wrote: > On Thu, Sep 21, 2023 at 02:34:46PM -0700, Sean Christopherson wrote: > > On Thu, Sep 21, 2023, Sean Christopherson wrote: > > > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c > > > > index a819367434e9..01fb4ca861d0 100644 > > > > --- a/virt/kvm/guest_mem.c > > > > +++ b/virt/kvm/guest_mem.c > > > > @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start, > > > > static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len) > > > > { > > > > struct list_head *gmem_list = &inode->i_mapping->private_list; > > > > + struct address_space *mapping = inode->i_mapping; > > > > pgoff_t start = offset >> PAGE_SHIFT; > > > > pgoff_t end = (offset + len) >> PAGE_SHIFT; > > > > struct kvm_gmem *gmem; > > > > > > > > + /* > > > > + * punch hole may result in zeroing partial area. As pages can be > > > > + * encrypted, prohibit zeroing partial area. > > > > + */ > > > > + if (offset & ~PAGE_MASK || len & ~PAGE_MASK) > > > > + return -EINVAL; > > > > > > This should be unnecessary, kvm_gmem_fallocate() does > > > > > > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > > > return -EINVAL; > > > > > > before invoking kvm_gmem_punch_hole(). If that's not working, i.e. your test > > > fails, then that code needs to be fixed. I'll run your test to double-check, > > > but AFAICT this is unnecesary. > > > > I confirmed that the testcase passes without the extra checks. Just to close the > > loop, what prompted adding more checks to kvm_gmem_punch_hole()? > > I don't know if it's the same issue that Isaku ran into, but for SNP we > hit a similar issue with the truncate_inode_pages_range(lstart, lend) call. > > The issue in that case was a bit more subtle: > > - userspace does a hole-punch on a 4K range of its gmem FD, which happens > to be backed by a 2MB folio. > - truncate_inode_pages_range() gets called for that 4K range > - truncate_inode_pages_range() does special handling on the folios at the > start/end of the range in case they are partial and passes these to > truncate_inode_partial_folio(folio, lstart, lend). In this case, there's > just the 1 backing folio. But it *still* gets the special treatment, and > so gets passed to truncate_inode_partial_folio(). > - truncate_inode_partial_folio() will then zero that 4K range, even though > it is page-aligned, based on the following rationale in the comments: > > /* > * We may be zeroing pages we're about to discard, but it avoids > * doing a complex calculation here, and then doing the zeroing > * anyway if the page split fails. > */ > folio_zero_range(folio, offset, length); > > - after that, .invalidate_folio callback is issued, then the folio is split, > and the caller (truncate_inode_pages_range()) does another pass through > the whole range and can free the now-split folio then .free_folio callbacks > are issued. > > Because of that, we can't rely on .invalidate_folio/.free_folio to handle > putting the page back into a normal host-accessible state, because the > zero'ing will happen beforehand. Argh, and that causes an RMP violation #PF. FWIW, I don't *think* zeroing would be problematic for TDX. The page would get poisoned, but KVM would re-zero the memory with MOVDIR64B and flush the cache. > That's why we ended up needing to do this for SNP patches to make sure > arch-specific invalidation callbacks are issued before the truncation occurs: > > https://github.com/mdroth/linux/commit/4ebcc04b84dd691fc6daccb9b7438402520b0704#diff-77306411fdaeb7f322a1ca756dead9feb75363aa6117b703ac118576153ddb37R233 > > I'd planned to post those as a separate RFC to discuss, but when I came across > this it seemed like it might be relevant to what the TDX folks might ran into > here. > > If not for the zero'ing logic mentioned above, for SNP at least, the > .free_folio() ends up working pretty nicely for both truncation and fput(), > and even plays nicely with live update use-case where the destination gmem > instance shares the inode->i_mapping, since iput() won't trigger the > truncate_inode_pages_final() until the last reference goes away so we don't > have to do anything special in kvm_gmem_release() to determine when we > should/shouldn't issue the arch-invalidations to clean up things like the > RMP table. > > It seems like the above zero'ing logic could be reworked to only zero non-page > aligned ranges (as the comments above truncate_inode_pages_range() claim > should be the case), which would avoid the issue for the gmem use-case. But I > wonder if some explicit "dont-zero-these-pages" flag might be more robust. > > Or maybe there's some other way we should be going about this? Skipping the write seems like the obvious solution. An address_space flag, e.g. AS_INACCESSIBLE, would be the easiest thing to implement. Or maybe even make it AS_DONT_ZERO_ON_TRUNCATE_DAMMIT (mostly joking). Or a hook in address_space_operations to zero the folio, which conceptually is better in many ways, but feels like overkill.