Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp885773pxf; Wed, 7 Apr 2021 14:08:35 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxeHNHvMJocO6AqOkCRiquVJwF8PygxDpxmWaqBFs29jgmdjMKaYM3dme5Af2HksUKYBEwC X-Received: by 2002:a92:7b05:: with SMTP id w5mr4137662ilc.232.1617829715546; Wed, 07 Apr 2021 14:08:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617829715; cv=none; d=google.com; s=arc-20160816; b=sig/H1HQw6qbI8fdLmrgVVM/5Bb0KnScd+zY+UklMVr/KnfY5WS5v4TDu0/Sao5jE6 kjJ01Ypj8rmy2HMfQBcT/xT00X9oWHw/O0IffnBNcX21odlDM7THngSbBeKG213+d6a1 +5wmBeHaS8B/+2ftT9TqwWUqx/xvjiqzjVig132ZrXoH2K6JPgKLmwuPJbKxiEWXABRZ q1qVsheHvj8sCHoU6GQcXO2ouiK6wDuZPXhJ8Ve+lR0EL/6uRlX0pzOaFK9l8T3rhZqU jqRjc1C8onzpfy1oBKGRvevq23bpVbHxNhcpHqKY2gabIWiwK2fjOo0MyvuxpkkM+/Ts aReA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:subject :organization:from:references:cc:to:dkim-signature; bh=3K37Ana68yLtXTYXYTNdCNXI4iqTOpD+NBf+2SOYuQo=; b=z61+23e1Ugo1IdY0EvCkBURnrerTpwe52rfuA6a77/CIxs57XTrTg5HqfZv+c4hxPu ZbagFthFFRimYw3aGqFqVZvou3GhGD+fRXT/7ij5u5Bh2uyoYy3CwuSdZ0i2kVy14A0/ EJX5e073QOaPU9dXMSgWgW4M9gUS9xXRNo3YArcgyWFEEgouwc5euWqdcqv9k7YnlrZy u5QrWGg6PzT/b+08sF1ssKDaP3FkGBGR+UN4QWhEhQBEXW+GHpzYMRtuw0h02nh3IZpL mpEJwC39A48gSWQ1KelDDpS0wKvdk4Bb93BnVpTkXe5PgFDcpU248t2V/D35xC2HP8Bl csUQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UfvyQaZj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h13si727073iow.73.2021.04.07.14.08.21; Wed, 07 Apr 2021 14:08:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UfvyQaZj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345432AbhDGO4S (ORCPT + 99 others); Wed, 7 Apr 2021 10:56:18 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:53891 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345305AbhDGO4Q (ORCPT ); Wed, 7 Apr 2021 10:56:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1617807366; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3K37Ana68yLtXTYXYTNdCNXI4iqTOpD+NBf+2SOYuQo=; b=UfvyQaZjUmEDknPcN9oxeKqQZQvVwaXVrwdSoRnMncao+MgBNi2+HNufqzlBgGgxN1GBoD w5rRBIrh8MGn4fBh+FWTGs/dDytqB8mB6Qo2Ojg8n5ng7a0fRFT2CNd5cG07BbQfsB5WIw NOkp2EY4Yta+/IP3tH0IjCYWVfvM9RM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-393-_q0AcofYMQSkxK5o71fQMw-1; Wed, 07 Apr 2021 10:56:03 -0400 X-MC-Unique: _q0AcofYMQSkxK5o71fQMw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8B1D783DD2A; Wed, 7 Apr 2021 14:55:59 +0000 (UTC) Received: from [10.36.114.68] (ovpn-114-68.ams2.redhat.com [10.36.114.68]) by smtp.corp.redhat.com (Postfix) with ESMTP id E9FC019D9F; Wed, 7 Apr 2021 14:55:55 +0000 (UTC) To: "Kirill A. Shutemov" , Dave Hansen , Andy Lutomirski , Peter Zijlstra , Sean Christopherson , Jim Mattson Cc: David Rientjes , "Edgecombe, Rick P" , "Kleen, Andi" , "Yamahata, Isaku" , x86@kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" , Oscar Salvador , Naoya Horiguchi References: <20210402152645.26680-1-kirill.shutemov@linux.intel.com> <20210402152645.26680-8-kirill.shutemov@linux.intel.com> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages Message-ID: <5e934d94-414c-90de-c58e-34456e4ab1cf@redhat.com> Date: Wed, 7 Apr 2021 16:55:54 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210402152645.26680-8-kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02.04.21 17:26, Kirill A. Shutemov wrote: > TDX architecture aims to provide resiliency against confidentiality and > integrity attacks. Towards this goal, the TDX architecture helps enforce > the enabling of memory integrity for all TD-private memory. > > The CPU memory controller computes the integrity check value (MAC) for > the data (cache line) during writes, and it stores the MAC with the > memory as meta-data. A 28-bit MAC is stored in the ECC bits. > > Checking of memory integrity is performed during memory reads. If > integrity check fails, CPU poisones cache line. > > On a subsequent consumption (read) of the poisoned data by software, > there are two possible scenarios: > > - Core determines that the execution can continue and it treats > poison with exception semantics signaled as a #MCE > > - Core determines execution cannot continue,and it does an unbreakable > shutdown > > For more details, see Chapter 14 of Intel TDX Module EAS[1] > > As some of integrity check failures may lead to system shutdown host > kernel must not allow any writes to TD-private memory. This requirment > clashes with KVM design: KVM expects the guest memory to be mapped into > host userspace (e.g. QEMU). > > This patch aims to start discussion on how we can approach the issue. > > For now I intentionally keep TDX out of picture here and try to find a > generic way to unmap KVM guest memory from host userspace. Hopefully, it > makes the patch more approachable. And anyone can try it out. > > To the proposal: > > Looking into existing codepaths I've discovered that we already have > semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap > entries in page tables: > > - If an application touches a page mapped with the SWP_HWPOISON, it will > get SIGBUS. > > - GUP will fail with -EFAULT; > > Access the poisoned memory via page cache doesn't match required > semantics right now, but it shouldn't be too hard to make it work: > access to poisoned dirty pages should give -EIO or -EHWPOISON. > > My idea is that we can mark page as poisoned when we make it TD-private > and replace all PTEs that map the page with SWP_HWPOISON. It looks quite hacky (well, what did I expect from an RFC :) ) you can no longer distinguish actually poisoned pages from "temporarily poisoned" pages. FOLL_ALLOW_POISONED sounds especially nasty and dangerous - "I want to read/write a poisoned page, trust me, I know what I am doing". Storing the state for each individual page initially sounded like the right thing to do, but I wonder if we couldn't handle this on a per-VMA level. You can just remember the handful of shared ranges internally like you do right now AFAIU. From what I get, you want a way to 1. Unmap pages from the user space page tables. 2. Disallow re-faulting of the protected pages into the page tables. On user space access, you want to deliver some signal (e.g., SIGBUS). 3. Allow selected users to still grab the pages (esp. KVM to fault them into the page tables). 4. Allow access to currently shared specific pages from user space. Right now, you achieve 1. Via try_to_unmap() 2. TestSetPageHWPoison 3. TBD (e.g., FOLL_ALLOW_POISONED) 4. ClearPageHWPoison() If we could bounce all writes to shared pages through the kernel, things could end up a little easier. Some very rough idea: We could let user space setup VM memory as mprotect(PROT_READ) (+ PROT_KERNEL_WRITE?), and after activating protected memory (I assume via a KVM ioctl), make sure the VMAs cannot be set to PROT_WRITE anymore. This would already properly unmap and deliver a SIGSEGV when trying to write from user space. You could then still access the pages, e.g., via FOLL_FORCE or a new fancy flag that allows to write with VM_MAYWRITE|VM_DENYUSERWRITE. This would allow an ioctl to write page content and to map the pages into NPTs. As an extension, we could think about (re?)mapping some shared pages read|write. The question is how to synchronize with user space. I have no idea how expensive would be bouncing writes (and reads?) through the kernel. Did you ever experiment with that/evaluate that? -- Thanks, David / dhildenb