Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp6411390ybv; Wed, 12 Feb 2020 11:41:48 -0800 (PST) X-Google-Smtp-Source: APXvYqwWNokQ3NqEpJ6kJdUWMXK1fpBXXOQsoTn4COxJarTMlNW0vMLKGlr74GFUST0l99VIh05T X-Received: by 2002:a05:6808:a11:: with SMTP id n17mr470555oij.94.1581536508707; Wed, 12 Feb 2020 11:41:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1581536508; cv=none; d=google.com; s=arc-20160816; b=hqkkYU6sKj6ykR8AAlVWu5AdkZJv94PaREwCw0jt8oYoME/5xKh5PPaIBXug2M3EIN xaiz1rCvhwTWWMqmUxz3W+j/UOGzasMUH6UGEKjP0jGa4VvpvswtrbgT3xZJdAHmW/pD zAoM1aOsbP2L7QLpKPLLU5LDF2KLtTXlG2Lym0W0U2BritWmOo8j2p3NdZH1j9r44uu3 8MimtrEPvauEYNq1dfAok6EChLXVq7LHYjry1wPe2D6Nzfj/bSg+MUfO0PuS8HqrHvqd LWQAcPfmtvGSy+Lrbpk/CGszIaqaRvN1ZqFH+k9+T1ehGjtyDQWcPVuDUTsQrdbBmUXC b5vg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=/niUyGCzwKmL8P0YFidMu7QFy/0jVujYkp3YM5pI1SY=; b=qKBIwuI7rvl42/HwrcauhNSlTRq6s4mcvaaDV9ffpvI8PAWleqGhA2hesHcty9Ccmq 8UhvUSjPngWGYcDAT0xS3e7p7MZu0koxssZRQVuBHIRpsVeGodovIviNj/X7nB2TQaXC QWJ50Er1ZDPyQkBGlhjn+yTO4xYk/3wzltqgxKWm/3M3bHEm0f3nUx2PAtI6tM+RVY0a 2fL1tYnttLb1aJdUvCeULv/tCkcd0d2m1jSyNHxHduBCtncL26J0TCQWP+epAeWsmbIf IXmgITN++SSeyKdOPl4m7tbZ+H0HKvkj1wYLHL/4lWxSG8zg/NyYh7EayCe2+I0Mjr1B 70YA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=awJhzI94; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j2si658825otr.255.2020.02.12.11.41.36; Wed, 12 Feb 2020 11:41:48 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=awJhzI94; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728955AbgBLTlO (ORCPT + 99 others); Wed, 12 Feb 2020 14:41:14 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:22508 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727600AbgBLTlN (ORCPT ); Wed, 12 Feb 2020 14:41:13 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1581536472; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=/niUyGCzwKmL8P0YFidMu7QFy/0jVujYkp3YM5pI1SY=; b=awJhzI94hXRcMKM1p1XY4flBHo6cJ8udaZzzy4sXDb7F9bqNP5/PulFkQ8xmkTowHcv54T EEan9M+Shd++jMIQbpVOaTwN/djdtmbMAaJ/fhdKPjgCOwI8JBXrKcqOUZG3R//MUJZtrU deoDnLr7369VB/nv12QHaeXKcn+dLTA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-216-UosBqSVOMsOqpSSGjRWgwA-1; Wed, 12 Feb 2020 14:41:09 -0500 X-MC-Unique: UosBqSVOMsOqpSSGjRWgwA-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E0246A0CBF; Wed, 12 Feb 2020 19:41:06 +0000 (UTC) Received: from mail (ovpn-122-89.rdu2.redhat.com [10.10.122.89]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 8EE086E40A; Wed, 12 Feb 2020 19:41:01 +0000 (UTC) Date: Wed, 12 Feb 2020 14:41:00 -0500 From: Andrea Arcangeli To: Peter Xu Cc: Jann Horn , Kees Cook , Daniel Colascione , Tim Murray , Nosh Minwalla , Nick Kralevich , Lokesh Gidra , kernel list , Linux API , SElinux list , Mike Rapoport , linux-security-module Subject: Re: [PATCH v2 0/6] Harden userfaultfd Message-ID: <20200212194100.GA29809@redhat.com> References: <20200211225547.235083-1-dancol@google.com> <202002112332.BE71455@keescook> <20200212171416.GD1083891@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200212171416.GD1083891@xz-x1> User-Agent: Mutt/1.13.1 (2019-12-14) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello everyone, On Wed, Feb 12, 2020 at 12:14:16PM -0500, Peter Xu wrote: > Right. AFAICT QEMU uses it far more than disk IOs. A guest page can > be accessed by any kernel component on the destination host during a > postcopy procedure. It can be as simple as when a vcpu writes to a > missing guest page which still resides on the source host, then KVM > will get a page fault and trap into userfaultfd asking for that page. > The same thing happens to other modules like vhost, etc., as long as a > missing guest page is touched by a kernel module. Correct. How does the android garbage collection work to make sure there cannot be kernel faults on the missing memory? If I understood correctly (I didn't have much time to review sorry) what's proposed with regard to limiting uffd events from kernel faults, the only use case I know that could deal with it is the UFFD_FEATURE_SIGBUS but that's not normal userfaultfd: that's also the only feature required from uffd to implement a pure malloc library in userland that never takes the mmap sem for writing to implement userland mremap/mmap/munmap lib calls (as those will convert to UFFDIO_ZEROPAGE and MADV_DONTNEED internally to the lib and there will be always a single vma). We just need to extend UFFDIO_ZEROPAGE to map the THP zeropage to make this future pure-uffd malloc lib perform better. On the other end I'm also planning a mremap_vma_merge userland syscall that will merge fragmented vmas. Currently once you have a nice heap all contiguous but with small objects and you free the fragments you can't build THP anymore even if you make the memory virtually contiguous again by calling mremap. That just build up a ton of vmas slowing down the app forever and also preventing THP collapsing ever again. mremap_vma_merge will require no new kernel feature, but it fundamentally must be able to handle kernel faults. If databases starts to use that, how can you enable this feature without breaking random apps then? So it'd be a feature usable only by one user (Android) perhaps? And only until you start defragging the vmas of small objects? Thanks, Andrea