Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp5303201pxu; Tue, 22 Dec 2020 13:21:14 -0800 (PST) X-Google-Smtp-Source: ABdhPJzlpQ9NvNLFicKd7bI/F+mbmjxtZdrtK+fPxxwZFjvXhlo7JlikhsJrHkpRODADZBPybdRv X-Received: by 2002:a50:8e19:: with SMTP id 25mr20221278edw.263.1608672074466; Tue, 22 Dec 2020 13:21:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1608672074; cv=none; d=google.com; s=arc-20160816; b=sbV3YwByXmj5BCDKtueZBhGfV2kuGJBDDQSieQemK+4fJOeh9qDJ1CniUCe0W6+xwk JYWk+HghB76mdxSi2qPWCxPnZvPZbYUHtxr31o7YJQzSVcQwCIZXDFjoFuNn1C4O5Oey mnHy/ho/me9tG6+krQiRtbjHWVz0g6/tIGmcDYwxftEPP+0bWjkXJQU/mdggXXVJazs/ 0x8SS3KPe23ygKaMVyjqn16Dx7LvFhrRpL7hkIOdf+yysSHb4/V27y4C3KoUx7iQOEwn E3kK3n5Z5S32c+7P8Wcvj1J3+8AIm1kdeB3c57GMP5C4UtujsuhWiSct5KXQSoQCIYMX uELg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=AkoOcriox08aICF4e6yOPafNyaUMgYdVf0PsmAE3Q6M=; b=F+nXMLjKrfhNkWnN8nVF2D4Z+/d5Z/8AYNXwB65SbdmZ5RUSj2OFdfvPHhZKQLY7yv gJDZ1hUnZu0xFBAPLXRECONDmCkPiIai6VQX3sUJTBf41WAqHxSTLpuMttWqVdMQTDtu zfcOQvZq7nDaEsjvP/Gr5kVXeCcQfpaVCjZJcSUYhKDcVQ/hBQKYf9nqRPc0zPgt1koe 36lmuz11JmKBFtOcCBhkBucNZJjhjcSDiA4L+JsDB54mWElyS/1MM/H1+RLmtcb6RQA+ x3knVaRFQNq6/bBn43dARrPbLJY1rZvGBzlvsW0Ngmc0davKf828Ge/DoghTGg0Gw7hW vP9g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=EwAp5cDO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u22si11563005ejt.406.2020.12.22.13.20.50; Tue, 22 Dec 2020 13:21:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=EwAp5cDO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727008AbgLVVTb (ORCPT + 99 others); Tue, 22 Dec 2020 16:19:31 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:58916 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726540AbgLVVTb (ORCPT ); Tue, 22 Dec 2020 16:19:31 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1608671884; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=AkoOcriox08aICF4e6yOPafNyaUMgYdVf0PsmAE3Q6M=; b=EwAp5cDOU7AMAmgzokH6l3fF/vSPzcNrvK8UaYDNoleEAnkIBkLWc8EHZnp9JS2rsZqDoE 4buVCDukP/XHNL29io/TiR/AyXGhDid/mqq1ApljiTR75xXcApr1olsksUJp5NuW4dfm/c 1MrQCqgxwjMir0Cxvq8TpHpZ3nhiuA4= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-213-JTGTxLJxM1aM-UHzYI6zKg-1; Tue, 22 Dec 2020 16:18:00 -0500 X-MC-Unique: JTGTxLJxM1aM-UHzYI6zKg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id F14028042A2; Tue, 22 Dec 2020 21:17:57 +0000 (UTC) Received: from mail (ovpn-112-5.rdu2.redhat.com [10.10.112.5]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C4B5960BF1; Tue, 22 Dec 2020 21:17:53 +0000 (UTC) Date: Tue, 22 Dec 2020 16:17:53 -0500 From: Andrea Arcangeli To: Nadav Amit Cc: Peter Xu , Yu Zhao , Linus Torvalds , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Andy Lutomirski , Will Deacon , Peter Zijlstra Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: References: <9E301C7C-882A-4E0F-8D6D-1170E792065A@gmail.com> <1FCC8F93-FF29-44D3-A73A-DF943D056680@gmail.com> <20201221223041.GL6640@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.0.3 (2020-12-04) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 22, 2020 at 12:19:49PM -0800, Nadav Amit wrote: > Perhaps any change to PTE in a page-table should increase a page-table > generation that we would save in the page-table page-struct (next to the The current rule is that by the time in the page fault we find a pte/hugepmd in certain state under the "PT lock", the TLB cannot be left stale with a more permissive state at that point. So if under "PT lock" the page fault sees !pte_write, it means no other CPU can possibly modify the page anymore. That must be enforced at all times, no sequence number is required if you enforce that and the whole point of the fix is to enforce that. This is how things always worked in the page fault and it's perfectly fine. For the record I never suggested to change anything in the page fault code. So the only way we can leave stale TLB after dropping PT lock with the mmap_read_lock and concurrent page faults is with a marker: 1) take PT lock 2) leave in the pte/hugepmd a unique identifiable marker 3) change the pte to downgrade permissions 4) drop PT lock and leave stale TLB entries with the upgraded permissions In point 3) the pte is built in a way that is guaranteed to trigger a deterministic path in the page fault. And in the way of that deterministic path you put the marker check for 2) to send the fault to a dead-end where the stale TLB is actually irrelevant and harmless. No change to the page fault is needed if you only enforce the above. Even if we'd take the mmap_write_lock in userfaultfd_writeprotect, you will still have to deal with the above technique because of change_prot_numa(). Would you suggest to add a sequence counter and to change all pte_same in order to make change_prot_numa safer or is it safe enough already using the above technique that checks the marker in the page fault? To be safe NUMA balancing is using the same mechanism above of sending the page fault into a dead end, in order to call the very same function (change_permissions) with the very same lock (mmap_read_lock) as userfaultfd_writeprotect. What happens then in do_numa_page then is different than what happens in handle_userfault, but both are a dead ends as far as the page fault code is concerned and they will never come back to the page fault code. That's how they avoid breaking the page fault. The final rule for the above logic to work safe for uffd, is that the marker cannot be cleared until after the deferred TLB flush is executed (that's where the window for the race existed, and the fix closed such window). do_numa_page differs in clearing the marker while the TLB flush still pending, but it can do that because it puts the same exact pte value (with the upgraded permissions) that was there before it put the marker in the pte. In turn do_numa_page makes the pending TLB flush becomes a noop and it doesn't need to wait for it before removing the marker. Does that last difference in how the marker is cleared, warrant to consider what NUMA balancing radically different so to forbid userfaultfd_writeprotect to use the same logic in the very same function with the very same lock in order to decrease the VM complexity? In my view no. Thanks, Andrea