Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp5239897pxu; Tue, 22 Dec 2020 11:46:58 -0800 (PST) X-Google-Smtp-Source: ABdhPJzDwkSX4sDmWtDjTJetvvLlnlq/Af636b3ewTCjfNxStQ5H4MxLt19WWTO5oOb9scVyUvqQ X-Received: by 2002:a17:906:77ce:: with SMTP id m14mr21954352ejn.10.1608666418698; Tue, 22 Dec 2020 11:46:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1608666418; cv=none; d=google.com; s=arc-20160816; b=BQ28fQtADPmmFA8JHOHzd8DdMBepDyGibZ1hEB7D1nbR6Jeg4FEnXzzlice9bg9D/t aJAuwaYIY28Ddt3Hw3w2IV4VjwEYYmVfEPeLPcXLgDEEgiojqHVpuzdbUC79ulGm86PM hLw6smKf3Vwa92ZGnXuvPIBGYchyGlM0idMQwr8EWt89bToPMC6vnUxs46M7bJtV70CB zIP0uUMR+DfTJbj/JUlq0UvHNJSkyelqdmkMU/kykonuM6N/TaBD00RtCn0BA4zttu7d jupej/e0v+Sw87ZQ/u8BeM1r6tMDnSse2Drv/p0AFH4I/hqDT5jesOIBEqM0aSPe3rsZ ruaA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=tI9RK4cFAHT2wID3U52x1nikQbt8aqU7ArNiStfO7ZQ=; b=xfxSgVj2QWGGJNbqlI1M45Ia/5hN4J3Fgb7S2/6Cls4qINYR9i2wtWY3xhALJkQZ2Y zi743VRn06yWf3aNJ5CEb8TZxkysWAyY0hA60ZIMKXnQWtx9KEvl/4zV+jsKBOR+Ddbx xwju2V+rHwHPgIZK2ZH2rZbePKbyTrxV1/fXAxZD+9LIzoO7r0jMLeEx5VLdu3EKh/T+ otuilMAVKP95Y5MbwdnndJwCtCA5fzG5gjwjjRCStxy7Jr7Ye6LzD5UDCH/naPL+tiZK W/kayNV8eQlNJA8w4qC23BLgRTSg5frLa42AoAsVmgBTGY9MjDRb583MirkXO7F6bgxS Hnpg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=VxPepqfP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f17si12728554edx.585.2020.12.22.11.46.35; Tue, 22 Dec 2020 11:46:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=VxPepqfP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726785AbgLVTpp (ORCPT + 99 others); Tue, 22 Dec 2020 14:45:45 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:26645 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726321AbgLVTpo (ORCPT ); Tue, 22 Dec 2020 14:45:44 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1608666257; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tI9RK4cFAHT2wID3U52x1nikQbt8aqU7ArNiStfO7ZQ=; b=VxPepqfPH7KJCeZ5h1Z8xXNz/+aSvSqtHXiM18gepV5NTEjY4B9eW4ygNBTrLYty3ATrGT bv5DlQbPc1wOGEG7uv1zNiOpscDkdjl9Zu0BTL9PtAoUMJx/wqAqEpx9TUThgdX12JHJhX zS7j544gE54KcB43sdJOZivBFYmMxbI= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-396-hC779z6sNvil__MY8cjkwg-1; Tue, 22 Dec 2020 14:44:14 -0500 X-MC-Unique: hC779z6sNvil__MY8cjkwg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 596A0801817; Tue, 22 Dec 2020 19:44:12 +0000 (UTC) Received: from mail (ovpn-112-5.rdu2.redhat.com [10.10.112.5]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C13D460BF1; Tue, 22 Dec 2020 19:44:08 +0000 (UTC) Date: Tue, 22 Dec 2020 14:44:08 -0500 From: Andrea Arcangeli To: Nadav Amit Cc: Peter Xu , Yu Zhao , Linus Torvalds , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Andy Lutomirski , Will Deacon , Peter Zijlstra Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: References: <20201221172711.GE6640@xz-x1> <76B4F49B-ED61-47EA-9BE4-7F17A26B610D@gmail.com> <9E301C7C-882A-4E0F-8D6D-1170E792065A@gmail.com> <1FCC8F93-FF29-44D3-A73A-DF943D056680@gmail.com> <20201221223041.GL6640@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/2.0.3 (2020-12-04) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote: > wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so I assume you mean "in" mprotect_fixup, after change_protection. If you would downgrade the mmap_lock to read there, then it'd severely slowdown the non contention case, if there's more than vma that needs change_protection. You'd need to throw away the prev->vm_next info and you'd need to do a new find_vma after droping the mmap_lock for reading and re-taking the mmap_lock for writing at every iteration of the loop. To do less harm to the non-contention case you could perhaps walk vma->vm_next and check if it's outside the mprotect range and only downgrade in such case. So let's assume we intend to optimize with mmap_write_downgrade only the last vma. The problem is once you had to take mmap_lock for writing, you already stalled for I/O and waited all concurrent page faults and blocked them as well for the vma allocations in split_vma, so that extra boost in SMP scalability you get is lost in the noise there at best. And the risk is that at worst that extra locked op of mmap_write_downgrade() will hurt SMP scalability because it would increase the locked ops of mprotect on the hottest false-shared cacheline by 50% and that may outweight the benefit from unblocking the page faults half a usec sooner on large systems. But the ultimate reason why mprotect cannot do mmap_write_downgrade() while userfaultfd_writeprotect can do mmap_read_lock and avoid the mmap_write_lock altogether, is that mprotect leaves no mark in the pte/hugepmd that allows to detect when the TLB is stale in order to redirect the page fault in a dead end (handle_userfault() or do_numa_page) until after the TLB has been flushed as it happens in the the 4 cases below: /* * STALE_TLB_WARNING: while the uffd_wp bit is set, the TLB * can be stale. We cannot allow do_wp_page to proceed or * it'll wrongly assume that nobody can still be writing to * the page if !pte_write. */ if (userfaultfd_pte_wp(vma, *vmf->pte)) { /* * STALE_TLB_WARNING: while the uffd_wp bit is set, * the TLB can be stale. We cannot allow wp_huge_pmd() * to proceed or it'll wrongly assume that nobody can * still be writing to the page if !pmd_write. */ if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd)) /* * STALE_TLB_WARNING: if the pte is NUMA protnone the TLB can * be stale. */ if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) /* * STALE_TLB_WARNING: if the pmd is NUMA * protnone the TLB can be stale. */ if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) Thanks, Andrea