Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp3258934pxu; Sat, 19 Dec 2020 18:53:48 -0800 (PST) X-Google-Smtp-Source: ABdhPJxrs1IQ73fSgISOoP4dn3hthSZ7H4apnq36wMVSMBoBbbpN0mMGzw0WA9279ie51Kgl2Iih X-Received: by 2002:a50:fd88:: with SMTP id o8mr10847763edt.386.1608432828258; Sat, 19 Dec 2020 18:53:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1608432828; cv=none; d=google.com; s=arc-20160816; b=PLpfhKypBZ2kHyFpnvVdlA0WVV9/U9UzhSzt6hs6aI38+EaqrTGIEdsf1MVDMH5r3e eN7dtC6kAYkrL7zz+tGDb8F1+kgHghM1AcmsFjt7um10g40vZjqhBlBMcdPeoLTUl7eN jxWMNRU3yQSkOUZUTyS+qe0iyPatUNUdA4UGdPF4Q4WgNRcVmSfBt7B/A3N0+GZOUgHy C22XFFys+Qfs0Aav2CnY0a0nH9UaCPkKwAdDXd2+INh3plXwi+AQfkUaj1iVUqNBAlkO 6ALV5l77r7liqpZASSLoCAPm8xfshwyepvfTE2b2iofFGOB4FmDDDnoeb5YEDL1JW5mP fdjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=NoHRpV085pjeRA/k08rN9TlOWSM2SBKc9GvOdVexZRU=; b=zC2F5dGOzWPdd/OdI8CZ8tPSme78r0xsl6aTmgzo1wZ7p5BgNs4Qt79YWY2c710zdH o1iEaYh8RpW1t19UJ1j4J7tW2RRG1mg36X6LOeJkKIKhmI6inKsIarJdV9WOgmP+3VDE x/WYIwQ6Rp1taOb/XDp0DqV9i5Fkq6oQhgz38gd5Rxcj9wiiMzIeeOwrGUUf3MQDsxko eMX3m1H8Qy5eeffVysWVjLOtL2V91MJ5uL+85ECttgsyhmzxqvmdeYlLz06p5wdPFRQK qvLKJylhLLnaR6RNmCdsbGTVpVVi7T1qfy+AnYg7Mi5eeOwRI+5ycpmadbKC6atk7Wzl 4yzA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=PL7HL7DY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id v5si11125833edi.183.2020.12.19.18.53.17; Sat, 19 Dec 2020 18:53:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=PL7HL7DY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726932AbgLTCvX (ORCPT + 99 others); Sat, 19 Dec 2020 21:51:23 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:37681 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726788AbgLTCvW (ORCPT ); Sat, 19 Dec 2020 21:51:22 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1608432595; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NoHRpV085pjeRA/k08rN9TlOWSM2SBKc9GvOdVexZRU=; b=PL7HL7DYXR3kMLnG3ZQ3yVRpYytkHB04z4vd8CkuFLLxdM5/WWf73AlpBW2IyDetbdH3+Y kevc8CWfhRcBMWs/qYMFFUIbYLK6vId6aaaQahH1RbDvVtTV7UkCIeQUH5AqU4wwjgDyzz fucfa1QOssXijuwVLG4TaZZX0M7LBkY= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-114-sbqyOewWMcmhWZBgBKFAcQ-1; Sat, 19 Dec 2020 21:49:52 -0500 X-MC-Unique: sbqyOewWMcmhWZBgBKFAcQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2A2B21005513; Sun, 20 Dec 2020 02:49:50 +0000 (UTC) Received: from mail (ovpn-119-164.rdu2.redhat.com [10.10.119.164]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 9989F19D9C; Sun, 20 Dec 2020 02:49:45 +0000 (UTC) Date: Sat, 19 Dec 2020 21:49:45 -0500 From: Andrea Arcangeli To: Andy Lutomirski Cc: Nadav Amit , Dave Hansen , linux-mm , Peter Xu , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Yu Zhao , Will Deacon , Peter Zijlstra Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: References: <20201219043006.2206347-1-namit@vmware.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.0.3 (2020-12-04) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Dec 19, 2020 at 06:01:39PM -0800, Andy Lutomirski wrote: > I missed the beginning of this thread, but it looks to me like > userfaultfd changes PTEs with not locking except mmap_read_lock(). It There's no mmap_read_lock, I assume you mean mmap_lock for reading. The ptes are changed always with the PT lock, in fact there's no problem with the PTE updates. The only difference with mprotect runtime is that the mmap_lock is taken for reading. And the effect contested for this change doesn't affect the PTE, but supposedly the tlb flushing deferral. The change_protection_range is identical to what already happens with zap_page_range. zap_page_range is called with mmap_lock for reading in MADV_DONTNEED, and by munmap with mmap_lock for writing. change_protection_range is called with mmap_lock for writing by mprotect, and mmap_lock for reading by UFFDIO_WRITEPROTECT. > also calls inc_tlb_flush_pending(), which is very explicitly > documented as requiring the pagetable lock. Those docs must be wrong, The comment in inc_tlb_flush_pending() shows the pagetable lock is taken after inc_tlb_flush_pending(): * atomic_inc(&mm->tlb_flush_pending); * spin_lock(&ptl); > because mprotect() uses the mmap_sem write lock, which is just fine, > but ISTM some kind of mutual exclusion with proper acquire/release > ordering is indeed needed. So the userfaultfd code seems bogus. If there's a bug, it'd be nice to fix without taking the mmap_lock for writing. The vma is guaranteed not modified, so I think it'd be pretty bad if we had to give in the mmap_lock for writing just to wait for a tlb flush that is issued deferred in the context of userfaultfd_writeprotect. > I think userfaultfd either needs to take a real lock (probably doesn't > matter which) or the core rules about PTEs need to be rewritten. It's not exactly clear how the do_wp_page could run on cpu1 before the unprotect did an extra flush, I guess the trace needs one more cpu/thread? Anyway to wait the wrprotect to do the deferred flush, before the unprotect can even start, one more mutex in the mm to take in all callers of change_protection_range with the mmap_lock for reading may be enough. Thanks, Andrea