Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp659181pxu; Thu, 7 Jan 2021 15:00:12 -0800 (PST) X-Google-Smtp-Source: ABdhPJwmm8P3OULQTx8WeBsDvZlt4ELYue7wS0aAf9hg+RVmAs4G5UF+a9ue5RQd+e4P0TQCVptT X-Received: by 2002:a50:fb85:: with SMTP id e5mr3211296edq.153.1610060412007; Thu, 07 Jan 2021 15:00:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610060411; cv=none; d=google.com; s=arc-20160816; b=oZK5uvmQoWpUn/3km1SLem5+yq5q9C4aKbXHui9tuBeIDt9CGzZs7PmFniwFtxA698 YXSse/TZtD7muI+VDc2tQ2GDWKgO3jTMPoiZvVjTrxY/f/KXCFM4yA7evWV8C6HnlcQx qVSDkOe18JBixHysGOZj3eoGPY+hYH1pVmFRd8Su7CdJ0n+eltT/bE7V9Sm3fjDqu3t9 6e0Dn+zVzpqBWhAGJW48OW6PhFVlHm7aWcZzj/WtVIshBH742JbUhPAMSlpX/LIbhnsk qTRBwAkTAoewHU9hCCrK2YQxAgxb9MyfziDcgoFM6wtcAPjIVgA4Tjeyx/zvYiSfO/d0 kcjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=aPUZOdAgAnopTQ0zbSHS4o/BBc0ukOCx70tRQ8Xa+7Q=; b=qWNmYRaFMhy0QplaEi8QW1opav1NIZ7LS2qs8x1JOfjUbCAaBIyoXPo6+oWXD4Caxr HHKBtMKojkYRz2yyQeV9cVEuSX7dlmzVXAAoxG2eKCPRYStUrOy1aUQl1HBOndp35MbU 2n/oY5d5ylGQ2o86VYwc/UImrD20YvBCmzKGtQhN4L2d+0hXYfx62t3HH1/UYa59wLJ3 tuKUudhOTEdZUQMGDyQRoDnsb8A2hM6+I5PDEFZdLhmisnjZ0o0uwBlRFo2va13vpoTD TQowumrri8ftm2hw31yfG/VgsuqGHeF1NBnEFWCdPYk6hREGFAURHTh/lMP6D8leMVfy ZOsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YKNz7XOE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id kq16si2786355ejb.253.2021.01.07.14.59.48; Thu, 07 Jan 2021 15:00:11 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YKNz7XOE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728693AbhAGW6a (ORCPT + 99 others); Thu, 7 Jan 2021 17:58:30 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:32866 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727738AbhAGW6a (ORCPT ); Thu, 7 Jan 2021 17:58:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1610060223; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=aPUZOdAgAnopTQ0zbSHS4o/BBc0ukOCx70tRQ8Xa+7Q=; b=YKNz7XOExzhTobSj/MBHhdorz8rN8imz/mmCTFM1Q8qtB3O5sZ9Kd2cVi92qasCofsWtis uZeLltbhSiWp+CSG7L29MaUZRlv3ANXuHFR1FVDUOCi0MDL9p+E41VTzVa/4BnhFUz0l0a 9xf9X8YhB49c/mzWLOTBbjeD5/ruF8M= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-306-Gqqj4m45NqWg1gGSD8Q3Nw-1; Thu, 07 Jan 2021 17:57:00 -0500 X-MC-Unique: Gqqj4m45NqWg1gGSD8Q3Nw-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C5568800D53; Thu, 7 Jan 2021 22:56:57 +0000 (UTC) Received: from mail (ovpn-112-222.rdu2.redhat.com [10.10.112.222]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B22F36062F; Thu, 7 Jan 2021 22:56:49 +0000 (UTC) Date: Thu, 7 Jan 2021 17:56:49 -0500 From: Andrea Arcangeli To: Linus Torvalds Cc: Jason Gunthorpe , Linux-MM , Linux Kernel Mailing List , Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Peter Zijlstra , Hugh Dickins , "Kirill A. Shutemov" , Matthew Wilcox , Oleg Nesterov , Jann Horn , Kees Cook , John Hubbard , Leon Romanovsky , Jan Kara , Kirill Tkhai Subject: Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy Message-ID: References: <20210107200402.31095-1-aarcange@redhat.com> <20210107202525.GD504133@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.0.4 (2020-12-30) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 07, 2021 at 02:17:50PM -0800, Linus Torvalds wrote: > So I think we can agree that even that softdirty case we can just > handle by "don't do that then". Absolutely. The question is if somebody was happily running clear_refs with a RDMA attached to the process, by the time they update and reboot they'll find it the hard way with silent mm corruption currently. So I was obliged to report this issue and the fact there was very strong reason why page_count was not used there and it's even documented explicitly in the source: * [..] however we only use * page_trans_huge_mapcount() in the copy-on-write faults where we * need full accuracy to avoid breaking page pinning, [..] I didn't entirely forget the comment when I reiterated it in fact also in Message-ID: <20200527212005.GC31990@redhat.com> on May 27 2020 since I recalled there was a very explicit reason why we had to use page_mapcount in do_wp_page and deliver full accuracy. Now I cannot proof there's any such user that will break, but we'll find those with a 1 year or more of delay. Even the tlb flushing deferral that caused clear_refs_write to also corrupt userland memory and literally lose userland writes even without any special secondary MMU hardware being attached to the memory, took 6 months to find. > if a page is pinned, the dirty bit of it makes no sense, because it > might be dirtied complately asynchronously by the pinner. > > So I think none of the softdirty stuff should touch pinned pages. I > think it falls solidly under that "don't do it then". > > Just skipping over them in clear_soft_dirty[_pmd]() doesn't look that > hard, does it? 1) How do you know again if it's not speculative lookup or an O_DIRECT that made them look pinned? 2) it's a hugepage 1, a 4k pin will make soft dirty then skip 511 dirty bits by mistake including those pages that weren't pinned and that userland is still writing through the transhuge pmd. Until v4.x we had a page pin counter for each subpage so the above wouldn't have happened, but not anymore since it was simplified and improved but now the page_count is pretty inefficient there, even if it'd be possible to make safe. 3) the GUP(write=0) may be just reading from RAM and sending to the other end so clear_refs may have currently very much tracked CPU writes fine, with no interference whatsoever from the secondary MMU working in readonly on the memory. Thanks, Andrea