Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp198380pxb; Fri, 15 Jan 2021 10:39:40 -0800 (PST) X-Google-Smtp-Source: ABdhPJyKKc+uxZzB+CGeeRG7jPQAB52tMh1bmd21l7wI/rwd/k84vVwCCrGWAUfszn2J9em9Kqve X-Received: by 2002:a17:906:3883:: with SMTP id q3mr9771065ejd.160.1610735980019; Fri, 15 Jan 2021 10:39:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610735980; cv=none; d=google.com; s=arc-20160816; b=vec9THwldCqZJv4LP+19bzHBZAvLGQ2UgVOCGOfXCoVCOJx0rB49C+tUItHeV4LOAw PWw+HloL7qDfNar3HO5h6mu45d+p2tSo4GvwWqZT0pHsHeHS67mEiZW+CiXj436cGjAo oawVivfB58W6yOwTDvpYkrjNi3SgKm8+Kg+oRMLDOfO/8XyzpdQ5OpTIyKA+ekIFtw8E 0nSTLPzst1i+OQ0788ko44gvd0m6ywA/D7TV5xR2YQ83aHmyXvxiwL9qAcvAqGaJGoGM 3BE3ELIM3vJaiq2CJioww9+BmEwmSXzlEvVdrgvstIkGscoZDci4xwQEkCe4vuaBt4NI CXMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=Q7tCZQob84PlN0gr0WSG12HEYI0WkF/OexcDulBQmLw=; b=umjjX6FUC5S2tsoJN8kelr73opM2tz3qa6It8mALbUXMA7W29RU6LDZcrGASCXfMyj 81Kiv+Mu51pzNdn2uc5rQhPjamBDGLRxeLJ8gfuoqcfpcA2kG2RbfNYBM3Y0rXgsD7C6 T42DcOdDc8/UPQY7aR8CGzeH9vGwMpxN2LRpg2DvM50b/lAnsWV2ch6Jj8hdtBXeyTZQ XwdPicY0ERF41nECq1CX9Ip0rm+h8fifoyYdgB63iOMhnqEtLWnPuB+JnJ456T5B0SKp e6U1wijDzxJF3jFPcmdCQ46OBY5FyqJ2jyuex25l+/BKichHfQLLAQRu38ZTTjd3DXw6 Ekpg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=A8qOWyQF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w14si4169827eje.646.2021.01.15.10.39.14; Fri, 15 Jan 2021 10:39:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=A8qOWyQF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733105AbhAOSiE (ORCPT + 99 others); Fri, 15 Jan 2021 13:38:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33372 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732398AbhAOSiE (ORCPT ); Fri, 15 Jan 2021 13:38:04 -0500 Received: from mail-qk1-x72b.google.com (mail-qk1-x72b.google.com [IPv6:2607:f8b0:4864:20::72b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD261C0613D3 for ; Fri, 15 Jan 2021 10:37:23 -0800 (PST) Received: by mail-qk1-x72b.google.com with SMTP id f26so12640495qka.0 for ; Fri, 15 Jan 2021 10:37:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=Q7tCZQob84PlN0gr0WSG12HEYI0WkF/OexcDulBQmLw=; b=A8qOWyQF85wa27E/deNxh/RtIOo5o1gZqNmr8yL+bFbBBgTA6Q1m/1N6q6DieMIuGU 10taXyzkthKSA8e/Ci52/9mD+66H8LC+/sUBj+a/rEBwxZg040T6kKd+CgXqWhN8YiUe Q11ebc2NvxglAu/Z/8ssZA+isogvUraI0r6ZxRfcYT7yJTiSAdM9I7UCX7tfd5v31uy9 0FI3nrpjbYdCURh3hUYue6sNuTcYfARWckqcsXsycnPAblStwNBocNpjIlrDW6sbkiuJ Xuv9q8+jkinVqH/+R0uMwh4mx2OVWR5RCCCmQ+Y7d6L6rsmEOzrB1PrP2Ni87F7dslHm 1PAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=Q7tCZQob84PlN0gr0WSG12HEYI0WkF/OexcDulBQmLw=; b=tQmfAfj0OmCT10qT8/wveAu+DtKk3eYrvIBzsaNpqQ6M7lLQdGuVQhy/O766ap0JEx h03lLk3C4cvmMQHCvUDID3FpRjNmBaNRcAKumlFD+7O5l7xb47lf+rpPfGTQF1i5X86i y5iqKd6afzZrMW5THSHGb2GygowZ+I3srtqm+6u0TVDlWJh8cxJ0kRZeXal9l+s0BpOC BQL5mKuBKZcOdyx7Scr89sAXCfSjV9JpRhlTyxRqm4w2ljO9hTQTtLEH2qYnbSJxKkcS nsyA18u4LBY8x+RWTRrPJpUW/nI9rbnp0rMALNGv6PRCAUwoHnxgR/BQ32okrjOAq+9r +Ulg== X-Gm-Message-State: AOAM532MuUaIEvxOMrne0pRwiH397vym+1xgRINy9/Dzhew9BbjtJBQ3 dJikIn8/kWmNksXIG3CvSjGEXA== X-Received: by 2002:a05:620a:909:: with SMTP id v9mr13468640qkv.435.1610735843024; Fri, 15 Jan 2021 10:37:23 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-142-162-115-133.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.115.133]) by smtp.gmail.com with ESMTPSA id i3sm5467107qkd.119.2021.01.15.10.37.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Jan 2021 10:37:22 -0800 (PST) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1l0TyL-001hpB-Dx; Fri, 15 Jan 2021 14:37:21 -0400 Date: Fri, 15 Jan 2021 14:37:21 -0400 From: Jason Gunthorpe To: David Hildenbrand Cc: Andrea Arcangeli , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Peter Zijlstra , Linus Torvalds , Hugh Dickins , "Kirill A. Shutemov" , Matthew Wilcox , Oleg Nesterov , Jann Horn , Kees Cook , John Hubbard , Leon Romanovsky , Jan Kara , Kirill Tkhai , Nadav Amit , Jens Axboe Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse Message-ID: <20210115183721.GG4605@ziepe.ca> References: <20210110004435.26382-1-aarcange@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 15, 2021 at 09:59:23AM +0100, David Hildenbrand wrote: > AFAIU, a more extreme case is probably VFIO: A VM with VFIO (e.g., > passthrough of a PCI device) can essentially be corrupted by "echo 4 > > /proc/[pid]/clear_refs". I've been told when doing migration with RDMA the VM's memory also ends up pinned, and then it does the stuff of #4. So it deliberately does clear_refs(4) on RDMA pinned memory and requires no COW. This is now a real world uABI break, unfortunately. > 7) There is no easy way to detect if a page really was pinned: we might > have false positives. Further, there is no way to distinguish if it was > pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking > we most probably would need more counters, which we cannot fit into > struct page. (AFAIU, for huge pages it's easier). I think this is the real issue. We can only store so much information, so we have to decide which things work and which things are broken. So far someone hasn't presented a way to record everything at least.. > However, AFAIU, even being able to detect if (and how) a page was pinned > would not completely help to solve the puzzle. At least for COW reuuse, uf we assign labels to every page user, and imagine we can track everything, I think we get this list: - # of ptes referencing the page (mapcount?) - # of page * pointer references that don't touch data (ie the speculative page cache ref) - # of DMA/CPU readers - # of DMA/CPU writers - # of long term data accesses - # of other reader/writers (specifically process incoherent reader/writers, not "DMA with the CPU" like vmsplice/iouring) Maybe there are more? This is what I've understood so far from this thread? Today's kernel makes the COW reuse decision as: # ptes == 1 && # refs == 0 && # DMA readers == 0 && # DMA writers == 0 && # of longterm == 0 && # other reader/writers == 0 (in essence this is what _refcount == 1 is saying, I think) From a GUP perspective I think the useful property is "a physical page under GUP is not indirectly removed from the mm_struct that pinned it". This is the idea that the process CPU page table and the ongoing DMA remain synchronized. This is a generalized statement from the clear_refs(4) and fork() regressions. Therefore, COW should not copy a page just because it is under GUP, it breaks the idea directly. We've also said speculative #refs should not cause COW. Removing both of those gets us to the COW reuse decision as: # ptes == 1 && # other reader/writers == 0 And I think where Linus is coming from is '# ptes' (eg mapcount) alone is not right because there are other relavent reader/writers too. (I'm not sure what these are, has someone pointed at one?) So, we have 64 bits for _refcount and _mapcount and we currently encode things as: - # ptes (_mapcount) - # page pointers + (low bits of _refcount) # DMA reader + writers + # other reader/writers + # ptes # We incr both _mapcount and_refcount? - # long term data acesses (high bits of _refcount If we move '# other reader/writers' to _mapcount (maybe with a shift), does it help? We also talked about GUP as meaning wrprotect == 0, but we could also change that to the idea that GUP means COW will always re-use, eg '#ptes == 1 && # other reader/writers == 0'. This gives some definition what mprotect(PROT_READ) means to pages under DMA (though I still think PROT_READ of pages under DMA write is weird) > 8) We have a vmsplice security issue that has to be fixed by touching > the code in question. A forked child process can read memory content of > its parent, which was modified by the parent after fork. AFAIU, the fix > will further lock us in into the direction of the code we are heading. No, vmsplice is just wrong. vmsplice has to do FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE for read only access to pages if userspace controls the duration of the pin. There are other bad bugs, like permanently locking DAX/CMA/ZONE_MIGRATE memory if the above pattern is not used. There was some debate over alternatives, but for a backport security fix it has to be above. AFAIK. Jason