Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp1153483rwb; Thu, 6 Oct 2022 09:05:45 -0700 (PDT) X-Google-Smtp-Source: AMsMyM68Y3QBANMNNmtMN0KYZM0Wi4msUvckRgE9ycG4g2u26CuC3b20wCq67m/7i8TW9U1j0Zzb X-Received: by 2002:a17:907:6d91:b0:78d:36d8:b799 with SMTP id sb17-20020a1709076d9100b0078d36d8b799mr423717ejc.573.1665072345049; Thu, 06 Oct 2022 09:05:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665072345; cv=none; d=google.com; s=arc-20160816; b=b7uCRKDFtlz5U2BryYBAjZThYnatvoQ67NANae8dTue7moSNdn92UKx/hVGjxwAlK1 t9Z6LCTfYjqAqaLpu/OGoupbkkVKRsO5lLMc27bju8Jhx/dYQ+aKBybN55mrEA6eW80S VDhpx2urszagEn9rwZEx5dJXfE7Zh1t767bHBj98sQlFeUfiQmEMGQk3JZNikA8zEIR0 byZftCClpBRYBbdlauGenEjV1TQ7bdcgEyqhUghPpxrX2KYmnwO9h2GSAQ8F12wC8Dzp WYAM5k/5VJbvD4hm7E/84q5inLY+Js2rKq3lZ2o40sy33Z15h+ovFRLitgRz+nPu8aYu wcEg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=CjxW0PKCyTrNxuUlcWq36jveOujZNX1r5syNeeGoP8s=; b=iqzbyh35HlFgRykqM21LmSJ4nZKFIt/w95A2uZEvT1Jixg56+pBN5z5gfGbythAi8U oZswaBeHKrhVR37qJZm9w3PNWqE1FmG5+4vTOhpT11Dsg0Hh5mreCsukzCspVVZOWcqX DHiL5Hf37Uh0wmq6lXbRl/s8QVN7e1fWDu1kFncIcFVjw4cbKIYlQv+EGHgUbkTKjCsU e6MGy4zt2jN5/qPuASfs3rxr+ll3Kwjas6C/aU7WpQOh/Rcg/QWOtodfUqr7QZnrgW6T Cpoauy59aDges5OgKXOzfcfFC77dEe57WuUX42Ge902n8SydTCUyTn3cer+cHL8zu1mi +IfQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=dYWAXBCf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m23-20020aa7d357000000b0045793b0060dsi1231053edr.345.2022.10.06.09.04.56; Thu, 06 Oct 2022 09:05:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=dYWAXBCf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231967AbiJFP4o (ORCPT + 99 others); Thu, 6 Oct 2022 11:56:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35932 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231723AbiJFP4m (ORCPT ); Thu, 6 Oct 2022 11:56:42 -0400 Received: from mail-il1-x136.google.com (mail-il1-x136.google.com [IPv6:2607:f8b0:4864:20::136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EF9CC9E6B3 for ; Thu, 6 Oct 2022 08:56:41 -0700 (PDT) Received: by mail-il1-x136.google.com with SMTP id m14so1189506ilf.12 for ; Thu, 06 Oct 2022 08:56:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=CjxW0PKCyTrNxuUlcWq36jveOujZNX1r5syNeeGoP8s=; b=dYWAXBCfGNdt2FEEr/XAmuM7IcXTbwrbh6ObZ10wCExgLJnFqd7ULWVxtf8SpzsIG2 xsAwTOGs5muyf1Ek1vxwQDgI6lZcSLPZxoQxbC9yEJlyWJnY9J71hAiD2wS0apDSMGpy pV5OUUrYFN5ggX769ElWdweoKedia0K/HYzpH1LVfPgRcuvzzr+lcfXPq5afXHFoVzbX Q0JUf8e0b+0XygU3i40KcPg+H+h2FmEktr+/Hc5KZg14AEyg/ARvuaT3odGj4Xxvyg7A nxMuU2PF/AkOq1ARlHMI2QbMqJ1SbFUOxUqCIZgJM1O6NwXGF1TNEAjl94MdR7SR44Lo Y6iQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CjxW0PKCyTrNxuUlcWq36jveOujZNX1r5syNeeGoP8s=; b=NjsW23jg74kksj820XjYWztyRqHxRNauPWUREPC/bFzkabO9wTxv5f99LwYZ4NSqWc U8/6PmwGXTMHyMz/x10CzF+Pw7zamF50otI1nEt6ptetl5CuK4X81RS/yAQJTR8HZnCU wcrHfTVxIoBmK0j84C6pFqme0xMO7nxdd96rC8Gev0WaTELFIzDBcCPFDUbUeE8VUG97 flFRR7SMnJkyvUevdgJ4mXTJS5O05kS0YumLG0h/NAslpTjDQF1Qj+DK6ovHtuzlSRqs ZaTsU8OGTNLyR1lBPG6+d4xjdoPuSl7IuDWl1Ewdx1TOGSSMAmIy+uYPU6Au1rwULMbb tbUw== X-Gm-Message-State: ACrzQf2YHbQSbKk5VMrT7y15ysd3keTIKZLsK99KNBsjTUm/XwPK/Vj8 VK40TojoDnGYrwf/EOT6OnL41gcFWuiIwKdc07b+/Z0TllUl5Q== X-Received: by 2002:a05:6602:2dc7:b0:6a5:14e5:d709 with SMTP id l7-20020a0566022dc700b006a514e5d709mr229751iow.54.1665071790552; Thu, 06 Oct 2022 08:56:30 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jann Horn Date: Thu, 6 Oct 2022 17:55:54 +0200 Message-ID: Subject: Re: ptep_get_lockless() on 32-bit x86/mips/sh looks wrong To: Jason Gunthorpe Cc: Linux-MM , Peter Zijlstra , Christoph Hellwig , kernel list , David Hildenbrand Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 6, 2022 at 5:44 PM Jason Gunthorpe wrote: > On Thu, Oct 06, 2022 at 05:23:59PM +0200, Jann Horn wrote: > > ptep_get_lockless() does the following under CONFIG_GUP_GET_PTE_LOW_HIGH: > > > > pte_t pte; > > do { > > pte.pte_low = ptep->pte_low; > > smp_rmb(); > > pte.pte_high = ptep->pte_high; > > smp_rmb(); > > } while (unlikely(pte.pte_low != ptep->pte_low)); > > > > It has a comment above it that argues that this is correct because: > > 1. A present PTE can't become non-present and then become a present > > PTE pointing to another page without a TLB flush in between. > > 2. TLB flushes involve IPIs. > > > > As far as I can tell, in particular on x86, _both_ of those > > assumptions are false; perhaps on mips and sh only one of them is? > > > > Number 2 is straightforward: X86 can run under hypervisors, and when > > it runs under hypervisors, the MMU paravirtualization code (including > > the KVM version) can implement remote TLB flushes without IPIs. > > > > Number 1 is gnarlier, because breaking that assumption implies that > > there can be a situation where different threads see different memory > > at the same virtual address because their TLBs are incoherent. But as > > far as I know, it can happen when MADV_DONTNEED races with an > > anonymous page fault, because zap_pte_range() does not always flush > > stale TLB entries before dropping the page table lock. I think that's > > probably fine, since it's a "garbage in, garbage out" kind of > > situation - but if a concurrent GUP-fast can then theoretically end up > > returning a completely unrelated page, that's bad. > > > > > > Sadly, mips and sh don't define arch_cmpxchg_double(), so we can't > > just change ptep_get_lockless() to use arch_cmpxchg_double() and be > > done with it... > > I think the argument here has nothing to do with IPIs, but is more a > statement on memory ordering. The comment above the definition of ptep_get_lockless() claims: "it will not switch to a completely different present page without a TLB flush in between; something that we are blocking by holding interrupts off." > What we want to get is a non-torn load > of low/high, under some restricted rules. > > PTE writes should be ordered so that the present/not present bit is > properly: > > Zapping a PTE: > > write_low (not present) > wmb() > write_high (a) > wmb() > > Reestablish a PTE: > > write_high (b) > wmb() > write_low (present) > wmb() > > This ordering is necessary to make the TLB's atomic 64 bit load work > properly, otherwise the TLB could read a present entry with a bogus > other half! > > For ptep_get_lockless() we define non-torn as meaning the same as for the TLB: > > pre-zap low / high (present) > restablish low / high (b) (present) > any low / any high (not present) > > Other combinations are forbidden. > > The read side has a corresponding list of reads: > > read_low > rmb() > read_high > rmb() > read_low > > So, it seems plausible this could be OK based only on atomics (I did > not check that the present bit is properly placed in the right > low/high). Do you see a way the atomics don't work out? The race would be something like this, where A is one thread doing ptep_get_lockless() and B, C and D are other threads: A: read ptep->pte_low, sees low address half 0x00010000 B: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet C: page fault installs a new PTE pointing to address 0x0001000200020000 A: read ptep->pte_high, sees high address half 0x00010002 C: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet D: page fault installs a new PTE pointing to address 0x0001000300010000 A: re-read ptep->pte_low, sees low address half 0x00010000 matching the first one A: returns physical address 0x000100020x00010000, which was never actually in the PTE So it's not a problem with the memory ordering, it's just that it's not possible to atomically read a 64-bit PTE with 32-bit reads when the PTE can completely change under you - and ptep_get_lockless() was written under the assumption that this can't happen because of TLB flush IPIs.