Received: by 2002:a05:6358:111d:b0:dc:6189:e246 with SMTP id f29csp157619rwi; Sun, 30 Oct 2022 22:30:38 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5HgfIjv5aliyrC2LlIav86YKoYNOmfZcUHEXgSUwxSzS2f95uax2si1yG7V/2tjhW985er X-Received: by 2002:a17:90a:ba85:b0:212:d644:fc28 with SMTP id t5-20020a17090aba8500b00212d644fc28mr13253622pjr.72.1667194237999; Sun, 30 Oct 2022 22:30:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667194237; cv=none; d=google.com; s=arc-20160816; b=G+zOy7ZkH8bLXJn7d4Aq8OZ3z3lk0h/b+0ElDWmrTRMnq2U/0cNaxP/K7CklR/rN0p pxl+ZI18cNqZs9cNjnXHodhvcTYRCfEeq8VLNN6jQ3Hf17VyX/Za3KNpTMP5lg9hKs3c koZQEKwR8UxgGpEWEvDjM+71iHPTib9eKKhOZ4VndfeeBjL94nohsKmXgkIrE2msSRyo NX8p5dlJHoIixFKFaLRqJhJuNKL9UMLfSTQ20GZkz3w6TorejDB63iGEqSA24TOd8b/d ND3+V0Fy4uDgY0qNZMpU7y5WbXp49wGeOIBqDosUzec+3BVJ3rjpTVEdqQ7RQQpOZYSU NLNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=KMoYPf5QlXxpUEaNojn2TFOmvOKnRgIek8nj+6VnLpI=; b=gRNPAQUrg6jzZYIHwVyfud34/Fb9RcgSyu6jBKfuUPSy9bk1EIT+dl5AdwXRTJajTT QYq7uDT+Jqk67jvUobkDi9BW6chXewS0SRBtgnUv0pDTIAxjl7bHMkhinqd8pXD2Iouh 6LlOfdh4m37XXDrLq7GrU/t99WNWRsYCaBX6zRGXStfKCuiOFt4NRJ0qWb2xGSrHR+m0 MspEmNKf8ejV2KZkNVmHxxzSe0JCv3inUiDz/pzFzxnOOoG57qYrjQm8/odVI0U/pdch lBrAQpBPFxpoNXNqgbH20RM6+76iWEKUlGNS1URcrGzMyd50PaOtcOtmQ3zVST6FvY36 1+vA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=gDfyWnj8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g22-20020a1709029f9600b0018128753b25si7195518plq.271.2022.10.30.22.30.24; Sun, 30 Oct 2022 22:30:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=gDfyWnj8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229674AbiJaFAz (ORCPT + 99 others); Mon, 31 Oct 2022 01:00:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44850 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229475AbiJaFAx (ORCPT ); Mon, 31 Oct 2022 01:00:53 -0400 Received: from mail-qk1-x736.google.com (mail-qk1-x736.google.com [IPv6:2607:f8b0:4864:20::736]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 823806142 for ; Sun, 30 Oct 2022 22:00:52 -0700 (PDT) Received: by mail-qk1-x736.google.com with SMTP id z1so1111632qkl.9 for ; Sun, 30 Oct 2022 22:00:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KMoYPf5QlXxpUEaNojn2TFOmvOKnRgIek8nj+6VnLpI=; b=gDfyWnj8mazCtlw/gyjBjCoBw+EAoW04RRcZ+OnjPIkjADjY+8JLLNDXRI4fHh9aHW c5LKzgN4CbVicZAJMiq0u6LN3wVFv1CsntXdZyXlxitg+Jl6emrhrhtudeT5+5pmWQNb xDFlixN7NTaGR92MioTb0yjQ7OeVPhlyfUpUg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KMoYPf5QlXxpUEaNojn2TFOmvOKnRgIek8nj+6VnLpI=; b=wM+e7bLa6BVsvc9OeUxuK/XOIoPCC9AFekMgAEbfMrZhYvAStB6DXCICqAW25N6YLr ZK6rXZhXR3YsFBc7EwCsecdPjdbXmDsAgZHwGk+VWomjoMCFcJ7vmxmJF69lqfGdtmFO YDuXgJCq96VY8/glzt+hqSinrQAsuPaXZNyO7gUwrSphmwXTkWwcmGEOAD7R1ZAL8Zvp ST6PdxlXKZa+bFUDyMswkb/gA3DGORYQ/B7g2I3XrQU2o6r4S6x0PgK8vLG1NtuXhb/8 pEtr64yT1a167o4Gr3+34scJsEczTUpVopBDkqQ9IA2kPfpEbVYZqZLBymuWnfxzG6Tu +0GA== X-Gm-Message-State: ACrzQf3Jhc1SMHkvSX0zIiP2jImQLCT3QaWNBnVBrgXTuTba1/TLFBsV jPmDfj3NVmr5jxV99iRc3genLKDKw/7Cpw== X-Received: by 2002:a05:620a:797:b0:6fa:1b5a:2e51 with SMTP id 23-20020a05620a079700b006fa1b5a2e51mr5322402qka.686.1667192451360; Sun, 30 Oct 2022 22:00:51 -0700 (PDT) Received: from mail-yb1-f170.google.com (mail-yb1-f170.google.com. [209.85.219.170]) by smtp.gmail.com with ESMTPSA id j6-20020a05620a288600b006fa00941e9dsm3928956qkp.136.2022.10.30.22.00.49 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 30 Oct 2022 22:00:49 -0700 (PDT) Received: by mail-yb1-f170.google.com with SMTP id j130so12469496ybj.9 for ; Sun, 30 Oct 2022 22:00:49 -0700 (PDT) X-Received: by 2002:a25:bb02:0:b0:6ca:9345:b2ee with SMTP id z2-20020a25bb02000000b006ca9345b2eemr93317ybg.362.1667192449279; Sun, 30 Oct 2022 22:00:49 -0700 (PDT) MIME-Version: 1.0 References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org> <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> <6C548A9A-3AF3-4EC1-B1E5-47A7FFBEB761@gmail.com> <47678198-C502-47E1-B7C8-8A12352CDA95@gmail.com> <140B437E-B994-45B7-8DAC-E9B66885BEEF@gmail.com> In-Reply-To: From: Linus Torvalds Date: Sun, 30 Oct 2022 22:00:32 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment To: Nadav Amit Cc: Peter Zijlstra , Jann Horn , John Hubbard , X86 ML , Matthew Wilcox , Andrew Morton , kernel list , Linux-MM , Andrea Arcangeli , "Kirill A . Shutemov" , jroedel@suse.de, ubizjak@gmail.com, Alistair Popple Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Oct 30, 2022 at 9:09 PM Nadav Amit wrote: > > I am sorry for not managing to make it reproducible on your system. Heh, that's very much *not* your fault. Honestly, I didn't try very much or very hard. I felt like I understood the problem cause sufficiently that I didn't really need to have a reproducer, and I much prefer to just think the solution through and try to make it really robust. Or, put another way - I'm just lazy. > Anyhow, I ran the tests with the patches and there are no failures. Lovely. > Thanks for addressing this issue. Well, I'm not sure the issue is "addressed" yet. I think the patch series is likely the right thing to do, but others may disagree with this approach. And regardless of that, this still leaves some questions open. (a) there's the issue of s390, which does its own version of __tlb_remove_page_size. I *think* s390 basically does the TLB flush synchronously in zap_pte_range(), and that it would be for that reason trivial to just add that 'flags' argument to the s390 __tlb_remove_page_size(), and make it do if (flags & TLB_ZAP_RMAP) page_zap_pte_rmap(page); at the top synchronously too. But some s390 person would need to look at it= . I *think* the issue is literally that straightforward and not a big deal, but it's probably not even worth bothering the s390 people until VM people have decided "yes, this makes sense". (b) the issue I mentioned with the currently useless "page_mapcount(page) < 0" test with that patch. Again, this is mostly just janitorial stuff associated with that patch seri= es. (c) whether to worry about back-porting I don't *think* this is worth backporting, but if it causes other changes, then maybe.. > I understand from the code that you decided to drop the deferring of > set_page_dirty(), which could - at least for the munmap case (where > mmap_lock is taken for write) - prevent the need for =E2=80=9Cforce_flush= =E2=80=9D and > potentially save TLB flushes. I really liked my dirty patch, but your warning case really made it obvious that it was just broken. The thing is, moving the "set_page_dirty()" to later is really nice, and really makes a *lot* of sense from a conceptual standpoint: only after that TLB flush do we really have no more people who can dirty it. BUT. Even if we just used another bit for in the array for "dirty", and did the set_page_dirty() later (but still before getting rid of the rmap), that wouldn't actually *work*. Why? Because the race with folio_mkclean() would just come back. Yes, now we'd have the rmap data, so mkclean would be forced to serialize with the page table lock. But if we get rid of the "force_flush" for the dirty bit, that serialization won't help, simply because we've *dropped* the page table lock before we actually then do the set_page_dirty() again. So the mkclean serialization needs *both* the late rmap dropping _and_ the page table lock being kept. So deferring set_page_dirty() is conceptually the right thing to do from a pure "just track dirty bit" standpoint, but it doesn't work with the way we currently expect mkclean to work. > I was just wondering whether the reason for that is that you wanted > to have small backportable and conservative patches, or whether you > changed your mind about it. See above: I still think it would be the right thing in a perfect world. But with the current folio_mkclean(), we just can't do it. I had completely forgotten / repressed that horror-show. So the current ordering rules are basically that we need to do set_page_dirty() *and* we need to flush the TLB's before dropping the page table lock. That's what gets us serialized with "mkclean". The whole "drop rmap" can then happen at any later time, the only important thing was that it was kept to at least after the TLB flush. We could do the rmap drop still inside the page table lock, but honestly, it just makes more sense to do as we free the batched pages anyway. Am I missing something still? And again, this is about our horrid serialization between folio_mkclean and set_page_dirty(). It's related to how GUP + set_page_dirty() is also fundamentally problematic. So that dirty bit situation *may* change if the rules for folio_mkclean() change... Linus