Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp387707pxu; Tue, 1 Dec 2020 13:55:53 -0800 (PST) X-Google-Smtp-Source: ABdhPJziXIMscz7AHtij2iQscWZn1tsbhwJ0LDhrQ6lHFzV/Awce0tkdPcUvs83oOHmjdqbVTpWL X-Received: by 2002:a50:9b58:: with SMTP id a24mr5126841edj.340.1606859753223; Tue, 01 Dec 2020 13:55:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1606859753; cv=none; d=google.com; s=arc-20160816; b=nKjh+KekFaooUQ+CkQlgXsAQOq5IpmK3lGcNAU/v7kZwz9hC9NfoX/c13RvEex8jLL RFPb3yt93LDJDB/bKjPmc0Wf1N+3f3Uu/ScYDwA8z4WhTbSUQbRW5W1jM+Wx9kP/2ziP FtpdkhBDXYkIvtEDNe44SgwwuIpQCWfhg+/Yz97bZEbnZSX8BRJJigaRLnsVY3Ht2yne DkwZ/2BV5z24gC3bBF0Y/4iPIJy+KGtqZ6zQAYQW3ke4GL+2R8hXRTthBCcqqdnyvMEr 0gYs9UGzUPM5X1HcpPIHgP5j2AdjYl0gIt3CLLV2V9lccPz6y3AEc3S/o+VVmYrR89aa Q89w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Sg1VHcI6OjBA5gCcFE20C7XlKxx+FRv97yqtHbZoN64=; b=vuhBgLxkDcrmq641mcTBWJoYDdPRUGexaPM58+oM3hV9kuHB2xLj02yzBDHTj6Li0C K7ybZ6WExjAVKrr85ZXZr/0ZFG2nW47AjiW8YnIttcAYOC96mDSLRvS7pvjCEgXS4gV5 1v/L5RqLkgBCkPZ7LX5d+Oh3Y43hce+VyhD250a/0NRoLlBrr1+2fCwDh3kCil8lgN8P rq+mPaG+2xUXZsXrM+CphYQWNzx+fRwq182yANXeJ93NnYaNqwwZ5Mski3dB8fVloKx3 eRdh905Jl1oSSY21cIPKCJN0PG7O0fs+vYRh8uzT78xKtErYoFADDh7DgxSukSvAvX7D 1zgA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=cYaRYuHV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a12si739124edy.403.2020.12.01.13.55.30; Tue, 01 Dec 2020 13:55:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=cYaRYuHV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726410AbgLAVvf (ORCPT + 99 others); Tue, 1 Dec 2020 16:51:35 -0500 Received: from mail.kernel.org ([198.145.29.99]:55732 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726412AbgLAVvf (ORCPT ); Tue, 1 Dec 2020 16:51:35 -0500 Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com [209.85.221.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 17F1522244 for ; Tue, 1 Dec 2020 21:50:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1606859454; bh=R0u/t9dXTAtlGviFs2oYeCEEHjWlPsXgdRNYm6TSXfc=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=cYaRYuHVHFbMApZUMrH9vnKsdRglxiaUhXeHk2PFyc9icoIoHfWGY00FhNGuyu56E zh5mtAt+rmfJPCsCD3pvXYVLTLoYITBv6zabhI/ulLzoU09TWkFU9392r9O6XOFxPQ vmV3ti3MsO4EsPXsAzOp6NyNDn5IzSDbOYI4iEsc= Received: by mail-wr1-f51.google.com with SMTP id g14so5151313wrm.13 for ; Tue, 01 Dec 2020 13:50:54 -0800 (PST) X-Gm-Message-State: AOAM532I/CZ0VJnk58wFNHFaaqubuyIt/6CLrC49KSl67fGwe7mY9H3L VpTVsZDU+yEXETkujQnpk/o5LzLYFGr+1OQOdY2TIA== X-Received: by 2002:adf:e449:: with SMTP id t9mr6472923wrm.257.1606859452360; Tue, 01 Dec 2020 13:50:52 -0800 (PST) MIME-Version: 1.0 References: <20201128160141.1003903-1-npiggin@gmail.com> <20201128160141.1003903-7-npiggin@gmail.com> <20201201212758.GA28300@willie-the-truck> In-Reply-To: <20201201212758.GA28300@willie-the-truck> From: Andy Lutomirski Date: Tue, 1 Dec 2020 13:50:38 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option To: Will Deacon Cc: Andy Lutomirski , Catalin Marinas , Heiko Carstens , Vasily Gorbik , Christian Borntraeger , Dave Hansen , Nicholas Piggin , LKML , X86 ML , Mathieu Desnoyers , Arnd Bergmann , Peter Zijlstra , linux-arch , linuxppc-dev , Linux-MM , Anton Blanchard Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 1, 2020 at 1:28 PM Will Deacon wrote: > > On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote: > > other arch folk: there's some background here: > > > > https://lkml.kernel.org/r/CALCETrVXUbe8LfNn-Qs+DzrOQaiw+sFUg1J047yByV31SaTOZw@mail.gmail.com > > > > On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski wrote: > > > > > > On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski wrote: > > > > > > > > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin wrote: > > > > > > > > > > On big systems, the mm refcount can become highly contented when doing > > > > > a lot of context switching with threaded applications (particularly > > > > > switching between the idle thread and an application thread). > > > > > > > > > > Abandoning lazy tlb slows switching down quite a bit in the important > > > > > user->idle->user cases, so so instead implement a non-refcounted scheme > > > > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down > > > > > any remaining lazy ones. > > > > > > > > > > Shootdown IPIs are some concern, but they have not been observed to be > > > > > a big problem with this scheme (the powerpc implementation generated > > > > > 314 additional interrupts on a 144 CPU system during a kernel compile). > > > > > There are a number of strategies that could be employed to reduce IPIs > > > > > if they turn out to be a problem for some workload. > > > > > > > > I'm still wondering whether we can do even better. > > > > > > > > > > Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes > > > the TLB. On x86, this will shoot down all lazies as long as even a > > > single pagetable was freed. (Or at least it will if we don't have a > > > serious bug, but the code seems okay. We'll hit pmd_free_tlb, which > > > sets tlb->freed_tables, which will trigger the IPI.) So, on > > > architectures like x86, the shootdown approach should be free. The > > > only way it ought to have any excess IPIs is if we have CPUs in > > > mm_cpumask() that don't need IPI to free pagetables, which could > > > happen on paravirt. > > > > Indeed, on x86, we do this: > > > > [ 11.558844] flush_tlb_mm_range.cold+0x18/0x1d > > [ 11.559905] tlb_finish_mmu+0x10e/0x1a0 > > [ 11.561068] exit_mmap+0xc8/0x1a0 > > [ 11.561932] mmput+0x29/0xd0 > > [ 11.562688] do_exit+0x316/0xa90 > > [ 11.563588] do_group_exit+0x34/0xb0 > > [ 11.564476] __x64_sys_exit_group+0xf/0x10 > > [ 11.565512] do_syscall_64+0x34/0x50 > > > > and we have info->freed_tables set. > > > > What are the architectures that have large systems like? > > > > x86: we already zap lazies, so it should cost basically nothing to do > > a little loop at the end of __mmput() to make sure that no lazies are > > left. If we care about paravirt performance, we could implement one > > of the optimizations I mentioned above to fix up the refcounts instead > > of sending an IPI to any remaining lazies. > > > > arm64: AFAICT arm64's flush uses magic arm64 hardware support for > > remote flushes, so any lazy mm references will still exist after > > exit_mmap(). (arm64 uses lazy TLB, right?) So this is kind of like > > the x86 paravirt case. Are there large enough arm64 systems that any > > of this matters? > > Yes, there are large arm64 systems where performance of TLB invalidation > matters, but they're either niche (supercomputers) or not readily available > (NUMA boxes). > > But anyway, we blow away the TLB for everybody in tlb_finish_mmu() after > freeing the page-tables. We have an optimisation to avoid flushing if > we're just unmapping leaf entries when the mm is going away, but we don't > have a choice once we get to actually reclaiming the page-tables. > > One thing I probably should mention, though, is that we don't maintain > mm_cpumask() because we're not able to benefit from it and the atomic > update is a waste of time. Do you do anything special for lazy TLB or do you just use the generic code? (i.e. where do your user pagetables point when you go from a user task to idle or to a kernel thread?) Do you end up with all cpus set in mm_cpumask or can you have the mm loaded on a CPU that isn't in mm_cpumask? --Andy > > Will