Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp2569370pxu; Sat, 28 Nov 2020 19:59:52 -0800 (PST) X-Google-Smtp-Source: ABdhPJyKabKznIem0Y8IBat0DlJl/28M+nKl5rpqo6dBxyT5Gfphk9gKKLHTgeSbmEgflnZS4uq2 X-Received: by 2002:a17:906:6010:: with SMTP id o16mr15073464ejj.55.1606622392355; Sat, 28 Nov 2020 19:59:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1606622392; cv=none; d=google.com; s=arc-20160816; b=atDJSQhh19S4wQrlqj96jcyAvlFPULXs2oERyCZvdZN7gCxeGwl8rKik7ZGHn/2fx+ Bt09hFvyvYX3hQWXXRfzinwGWnkVZIur3BJyKhauIBFEf3b1MpAuCoJwzgeo8xYfjhVd vJ+vjiAvJbSWXGjgek+MZ6b7NLRxDMrVUKeAJ51dgzEUMKw2iL6tkIaCVAODPJDMzCEi tEhw2bsCR/hVRE4cmVvf6XCP0mceac659HpO7QjLEPixiN0G9t5GoQjlqxC+UAv3Sr7R IwvVcrwTK+bwOL4eYpuXALUENpb7MQFHq+9k8SQSPUQ0Dt+5D2mS0fUTtJ4OpDT307se 5oGw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=l8YNOiTbxbYgABgO8P3Sq9k9r+unR4mF4SPADtyYfm0=; b=RaFlszdAVwvlvVmAAP5xJRBnJ5qEZ1wXQver+2h7eexbshiNWuTZ2/MxMpqKzF03M1 vkSVHl3fhGgN3/lCxXue1gfXazbLZdzE200O/evzuPz06lImotS9n0YWiahIupKEv+KF rn0cWEF4npr5wsHSrhK/LcVDdr2rerDbxrgvlFw+9FIC/mS5Yw66tnrAy4O1fP0SUaSM ZeIfROMpxBOb8MX1rgtHfU5asLtzEgAC+Z/CJDU84U6k44lwmM5QtJ/3jC7+z1wTH32X rmHOA0pjdRa6Md/nrsQxW2Pl8/gjBx48cxbpK9yyD8/eT8//nbbCCgpRwtPx//U5Kc32 cCCQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=RIa5z40g; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b25si4246047ejl.744.2020.11.28.19.59.16; Sat, 28 Nov 2020 19:59:52 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=RIa5z40g; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726006AbgK2Dzx (ORCPT + 99 others); Sat, 28 Nov 2020 22:55:53 -0500 Received: from mail.kernel.org ([198.145.29.99]:50250 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725294AbgK2Dzx (ORCPT ); Sat, 28 Nov 2020 22:55:53 -0500 Received: from mail-wr1-f54.google.com (mail-wr1-f54.google.com [209.85.221.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 011C32087C for ; Sun, 29 Nov 2020 03:55:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1606622112; bh=SFN0cW7ogqkomQKOXYAGwHhs99pK+M7jTK6kr0JdxhQ=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=RIa5z40goQIQDLMNaAtltuoB/wE0qEzVAOMGYNuhmvzVHq0LWbHfJof2wjKw7TPOs 1XgBacK+PIBgAGvGw0Z1NTuKx/2qpjY45VH0QJbKQ3LKZmBA0wjqH9YalNQOkGM0iH Xd0UudwbagjBtu41fJaBX4gb+sXd87B5VGO25xGg= Received: by mail-wr1-f54.google.com with SMTP id i2so10350548wrs.4 for ; Sat, 28 Nov 2020 19:55:11 -0800 (PST) X-Gm-Message-State: AOAM530Oq3chG6Obh04bwyh38Y2VVA2G+6uL1EaSQGS/d7swhFoqHqG2 XyfFWnYK9do3tNeEupLYDZtzauO24p1nTL3XBRGkaw== X-Received: by 2002:adf:e449:: with SMTP id t9mr20484863wrm.257.1606622110451; Sat, 28 Nov 2020 19:55:10 -0800 (PST) MIME-Version: 1.0 References: <20201128160141.1003903-1-npiggin@gmail.com> <20201128160141.1003903-7-npiggin@gmail.com> In-Reply-To: <20201128160141.1003903-7-npiggin@gmail.com> From: Andy Lutomirski Date: Sat, 28 Nov 2020 19:54:57 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option To: Nicholas Piggin Cc: LKML , X86 ML , Mathieu Desnoyers , Arnd Bergmann , Peter Zijlstra , linux-arch , linuxppc-dev , Linux-MM , Anton Blanchard Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin wrote: > > On big systems, the mm refcount can become highly contented when doing > a lot of context switching with threaded applications (particularly > switching between the idle thread and an application thread). > > Abandoning lazy tlb slows switching down quite a bit in the important > user->idle->user cases, so so instead implement a non-refcounted scheme > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down > any remaining lazy ones. > > Shootdown IPIs are some concern, but they have not been observed to be > a big problem with this scheme (the powerpc implementation generated > 314 additional interrupts on a 144 CPU system during a kernel compile). > There are a number of strategies that could be employed to reduce IPIs > if they turn out to be a problem for some workload. I'm still wondering whether we can do even better. The IPIs you're doing aren't really necessary -- we don't fundamentally need to free the pagetables immediately when all non-lazy users are done with them (and current kernels don't) -- what we need to do is to synchronize all the bookkeeping. So, with adequate locking (famous last words), a couple of alternative schemes ought to be possible. a) Instead of sending an IPI, increment mm_count on behalf of the remote CPU and do something to make sure that the remote CPU knows we did this on its behalf. Then free the mm when mm_count hits zero. b) Treat mm_cpumask as part of the refcount. Add one to mm_count when an mm is created. Once mm_users hits zero, whoever clears the last bit in mm_cpumask is responsible for decrementing a single reference from mm_count, and whoever sets it to zero frees the mm. Version (b) seems fairly straightforward to implement -- add RCU protection and a atomic_t special_ref_cleared (initially 0) to struct mm_struct itself. After anyone clears a bit to mm_cpumask (which is already a barrier), they read mm_users. If it's zero, then they scan mm_cpumask and see if it's empty. If it is, they atomically swap special_ref_cleared to 1. If it was zero before the swap, they do mmdrop(). I can imagine some tweaks that could make this a big faster, at least in the limit of a huge number of CPUs. Version (a) seems a bit harder to reason about. Maybe it could be done like this. Add a percpu variable mm_with_extra_count. This variable can be NULL, but it can also be an mm that has an extra reference on behalf of the cpu in question. __mmput scans mm_cpumask and, for each cpu in the mask, mmgrabs the mm and cmpxchgs that cpu's mm_with_extra_count from NULL to mm. If it succeeds, then we win. If it fails, further thought is required, and maybe we have to send an IPI, although maybe some other cleverness is possible. Any time a CPU switches mms, it does atomic swaps mm_with_extra_count to NULL and mmdrops whatever the mm was. (Maybe it needs to check the mm isn't equal to the new mm, although it would be quite bizarre for this to happen.) Other than these mmgrab and mmdrop calls, the mm switching code doesn't mmgrab or mmdrop at all. Version (a) seems like it could have excellent performance. *However*, I think we should consider whether we want to do something even bigger first. Even with any of these changes, we still need to maintain mm_cpumask(), and that itself can be a scalability problem. I wonder if we can solve this problem too. Perhaps the switch_mm() paths could only ever set mm_cpumask bits, and anyone who would send an IPI because a bit is set in mm_cpumask would first check some percpu variable (cpu_rq(cpu)->something? an entirely new variable) to see if the bit in mm_cpumask is spurious. Or perhaps mm_cpumask could be split up across multiple cachelines, one per node. We should keep the recent lessons from Apple in mind, though: x86 is a dinosaur. The future of atomics is going to look a lot more like ARM's LSE than x86's rather anemic set. This means that mm_cpumask operations won't need to be full barriers forever, and we might not want to take the implied full barriers in set_bit() and clear_bit() for granted. --Andy