Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp899922ybh; Wed, 15 Jul 2020 19:36:35 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxqUIqyq/ftP47UeQUJcCBjo8UkfVbXkXbWdQrwRcae8fu6kR1lvcpsXOv70vAkuASIfIRA X-Received: by 2002:a17:906:3e13:: with SMTP id k19mr1738382eji.476.1594866995101; Wed, 15 Jul 2020 19:36:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594866995; cv=none; d=google.com; s=arc-20160816; b=p2CmKmNHg9Hul4OmeAhS0lF8oFUC97j4LWmDZkGECGYNY8MbI5mJA0fdYdn5F5p/D0 1Rn6MSrW1vP+mcjHo/zqd0fYopALX+WWVoMs26usuVPWX8t8kEchvtIMKJl7d7aM2r1u yEOxJKxz3HXArZe3782ZUabMBv/632ECJs/ipubR7HDlBrWz4IzEi16dJrtXAkWY2f7r Ua/1B2VvKcpUTwFwB3IA2S7YC2oTDhXnngM56lozaF83or25YUMQsxGccXLpKMQ48Aum bVHxxzov+rLogkajmiB9OQA0XbXP/TgCRsnYDGoV3JDIeYbGYkwML+vpV2WXTqUwwRDh /7ww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:message-id :mime-version:in-reply-to:references:cc:to:subject:from:date :dkim-signature; bh=zXeIy2ePTaW1WcCtqXXzYiM+52x43mwukDkyOFdBArE=; b=Ua3IuCuFI/YE/LbtXiyYBCBqwjewDMSL07quNRYhW1l369UXM45oISr0eAXSWXV9T/ daKcyuPz3Am/L3biBCn9VzEMafAAZH9S3Tu/h9ox/FSYeiz+Z4IeIZ21wyAr5RUNpCVc ApOOTK4WSycPET1mPH7slb7fIJ8FH6SrBznfBmbFAkr49PM4WG3aJ5FqI+5yRoufnF40 IdjTburj2nNwKotZy2RUi+mSx2Ccvri9jvccDIF/qYRh3Ax8jD3qeIo9udl6Spzo2zUZ 41eio/fWdUa+2v4pLYC9Hide5pzbREkodthZszEHwpCu4obXTHXjdWckMnpviVGEjIYb QNGg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=uNLf3n0D; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l13si2466232eds.603.2020.07.15.19.36.11; Wed, 15 Jul 2020 19:36:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=uNLf3n0D; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727981AbgGPCfV (ORCPT + 99 others); Wed, 15 Jul 2020 22:35:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50254 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726479AbgGPCfU (ORCPT ); Wed, 15 Jul 2020 22:35:20 -0400 Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com [IPv6:2a00:1450:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 30675C061755; Wed, 15 Jul 2020 19:35:20 -0700 (PDT) Received: by mail-wr1-x444.google.com with SMTP id o11so5231781wrv.9; Wed, 15 Jul 2020 19:35:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:subject:to:cc:references:in-reply-to:mime-version :message-id:content-transfer-encoding; bh=zXeIy2ePTaW1WcCtqXXzYiM+52x43mwukDkyOFdBArE=; b=uNLf3n0D7Qu4h1hbLPUuunRiVCBX2Lyi8LRRfrkUyw/6bxvl47ZrFT7AC0G7UdacXX c6+QrsjNTwoOq37N5uFU3QbbBZ5cRutFyI9ZXl2FVTjma5MpfbtwJA9JB75KD3mnpInx HcZiDM+TqoKevx4n6G0jFXwA3RYHSIbITjePdij1I8rqNjv3lGDhPeQb/4sr19zZ1D43 xEuD1M/m9160XiL7U4s0w5QSSil+NxJItFWA0LUt+cI91b3UmLubjqFi5eRUOzZBUeW4 2ReYm7yv5QnD3OD5iJ+QI454gI1nvkIo/ThBY1fJjyr0XfrD+eFc/s0fgm1bT2X+ofsi btdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to :mime-version:message-id:content-transfer-encoding; bh=zXeIy2ePTaW1WcCtqXXzYiM+52x43mwukDkyOFdBArE=; b=oQFy8MywTeTNIbNoooD6yV0SUzKOrkMjVi6XUO2U8GugG97xC2N9tAd4KF9tcTCngv 8i31ZIdlFAZSpsfSpaaCylafdFEjKmBZefk/6FfzYCv2EptqPutW7Q4j0WAcZZzB0WHc Yo8BuGU/tykA1L+txFutcd40QPSVLH810iDnqEuYTmTG7bSJjk1JULZvOV/GI4hL4oWi kUpvMdqWGnTtMyCLWF7AU8ZAg6mOxmDyILfZe0Xr6OZ1GzfjDbgWDZUL0slIw4jTuCT2 BEUTmLOj5xquzx7gh79BSIfH+sU+o1syNrZ2YStIKyP23WKJW+RW38o+Y/jm10e668vc pgkg== X-Gm-Message-State: AOAM533Cp4182F/4S8vW0EM6uq1rBU4V+aj8cfNQRl+h6HIgHOax5tl6 2iAuQEGqoniavT+c/N8HcPk= X-Received: by 2002:adf:e80d:: with SMTP id o13mr2571048wrm.112.1594866918928; Wed, 15 Jul 2020 19:35:18 -0700 (PDT) Received: from localhost (110-174-173-27.tpgi.com.au. [110.174.173.27]) by smtp.gmail.com with ESMTPSA id g13sm6584482wro.84.2020.07.15.19.35.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Jul 2020 19:35:18 -0700 (PDT) Date: Thu, 16 Jul 2020 12:35:12 +1000 From: Nicholas Piggin Subject: Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option To: Andy Lutomirski Cc: Anton Blanchard , Arnd Bergmann , linux-arch , LKML , Linux-MM , linuxppc-dev , Andy Lutomirski , Mathieu Desnoyers , Peter Zijlstra , X86 ML References: <1594708054.04iuyxuyb5.astroid@bobo.none> <6D3D1346-DB1E-43EB-812A-184918CCC16A@amacapital.net> In-Reply-To: <6D3D1346-DB1E-43EB-812A-184918CCC16A@amacapital.net> MIME-Version: 1.0 Message-Id: <1594866490.ultl891sgk.astroid@bobo.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Excerpts from Andy Lutomirski's message of July 14, 2020 10:46 pm: >=20 >=20 >> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin wrote: >>=20 >> =EF=BB=BFExcerpts from Nicholas Piggin's message of July 14, 2020 3:04 p= m: >>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >>>>=20 >>>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin wrot= e: >>>>>=20 >>>>> =EF=BB=BFExcerpts from Andy Lutomirski's message of July 14, 2020 1:5= 9 am: >>>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin = wrote: >>>>>>>=20 >>>>>>> On big systems, the mm refcount can become highly contented when do= ing >>>>>>> a lot of context switching with threaded applications (particularly >>>>>>> switching between the idle thread and an application thread). >>>>>>>=20 >>>>>>> Abandoning lazy tlb slows switching down quite a bit in the importa= nt >>>>>>> user->idle->user cases, so so instead implement a non-refcounted sc= heme >>>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot = down >>>>>>> any remaining lazy ones. >>>>>>>=20 >>>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmar= k >>>>>>> with as many software threads as CPUs (so each switch will go in an= d >>>>>>> out of idle), upstream can achieve a rate of about 1 million contex= t >>>>>>> switches per second. After this patch it goes up to 118 million. >>>>>>>=20 >>>>>>=20 >>>>>> I read the patch a couple of times, and I have a suggestion that cou= ld >>>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>>>> refcount. You're saying "hey, this mm has no more references, but i= t >>>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>>>> those references too." I'm wondering whether you actually need the >>>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcoun= t >>>>>> for real? Roughly, in __mmdrop(), you would only free the page tabl= es >>>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>>>> mm_cpumask(), you would check if mm_users =3D=3D 0 and, if so, check= if >>>>>> you just removed the last bit from mm_cpumask and potentially free t= he >>>>>> mm. >>>>>>=20 >>>>>> Getting the locking right here could be a bit tricky -- you need to >>>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>>>> should free the mm, and you also need to avoid an mm with mm_users >>>>>> hitting zero concurrently with the last remote CPU using it lazily >>>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>>>> =3D=3D 1 mean "mm_cpumask() is might contain bits and, if so, it own= s the >>>>>> mm" and mm_count =3D=3D 0 meaning "now it's dead" and using some car= eful >>>>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>>>>=20 >>>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>>>> only ever take the lock after mm_users goes to zero. >>>>>=20 >>>>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>>>>=20 >>>>> I haven't seen much problem here that made me too concerned about IPI= s=20 >>>>> yet, so I think the simple patch may be good enough to start with >>>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining t= he >>>>> unlazying with the exit TLB flush without doing anything fancy with >>>>> ref counting, but we'll see. >>>>=20 >>>> I would be cautious with benchmarking here. I would expect that the >>>> nasty cases may affect power consumption more than performance =E2=80= =94 the=20 >>>> specific issue is IPIs hitting idle cores, and the main effects are to= =20 >>>> slow down exit() a bit but also to kick the idle core out of idle.=20 >>>> Although, if the idle core is in a deep sleep, that IPI could be=20 >>>> *very* slow. >>>=20 >>> It will tend to be self-limiting to some degree (deeper idle cores >>> would tend to have less chance of IPI) but we have bigger issues on >>> powerpc with that, like broadcast IPIs to the mm cpumask for THP >>> management. Power hasn't really shown up as an issue but powerpc >>> CPUs may have their own requirements and issues there, shall we say. >>>=20 >>>> So I think it=E2=80=99s worth at least giving this a try. >>>=20 >>> To be clear it's not a complete solution itself. The problem is of=20 >>> course that mm cpumask gives you false negatives, so the bits >>> won't always clean up after themselves as CPUs switch away from their >>> lazy tlb mms. >>=20 >> ^^ >>=20 >> False positives: CPU is in the mm_cpumask, but is not using the mm >> as a lazy tlb. So there can be bits left and never freed. >>=20 >> If you closed the false positives, you're back to a shared mm cache >> line on lazy mm context switches. >=20 > x86 has this exact problem. At least no more than 64*8 CPUs share the cac= he line :) >=20 > Can your share your benchmark? Just testing the IPI rates (on a smaller 176 CPU system), on a kernel compile, it causes about 300 shootdown interrupts (not 300 broadcasts but total interrupts). And very short lived fork;exec;exit things like typical scripting commands doesn't typically generate any. So yeah the really high exit rate things self-limit pretty well. I documented the concern and added a few of the possible ways to further reduce IPIs in the comments though. Thanks, Nick