Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp966326ybh; Wed, 15 Jul 2020 22:20:58 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxS90fmc2XdiMY3J5RTEKUhO9ZTlLAdBcOp+gs4xSVgzFRFi2EK5FtTv/fxmKYeMkHqgCa5 X-Received: by 2002:a05:6402:2350:: with SMTP id r16mr2741331eda.62.1594876858198; Wed, 15 Jul 2020 22:20:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594876858; cv=none; d=google.com; s=arc-20160816; b=fkx9eMV4Exm9gWg17Y0XC3F7tGAAVEvCfLIW1ZEM3obwC9SWRcBU6wHfyn5yw2Vy1u pVDM11RHKFM2MzL3LZXpwxiT7M6mSxdyOYCupfixuHed8DCXzo4HVPxTaC6ZPVlJqgpC x9NMcKVNgBCj59+xIZYrgscQnQsVNEJ/yNy1BEjlVLTYcREhHE3l77sOsfhN+kpL+JJ/ talDp5/YQBythR+HSB6iZX7unfrrZVceqR1B2JfSnkdYw4s6XVmJ7DXTzf9bFHfoICBg mwQa+rvUc0KxiNW5mm4nEDq3tWFbEsJNekPd3PiIfXrGXUhYXi75AFiw2xs/Qkbifwa+ nq5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:in-reply-to:cc:references:message-id :date:subject:mime-version:from:content-transfer-encoding :dkim-signature; bh=pjSAQ6Btld2fycHWHEXEy5l6TD0W6bB4dFccT/gpVM8=; b=nch7TLOZxGbOyBNY/5WJqF1eH0wduCLbd22wGPAYhTmdF3Ob6NAZH1dBPUNxkkFMbJ gjcdjakuOVlEsJ97KVuyi7xIPW8GJ0VOmld6LPueU6U48Gbf0yPbpg//l0CuRIBfsQG9 697UbPC0K3oVaMiNV5zqLZetHtC9TnDoz3KRLhSqAhLtTddL/eAQofPddR+fbHb0UJBP gvxPul1Flu5rHWsfn9oSTPFMIuBCaFwiVaFIBLoPxc5MZiQfFof5S9m11cl2h+FhU632 JX10emaV5qRQpmxt9W1CIgEyn94eoIGkkhVVONTOMyjsn+4B+LKVw8HgXB1MgwMrN41M /4iQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amacapital-net.20150623.gappssmtp.com header.s=20150623 header.b=SzLH2bWv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id m23si2489081edq.346.2020.07.15.22.20.34; Wed, 15 Jul 2020 22:20:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@amacapital-net.20150623.gappssmtp.com header.s=20150623 header.b=SzLH2bWv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726256AbgGPFS1 (ORCPT + 99 others); Thu, 16 Jul 2020 01:18:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725268AbgGPFS0 (ORCPT ); Thu, 16 Jul 2020 01:18:26 -0400 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BDD49C061755 for ; Wed, 15 Jul 2020 22:18:26 -0700 (PDT) Received: by mail-pf1-x441.google.com with SMTP id u185so3149709pfu.1 for ; Wed, 15 Jul 2020 22:18:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amacapital-net.20150623.gappssmtp.com; s=20150623; h=content-transfer-encoding:from:mime-version:subject:date:message-id :references:cc:in-reply-to:to; bh=pjSAQ6Btld2fycHWHEXEy5l6TD0W6bB4dFccT/gpVM8=; b=SzLH2bWvof9kr76xot8dGve40BlKMean/0Ynh3Mixvp/iYGI6jslyiT1H/1n9Qj1ow Mh6ujatjCTcxc2s25k7iA9sn8U89yBeRY8ebObRWZj6HYuZXFeSDTBb/EWPHPkSP4FNU WwDuuMXK/rQCsP66Mjz/JMKuDuyf2EqCTf0wV7Jeao4eNmnGi3HcAtNuOo1DPvCl3PrS G8xS6Xv4Ca2XuthzckdOpVp0+F0+/78sRIME7/YVN8R9H8xTFBhlymC5jiziD9Jh1/PT 1dEIfLgzq81IlVYg4rVZj6jYicZCgwuZlDfimKz4575Fr0aZwG6NbpNsQCvhpVa8sGpY J2Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:content-transfer-encoding:from:mime-version :subject:date:message-id:references:cc:in-reply-to:to; bh=pjSAQ6Btld2fycHWHEXEy5l6TD0W6bB4dFccT/gpVM8=; b=t7hOAiIEHvg1aboTusjWQN1xO3ZFv0WlEQ+QnBrsDglJU8Yn6HWaaxXP7isFI8AgYl I/xSmurHZAqF7+7r6HvVYFFM2EeIEzPv3AYaWyQ7I6OS1ybqJNSx+KE+GEJfWhZ3Otmf OQu2hbbLkQOK3BCR5O+K8qCiayHBlKfaEVZVfMLhmvG1PgIX4yaYnHFLS0hXc0pnc1YS mksIPReTa2Sy9r6oLJ5H7cp7RDg2qiWA9K+MMGcAJbbTO6EKkUqesJ1roNGV/ww+1YxV gKAZh+gTlSuIcqgcBEJ7uJqSXhWccRO+jKxPp7nfKqvT+dPiqrMP3futXKid5w3gjmnv kRWg== X-Gm-Message-State: AOAM53159Ko2D9+qOgGJ6PeswaVuvEs/TavlhOyYkuSj9YjJp92L8Ueb +zaurbJjNBX4FbZ/5bYz0WWB9g== X-Received: by 2002:a63:444b:: with SMTP id t11mr2886208pgk.134.1594876706172; Wed, 15 Jul 2020 22:18:26 -0700 (PDT) Received: from ?IPv6:2601:646:c200:1ef2:c8f2:5437:b9af:674c? ([2601:646:c200:1ef2:c8f2:5437:b9af:674c]) by smtp.gmail.com with ESMTPSA id k7sm3623730pgh.46.2020.07.15.22.18.24 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 15 Jul 2020 22:18:25 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Andy Lutomirski Mime-Version: 1.0 (1.0) Subject: Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode Date: Wed, 15 Jul 2020 22:18:20 -0700 Message-Id: References: <1594868476.6k5kvx8684.astroid@bobo.none> Cc: Mathieu Desnoyers , Anton Blanchard , Arnd Bergmann , linux-arch , linux-kernel , linux-mm , linuxppc-dev , Andy Lutomirski , Peter Zijlstra , x86 In-Reply-To: <1594868476.6k5kvx8684.astroid@bobo.none> To: Nicholas Piggin X-Mailer: iPhone Mail (17F80) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin wrote: >=20 > =EF=BB=BFExcerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 a= m: >> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrot= e: >>=20 >>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>>> serialize. There are older kernels for which it does not promise to >>>>> serialize. And I have plans to make it stop serializing in the >>>>> nearish future. >>>>=20 >>>> You mean x86's return from interrupt? Sounds fun... you'll konw where t= o >>>> update the membarrier sync code, at least :) >>>=20 >>> Oh, I should actually say Mathieu recently clarified a return from >>> interrupt doesn't fundamentally need to serialize in order to support >>> membarrier sync core. >>=20 >> Clarification to your statement: >>=20 >> Return from interrupt to kernel code does not need to be context serializ= ing >> as long as kernel serializes before returning to user-space. >>=20 >> However, return from interrupt to user-space needs to be context serializ= ing. >=20 > Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb > in the right places. >=20 > A kernel thread does a use_mm, then it blocks and the user process with > the same mm runs on that CPU, and then it calls into the kernel, blocks, > the kernel thread runs again, another CPU issues a membarrier which does > not IPI this one because it's running a kthread, and then the kthread > switches back to the user process (still without having unused the mm), > and then the user process returns from syscall without having done a=20 > core synchronising instruction. >=20 > The cause of the problem is you want to avoid IPI'ing kthreads. Why? > I'm guessing it really only matters as an optimisation in case of idle > threads. Idle thread is easy (well, easier) because it won't use_mm, so=20= > you could check for rq->curr =3D=3D rq->idle in your loop (in a suitable=20= > sched accessor function). >=20 > But... I'm not really liking this subtlety in the scheduler for all this=20= > (the scheduler still needs the barriers when switching out of idle). >=20 > Can it be improved somehow? Let me forget x86 core sync problem for now > (that _may_ be a bit harder), and step back and look at what we're doing. > The memory barrier case would actually suffer from the same problem as > core sync, because in the same situation it has no implicit mmdrop in > the scheduler switch code either. >=20 > So what are we doing with membarrier? We want any activity caused by the=20= > set of CPUs/threads specified that can be observed by this thread before=20= > calling membarrier is appropriately fenced from activity that can be=20 > observed to happen after the call returns. >=20 > CPU0 CPU1 > 1. user stuff > a. membarrier() 2. enter kernel > b. read rq->curr 3. rq->curr switched to kthread > c. is kthread, skip IPI 4. switch_to kthread > d. return to user 5. rq->curr switched to user thread > 6. switch_to user thread > 7. exit kernel > 8. more user stuff >=20 > As far as I can see, the problem is CPU1 might reorder step 5 and step > 8, so you have mmdrop of lazy mm be a mb after step 6. >=20 > But why? The membarrier call only cares that there is a full barrier > between 1 and 8, right? Which it will get from the previous context > switch to the kthread. >=20 > I must say the memory barrier comments in membarrier could be improved > a bit (unless I'm missing where the main comment is). It's fine to know > what barriers pair with one another, but we need to know which exact > memory accesses it is ordering >=20 > /* > * Matches memory barriers around rq->curr modification in > * scheduler. > */ >=20 > Sure, but it doesn't say what else is being ordered. I think it's just > the user memory accesses, but would be nice to make that a bit more > explicit. If we had such comments then we might know this case is safe. >=20 > I think the funny powerpc barrier is a similar case of this. If we > ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that > CPU has or will have issued a memory barrier between running user > code. >=20 > So AFAIKS all this membarrier stuff in kernel/sched/core.c could > just go away. Except x86 because thread switch doesn't imply core > sync, so CPU1 between 1 and 8 may never issue a core sync instruction > the same way a context switch must be a full mb. >=20 > Before getting to x86 -- Am I right, or way off track here? I find it hard to believe that this is x86 only. Why would thread switch imp= ly core sync on any architecture? Is x86 unique in having a stupid expensiv= e core sync that is heavier than smp_mb()? But I=E2=80=99m wondering if all this deferred sync stuff is wrong. In the b= rave new world of io_uring and such, perhaps kernel access matter too. Heck= , even: int a[2]; Thread A: a[0] =3D 1; a[1] =3D 2: Thread B: write(fd, a, sizeof(a)); Doesn=E2=80=99t do what thread A is expecting. Admittedly this particular e= xample is nonsense, but maybe there are sensible cases that matter to someon= e. =E2=80=94Andy >=20 > Thanks, > Nick