Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751791AbdHAKcu (ORCPT ); Tue, 1 Aug 2017 06:32:50 -0400 Received: from mail-wm0-f49.google.com ([74.125.82.49]:37072 "EHLO mail-wm0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751115AbdHAKct (ORCPT ); Tue, 1 Aug 2017 06:32:49 -0400 Subject: Re: [RFC PATCH v2] membarrier: expedited private command To: Peter Zijlstra , Nicholas Piggin Cc: Mathieu Desnoyers , Michael Ellerman , "Paul E. McKenney" , linux-kernel , Boqun Feng , Andrew Hunter , maged michael , gromer , Benjamin Herrenschmidt , Palmer Dabbelt , Dave Watson References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com> <20170728085532.ylhuz2irwmgpmejv@hirez.programming.kicks-ass.net> <20170728115702.5vgnvwhmbbmyrxbf@hirez.programming.kicks-ass.net> <87tw1s4u9w.fsf@concordia.ellerman.id.au> <20170731233731.32e68f6d@roar.ozlabs.ibm.com> <973223324.694.1501551189603.JavaMail.zimbra@efficios.com> <20170801120047.61c59064@roar.ozlabs.ibm.com> <20170801081230.GF6524@worktop.programming.kicks-ass.net> <20170801195717.7a675cc2@roar.ozlabs.ibm.com> <20170801102203.urldoripgbh2ohun@hirez.programming.kicks-ass.net> From: Avi Kivity Organization: ScyllaDB Message-ID: <3db6b20b-df76-8284-5bc1-37a511ee0534@scylladb.com> Date: Tue, 1 Aug 2017 13:32:43 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170801102203.urldoripgbh2ohun@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2187 Lines: 47 On 08/01/2017 01:22 PM, Peter Zijlstra wrote: > >> If mm cpumask is used, I think it's okay. You can cause quite similar >> kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using >> munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the >> stage where we're controlling those kinds of ops in terms of impact >> to the rest of the system? > So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs > to those CPUs actually running threads of our process (or very > recently). So while there can be the sporadic stray IPI for a CPU that > recently ran a thread of the target process, it will not get another one > until it switches back into the process. > > On machines that need manual TLB broadcasts and don't keep a tight mask, > yes you can interfere at will, but if they care they can fix by > tightening the mask. > > In either case, the mm_cpumask() will be bounded by the set of CPUs the > threads are allowed to run on and will not interfere with the rest of > the system. > > As to scheduler IPIs, those are limited to the CPUs the user is limited > to and are rate limited by the wakeup-latency of the tasks. After all, > all the time a task is runnable but not running, wakeups are no-ops. > > Trouble is of course, that not everybody even sets a single bit in > mm_cpumask() and those that never clear bits will end up with a fairly > wide mask, still interfering with work that isn't hard partitioned. I hate to propose a way to make this more complicated, but this could be fixed by a process first declaring its intent to use expedited process-wide membarrier; if it does, then every context switch updates a process-wide cpumask indicating which cpus are currently running threads of that process: if (prev->mm != next->mm) if (prev->mm->running_cpumask) cpumask_clear(...); else if (next->mm->running_cpumask) cpumask_set(...); now only processes that want expedited process-wide membarrier pay for it (in other than some predictable branches). You can even have threads opt-in, so unrelated threads that don't participate in the party don't cause those bits to be set.