Subject: Re: [RFC PATCH v2] membarrier: expedited private command
To: Peter Zijlstra <peterz@infradead.org>,
        Nicholas Piggin <npiggin@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Michael Ellerman <mpe@ellerman.id.au>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Boqun Feng <boqun.feng@gmail.com>, Andrew Hunter <ahh@google.com>,
        maged michael <maged.michael@gmail.com>, gromer <gromer@google.com>,
        Benjamin Herrenschmidt <benh@kernel.crashing.org>,
        Palmer Dabbelt <palmer@dabbelt.com>, Dave Watson <davejwatson@fb.com>
References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com>
 <20170728085532.ylhuz2irwmgpmejv@hirez.programming.kicks-ass.net>
 <20170728115702.5vgnvwhmbbmyrxbf@hirez.programming.kicks-ass.net>
 <87tw1s4u9w.fsf@concordia.ellerman.id.au>
 <20170731233731.32e68f6d@roar.ozlabs.ibm.com>
 <973223324.694.1501551189603.JavaMail.zimbra@efficios.com>
 <20170801120047.61c59064@roar.ozlabs.ibm.com>
 <20170801081230.GF6524@worktop.programming.kicks-ass.net>
 <20170801195717.7a675cc2@roar.ozlabs.ibm.com>
 <20170801102203.urldoripgbh2ohun@hirez.programming.kicks-ass.net>
From: Avi Kivity <avi@scylladb.com>
Organization: ScyllaDB
Message-ID: <3db6b20b-df76-8284-5bc1-37a511ee0534@scylladb.com>
Date: Tue, 1 Aug 2017 13:32:43 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <20170801102203.urldoripgbh2ohun@hirez.programming.kicks-ass.net>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-US
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2187
Lines: 47


On 08/01/2017 01:22 PM, Peter Zijlstra wrote:
>
>> If mm cpumask is used, I think it's okay. You can cause quite similar
>> kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using
>> munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the
>> stage where we're controlling those kinds of ops in terms of impact
>> to the rest of the system?
> So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs
> to those CPUs actually running threads of our process (or very
> recently). So while there can be the sporadic stray IPI for a CPU that
> recently ran a thread of the target process, it will not get another one
> until it switches back into the process.
>
> On machines that need manual TLB broadcasts and don't keep a tight mask,
> yes you can interfere at will, but if they care they can fix by
> tightening the mask.
>
> In either case, the mm_cpumask() will be bounded by the set of CPUs the
> threads are allowed to run on and will not interfere with the rest of
> the system.
>
> As to scheduler IPIs, those are limited to the CPUs the user is limited
> to and are rate limited by the wakeup-latency of the tasks. After all,
> all the time a task is runnable but not running, wakeups are no-ops.
>
> Trouble is of course, that not everybody even sets a single bit in
> mm_cpumask() and those that never clear bits will end up with a fairly
> wide mask, still interfering with work that isn't hard partitioned.

I hate to propose a way to make this more complicated, but this could be 
fixed by a process first declaring its intent to use expedited 
process-wide membarrier; if it does, then every context switch updates a 
process-wide cpumask indicating which cpus are currently running threads 
of that process:

   if (prev->mm != next->mm)
       if (prev->mm->running_cpumask)
              cpumask_clear(...);
       else if (next->mm->running_cpumask)
              cpumask_set(...);

now only processes that want expedited process-wide membarrier pay for 
it (in other than some predictable branches). You can even have threads 
opt-in, so unrelated threads that don't participate in the party don't 
cause those bits to be set.