Subject: Re: Udpated sys_membarrier() speedup patch, FYI
To: paulmck@linux.vnet.ibm.com, maged.michael@gmail.com, ahh@google.com,
        gromer@google.com
Cc: linux-kernel@vger.kernel.org
References: <20170727181250.GA20183@linux.vnet.ibm.com>
From: Avi Kivity <avi@scylladb.com>
Message-ID: <5c8c6946-ce3a-6183-76a2-027823a9948a@scylladb.com>
Date: Thu, 27 Jul 2017 22:20:14 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <20170727181250.GA20183@linux.vnet.ibm.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-US
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5558
Lines: 127

On 07/27/2017 09:12 PM, Paul E. McKenney wrote:
> Hello!
>
> Please see below for a prototype sys_membarrier() speedup patch.
> Please note that there is some controversy on this subject, so the final
> version will probably be quite a bit different than this prototype.
>
> But my main question is whether the throttling shown below is acceptable
> for your use cases, namely only one expedited sys_membarrier() permitted
> per scheduling-clock period (1 millisecond on many platforms), with any
> excess being silently converted to non-expedited form.  The reason for
> the throttling is concerns about DoS attacks based on user code with a
> tight loop invoking this system call.
>
> Thoughts?

Silent throttling would render it useless for me. -EAGAIN is a little 
better, but I'd be forced to spin until either I get kicked out of my 
loop, or it succeeds.

IPIing only running threads of my process would be perfect. In fact I 
might even be able to make use of "membarrier these threads please" to 
reduce IPIs, when I change the topology from fully connected to 
something more sparse, on larger machines.

My previous implementations were a signal (but that's horrible on large 
machines) and trylock + mprotect (but that doesn't work on ARM).


> 							Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 4cd5253094b6d7f9501e21e13aa4e2e78e8a70cd
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Tue Jul 18 13:53:32 2017 -0700
>
>      sys_membarrier: Add expedited option
>      
>      The sys_membarrier() system call has proven too slow for some use cases,
>      which has prompted users to instead rely on TLB shootdown.  Although TLB
>      shootdown is much faster, it has the slight disadvantage of not working
>      at all on arm and arm64 and also of being vulnerable to reasonable
>      optimizations that might skip some IPIs.  However, the Linux kernel
>      does not currrently provide a reasonable alternative, so it is hard to
>      criticize these users from doing what works for them on a given piece
>      of hardware at a given time.
>      
>      This commit therefore adds an expedited option to the sys_membarrier()
>      system call, thus providing a faster mechanism that is portable and
>      is not subject to death by optimization.  Note that if more than one
>      MEMBARRIER_CMD_SHARED_EXPEDITED sys_membarrier() call happens within
>      the same jiffy, all but the first will use synchronize_sched() instead
>      of synchronize_sched_expedited().
>      
>      Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>      [ paulmck: Fix code style issue pointed out by Boqun Feng. ]
>      Tested-by: Avi Kivity <avi@scylladb.com>
>      Cc: Maged Michael <maged.michael@gmail.com>
>      Cc: Andrew Hunter <ahh@google.com>
>      Cc: Geoffrey Romer <gromer@google.com>
>
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> index e0b108bd2624..5720386d0904 100644
> --- a/include/uapi/linux/membarrier.h
> +++ b/include/uapi/linux/membarrier.h
> @@ -40,6 +40,16 @@
>    *                          (non-running threads are de facto in such a
>    *                          state). This covers threads from all processes
>    *                          running on the system. This command returns 0.
> + * @MEMBARRIER_CMD_SHARED_EXPEDITED:  Execute a memory barrier on all
> + *			    running threads, but in an expedited fashion.
> + *                          Upon return from system call, the caller thread
> + *                          is ensured that all running threads have passed
> + *                          through a state where all memory accesses to
> + *                          user-space addresses match program order between
> + *                          entry to and return from the system call
> + *                          (non-running threads are de facto in such a
> + *                          state). This covers threads from all processes
> + *                          running on the system. This command returns 0.
>    *
>    * Command to be passed to the membarrier system call. The commands need to
>    * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> @@ -48,6 +58,7 @@
>   enum membarrier_cmd {
>   	MEMBARRIER_CMD_QUERY = 0,
>   	MEMBARRIER_CMD_SHARED = (1 << 0),
> +	MEMBARRIER_CMD_SHARED_EXPEDITED = (1 << 1),
>   };
>   
>   #endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
> index 9f9284f37f8d..587e3bbfae7e 100644
> --- a/kernel/membarrier.c
> +++ b/kernel/membarrier.c
> @@ -22,7 +22,8 @@
>    * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>    * except MEMBARRIER_CMD_QUERY.
>    */
> -#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
> +#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED |		\
> +				 MEMBARRIER_CMD_SHARED_EXPEDITED)
>   
>   /**
>    * sys_membarrier - issue memory barriers on a set of threads
> @@ -64,6 +65,20 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>   		if (num_online_cpus() > 1)
>   			synchronize_sched();
>   		return 0;
> +	case MEMBARRIER_CMD_SHARED_EXPEDITED:
> +		if (num_online_cpus() > 1) {
> +			static unsigned long lastexp;
> +			unsigned long j;
> +
> +			j = jiffies;
> +			if (READ_ONCE(lastexp) == j) {
> +				synchronize_sched();
> +				WRITE_ONCE(lastexp, j);
> +			} else {
> +				synchronize_sched_expedited();
> +			}
> +		}
> +		return 0;
>   	default:
>   		return -EINVAL;
>   	}
>