MIME-Version: 1.0
In-Reply-To: <CALCETrUSBqHG3tbOq1yFz33v1_ckEgLNorgAxwLFi7MkjNcwLA@mail.gmail.com>
References: <1432219487-13364-1-git-send-email-mathieu.desnoyers@efficios.com>
	<CAHO5Pa0Kok4_QN0v3JNWyzGT=GbZNZcRyLhu02R2npV9hSdt7g@mail.gmail.com>
	<CALCETrUSBqHG3tbOq1yFz33v1_ckEgLNorgAxwLFi7MkjNcwLA@mail.gmail.com>
Date: Fri, 22 May 2015 15:06:47 -0700
Message-ID: <CADroS=4KKb4SKWNToqsqYm5qPq10OPy00yOTBkzaZO65kib1-w@mail.gmail.com>
Subject: Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
From: Andrew Hunter <ahh@google.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Paul Turner <pjt@google.com>, Ben Maurer <bmaurer@fb.com>,
        Linux Kernel <linux-kernel@vger.kernel.org>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Josh Triplett <josh@joshtriplett.org>,
        Lai Jiangshan <laijs@cn.fujitsu.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linux API <linux-api@vger.kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2133
Lines: 42

On Fri, May 22, 2015 at 1:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> for 0 <= i < ncpus.
>
> To acquire, look up your CPU number.  Then, atomically, check that
> lock[cpu] isn't held and, if so, mark it held and record both your tid
> and your lock acquisition count.  If you learn that the lock *was*
> held after all, signal the holder (with kill or your favorite other
> mechanism), telling it which lock acquisition count is being aborted.
> Then atomically steal the lock, but only if the lock acquisition count
> hasn't changed.
>

We had to deploy the userspace percpu API (percpu sharded locks,
{double,}compare-and-swap, atomic-increment, etc) universally to the
fleet without waiting for 100% kernel penetration, not to mention
wanting to disable the kernel acceleration in case of kernel bugs.
(Since this is mostly used in core infrastructure--malloc, various
statistics platforms, etc--in userspace, checking for availability
isn't feasible. The primitives have to work 100% of the time or it
would be too complex for our developers to bother using them.)

So we did basically this (without the lock stealing...): we have a
single per-cpu spin lock manipulated with atomics, which we take very
briefly to implement (e.g.) compare-and-swap.  The performance is
hugely worse; typical overheads are in the 10x range _without_ any
on-cpu contention. Uncontended atomics are much cheaper than they were
on pre-Nehalem chips, but they still can't hold a candle to
unsynchronized instructions.

As a fallback path for userspace, this is fine--if 5% of binaries on
busted kernels aren't quite as fast, we can work with that in exchange
for being able to write a percpu op without worrying about what to do
on -ENOSYS.  But it's just not fast enough to compete as the intended
way to do things.

AHH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/