Several people have suggested that Linux should provide users with a
lightweight mechanism that allows light-weight fancy per-cpu
operations. This could be used to implement free lists or counters
without any barriers or atomic operations, for example.
There are at least three approaches floating around. Paul Turner
proposed a single block of userspace code that aborts if it's
preempted -- within that block, percpu variables can be used safely.
Mathieu Desnoyers proposed a more complex variant. I proposed a much
simpler approach of just offering percpu gs bases on x86, allowing
cmpxchg (as opposed to lock cmpxchg) to access percpu variables.
None of these should be hard to implement, but it would be nice to
hash out whether the kernel should support such a mechanism at all
and, if so, what it would look like.
Jon Corbet unsurprisingly has a nice writeup here:
http://lwn.net/SubscriberLink/650333/f23d07040a58cd46/
--Andy
On Thu, Jul 09, 2015 at 11:32:45AM -0700, Andy Lutomirski wrote:
> Several people have suggested that Linux should provide users with a
> lightweight mechanism that allows light-weight fancy per-cpu
> operations. This could be used to implement free lists or counters
> without any barriers or atomic operations, for example.
>
> There are at least three approaches floating around. Paul Turner
> proposed a single block of userspace code that aborts if it's
> preempted -- within that block, percpu variables can be used safely.
> Mathieu Desnoyers proposed a more complex variant. I proposed a much
> simpler approach of just offering percpu gs bases on x86, allowing
> cmpxchg (as opposed to lock cmpxchg) to access percpu variables.
>
> None of these should be hard to implement, but it would be nice to
> hash out whether the kernel should support such a mechanism at all
> and, if so, what it would look like.
[ added Jens and Shaohua ]
We've started experimenting with these to cut overheads in a few
critical places, and while we don't have numbers yet I really hope it
won't take too long.
I think the topic is really interesting and we'll be able to get numbers
from production workloads to help justify and compare different
approaches.
-chris
On Thu, 9 Jul 2015, Chris Mason wrote:
> I think the topic is really interesting and we'll be able to get numbers
> from production workloads to help justify and compare different
> approaches.
Ok that would be important. I also think that the approach may be used
in kernel to reduce the overhead of CONFIG_PREEMPT and also to implement
fast versions of this_cpu_ops for non x86 architectures and maybe even
optimize the x86 variants if interrupts also can detect critical sections
and restart at defined points.
On Fri, Jul 10, 2015 at 12:26:21PM -0500, Christoph Lameter wrote:
> On Thu, 9 Jul 2015, Chris Mason wrote:
>
> > I think the topic is really interesting and we'll be able to get numbers
> > from production workloads to help justify and compare different
> > approaches.
>
> Ok that would be important. I also think that the approach may be used
> in kernel to reduce the overhead of CONFIG_PREEMPT and also to implement
> fast versions of this_cpu_ops for non x86 architectures and maybe even
There is nothing stopping people from trying this in-kernel, in fact
that would be lots easier as we do not have to commit to any one
specific ABI for that.
Also, I don't think we need a schedule check for the in-kernel usage,
pure interrupt should be good enough, nobody should (want to) call
schedule() while inside such a critical section, which leaves us with
involuntary preemption, and those are purely interrupt driven.
Now the 'problem' is finding these special regions fast, the easy
solution is the same as the one proposed for userspace, one big section.
That way the interrupt only has to check if the IP is inside this
section which is minimal effort.
The down side is that all percpu ops would then end up being full
function calls. Which on some archs is indeed faster than disabling
interrupts, but not by much I'm afraid.
> optimize the x86 variants if interrupts also can detect critical sections
> and restart at defined points.
I really don't see how we can beat %GS prefixes with any such scheme.
On Mon, 13 Jul 2015, Peter Zijlstra wrote:
> Now the 'problem' is finding these special regions fast, the easy
> solution is the same as the one proposed for userspace, one big section.
> That way the interrupt only has to check if the IP is inside this
> section which is minimal effort.
>
> The down side is that all percpu ops would then end up being full
> function calls. Which on some archs is indeed faster than disabling
> interrupts, but not by much I'm afraid.
Well one could move the entire functions that are using these ops into the
special sections. That is certainly an area requiring much more thought.
> > optimize the x86 variants if interrupts also can detect critical sections
> > and restart at defined points.
>
> I really don't see how we can beat %GS prefixes with any such scheme.
We may be able to avoid RMV sequences which allows the processor to better
schedule operations.
On Mon, Jul 13, 2015 at 7:01 AM, Christoph Lameter <[email protected]> wrote:
> On Mon, 13 Jul 2015, Peter Zijlstra wrote:
>
>> Now the 'problem' is finding these special regions fast, the easy
>> solution is the same as the one proposed for userspace, one big section.
>> That way the interrupt only has to check if the IP is inside this
>> section which is minimal effort.
>>
>> The down side is that all percpu ops would then end up being full
>> function calls. Which on some archs is indeed faster than disabling
>> interrupts, but not by much I'm afraid.
>
> Well one could move the entire functions that are using these ops into the
> special sections. That is certainly an area requiring much more thought.
Hmm.
>
>> > optimize the x86 variants if interrupts also can detect critical sections
>> > and restart at defined points.
>>
>> I really don't see how we can beat %GS prefixes with any such scheme.
>
> We may be able to avoid RMV sequences which allows the processor to better
> schedule operations.
True, but cmpxchg is, surprisingly, pretty fast.
Crazy thought: At the risk of proposing something ridiculous, what if
we had per-cpu memory mappings? We could do this at the cost of up to
2kB of memcpy whenever we switch mms. Expensive but maybe not a
showstopper.
--Andy
On Tue, 14 Jul 2015, Andy Lutomirski wrote:
> Crazy thought: At the risk of proposing something ridiculous, what if
> we had per-cpu memory mappings? We could do this at the cost of up to
> 2kB of memcpy whenever we switch mms. Expensive but maybe not a
> showstopper.
This is not crazy and actually was done before. Itanium has that and
its doable since the TLB insertion could be handled in software.
The problem on x86 is that one would need a separate page table for each
processor for each task. There is no way to handle TLB faults in
software to my knowledge.
On Fri, Jul 10, 2015 at 3:09 AM, Chris Mason <[email protected]> wrote:
>
> We've started experimenting with these to cut overheads in a few
> critical places, and while we don't have numbers yet I really hope it
> won't take too long.
>
> I think the topic is really interesting and we'll be able to get numbers
> from production workloads to help justify and compare different
> approaches.
>
I was interested by the idea since Paul(paulmck) and Mathieu introduced
it to me at the K.S. 2013. I didn't expect it is re-posted on LKML so late.
IMHO, the direction is useful and helpful not just only fun, I hope we can make
some progress on it.
Thanks
Lai
> -chris
> _______________________________________________
> Ksummit-discuss mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
On Mon, Jul 13, 2015 at 5:57 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jul 10, 2015 at 12:26:21PM -0500, Christoph Lameter wrote:
>> On Thu, 9 Jul 2015, Chris Mason wrote:
>>
>> > I think the topic is really interesting and we'll be able to get numbers
>> > from production workloads to help justify and compare different
>> > approaches.
>>
>> Ok that would be important. I also think that the approach may be used
>> in kernel to reduce the overhead of CONFIG_PREEMPT and also to implement
>> fast versions of this_cpu_ops for non x86 architectures and maybe even
>
>
> Also, I don't think we need a schedule check for the in-kernel usage,
> pure interrupt should be good enough, nobody should (want to) call
> schedule() while inside such a critical section, which leaves us with
> involuntary preemption, and those are purely interrupt driven.
>
> Now the 'problem' is finding these special regions fast, the easy
> solution is the same as the one proposed for userspace, one big section.
> That way the interrupt only has to check if the IP is inside this
> section which is minimal effort.
>
> The down side is that all percpu ops would then end up being full
> function calls. Which on some archs is indeed faster than disabling
> interrupts, but not by much I'm afraid.
Anther down site is that all percpu ops can't call any function outside
the section. Otherwise we would fail to detect whether it is a special
region or be hard to detect it.
If we disallow the percpu ops calling any function, I think we can
insert some special instructions to the generated code along with
a notation in a table (like exception table for copy_to_user()).
So thus the interrupt only has to check the special instructions
near the IP and confirm it by check it on the table.
>
>> optimize the x86 variants if interrupts also can detect critical sections
>> and restart at defined points.
>
> I really don't see how we can beat %GS prefixes with any such scheme.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Mon, Jul 13, 2015 at 5:57 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jul 10, 2015 at 12:26:21PM -0500, Christoph Lameter wrote:
>> On Thu, 9 Jul 2015, Chris Mason wrote:
>>
>> > I think the topic is really interesting and we'll be able to get numbers
>> > from production workloads to help justify and compare different
>> > approaches.
>>
>> Ok that would be important. I also think that the approach may be used
>> in kernel to reduce the overhead of CONFIG_PREEMPT and also to implement
>> fast versions of this_cpu_ops for non x86 architectures and maybe even
>
> There is nothing stopping people from trying this in-kernel, in fact
> that would be lots easier as we do not have to commit to any one
> specific ABI for that.
It also provides us a nicer way to fight with NMI and
to modify a slight-biger-struct irq-safely
if we have it in-kenrel.
>
> Also, I don't think we need a schedule check for the in-kernel usage,
> pure interrupt should be good enough, nobody should (want to) call
> schedule() while inside such a critical section, which leaves us with
> involuntary preemption, and those are purely interrupt driven.
>
> Now the 'problem' is finding these special regions fast, the easy
> solution is the same as the one proposed for userspace, one big section.
> That way the interrupt only has to check if the IP is inside this
> section which is minimal effort.
>
> The down side is that all percpu ops would then end up being full
> function calls. Which on some archs is indeed faster than disabling
> interrupts, but not by much I'm afraid.
>
>> optimize the x86 variants if interrupts also can detect critical sections
>> and restart at defined points.
>
> I really don't see how we can beat %GS prefixes with any such scheme.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/