MIME-Version: 1.0
In-Reply-To: <20160407201156.GC3448@twins.programming.kicks-ass.net>
References: <20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com>
 <CALCETrW=3bZyC9d5tUoESEsNt-rc-uhNhZpgEgeSC8W4FAVYkg@mail.gmail.com>
 <20160407120254.GY3448@twins.programming.kicks-ass.net> <CALCETrV0vcYcnBrs0axykJD=_BM28wKWVMG6bMzK8zh8R3m5fg@mail.gmail.com>
 <20160407152432.GZ3448@twins.programming.kicks-ass.net> <CALCETrU5ZL6Jajc=9up-j86vY_Xtt-gTFjdQE0sB0d=d-CJZ6A@mail.gmail.com>
 <20160407155312.GA3448@twins.programming.kicks-ass.net> <CALCETrVGo1Di3qamxx1NAFUSN_o=-HnYRDpeVp7zrQEBwe5u-g@mail.gmail.com>
 <20160407201156.GC3448@twins.programming.kicks-ass.net>
From: Andy Lutomirski <luto@amacapital.net>
Date: Thu, 7 Apr 2016 15:05:26 -0700
Message-ID: <CALCETrXVReuuGGKW6EOV7tFFaK9RbwWxYvKdpUdvU=MpDaOtsQ@mail.gmail.com>
Subject: Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu
 critical sections
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Ingo Molnar <mingo@redhat.com>, Paul Turner <commonly@gmail.com>,
        Andi Kleen <andi@firstfloor.org>, Chris Lameter <cl@linux.com>,
        Dave Watson <davejwatson@fb.com>,
        Josh Triplett <josh@joshtriplett.org>,
        Linux API <linux-api@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Andrew Hunter <ahh@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3615
Lines: 113

On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
>> More concretely, this looks like (using totally arbitrary register
>> assingments -- probably far from ideal, especially given how GCC's
>> constraints work):
>>
>> enter the critical section:
>> 1:
>> movq %[cpu], %%r12
>> movq {address of counter for our cpu}, %%r13
>> movq {some fresh value}, (%%r13)
>> cmpq %[cpu], %%r12
>> jne 1b
>>
>> ... do whatever setup or computation is needed...
>>
>> movq $%l[failed], %%rcx
>> movq $1f, %[commit_instr]
>> cmpq {whatever counter we chose}, (%%r13)
>> jne %l[failed]
>> cmpq %[cpu], %%r12
>> jne %l[failed]
>>
>> <-- a signal in here that conflicts with us would clobber (%%r13), and
>> the kernel would notice and send us to the failed label
>>
>> movq %[to_write], (%[target])
>> 1: movq $0, %[commit_instr]
>
> And the kernel, for every thread that has had the syscall called and a
> thingy registered, needs to (at preempt/signal-setup):
>
>         if (get_user(post_commit_ip, current->post_commit_ip))
>                 return -EFAULT;
>
>         if (likely(!post_commit_ip))
>                 return 0;
>
>         if (regs->ip >= post_commit_ip)
>                 return 0;
>
>         if (get_user(seq, (u32 __user *)regs->r13))
>                 return -EFAULT;
>
>         if (regs->$(which one holds our chosen seq?) == seq) {
>                 /* nothing changed, do not cancel, proceed to commit. */
>                 return 0;

Only return zero if regs->${which one holds the cpu) == smp_processor_id().

>         }
>
>         if (put_user(0UL, current->post_commit_ip))
>                 return -EFAULT;
>
>         regs->ip = regs->rcx;

I was imagining this happening at (return to userspace or preempt) and
possibly at signal return, but yes, more or less.

>
>
>> In contrast to Paul's scheme, this has two additional (highly
>> predictable) branches and requires generation of a seqcount in
>> userspace.  In its favor, though, it doesnt need preemption hooks,
>
> Without preemption hooks, how would one thread preempting another at the
> above <-- clobber anything and cause the commit to fail?

It doesn't, which is what I like about my variant.  If the thread
accesses the protected data structure, though, it should bump the
sequence count, which will cause the first thread to about when it
gets scheduled in.

>
>> it's inherently debuggable,
>
> It is more debuggable, agreed.
>
>> and it allows multiple independent
>> rseq-protected things to coexist without forcing each other to abort.
>
> And the kernel only needs to load the second cacheline if it lands in
> the middle of a finish block, which should be manageable overhead I
> suppose.
>
> But the userspace chunk is lots slower as it needs to always touch
> multiple lines, since the @cpu, @seq and @post_commit_ip all live in
> separate lines (although I suppose @cpu and @post_commit_ip could live
> in the same).
>
> The finish thing needs 3 registers for:
>
>  - fail ip
>  - seq pointer
>  - seq value
>
> Which I suppose is possible even on register constrained architectures
> like i386.

I think this can all be munged into two cachelines:

One cacheline contains the per-thread CPU number and post_commit_ip
(either by doing it over Linus' dead body or by having userspace
allocate it carefully).  The other contains the sequence counter *and*
the percpu data structure that's protected.  So in some sense it's the
same number of cache lines as Paul's version.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC