MIME-Version: 1.0
In-Reply-To: <511681386.5786.1432235282365.JavaMail.zimbra@efficios.com>
References: <1432219487-13364-1-git-send-email-mathieu.desnoyers@efficios.com>
 <20150521183240.GW3644@twins.programming.kicks-ass.net> <511681386.5786.1432235282365.JavaMail.zimbra@efficios.com>
From: Paul Turner <pjt@google.com>
Date: Thu, 21 May 2015 12:31:33 -0700
Message-ID: <CAPM31R+6naYkWw368-6-3pTAz+y7kev9vR9tWJS2DRjfZ+NbQw@mail.gmail.com>
Subject: Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Andrew Hunter <ahh@google.com>,
        Ben Maurer <bmaurer@fb.com>, LKML <linux-kernel@vger.kernel.org>,
        Ingo Molnar <mingo@redhat.com>, Steven Rostedt <rostedt@goodmis.org>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Josh Triplett <josh@joshtriplett.org>,
        Lai Jiangshan <laijs@cn.fujitsu.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4872
Lines: 142

On Thu, May 21, 2015 at 12:08 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- Original Message -----
>> On Thu, May 21, 2015 at 10:44:47AM -0400, Mathieu Desnoyers wrote:
>>
>> > +struct thread_percpu_user {
>> > +   int32_t nesting;
>> > +   int32_t signal_sent;
>> > +   int32_t signo;
>> > +   int32_t current_cpu;
>> > +};
>>
>> I would require this thing be naturally aligned, such that it does not
>> cross cacheline boundaries.
>
> Good point. Adding a comment into the code to that effect.
>
>>
>> > +
>> > +static void percpu_user_sched_in(struct preempt_notifier *notifier, int
>> > cpu)
>> > +{
>> > +   struct thread_percpu_user __user *tpu_user;
>> > +   struct thread_percpu_user tpu;
>> > +   struct task_struct *t = current;
>> > +
>> > +   tpu_user = t->percpu_user;
>> > +   if (tpu_user == NULL)
>> > +           return;
>> > +   if (unlikely(t->flags & PF_EXITING))
>> > +           return;
>> > +   /*
>> > +    * access_ok() of tpu_user has already been checked by sys_percpu().
>> > +    */
>> > +   if (__put_user(smp_processor_id(), &tpu_user->current_cpu)) {
>> > +           WARN_ON_ONCE(1);
>> > +           return;
>> > +   }
>>
>> This seems a waste; you already read the number unconditionally, might
>> as well double check and avoid the store.
>>
>> > +   if (__copy_from_user(&tpu, tpu_user, sizeof(tpu))) {
>> > +           WARN_ON_ONCE(1);
>> > +           return;
>> > +   }
>>
>>       if (tpu.current_cpu != smp_processor_id())
>>               __put_user();
>
> Yep, and I could even use the "cpu" parameter received by the
> function rather than smp_processor_id().
>
>>
>>
>>
>> > +   if (!tpu.nesting || tpu.signal_sent)
>> > +           return;
>> > +   if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
>> > +           WARN_ON_ONCE(1);
>> > +           return;
>> > +   }
>> > +   tpu.signal_sent = 1;
>> > +   if (__copy_to_user(tpu_user, &tpu, sizeof(tpu))) {
>> > +           WARN_ON_ONCE(1);
>> > +           return;
>> > +   }
>> > +}
>>
>> Please do not use preempt notifiers for this.
>
> Do you recommend we issue a function call from the scheduler
> finish_task_switch() ?
>
>>
>> Second, this all is done with preemption disabled, this means that all
>> that user access can fail.
>
> OK, this is one part I was worried about.
>
>>
>> You print useless WARNs and misbehave. If you detect a desire to fault,
>> you could delay until return to userspace and try again there. But it
>> all adds complexity.
>
> We could keep a flag, and then call the function again if we detect a
> desire to fault.
>
>>
>> The big advantage pjt's scheme had is that we have the instruction
>> pointer, we do not need to go read userspace memory that might not be
>> there. And it being limited  to a single range, while inconvenient,
>> simplifies the entire kernel side to:
>>
>>       if ((unsigned long)(ip - offset) < size)
>>               do_magic();
>>
>> Which is still simpler than the above.

Yeah, we've thought about this part of the API in some depth.  There
are obvious pros and cons.

The advantage of this approach is that:
  a) All of the demuxing can be done in userspace, the kernel check is
as you say, wholly trivial.
  B) We do not need to do anything complicated, e.g. signal delivery

The largest drawback is that it does require centralization of your
percpu regions.  It's been our experience that most features are best
implemented using a small library of operations rather than
long-form-atomic-sequences.  This is a notable challenge for binaries
with shared linkage.

>
> There is one big aspect of pjt's approach that I still don't grasp
> after all this time that makes me worry. How does it interact with
> the following scenario ?
>
> Userspace thread
>   - within the code region that needs to be restarted
>     - signal handler nested on top
>       - running within the signal handler code
>         - preempted by kernel
>           - checking instruction pointer misses the userspace stack
>             underneath the signal handler.
>
> Given this scenario, is the kernel code really as simple as a pointer check
> on pt_regs, or do we need a stack walk over all signal frames ? Another way
> would be to check for the pt_regs instruction pointer whenever we receive
> a signal, but then it would require per-architectures modifications, and
> suddenly becomes less straightforward.
>

This is actually straightforward to handle:
On signal delivery, as you are setting up the signal stack, you modify
the interrupted state so that the signal handler will return into the
restart section rather than the original code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/