Date: Sat, 2 May 2015 07:27:33 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Rik van Riel <riel@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        X86 ML <x86@kernel.org>, williams@redhat.com,
        Andrew Lutomirski <luto@kernel.org>, fweisbec@redhat.com,
        Peter Zijlstra <peterz@infradead.org>,
        Heiko Carstens <heiko.carstens@de.ibm.com>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        "Paul E. McKenney" <paulmck@us.ibm.com>,
        Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable
 & enable from context tracking on syscall entry
Message-ID: <20150502052733.GA9983@gmail.com>
References: <554399D1.6010405@redhat.com>
 <20150501155912.GA451@gmail.com>
 <CALCETrVZf11EYLhKWOfeQSyzq9eq5KB+btcY19JF+sJvs2zMXA@mail.gmail.com>
 <20150501162109.GA1091@gmail.com>
 <5543A94B.3020108@redhat.com>
 <20150501163431.GB1327@gmail.com>
 <5543C05E.9040209@redhat.com>
 <20150501184025.GA2114@gmail.com>
 <5543CFE5.1030509@redhat.com>
 <CALCETrX2Di19atf4+Nx5cCOxbqoFdx+USLJ-WRSBy_Se25RE-g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CALCETrX2Di19atf4+Nx5cCOxbqoFdx+USLJ-WRSBy_Se25RE-g@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4213
Lines: 110


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Fri, May 1, 2015 at 12:11 PM, Rik van Riel <riel@redhat.com> wrote:
> > On 05/01/2015 02:40 PM, Ingo Molnar wrote:
> >
> >> Or we could do that in the syscall path with a single store of a
> >> constant flag to a location in the task struct. We have a number of
> >> natural flags that get written on syscall entry, such as:
> >>
> >>         pushq_cfi $__USER_DS                    /* pt_regs->ss */
> >>
> >> That goes to a constant location on the kernel stack. On return from
> >> system calls we could write 0 to that location.
> 
> Huh?  IRET with zero there will fault, and we can't guarantee that 
> all syscalls return using sysret. [...]

So IRET is a slowpath - I was thinking about the fastpath mainly.

> [...]  Also, we'd have to audit all the entries, and 
> system_call_after_swapgs currently enables interrupts early enough 
> that an interrupt before all the pushes will do unpredictable things 
> to pt_regs.

An irq hardware frame won't push zero to that selector value, right? 
That's the only bad thing that would confuse the code.

> We could further abuse orig_ax, but that would require a lot of 
> auditing.  Honestly, though, I think keeping a flag in an 
> otherwise-hot cache line is a better bet. [...]

That would work too, at the cost of one more instruction, as now we'd 
have to both set and clear it.

> [...]  Also, making it per-cpu instead of per-task will probably be 
> easier on the RCU code, since otherwise the RCU code will have to do 
> some kind of synchronization (even if it's a wait-free probe of the 
> rq lock or similar) to figure out which pt_regs to look at.

So the synchronize_rcu() part is an even slower slow path, in 
comparison with system call entry overhead.

But yes, safely accessing the remote task is a complication, but with 
such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So 
even if we do:

> If we went that route, I'd advocate sticking the flag in tss->sp1. 
> That cacheline is unconditionally read on kernel entry already, and 
> sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I 
> lost track). [1]
> 
> Alternatively, I'd suggest that we actually add a whole new word to 
> pt_regs.

... we'd still have to set TIF_NOHZ and are back to square one in 
terms of race complexity.

But it's not overly complex: by taking the remote CPU's rq-lock from 
synchronize_rcu() we could get a stable pointer to the currently 
executing task.

And if we do that, we might as well use the opportunity and take a 
look at pt_regs as well - this is why sticking it into pt_regs might 
be better.

So I'd:

  - enlarge pt_regs by a word and stick the flag there (this 
    allocation is essentially free)

  - update the word from entry/exit

  - synchronize_rcu() avoids having to send an IPI by taking a 
    peak at rq->curr's pt_regs::flag, and if:

     - the flag is 0 then it has observed a quiescent state.

     - the flag is 1, then it would set TIF_NOHZ and wait for a 
       completion from a TIF_NOHZ callback.

synchronize_rcu() often involves waiting for (timer tick driven) grace 
periods anyway, so this is a relatively fast solution - and it would 
limit the overhead to 2 instructions.

On systems that have zero nohz-full CPUs (i.e. !context_tracking_enabled)
we could patch out those two instructions into NOPs, which would be
eliminated in the decoder.

Regarding the user/kernel execution time split measurement:

1) the same flag could be used to sample a remote CPU's statistics 
from another CPU and update the stats in the currently executing task. 
As long as there's at least one non-nohz-full CPU, this would work. Or 
are there systems were all CPUs are nohz-full?

2) Alternatively we could just drive user/kernel split statistics from 
context switches, which would be inaccurate if the workload is 
SCHED_FIFO that only rarely context switches.

How does this sound?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/