Date: Fri, 1 May 2015 17:59:12 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org, x86@kernel.org, williams@redhat.com,
        luto@kernel.org, fweisbec@redhat.com, peterz@infradead.org,
        heiko.carstens@de.ibm.com, tglx@linutronix.de,
        Ingo Molnar <mingo@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable
 & enable from context tracking on syscall entry
Message-ID: <20150501155912.GA451@gmail.com>
References: <1430429035-25563-1-git-send-email-riel@redhat.com>
 <1430429035-25563-4-git-send-email-riel@redhat.com>
 <20150501064044.GA18957@gmail.com>
 <554399D1.6010405@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <554399D1.6010405@redhat.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2768
Lines: 72


* Rik van Riel <riel@redhat.com> wrote:

> > I.e. what's the baseline we are talking about?
> 
> It's an astounding difference. This is not a kernel without 
> nohz_full, just a CPU without nohz_full running the same kernel I 
> tested with yesterday:
> 
>  		run time	system time
> vanilla		5.49s		2.08s
> __acct patch	5.21s		1.92s
> both patches	4.88s		1.71s
> CPU w/o nohz	3.12s		1.63s    <-- your numbers, mostly
> 
> What is even more interesting is that the majority of the time 
> difference seems to come from _user_ time, which has gone down from 
> around 3.4 seconds in the vanilla kernel to around 1.5 seconds on 
> the CPU without nohz_full enabled...
> 
> At syscall entry time, the nohz_full context tracking code is very 
> straightforward. We check thread_info->flags & 
> _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which 
> handles USER -> KERNEL context transition.
> 
> Syscall exit time is a convoluted mess. Both do_notify_resume and 
> syscall_trace_leave call exit_user() on entry and enter_user() on 
> exit, leaving the time spent looping around between int_with_check 
> and syscall_return: in entry_64.S accounted as user time.
> 
> I sent an email about this last night, it may be useful to add a 
> third test & function call point to the syscall return code, where 
> we can call user_enter() just ONCE, and remove the other context 
> tracking calls from that loop.

So what I'm wondering about is the big picture:

 - This is crazy big overhead in something as fundamental as system
   calls!

 - We don't even have the excuse of the syscall auditing code, which
   kind of has to run for every syscall if it wants to do its job!

 - [ The 'precise vtime' stuff that is driven from syscall entry/exit 
     is crazy, and I hope not enabled in any distro. ]

 - So why are we doing this in every syscall time at all?

Basically the whole point of user-context tracking is to be able to 
flush pending RCU callbacks. But that's crazy, we can sure defer a few 
kfree()s on this CPU, even indefinitely!

If some other CPU does a sync_rcu(), then it can very well pluck those 
callbacks from this super low latency CPU's RCU lists (with due care) 
and go and free stuff itself ... There's no need to disturb this CPU 
for that!

If user-space does not do anything kernel-ish then there won't be any 
new RCU callbacks piled up, so it's not like it's a resource leak 
issue either.

So what's the point? Why not remove this big source of overhead 
altogether?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/