This could be part of the unknown 2% performance regression with
db transaction processing benchmark.
The four functions in the following patch use to be inline. They
are un-inlined since 2.6.7.
We measured that by re-inline them back on 2.6.9, it improves performance
for db transaction processing benchmark, +0.2% (on real hardware :-)
The cost is certainly larger kernel size, cost 928 bytes on x86, and
2728 bytes on ia64. But certainly worth the money for enterprise
customer since they improve performance on enterprise workload.
# size vmlinux.*
text data bss dec hex filename
3261844 717184 262020 4241048 40b698 vmlinux.x86.orig
3262772 717488 262020 4242280 40bb68 vmlinux.x86.inline
text data bss dec hex filename
5836933 903828 201940 6942701 69efed vmlinux.ia64.orig
5839661 903460 201940 6945061 69f925 vmlinux.ia64.inline
Possible we can introduce them back?
Signed-off-by: Ken Chen <[email protected]>
--- linux-2.6.11/kernel/sched.c.orig 2005-03-10 15:31:10.000000000 -0800
+++ linux-2.6.11/kernel/sched.c 2005-03-10 15:36:32.000000000 -0800
@@ -164,7 +164,7 @@
#define SCALE_PRIO(x, prio) \
max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO/2), MIN_TIMESLICE)
-static unsigned int task_timeslice(task_t *p)
+static inline unsigned int task_timeslice(task_t *p)
{
if (p->static_prio < NICE_TO_PRIO(0))
return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
@@ -302,7 +302,7 @@ static DEFINE_PER_CPU(struct runqueue, r
* interrupts. Note the ordering: we can safely lookup the task_rq without
* explicitly disabling preemption.
*/
-static runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
+static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
__acquires(rq->lock)
{
struct runqueue *rq;
@@ -426,7 +426,7 @@ struct file_operations proc_schedstat_op
/*
* rq_lock - lock a given runqueue and disable interrupts.
*/
-static runqueue_t *this_rq_lock(void)
+static inline runqueue_t *this_rq_lock(void)
__acquires(rq->lock)
{
runqueue_t *rq;
@@ -1323,7 +1323,7 @@ void fastcall sched_exit(task_t * p)
* with the lock held can cause deadlocks; see schedule() for
* details.)
*/
-static void finish_task_switch(task_t *prev)
+static inline void finish_task_switch(task_t *prev)
__releases(rq->lock)
{
runqueue_t *rq = this_rq();
"Chen, Kenneth W" <[email protected]> wrote:
>
> This could be part of the unknown 2% performance regression with
> db transaction processing benchmark.
>
> The four functions in the following patch use to be inline. They
> are un-inlined since 2.6.7.
>
> We measured that by re-inline them back on 2.6.9, it improves performance
> for db transaction processing benchmark, +0.2% (on real hardware :-)
>
> The cost is certainly larger kernel size, cost 928 bytes on x86, and
> 2728 bytes on ia64. But certainly worth the money for enterprise
> customer since they improve performance on enterprise workload.
Less that 1k on x86 versus >2k on ia64. No wonder those things have such
big caches ;)
> ...
> Possible we can introduce them back?
OK by me.
Andrew Morton wrote:
>"Chen, Kenneth W" <[email protected]> wrote:
>
>>This could be part of the unknown 2% performance regression with
>>db transaction processing benchmark.
>>
>>The four functions in the following patch use to be inline. They
>>are un-inlined since 2.6.7.
>>
>>We measured that by re-inline them back on 2.6.9, it improves performance
>>for db transaction processing benchmark, +0.2% (on real hardware :-)
>>
>>
Can you also inline requeue_task? No performance gain expected, but
it is just a simple wrapper around a list function.
>>The cost is certainly larger kernel size, cost 928 bytes on x86, and
>>2728 bytes on ia64. But certainly worth the money for enterprise
>>customer since they improve performance on enterprise workload.
>>
>
>Less that 1k on x86 versus >2k on ia64. No wonder those things have such
>big caches ;)
>
>
>>...
>>Possible we can introduce them back?
>>
>
>OK by me.
>
>
What happens if you leave task_timeslice out of line? It isn't exactly
huge, but it is called from a handful of places.
* Chen, Kenneth W <[email protected]> wrote:
> # size vmlinux.*
> text data bss dec hex filename
> 3261844 717184 262020 4241048 40b698 vmlinux.x86.orig
> 3262772 717488 262020 4242280 40bb68 vmlinux.x86.inline
> Possible we can introduce them back?
> -static unsigned int task_timeslice(task_t *p)
> +static inline unsigned int task_timeslice(task_t *p)
the patch looks good except this one - could you try to undo it and
re-measure? task_timeslice() is not used in any true fastpath, if it
makes any difference then the performance difference must be some other
artifact.
Ingo
Ingo Molnar wrote on Friday, March 11, 2005 1:32 AM
> > -static unsigned int task_timeslice(task_t *p)
> > +static inline unsigned int task_timeslice(task_t *p)
>
> the patch looks good except this one - could you try to undo it and
> re-measure? task_timeslice() is not used in any true fastpath, if it
> makes any difference then the performance difference must be some other
> artifact.
OK, I'll re-measure. Yeah, I agree that this function is not in the fastpath.
Ingo Molnar wrote on Friday, March 11, 2005 1:32 AM
> > -static unsigned int task_timeslice(task_t *p)
> > +static inline unsigned int task_timeslice(task_t *p)
>
> the patch looks good except this one - could you try to undo it and
> re-measure? task_timeslice() is not used in any true fastpath, if it
> makes any difference then the performance difference must be some other
> artifact.
Chen, Kenneth W wrote on Friday, March 11, 2005 10:40 AM
> OK, I'll re-measure. Yeah, I agree that this function is not in the fastpath.
Ingo is right, re-measured on our benchmark setup and did not see any
difference whether task_timeslice is inlined or not. So if people want
to take inline keyword out for that function, we won't complain :-)
* Chen, Kenneth W <[email protected]> wrote:
> Ingo Molnar wrote on Friday, March 11, 2005 1:32 AM
> > > -static unsigned int task_timeslice(task_t *p)
> > > +static inline unsigned int task_timeslice(task_t *p)
> >
> > the patch looks good except this one - could you try to undo it and
> > re-measure? task_timeslice() is not used in any true fastpath, if it
> > makes any difference then the performance difference must be some other
> > artifact.
>
> Chen, Kenneth W wrote on Friday, March 11, 2005 10:40 AM
> > OK, I'll re-measure. Yeah, I agree that this function is not in the fastpath.
>
> Ingo is right, re-measured on our benchmark setup and did not see any
> difference whether task_timeslice is inlined or not. So if people
> want to take inline keyword out for that function, we won't complain
> :-)
Signed-off-by: Ingo Molnar <[email protected]>
uninline task_timeslice() - reduces code footprint noticeably, and it's
slowpath code.
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -166,7 +166,7 @@
#define SCALE_PRIO(x, prio) \
max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO/2), MIN_TIMESLICE)
-static inline unsigned int task_timeslice(task_t *p)
+static unsigned int task_timeslice(task_t *p)
{
if (p->static_prio < NICE_TO_PRIO(0))
return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);