2010-11-29 08:45:16

by Yong Zhang

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Do not account irq time to current task

On Tue, Oct 19, 2010 at 3:26 AM, tip-bot for Venkatesh Pallipadi
<[email protected]> wrote:
> Commit-ID:  305e6835e05513406fa12820e40e4a8ecb63743c
> Gitweb:     http://git.kernel.org/tip/305e6835e05513406fa12820e40e4a8ecb63743c
> Author:     Venkatesh Pallipadi <[email protected]>
> AuthorDate: Mon, 4 Oct 2010 17:03:21 -0700
> Committer:  Ingo Molnar <[email protected]>
> CommitDate: Mon, 18 Oct 2010 20:52:26 +0200
>
> sched: Do not account irq time to current task
>
> Scheduler accounts both softirq and interrupt processing times to the
> currently running task. This means, if the interrupt processing was
> for some other task in the system, then the current task ends up being
> penalized as it gets shorter runtime than otherwise.
>
> Change sched task accounting to acoount only actual task time from
> currently running task. Now update_curr(), modifies the delta_exec to
> depend on rq->clock_task.
>
> Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
> extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
> for later.
>
> This change will impact scheduling behavior in interrupt heavy conditions.
>
> Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
> task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
> spending 75%+ of its time in irq processing. CPU 3 spending around 35%
> time running nc task.
>
> Now, if I run another CPU intensive task on CPU 2, without this change
> /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
> change, it rightly shows less than 25% accounted to this task as remaining
> time is actually spent on irq processing.
>
> Signed-off-by: Venkatesh Pallipadi <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
>  kernel/sched.c      |   43 ++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched_fair.c |    6 +++---
>  kernel/sched_rt.c   |    8 ++++----
>  3 files changed, 47 insertions(+), 10 deletions(-)
>

[snip]

> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index ab77aa0..bea7d79 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
>        if (!task_has_rt_policy(curr))
>                return;
>
> -       delta_exec = rq->clock - curr->se.exec_start;
> +       delta_exec = rq->clock_task - curr->se.exec_start;
>        if (unlikely((s64)delta_exec < 0))
>                delta_exec = 0;
>
> @@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
>        curr->se.sum_exec_runtime += delta_exec;
>        account_group_exec_runtime(curr, delta_exec);
>
> -       curr->se.exec_start = rq->clock;
> +       curr->se.exec_start = rq->clock_task;
>        cpuacct_charge(curr, delta_exec);
>
>        sched_rt_avg_update(rq, delta_exec);

Seems the above changes to update_curr_rt() have some false positive
to rt_bandwidth control.
For example:
rt_period=1000000;
rt_runtime=950000;
then if in that period the irq_time is no zero(such as 50000), according to
the throttle mechanism, rt is not throttled. In the end we left no
time to others.
It seems that this break the semantic of throttle.

Maybe we can revert the change to update_curr_rt()?

Thanks,
Yong


2010-11-29 11:59:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Do not account irq time to current task

On Mon, 2010-11-29 at 16:45 +0800, Yong Zhang wrote:
> On Tue, Oct 19, 2010 at 3:26 AM, tip-bot for Venkatesh Pallipadi
> <[email protected]> wrote:
> > Commit-ID: 305e6835e05513406fa12820e40e4a8ecb63743c
> > Gitweb: http://git.kernel.org/tip/305e6835e05513406fa12820e40e4a8ecb63743c
> > Author: Venkatesh Pallipadi <[email protected]>
> > AuthorDate: Mon, 4 Oct 2010 17:03:21 -0700
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Mon, 18 Oct 2010 20:52:26 +0200
> >
> > sched: Do not account irq time to current task
> >
> > Scheduler accounts both softirq and interrupt processing times to the
> > currently running task. This means, if the interrupt processing was
> > for some other task in the system, then the current task ends up being
> > penalized as it gets shorter runtime than otherwise.
> >
> > Change sched task accounting to acoount only actual task time from
> > currently running task. Now update_curr(), modifies the delta_exec to
> > depend on rq->clock_task.
> >
> > Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
> > extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
> > for later.
> >
> > This change will impact scheduling behavior in interrupt heavy conditions.
> >
> > Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
> > task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
> > spending 75%+ of its time in irq processing. CPU 3 spending around 35%
> > time running nc task.
> >
> > Now, if I run another CPU intensive task on CPU 2, without this change
> > /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
> > change, it rightly shows less than 25% accounted to this task as remaining
> > time is actually spent on irq processing.
> >
> > Signed-off-by: Venkatesh Pallipadi <[email protected]>
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > LKML-Reference: <[email protected]>
> > Signed-off-by: Ingo Molnar <[email protected]>
> > ---
> > kernel/sched.c | 43 ++++++++++++++++++++++++++++++++++++++++---
> > kernel/sched_fair.c | 6 +++---
> > kernel/sched_rt.c | 8 ++++----
> > 3 files changed, 47 insertions(+), 10 deletions(-)
> >
>
> [snip]
>
> > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > index ab77aa0..bea7d79 100644
> > --- a/kernel/sched_rt.c
> > +++ b/kernel/sched_rt.c
> > @@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
> > if (!task_has_rt_policy(curr))
> > return;
> >
> > - delta_exec = rq->clock - curr->se.exec_start;
> > + delta_exec = rq->clock_task - curr->se.exec_start;
> > if (unlikely((s64)delta_exec < 0))
> > delta_exec = 0;
> >
> > @@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
> > curr->se.sum_exec_runtime += delta_exec;
> > account_group_exec_runtime(curr, delta_exec);
> >
> > - curr->se.exec_start = rq->clock;
> > + curr->se.exec_start = rq->clock_task;
> > cpuacct_charge(curr, delta_exec);
> >
> > sched_rt_avg_update(rq, delta_exec);
>
> Seems the above changes to update_curr_rt() have some false positive
> to rt_bandwidth control.
> For example:
> rt_period=1000000;
> rt_runtime=950000;
> then if in that period the irq_time is no zero(such as 50000), according to
> the throttle mechanism, rt is not throttled. In the end we left no
> time to others.
> It seems that this break the semantic of throttle.
>
> Maybe we can revert the change to update_curr_rt()?

No, that's totally correct.

Its the correct and desired behaviour, IRQ time is not time spend
running the RT tasks, hence they should not be accounted for it.

If you still want to throttle RT tasks simply ensure their bandwidth
constraint is lower than the available time.


2010-11-29 14:22:28

by Yong Zhang

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Do not account irq time to current task

On Mon, Nov 29, 2010 at 12:59:50PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-11-29 at 16:45 +0800, Yong Zhang wrote:
> > On Tue, Oct 19, 2010 at 3:26 AM, tip-bot for Venkatesh Pallipadi
> > <[email protected]> wrote:
> > > Commit-ID: 305e6835e05513406fa12820e40e4a8ecb63743c
> > > Gitweb: http://git.kernel.org/tip/305e6835e05513406fa12820e40e4a8ecb63743c
> > > Author: Venkatesh Pallipadi <[email protected]>
> > > AuthorDate: Mon, 4 Oct 2010 17:03:21 -0700
> > > Committer: Ingo Molnar <[email protected]>
> > > CommitDate: Mon, 18 Oct 2010 20:52:26 +0200
> > >
> > > sched: Do not account irq time to current task
> > >
> > > Scheduler accounts both softirq and interrupt processing times to the
> > > currently running task. This means, if the interrupt processing was
> > > for some other task in the system, then the current task ends up being
> > > penalized as it gets shorter runtime than otherwise.
> > >
> > > Change sched task accounting to acoount only actual task time from
> > > currently running task. Now update_curr(), modifies the delta_exec to
> > > depend on rq->clock_task.
> > >
> > > Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
> > > extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
> > > for later.
> > >
> > > This change will impact scheduling behavior in interrupt heavy conditions.
> > >
> > > Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
> > > task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
> > > spending 75%+ of its time in irq processing. CPU 3 spending around 35%
> > > time running nc task.
> > >
> > > Now, if I run another CPU intensive task on CPU 2, without this change
> > > /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
> > > change, it rightly shows less than 25% accounted to this task as remaining
> > > time is actually spent on irq processing.
> > >
> > > Signed-off-by: Venkatesh Pallipadi <[email protected]>
> > > Signed-off-by: Peter Zijlstra <[email protected]>
> > > LKML-Reference: <[email protected]>
> > > Signed-off-by: Ingo Molnar <[email protected]>
> > > ---
> > > kernel/sched.c | 43 ++++++++++++++++++++++++++++++++++++++++---
> > > kernel/sched_fair.c | 6 +++---
> > > kernel/sched_rt.c | 8 ++++----
> > > 3 files changed, 47 insertions(+), 10 deletions(-)
> > >
> >
> > [snip]
> >
> > > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > > index ab77aa0..bea7d79 100644
> > > --- a/kernel/sched_rt.c
> > > +++ b/kernel/sched_rt.c
> > > @@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
> > > if (!task_has_rt_policy(curr))
> > > return;
> > >
> > > - delta_exec = rq->clock - curr->se.exec_start;
> > > + delta_exec = rq->clock_task - curr->se.exec_start;
> > > if (unlikely((s64)delta_exec < 0))
> > > delta_exec = 0;
> > >
> > > @@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
> > > curr->se.sum_exec_runtime += delta_exec;
> > > account_group_exec_runtime(curr, delta_exec);
> > >
> > > - curr->se.exec_start = rq->clock;
> > > + curr->se.exec_start = rq->clock_task;
> > > cpuacct_charge(curr, delta_exec);
> > >
> > > sched_rt_avg_update(rq, delta_exec);
> >
> > Seems the above changes to update_curr_rt() have some false positive
> > to rt_bandwidth control.
> > For example:
> > rt_period=1000000;
> > rt_runtime=950000;
> > then if in that period the irq_time is no zero(such as 50000), according to
> > the throttle mechanism, rt is not throttled. In the end we left no
> > time to others.
> > It seems that this break the semantic of throttle.
> >
> > Maybe we can revert the change to update_curr_rt()?
>
> No, that's totally correct.
>
> Its the correct and desired behaviour, IRQ time is not time spend
> running the RT tasks, hence they should not be accounted for it.

Right.

>
> If you still want to throttle RT tasks simply ensure their bandwidth
> constraint is lower than the available time.

But the available time is harder to calculated than before.
IRQ is random, so as to the irq_time.
But the unthrottle(do_sched_rt_period_timer()) runs in fixed
period which is based on hard clock.

Is that what we want?

Thanks,
Yong

2010-11-29 17:07:02

by Dario Faggioli

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Do not account irq time to current task

On Mon, 2010-11-29 at 22:22 +0800, Yong Zhang wrote:
> > If you still want to throttle RT tasks simply ensure their bandwidth
> > constraint is lower than the available time.
>
> But the available time is harder to calculated than before.
>
Well, it shouldn't... I would say it makes it easier! :-P

> IRQ is random, so as to the irq_time.
>
Well, that's definitely true, as it is true that the time needed to deal
with IRQ is now somehow "lost" or "hidden" (meaning that it is not
accounted to anyone).

> But the unthrottle(do_sched_rt_period_timer()) runs in fixed
> period which is based on hard clock.
>
> Is that what we want?
>
I'm not sure I'm getting what the issue is here... Do you have an
example do discuss?
Because, referring to the one in your first e-mail, if in the last
period (of 1sec) we spent 900ms running -rt tasks and 50ms servicing
interrupts, given a limit was 950ms for sched_rt, why should we throttle
them? :-O

Well, I see that this could mean that all we'd do in that period will be
servicing interrupts and running -rt tasks (for _their_ last 50ms) but,
also means (at least to me) that your system needs some more design
effort.
Then I can also agree on the point that it might make sense to think a
bit on how to take the 50ms from interrupts somehow into account, but
just charging -rt tasks for that time seems quite arbitrary... :-O

Something that I've done recently is trying to figure out, if interrupts
handler have their own threads (as in PREEMPT_RT), what happens if you
try constraining the bandwidth of those thread, using proper scheduling
mechanisms (such as deadline scheduling, but rt-throttling could also be
a "reasonable approximation" :-P)... But it's just preliminary research
results.

BTW, having handlers in threads could actually be the solution per-se
here, since they would then consume -rt bandwidth (if set to -rt) and
contribute to throttling... :-)

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://retis.sssup.it/people/faggioli -- [email protected]


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part

2010-11-30 05:57:38

by Yong Zhang

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Do not account irq time to current task

On Tue, Nov 30, 2010 at 1:06 AM, Raistlin <[email protected]> wrote:
> On Mon, 2010-11-29 at 22:22 +0800, Yong Zhang wrote:
>> > If you still want to throttle RT tasks simply ensure their bandwidth
>> > constraint is lower than the available time.
>>
>> But the available time is harder to calculated than before.
>>
> Well, it shouldn't... I would say it makes it easier! :-P

It depends on the way how we see it.
>From the side that we can just focus on task time, it's easy
and well understood. :)

Now I think we need a way to export the irq_time to user,
then the user can determine the average time spending
on IRQ.
Maybe Venkatesh Pallipadi's patchset titled "Proper
kernel irq time reporting" is the one.

>
>> IRQ is random, so as to the irq_time.
>>
> Well, that's definitely true, as it is true that the time needed to deal
> with IRQ is now somehow "lost" or "hidden" (meaning that it is not
> accounted to anyone).
>
>> But the unthrottle(do_sched_rt_period_timer()) runs in fixed
>> period which is based on hard clock.
>>
>> Is that what we want?
>>
> I'm not sure I'm getting what the issue is here... Do you have an
> example do discuss?

I was just thinking about pulling out irq_time from rt_period, but
it seems that it's not so easy.

By comparing current unthrottle mechanism and the before one,
there is no difference. IRQ can happen at any time, regardless
who its time will be accounted to. So it's still fair for now.

> Because, referring to the one in your first e-mail, if in the last
> period (of 1sec) we spent 900ms running -rt tasks and 50ms servicing
> interrupts, given a limit was 950ms for sched_rt, why should we throttle
> them? :-O

In the previous version, rt is throttled because irq_time is accounted
to the task.
So in the new version, I should set rt_runtime to 900ms to make it
sensible.

>
> Well, I see that this could mean that all we'd do in that period will be
> servicing interrupts and running -rt tasks (for _their_ last 50ms) but,
> also means (at least to me) that your system needs some more design
> effort.
> Then I can also agree on the point that it might make sense to think a
> bit on how to take the 50ms from interrupts somehow into account, but
> just charging -rt tasks for that time seems quite arbitrary... :-O

Sure.

>
> Something that I've done recently is trying to figure out, if interrupts
> handler have their own threads (as in PREEMPT_RT), what happens if you
> try constraining the bandwidth of those thread, using proper scheduling
> mechanisms (such as deadline scheduling, but rt-throttling could also be
> a "reasonable approximation" :-P)... But it's just preliminary research
> results.

Interesting. Will keep eye on that. :)

>
> BTW, having handlers in threads could actually be the solution per-se
> here, since they would then consume -rt bandwidth (if set to -rt) and
> contribute to throttling... :-)

Agree. This will make time arrangement more accurate.

But as you said above, we must evaluate the side effect to
irq when constraining the bandwidth of irq thread.

Thanks,
Yong

2010-12-01 18:56:07

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Do not account irq time to current task

On Mon, Nov 29, 2010 at 9:57 PM, Yong Zhang <[email protected]> wrote:
> On Tue, Nov 30, 2010 at 1:06 AM, Raistlin <[email protected]> wrote:
>> On Mon, 2010-11-29 at 22:22 +0800, Yong Zhang wrote:
>>> > If you still want to throttle RT tasks simply ensure their bandwidth
>>> > constraint is lower than the available time.
>>>
>>> But the available time is harder to calculated than before.
>>>
>> Well, it shouldn't... I would say it makes it easier! :-P
>
> It depends on the way how we see it.
> From the side that we can just focus on task time, it's easy
> and well understood. ?:)
>
> Now I think we need a way to export the irq_time to user,
> then the user can determine the average time spending
> on IRQ.
> Maybe Venkatesh Pallipadi's patchset titled "Proper
> kernel irq time reporting" is the one.
>

Yes. That second patchset exports this information through /proc/stat.
Peter: Can you pull that patchset into tip.

Thanks,
Venki

2010-12-01 19:16:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Do not account irq time to current task

On Wed, 2010-12-01 at 10:55 -0800, Venkatesh Pallipadi wrote:
> Peter: Can you pull that patchset into tip.

I'll try and have another look at those patches early next week. Do
remind me if you haven't heard back from me by Wednesday or something.