LinuxLists.cc - Re: + mm-align-vmstat_works-timer.patch added to -mm tree

2009-04-06 03:30:44

Subject: Re: + mm-align-vmstat_works-timer.patch added to -mm tree

(swich to lkml and linux-mm)

Hi Anton,

Do you have any mesurement data?

Honestly, I made the same patch few week ago.
but I found two problems.

1)
work queue tracer (in -tip) reported it isn't proper rounded.

The fact is, schedule_delayed_work(work, round_jiffies_relative()) is
a bit ill.

it mean
- round_jiffies_relative() calculate rounded-time - jiffies
- schedule_delayed_work() calculate argument + jiffies

it assume no jiffies change at above two place. IOW it assume
non preempt kernel.

2)
> - schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
> + schedule_delayed_work_on(cpu, vmstat_work,
> + __round_jiffies_relative(HZ, cpu));

isn't same meaning.

vmstat_work mean to move per-cpu stastics to global stastics.
Then, (HZ + cpu) mean to avoid to touch the same global variable at the same time.

Oh well, this patch have performance regression risk on _very_ big server.
(perhaps, only sgi?)

but I agree vmstat_work is one of most work queue heavy user.
For power consumption view, it isn't proper behavior.

I still think improving another way.

>
> The patch titled
> mm: align vmstat_work's timer
> has been added to the -mm tree. Its filename is
> mm-align-vmstat_works-timer.patch
>
> Before you just go and hit "reply", please:
> a) Consider who else should be cc'ed
> b) Prefer to cc a suitable mailing list as well
> c) Ideally: find the original patch on the mailing list and do a
> reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
>
> See http://userweb.kernel.org/~akpm/stuff/added-to-mm.txt to find
> out what to do about this
>
> The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/
>
> ------------------------------------------------------
> Subject: mm: align vmstat_work's timer
> From: Anton Blanchard <[email protected]>
>
> Even though vmstat_work is marked deferrable, there are still benefits to
> aligning it. For certain applications we want to keep OS jitter as low as
> possible and aligning timers and work so they occur together can reduce
> their overall impact.
>
> Signed-off-by: Anton Blanchard <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/vmstat.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff -puN mm/vmstat.c~mm-align-vmstat_works-timer mm/vmstat.c
> --- a/mm/vmstat.c~mm-align-vmstat_works-timer
> +++ a/mm/vmstat.c
> @@ -984,7 +984,7 @@ static void vmstat_update(struct work_st
> {
> refresh_cpu_vm_stats(smp_processor_id());
> schedule_delayed_work(&__get_cpu_var(vmstat_work),
> - sysctl_stat_interval);
> + round_jiffies_relative(sysctl_stat_interval));
> }
>
> static void __cpuinit start_cpu_timer(int cpu)
> @@ -992,7 +992,8 @@ static void __cpuinit start_cpu_timer(in
> struct delayed_work *vmstat_work = &per_cpu(vmstat_work, cpu);
>
> INIT_DELAYED_WORK_DEFERRABLE(vmstat_work, vmstat_update);
> - schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
> + schedule_delayed_work_on(cpu, vmstat_work,
> + __round_jiffies_relative(HZ, cpu));
> }
>
> /*
> _
>
> Patches currently in -mm which might be from [email protected] are
>
> origin.patch
> mm-align-vmstat_works-timer.patch
> random-align-rekey_works-timer.patch
> sunrpc-align-cache_clean-works-timer.patch
>
> --
> To unsubscribe from this list: send the line "unsubscribe mm-commits" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-04-06 03:57:51

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: + mm-align-vmstat_works-timer.patch added to -mm tree

> (swich to lkml and linux-mm)
>
> Hi Anton,
>
> Do you have any mesurement data?
>
> Honestly, I made the same patch few week ago.
> but I found two problems.
>
> 1)
> work queue tracer (in -tip) reported it isn't proper rounded.

Ah, sorry ignore this sentence.
I used my local patch queue's feature for mesurement, not -tip.

>
> The fact is, schedule_delayed_work(work, round_jiffies_relative()) is
> a bit ill.
>
> it mean
> - round_jiffies_relative() calculate rounded-time - jiffies
> - schedule_delayed_work() calculate argument + jiffies
>
> it assume no jiffies change at above two place. IOW it assume
> non preempt kernel.
>
>
> 2)
> > - schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
> > + schedule_delayed_work_on(cpu, vmstat_work,
> > + __round_jiffies_relative(HZ, cpu));
>
> isn't same meaning.
>
> vmstat_work mean to move per-cpu stastics to global stastics.
> Then, (HZ + cpu) mean to avoid to touch the same global variable at the same time.
>
> Oh well, this patch have performance regression risk on _very_ big server.
> (perhaps, only sgi?)
>
> but I agree vmstat_work is one of most work queue heavy user.
> For power consumption view, it isn't proper behavior.
>
> I still think improving another way.

2009-04-07 04:06:48

by Anton Blanchard

[permalink] [raw]

Subject: Re: + mm-align-vmstat_works-timer.patch added to -mm tree

Hi,

> Do you have any mesurement data?

I was using a simple set of kprobes to look at when timers and
workqueues fire.

> The fact is, schedule_delayed_work(work, round_jiffies_relative()) is
> a bit ill.
>
> it mean
> - round_jiffies_relative() calculate rounded-time - jiffies
> - schedule_delayed_work() calculate argument + jiffies
>
> it assume no jiffies change at above two place. IOW it assume
> non preempt kernel.

I'm not sure we are any worse off here. Before the patch we could end up
with all threads converging on the same jiffy, and once that happens
they will continue to fire over the top of each other (at least until a
difference in the time it takes vmstat_work to complete causes them to
diverge again).

With the patch we always apply a per cpu offset, so should keep them
separated even if jiffies sometimes changes between
round_jiffies_relative() and schedule_delayed_work().

> 2)
> > - schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
> > + schedule_delayed_work_on(cpu, vmstat_work,
> > + __round_jiffies_relative(HZ, cpu));
>
> isn't same meaning.
>
> vmstat_work mean to move per-cpu stastics to global stastics.
> Then, (HZ + cpu) mean to avoid to touch the same global variable at the same time.

round_jiffies_common still provides per cpu skew doesn't it?

/*
* We don't want all cpus firing their timers at once hitting the
* same lock or cachelines, so we skew each extra cpu with an extra
* 3 jiffies. This 3 jiffies came originally from the mm/ code which
* already did this.
* The skew is done by adding 3*cpunr, then round, then subtract this
* extra offset again.
*/

In fact we are also skewing timer interrupts across half a timer tick in
tick_setup_sched_timer:

/* Get the next period (per cpu) */
hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
offset = ktime_to_ns(tick_period) >> 1;
do_div(offset, num_possible_cpus());
offset *= smp_processor_id();
hrtimer_add_expires_ns(&ts->sched_timer, offset);

I still need to see if I can measure a reduction in jitter by removing
this half jiffy skew and aligning all timer interrupts. Assuming we skew
per cpu work and timers, it seems like we shouldn't need to skew timer
interrupts too.

> but I agree vmstat_work is one of most work queue heavy user.
> For power consumption view, it isn't proper behavior.
>
> I still think improving another way.

I definitely agree it would be nice to fix vmstat_work :)

Anton

2009-04-09 01:00:10

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: + mm-align-vmstat_works-timer.patch added to -mm tree

Hi

>
> Hi,
>
> > Do you have any mesurement data?
>
> I was using a simple set of kprobes to look at when timers and
> workqueues fire.

ok. thanks.

> > The fact is, schedule_delayed_work(work, round_jiffies_relative()) is
> > a bit ill.
> >
> > it mean
> > - round_jiffies_relative() calculate rounded-time - jiffies
> > - schedule_delayed_work() calculate argument + jiffies
> >
> > it assume no jiffies change at above two place. IOW it assume
> > non preempt kernel.
>
> I'm not sure we are any worse off here. Before the patch we could end up
> with all threads converging on the same jiffy, and once that happens
> they will continue to fire over the top of each other (at least until a
> difference in the time it takes vmstat_work to complete causes them to
> diverge again).
>
> With the patch we always apply a per cpu offset, so should keep them
> separated even if jiffies sometimes changes between
> round_jiffies_relative() and schedule_delayed_work().

Well, ok I agree your patch don't have back step.

I mean I agree preempt kernel vs round_jiffies_relative() problem is
unrelated to your patch.

> > 2)
> > > - schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
> > > + schedule_delayed_work_on(cpu, vmstat_work,
> > > + __round_jiffies_relative(HZ, cpu));
> >
> > isn't same meaning.
> >
> > vmstat_work mean to move per-cpu stastics to global stastics.
> > Then, (HZ + cpu) mean to avoid to touch the same global variable at the same time.
>
> round_jiffies_common still provides per cpu skew doesn't it?
>
> /*
> * We don't want all cpus firing their timers at once hitting the
> * same lock or cachelines, so we skew each extra cpu with an extra
> * 3 jiffies. This 3 jiffies came originally from the mm/ code which
> * already did this.
> * The skew is done by adding 3*cpunr, then round, then subtract this
> * extra offset again.
> */
>
> In fact we are also skewing timer interrupts across half a timer tick in
> tick_setup_sched_timer:
>
> /* Get the next period (per cpu) */
> hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
> offset = ktime_to_ns(tick_period) >> 1;
> do_div(offset, num_possible_cpus());
> offset *= smp_processor_id();
> hrtimer_add_expires_ns(&ts->sched_timer, offset);
>
> I still need to see if I can measure a reduction in jitter by removing
> this half jiffy skew and aligning all timer interrupts. Assuming we skew
> per cpu work and timers, it seems like we shouldn't need to skew timer
> interrupts too.

Ah, you are perfectly right.
I missed it.

> > but I agree vmstat_work is one of most work queue heavy user.
> > For power consumption view, it isn't proper behavior.
> >
> > I still think improving another way.
>
> I definitely agree it would be nice to fix vmstat_work :)

Thank you for kindful explanation :)