MIME-Version: 1.0
In-Reply-To: <1490848051.4167.57.camel@gmx.de>
References: <20170323165512.60945ac6@redhat.com> <CANRm+CxcgSP2-x+A822DmHLvFLzFmTptS6oYwYtwVdErTpiB=Q@mail.gmail.com>
 <1490636129.8850.76.camel@redhat.com> <20170328132406.7d23579c@redhat.com>
 <20170329131656.1d6cb743@redhat.com> <1490818125.28917.11.camel@redhat.com> <1490848051.4167.57.camel@gmx.de>
From: Wanpeng Li <kernellwp@gmail.com>
Date: Thu, 30 Mar 2017 14:47:11 +0800
Message-ID: <CANRm+CzfqkU6iV1BFk3PmWvy7MOsYH1mSwejJXMkvWW9C5ngwg@mail.gmail.com>
Subject: Re: [BUG nohz]: wrong user and system time accounting
To: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@redhat.com>,
        Luiz Capitulino <lcapitulino@redhat.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Thomas Gleixner <tglx@linutronix.de>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4501
Lines: 128

Cc Peterz, Thomas,
2017-03-30 12:27 GMT+08:00 Mike Galbraith <efault@gmx.de>:
> On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
>
>> In other words, the tick on cpu0 is aligned
>> with the tick on the nohz_full cpus, and
>> jiffies is advanced while the nohz_full cpus
>> with an active tick happen to be in kernel
>> mode?
>
> You really want skew_tick=1, especially on big boxen.
>
>> Frederic, can you think of any reason why
>> the tick on nohz_full CPUs would end up aligned
>> with the tick on cpu0, instead of running at some
>> random offset?
>
> (I or low rq->clock bits as crude NOHZ collision avoidance)
>
>> A random offset, or better yet a somewhat randomized
>> tick length to make sure that simultaneous ticks are
>> fairly rare and the vtime sampling does not end up
>> "in phase" with the jiffies incrementing, could make
>> the accounting work right again.
>
> That improves jitter, especially on big boxen.  I have an 8 socket box
> that thinks it's an extra large PC, there, collision avoidance matters
> hugely.  I couldn't reproduce bean counting woes, no idea if collision
> avoidance will help that.

So I implement two methods, one is from Rik's random offset proposal
through skew tick, the other one is from Frederic's proposal and it is
the same as my original idea through use nanosecond granularity to
check deltas but only perform an actual cputime update when that delta
>= TICK_NSEC. Both methods can solve the bug which Luiz reported.
Peterz, Thomas, any ideas?

--------------------------->8-------------------------------------------------------------

skew tick:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7fe53be..9981437 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void)
     hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());

     /* Offset the tick to avert jiffies_lock contention. */
+#ifdef CONFIG_NO_HZ_FULL
+    if (sched_skew_tick || tick_nohz_full_running) {
+#else
     if (sched_skew_tick) {
+#endif
         u64 offset = ktime_to_ns(tick_period) >> 1;
         do_div(offset, num_possible_cpus());
         offset *= smp_processor_id();

-------------------------------------->8-----------------------------------------------------

use nanosecond granularity to check deltas but only perform an actual
cputime update when that delta >= TICK_NSEC.

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f3778e2b..f1ee393 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
task_struct *p, u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct task_struct *tsk)
 {
-    unsigned long now = READ_ONCE(jiffies);
+    u64 now = local_clock();
+    u64 delta;
+
+    delta = now - tsk->vtime_snap;

-    if (time_before(now, (unsigned long)tsk->vtime_snap))
+    if (delta < TICK_NSEC)
         return 0;

-    return jiffies_to_nsecs(now - tsk->vtime_snap);
+    return jiffies_to_nsecs(delta / TICK_NSEC);
 }

 static u64 get_vtime_delta(struct task_struct *tsk)
 {
-    unsigned long now = READ_ONCE(jiffies);
-    u64 delta, other;
+    u64 delta = vtime_delta(tsk);
+    u64 other;

     /*
      * Unlike tick based timing, vtime based timing never has lost
@@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk)
      * elapsed time. Limit account_other_time to prevent rounding
      * errors from causing elapsed vtime to go negative.
      */
-    delta = jiffies_to_nsecs(now - tsk->vtime_snap);
     other = account_other_time(delta);
     WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-    tsk->vtime_snap = now;
+    tsk->vtime_snap += delta;

     return delta - other;
 }
@@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev)

     write_seqcount_begin(&current->vtime_seqcount);
     current->vtime_snap_whence = VTIME_SYS;
-    current->vtime_snap = jiffies;
+    current->vtime_snap = sched_clock_cpu(smp_processor_id());
     write_seqcount_end(&current->vtime_seqcount);
 }

@@ -787,7 +789,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
     local_irq_save(flags);
     write_seqcount_begin(&t->vtime_seqcount);
     t->vtime_snap_whence = VTIME_SYS;
-    t->vtime_snap = jiffies;
+    t->vtime_snap = sched_clock_cpu(cpu);
     write_seqcount_end(&t->vtime_seqcount);
     local_irq_restore(flags);
 }

Regards,
Wanpeng Li