DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 49F6375724
Date: Thu, 30 Mar 2017 17:25:46 -0400
From: Luiz Capitulino <lcapitulino@redhat.com>
To: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Wanpeng Li <kernellwp@gmail.com>, Mike Galbraith <efault@gmx.de>,
        Rik van Riel <riel@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [BUG nohz]: wrong user and system time accounting
Message-ID: <20170330172546.4e8e1a6a@redhat.com>
In-Reply-To: <20170330141816.GE3626@lerouge>
References: <20170323165512.60945ac6@redhat.com>
        <CANRm+CxcgSP2-x+A822DmHLvFLzFmTptS6oYwYtwVdErTpiB=Q@mail.gmail.com>
        <1490636129.8850.76.camel@redhat.com>
        <20170328132406.7d23579c@redhat.com>
        <20170329131656.1d6cb743@redhat.com>
        <1490818125.28917.11.camel@redhat.com>
        <1490848051.4167.57.camel@gmx.de>
        <CANRm+CzfqkU6iV1BFk3PmWvy7MOsYH1mSwejJXMkvWW9C5ngwg@mail.gmail.com>
        <20170330133802.GC3626@lerouge>
        <CANRm+Cw8-7PydicanPCkNcVLi16Yr_0J9Sj6m6xWp4OUxLywjQ@mail.gmail.com>
        <20170330141816.GE3626@lerouge>
Organization: Red Hat
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1980
Lines: 46

On Thu, 30 Mar 2017 16:18:17 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote:
> > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:  
> > > If it works, we may want to take that solution, likely less performance sensitive
> > > than using sched_clock(). In fact sched_clock() is fast, especially as we require it to
> > > be stable for nohz_full, but using it involves costly conversion back and forth to jiffies.  
> > 
> > So both Rik and you agree with the skew tick solution, I will try it
> > tomorrow. Btw, if we should just add random offset to the cpu in the
> > nohz_full mode or add random offset to all cpus like the codes above?  
> 
> Lets just keep it to all CPUs for simplicty.
> Also please add a comment that explains why we need that skew_tick on nohz_full.

I've tried all the test-cases we discussed in this thread with skew_tick=1
and it worked as expected in bare-metal and KVM guests.

However, I found a test-case that works in bare-metal but show problems
in KVM guests. It could something that's KVM specific, or it could be
something that's harder to reproduce in bare-metal.

The reproducer is (not sure all the steps are necessary):

1. Isolate 8 cores in the host with isolcpus= and nohz_full= (and skew_tick=1)

2. Create a KVM guest with 8 vCPUs and pin each vCPU to an isolated
   host core

3. Boot the guest with isolcpus=2,3,4,5,6,7 nohz_full=2,3,4,5,6,7 skew_tick=1

4. Once the guest is booted, run:

# for i in $(seq 2 7); do taskset -c $i hog& ;done
# taskset -c 2,3,4,5,6,7 \
  cyclictest -m -n -q -p95 -D 1m -h60 -i 200 -t 6 -a 2,3,4,5,6,7

  (where hog is a program taking 100% of the CPU, and cyclictest
   is RT's cyclictest)

5. Run top -d1

In a few minutes into this test-case, I see one isolated CPU in the
guest reporting around 95% system time (where the expected is close
to 100% user time, which the others isolated CPUs correctly report).