Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751457AbdC0RhV (ORCPT ); Mon, 27 Mar 2017 13:37:21 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53382 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751350AbdC0RhM (ORCPT ); Mon, 27 Mar 2017 13:37:12 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 24ABD41A37 Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=riel@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 24ABD41A37 Message-ID: <1490636129.8850.76.camel@redhat.com> Subject: Re: [BUG nohz]: wrong user and system time accounting From: Rik van Riel To: Wanpeng Li , Luiz Capitulino Cc: Frederic Weisbecker , "linux-kernel@vger.kernel.org" , linux-rt-users@vger.kernel.org Date: Mon, 27 Mar 2017 13:35:29 -0400 In-Reply-To: References: <20170323165512.60945ac6@redhat.com> Organization: Red Hat, Inc Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Mon, 27 Mar 2017 17:35:33 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2028 Lines: 47 On Mon, 2017-03-27 at 09:56 +0800, Wanpeng Li wrote: > > Actually after I bisect, the first bad commit is ff9a9b4c4334 > ("sched, > time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug > can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true At the time, we thought it was an "occasionally bad" / "unlucky" kind of bug, not a systemic issue, like your observations seem to suggest. > Let's consider the cpu which has responsibility for the global > timekeeping, as the tracing posted above, the vtime_account_user() is > called before tick_sched_timer() which will update jiffies, so > jiffies > is stale in vtime_account_user() and the run time in userspace is > skipped, the vtime_user_enter() is called after jiffies update, so > both the time in userspace and in  kernel are accumulated to sys > time. > If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is > fine. However, if you give stress to the housekeeping cpu, top will > show 100% sys-time of both the housekeeping cpu and the other cpus > who > have at least two tasks running on and in full_nohz mode. I think it > is because the stress delays the timer interrupt handling in some > degree, then the jiffies is not updated timely before other cpus > access it in vtime_account_user(). > > I think we can keep syscalls/exceptions context tracking still in > jiffies based sampling and utilize local_clock() in vtime_delta() > again for irqs which avoids jiffies stale influence. I can make a > patch if the idea is acceptable or there is any better proposal. :) Making that patch seems worthwhile, but I would like to know what the root cause is of the issue that is being observed. Is the problem due to the nohz_full CPU receiving an interrupt at the same time the timer interrupt fires on the housekeeping CPU? Is it due to a nohz_full CPU updating jiffies all by itself from irq context? In that case, could it be better to always have that be done by the housekeeping CPU? What exactly is going on here?