Received: by 10.192.165.148 with SMTP id m20csp4763446imm; Tue, 24 Apr 2018 08:03:40 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+QTwMdRLBLtjzPIs2XTGniA68uw6w7fQIcAwNYGcYP9QHdJko7vuK71W2Duy1Dn+IjQDsX X-Received: by 10.101.102.3 with SMTP id w3mr20623635pgv.151.1524582220289; Tue, 24 Apr 2018 08:03:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524582220; cv=none; d=google.com; s=arc-20160816; b=LfQt7svcuXmMzfK8W+xhO7tgb4Luqsu3DinRV4cH6Ui0QXklNxy5VzT5L27a3sXJsD ZIlTRFrkeUca6k0rxLwlfxVxsK6wEpRdHOOk8xSUN7i5/WAoT3Ut437ZZyQJWNyERCyA gi4P3VG5imt0EXCfroq8J+IE1v9RRK8wuYFLeoEUfxnv0Wjkzqo041ZTyj8jlbrlP28q mnEoHNczLjW7/JYnjrdN6PjdDynVq+F7VVipzxnCiLs5gDm5DDmG757iOvRSnP4/s6lI XI455Gx3YUJr9T1OuPM+sk3FzTcVbnlY7XUX15oXdcOGWf9coi+X4cavV5lqemLAw8oQ /6Eg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date :arc-authentication-results; bh=QBuz7CSODic4NUqOlvcHZxJ6FAFLPoskvZt4PW7/4Lg=; b=f1iIjFi9gRjKxDrkwPqBWDpBId/nww5ssWu++q+0WAkXqX7lyjzvuYSj29eGCrRp/w mswFtK8SL7ykwkwhoxm42lZgtpxit6Z3ieGgqOmNzMTWKrMs8XdPS6TSCNDdzORtPWin sKHz2lqVKASh3ZwHfUdOGjpuyAgeI2oICfgwukP5AOj7NwL3JBawPpw5xkS5Y2LncCJG opS7QBmdfJwv7iFgZS2SCxG8fhNVsy6Cq/CHG23NDIO33ggWLINM26QxbSctPfq0SsKG DK5KXXQZBp/pLYrUQD5VOiTnhybTibfMVXck/HM2ml0fNU+y5iLL4UeDMlUz790o/Ntm 1i0A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u21si13625336pfl.143.2018.04.24.08.03.17; Tue, 24 Apr 2018 08:03:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750902AbeDXOzE (ORCPT + 99 others); Tue, 24 Apr 2018 10:55:04 -0400 Received: from Galois.linutronix.de ([146.0.238.70]:36195 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750739AbeDXOzD (ORCPT ); Tue, 24 Apr 2018 10:55:03 -0400 Received: from hsi-kbw-5-158-153-52.hsi19.kabel-badenwuerttemberg.de ([5.158.153.52] helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fAzLO-0004zI-Js; Tue, 24 Apr 2018 16:54:58 +0200 Date: Tue, 24 Apr 2018 16:54:58 +0200 (CEST) From: Thomas Gleixner To: "Wan, Kaike" cc: "Marciniszyn, Mike" , "Dalessandro, Dennis" , "Weiny, Ira" , "Fleck, John" , "linux-kernel@vger.kernel.org" , "linux-rdma@vger.kernel.org" , Peter Zijlstra , Anna-Maria Gleixner , Frederic Weisbecker , Ingo Molnar Subject: RE: hrtimer (rdmavt RNR timer) was lost In-Reply-To: Message-ID: References: <3F128C9216C9B84BB6ED23EF16290AFB634C8C87@CRSMSX101.amr.corp.intel.com> <3F128C9216C9B84BB6ED23EF16290AFB634C9420@CRSMSX101.amr.corp.intel.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 23 Apr 2018, Thomas Gleixner wrote: > On Mon, 23 Apr 2018, Wan, Kaike wrote: > > > Can you apply the following debug patch and enable the hrtimer_start trace > > > point and send me the full trace or upload it somewhere? > > > > The original trace was about 29GB and I filtered it with > > "0000000066dda1ea" (the offending base) to generate a 1.4GB file that I > > could open and investigate. I am not sure how I can send them to you. Do > > you have somewhere I can upload to? > > > > I can try your debug patch and again I am anticipating a big trace file. > > Well, you can find the spot where the fail happens and then extract the > full thing from 2s before that point to 1s after. That should be reasonably > small and good enough. Let me know when you have it and how big it is > (compressed) and we'll figure something out how to transport it. Thanks for the more complete data which actually let me decode the wreckage. So you have NO_HZ enabled and looking at the trace then this is related: -0 [003] dN.. 3598605307236: hrtimer_start: hrtimer=0000000004346740 function=tick_sched_timer expires=712171000000 softexpires=712171000000mode=ABS|PINNED base=0000000066dda1ea next=00000000708914d7 -0 [003] dN.. 3598605307601: hrtimer_post_add: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=0000000004346740 -0 [002] d.h. 3598605313255: hrtimer_start: hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt] expires=712172915662 softexpires=712172915662mode=REL base=0000000066dda1ea next=0000000004346740 -0 [002] d.h. 3599538740786: hrtimer_post_add: hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt] base=0000000066dda1ea next=00000000662d2dd8 -0 [003] dn.. 3599538742242: hrtimer_pre_remove: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000662d2dd8 -0 [003] dn.. 3599538743084: hrtimer_post_remove: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000662d2dd8 -0 [003] dn.. 3599538743830: hrtimer_start: hrtimer=0000000004346740 function=tick_sched_timer expires=716767000000 softexpires=716767000000mode=ABS|PINNED base=0000000066dda1ea next=00000000662d2dd8 -0 [003] dn.. 3599538744560: hrtimer_post_add: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000662d2dd8 And staring at the NOHZ code I'm pretty sure, I know what happens. CPU 3 CPU 2 idle start tick_sched_timer 712171000000 start rdmavt timer 712172915662 lock(base) tick_nohz_stop_tick() tick = 716767000000 timerqueue_add(tmr) hrtimer_set_expires(&ts->sched_timer, tick); <---- FAIL if (tmr->exp < queue->next->exp) hrtimer_start(sched_timer) queue->next = tmr; lock(base) unlock(base) timerqueue_remove() timerqueue_add() .... So ts->sched_timer is still queued and queue->next is pointing to it, but then expires is modified. So the other side sees the new expiry time and makes the rdmavt timer the new queue->next. All f*cked from there. The problem was introduced with: d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get out of sync") The changelog is not really helpful, but I can't see why the expiry time of ts->sched_timer should be updated before the timer is [re]started in case of HIGHRES + NOHZ. hrtimer_start() sets the expiry time, so that should be sufficient. Of course the tick needs to be stored in ts->sched_timer for the !HIGHRES + HOHZ case, but that's trivial enough to do. Patch against Linus tree below. It's easy to backport in case you want to run it against the kernel which made the observation simpler. I need to come up with integration of hrtimer_set_expires() into debug objects to catch this kind of horrors. Groan.... Thanks, tglx 8<----------- --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -804,12 +804,12 @@ static void tick_nohz_stop_tick(struct t return; } - hrtimer_set_expires(&ts->sched_timer, tick); - - if (ts->nohz_mode == NOHZ_MODE_HIGHRES) - hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED); - else + if (ts->nohz_mode == NOHZ_MODE_HIGHRES) { + hrtimer_start(&ts->sched_timer, tick, HRTIMER_MODE_ABS_PINNED); + } else { + hrtimer_set_expires(&ts->sched_timer, tick); tick_program_event(tick, 1); + } } static void tick_nohz_retain_tick(struct tick_sched *ts)