Received: by 10.192.165.148 with SMTP id m20csp3390149imm; Mon, 23 Apr 2018 05:54:33 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/8XjaEEtWrii5wS8yKieHVLPbEPnrbng34nbWaLrvw3nhPtmjJqIIDwpjeSuXan+h/CQf9 X-Received: by 10.101.93.138 with SMTP id f10mr17322883pgt.25.1524488073905; Mon, 23 Apr 2018 05:54:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524488073; cv=none; d=google.com; s=arc-20160816; b=fXkJlicDEhC/o5W2Y6i7wsGyTgdVetpt57QLGwhHjKkTYAjhIxhuce7Na+kMUM0wq8 AkpmMQACptK2XL1EWTMaxazbJDty7gN7RO3B+imM8CcsFdisP+8enn8YcR6Pj9ORDqwk rH7sXEiM1Tvh8/46agJnzgOUULQmtzSBVlsd6hmlLhD/bmweiI7JS7e04mrsjhQYOJ4W 8f2DEGrMaV3k9qEVV0ShfECn4C4CIabCl1wcQ6mKMToLvVXmCloT/a+J0q7ZmoX5kJAq enDQdofHPrOZpxJm62uJD6WlEvx6rXsgE9Ipn1CdLu6t5YiKHzZs1SQ/uBza55FHIl/w aRSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date :arc-authentication-results; bh=DTKK+svy+CvjPuA4Xk+ws8Eiy9VPL13p4tRtnZ3rLac=; b=VfhC//r2Y6UIpcHQhEj7U9Hf1mX8jxKW+YehvXEFuLmqgMHPuK8VjsQf4sqGGDdRcM kBCV5ueh5rLL4fTLP1Xz35xM8bDF9exTycJIZGcWL9N4+Myly2vmfljn8LtQvGIvNhhL H+EXvrcTkEkidmHnwJWI3o7OwuT1ISKDP21ny5p5DQnHjEcb0OlXsANXgQiMK3SWQtKo KFJEQ1XeA4rkwi8+KgUUf18l/yd8qlF3YJE7ZXwb3E8m8jzFS17T/grWJ3n2Rl9hvVt4 c7RrxlFSapxdCqYFqsa/0weJ/NHuRlBXYyQpGQMGHF4D38xOSRNLYGoZyPH/dNACFIsY 1Srw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y14si9461871pgo.85.2018.04.23.05.54.19; Mon, 23 Apr 2018 05:54:33 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754910AbeDWMxO (ORCPT + 99 others); Mon, 23 Apr 2018 08:53:14 -0400 Received: from Galois.linutronix.de ([146.0.238.70]:55501 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754786AbeDWMxN (ORCPT ); Mon, 23 Apr 2018 08:53:13 -0400 Received: from hsi-kbw-5-158-153-52.hsi19.kabel-badenwuerttemberg.de ([5.158.153.52] helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fAaxw-0008Lq-Ol; Mon, 23 Apr 2018 14:53:08 +0200 Date: Mon, 23 Apr 2018 14:53:08 +0200 (CEST) From: Thomas Gleixner To: "Wan, Kaike" cc: "Marciniszyn, Mike" , "Dalessandro, Dennis" , "Weiny, Ira" , "Fleck, John" , "linux-kernel@vger.kernel.org" , "linux-rdma@vger.kernel.org" Subject: Re: hrtimer (rdmavt RNR timer) was lost In-Reply-To: <3F128C9216C9B84BB6ED23EF16290AFB634C8C87@CRSMSX101.amr.corp.intel.com> Message-ID: References: <3F128C9216C9B84BB6ED23EF16290AFB634C8C87@CRSMSX101.amr.corp.intel.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 20 Apr 2018, Wan, Kaike wrote: Please do not send html mail. It's rejected by the list server. > -0 [001] d.h. 3598473310671: hrtimer_start: hrtimer=00000000708914d7 function=rvt_rc_rnr_retry [rdmavt] expires=712172295802 softexpires=712172295802mode=REL base=0000000066dda1ea next=0000000039cae92b > -0 [001] d.h. 3598473311561: hrtimer_post_add: hrtimer=00000000708914d7 function=rvt_rc_rnr_retry [rdmavt] base=0000000066dda1ea next=0000000039cae92b > > --> The target was first added. The expires was 712172295802. The call > was made from cpu 1 while the timer base was on cpu 3. > -0 [003] dn.. 3598605305357: hrtimer_start: hrtimer=0000000004346740 function=tick_sched_timer expires=716767000000 softexpires=716767000000mode=ABS|PINNED base=0000000066dda1ea next=00000000708914d7 > -0 [003] dn.. 3598605305781: hrtimer_post_add: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000708914d7 > > --> the target timer was the next one to trigger. The call was from cpu 3. > > -0 [003] dN.. 3598605306604: hrtimer_pre_remove: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000708914d7 > -0 [003] dN.. 3598605306802: hrtimer_post_remove: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000708914d7 > > --> The tick_sched_timer was removed > > -0 [003] dN.. 3598605307236: hrtimer_start: hrtimer=0000000004346740 function=tick_sched_timer expires=712171000000 softexpires=712171000000mode=ABS|PINNED base=0000000066dda1ea next=00000000708914d7 > -0 [003] dN.. 3598605307601: hrtimer_post_add: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=0000000004346740 > > --> The tick_sched_timer was added again with a new expires (712171000000 > < 716767000000), which replaced the target (712172295802) as the > head->next. > > -0 [002] d.h. 3598605313255: hrtimer_start: hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt] expires=712172915662 softexpires=712172915662mode=REL base=0000000066dda1ea next=0000000004346740 > -0 [002] d.h. 3599538740786: hrtimer_post_add: hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt] base=0000000066dda1ea next=00000000662d2dd8 > > > --> A new hrtimer (00000000662d2dd8) was added (from cpu 2). Its expires > (712172915662) was larger than the target's (712172295802). How could > it become the next hrtimer to trigger??? The next pointer was not set > properly here. I have no idea how that can happen assumed that you did not strip anything from the trace. If you stripped stuff, please send me the full trace or upload it somewhere. Can you apply the following debug patch and enable the hrtimer_start trace point and send me the full trace or upload it somewhere? Thanks, tglx 8<------------------------- --- a/lib/timerqueue.c +++ b/lib/timerqueue.c @@ -57,6 +57,10 @@ bool timerqueue_add(struct timerqueue_he rb_link_node(&node->node, parent, p); rb_insert_color(&node->node, &head->head); + trace_printk("head %p next %p next->exp %llu node %p node->exp %llu\n", + head, head->next, head->next ? head->next->expires : 0ULL, + node, node->expires); + if (!head->next || node->expires < head->next->expires) { head->next = node; return true; @@ -84,6 +88,11 @@ bool timerqueue_del(struct timerqueue_he head->next = rb_entry_safe(rbn, struct timerqueue_node, node); } + + trace_printk("head %p next %p next->exp %llu node %p node->exp %llu\n", + head, head->next, head->next ? head->next->expires : 0ULL, + node, node->expires); + rb_erase(&node->node, &head->head); RB_CLEAR_NODE(&node->node); return head->next != NULL;