Received: by 10.192.165.148 with SMTP id m20csp4877957imm; Tue, 24 Apr 2018 09:50:02 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+1t+ByfE8X2hi9JztNeYJlEY6/PFpJIus0CP0990HBgvO6rqM4ZvSA+skMXADoz5Z1GZFT X-Received: by 2002:a17:902:8546:: with SMTP id d6-v6mr26645993plo.106.1524588601982; Tue, 24 Apr 2018 09:50:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524588601; cv=none; d=google.com; s=arc-20160816; b=ByPpca7/iXBC+jLQtN3PvDTdup5ILGCj9yB3n3Let585lgf5Z2CDZUN3Mk4pkGI4d+ eXdOQszMcFI8a11w8bR+bkDI+8r/oXQPsSpiE3Jk9+d8q85VKX+PgbTGK+znL8afaF1K fyPtw/L+0IVtguYBawPcur8edbu9btJPlEEsochSaxIJucFrbgumBvJycfwCh3Saj253 Ng49vkcWslu+Yfiad6CUOGllCxC0ALt3AspPgoVlW3Vd786F3w+u9QtY6M5D6HEBM4WX DOhb6LtBBkU9gwA6FdpVUqlFglS123QyxhoBJxwBtrdVlClPJ/nt3uR+cxYXRb+rr7fv Kwdw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dmarc-filter:arc-authentication-results; bh=F2wnVDO2UhqjrjDxqSUnbkk8aobkOThumuc/DqY9xA8=; b=xTSxWqSrcKSNz3sQGUa5BYYZKsv/D2SzwteVU1/D9F7EDFFgWZZfBPyHI0XuiXGb7T pZpbub6KQgqhgPpX6sB+a2Vu75qMmr0Oqi80dB8roulwHAVBmzl+Dhaw/QKGN3evrgRd XEWKHnd6lGJbTGOv9fYKOePlX2A/Dh0q0i02NVZ9jJILXucL8Xsd9hq6JuZGO4bPOXEh q40tkeTUYmC+LST3hxAIKHTnwQn245taLYytd4JdX2Ew5036uRa7l1qJmWo3bdGCI+Le frl+PiNNIrBuzTKE0+R3ikbaYt9ws/F+le6qNzQls5x94A7Vd+6dalv7OfyKICnHExM2 xttg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u81si12102481pfg.289.2018.04.24.09.49.47; Tue, 24 Apr 2018 09:50:01 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752353AbeDXQs2 (ORCPT + 99 others); Tue, 24 Apr 2018 12:48:28 -0400 Received: from mail.kernel.org ([198.145.29.99]:36144 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750827AbeDXQs0 (ORCPT ); Tue, 24 Apr 2018 12:48:26 -0400 Received: from localhost (LFbn-NCY-1-193-82.w83-194.abo.wanadoo.fr [83.194.41.82]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 1BCCF217CE; Tue, 24 Apr 2018 16:48:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1BCCF217CE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=frederic@kernel.org Date: Tue, 24 Apr 2018 18:48:23 +0200 From: Frederic Weisbecker To: Thomas Gleixner Cc: "Wan, Kaike" , "Marciniszyn, Mike" , "Dalessandro, Dennis" , "Weiny, Ira" , "Fleck, John" , "linux-kernel@vger.kernel.org" , "linux-rdma@vger.kernel.org" , Peter Zijlstra , Anna-Maria Gleixner , Frederic Weisbecker , Ingo Molnar Subject: Re: hrtimer (rdmavt RNR timer) was lost Message-ID: <20180424164821.GA31724@lerouge> References: <3F128C9216C9B84BB6ED23EF16290AFB634C8C87@CRSMSX101.amr.corp.intel.com> <3F128C9216C9B84BB6ED23EF16290AFB634C9420@CRSMSX101.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 24, 2018 at 04:54:58PM +0200, Thomas Gleixner wrote: > On Mon, 23 Apr 2018, Thomas Gleixner wrote: > > On Mon, 23 Apr 2018, Wan, Kaike wrote: > > > > Can you apply the following debug patch and enable the hrtimer_start trace > > > > point and send me the full trace or upload it somewhere? > > > > > > The original trace was about 29GB and I filtered it with > > > "0000000066dda1ea" (the offending base) to generate a 1.4GB file that I > > > could open and investigate. I am not sure how I can send them to you. Do > > > you have somewhere I can upload to? > > > > > > I can try your debug patch and again I am anticipating a big trace file. > > > > Well, you can find the spot where the fail happens and then extract the > > full thing from 2s before that point to 1s after. That should be reasonably > > small and good enough. Let me know when you have it and how big it is > > (compressed) and we'll figure something out how to transport it. > > Thanks for the more complete data which actually let me decode the wreckage. > > So you have NO_HZ enabled and looking at the trace then this is related: > > -0 [003] dN.. 3598605307236: hrtimer_start: hrtimer=0000000004346740 function=tick_sched_timer expires=712171000000 softexpires=712171000000mode=ABS|PINNED base=0000000066dda1ea next=00000000708914d7 > -0 [003] dN.. 3598605307601: hrtimer_post_add: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=0000000004346740 > > -0 [002] d.h. 3598605313255: hrtimer_start: hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt] expires=712172915662 softexpires=712172915662mode=REL base=0000000066dda1ea next=0000000004346740 > -0 [002] d.h. 3599538740786: hrtimer_post_add: hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt] base=0000000066dda1ea next=00000000662d2dd8 > > -0 [003] dn.. 3599538742242: hrtimer_pre_remove: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000662d2dd8 > -0 [003] dn.. 3599538743084: hrtimer_post_remove: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000662d2dd8 > > -0 [003] dn.. 3599538743830: hrtimer_start: hrtimer=0000000004346740 function=tick_sched_timer expires=716767000000 softexpires=716767000000mode=ABS|PINNED base=0000000066dda1ea next=00000000662d2dd8 > -0 [003] dn.. 3599538744560: hrtimer_post_add: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000662d2dd8 > > And staring at the NOHZ code I'm pretty sure, I know what happens. > > CPU 3 CPU 2 > > idle > start tick_sched_timer 712171000000 > start rdmavt timer 712172915662 > lock(base) > tick_nohz_stop_tick() > tick = 716767000000 timerqueue_add(tmr) > > hrtimer_set_expires(&ts->sched_timer, tick); <---- FAIL > > if (tmr->exp < queue->next->exp) > hrtimer_start(sched_timer) queue->next = tmr; > lock(base) > unlock(base) > timerqueue_remove() > timerqueue_add() > .... > > So ts->sched_timer is still queued and queue->next is pointing to it, but > then expires is modified. So the other side sees the new expiry time and > makes the rdmavt timer the new queue->next. All f*cked from there. > > The problem was introduced with: > > d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get out of sync") > > The changelog is not really helpful, but I can't see why the expiry time of > ts->sched_timer should be updated before the timer is [re]started in case > of HIGHRES + NOHZ. hrtimer_start() sets the expiry time, so that should be > sufficient. Of course the tick needs to be stored in ts->sched_timer for > the !HIGHRES + HOHZ case, but that's trivial enough to do. Patch against > Linus tree below. It's easy to backport in case you want to run it against > the kernel which made the observation simpler. > > I need to come up with integration of hrtimer_set_expires() into debug > objects to catch this kind of horrors. Groan.... Sorry for all that, that must have been hairy enough to debug :-( I thought that I could fiddle with the sched_timer because nothing else is supposed to touch it concurrently, but that forgot about the whole queue locked under cpu_base. My fault. The fix looks all good to me, thanks! ACK.