Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp4173200imm; Mon, 30 Jul 2018 09:52:14 -0700 (PDT) X-Google-Smtp-Source: AAOMgpccQsZwoDAT9VY4okwB7euqQH87iAY9TaeOXrtmsBI90+E24o91Qx5yZ1yvudm22+Z3I0UY X-Received: by 2002:a17:902:5281:: with SMTP id a1-v6mr16980606pli.73.1532969534049; Mon, 30 Jul 2018 09:52:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532969534; cv=none; d=google.com; s=arc-20160816; b=g/RYZDF7Be16inZXyeEp8LWh+YffQuz9yMoEVC1mPAXZBsbLMGW1cmjCUnmEr+TQ7S ikghc0gEnj3mMkv2iVISd0l1L++0acw4FZFB9Ns1KK87pRLz/i2pe9p0aAVq+TvFsUCE fI1F73VixGPgCmahABwN8zziaGvdfWH/kZa5FvC3Tku1syn8L/1ULt5rOp4RX2xelfKg RFhp/ZU8SoMBO2OkxNk+Ln7hFQs4EvSeQLZUULBrLMgFvfAZQgVx8HZ2aMj9ji0Kvhmo TrPsS+ZGTPVlkyGdccOC0rMttiXRG2OizoWMGRFNJVxdC42vPqeMNRiDboXdxtCSpbrs Ye6g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=a6EWL3Gykq4/sL1IRbBsbFmC8npvdWHwA313SgeOfV8=; b=HLU9sOlWrKBetynkXLHH678FnC/uveRx4lLTVUceBzc4bJF8je3GJB0nbtp9BT4Qv6 kUjMx8zr7hvFMsmJv7cm+114j0ECCBhYgxQfrLT2SzFJl8G7Ywe4Cv/aZseBtOSidboD 9TNza8BaW95PT1AhIq16XAcZSwfxuC09GC9hPVGWW3OljQs9ZFqRBM3dLOPs2Zr4YJ2K o00Yfzofq9LudpmzidczaqOAto4H8O7Z2Pd3PVO0xdAiK3vwJUQaXHCADd/1Kb4E5GMK 2oPGmwSViiObvNthsCszU2gCAAdgJrbjWrsOg5yHqQ0BO+jzMaRiN2g30ng18GZoMOEd 8i0w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b="rZ+pQ/rH"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w18-v6si11688995pfi.88.2018.07.30.09.51.59; Mon, 30 Jul 2018 09:52:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b="rZ+pQ/rH"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727037AbeG3S04 (ORCPT + 99 others); Mon, 30 Jul 2018 14:26:56 -0400 Received: from smtp-fw-33001.amazon.com ([207.171.190.10]:5389 "EHLO smtp-fw-33001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726668AbeG3S04 (ORCPT ); Mon, 30 Jul 2018 14:26:56 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1532969464; x=1564505464; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=a6EWL3Gykq4/sL1IRbBsbFmC8npvdWHwA313SgeOfV8=; b=rZ+pQ/rHljSg2z9PP2Ng9HWSFP3nqcB0QUHvk1g3WRAuLm7DMxrMriIU S9ZExDT1Cyen4t3N5kJsVS5cxk95LsZ8vxDwTaQkIEGfe8Lmw4/FPsc+X csKyN6n86ph5d2H1iieflfEH46A37jyoX6TGVyUrQYwqQkmkBEJJjdKp1 4=; X-IronPort-AV: E=Sophos;i="5.51,422,1526342400"; d="scan'208";a="747039686" Received: from sea3-co-svc-lb6-vlan2.sea.amazon.com (HELO email-inbound-relay-2a-7c6d20a4.us-west-2.amazon.com) ([10.47.22.34]) by smtp-border-fw-out-33001.sea14.amazon.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 30 Jul 2018 16:41:35 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2a-7c6d20a4.us-west-2.amazon.com (8.14.7/8.14.7) with ESMTP id w6UGf0IR070864 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=FAIL); Mon, 30 Jul 2018 16:41:00 GMT Received: from EX13D05UWB004.ant.amazon.com (10.43.161.208) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Mon, 30 Jul 2018 16:41:00 +0000 Received: from EX13MTAUWA001.ant.amazon.com (10.43.160.58) by EX13D05UWB004.ant.amazon.com (10.43.161.208) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Mon, 30 Jul 2018 16:41:00 +0000 Received: from localhost (10.55.160.54) by mail-relay.amazon.com (10.43.160.118) with Microsoft SMTP Server id 15.0.1367.3 via Frontend Transport; Mon, 30 Jul 2018 16:41:00 +0000 Date: Mon, 30 Jul 2018 09:41:00 -0700 From: Eduardo Valentin To: Peter Zijlstra CC: Eduardo Valentin , "Rafael J . Wysocki" , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Dou Liyang , Len Brown , "Rafael J. Wysocki" , "mike.travis@hpe.com" , Rajvi Jingar , Pavel Tatashin , Philippe Ombredanne , "Kate Stewart" , Greg Kroah-Hartman , , , Subject: Re: [PATCH RESEND 1/1] x86: tsc: avoid system instability in hibernation Message-ID: <20180730164100.GD15414@u40b0340c692b58f6553c.ant.amazon.com> References: <20180726155656.14873-1-eduval@amazon.com> <20180730085354.GA2494@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20180730085354.GA2494@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey Peter, On Mon, Jul 30, 2018 at 10:53:54AM +0200, Peter Zijlstra wrote: > On Thu, Jul 26, 2018 at 08:56:56AM -0700, Eduardo Valentin wrote: > > System instability are seen during resume from hibernation when system > > is under heavy CPU load. This is due to the lack of update of sched > > clock data > > Which would suggest you're already running with unstable sched clock. > Otherwise nobody would care about the scd stuff. Yes. > > What kind of machine are you running? What does: > > dmesg | grep -i tsc > > say? Here: [ 0.000000] tsc: Fast TSC calibration using PIT [ 0.004005] tsc: Detected 3000.000 MHz processor [ 0.066796] TSC deadline timer enabled [ 3.904269] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2b3e459bf4c, max_idle_ns: 440795289890 ns > > > The fix for this situation is to mark the sched clock as unstable > > as early as possible in the resume path, leaving it unstable > > for the duration of the resume process. This will force the > > scheduler to attempt to align the sched clock across CPUs using > > the delta with time of day, updating sched clock data. In a post > > hibernation event, we can then mark the sched clock as stable > > again, avoiding unnecessary syncs with time of day on systems > > in which TSC is reliable. > > None of this makes any sense. Either you were already unstable and it > should already have worked and them marking it stable is an outright > bug, or your sched clock was stable but then your initial diagnosis of > lack of scd updates is complete garbage. > I see, or it is just a workaround for the underling issue. I, for sure, see no lockups anymore after forcing the scd updates. The other thing which are not super clear is that this happens during the unfreezing of tasks. If I get a set of cpu hog tasks while unfreezing, I see the system throwing worqueue lockup detectors in hibernation restore. > -- All the best, Eduardo Valentin