Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp3011449ybi; Mon, 17 Jun 2019 14:38:48 -0700 (PDT) X-Google-Smtp-Source: APXvYqw0IxUuTncP/xr/mg9rVJSlUmzDsmL63gCq4DMueFWtKJAjI5x6WmyQvh3i5vWQtzWqgBrg X-Received: by 2002:a05:6a00:4d:: with SMTP id i13mr4636504pfk.1.1560807528498; Mon, 17 Jun 2019 14:38:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1560807528; cv=none; d=google.com; s=arc-20160816; b=RT3kYR2RN5g6MenHmcq3iyWbyJAAIHPF8FImgMgARLMc8ChBCFUtIeDH+ttRH9FOIC xJ6cPYJbpBO6OAjqaGNoLryrdKvqG7uMsddJ770OUnfCEMbV53gxUXN07r1RWeTN8a9E vw1EPkdarB2ZY/vmpSfZExy6fMG/wZ8UB8Cm89/uhuNy3ujxCofI/xqZRww6tSlMjwOM Lqg9iNKe1QNZbVhA+DcKnuqmHuFNKsrvtvzToNhCS6b7qX3Mjs6p67CzEiMyB7qPf//Q XvJsoUfnNmDo0HGGU04eNyK/ZDca+w414DQtX3u1uRdNbZCLV8/meas7zRlzNzrigvOj AeRQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=SrvOCOvD/npMq+rdmG0iHCrr8LGFgDfdEB1g1qFDyfA=; b=kYJc+G6yClSyDDNLSrTuyU2scamUmDOiZejczmYgBy3ujza0a78EsTo6CWSD9Taogf rvJlOh0Ycy0m4/L9yMww12csc8iT4/0uxxwiIsC8TgHdUYoydBJB1cHA7v8AHHgToDkx nVJNLrpvH44jCyPf4p+wd3dA01jg7zD6tqyEyBKpBY4c9zapz0OkyrCaDCTqBn7VgZhw oTmqXcH6UJgttvZQGrcIpFpABURqfgzc1fWk+Vd99SntJrrDhXOKJKYgYd25XDpb25Ma 9U4+lVUHaxBRpVXyQ9bcUWbPu1SyheRFeCjEfWpOYuOmPNxo3m0laAAIAOuEO8KAiqtr AwQA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=C6LL7GNc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q9si6833746pfl.287.2019.06.17.14.38.33; Mon, 17 Jun 2019 14:38:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=C6LL7GNc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729192AbfFQVi3 (ORCPT + 99 others); Mon, 17 Jun 2019 17:38:29 -0400 Received: from mail-io1-f67.google.com ([209.85.166.67]:36479 "EHLO mail-io1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726538AbfFQVi1 (ORCPT ); Mon, 17 Jun 2019 17:38:27 -0400 Received: by mail-io1-f67.google.com with SMTP id h6so24837251ioh.3 for ; Mon, 17 Jun 2019 14:38:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=SrvOCOvD/npMq+rdmG0iHCrr8LGFgDfdEB1g1qFDyfA=; b=C6LL7GNcUBRJQpB4zer0JYaPOPHc6GgFgTD7zFDsGTMWqtavFEHHHuJY9nGPFDlXPO F4ENymilLNms/n63xiUPRRFNez8BJWTxNgoXwuTVW7uotERbMl0IbQ7apAg9paydetvx lD1n/XeTl9m3wdlz++cGBLpYAeATZ+/bVS6H7s1sD7egag9nX3ECnEP220eRW2T2Ai6r 11tQf01dDaKEYauJ3XSiCh7V71ll0lk62RmmRFg6Mhn7VomHCrA9+1zNNk6pmG261jGj 5ymslLp68rXct5A/uqKLTRCpTU3n39CZRertUHgD/LyL5E2M7qzTrDHi51ez+hRDy/Ho vjgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=SrvOCOvD/npMq+rdmG0iHCrr8LGFgDfdEB1g1qFDyfA=; b=Q/kjwtO2F25kRQFzKZaSaDhaDTDy9NefKHVFRtDFP96wBxUixO1YI5GAcctuUa5F+E 1F9QCbOKjwkMmUZAj32DLZKxEahO38cKde2/UGit38j+bfFc4pZEnU0r7dnRVvvAEcoj +GbhQvbfZkFCnY8XkB6DJhNmPaz3ISJN15/iMgvep9H4yKsbOHsBFmmz/SZYJQli3lz+ 06iBeAwoRLmF7BanVQt+1PkcUcA3FS/+xjsdghVNxj5PM5XkrIZMMvm2kPOQ7vvcmlZA G7ThYxOcHPvBKkIXJUwmvdLrCN1v3pMq+fELEMEvxSMbDWM4k6jRJ5yQmBHVumPs0NXM 9Kqw== X-Gm-Message-State: APjAAAUmx3Cg+GA/8NlahmOOd/h8AvjRV8aFX2CR/ES9aUrGNRdNFSJN BfNI35KipZyuD7RJNFg4WAy1dnPAmcQQz4mGiuNnfQ== X-Received: by 2002:a5e:c207:: with SMTP id v7mr2637922iop.163.1560807506670; Mon, 17 Jun 2019 14:38:26 -0700 (PDT) MIME-Version: 1.0 References: <1558660583-28561-1-git-send-email-ricardo.neri-calderon@linux.intel.com> <1558660583-28561-21-git-send-email-ricardo.neri-calderon@linux.intel.com> In-Reply-To: From: Stephane Eranian Date: Mon, 17 Jun 2019 14:38:14 -0700 Message-ID: Subject: Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog To: Thomas Gleixner Cc: Ricardo Neri , Ingo Molnar , Borislav Petkov , Ashok Raj , Joerg Roedel , Andi Kleen , Peter Zijlstra , Suravee Suthikulpanit , "Ravi V. Shankar" , Randy Dunlap , x86 , LKML , iommu@lists.linux-foundation.org, Ricardo Neri , Tony Luck , Jacob Pan , Juergen Gross , Bjorn Helgaas , Wincy Van , Kate Stewart , Philippe Ombredanne , "Eric W. Biederman" , Baoquan He , Jan Kiszka , Lu Baolu Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner wrote: > > On Sun, 16 Jun 2019, Thomas Gleixner wrote: > > On Thu, 23 May 2019, Ricardo Neri wrote: > > > When the hardlockup detector is enabled, the function > > > hld_hpet_intremapactivate_irq() activates the recently created entry > > > in the interrupt remapping table via the modify_irte() functions. While > > > doing this, it specifies which CPU the interrupt must target via its APIC > > > ID. This function can be called every time the destination iD of the > > > interrupt needs to be updated; there is no need to allocate or remove > > > entries in the interrupt remapping table. > > > > Brilliant. > > > > > +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata) > > > +{ > > > + u32 destid = apic->calc_dest_apicid(hdata->handling_cpu); > > > + struct intel_ir_data *data; > > > + > > > + data = (struct intel_ir_data *)hdata->intremap_data; > > > + data->irte_entry.dest_id = IRTE_DEST(destid); > > > + return modify_irte(&data->irq_2_iommu, &data->irte_entry); > > > > This calls modify_irte() which does at the very beginning: > > > > raw_spin_lock_irqsave(&irq_2_ir_lock, flags); > > > > How is that supposed to work from NMI context? Not to talk about the > > other spinlocks which are taken in the subsequent call chain. > > > > You cannot call in any of that code from NMI context. > > > > The only reason why this never deadlocked in your testing is that nothing > > else touched that particular iommu where the HPET hangs off concurrently. > > > > But that's just pure luck and not design. > > And just for the record. I warned you about that problem during the review > of an earlier version and told you to talk to IOMMU folks whether there is > a way to update the entry w/o running into that lock problem. > > Can you tell my why am I actually reviewing patches and spending time on > this when the result is ignored anyway? > > I also tried to figure out why you went away from the IPI broadcast > design. The only information I found is: > > Changes vs. v1: > > * Brought back the round-robin mechanism proposed in v1 (this time not > using the interrupt subsystem). This also requires to compute > expiration times as in v1 (Andi Kleen, Stephane Eranian). > > Great that there is no trace of any mail from Andi or Stephane about this > on LKML. There is no problem with talking offlist about this stuff, but > then you should at least provide a rationale for those who were not part of > the private conversation. > Let me add some context to this whole patch series. The pressure on the core PMU counters is increasing as more people want to use them to measure always more events. When the PMU is overcommitted, i.e., more events than counters for them, there is multiplexing. It comes with an overhead that is too high for certain applications. One way to avoid this is to lower the multiplexing frequency, which is by default 1ms, but that comes with loss of accuracy. Another approach is to measure only a small number of events at a time and use multiple runs, but then you lose consistent event view. Another approach is to push for increasing the number of counters. But getting new hardware counters takes time. Short term, we can investigate what it would take to free one cycle-capable counter which is commandeered by the hard lockup detector on all X86 processors today. The functionality of the watchdog, being able to get a crash dump on kernel deadlocks, is important and we cannot simply disable it. At scale, many bugs are exposed and thus machines deadlock. Therefore, we want to investigate what it would take to move the detector to another NMI-capable source, such as the HPET because the detector does not need high low granularity timer and interrupts only every 2s. Furthermore, recent Intel erratum, e.g., the TSX issue forcing the TFA code in perf_events, have increased the pressure even more with only 3 generic counters left. Thus, it is time to look at alternative ways of getting a hard lockup detector (NMI watchdog) from another NMI source than the PMU. To that extent, I have been discussing about alternatives. Intel suggested using the HPET and Ricardo has been working on producing this patch series. It is clear from your review that the patches have issues, but I am hoping that they can be resolved with constructive feedback knowing what the end goal is. As for the round-robin changes, yes, we discussed this as an alternative to avoid overloading CPU0 with handling all of the work to broadcasting IPI to 100+ other CPUs. Thanks.