Received: by 2002:a25:ca44:0:0:0:0:0 with SMTP id a65csp838855ybg; Mon, 27 Jul 2020 00:27:16 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw2fdeso13RXIXedXdtr0V9w442DaGM77L4BL3IrFEW2dnGR3zRyl6qoVeX2toSFt9dbpf8 X-Received: by 2002:a17:907:2662:: with SMTP id ci2mr634008ejc.334.1595834835967; Mon, 27 Jul 2020 00:27:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1595834835; cv=none; d=google.com; s=arc-20160816; b=fFXDbJcfDiWlkaL4iLrYAismF5+zZETUKPxTuYs68GSo6M32F3LG3TM/OaQ1o38IJ+ V6C3pj14s23WTfGK2rTNOa42hda4r9bEb2p+Yfiw7t5FtJV4gcRhdzL7i4JWNQ++JYS3 RSPhHRML0yuEeytKRTqU0UbEJtU1OhP/R0GJJpF1NsOjTKDc0IA142PQ2hqGjVcxIrSS YFLe2+/eaJqyV4+mTDFDGLWhiwCsgO2bQYfFJip94By6GMpCZgWbfj+blG859VBNZOi7 SwCHutbZeQ4xKRPRbBux/CodERv7/v8d2j85cvLR0XDbBVWaI4Ly4g2UTkQ3cg8FWo7t sL3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=bcQMkiC6JyOAC4Lv4CWi0++I1XNF82/fz8tqn1+E1Hw=; b=ytRhc/RkOu5I8ntnCLw2P1kXU2LggB9zymraQ4yCEe9C/vhdTGY1rdxrVcHBiN1xJm VF7u+QxRMZ94hzzLgcEX/9wKM+syVFug2pnhZOiCUlTYfNas5bgQnYtH50JqyHs0fqjG bYZktMZA782VVizejMOFMOkI4/rwrjuF2vVYHrPBgdJV5kGLiPQ2x3TWydsjjQpQQ1OH 3bqQ5Y3RkGLvO75CpZnaqjtzUt2WdPtNDHaAQWabhQ2CoOHm7iLhcmOxCECDPGOnhBjd bdkRg4EjGfi/0gHo+vGPwMid3oD7jojxnLdi+SCLEmexaH3JZEHiG+KlpED/RCFqMwK/ 0IMg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@broadcom.com header.s=google header.b=IyqtPWAa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=broadcom.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id s24si5474231ejd.575.2020.07.27.00.26.54; Mon, 27 Jul 2020 00:27:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@broadcom.com header.s=google header.b=IyqtPWAa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=broadcom.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726826AbgG0H0f (ORCPT + 99 others); Mon, 27 Jul 2020 03:26:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56996 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726116AbgG0H0e (ORCPT ); Mon, 27 Jul 2020 03:26:34 -0400 Received: from mail-qk1-x741.google.com (mail-qk1-x741.google.com [IPv6:2607:f8b0:4864:20::741]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B6894C0619D2 for ; Mon, 27 Jul 2020 00:26:34 -0700 (PDT) Received: by mail-qk1-x741.google.com with SMTP id u64so14379006qka.12 for ; Mon, 27 Jul 2020 00:26:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=bcQMkiC6JyOAC4Lv4CWi0++I1XNF82/fz8tqn1+E1Hw=; b=IyqtPWAaKhAh+DaxDQ4DPb3PUihUUHg8FyCmLNkaMd01kCj5NBl4KCGQa/PWeiKjvy lYiDgQlerDSKik78BIzs/K0+t9zjJEXiGFWlYKsGkciLp+jsot0ilKwAfdPe7YGjRNbf OR/xYOR5VD5tJTttqVfi3tZ1SpLAtqmyRi3zc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=bcQMkiC6JyOAC4Lv4CWi0++I1XNF82/fz8tqn1+E1Hw=; b=dj9hWZPzh1b6QtZvgRAd8CmlM2SHDAF66sxMkaLOCRhqAiJWD8nzrMcb1GzkMk0ZU+ WXdWK1g4Oegm/V2uqWy3feVJ4Dj0QUXQ88oc4lp3u03/cC9SWFIMKySrTmVClVs8tFIf PFLhw12oAD6q+AX30sbUWE/na/f9FmNgtjpSrfU6Ae942nDip/mxvuWkWMpAoXxXRY50 8In/LaYGFqRXYac6rQ2zyiXJXAMUokE6yvv0Rhe73Y0DzzoGfdG8wNGuVn8uBhwsAwwS kcVFe4QcYFOVtzTsSaAd00pv798t9E14IDmIF4266g7B16cWAaUslc0YcsiwR/i0o4Df 0eYg== X-Gm-Message-State: AOAM532CdeauKWumiqucvKHPBAdUbaiUB8ZNN+EM3ZfaIyPGLCGC7fDQ U5lsilB2oD9XH9IgsmKqo+TjPlyf1ss+Fe96u08cSQ== X-Received: by 2002:a05:620a:209b:: with SMTP id e27mr21265347qka.431.1595834793706; Mon, 27 Jul 2020 00:26:33 -0700 (PDT) MIME-Version: 1.0 References: <20200615190119.382589-1-drc@linux.vnet.ibm.com> <20200617185117.732849-1-drc@linux.vnet.ibm.com> <4bb07889-063f-c12d-28e0-11de9766c774@linux.vnet.ibm.com> In-Reply-To: <4bb07889-063f-c12d-28e0-11de9766c774@linux.vnet.ibm.com> From: Michael Chan Date: Mon, 27 Jul 2020 00:26:22 -0700 Message-ID: Subject: Re: [PATCH v2] tg3: driver sleeps indefinitely when EEH errors exceed eeh_max_freezes To: David Christensen Cc: Netdev , Siva Reddy Kallam , Prashant Sreedharan , Michael Chan , open list Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 24, 2020 at 5:19 PM David Christensen wrote: > In the working case, tg3_init_hw() returns successfully, resulting in > every instance of napi_disable() being followed by an instance of > napi_enable(). > > In the failing case, tg3_hw_init() returns an error. (This is not > surprising since the system is now preventing the adapter from accessing > its MMIO registers. I'm curious why it doesn't always fail.) When > tg3_hw_init() fails, tg3_netif_start() is not called, and we end up with > two sequential calls to napi_disable(), resulting in multiple hung task > messages. > If the driver fails to initialize the chip completely, the tg3_flags should indicate we are in this failed state. We already have TG3_FLAG_INIT_COMPLETE. Perhaps, we can expand the use of this flag to cover the scenario that you described above. We can clear TG3_FLAG_INIT_COMPLETE before calling tg3_halt() and only set it back when tg3_hw_init() completes successfully. This is the rough idea, but a more detailed analysis on how this flag is used needs to be done first. Assuming this works, the EEH handler can check TG3_FLAG_INIT_COMPLETE to see if we should call tg3_netif_stop(). Another way to fix it is to call dev_close() if tg3_reset_task() fails to re-initialize the device.