Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932268AbcKUUka (ORCPT ); Mon, 21 Nov 2016 15:40:30 -0500 Received: from smtp.codeaurora.org ([198.145.29.96]:37364 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932199AbcKUUk0 (ORCPT ); Mon, 21 Nov 2016 15:40:26 -0500 DMARC-Filter: OpenDMARC Filter v1.3.1 smtp.codeaurora.org B0688612E8 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=pass smtp.mailfrom=tbaicar@codeaurora.org Subject: Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN To: "Neftin, Sasha" , jeffrey.t.kirsher@intel.com, intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, okaya@codeaurora.org, timur@codeaurora.org References: <1478727681-16505-1-git-send-email-tbaicar@codeaurora.org> <42e53bc2-0361-e7b7-1093-4e095aa56955@intel.com> <3914adae-91b2-7b74-02bc-579b3e32d143@intel.com> From: "Baicar, Tyler" Message-ID: <20c191fe-e8fe-b2ca-0628-7d0188d1f28e@codeaurora.org> Date: Mon, 21 Nov 2016 13:40:17 -0700 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9087 Lines: 206 On 11/17/2016 6:31 AM, Neftin, Sasha wrote: > On 11/13/2016 10:34 AM, Neftin, Sasha wrote: >> On 11/11/2016 12:35 AM, Baicar, Tyler wrote: >>> Hello Sasha, >>> >>> On 11/9/2016 11:19 PM, Neftin, Sasha wrote: >>>> On 11/9/2016 11:41 PM, Tyler Baicar wrote: >>>>> Move IRQ free code so that it will happen regardless of the >>>>> __E1000_DOWN bit. Currently the e1000e driver only releases its IRQ >>>>> if the __E1000_DOWN bit is cleared. This is not sufficient because >>>>> it is possible for __E1000_DOWN to be set without releasing the IRQ. >>>>> In such a situation, we will hit a kernel bug later in e1000_remove >>>>> because the IRQ still has action since it was never freed. A >>>>> secondary bus reset can cause this case to happen. >>>>> >>>>> Signed-off-by: Tyler Baicar >>>>> --- >>>>> drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++- >>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c >>>>> b/drivers/net/ethernet/intel/e1000e/netdev.c >>>>> index 7017281..36cfcb0 100644 >>>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c >>>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c >>>>> @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev) >>>>> if (!test_bit(__E1000_DOWN, &adapter->state)) { >>>>> e1000e_down(adapter, true); >>>>> - e1000_free_irq(adapter); >>>>> /* Link status message must follow this format */ >>>>> pr_info("%s NIC Link is Down\n", adapter->netdev->name); >>>>> } >>>>> + e1000_free_irq(adapter); >>>>> + >>>>> napi_disable(&adapter->napi); >>>>> e1000e_free_tx_resources(adapter->tx_ring); >>>>> >>>> I would like not recommend insert this change. This change related >>>> driver state machine, we afraid from lot of synchronization problem and >>>> issues. >>>> We need keep e1000_free_irq in loop and check for 'test_bit' ready. >>> What do you mean here? There is no loop. If __E1000_DOWN is set then we >>> will never free the IRQ. >>> >>>> Another point, does before execute secondary bus reset your SW back up >>>> pcie configuration space as properly? >>> After a secondary bus reset, the link needs to recover and go back to a >>> working state after 1 second. >>> >>> From the callstack, the issue is happening while removing the endpoint >>> from the system, before applying the secondary bus reset. >>> >>> The order of events is >>> 1. remove the drivers >>> 2. cause a secondary bus reset >>> 3. wait 1 second >> Actually, this is too much, usually link up in less than 100ms.You can >> check Data Link Layer indication. >>> 4. recover the link >>> >>> callstack: >>> free_msi_irqs+0x6c/0x1a8 >>> pci_disable_msi+0xb0/0x148 >>> e1000e_reset_interrupt_capability+0x60/0x78 >>> e1000_remove+0xc8/0x180 >>> pci_device_remove+0x48/0x118 >>> __device_release_driver+0x80/0x108 >>> device_release_driver+0x2c/0x40 >>> pci_stop_bus_device+0xa0/0xb0 >>> pci_stop_bus_device+0x3c/0xb0 >>> pci_stop_root_bus+0x54/0x80 >>> acpi_pci_root_remove+0x28/0x64 >>> acpi_bus_trim+0x6c/0xa4 >>> acpi_device_hotplug+0x19c/0x3f4 >>> acpi_hotplug_work_fn+0x28/0x3c >>> process_one_work+0x150/0x460 >>> worker_thread+0x50/0x4b8 >>> kthread+0xd4/0xe8 >>> ret_from_fork+0x10/0x50 >>> >>> Thanks, >>> Tyler >>> >> Hello Tyler, >> Okay, we need consult more about this suggestion. >> May I ask what is setup you run? Is there NIC or on board LAN? I would >> like try reproduce this issue in our lab's too. >> Also, is same issue observed with same scenario and others NIC's too? >> Sasha >> _______________________________________________ >> Intel-wired-lan mailing list >> Intel-wired-lan@lists.osuosl.org >> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan >> > Hello Tyler, > I see some in consistent implementation of __*_close methods in our > drivers. Do you have any igb NIC to check if same problem persist there? > Thanks, > Sasha Hello Sasha, I couldn't find an igb NIC to test with, but I did find another e1000e card that does not cause the same issue. That card is: 0004:01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Subsystem: Intel Corporation Gigabit CT Desktop Adapter Physical Slot: 5-1 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR-