Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752467AbdDKROz (ORCPT ); Tue, 11 Apr 2017 13:14:55 -0400 Received: from mail-qt0-f194.google.com ([209.85.216.194]:36501 "EHLO mail-qt0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751449AbdDKROw (ORCPT ); Tue, 11 Apr 2017 13:14:52 -0400 Subject: Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter. To: Michael Ellerman , Tyrel Datwyler , Sachin Sant , linuxppc-dev@ozlabs.org References: <8760ig983f.fsf@concordia.ellerman.id.au> <89aec36c-e352-e055-5e80-1235449762ce@linux.vnet.ibm.com> <871sszwc87.fsf@concordia.ellerman.id.au> Cc: Nathan Fontenot , LKML From: Tyrel Datwyler Message-ID: Date: Tue, 11 Apr 2017 10:14:49 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: <871sszwc87.fsf@concordia.ellerman.id.au> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2357 Lines: 59 On 04/11/2017 02:00 AM, Michael Ellerman wrote: > Tyrel Datwyler writes: > >> On 04/06/2017 09:04 PM, Michael Ellerman wrote: >>> Tyrel Datwyler writes: >>> >>>> On 04/06/2017 03:27 AM, Sachin Sant wrote: >>>>> On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on >>>>> any I/O adapter results in the following warning >>>>> >>>>> This problem has been in the code for some time now. I had first seen this in >>>>> -next tree. >>>>> >> >> >> >>>>> Have attached the dmesg log from the system. Let me know if any additional >>>>> information is required to help debug this problem. >>>> >>>> I remember you mentioning this when the issue was brought up for CPUs. I >>>> assume the case is the same here where the issue is only seen with >>>> adapters that were hot-added after boot (ie. hot-remove of adapter >>>> present at boot doesn't trip the warning)? >>> >>> So who's fixing this? >> >> I started looking at it when Bharata submitted a patch trying to fix the >> issue for CPUs, but got side tracked by other things. I suspect that >> this underflow has actually been an issue for quite some time, and we >> are just now becoming aware of it thanks to the recount_t patchset being >> merged. > > Yes I agree. Which means it might be broken in existing distros. Definitely. I did some profiling last night, and I understand the hotplug case. It turns out to be as I suggested in the original thread about CPUs. When the devicetree code was worked to move the tree out of proc and into sysfs the sysfs detach code added a of_node_put to remove the original of_init reference. pSeries Being the sole original *dynamic* device tree user we had always issued a of_node_put in our dlpar specific detach function to achieve that end. So, this should be a pretty straight forward trivial fix. However, for the case where devices are present at boot it appears we a leaking a lot of references resulting in the device nodes never actually being released/freed after a dlpar remove. In the CPU case after boot I count 8 more references taken than the hotplug case, and corresponding of_node_put's are not called at dlpar remove time either. That will take some time to track them down, review and clean up. -Tyrel > >> I'll look into it again this week. > > Thanks. > > cheers >