Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751964AbcCXS55 (ORCPT ); Thu, 24 Mar 2016 14:57:57 -0400 Received: from shards.monkeyblade.net ([149.20.54.216]:52453 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751462AbcCXS5s (ORCPT ); Thu, 24 Mar 2016 14:57:48 -0400 Date: Thu, 24 Mar 2016 14:57:45 -0400 (EDT) Message-Id: <20160324.145745.451518711282450835.davem@davemloft.net> To: bhelgaas@google.com Cc: nikolay@cumulusnetworks.com, netdev@vger.kernel.org, nhorman@redhat.com, aduyck@mirantis.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] netpoll: Fix extra refcount release in netpoll_cleanup() From: David Miller In-Reply-To: <20160324161334.31279.26510.stgit@bhelgaas-glaptop2.roam.corp.google.com> References: <20160324161334.31279.26510.stgit@bhelgaas-glaptop2.roam.corp.google.com> X-Mailer: Mew version 6.6 on Emacs 24.5 / Mule 6.0 (HANACHIRUSATO) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.5.12 (shards.monkeyblade.net [149.20.54.216]); Thu, 24 Mar 2016 11:57:47 -0700 (PDT) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1257 Lines: 32 From: Bjorn Helgaas Date: Thu, 24 Mar 2016 11:13:34 -0500 > netpoll_setup() does a dev_hold() on np->dev, the netpoll device. If it > fails, it correctly does a dev_put() but leaves np->dev set. If we call > netpoll_cleanup() after the failure, np->dev is still set so we do another > dev_put(), which decrements the refcount an extra time. > > It's questionable to call netpoll_cleanup() after netpoll_setup() fails, > but it can be difficult to find the problem, and we can easily avoid it in > this case. The extra decrements can lead to hangs like this: > > unregister_netdevice: waiting for bond0 to become free. Usage count = -3 > > In __netpoll_setup(), don't set np->dev until we know we're going to > succeed. > > Signed-off-by: Bjorn Helgaas The reason this bug exists is because the thing doing the reference counting is separated from the thing that assigns the device pointer. That's how this was allowed to happen. If you instead do the np->dev = dev; assignment where the get is performed, and do a np->dev = NULL; where the error path puts the reference, everything is obvious and this error is unlikely to be reintroduced. So could you please implement your fix like that? Thanks.