Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp3464159pxu; Tue, 8 Dec 2020 12:46:58 -0800 (PST) X-Google-Smtp-Source: ABdhPJy1NyiC7INuwFxRZK82ysjesu8xsUnuPYG9nb7/3K5Z+kO8CPPqh0s6JF4woBLV/pkGPWfD X-Received: by 2002:a05:6402:491:: with SMTP id k17mr26363851edv.370.1607460418716; Tue, 08 Dec 2020 12:46:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1607460418; cv=none; d=google.com; s=arc-20160816; b=NouOSyyUeheQ5Ovz403SrKUXO1cati+BV0ayEjdtkFXdD3b1euGtmssY15/h4mmGj+ Upxko4UF0QWVuERYESL5NkDNRGfoj1z5t1tpQfHJ5NhyI/cFIoQCmNJdjBcoj26rrxz9 7yTzbQjwHg8z/1w8iqegj8zTyKFY+8nFuZNTwT/CjD9FeBJxV/CQ+dBUBymCcibZz54I zgUxh11EI5LAeMsGDmk8HXGd1Tmvwc3nuAQ8wnEeiU8gTVFXyt5evxJObXxJ8TuD7S/T miWDKgIUybm+UWgBcY/cqEJN9fsXv+KLb1KXlyDQNvwF6ff6aKkT0FKBOHdk/u2CieXK hpCg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:dkim-signature :date; bh=vanp4gpCGGwk26koMev3MyOkbgiQAdEWacdW6QPMRsA=; b=yzfZvB3yrGQF9/ljdDuB3ukCFHm+g7gp1unAVj32HYl6qQDfMuxvqWgD6ShMtRFv3X 8hZM5u3gYUY17jNHQ4kTjAQ5wFKt/jzgRZvUbUXnX0zDnyy6xI8bM1/Ofc534UDM5wvn KEdFLX9InVXIb5QORCVyBRv99XcqZCEKKl0jRpmovTVVwZ2qHHzA5Ktx0Q3r8CsnWmot iiAC/fOCFmi01ocPhzQUG+N28NDSuB5Ve0+ucOvestIebVrU2SY5oA5yDW9LkyQzrMx/ NJoPxtbuM9u9h9x+FVfRKaHbKDIibk60yXX0kJk51BXyEaFcP/qYkn9bpw+88Vt7g1OU JKRQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=puG9BwYS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h14si9166724eje.325.2020.12.08.12.46.34; Tue, 08 Dec 2020 12:46:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=puG9BwYS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731302AbgLHUQK (ORCPT + 99 others); Tue, 8 Dec 2020 15:16:10 -0500 Received: from mail.kernel.org ([198.145.29.99]:34194 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731144AbgLHUPH (ORCPT ); Tue, 8 Dec 2020 15:15:07 -0500 Date: Tue, 8 Dec 2020 11:38:20 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1607456307; bh=4xIgevDP7/ZGGz5vmkJ8zIO+NIU+cuUQaVaBwbzuc5I=; h=From:To:Cc:Subject:In-Reply-To:References:From; b=puG9BwYSAJ7e6ZPqIbzIEvMJoNdGD/nRqvjr61QKZuIR24H2WQ/1pK7kkTZWTVyVC D8kmlJ/J7d93CeBddN9+EEWdT0Y3L7t5V73jXEqdFRXKY2LtLv2P/P6K6Lr7NeqENU Uwd6Do5ZhKj9PKHpA4LnwicRabU4+KaOiL3DF9yD0IOsY3coqpFmnAo1kTzbqwnK7b kzZdF2sW2PAUfeDcKLJm2KmQCjfTXCfsRSaa2soFiffrtXdMkP+dNMDXcETYfOfARj o3e6m0PsFJYo6u8CaP2lnbReOYw5tiJgxfoG5ddH/wh+qsO7dkAO07xvAC8O0k45Gw lhdFoy7/3V6cw== From: Jakub Kicinski To: Jarod Wilson Cc: linux-kernel@vger.kernel.org, Mahesh Bandewar , Jay Vosburgh , Veaceslav Falico , Andy Gospodarek , "David S. Miller" , Thomas Davis , netdev@vger.kernel.org Subject: Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread Message-ID: <20201208113820.179ed5ca@kicinski-fedora-pc1c0hjn.DHCP.thefacebook.com> In-Reply-To: <20201205234354.1710-1-jarod@redhat.com> References: <20201205234354.1710-1-jarod@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 5 Dec 2020 18:43:54 -0500 Jarod Wilson wrote: > I'm seeing a system get stuck unable to bring a downed interface back up > when it's got an updelay value set, behavior which ceased when logging > spew was removed from bond_miimon_inspect(). I'm monitoring logs on this > system over another network connection, and it seems that the act of > spewing logs at all there increases rtnl lock contention, because > instrumented code showed bond_mii_monitor() never able to succeed in it's > attempts to call rtnl_trylock() to actually commit link state changes, > leaving the downed link stuck in BOND_LINK_DOWN. The system in question > appears to be fine with the log spew being moved to > bond_commit_link_state(), which is called after the successful > rtnl_trylock(). But it's not called under rtnl_lock AFAICT. So something else is also spewing messages? While bond_commit_link_state() _is_ called under the lock. So you're increasing the retry rate, by putting the slow operation under the lock, is that right? Also isn't bond_commit_link_state() called from many more places? So we're adding new prints, effectively? > I'm actually wondering if perhaps we ultimately need/want > some bond-specific lock here to prevent racing with bond_close() instead > of using rtnl, but this shift of the output appears to work. I believe > this started happening when de77ecd4ef02 ("bonding: improve link-status > update in mii-monitoring") went in, but I'm not 100% on that. > > The addition of a case BOND_LINK_BACK in bond_miimon_inspect() is somewhat > separate from the fix for the actual hang, but it eliminates a constant > "invalid new link 3 on slave" message seen related to this issue, and it's > not actually an invalid state here, so we shouldn't be reporting it as an > error. Let's make it a separate patch, then.