Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753519AbdLMO2Z (ORCPT ); Wed, 13 Dec 2017 09:28:25 -0500 Received: from mail-ot0-f176.google.com ([74.125.82.176]:45330 "EHLO mail-ot0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752840AbdLMO2R (ORCPT ); Wed, 13 Dec 2017 09:28:17 -0500 X-Google-Smtp-Source: ACJfBosDaVxC0r91/thQLFs/iyO1LH4pLA/Zv1WO4PAVApXOdfEFwOkfiDWXK+0ydaM9VqVww3eEEClIOx/r11m/0Qo= MIME-Version: 1.0 In-Reply-To: References: <9f95c2a0-e4fe-270d-790a-beeb6b3e7690@oracle.com> From: Or Gerlitz Date: Wed, 13 Dec 2017 16:28:15 +0200 Message-ID: Subject: Re: Setting large MTU size on slave interfaces may stall the whole system To: Qing Huang Cc: Linux Netdev List , Linux Kernel , Jay Vosburgh , Veaceslav Falico , Andy Gospodarek , Aviv Heller , Moni Shoua Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1285 Lines: 35 On Tue, Dec 12, 2017 at 5:21 AM, Qing Huang wrote: > Hi, > > We found an issue with the bonding driver when testing Mellanox devices. > The following test commands will stall the whole system sometimes, with > serial console > flooded with log messages from the bond_miimon_inspect() function. Setting > mtu size > to be 1500 seems okay but very rarely it may hit the same problem too. > > ip address flush dev ens3f0 > ip link set dev ens3f0 down > ip address flush dev ens3f1 > ip link set dev ens3f1 down > [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1 > updelay=500 downdelay=500 > [root@ca-hcl629 etc]# ifconfig bond0 up > [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1 > [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up > Seiral console output: > > ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for > interface ens3f0, disabling it in 500 ms [..] > It seems that when setting a large mtu size on an RoCE interface, the RTNL > mutex may be held too long by the slave > interface, causing bond_mii_monitor() to be called repeatedly at an interval > of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive. Did you try/managed to reproduce that also with other NIC drivers? Or.