Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp805240rwp; Thu, 13 Jul 2023 00:59:18 -0700 (PDT) X-Google-Smtp-Source: APBJJlEidtysoclbEllch+CpnNoqhwz3hx7VEM0wv82lu4oOubTUuXCYqzOmLQHERcPMS24Cx7AA X-Received: by 2002:a05:6a00:17a1:b0:682:4ef7:9b17 with SMTP id s33-20020a056a0017a100b006824ef79b17mr609212pfg.32.1689235158123; Thu, 13 Jul 2023 00:59:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689235158; cv=none; d=google.com; s=arc-20160816; b=mjAcNuU08soA5KZezpHPgFfCTjxtcQbCbp/GJEBGj9dbva9HIgcqKowE12xOlwZjnX nkIhBo6Ax3aLCHzXhGXzIkOe//6MEijGWPIu4PmEzzxDAXuRtH27nnPB8+PjTDDMfOFH zh3j+0N6YRTuGjHqz4/lJZQNxYy/aEUf61ByIAqA7uHiwjTkSx7oPUM7hOtaLOahwZr6 Z/JaTGqT/QSrnZTRjz3UcRaQAoZRBssfA6pXtlL0KxJAEUSyZMKAqmWjGY5wWig+o2hA emsYs9kraC0ijWJEEqe3ea/Z+EcyLEaIiziqQmP1ist0y6oaOTjgMG0AzGG2ryxWdJH4 2RFg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:in-reply-to:message-id:date:cc:to :from:user-agent:references:subject:dkim-signature; bh=vot4qMoMLFWBXgyy/o6LSt72N9CRa0tUFzm54MB4zW4=; fh=av3eeW944XrIDarr6lKWFyj7UmGki5RrLUHMHmVBInw=; b=r7/q1s1s8CiOWwhIESQpQOW8bRIseFREQIfOQF+y8PnFRf1czyvcNNu0fIDKOUFN87 AtWDrVSRtgQgFfBGYWVSK++vTE5UJX90IhoyKJ8vSAfORLUDnFVITDh5an/4oagpUJ4s 6RyP+ro1O3wkTs8y63UD494HOiaXioQfcxNsmMeh5LztGIkFZjzJtfzrmWsteT7z9IpU B8xcoDv9xXhVMIC1+3HLNOyRQE5BI1TV3Mfs7CqNzgdJpLOqu3nxuepGW4RC9SMrVHLl 3WbEcDtEPx37SldRFhd8oYKouZ8WIKsw4WK2bEMfTNk3+oTWm1hCkeixFMu+WIJFjCR6 e5BA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b="r5Jp7N/K"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e24-20020a656798000000b005533cf1fdbfsi4816097pgr.629.2023.07.13.00.59.05; Thu, 13 Jul 2023 00:59:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b="r5Jp7N/K"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229455AbjGMHus (ORCPT + 99 others); Thu, 13 Jul 2023 03:50:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33450 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232343AbjGMHur (ORCPT ); Thu, 13 Jul 2023 03:50:47 -0400 Received: from smtp-fw-2101.amazon.com (smtp-fw-2101.amazon.com [72.21.196.25]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 321F3B7; Thu, 13 Jul 2023 00:50:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1689234647; x=1720770647; h=references:from:to:cc:date:message-id:in-reply-to: mime-version:subject; bh=vot4qMoMLFWBXgyy/o6LSt72N9CRa0tUFzm54MB4zW4=; b=r5Jp7N/KjHTHce7RFrFa92deyMkUoKZzn0i9ORmah2Qc0CV8SqjBtCn3 syoihsikxJ9p3T74ABRIIboX20gEvd1RmnSVHROwATdb4/avteDElGZFa vE8UcMhLdrlAeW+nw1imGvfHt1xonulXW8HPAtLwOS0x3VQZneH8B1o4R k=; X-IronPort-AV: E=Sophos;i="6.01,202,1684800000"; d="scan'208";a="339471606" Subject: Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-pdx-2a-m6i4x-3ef535ca.us-west-2.amazon.com) ([10.43.8.6]) by smtp-border-fw-2101.iad2.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jul 2023 07:50:43 +0000 Received: from EX19D010EUA002.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-2a-m6i4x-3ef535ca.us-west-2.amazon.com (Postfix) with ESMTPS id A414C609A2; Thu, 13 Jul 2023 07:50:41 +0000 (UTC) Received: from EX19D028EUB003.ant.amazon.com (10.252.61.31) by EX19D010EUA002.ant.amazon.com (10.252.50.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.30; Thu, 13 Jul 2023 07:50:38 +0000 Received: from u95c7fd9b18a35b.ant.amazon.com.amazon.com (10.1.212.11) by EX19D028EUB003.ant.amazon.com (10.252.61.31) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.30; Thu, 13 Jul 2023 07:50:34 +0000 References: <20230711013621.GE1926@templeofstupid.com> <20230711225210.GA2088@templeofstupid.com> User-agent: mu4e 1.10.3; emacs 28.2 From: Shay Agroskin To: Krister Johansen CC: , , "Arthur Kiyanovski" , David Arinzon , "Noam Dagan" , Saeed Bishara , "David S. Miller" , Eric Dumazet , "Jakub Kicinski" , Paolo Abeni Date: Thu, 13 Jul 2023 10:46:55 +0300 Message-ID: In-Reply-To: <20230711225210.GA2088@templeofstupid.com> MIME-Version: 1.0 Content-Type: text/plain; format=flowed X-Originating-IP: [10.1.212.11] X-ClientProxiedBy: EX19D044UWB003.ant.amazon.com (10.13.139.168) To EX19D028EUB003.ant.amazon.com (10.252.61.31) X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Krister Johansen writes: > > On Tue, Jul 11, 2023 at 08:47:32PM +0300, Shay Agroskin wrote: >> >> Krister Johansen writes: >> >> > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c >> > b/drivers/net/ethernet/amazon/ena/ena_com.c >> > index 451c3a1b6255..633b321d7fdd 100644 >> > --- a/drivers/net/ethernet/amazon/ena/ena_com.c >> > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c >> > @@ -35,6 +35,8 @@ >> > #define ENA_REGS_ADMIN_INTR_MASK 1 >> > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U >> > + >> > #define ENA_MIN_ADMIN_POLL_US 100 >> > #define ENA_MAX_ADMIN_POLL_US 5000 >> > @@ -536,6 +538,7 @@ static int >> > ena_com_comp_status_to_errno(struct >> > ena_com_admin_queue *admin_queue, >> > static void ena_delay_exponential_backoff_us(u32 exp, u32 >> > delay_us) >> > { >> > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP); >> > delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us); >> > delay_us = min_t(u32, delay_us * (1U << exp), >> > ENA_MAX_ADMIN_POLL_US); >> > usleep_range(delay_us, 2 * delay_us); >> >> Hi, thanks for submitting this patch (: > > Absolutely; thanks for the review! > >> Going over the logic here, the driver sleeps for `delay_us` >> micro-seconds in >> each iteration that this function gets called. >> >> For an exp = 14 it'd sleep (I added units notation) >> delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) / >> (1000000) s >> = 1.6 s >> >> For an exp = 15 it'd sleep >> (10 * (2 ^ 15)) / (1000000) = 3.2s >> >> To even get close to an overflow value, say exp=29 the driver >> would sleep in >> a single iteration >> 53687 s = 14.9 hours. >> >> The driver should stop trying to get a response from the device >> after a >> timeout period received from the device which is 3 seconds by >> default. >> >> The point being, it seems very unlikely to hit this >> overflow. Did you >> experience it or was the issue discovered by a static analyzer >> ? > > No, no use of fuzzing or static analysis. This was hit on a > production > instance that was having ENA trouble. > > I'm apparently reading the code differently. I thought this > line: > >> > delay_us = min_t(u32, delay_us * (1U << exp), >> > ENA_MAX_ADMIN_POLL_US); > > Was going to cap that delay_us at (delay_us * (1U << exp)) or > 5000us, whichever is smaller. By that measure, if delay_us is > 100 and > ENA_MAX_ADMIN_POLL_US is 5000, this should start getting capped > after > exp = 6, correct? By my estimate, that puts it at between 160ms > and > 320ms of sleeping before one could hit this problem. > > I went and pulled the logs out of the archive and have the > following > timeline. This is seconds from boot as reported by dmesg: > > 11244.226583 - ena warns TX not completed on time, 10112000 > usecs since > last napi execution, missing tx timeout val of 5000 msec > > 11245.190453 - netdev watchdog fires > > 11245.190781 - ena records Transmit timeout > 11245.250739 - ena records Trigger reset on > > 11246.812620 - UBSAN message to console > > 11248.590441 - ena reports Reset inidication didn't turn off > 11250.633545 - ena reports failure to reset device > 12013.529338 - last logline before new boot > > While the difference between the panic and the trigger reset is > more > than 320ms, it is definitely on the order of seconds instead of > hours. > Yup you're right. I was so entangled in my exponent calculations that I completely missed the min_t expression there. That's quite an awkward design to be honest, I hope to submit a re-write for it in one of the releases. Thanks again for the work you've put into it