Received: by 2002:a05:6358:4e97:b0:b3:742d:4702 with SMTP id ce23csp2049710rwb; Fri, 12 Aug 2022 11:01:21 -0700 (PDT) X-Google-Smtp-Source: AA6agR6JmAyAmudWl9lPhLitd9vqd/NodvRY9YsV2cSKcL2YiemRlLYUU7rVZyO+74RXVO1Yh1nS X-Received: by 2002:a65:6216:0:b0:41d:8248:3d05 with SMTP id d22-20020a656216000000b0041d82483d05mr4119671pgv.36.1660327281120; Fri, 12 Aug 2022 11:01:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660327281; cv=none; d=google.com; s=arc-20160816; b=nFgNRnof6OaVkz9zuD9QKcA2+jUXDQfo9KRPzdgcj5hx4ToTSc0DDUCr7ZJRYODRoy xzM5TXSqZZearhuu/uQ8F5LK9kA1mHYNVf8bzPbB17wVINsppyNOVZKl5sjubmfyDQpG oDuJOoXn0G+WLmDc1wH3POnkLPCo4AFbmzPIFz3vy1N3oZCnFVjpA6qrik8N896qaV1H NUd3VGnrMBjrLntiasGYLmQp/v68AoHa9a6XMFmpk00Qjen/GOwAr4bKs4oQRjr3pLD9 E4M4SMzo6Xc0pTlwb2R4L2vcvhmKw2kgc2u4T+ALLbtAX0mPO5UH73AAr11FXwCcIa+Q JD2A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=wx/ERrMm2Bc5z30Jh0fJguExB7AP6Vqb6sZ78FbzlIc=; b=wKTshYF+7tNqAyRPf9YI9rGMsIotC4wN4oG6jaBkdFR9g6GsD9Z38OdMFWbHNzpL0b gvTpvegcWwqv3DZYC9BeQspUTgDGtRIMtG0tBiibUvqcDgNbviphvwTPhDiYkxQCjiz2 U51fWapbHTOBc3FLVx+Sl/NWGRolnUMTHfn774dhujSbUM1dWHAU8j5CEKWVHzbkN6ms IkDNDOOUD8QKQC20hRohW/E6hTveZYtdvZNrWyhiI442kwgyVHRh+/zU5lkpuIuPOGGz /n4xe6Ffl2M7Fd8mMwWXeSKmWbM8hwlOwCGzuyWzEri6msQfaRtlRS8ob0AOkIJJhYeu RTAw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@toxicpanda-com.20210112.gappssmtp.com header.s=20210112 header.b="12w+EJB/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a1-20020a170902ecc100b0016dd63c3d8bsi3279949plh.319.2022.08.12.11.01.08; Fri, 12 Aug 2022 11:01:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@toxicpanda-com.20210112.gappssmtp.com header.s=20210112 header.b="12w+EJB/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238753AbiHLR7e (ORCPT + 99 others); Fri, 12 Aug 2022 13:59:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34674 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234019AbiHLR7d (ORCPT ); Fri, 12 Aug 2022 13:59:33 -0400 Received: from mail-oi1-x235.google.com (mail-oi1-x235.google.com [IPv6:2607:f8b0:4864:20::235]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C319EB285C for ; Fri, 12 Aug 2022 10:59:31 -0700 (PDT) Received: by mail-oi1-x235.google.com with SMTP id w196so1929333oiw.10 for ; Fri, 12 Aug 2022 10:59:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc; bh=wx/ERrMm2Bc5z30Jh0fJguExB7AP6Vqb6sZ78FbzlIc=; b=12w+EJB/sEwbLd/xs/CAahxStbaWmmBiZ7/6sAa+6CdJU6cAoJDIcIQxxGnUkjjaQM 0dzpodgbSDalE87E63cYL5a30A4zzFWgCYThxsuvujt9OsBO029gJ1VNywAuml4DWEeF cnwybzvzR5T7Ds3PfcKN/Yu9v8olhe2ltwd449JZ0eEK7UTkueSsPsp3474kso5SrwHJ kzjWvd/0nsFNv18juyeNMLzuLL7RLqlru6ncMhITBKEKslrDmby3/WAbTig1hS1KNX6/ iBGHNHHMTs7uTXOrse9MG1P6FCYAV2AzapLDR4W7xWlAd5/SzMQL339eUI1985/YtYp8 Ntng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc; bh=wx/ERrMm2Bc5z30Jh0fJguExB7AP6Vqb6sZ78FbzlIc=; b=dO5tWhhc7RVLWx2g4JaSGjk91BOkhoRHG2Dl1M6JuiVtYhiiOwqZM+AQns8t3SH2m2 0AfIBvuPHB9bmOz8No0MYa1w9d2P8uXe16K+tyL8YHCZnrBLseQ4aKrMOAoCGY5VzTeR ECErOvA35sXfTjLO3u1tvXCoy8j1CRk1ZBx+a9YYAIOvuHkZ/ef7FYQxeCSRRlJFuqUu c4mx/1l5UcZl8XKEVTTDYW4FaBqZeOfhxSr4g5UKSuGidRe3wYlAbKOR7BZNqFyi7geH 7O2nIIilSXHDI6BFWqK5y5sVfmrI632omTt5h+2tXX4t5Z7rL7+O9e6rYVe5jZl/RCfB yy3Q== X-Gm-Message-State: ACgBeo0E1L9XLTHPkS/+O4yy4WQSHB2adhMnhgkk0/bGs2RqzB0Ozxzc DVmincV4fzrcvuvTz3ZsC12k6600XhG0oXG0yRLVCw== X-Received: by 2002:aca:a913:0:b0:343:fe9:951a with SMTP id s19-20020acaa913000000b003430fe9951amr6118000oie.94.1660327170882; Fri, 12 Aug 2022 10:59:30 -0700 (PDT) MIME-Version: 1.0 References: <2220d403-e443-4e60-b7c3-d149e402c13e@www.fastmail.com> In-Reply-To: From: Josef Bacik Date: Fri, 12 Aug 2022 13:59:19 -0400 Message-ID: Subject: Re: stalling IO regression since linux 5.12, through 5.18 To: Chris Murphy Cc: Paolo Valente , Btrfs BTRFS , Linux-RAID , linux-block , linux-kernel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,SUSPICIOUS_RECIPS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 12, 2022 at 12:05 PM Chris Murphy wro= te: > > > > On Wed, Aug 10, 2022, at 3:34 PM, Chris Murphy wrote: > > Booted with cgroup_disable=3Dio, and confirmed cat > > /sys/fs/cgroup/cgroup.controllers does not list io. > > The problem still reproduces with the cgroup IO controller disabled. > > On a whim, I decided to switch the IO scheduler from Fedora's default bfq= for rotating drives to mq-deadline. The problem does not reproduce for 15+= hours, which is not 100% conclusive but probably 99% conclusive. I then sw= itched live while running the workload to bfq on all eight drives, and with= in 10 minutes the system cratered, all new commands just hang. Load average= goes to triple digits, i/o wait increasing, i/o pressure for the workload = tasks to 100%, and IO completely stalls to zero. I was able to switch only = two of the drive queues back to mq-deadline and then lost responsivness in = that shell and had to issue sysrq+b... > > Before that I was able to extra sysrq+w and sysrq+t. > https://drive.google.com/file/d/16hdQjyBnuzzQIhiQT6fQdE0nkRQJj7EI/view?us= p=3Dsharing > > I can't tell if this is a bfq bug, or if there's some negative interactio= n between bfq and scsi or megaraid_sas. Obviously it's rare because otherwi= se people would have been falling over this much sooner. But at this point = there's strong correlation that it's bfq related and is a kernel regression= that's been around since 5.12.0 through 5.18.0, and I suspect also 5.19.0 = but it's being partly masked by other improvements. This matches observations we've had internally (inside Facebook) as well as my continual integration performance testing. It should probably be looked into by the BFQ guys as it was working previously. Thanks, Josef