Received: by 2002:a05:6a10:7420:0:0:0:0 with SMTP id hk32csp2677559pxb; Sat, 19 Feb 2022 19:58:40 -0800 (PST) X-Google-Smtp-Source: ABdhPJzGQY1oQJsJ56Pu0/jHaGJYbDj3BdPpnOgq7JjK4q24ca3v1g1lWWDB+2jx539YIGzBIFsW X-Received: by 2002:a17:90a:9309:b0:1bc:1bf1:dad4 with SMTP id p9-20020a17090a930900b001bc1bf1dad4mr2313480pjo.169.1645329520334; Sat, 19 Feb 2022 19:58:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645329520; cv=none; d=google.com; s=arc-20160816; b=PGO85kEW502HG8z6Rkx7HG4GoV9ek/5eXhnntELDHDdFpXtyST3f+29+5/Ulxq2R4s WB3AQENWTCPkI/uDx/xO0PJpBQZnvS0d95cNqo0ubJTXQc4u+ckmO+VZCtvj6Li0Ah7V glGV3XJhiox1RICOdzw6O7wSuYDqnSwweIhbBusMm8Ecykju9gavAITKmaIcCVm1vap+ rvBATdzTantdqV8H7j9GYSlV/a1wNWqi1NQKwJhdvPi+CzvigiBjDkl4tjP4MTWYnH0Q vtHdopi85tC9xCmd7wSCH36Sruv3vexco5AwBANbAly/BIjaroFYtl3uaBMs3FZ3PJhK t4Vw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=JsTvQXRsbNOiGcu58U9fFMq2V+cC9hbrFYaW/h3E+1g=; b=zdiyFCtHufEjsPzzbUEvvhDEearg6ULSobM1vEN57feGk0f2DIgk4r7632pdr4u1sZ 8l2jM8g/3cmrQDZ/JEGUBFU3sJCiWL2bdeBpCm2+rUrOnSn+zNxCadFSAKHT51cpmt7V 0og9olXf2aNUvcda5dEwJSqMvVDIzR23pJRQTnZmvmtVo1MuTWiNN8oZw49hBYisyhaI OGiVsgtyQBP7vEL7qPkc1H960rYqkQpkmknuF0jvshjNy2IcyQlycYDCMRpmUT//RNd+ DbPY0+xXequ4P2USBv/NtulAud46CRRSyK5dEfANHUCh+Q65HdnvhwNEJwWsSh7iPDkx DS4w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=Z6v34CPk; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n4si5115717pgm.161.2022.02.19.19.58.25; Sat, 19 Feb 2022 19:58:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=Z6v34CPk; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235384AbiBSXBZ (ORCPT + 99 others); Sat, 19 Feb 2022 18:01:25 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:44766 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229871AbiBSXBY (ORCPT ); Sat, 19 Feb 2022 18:01:24 -0500 Received: from mail-il1-x134.google.com (mail-il1-x134.google.com [IPv6:2607:f8b0:4864:20::134]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 882EA506F1; Sat, 19 Feb 2022 15:01:01 -0800 (PST) Received: by mail-il1-x134.google.com with SMTP id z7so7312529ilb.6; Sat, 19 Feb 2022 15:01:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JsTvQXRsbNOiGcu58U9fFMq2V+cC9hbrFYaW/h3E+1g=; b=Z6v34CPkmCq2AZb1WTbW3mh6Tm48Pu3YR0j6WN/wwMs+xuHp22zD8kUHhoz3TVmfOY z+qdVyKgtKySnD20XiY4UuUPARq7lbCMHyKPxma3uECUDKeh0763ogQlE3zULF2a183i WIrv5A6NnC6nhREzAEKWftm+BtXYFboYrsYDmyB7lqBClTxPONUHt8aFTDZZmQSu9cUK fQVjei7MWURhmlT1R0aoYg31Rhwdxn7sDgTPJ4cNdN5QfC7Lo4OSPUNiMnhvwcgkPU+Z juIkgyW5aH96hluWwW9l2ci2gK0uyLRnxxprEBnxCVOiqjuP6w3pNxlhLamLNHvVhPzV ObSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JsTvQXRsbNOiGcu58U9fFMq2V+cC9hbrFYaW/h3E+1g=; b=KIjb9TgsPoAr/16501mzzAYAkFkJp4DA33vZzuqbQuWTnaWzRBfKTigK69YDWpsiFC pEBMhsPnf2aEgHAb14vJNQ87AHvfQCHm4znQUTSWWL8bdWD1wL8jsfDipsArWu7H3LO1 qERceGSoGJk3P0Z2AsszSg6BuQ7D6T9BkkevXqiE2LlctrlHGysGpRpPGgOMwQlJHaws /TqehExv/g9COIY/jYkjzoWpTwsfjjdu5tqhIflUi8GCOBstjifE00GGa7SckMelL6SN 1qgfD1aUzoza5Jxxpu3yGN8ULw5jS6NspNHErOe9YUxRSv4zpBjOoBoK8ay68RqlQB76 8o3Q== X-Gm-Message-State: AOAM5308AdjgW32K+iiPjsh47f5ZBU/qoU5f23ntCB8Kk5jJsqRr1jEv vHWM1MfiH6eihe5JI3dsjisvkxO2dGeRNuvTtE4= X-Received: by 2002:a92:cd8c:0:b0:2be:abb:ec04 with SMTP id r12-20020a92cd8c000000b002be0abbec04mr10635076ilb.246.1645311660937; Sat, 19 Feb 2022 15:01:00 -0800 (PST) MIME-Version: 1.0 References: <20220219210354.GF59715@dread.disaster.area> In-Reply-To: <20220219210354.GF59715@dread.disaster.area> From: Kyle Sanderson Date: Sat, 19 Feb 2022 15:00:51 -0800 Message-ID: Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs To: Dave Chinner Cc: qat-linux@intel.com, giovanni.cabiddu@intel.com, Linux-Kernal , linux-xfs@vger.kernel.org, linux-crypto@vger.kernel.org, dm-devel@redhat.com, Linus Torvalds , Greg KH , salvatore.benedetto@intel.com, herbert@gondor.apana.org.au, pablo.marcos.oltra@intel.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org hi Dave, > This really sounds like broken hardware, not a kernel problem. It is indeed a hardware issue, specifically the intel qat crypto driver that's in-tree - the hardware is fine (see below). The IQAT eratta documentation states that if a request is not submitted properly it can stall the entire device. The remediation guidance from 2020 was "don't do that" and "don't allow unprivileged users access to the device". The in-tree driver is not implemented properly either for this SoC or board - I'm thinking it's related to QATE-7495. https://01.org/sites/default/files/downloads//336211qatsoftwareforlinux-rn-hwversion1.7021.pdf > This implies a dmcrypt level problem - XFS can't make progress is dmcrypt is not completing IOs. That's the weird part about it. Some bio's are completing, others are completely dropped, with some stalling forever. I had to use xfs_repair to get the volumes operational again. I lost a good deal of files and had to recover from backup after toggling the device back on on a production system (silly, I know). > Where are the XFS corruption reports that the subject implies is occurring? I think you're right, it's dm-crypt that's broken here, with ultimately the crypto driver causing this corruption. XFS being the edge to the end-user is taking the brunt of it. There's reports going back to late 2017 of significant issues with this mainlined stable driver. https://bugzilla.redhat.com/show_bug.cgi?id=1522962 https://serverfault.com/questions/1010108/luks-hangs-on-centos-running-on-atom-c3758-cpu https://www.phoronix.com/forums/forum/software/distributions/1172231-fedora-33-s-enterprise-linux-next-effort-approved-testbed-for-raising-cpu-requirements-etc?p=1174560#post1174560 Any guidance would be appreciated. Kyle. On Sat, Feb 19, 2022 at 1:03 PM Dave Chinner wrote: > > On Fri, Feb 18, 2022 at 09:02:28PM -0800, Kyle Sanderson wrote: > > A2SDi-8C-HLN4F has IQAT enabled by default, when this device is > > attempted to be used by xfs (through dm-crypt) the entire kernel > > thread stalls forever. Multiple users have hit this over the years > > (through sporadic reporting) - I ended up trying ZFS and encryption > > wasn't an issue there at all because I guess they don't use this > > device. Returning to sanity (xfs), I was able to provision a dm-crypt > > volume no problem on the disk, however when running mkfs.xfs on the > > volume is what triggers the cascading failure (each request kills a > > kthread). > > Can you provide the full stack traces for these errors so we can see > exactly what this cascading failure looks like, please? In reality, > the stall messages some time after this are not interesting - it's > the first errors that cause the stall that need to be investigated. > > A good idea would be to provide the full storage stack decription > and hardware in use, as per: > > https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > > > Disabling IQAT on the south bridge results in a working > > system, however this is not the default configuration for the > > distribution of choice (Ubuntu 20.04.3 LTS), nor the motherboard. I'm > > convinced this never worked properly based on the lack of popularity > > for kernel encryption (crypto), and the embedded nature that > > SuperMicro has integrated this device in collaboration with intel as > > it looks like the primary usage is through external accelerator cards. > > This really sounds like broken hardware, not a kernel problem. > > > Kernels tried were from RHEL8 over a year ago, and this impacts the > > entirety of the 5.4 series on Ubuntu. > > Please CC me on replies as I'm not subscribed to all lists. CPU is C3758. > > [snip stalled kcryptd worker threads] > > This implies a dmcrypt level problem - XFS can't make progress is > dmcrypt is not completing IOs. > > Where are the XFS corruption reports that the subject implies is > occurring? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com