hi Dave,
> This really sounds like broken hardware, not a kernel problem.
It is indeed a hardware issue, specifically the intel qat crypto
driver that's in-tree - the hardware is fine (see below). The IQAT
eratta documentation states that if a request is not submitted
properly it can stall the entire device. The remediation guidance from
2020 was "don't do that" and "don't allow unprivileged users access to
the device". The in-tree driver is not implemented properly either for
this SoC or board - I'm thinking it's related to QATE-7495.
https://01.org/sites/default/files/downloads//336211qatsoftwareforlinux-rn-hwversion1.7021.pdf
> This implies a dmcrypt level problem - XFS can't make progress is dmcrypt is not completing IOs.
That's the weird part about it. Some bio's are completing, others are
completely dropped, with some stalling forever. I had to use
xfs_repair to get the volumes operational again. I lost a good deal of
files and had to recover from backup after toggling the device back on
on a production system (silly, I know).
> Where are the XFS corruption reports that the subject implies is occurring?
I think you're right, it's dm-crypt that's broken here, with
ultimately the crypto driver causing this corruption. XFS being the
edge to the end-user is taking the brunt of it. There's reports going
back to late 2017 of significant issues with this mainlined stable
driver.
https://bugzilla.redhat.com/show_bug.cgi?id=1522962
https://serverfault.com/questions/1010108/luks-hangs-on-centos-running-on-atom-c3758-cpu
https://www.phoronix.com/forums/forum/software/distributions/1172231-fedora-33-s-enterprise-linux-next-effort-approved-testbed-for-raising-cpu-requirements-etc?p=1174560#post1174560
Any guidance would be appreciated.
Kyle.
On Sat, Feb 19, 2022 at 1:03 PM Dave Chinner <[email protected]> wrote:
>
> On Fri, Feb 18, 2022 at 09:02:28PM -0800, Kyle Sanderson wrote:
> > A2SDi-8C-HLN4F has IQAT enabled by default, when this device is
> > attempted to be used by xfs (through dm-crypt) the entire kernel
> > thread stalls forever. Multiple users have hit this over the years
> > (through sporadic reporting) - I ended up trying ZFS and encryption
> > wasn't an issue there at all because I guess they don't use this
> > device. Returning to sanity (xfs), I was able to provision a dm-crypt
> > volume no problem on the disk, however when running mkfs.xfs on the
> > volume is what triggers the cascading failure (each request kills a
> > kthread).
>
> Can you provide the full stack traces for these errors so we can see
> exactly what this cascading failure looks like, please? In reality,
> the stall messages some time after this are not interesting - it's
> the first errors that cause the stall that need to be investigated.
>
> A good idea would be to provide the full storage stack decription
> and hardware in use, as per:
>
> https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> > Disabling IQAT on the south bridge results in a working
> > system, however this is not the default configuration for the
> > distribution of choice (Ubuntu 20.04.3 LTS), nor the motherboard. I'm
> > convinced this never worked properly based on the lack of popularity
> > for kernel encryption (crypto), and the embedded nature that
> > SuperMicro has integrated this device in collaboration with intel as
> > it looks like the primary usage is through external accelerator cards.
>
> This really sounds like broken hardware, not a kernel problem.
>
> > Kernels tried were from RHEL8 over a year ago, and this impacts the
> > entirety of the 5.4 series on Ubuntu.
> > Please CC me on replies as I'm not subscribed to all lists. CPU is C3758.
>
> [snip stalled kcryptd worker threads]
>
> This implies a dmcrypt level problem - XFS can't make progress is
> dmcrypt is not completing IOs.
>
> Where are the XFS corruption reports that the subject implies is
> occurring?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]