2022-02-19 19:41:33

by Kyle Sanderson

[permalink] [raw]
Subject: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

A2SDi-8C-HLN4F has IQAT enabled by default, when this device is
attempted to be used by xfs (through dm-crypt) the entire kernel
thread stalls forever. Multiple users have hit this over the years
(through sporadic reporting) - I ended up trying ZFS and encryption
wasn't an issue there at all because I guess they don't use this
device. Returning to sanity (xfs), I was able to provision a dm-crypt
volume no problem on the disk, however when running mkfs.xfs on the
volume is what triggers the cascading failure (each request kills a
kthread). Disabling IQAT on the south bridge results in a working
system, however this is not the default configuration for the
distribution of choice (Ubuntu 20.04.3 LTS), nor the motherboard. I'm
convinced this never worked properly based on the lack of popularity
for kernel encryption (crypto), and the embedded nature that
SuperMicro has integrated this device in collaboration with intel as
it looks like the primary usage is through external accelerator cards.

Kernels tried were from RHEL8 over a year ago, and this impacts the
entirety of the 5.4 series on Ubuntu.
Please CC me on replies as I'm not subscribed to all lists. CPU is C3758.

363.495058] INFO: task kworker/u16:0:8 blocked for more than 120 seconds.
[ 363.495114] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.495155] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.495201] kworker/u16:0 D 0 8 2 0x80004000
[ 363.495213] Workqueue: kcryptd/253:0 kcryptd_crypt [dm_crypt]
[ 363.495214] Call Trace:
[ 363.495223] __schedule+0x2e3/0x740
[ 363.495226] schedule+0x42/0xb0
[ 363.495228] schedule_timeout+0x10e/0x160
[ 363.495232] ? skcipher_encrypt_ablkcipher+0x61/0x70
[ 363.495233] ? crypto_skcipher_encrypt+0x48/0x60
[ 363.495236] wait_for_completion+0xb1/0x120
[ 363.495239] ? wake_up_q+0x70/0x70
[ 363.495242] crypt_convert+0x144/0x1f0 [dm_crypt]
[ 363.495245] kcryptd_crypt+0x2b9/0x3b0 [dm_crypt]
[ 363.495249] process_one_work+0x1eb/0x3b0
[ 363.495251] worker_thread+0x4d/0x400
[ 363.495254] kthread+0x104/0x140
[ 363.495256] ? process_one_work+0x3b0/0x3b0
[ 363.495257] ? kthread_park+0x90/0x90
[ 363.495260] ret_from_fork+0x1f/0x40
[ 363.495274] INFO: task kworker/u16:1:123 blocked for more than 120 seconds.
[ 363.495317] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.495364] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.495410] kworker/u16:1 D 0 123 2 0x80004000
[ 363.495415] Workqueue: kcryptd/253:0 kcryptd_crypt [dm_crypt]
[ 363.495416] Call Trace:
[ 363.495419] __schedule+0x2e3/0x740
[ 363.495422] schedule+0x42/0xb0
[ 363.495424] schedule_timeout+0x10e/0x160
[ 363.495426] ? skcipher_encrypt_ablkcipher+0x61/0x70
[ 363.495427] ? crypto_skcipher_encrypt+0x48/0x60
[ 363.495430] wait_for_completion+0xb1/0x120
[ 363.495431] ? wake_up_q+0x70/0x70
[ 363.495434] crypt_convert+0x144/0x1f0 [dm_crypt]
[ 363.495437] kcryptd_crypt+0x2b9/0x3b0 [dm_crypt]
[ 363.495441] process_one_work+0x1eb/0x3b0
[ 363.495443] worker_thread+0x4d/0x400
[ 363.495445] kthread+0x104/0x140
[ 363.495447] ? process_one_work+0x3b0/0x3b0
[ 363.495449] ? kthread_park+0x90/0x90
[ 363.495451] ret_from_fork+0x1f/0x40
[ 363.495457] INFO: task kworker/u16:2:153 blocked for more than 120 seconds.
[ 363.495499] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.495539] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.495584] kworker/u16:2 D 0 153 2 0x80004000
[ 363.495589] Workqueue: kcryptd/253:5 kcryptd_crypt [dm_crypt]
[ 363.495590] Call Trace:
[ 363.495593] __schedule+0x2e3/0x740
[ 363.495595] schedule+0x42/0xb0
[ 363.495597] schedule_timeout+0x10e/0x160
[ 363.495599] ? skcipher_decrypt_ablkcipher+0x61/0x70
[ 363.495601] ? crypto_skcipher_decrypt+0x48/0x60
[ 363.495603] wait_for_completion+0xb1/0x120
[ 363.495605] ? wake_up_q+0x70/0x70
[ 363.495608] crypt_convert+0x144/0x1f0 [dm_crypt]
[ 363.495611] kcryptd_crypt+0xc6/0x3b0 [dm_crypt]
[ 363.495613] ? __switch_to+0x7f/0x480
[ 363.495615] ? switch_mm_irqs_off+0x19b/0x500
[ 363.495618] process_one_work+0x1eb/0x3b0
[ 363.495621] worker_thread+0x4d/0x400
[ 363.495623] kthread+0x104/0x140
[ 363.495625] ? process_one_work+0x3b0/0x3b0
[ 363.495627] ? kthread_park+0x90/0x90
[ 363.495629] ret_from_fork+0x1f/0x40
[ 363.495636] INFO: task kworker/u16:5:279 blocked for more than 120 seconds.
[ 363.495677] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.495717] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.495762] kworker/u16:5 D 0 279 2 0x80004000
[ 363.495766] Workqueue: kcryptd/253:0 kcryptd_crypt [dm_crypt]
[ 363.495767] Call Trace:
[ 363.495771] __schedule+0x2e3/0x740
[ 363.495773] schedule+0x42/0xb0
[ 363.495775] schedule_timeout+0x10e/0x160
[ 363.495777] ? skcipher_encrypt_ablkcipher+0x61/0x70
[ 363.495778] ? crypto_skcipher_encrypt+0x48/0x60
[ 363.495781] wait_for_completion+0xb1/0x120
[ 363.495782] ? wake_up_q+0x70/0x70
[ 363.495785] crypt_convert+0x144/0x1f0 [dm_crypt]
[ 363.495788] kcryptd_crypt+0x2b9/0x3b0 [dm_crypt]
[ 363.495791] process_one_work+0x1eb/0x3b0
[ 363.495794] worker_thread+0x4d/0x400
[ 363.495796] kthread+0x104/0x140
[ 363.495798] ? process_one_work+0x3b0/0x3b0
[ 363.495800] ? kthread_park+0x90/0x90
[ 363.495802] ret_from_fork+0x1f/0x40
[ 363.495808] INFO: task kworker/u16:11:299 blocked for more than 120 seconds.
[ 363.495849] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.495890] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.495935] kworker/u16:11 D 0 299 2 0x80004000
[ 363.495939] Workqueue: kcryptd/253:0 kcryptd_crypt [dm_crypt]
[ 363.495940] Call Trace:
[ 363.495943] __schedule+0x2e3/0x740
[ 363.495946] schedule+0x42/0xb0
[ 363.495947] schedule_timeout+0x10e/0x160
[ 363.495949] ? skcipher_encrypt_ablkcipher+0x61/0x70
[ 363.495951] ? crypto_skcipher_encrypt+0x48/0x60
[ 363.495953] wait_for_completion+0xb1/0x120
[ 363.495955] ? wake_up_q+0x70/0x70
[ 363.495958] crypt_convert+0x144/0x1f0 [dm_crypt]
[ 363.495961] kcryptd_crypt+0x2b9/0x3b0 [dm_crypt]
[ 363.495964] process_one_work+0x1eb/0x3b0
[ 363.495966] worker_thread+0x4d/0x400
[ 363.495969] kthread+0x104/0x140
[ 363.495971] ? process_one_work+0x3b0/0x3b0
[ 363.495972] ? kthread_park+0x90/0x90
[ 363.495974] ret_from_fork+0x1f/0x40
[ 363.495977] INFO: task kworker/u16:12:300 blocked for more than 120 seconds.
[ 363.496018] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.496058] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.496108] kworker/u16:12 D 0 300 2 0x80004000
[ 363.496113] Workqueue: kcryptd/253:0 kcryptd_crypt [dm_crypt]
[ 363.496114] Call Trace:
[ 363.496117] __schedule+0x2e3/0x740
[ 363.496120] schedule+0x42/0xb0
[ 363.496121] schedule_timeout+0x10e/0x160
[ 363.496123] ? skcipher_encrypt_ablkcipher+0x61/0x70
[ 363.496125] ? crypto_skcipher_encrypt+0x48/0x60
[ 363.496127] wait_for_completion+0xb1/0x120
[ 363.496129] ? wake_up_q+0x70/0x70
[ 363.496132] crypt_convert+0x144/0x1f0 [dm_crypt]
[ 363.496134] kcryptd_crypt+0x2b9/0x3b0 [dm_crypt]
[ 363.496138] process_one_work+0x1eb/0x3b0
[ 363.496140] worker_thread+0x4d/0x400
[ 363.496142] kthread+0x104/0x140
[ 363.496144] ? process_one_work+0x3b0/0x3b0
[ 363.496146] ? kthread_park+0x90/0x90
[ 363.496148] ret_from_fork+0x1f/0x40
[ 363.496151] INFO: task kworker/u16:13:301 blocked for more than 120 seconds.
[ 363.496193] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.496233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.496278] kworker/u16:13 D 0 301 2 0x80004000
[ 363.496282] Workqueue: kcryptd/253:0 kcryptd_crypt [dm_crypt]
[ 363.496283] Call Trace:
[ 363.496286] __schedule+0x2e3/0x740
[ 363.496289] schedule+0x42/0xb0
[ 363.496290] schedule_timeout+0x10e/0x160
[ 363.496292] ? skcipher_encrypt_ablkcipher+0x61/0x70
[ 363.496294] ? crypto_skcipher_encrypt+0x48/0x60
[ 363.496296] wait_for_completion+0xb1/0x120
[ 363.496298] ? wake_up_q+0x70/0x70
[ 363.496301] crypt_convert+0x144/0x1f0 [dm_crypt]
[ 363.496304] kcryptd_crypt+0x2b9/0x3b0 [dm_crypt]
[ 363.496307] process_one_work+0x1eb/0x3b0
[ 363.496310] worker_thread+0x4d/0x400
[ 363.496312] kthread+0x104/0x140
[ 363.496314] ? process_one_work+0x3b0/0x3b0
[ 363.496316] ? kthread_park+0x90/0x90
[ 363.496317] ret_from_fork+0x1f/0x40
[ 363.496320] INFO: task kworker/u16:14:302 blocked for more than 120 seconds.
[ 363.496362] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.496402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.496447] kworker/u16:14 D 0 302 2 0x80004000
[ 363.496451] Workqueue: kcryptd/253:0 kcryptd_crypt [dm_crypt]
[ 363.496452] Call Trace:
[ 363.496455] __schedule+0x2e3/0x740
[ 363.496458] schedule+0x42/0xb0
[ 363.496459] schedule_timeout+0x10e/0x160
[ 363.496461] ? skcipher_encrypt_ablkcipher+0x61/0x70
[ 363.496463] ? crypto_skcipher_encrypt+0x48/0x60
[ 363.496465] wait_for_completion+0xb1/0x120
[ 363.496467] ? wake_up_q+0x70/0x70
[ 363.496470] crypt_convert+0x144/0x1f0 [dm_crypt]
[ 363.496473] kcryptd_crypt+0x2b9/0x3b0 [dm_crypt]
[ 363.496476] process_one_work+0x1eb/0x3b0
[ 363.496478] worker_thread+0x4d/0x400
[ 363.496481] kthread+0x104/0x140
[ 363.496483] ? process_one_work+0x3b0/0x3b0
[ 363.496484] ? kthread_park+0x90/0x90
[ 363.496486] ret_from_fork+0x1f/0x40
[ 363.496489] INFO: task kworker/u16:15:303 blocked for more than 120 seconds.
[ 363.496531] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.496571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.496616] kworker/u16:15 D 0 303 2 0x80004000
[ 363.496620] Workqueue: kcryptd/253:0 kcryptd_crypt [dm_crypt]
[ 363.496621] Call Trace:
[ 363.496624] __schedule+0x2e3/0x740
[ 363.496627] schedule+0x42/0xb0
[ 363.496629] schedule_timeout+0x10e/0x160
[ 363.496630] ? skcipher_encrypt_ablkcipher+0x61/0x70
[ 363.496632] ? crypto_skcipher_encrypt+0x48/0x60
[ 363.496634] wait_for_completion+0xb1/0x120
[ 363.496636] ? wake_up_q+0x70/0x70
[ 363.496639] crypt_convert+0x144/0x1f0 [dm_crypt]
[ 363.496642] kcryptd_crypt+0x2b9/0x3b0 [dm_crypt]
[ 363.496645] process_one_work+0x1eb/0x3b0
[ 363.496647] worker_thread+0x4d/0x400
[ 363.496650] kthread+0x104/0x140
[ 363.496652] ? process_one_work+0x3b0/0x3b0
[ 363.496654] ? kthread_park+0x90/0x90
[ 363.496655] ret_from_fork+0x1f/0x40
[ 363.496713] INFO: task mergerfs:9760 blocked for more than 120 seconds.
[ 363.496752] Tainted: P O 5.4.0-100-generic #113-Ubuntu
[ 363.496793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.496838] mergerfs D 0 9760 1 0x00000000
[ 363.496840] Call Trace:
[ 363.496843] __schedule+0x2e3/0x740
[ 363.496846] schedule+0x42/0xb0
[ 363.496848] schedule_timeout+0x10e/0x160
[ 363.496851] ? blk_finish_plug+0x26/0x40
[ 363.496853] wait_for_completion+0xb1/0x120
[ 363.496855] ? wake_up_q+0x70/0x70
[ 363.496910] ? __xfs_buf_submit+0x138/0x260 [xfs]
[ 363.496950] xfs_buf_iowait+0x26/0xe0 [xfs]
[ 363.496990] __xfs_buf_submit+0x138/0x260 [xfs]
[ 363.497030] _xfs_buf_read+0x27/0x30 [xfs]
[ 363.497070] xfs_buf_read_map+0x132/0x1d0 [xfs]
[ 363.497073] ? new_slab+0x4a/0x70
[ 363.497117] xfs_trans_read_buf_map+0xca/0x350 [xfs]
[ 363.497155] xfs_imap_to_bp+0x66/0xd0 [xfs]
[ 363.497193] xfs_iread+0x83/0x200 [xfs]
[ 363.497234] xfs_iget+0x214/0x9e0 [xfs]
[ 363.497270] ? xfs_da_compname+0x1d/0x30 [xfs]
[ 363.497306] ? xfs_dir2_sf_lookup+0xd0/0x200 [xfs]
[ 363.497348] xfs_lookup+0xe2/0x120 [xfs]
[ 363.497390] xfs_vn_lookup+0x72/0xb0 [xfs]
[ 363.497393] __lookup_slow+0x92/0x160
[ 363.497395] lookup_slow+0x3b/0x60
[ 363.497397] walk_component+0x1da/0x360
[ 363.497399] ? link_path_walk.part.0+0x2a2/0x550
[ 363.497401] path_lookupat.isra.0+0x80/0x230
[ 363.497404] filename_lookup+0xae/0x170
[ 363.497407] ? __check_object_size+0x13f/0x150
[ 363.497409] ? strncpy_from_user+0x4c/0x150
[ 363.497412] user_path_at_empty+0x3a/0x50
[ 363.497414] vfs_statx+0x7d/0xe0
[ 363.497417] __do_sys_newlstat+0x3e/0x80
[ 363.497419] ? vfs_read+0x12e/0x160
[ 363.497420] ? fput+0x13/0x20
[ 363.497422] ? ksys_read+0xce/0xe0
[ 363.497424] __x64_sys_newlstat+0x16/0x20
[ 363.497427] do_syscall_64+0x57/0x190
[ 363.497429] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 363.497432] RIP: 0033:0x7f7f32c656ea
[ 363.497438] Code: Bad RIP value.
[ 363.497439] RSP: 002b:00007f7f31fea0c8 EFLAGS: 00000246 ORIG_RAX:
0000000000000006
[ 363.497441] RAX: ffffffffffffffda RBX: 0000560953796248 RCX: 00007f7f32c656ea
[ 363.497442] RDX: 00007f7f31fea110 RSI: 00007f7f31fea110 RDI: 00007f7f31fea100
[ 363.497443] RBP: 00007f7f31fea120 R08: 0000000000000001 R09: 000000000000000a
[ 363.497445] R10: 00007f7f14000b90 R11: 0000000000000246 R12: 00007f7f31fea220
[ 363.497446] R13: 00007f7f14000b90 R14: 00007f7f31fea100 R15: 00007f7f31fea110


2022-02-21 19:30:24

by Cabiddu, Giovanni

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

Hi Kyle,

The issue is that the implementations of aead and skcipher in the QAT
driver are not properly supporting requests with the
CRYPTO_TFM_REQ_MAY_BACKLOG flag set.
If the HW queue is full, the driver returns -EBUSY [1] but does not
enqueues the request as dm-crypt expects [2]. Dm-crypt ends up waiting
indefinitely for a completion to a request that was never submitted,
therefore the stall.
This is not related to QATE-7495 'An incorrectly formatted request to
QAT can hang the entire QAT endpoint' [3], which occurs when a malformed
request is sent to the device.

I'm working at patch that resolves this problem. In the meanwhile a
workaround is to blacklist the qat_c3xxx.ko driver.

Regarding avoiding this issue on stable kernels. The usage of QAT with
dm-crypt was already disabled in kernel 5.10 for a different issue
(the driver allocates memory in the datapath).
The following patches implement the change:
7bcb2c99f8ed crypto: algapi - use common mechanism for inheriting flags
2eb27c11937e crypto: algapi - add NEED_FALLBACK to INHERITED_FLAGS
fbb6cda44190 crypto: algapi - introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
b8aa7dc5c753 crypto: drivers - set the flag CRYPTO_ALG_ALLOCATES_MEMORY
cd74693870fb dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
An option would be to send the patches above to stable, another is to wait
for a patch that fixes the problems in the QAT driver and send that to
stable.
@Herbert, what is the preferred approach here?

Thanks,

[1] https://elixir.bootlin.com/linux/latest/source/drivers/crypto/qat/qat_common/qat_algs.c#L1022
[2] https://elixir.bootlin.com/linux/latest/source/drivers/md/dm-crypt.c#L1584
[3] https://01.org/sites/default/files/downloads//336211qatsoftwareforlinux-rn-hwversion1.7021.pdf - page 25

--
Giovanni


On Sat, Feb 19, 2022 at 03:00:51PM -0800, Kyle Sanderson wrote:
> hi Dave,
>
> > This really sounds like broken hardware, not a kernel problem.
>
> It is indeed a hardware issue, specifically the intel qat crypto
> driver that's in-tree - the hardware is fine (see below). The IQAT
> eratta documentation states that if a request is not submitted
> properly it can stall the entire device. The remediation guidance from
> 2020 was "don't do that" and "don't allow unprivileged users access to
> the device". The in-tree driver is not implemented properly either for
> this SoC or board - I'm thinking it's related to QATE-7495.
>
> https://01.org/sites/default/files/downloads//336211qatsoftwareforlinux-rn-hwversion1.7021.pdf
>
> > This implies a dmcrypt level problem - XFS can't make progress is dmcrypt is not completing IOs.
>
> That's the weird part about it. Some bio's are completing, others are
> completely dropped, with some stalling forever. I had to use
> xfs_repair to get the volumes operational again. I lost a good deal of
> files and had to recover from backup after toggling the device back on
> on a production system (silly, I know).
>
> > Where are the XFS corruption reports that the subject implies is occurring?
>
> I think you're right, it's dm-crypt that's broken here, with
> ultimately the crypto driver causing this corruption. XFS being the
> edge to the end-user is taking the brunt of it. There's reports going
> back to late 2017 of significant issues with this mainlined stable
> driver.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1522962
> https://serverfault.com/questions/1010108/luks-hangs-on-centos-running-on-atom-c3758-cpu
> https://www.phoronix.com/forums/forum/software/distributions/1172231-fedora-33-s-enterprise-linux-next-effort-approved-testbed-for-raising-cpu-requirements-etc?p=1174560#post1174560
>
> Any guidance would be appreciated.
> Kyle.
> On Sat, Feb 19, 2022 at 1:03 PM Dave Chinner <[email protected]> wrote:
> >
> > On Fri, Feb 18, 2022 at 09:02:28PM -0800, Kyle Sanderson wrote:
> > > A2SDi-8C-HLN4F has IQAT enabled by default, when this device is
> > > attempted to be used by xfs (through dm-crypt) the entire kernel
> > > thread stalls forever. Multiple users have hit this over the years
> > > (through sporadic reporting) - I ended up trying ZFS and encryption
> > > wasn't an issue there at all because I guess they don't use this
> > > device. Returning to sanity (xfs), I was able to provision a dm-crypt
> > > volume no problem on the disk, however when running mkfs.xfs on the
> > > volume is what triggers the cascading failure (each request kills a
> > > kthread).
> >
> > Can you provide the full stack traces for these errors so we can see
> > exactly what this cascading failure looks like, please? In reality,
> > the stall messages some time after this are not interesting - it's
> > the first errors that cause the stall that need to be investigated.
> >
> > A good idea would be to provide the full storage stack decription
> > and hardware in use, as per:
> >
> > https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> >
> > > Disabling IQAT on the south bridge results in a working
> > > system, however this is not the default configuration for the
> > > distribution of choice (Ubuntu 20.04.3 LTS), nor the motherboard. I'm
> > > convinced this never worked properly based on the lack of popularity
> > > for kernel encryption (crypto), and the embedded nature that
> > > SuperMicro has integrated this device in collaboration with intel as
> > > it looks like the primary usage is through external accelerator cards.
> >
> > This really sounds like broken hardware, not a kernel problem.
> >
> > > Kernels tried were from RHEL8 over a year ago, and this impacts the
> > > entirety of the 5.4 series on Ubuntu.
> > > Please CC me on replies as I'm not subscribed to all lists. CPU is C3758.
> >
> > [snip stalled kcryptd worker threads]
> >
> > This implies a dmcrypt level problem - XFS can't make progress is
> > dmcrypt is not completing IOs.
> >
> > Where are the XFS corruption reports that the subject implies is
> > occurring?
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > [email protected]

2022-02-28 10:46:59

by Kyle Sanderson

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

> The issue is that the implementations of aead and skcipher in the QAT driver are not properly supporting requests with the CRYPTO_TFM_REQ_MAY_BACKLOG flag set.

Thanks Giovanni. Joel (from Intel) reached out to me out of band to
try and sell me further on QAT but wasn't able to follow-up on any
questions (like - how is the device actually used, how can I
personally help, etc).

> If the HW queue is full, the driver returns -EBUSY [1] but does not enqueues the request as dm-crypt expects [2]. Dm-crypt ends up waiting indefinitely for a completion to a request that was never submitted, therefore the stall.

Makes sense - this kernel driver has been destroying users for many
years. I'm disappointed that this critical bricking failure isn't
searchable for others.

> This is not related to QATE-7495 'An incorrectly formatted request to QAT can hang the entire QAT endpoint' [3], which occurs when a malformed request is sent to the device.

That's nice to hear that the device itself isn't dying, but it's been
completely destroying systems for years which itself is a DoS.

> I'm working at patch that resolves this problem. In the meanwhile a workaround is to blacklist the qat_c3xxx.ko driver.

I'm not writing this facetiously, but this driver has caused
incredible harm over the past 5+ years and seems to continue to do so.
As there's no patch proposed yet, I'm looking for the driver to be
completely removed from the tree as it's presently a pure marketing
campaign that's caused significant harm. If the marketing benefits
(like accelerated crypto + hashing) aren't there when the accelerated
instruction set was pulled from these integrated chips - the driver
continues to serve no purpose for consumers beyond damage. Disabling
the core I/O bits in December 2020 to make this barely work continues
to promote this as a side project as it was never resolved in the
driver.

If I can test patches, or assist with the removal of this present
in-tree malware I'm happy to help.

Kyle.


On Mon, Feb 21, 2022 at 3:48 AM Giovanni Cabiddu
<[email protected]> wrote:
>
> Hi Kyle,
>
> The issue is that the implementations of aead and skcipher in the QAT
> driver are not properly supporting requests with the
> CRYPTO_TFM_REQ_MAY_BACKLOG flag set.
> If the HW queue is full, the driver returns -EBUSY [1] but does not
> enqueues the request as dm-crypt expects [2]. Dm-crypt ends up waiting
> indefinitely for a completion to a request that was never submitted,
> therefore the stall.
> This is not related to QATE-7495 'An incorrectly formatted request to
> QAT can hang the entire QAT endpoint' [3], which occurs when a malformed
> request is sent to the device.
>
> I'm working at patch that resolves this problem. In the meanwhile a
> workaround is to blacklist the qat_c3xxx.ko driver.
>
> Regarding avoiding this issue on stable kernels. The usage of QAT with
> dm-crypt was already disabled in kernel 5.10 for a different issue
> (the driver allocates memory in the datapath).
> The following patches implement the change:
> 7bcb2c99f8ed crypto: algapi - use common mechanism for inheriting flags
> 2eb27c11937e crypto: algapi - add NEED_FALLBACK to INHERITED_FLAGS
> fbb6cda44190 crypto: algapi - introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
> b8aa7dc5c753 crypto: drivers - set the flag CRYPTO_ALG_ALLOCATES_MEMORY
> cd74693870fb dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
> An option would be to send the patches above to stable, another is to wait
> for a patch that fixes the problems in the QAT driver and send that to
> stable.
> @Herbert, what is the preferred approach here?
>
> Thanks,
>
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/crypto/qat/qat_common/qat_algs.c#L1022
> [2] https://elixir.bootlin.com/linux/latest/source/drivers/md/dm-crypt.c#L1584
> [3] https://01.org/sites/default/files/downloads//336211qatsoftwareforlinux-rn-hwversion1.7021.pdf - page 25
>
> --
> Giovanni
>
>
> On Sat, Feb 19, 2022 at 03:00:51PM -0800, Kyle Sanderson wrote:
> > hi Dave,
> >
> > > This really sounds like broken hardware, not a kernel problem.
> >
> > It is indeed a hardware issue, specifically the intel qat crypto
> > driver that's in-tree - the hardware is fine (see below). The IQAT
> > eratta documentation states that if a request is not submitted
> > properly it can stall the entire device. The remediation guidance from
> > 2020 was "don't do that" and "don't allow unprivileged users access to
> > the device". The in-tree driver is not implemented properly either for
> > this SoC or board - I'm thinking it's related to QATE-7495.
> >
> > https://01.org/sites/default/files/downloads//336211qatsoftwareforlinux-rn-hwversion1.7021.pdf
> >
> > > This implies a dmcrypt level problem - XFS can't make progress is dmcrypt is not completing IOs.
> >
> > That's the weird part about it. Some bio's are completing, others are
> > completely dropped, with some stalling forever. I had to use
> > xfs_repair to get the volumes operational again. I lost a good deal of
> > files and had to recover from backup after toggling the device back on
> > on a production system (silly, I know).
> >
> > > Where are the XFS corruption reports that the subject implies is occurring?
> >
> > I think you're right, it's dm-crypt that's broken here, with
> > ultimately the crypto driver causing this corruption. XFS being the
> > edge to the end-user is taking the brunt of it. There's reports going
> > back to late 2017 of significant issues with this mainlined stable
> > driver.
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=1522962
> > https://serverfault.com/questions/1010108/luks-hangs-on-centos-running-on-atom-c3758-cpu
> > https://www.phoronix.com/forums/forum/software/distributions/1172231-fedora-33-s-enterprise-linux-next-effort-approved-testbed-for-raising-cpu-requirements-etc?p=1174560#post1174560
> >
> > Any guidance would be appreciated.
> > Kyle.
> > On Sat, Feb 19, 2022 at 1:03 PM Dave Chinner <[email protected]> wrote:
> > >
> > > On Fri, Feb 18, 2022 at 09:02:28PM -0800, Kyle Sanderson wrote:
> > > > A2SDi-8C-HLN4F has IQAT enabled by default, when this device is
> > > > attempted to be used by xfs (through dm-crypt) the entire kernel
> > > > thread stalls forever. Multiple users have hit this over the years
> > > > (through sporadic reporting) - I ended up trying ZFS and encryption
> > > > wasn't an issue there at all because I guess they don't use this
> > > > device. Returning to sanity (xfs), I was able to provision a dm-crypt
> > > > volume no problem on the disk, however when running mkfs.xfs on the
> > > > volume is what triggers the cascading failure (each request kills a
> > > > kthread).
> > >
> > > Can you provide the full stack traces for these errors so we can see
> > > exactly what this cascading failure looks like, please? In reality,
> > > the stall messages some time after this are not interesting - it's
> > > the first errors that cause the stall that need to be investigated.
> > >
> > > A good idea would be to provide the full storage stack decription
> > > and hardware in use, as per:
> > >
> > > https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> > >
> > > > Disabling IQAT on the south bridge results in a working
> > > > system, however this is not the default configuration for the
> > > > distribution of choice (Ubuntu 20.04.3 LTS), nor the motherboard. I'm
> > > > convinced this never worked properly based on the lack of popularity
> > > > for kernel encryption (crypto), and the embedded nature that
> > > > SuperMicro has integrated this device in collaboration with intel as
> > > > it looks like the primary usage is through external accelerator cards.
> > >
> > > This really sounds like broken hardware, not a kernel problem.
> > >
> > > > Kernels tried were from RHEL8 over a year ago, and this impacts the
> > > > entirety of the 5.4 series on Ubuntu.
> > > > Please CC me on replies as I'm not subscribed to all lists. CPU is C3758.
> > >
> > > [snip stalled kcryptd worker threads]
> > >
> > > This implies a dmcrypt level problem - XFS can't make progress is
> > > dmcrypt is not completing IOs.
> > >
> > > Where are the XFS corruption reports that the subject implies is
> > > occurring?
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > [email protected]

2022-02-28 20:42:26

by Cabiddu, Giovanni

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

On Mon, Feb 28, 2022 at 11:25:49AM -0800, Linus Torvalds wrote:
> On Mon, Feb 28, 2022 at 12:18 AM Kyle Sanderson <[email protected]> wrote:
> >
> > Makes sense - this kernel driver has been destroying users for many
> > years. I'm disappointed that this critical bricking failure isn't
> > searchable for others.
>
> It does sound like we should just disable that driver entirely until
> it is fixed.
>
> Or at least the configuration that can cause problems, if there is
> some particular sub-case.
The dm-crypt + QAT use-case is already disabled since kernel 5.10 due to
a different issue.
Is it an option to port those patches to stable till I provide a fix for
the driver? I drafted already few alternatives for the fix and I am aiming
for a final set by end of week.

Thanks,

--
Giovanni

2022-02-28 21:01:51

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

On Mon, Feb 28, 2022 at 08:39:11PM +0000, Giovanni Cabiddu wrote:
> On Mon, Feb 28, 2022 at 11:25:49AM -0800, Linus Torvalds wrote:
> > On Mon, Feb 28, 2022 at 12:18 AM Kyle Sanderson <[email protected]> wrote:
> > >
> > > Makes sense - this kernel driver has been destroying users for many
> > > years. I'm disappointed that this critical bricking failure isn't
> > > searchable for others.
> >
> > It does sound like we should just disable that driver entirely until
> > it is fixed.
> >
> > Or at least the configuration that can cause problems, if there is
> > some particular sub-case.
> The dm-crypt + QAT use-case is already disabled since kernel 5.10 due to
> a different issue.
> Is it an option to port those patches to stable till I provide a fix for
> the driver? I drafted already few alternatives for the fix and I am aiming
> for a final set by end of week.

If the existing situation is broken, yes, those patches are fine for
stable releases.

thanks,

greg k-h

2022-03-03 00:31:54

by Herbert Xu

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

On Wed, Mar 02, 2022 at 03:56:36PM +0100, Greg KH wrote:
>
> > If not, then these are the patches that should be backported:
> > 7bcb2c99f8ed crypto: algapi - use common mechanism for inheriting flags
> > 2eb27c11937e crypto: algapi - add NEED_FALLBACK to INHERITED_FLAGS
> > fbb6cda44190 crypto: algapi - introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
> > b8aa7dc5c753 crypto: drivers - set the flag CRYPTO_ALG_ALLOCATES_MEMORY
> > cd74693870fb dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
> > Herbert, correct me if I'm wrong here.
>
> These need to be manually backported as they do not apply cleanly. Can
> you provide such a set? Or should I just disable a specific driver here
> instead which would be easier overall?

I think the safest thing is to disable qat in stable (possibly only
when DM_CRYPT is enabled/modular). The patches in question while
good may have too wide an effect for the stable kernel series.

Giovanni, could you send Greg a Kconfig patch to do that?

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2022-03-03 00:32:18

by Cabiddu, Giovanni

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

On Thu, Mar 03, 2022 at 10:27:47AM +1200, Herbert Xu wrote:
> On Wed, Mar 02, 2022 at 03:56:36PM +0100, Greg KH wrote:
> >
> > > If not, then these are the patches that should be backported:
> > > 7bcb2c99f8ed crypto: algapi - use common mechanism for inheriting flags
> > > 2eb27c11937e crypto: algapi - add NEED_FALLBACK to INHERITED_FLAGS
> > > fbb6cda44190 crypto: algapi - introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
> > > b8aa7dc5c753 crypto: drivers - set the flag CRYPTO_ALG_ALLOCATES_MEMORY
> > > cd74693870fb dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
> > > Herbert, correct me if I'm wrong here.
> >
> > These need to be manually backported as they do not apply cleanly. Can
> > you provide such a set? Or should I just disable a specific driver here
> > instead which would be easier overall?
>
> I think the safest thing is to disable qat in stable (possibly only
> when DM_CRYPT is enabled/modular). The patches in question while
> good may have too wide an effect for the stable kernel series.
>
> Giovanni, could you send Greg a Kconfig patch to do that?
I was thinking, as an alternative, to lower the cra_priority in the QAT
driver for the algorithms used by dm-crypt so they are not used by
default.
Is that a viable option?

Sure, I can provide a patch for either the cra_priority or the Kconfig
option for the stable kernels that don't have the patches above.

--
Giovanni

2022-03-03 14:14:58

by Cabiddu, Giovanni

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

On Thu, Mar 03, 2022 at 10:45:48AM +1200, Herbert Xu wrote:
> On Wed, Mar 02, 2022 at 10:42:20PM +0000, Giovanni Cabiddu wrote:
> >
> > I was thinking, as an alternative, to lower the cra_priority in the QAT
> > driver for the algorithms used by dm-crypt so they are not used by
> > default.
> > Is that a viable option?
>
> Yes I think that should work too.
The patch below implements that solution and applies to linux-5.4.y.
If it is ok, I can send it to stable for all kernels <= 5.4 following
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#option-3

---8<---
From: Giovanni Cabiddu <[email protected]>
Date: Thu, 3 Mar 2022 11:54:07 +0000
Subject: [PATCH] crypto: qat - drop priority of algorithms
Organization: Intel Research and Development Ireland Ltd - Co. Reg. #308263 - Collinstown Industrial Park, Leixlip, County Kildare - Ireland

The implementations of aead and skcipher in the QAT driver are not
properly supporting requests with the CRYPTO_TFM_REQ_MAY_BACKLOG flag set.
If the HW queue is full, the driver returns -EBUSY but does not enqueue
the request.
This can result in applications like dm-crypt waiting indefinitely for a
completion of a request that was never submitted to the hardware.

To mitigate this problem, reduce the priority of all skcipher and aead
implementations in the QAT driver so they are not used by default.

This patch deviates from the original upstream solution, that prevents
dm-crypt to use drivers registered with the flag
CRYPTO_ALG_ALLOCATES_MEMORY, since a backport of that set to stable
kernels may have a too wide effect.

commit 7bcb2c99f8ed032cfb3f5596b4dccac6b1f501df upstream
commit 2eb27c11937ee9984c04b75d213a737291c5f58c upstream
commit fbb6cda44190d72aa5199d728797aabc6d2ed816 upstream
commit b8aa7dc5c7535f9abfca4bceb0ade9ee10cf5f54 upstream
commit cd74693870fb748d812867ba49af733d689a3604 upstream

Signed-off-by: Giovanni Cabiddu <[email protected]>
---
drivers/crypto/qat/qat_common/qat_algs.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/crypto/qat/qat_common/qat_algs.c b/drivers/crypto/qat/qat_common/qat_algs.c
index 6b8ad3d67481..a5c28a08fd8c 100644
--- a/drivers/crypto/qat/qat_common/qat_algs.c
+++ b/drivers/crypto/qat/qat_common/qat_algs.c
@@ -1274,7 +1274,7 @@ static struct aead_alg qat_aeads[] = { {
.base = {
.cra_name = "authenc(hmac(sha1),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha1",
- .cra_priority = 4001,
+ .cra_priority = 1,
.cra_flags = CRYPTO_ALG_ASYNC,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_aead_ctx),
@@ -1291,7 +1291,7 @@ static struct aead_alg qat_aeads[] = { {
.base = {
.cra_name = "authenc(hmac(sha256),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha256",
- .cra_priority = 4001,
+ .cra_priority = 1,
.cra_flags = CRYPTO_ALG_ASYNC,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_aead_ctx),
@@ -1308,7 +1308,7 @@ static struct aead_alg qat_aeads[] = { {
.base = {
.cra_name = "authenc(hmac(sha512),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha512",
- .cra_priority = 4001,
+ .cra_priority = 1,
.cra_flags = CRYPTO_ALG_ASYNC,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_aead_ctx),
@@ -1326,7 +1326,7 @@ static struct aead_alg qat_aeads[] = { {
static struct crypto_alg qat_algs[] = { {
.cra_name = "cbc(aes)",
.cra_driver_name = "qat_aes_cbc",
- .cra_priority = 4001,
+ .cra_priority = 1,
.cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_ablkcipher_ctx),
@@ -1348,7 +1348,7 @@ static struct crypto_alg qat_algs[] = { {
}, {
.cra_name = "ctr(aes)",
.cra_driver_name = "qat_aes_ctr",
- .cra_priority = 4001,
+ .cra_priority = 1,
.cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct qat_alg_ablkcipher_ctx),
@@ -1370,7 +1370,7 @@ static struct crypto_alg qat_algs[] = { {
}, {
.cra_name = "xts(aes)",
.cra_driver_name = "qat_aes_xts",
- .cra_priority = 4001,
+ .cra_priority = 1,
.cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_ablkcipher_ctx),

base-commit: 866ae42cf4788c8b18de6bda0a522362702861d7
--
2.35.1

2022-03-03 21:51:08

by Cabiddu, Giovanni

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

On Thu, Mar 03, 2022 at 07:21:33PM +0000, Eric Biggers wrote:
> If these algorithms have critical bugs, which it appears they do, then IMO it
> would be better to disable them (either stop registering them, or disable the
> whole driver) than to leave them available with low cra_priority. Low
> cra_priority doesn't guarantee that they aren't used.
Thanks for your feedback Eric.

Here is a patch that disables the registration of the algorithms in the
QAT driver by setting, a config time, the number of HW queues (aka
instances) to zero.

---8<---
From: Giovanni Cabiddu <[email protected]>
Subject: [PATCH] crypto: qat - disable registration of algorithms
Organization: Intel Research and Development Ireland Ltd - Co. Reg. #308263 - Collinstown Industrial Park, Leixlip, County Kildare - Ireland

The implementations of aead and skcipher in the QAT driver do not
support properly requests with the CRYPTO_TFM_REQ_MAY_BACKLOG flag set.
If the HW queue is full, the driver returns -EBUSY but does not enqueue
the request.
This can result in applications like dm-crypt waiting indefinitely for a
completion of a request that was never submitted to the hardware.

To avoid this problem, disable the registration of all skcipher and aead
implementations in the QAT driver by setting the number of crypto
instances to 0 at configuration time.

This patch deviates from the original upstream solution, that prevents
dm-crypt to use drivers registered with the flag
CRYPTO_ALG_ALLOCATES_MEMORY, since a backport of that set to stable
kernels may have a too wide effect.

commit 7bcb2c99f8ed032cfb3f5596b4dccac6b1f501df upstream
commit 2eb27c11937ee9984c04b75d213a737291c5f58c upstream
commit fbb6cda44190d72aa5199d728797aabc6d2ed816 upstream
commit b8aa7dc5c7535f9abfca4bceb0ade9ee10cf5f54 upstream
commit cd74693870fb748d812867ba49af733d689a3604 upstream

Signed-off-by: Giovanni Cabiddu <[email protected]>
---
drivers/crypto/qat/qat_common/qat_crypto.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/crypto/qat/qat_common/qat_crypto.c b/drivers/crypto/qat/qat_common/qat_crypto.c
index 3852d31ce0a4..611d214d5198 100644
--- a/drivers/crypto/qat/qat_common/qat_crypto.c
+++ b/drivers/crypto/qat/qat_common/qat_crypto.c
@@ -159,9 +159,7 @@ struct qat_crypto_instance *qat_crypto_get_instance_node(int node)
*/
int qat_crypto_dev_config(struct adf_accel_dev *accel_dev)
{
- int cpus = num_online_cpus();
- int banks = GET_MAX_BANKS(accel_dev);
- int instances = min(cpus, banks);
+ int instances = 0;
char key[ADF_CFG_MAX_KEY_LEN_IN_BYTES];
int i;
unsigned long val;

base-commit: 866ae42cf4788c8b18de6bda0a522362702861d7
--
2.35.1

2022-03-03 22:36:35

by Eric Biggers

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

On Thu, Mar 03, 2022 at 09:24:42PM +0000, Giovanni Cabiddu wrote:
> On Thu, Mar 03, 2022 at 07:21:33PM +0000, Eric Biggers wrote:
> > If these algorithms have critical bugs, which it appears they do, then IMO it
> > would be better to disable them (either stop registering them, or disable the
> > whole driver) than to leave them available with low cra_priority. Low
> > cra_priority doesn't guarantee that they aren't used.
> Thanks for your feedback Eric.
>
> Here is a patch that disables the registration of the algorithms in the
> QAT driver by setting, a config time, the number of HW queues (aka
> instances) to zero.
>
> ---8<---
> From: Giovanni Cabiddu <[email protected]>
> Subject: [PATCH] crypto: qat - disable registration of algorithms
> Organization: Intel Research and Development Ireland Ltd - Co. Reg. #308263 - Collinstown Industrial Park, Leixlip, County Kildare - Ireland
>
> The implementations of aead and skcipher in the QAT driver do not
> support properly requests with the CRYPTO_TFM_REQ_MAY_BACKLOG flag set.
> If the HW queue is full, the driver returns -EBUSY but does not enqueue
> the request.
> This can result in applications like dm-crypt waiting indefinitely for a
> completion of a request that was never submitted to the hardware.
>
> To avoid this problem, disable the registration of all skcipher and aead
> implementations in the QAT driver by setting the number of crypto
> instances to 0 at configuration time.
>
> This patch deviates from the original upstream solution, that prevents
> dm-crypt to use drivers registered with the flag
> CRYPTO_ALG_ALLOCATES_MEMORY, since a backport of that set to stable
> kernels may have a too wide effect.
>
> commit 7bcb2c99f8ed032cfb3f5596b4dccac6b1f501df upstream
> commit 2eb27c11937ee9984c04b75d213a737291c5f58c upstream
> commit fbb6cda44190d72aa5199d728797aabc6d2ed816 upstream
> commit b8aa7dc5c7535f9abfca4bceb0ade9ee10cf5f54 upstream
> commit cd74693870fb748d812867ba49af733d689a3604 upstream
>
> Signed-off-by: Giovanni Cabiddu <[email protected]>
> ---
> drivers/crypto/qat/qat_common/qat_crypto.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)

Sounds good; is there any reason not to apply this upstream too, though?
You could revert it later as part of the patch series that fixes the driver.

- Eric

2022-03-04 19:22:01

by Cabiddu, Giovanni

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

On Thu, Mar 03, 2022 at 09:44:53PM +0000, Eric Biggers wrote:
> On Thu, Mar 03, 2022 at 09:24:42PM +0000, Giovanni Cabiddu wrote:
> > On Thu, Mar 03, 2022 at 07:21:33PM +0000, Eric Biggers wrote:
> > > If these algorithms have critical bugs, which it appears they do, then IMO it
> > > would be better to disable them (either stop registering them, or disable the
> > > whole driver) than to leave them available with low cra_priority. Low
> > > cra_priority doesn't guarantee that they aren't used.
> > Thanks for your feedback Eric.
> >
> > Here is a patch that disables the registration of the algorithms in the
> > QAT driver by setting, a config time, the number of HW queues (aka
> > instances) to zero.
> >
> > ---8<---
> > From: Giovanni Cabiddu <[email protected]>
> > Subject: [PATCH] crypto: qat - disable registration of algorithms
> > Organization: Intel Research and Development Ireland Ltd - Co. Reg. #308263 - Collinstown Industrial Park, Leixlip, County Kildare - Ireland
> >
> > The implementations of aead and skcipher in the QAT driver do not
> > support properly requests with the CRYPTO_TFM_REQ_MAY_BACKLOG flag set.
> > If the HW queue is full, the driver returns -EBUSY but does not enqueue
> > the request.
> > This can result in applications like dm-crypt waiting indefinitely for a
> > completion of a request that was never submitted to the hardware.
> >
> > To avoid this problem, disable the registration of all skcipher and aead
> > implementations in the QAT driver by setting the number of crypto
> > instances to 0 at configuration time.
> >
> > This patch deviates from the original upstream solution, that prevents
> > dm-crypt to use drivers registered with the flag
> > CRYPTO_ALG_ALLOCATES_MEMORY, since a backport of that set to stable
> > kernels may have a too wide effect.
> >
> > commit 7bcb2c99f8ed032cfb3f5596b4dccac6b1f501df upstream
> > commit 2eb27c11937ee9984c04b75d213a737291c5f58c upstream
> > commit fbb6cda44190d72aa5199d728797aabc6d2ed816 upstream
> > commit b8aa7dc5c7535f9abfca4bceb0ade9ee10cf5f54 upstream
> > commit cd74693870fb748d812867ba49af733d689a3604 upstream
> >
> > Signed-off-by: Giovanni Cabiddu <[email protected]>
> > ---
> > drivers/crypto/qat/qat_common/qat_crypto.c | 4 +---
> > 1 file changed, 1 insertion(+), 3 deletions(-)
>
> Sounds good; is there any reason not to apply this upstream too, though?
> You could revert it later as part of the patch series that fixes the driver.
Makes sense. I'm going to send it upstream and Cc stable as documented
in https://www.kernel.org/doc/html/v4.10/process/stable-kernel-rules.html#option-1
I will then revert this change in the set that fixes the problem.

Thanks,

--
Giovanni

2022-03-17 04:31:20

by Kyle Sanderson

[permalink] [raw]
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs

> Makes sense. I'm going to send it upstream and Cc stable as documented
> in https://www.kernel.org/doc/html/v4.10/process/stable-kernel-rules.html#option-1
> I will then revert this change in the set that fixes the problem.

Did this go anywhere? I'm still not seeing it in any of the stable trees.

Kyle.

On Fri, Mar 4, 2022 at 9:50 AM Giovanni Cabiddu
<[email protected]> wrote:
>
> On Thu, Mar 03, 2022 at 09:44:53PM +0000, Eric Biggers wrote:
> > On Thu, Mar 03, 2022 at 09:24:42PM +0000, Giovanni Cabiddu wrote:
> > > On Thu, Mar 03, 2022 at 07:21:33PM +0000, Eric Biggers wrote:
> > > > If these algorithms have critical bugs, which it appears they do, then IMO it
> > > > would be better to disable them (either stop registering them, or disable the
> > > > whole driver) than to leave them available with low cra_priority. Low
> > > > cra_priority doesn't guarantee that they aren't used.
> > > Thanks for your feedback Eric.
> > >
> > > Here is a patch that disables the registration of the algorithms in the
> > > QAT driver by setting, a config time, the number of HW queues (aka
> > > instances) to zero.
> > >
> > > ---8<---
> > > From: Giovanni Cabiddu <[email protected]>
> > > Subject: [PATCH] crypto: qat - disable registration of algorithms
> > > Organization: Intel Research and Development Ireland Ltd - Co. Reg. #308263 - Collinstown Industrial Park, Leixlip, County Kildare - Ireland
> > >
> > > The implementations of aead and skcipher in the QAT driver do not
> > > support properly requests with the CRYPTO_TFM_REQ_MAY_BACKLOG flag set.
> > > If the HW queue is full, the driver returns -EBUSY but does not enqueue
> > > the request.
> > > This can result in applications like dm-crypt waiting indefinitely for a
> > > completion of a request that was never submitted to the hardware.
> > >
> > > To avoid this problem, disable the registration of all skcipher and aead
> > > implementations in the QAT driver by setting the number of crypto
> > > instances to 0 at configuration time.
> > >
> > > This patch deviates from the original upstream solution, that prevents
> > > dm-crypt to use drivers registered with the flag
> > > CRYPTO_ALG_ALLOCATES_MEMORY, since a backport of that set to stable
> > > kernels may have a too wide effect.
> > >
> > > commit 7bcb2c99f8ed032cfb3f5596b4dccac6b1f501df upstream
> > > commit 2eb27c11937ee9984c04b75d213a737291c5f58c upstream
> > > commit fbb6cda44190d72aa5199d728797aabc6d2ed816 upstream
> > > commit b8aa7dc5c7535f9abfca4bceb0ade9ee10cf5f54 upstream
> > > commit cd74693870fb748d812867ba49af733d689a3604 upstream
> > >
> > > Signed-off-by: Giovanni Cabiddu <[email protected]>
> > > ---
> > > drivers/crypto/qat/qat_common/qat_crypto.c | 4 +---
> > > 1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > Sounds good; is there any reason not to apply this upstream too, though?
> > You could revert it later as part of the patch series that fixes the driver.
> Makes sense. I'm going to send it upstream and Cc stable as documented
> in https://www.kernel.org/doc/html/v4.10/process/stable-kernel-rules.html#option-1
> I will then revert this change in the set that fixes the problem.
>
> Thanks,
>
> --
> Giovanni