2022-08-22 14:40:43

by Fabio M. De Francesco

[permalink] [raw]
Subject: [PATCH v3 0/1] Don't kmap() pages which can't come from HIGHMEM

kmap() is being deprecated in favor of kmap_local_page().[1]

There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.

The pages which will be mapped are allocated in nvmet_tcp_map_data(),
using the GFP_KERNEL flag. This assures that they cannot come from
HIGHMEM. This imply that a straight page_address() can replace the kmap()
of sg_page(sg) in nvmet_tcp_map_pdu_iovec(). As a side effect, we might
also delete the field "nr_mapped" from struct "nvmet_tcp_cmd" because,
after removing the kmap() calls, there would be no longer any need of it.

Therefore, replace the kmap() of sg_page(sg) with a page_address() and
delete the "nr_mapped" field from "nvmet_tcp_cmd" and instead pass a
local variable to iov_iter_kvec() from the call site in
nvmet_tcp_map_pdu_iovec().

Test with blktests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.

[1] "[PATCH] checkpatch: Add kmap and kmap_atomic to the deprecated
list" https://lore.kernel.org/all/[email protected]/

Cc: Chaitanya Kulkarni <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Suggested-by: Ira Weiny <[email protected]>
Signed-off-by: Fabio M. De Francesco <[email protected]>

v2->v3: Test with blktests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a
kernel with HIGHMEM64GB enabled as requested by Sagi (thanks!).

With or without this patch, test nvme/029 fails, therefore, these changes
don't introduce new bugs. The output from dmesg and the decoded stacktrace
is copy-pasted at the end of this cover letter.

Split a couple of long lines (thanks to Chaitanya and Christoph).

v1->v2: Use a local variable as argument of iov_iter_kvec() instead of
the removed "nr_mapped" field from struct "nvmet_tcp_cmd". Thanks to
Sagi and Keith who noticed my mistake.

Many thanks to Chaitanya, Keith, Sagi, for the answers and the comments on
the RFC which led to this patch. The RFC is at:
https://lore.kernel.org/all/[email protected]/

Fabio M. De Francesco (1):
nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM

drivers/nvme/target/tcp.c | 28 +++++-----------------------
1 file changed, 5 insertions(+), 23 deletions(-)

Output from dmesg command with decoded stack-trace (as said, with or
without my changes, test nvme/029 fails and hangs the VM):

tweed32:~ # cat /usr/src/git/packages/blktests/6.0.0-rc2-no-patched.dmesg
[ 727.874459] run blktests nvme/029 at 2022-08-22 15:04:57
[ 727.917821] loop0: detected capacity change from 0 to 2097152
[ 727.935531] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 727.952425] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 727.965284] device: 'nvme0': device_add
[ 727.965481] PM: Adding info for No Bus:nvme0
[ 727.966436] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0ce2efcb-2d44-4c39-9064-433de1b4c5d4.
[ 727.967487] device: 'nvme-subsys0': device_add
[ 727.967583] PM: Adding info for No Bus:nvme-subsys0
[ 727.969295] device: 'hwmon0': device_add
[ 727.969362] PM: Adding info for No Bus:hwmon0
[ 727.969739] nvme nvme0: creating 4 I/O queues.
[ 727.970839] nvme nvme0: mapped 4/0/0 default/read/poll queues.
[ 727.972127] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
[ 727.974938] device: 'nvme0c0n1': device_add
[ 727.975143] PM: Adding info for No Bus:nvme0c0n1
[ 727.976215] device: 'nvme0n1': device_add
[ 727.976401] PM: Adding info for No Bus:nvme0n1
[ 727.976721] device: '259:27': device_add
[ 727.976803] PM: Adding info for No Bus:259:27
[ 727.977655] device: 'ng0n1': device_add
[ 727.977731] PM: Adding info for No Bus:ng0n1
[ 729.154663] usercopy: Kernel memory exposure attempt detected from kmap (offset 0, size 12288)!
[ 729.154711] ------------[ cut here ]------------
[ 729.154713] kernel BUG at mm/usercopy.c:101!
[ 729.154723] invalid opcode: 0000 [#1] PREEMPT SMP
[ 729.154727] CPU: 2 PID: 2907 Comm: kworker/2:2H Tainted: G S W 6.0.0-rc2-vanilla-debug+ #31 aa3ee8c1eb49ad6f37c1eba908f84dc5e5ede635
[ 729.154731] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
[ 729.154736] Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
[ 729.154743] EIP: usercopy_abort (mm/usercopy.c:101 (discriminator 17))
[ 729.154752] Code: 44 cb bb 9a fe d2 d1 89 4d f0 b9 ea fa d0 d1 0f 45 cb ff 75 0c ff 75 08 57 52 56 50 ff 75 f0 51 68 10 fe d2 d1 e8 52 86 ff ff <0f> 0b b8 70 7f 0b d2 83 c4 24 e8 3f dc 00 00 89 e0 56 31 d2 29 d8
All code
========
0: 44 inc %esp
1: cb lret
2: bb 9a fe d2 d1 mov $0xd1d2fe9a,%ebx
7: 89 4d f0 mov %ecx,-0x10(%ebp)
a: b9 ea fa d0 d1 mov $0xd1d0faea,%ecx
f: 0f 45 cb cmovne %ebx,%ecx
12: ff 75 0c push 0xc(%ebp)
15: ff 75 08 push 0x8(%ebp)
18: 57 push %edi
19: 52 push %edx
1a: 56 push %esi
1b: 50 push %eax
1c: ff 75 f0 push -0x10(%ebp)
1f: 51 push %ecx
20: 68 10 fe d2 d1 push $0xd1d2fe10
25: e8 52 86 ff ff call 0xffff867c
2a:* 0f 0b ud2 <-- trapping instruction
2c: b8 70 7f 0b d2 mov $0xd20b7f70,%eax
31: 83 c4 24 add $0x24,%esp
34: e8 3f dc 00 00 call 0xdc78
39: 89 e0 mov %esp,%eax
3b: 56 push %esi
3c: 31 d2 xor %edx,%edx
3e: 29 d8 sub %ebx,%eax

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: b8 70 7f 0b d2 mov $0xd20b7f70,%eax
7: 83 c4 24 add $0x24,%esp
a: e8 3f dc 00 00 call 0xdc4e
f: 89 e0 mov %esp,%eax
11: 56 push %esi
12: 31 d2 xor %edx,%edx
14: 29 d8 sub %ebx,%eax
[ 729.154754] EAX: 00000053 EBX: d1d2fe9a ECX: f7484e08 EDX: 00000000
[ 729.154755] ESI: d1d487be EDI: d1d487be EBP: d03b5cdc ESP: d03b5ca8
[ 729.154757] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[ 729.154760] CR0: 80050033 CR2: b7f57080 CR3: 06027c20 CR4: 000006b0
[ 729.154762] Call Trace:
[ 729.154765] __check_object_size.cold (mm/usercopy.c:145 mm/usercopy.c:253)
[ 729.154768] simple_copy_to_iter (include/linux/uio.h:176 net/core/datagram.c:513)
[ 729.154772] __skb_datagram_iter (include/linux/highmem-internal.h:55 net/core/datagram.c:445)
[ 729.154775] skb_copy_datagram_iter (net/core/datagram.c:529)
[ 729.154777] ? skb_copy_and_hash_datagram_iter (net/core/datagram.c:512)
[ 729.154779] tcp_recvmsg_locked (net/ipv4/tcp.c:2593)
[ 729.154784] ? lock_sock_nested (net/core/sock.c:3399)
[ 729.154789] tcp_recvmsg (net/ipv4/tcp.c:2670)
[ 729.154791] ? aa_sk_perm (security/apparmor/include/cred.h:122 security/apparmor/net.c:176)
[ 729.154797] inet_recvmsg (net/ipv4/af_inet.c:852)
[ 729.154800] sock_recvmsg (net/socket.c:995 net/socket.c:1013)
[ 729.154804] nvmet_tcp_io_work (drivers/nvme/target/tcp.c:1137 drivers/nvme/target/tcp.c:1211 drivers/nvme/target/tcp.c:1237 drivers/nvme/target/tcp.c:1287) nvmet_tcp
[ 729.154807] ? lock_acquire (include/trace/events/lock.h:24 kernel/locking/lockdep.c:5637)
[ 729.154814] ? rcu_read_lock_sched_held (kernel/rcu/update.c:104 kernel/rcu/update.c:123)
[ 729.154820] ? _raw_spin_unlock_irq (arch/x86/include/asm/preempt.h:103 include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:202)
[ 729.154830] ? process_one_work (kernel/workqueue.c:2264)
[ 729.154836] ? trace_preempt_on (kernel/trace/trace_preemptirq.c:125)
[ 729.154841] ? process_one_work (kernel/workqueue.c:2264)
[ 729.154845] process_one_work (kernel/workqueue.c:2294)
[ 729.154850] worker_thread (include/linux/list.h:292 kernel/workqueue.c:2437)
[ 729.154853] ? _raw_spin_unlock_irqrestore (arch/x86/include/asm/preempt.h:103 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
[ 729.154857] kthread (kernel/kthread.c:376)
[ 729.154860] ? process_one_work (kernel/workqueue.c:2379)
[ 729.154862] ? kthread_complete_and_exit (kernel/kthread.c:331)
[ 729.154865] ret_from_fork (arch/x86/entry/entry_32.S:770)
[ 729.154871] Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics vfat fat xfs loop nvme_core uinput af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security rfkill nfnetlink ip6table_filter ip6_tables iptable_filter qrtr snd_hda_codec_generic ledtrig_audio iTCO_wdt intel_pmc_bxt snd_hda_intel iTCO_vendor_support joydev snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd i2c_i801 pcspkr soundcore lpc_ich i2c_smbus virtio_net virtio_balloon net_failover failover tiny_power_button button intel_agp intel_gtt agpgart configfs ip_tables x_tables hid_generic usbhid ext4 mbcache jbd2 virtio_gpu uhci_hcd virtio_dma_buf ehci_hcd drm_shmem_helper drm_kms_helper syscopyarea sysfillrect
[ 729.154919] usbcore serio_raw virtio_blk sysimgblt virtio_rng virtiofs usb_common fb_sys_fops drm virtio_scsi fuse sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr qemu_fw_cfg [last unloaded: nvmet]
[ 729.154931] ---[ end trace 0000000000000000 ]---
[ 729.154932] EIP: usercopy_abort (mm/usercopy.c:101 (discriminator 17))
[ 729.154934] Code: 44 cb bb 9a fe d2 d1 89 4d f0 b9 ea fa d0 d1 0f 45 cb ff 75 0c ff 75 08 57 52 56 50 ff 75 f0 51 68 10 fe d2 d1 e8 52 86 ff ff <0f> 0b b8 70 7f 0b d2 83 c4 24 e8 3f dc 00 00 89 e0 56 31 d2 29 d8
All code
========
0: 44 inc %esp
1: cb lret
2: bb 9a fe d2 d1 mov $0xd1d2fe9a,%ebx
7: 89 4d f0 mov %ecx,-0x10(%ebp)
a: b9 ea fa d0 d1 mov $0xd1d0faea,%ecx
f: 0f 45 cb cmovne %ebx,%ecx
12: ff 75 0c push 0xc(%ebp)
15: ff 75 08 push 0x8(%ebp)
18: 57 push %edi
19: 52 push %edx
1a: 56 push %esi
1b: 50 push %eax
1c: ff 75 f0 push -0x10(%ebp)
1f: 51 push %ecx
20: 68 10 fe d2 d1 push $0xd1d2fe10
25: e8 52 86 ff ff call 0xffff867c
2a:* 0f 0b ud2 <-- trapping instruction
2c: b8 70 7f 0b d2 mov $0xd20b7f70,%eax
31: 83 c4 24 add $0x24,%esp
34: e8 3f dc 00 00 call 0xdc78
39: 89 e0 mov %esp,%eax
3b: 56 push %esi
3c: 31 d2 xor %edx,%edx
3e: 29 d8 sub %ebx,%eax

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: b8 70 7f 0b d2 mov $0xd20b7f70,%eax
7: 83 c4 24 add $0x24,%esp
a: e8 3f dc 00 00 call 0xdc4e
f: 89 e0 mov %esp,%eax
11: 56 push %esi
12: 31 d2 xor %edx,%edx
14: 29 d8 sub %ebx,%eax
[ 729.154936] EAX: 00000053 EBX: d1d2fe9a ECX: f7484e08 EDX: 00000000
[ 729.154937] ESI: d1d487be EDI: d1d487be EBP: d03b5cdc ESP: d03b5ca8
[ 729.154938] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[ 729.154941] CR0: 80050033 CR2: b7f57080 CR3: 06027c20 CR4: 000006b0
[ 753.362612] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[ 753.362619] nvmet: ctrl 1 fatal error occurred!
[ 753.362688] nvme nvme0: starting error recovery
[ 753.362962] nvmet_tcp: queue 1 unhandled state 5
[ 753.363035] nvmet_tcp: queue 2 unhandled state 5
[ 753.363130] nvme0c0n1: Unknown(0x3) @ LBA 18446744073709551615, 513 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
[ 753.392895] nvmet_tcp: queue 0 unhandled state 5
[ 753.392966] nvme nvme0: Reconnecting in 10 seconds...
[ 753.463036] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[ 753.535011] PM: Removing info for No Bus:nvme0c0n1
[ 753.535293] PM: Removing info for No Bus:ng0n1
[ 753.613096] device: '259:27': device_unregister
[ 753.613207] PM: Removing info for No Bus:259:27
[ 753.613281] device: '259:27': device_create_release
[ 753.613860] PM: Removing info for No Bus:nvme0n1
[ 753.753744] nvme nvme0: Property Set error: 880, offset 0x14
[ 753.753770] nvme nvme0: failed nvme_keep_alive_end_io error=4
[ 753.753972] device: 'hwmon0': device_unregister
[ 753.754061] PM: Removing info for No Bus:hwmon0
[ 753.754647] PM: Removing info for No Bus:nvme0
[ 753.754950] PM: Removing info for No Bus:nvme-subsys0
[ 790.472628] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=-20 stuck for 56s!
[ 790.472698] Showing busy workqueues and worker pools:
[ 790.472708] workqueue events_highpri: flags=0x10
[ 790.472720] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
[ 790.472729] pending: mix_interrupt_randomness
[ 790.472753] workqueue kblockd: flags=0x18
[ 790.472763] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=3/256 refcnt=4
[ 790.472771] pending: blk_mq_timeout_work, blk_mq_timeout_work, blk_mq_run_work_fn
[ 790.472794] workqueue nvmet-wq: flags=0x8
[ 790.472805] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 790.472812] in-flight: 2191:nvmet_tcp_release_queue_work [nvmet_tcp]
[ 790.472825] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256 refcnt=5
[ 790.472832] in-flight: 2726:nvmet_tcp_release_queue_work [nvmet_tcp], 227:nvmet_fatal_error_handler [nvmet], 2723:nvmet_tcp_release_queue_work [nvmet_tcp], 2722:nvmet_tcp_release_queue_work [nvmet_tcp]
[ 790.472854] workqueue nvmet_tcp_wq: flags=0x18
[ 790.472865] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=5/256 refcnt=6
[ 790.472878] in-flight: 1802201963:nvmet_tcp_io_work [nvmet_tcp]
[ 790.472881] pending: nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp]
[ 790.472894] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=1s workers=7 idle: 2349 4276 6
[ 790.472909] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=7 idle: 2500 29 734 2498 4601 2503
[ 790.472926] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=56s workers=2 idle: 213
[ 821.032663] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=-20 stuck for 87s!
[ 821.032676] Showing busy workqueues and worker pools:
[ 821.032678] workqueue events: flags=0x0
[ 821.032680] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 821.032684] pending: vmstat_shepherd
[ 821.032695] workqueue events_highpri: flags=0x10
[ 821.032698] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
[ 821.032701] pending: mix_interrupt_randomness
[ 821.032706] workqueue mm_percpu_wq: flags=0x8
[ 821.032709] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 821.032711] pending: vmstat_update
[ 821.032713] workqueue writeback: flags=0x4a
[ 821.032715] pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/256 refcnt=3
[ 821.032717] in-flight: 3832:wb_workfn
[ 821.032722] workqueue kblockd: flags=0x18
[ 821.032724] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=3/256 refcnt=4
[ 821.032726] pending: blk_mq_timeout_work, blk_mq_timeout_work, blk_mq_run_work_fn
[ 821.032738] workqueue nvmet-wq: flags=0x8
[ 821.032740] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 821.032742] in-flight: 2191:nvmet_tcp_release_queue_work [nvmet_tcp]
[ 821.032750] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256 refcnt=5
[ 821.032752] in-flight: 2726:nvmet_tcp_release_queue_work [nvmet_tcp], 227:nvmet_fatal_error_handler [nvmet], 2723:nvmet_tcp_release_queue_work [nvmet_tcp], 2722:nvmet_tcp_release_queue_work [nvmet_tcp]
[ 821.032766] workqueue nvmet_tcp_wq: flags=0x18
[ 821.032770] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=5/256 refcnt=6
[ 821.032771] in-flight: 1802201963:nvmet_tcp_io_work [nvmet_tcp]
[ 821.032773] pending: nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp]
[ 821.032779] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=7 idle: 2349 4276 6
[ 821.032782] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=7 idle: 2500 29 734 2498 4601 2503
[ 821.032787] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=87s workers=2 idle: 213
[ 821.032789] pool 8: cpus=0-3 flags=0x4 nice=0 hung=0s workers=5 idle: 3865 99 97 98
[ 851.052808] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=-20 stuck for 117s!
[ 851.052822] Showing busy workqueues and worker pools:
[ 851.052824] workqueue events_highpri: flags=0x10
[ 851.052826] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
[ 851.052830] pending: mix_interrupt_randomness
[ 851.052839] workqueue writeback: flags=0x4a
[ 851.052842] pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/256 refcnt=3
[ 851.052844] in-flight: 3832:wb_workfn
[ 851.052848] workqueue kblockd: flags=0x18
[ 851.052851] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=3/256 refcnt=4
[ 851.052853] pending: blk_mq_timeout_work, blk_mq_timeout_work, blk_mq_run_work_fn
[ 851.052867] workqueue nvmet-wq: flags=0x8
[ 851.052870] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 851.052871] in-flight: 2191:nvmet_tcp_release_queue_work [nvmet_tcp]
[ 851.052882] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256 refcnt=5
[ 851.052884] in-flight: 2726:nvmet_tcp_release_queue_work [nvmet_tcp], 227:nvmet_fatal_error_handler [nvmet], 2723:nvmet_tcp_release_queue_work [nvmet_tcp], 2722:nvmet_tcp_release_queue_work [nvmet_tcp]
[ 851.052898] workqueue nvmet_tcp_wq: flags=0x18
[ 851.052901] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=5/256 refcnt=6
[ 851.052903] in-flight: 1802201963:nvmet_tcp_io_work [nvmet_tcp]
[ 851.052905] pending: nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp]
[ 851.052910] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=7 idle: 2349 4276 6
[ 851.052915] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=7 idle: 2500 29 734 2498 4601 2503
[ 851.052920] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=117s workers=2 idle: 213
[ 851.052923] pool 8: cpus=0-3 flags=0x4 nice=0 hung=1s workers=5 idle: 99 3865 97 98
[ 881.262608] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=-20 stuck for 147s!
[ 881.262647] Showing busy workqueues and worker pools:
[ 881.262654] workqueue events_highpri: flags=0x10
[ 881.262661] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
[ 881.262668] pending: mix_interrupt_randomness
[ 881.262681] workqueue writeback: flags=0x4a
[ 881.262687] pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/256 refcnt=3
[ 881.262693] in-flight: 3832:wb_workfn
[ 881.262701] workqueue kblockd: flags=0x18
[ 881.262708] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=3/256 refcnt=4
[ 881.262713] pending: blk_mq_timeout_work, blk_mq_timeout_work, blk_mq_run_work_fn
[ 881.262728] workqueue nvmet-wq: flags=0x8
[ 881.262735] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 881.262740] in-flight: 2191:nvmet_tcp_release_queue_work [nvmet_tcp]
[ 881.262749] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256 refcnt=5
[ 881.262754] in-flight: 2726:nvmet_tcp_release_queue_work [nvmet_tcp], 227:nvmet_fatal_error_handler [nvmet], 2723:nvmet_tcp_release_queue_work [nvmet_tcp], 2722:nvmet_tcp_release_queue_work [nvmet_tcp]
[ 881.262771] workqueue nvmet_tcp_wq: flags=0x18
[ 881.262778] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=5/256 refcnt=6
[ 881.262782] in-flight: 1802201963:nvmet_tcp_io_work [nvmet_tcp]
[ 881.262785] pending: nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp], nvmet_tcp_io_work [nvmet_tcp]
[ 881.262794] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=7 idle: 2349 4276 6
[ 881.262804] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=3s workers=7 idle: 2500 29 734 2498 4601 2503
[ 881.262816] pool 5: cpus=2 node=0 flags=0x0 nice=-20 hung=147s workers=2 idle: 213
[ 881.262825] pool 8: cpus=0-3 flags=0x4 nice=0 hung=0s workers=5 idle: 99 3865 97 98

--
2.37.2


2022-08-22 14:43:04

by Fabio M. De Francesco

[permalink] [raw]
Subject: [PATCH v3 1/1] nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM

kmap() is being deprecated in favor of kmap_local_page().[1]

There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.

The pages which will be mapped are allocated in nvmet_tcp_map_data(),
using the GFP_KERNEL flag. This assures that they cannot come from
HIGHMEM. This imply that a straight page_address() can replace the kmap()
of sg_page(sg) in nvmet_tcp_map_pdu_iovec(). As a side effect, we might
also delete the field "nr_mapped" from struct "nvmet_tcp_cmd" because,
after removing the kmap() calls, there would be no longer any need of it.

Therefore, replace the kmap() of sg_page(sg) with a page_address() and
delete the "nr_mapped" field from "nvmet_tcp_cmd" and instead pass a
local variable to iov_iter_kvec() from the call site in
nvmet_tcp_map_pdu_iovec().

Test with blktests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.

[1] "[PATCH] checkpatch: Add kmap and kmap_atomic to the deprecated
list" https://lore.kernel.org/all/[email protected]/

Cc: Chaitanya Kulkarni <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Suggested-by: Ira Weiny <[email protected]>
Signed-off-by: Fabio M. De Francesco <[email protected]>
---
drivers/nvme/target/tcp.c | 28 +++++-----------------------
1 file changed, 5 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index dc3b4dc8fe08..2e2a7367b77e 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -77,7 +77,6 @@ struct nvmet_tcp_cmd {
u32 pdu_len;
u32 pdu_recv;
int sg_idx;
- int nr_mapped;
struct msghdr recv_msg;
struct kvec *iov;
u32 flags;
@@ -167,7 +166,6 @@ static const struct nvmet_fabrics_ops nvmet_tcp_ops;
static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c);
static void nvmet_tcp_finish_cmd(struct nvmet_tcp_cmd *cmd);
static void nvmet_tcp_free_cmd_buffers(struct nvmet_tcp_cmd *cmd);
-static void nvmet_tcp_unmap_pdu_iovec(struct nvmet_tcp_cmd *cmd);

static inline u16 nvmet_tcp_cmd_tag(struct nvmet_tcp_queue *queue,
struct nvmet_tcp_cmd *cmd)
@@ -301,35 +299,21 @@ static int nvmet_tcp_check_ddgst(struct nvmet_tcp_queue *queue, void *pdu)

static void nvmet_tcp_free_cmd_buffers(struct nvmet_tcp_cmd *cmd)
{
- WARN_ON(unlikely(cmd->nr_mapped > 0));
-
kfree(cmd->iov);
sgl_free(cmd->req.sg);
cmd->iov = NULL;
cmd->req.sg = NULL;
}

-static void nvmet_tcp_unmap_pdu_iovec(struct nvmet_tcp_cmd *cmd)
-{
- struct scatterlist *sg;
- int i;
-
- sg = &cmd->req.sg[cmd->sg_idx];
-
- for (i = 0; i < cmd->nr_mapped; i++)
- kunmap(sg_page(&sg[i]));
-
- cmd->nr_mapped = 0;
-}
-
static void nvmet_tcp_map_pdu_iovec(struct nvmet_tcp_cmd *cmd)
{
struct kvec *iov = cmd->iov;
struct scatterlist *sg;
u32 length, offset, sg_offset;
+ int nr_mapped;

length = cmd->pdu_len;
- cmd->nr_mapped = DIV_ROUND_UP(length, PAGE_SIZE);
+ nr_mapped = DIV_ROUND_UP(length, PAGE_SIZE);
offset = cmd->rbytes_done;
cmd->sg_idx = offset / PAGE_SIZE;
sg_offset = offset % PAGE_SIZE;
@@ -338,7 +322,8 @@ static void nvmet_tcp_map_pdu_iovec(struct nvmet_tcp_cmd *cmd)
while (length) {
u32 iov_len = min_t(u32, length, sg->length - sg_offset);

- iov->iov_base = kmap(sg_page(sg)) + sg->offset + sg_offset;
+ iov->iov_base = page_address(sg_page(sg)) +
+ sg->offset + sg_offset;
iov->iov_len = iov_len;

length -= iov_len;
@@ -348,7 +333,7 @@ static void nvmet_tcp_map_pdu_iovec(struct nvmet_tcp_cmd *cmd)
}

iov_iter_kvec(&cmd->recv_msg.msg_iter, READ, cmd->iov,
- cmd->nr_mapped, cmd->pdu_len);
+ nr_mapped, cmd->pdu_len);
}

static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
@@ -1141,7 +1126,6 @@ static int nvmet_tcp_try_recv_data(struct nvmet_tcp_queue *queue)
cmd->rbytes_done += ret;
}

- nvmet_tcp_unmap_pdu_iovec(cmd);
if (queue->data_digest) {
nvmet_tcp_prep_recv_ddgst(cmd);
return 0;
@@ -1411,7 +1395,6 @@ static void nvmet_tcp_restore_socket_callbacks(struct nvmet_tcp_queue *queue)
static void nvmet_tcp_finish_cmd(struct nvmet_tcp_cmd *cmd)
{
nvmet_req_uninit(&cmd->req);
- nvmet_tcp_unmap_pdu_iovec(cmd);
nvmet_tcp_free_cmd_buffers(cmd);
}

@@ -1424,7 +1407,6 @@ static void nvmet_tcp_uninit_data_in_cmds(struct nvmet_tcp_queue *queue)
if (nvmet_tcp_need_data_in(cmd))
nvmet_req_uninit(&cmd->req);

- nvmet_tcp_unmap_pdu_iovec(cmd);
nvmet_tcp_free_cmd_buffers(cmd);
}

--
2.37.2

2022-08-23 23:46:04

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM

On Mon, Aug 22, 2022 at 04:24:38PM +0200, Fabio M. De Francesco wrote:

> Therefore, replace the kmap() of sg_page(sg) with a page_address() and
> delete the "nr_mapped" field from "nvmet_tcp_cmd" and instead pass a
> local variable to iov_iter_kvec() from the call site in
> nvmet_tcp_map_pdu_iovec().

I'd suggest looking into switching to ITER_BVEC...

2022-08-26 18:34:28

by Fabio M. De Francesco

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM

On mercoledì 24 agosto 2022 01:32:53 CEST Al Viro wrote:
> On Mon, Aug 22, 2022 at 04:24:38PM +0200, Fabio M. De Francesco wrote:
> > Therefore, replace the kmap() of sg_page(sg) with a page_address() and
> > delete the "nr_mapped" field from "nvmet_tcp_cmd" and instead pass a
> > local variable to iov_iter_kvec() from the call site in
> > nvmet_tcp_map_pdu_iovec().
>
> I'd suggest looking into switching to ITER_BVEC...

Hello Al,

It's interesting that, for the second time in a few days, I've been suggested
to switch to the use of BVEC. Christoph asked for first[1]

I responded to Christoph and, in the same email, I also asked Sagi
(the author of nvmet_tcp_map_pdu_iovec()) if he wanted to add something to
that discussion.

As you may have already read, I'm so new to kernel development that I still
know very little about many subsystems and drivers. I am not currently
able to tell the difference between BVEC and KVEC. I could probably try to
switch from one to the other (after learning from other code), however I won't
be able to explain in the commit message why users should better use BVEC in
this case.

My task is to remove as many kmap() call sites I can from the entire kernel,
and this is all I did in nvmet-tcp. After doing many tens of call sites
removals or conversions to kmap_local_page(), I learned to not go deep into
those details which aren't needed to accomplish the task I've been given.

However, since you and Christoph suggested doing more (I suppose in a separate
patch), I did some research which seems to confirm that iov_iter_kvec() may be
good for what it is supposed to do.

Since you are talking again about BVEC, soon after Christoph did the same, I'd
like to understand better which improvements do we expect from BVEC and why
and when should developers prefer this or KVEC.

Can you please provide further information or links to documentation?

Thanks for your time,

Fabio

[1] https://lore.kernel.org/all/8107517.T7Z3S40VBb@opensuse/



2022-08-26 19:11:55

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM

On Fri, Aug 26, 2022 at 08:16:59PM +0200, Fabio M. De Francesco wrote:
> On mercoled? 24 agosto 2022 01:32:53 CEST Al Viro wrote:
> > On Mon, Aug 22, 2022 at 04:24:38PM +0200, Fabio M. De Francesco wrote:
> > > Therefore, replace the kmap() of sg_page(sg) with a page_address() and
> > > delete the "nr_mapped" field from "nvmet_tcp_cmd" and instead pass a
> > > local variable to iov_iter_kvec() from the call site in
> > > nvmet_tcp_map_pdu_iovec().
> >
> > I'd suggest looking into switching to ITER_BVEC...
>
> Hello Al,
>
> It's interesting that, for the second time in a few days, I've been suggested
> to switch to the use of BVEC. Christoph asked for first[1]
>
> I responded to Christoph and, in the same email, I also asked Sagi
> (the author of nvmet_tcp_map_pdu_iovec()) if he wanted to add something to
> that discussion.
>
> As you may have already read, I'm so new to kernel development that I still
> know very little about many subsystems and drivers. I am not currently
> able to tell the difference between BVEC and KVEC. I could probably try to
> switch from one to the other (after learning from other code), however I won't
> be able to explain in the commit message why users should better use BVEC in
> this case.
>
> My task is to remove as many kmap() call sites I can from the entire kernel,
> and this is all I did in nvmet-tcp. After doing many tens of call sites
> removals or conversions to kmap_local_page(), I learned to not go deep into
> those details which aren't needed to accomplish the task I've been given.
>
> However, since you and Christoph suggested doing more (I suppose in a separate
> patch), I did some research which seems to confirm that iov_iter_kvec() may be
> good for what it is supposed to do.
>
> Since you are talking again about BVEC, soon after Christoph did the same, I'd
> like to understand better which improvements do we expect from BVEC and why
> and when should developers prefer this or KVEC.
>
> Can you please provide further information or links to documentation?

This kvec/bvec talk is a digression from your original goal: reduce unnecessary
kmap() usage. If anyone wants to take on the bvec optimization in this code
path, then by all means lets do it in a separate patch.

2022-08-26 20:05:09

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM

On Fri, Aug 26, 2022 at 08:16:59PM +0200, Fabio M. De Francesco wrote:

> As you may have already read, I'm so new to kernel development that I still
> know very little about many subsystems and drivers. I am not currently
> able to tell the difference between BVEC and KVEC. I could probably try to
> switch from one to the other (after learning from other code), however I won't
> be able to explain in the commit message why users should better use BVEC in
> this case.

struct kvec: pairs of form <kernel address, length>
struct bio_vec: triples of form <page, offset, length>

Either is a way to refer to a chunk of memory; the former obviously has it
already mapped (you don't get kernel addresses otherwise), the latter doesn't
need to.

iov_iter instances might be backed by different things, including
arrays of kvec (iov_iter_kvec() constructs such) and arrays of
bio_vec (iov_iter_bvec() is the constructor for those).

iov_iter primitives (copy_to_iter/copy_from_iter/copy_page_to_iter/etc.)
work with either variant - they look at the flavour and act accordingly.

ITER_BVEC ones tend to do that kmap_local_page() + copy + kunmap_local().
ITER_KVEC obviously use memcpy() for copying and that's it.

If you need e.g. to send some subranges of some pages you could kmap them,
form kvec array, make msg.msg_iter a KVEC-backed iterator over those and
feed it to sendmsg(). Or you could take a bio_vec array instead, make
msg.msg_iter a BVEC-backed iterator over that and feed to sendmsg().

The difference is, in the latter case kmap_local() will be done on demand
*inside* ->sendmsg() instance, when it gets around to copying some data
from the source and calls something like csum_and_copy_from_iter() or
whichever primitive it chooses to use.

Why bother with mapping the damn thing in the caller and having it pinned
all along whatever ->sendmsg() you end up calling? Just give it
page/offset/length instead of address/length and let lib/iov_iter.c
do the right thing...

2022-08-26 20:24:13

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM

On Fri, Aug 26, 2022 at 12:35:09PM -0600, Keith Busch wrote:

> > Can you please provide further information or links to documentation?
>
> This kvec/bvec talk is a digression from your original goal: reduce unnecessary
> kmap() usage. If anyone wants to take on the bvec optimization in this code
> path, then by all means lets do it in a separate patch.

Oh? And why is a switch to iov_iter_bvec() *not* a reduction of kmap() usage?
ITER_BVEC methods use kmap_local() for data transfers, TYVM...

2022-08-28 14:37:37

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM


>> As you may have already read, I'm so new to kernel development that I still
>> know very little about many subsystems and drivers. I am not currently
>> able to tell the difference between BVEC and KVEC. I could probably try to
>> switch from one to the other (after learning from other code), however I won't
>> be able to explain in the commit message why users should better use BVEC in
>> this case.
>
> struct kvec: pairs of form <kernel address, length>
> struct bio_vec: triples of form <page, offset, length>
>
> Either is a way to refer to a chunk of memory; the former obviously has it
> already mapped (you don't get kernel addresses otherwise), the latter doesn't
> need to.
>
> iov_iter instances might be backed by different things, including
> arrays of kvec (iov_iter_kvec() constructs such) and arrays of
> bio_vec (iov_iter_bvec() is the constructor for those).
>
> iov_iter primitives (copy_to_iter/copy_from_iter/copy_page_to_iter/etc.)
> work with either variant - they look at the flavour and act accordingly.
>
> ITER_BVEC ones tend to do that kmap_local_page() + copy + kunmap_local().
> ITER_KVEC obviously use memcpy() for copying and that's it.
>
> If you need e.g. to send some subranges of some pages you could kmap them,
> form kvec array, make msg.msg_iter a KVEC-backed iterator over those and
> feed it to sendmsg(). Or you could take a bio_vec array instead, make
> msg.msg_iter a BVEC-backed iterator over that and feed to sendmsg().
>
> The difference is, in the latter case kmap_local() will be done on demand
> *inside* ->sendmsg() instance, when it gets around to copying some data
> from the source and calls something like csum_and_copy_from_iter() or
> whichever primitive it chooses to use.
>
> Why bother with mapping the damn thing in the caller and having it pinned
> all along whatever ->sendmsg() you end up calling? Just give it
> page/offset/length instead of address/length and let lib/iov_iter.c
> do the right thing...

Thanks for the info Al. IIRC, I think that the original goal of this
code was to avoid the kmap/kunmap overhead on every copy as the buffers
lifetime matches the request itself, However as noted in this thread,
the buffers are always highmem, and hence don't have an associated
overhead.

I agree that this can be converted to use bio_vec, I'll give it a try.

2022-08-30 20:32:09

by Fabio M. De Francesco

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM

On venerdì 26 agosto 2022 21:32:44 CEST Al Viro wrote:
> On Fri, Aug 26, 2022 at 08:16:59PM +0200, Fabio M. De Francesco wrote:
> > As you may have already read, I'm so new to kernel development that I
still
> > know very little about many subsystems and drivers. I am not currently
> > able to tell the difference between BVEC and KVEC. I could probably try to
> > switch from one to the other (after learning from other code), however I
> > won't be able to explain in the commit message why users should better use
> > BVEC in this case.
>
> struct kvec: pairs of form <kernel address, length>
> struct bio_vec: triples of form <page, offset, length>
>
> Either is a way to refer to a chunk of memory; the former obviously has it
> already mapped (you don't get kernel addresses otherwise), the latter
doesn't
> need to.
>
> iov_iter instances might be backed by different things, including
> arrays of kvec (iov_iter_kvec() constructs such) and arrays of
> bio_vec (iov_iter_bvec() is the constructor for those).
>
> iov_iter primitives (copy_to_iter/copy_from_iter/copy_page_to_iter/etc.)
> work with either variant - they look at the flavour and act accordingly.
>
> ITER_BVEC ones tend to do that kmap_local_page() + copy + kunmap_local().
> ITER_KVEC obviously use memcpy() for copying and that's it.
>
> If you need e.g. to send some subranges of some pages you could kmap them,
> form kvec array, make msg.msg_iter a KVEC-backed iterator over those and
> feed it to sendmsg(). Or you could take a bio_vec array instead, make
> msg.msg_iter a BVEC-backed iterator over that and feed to sendmsg().
>
> The difference is, in the latter case kmap_local() will be done on demand
> *inside* ->sendmsg() instance, when it gets around to copying some data
> from the source and calls something like csum_and_copy_from_iter() or
> whichever primitive it chooses to use.
>
> Why bother with mapping the damn thing in the caller and having it pinned
> all along whatever ->sendmsg() you end up calling? Just give it
> page/offset/length instead of address/length and let lib/iov_iter.c
> do the right thing...

Hi Al,

Thanks so much for the time you spent writing this detailed explanation. I
really appreciated that you spend time for teaching newcomers with kindness
and patience.

I apologize for this late response, but for the past few days I've been out of
city with no PC to email the lists (therefore, the lists would have rejected
my messages from phone with a Gmail app).

Yesterday I saw that Sagi sent an reworked version of my patch according to
your suggestions. This evening I will send it again as a patch v4 and with the
cover letter it had in the previous version.

You will be acknowledged with a "Suggested by" tag. Instead Sagi with a "Co-
developed-by" and, of course, with a "Signed-off-by" immediately after the
previous tag. There will be more information after the three dashes.

Again thanks,

Fabio



2022-09-01 00:05:59

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvmet-tcp: Don't kmap() pages which can't come from HIGHMEM

On Tue, Aug 30, 2022 at 10:06:54PM +0200, Fabio M. De Francesco wrote:

> Thanks so much for the time you spent writing this detailed explanation. I
> really appreciated that you spend time for teaching newcomers with kindness
> and patience.

*blink*

You are welcome, but TBH answering your question took about 5 minutes.
The question had been fairly simple and reasonable, and the lack of
in-tree documentation covering that is my own fault, so...