2024-01-03 18:03:31

by Jeffrey Layton

[permalink] [raw]
Subject: hangs during fstests testing with TLS

I'm seeing some hangs when testing with TLS in v6.7-rc8. This kernel has
the patch I sent this morning to get rid of nfsd_put, but otherwise is
stock v6.7-rc8:

[ 2125.174937] run fstests generic/126 at 2024-01-03 09:46:39
[ 2129.793577] run fstests generic/127 at 2024-01-03 09:46:44
[ 3199.661565] run fstests generic/128 at 2024-01-03 10:04:33
[ 3204.502354] run fstests generic/129 at 2024-01-03 10:04:38
[ 3208.111189] RPC: Could not send backchannel reply error: -110
[ 3384.487762] INFO: task looptest:92512 blocked for more than 120 seconds.
[ 3384.491396] Not tainted 6.7.0-rc8-g6c5361baaf84 #74
[ 3384.494103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3384.497215] task:looptest state:D stack:0 pid:92512 tgid:92512 ppid:92333 flags:0x00004002
[ 3384.499471] Call Trace:
[ 3384.500129] <TASK>
[ 3384.500717] __schedule+0x3c4/0xad0
[ 3384.501652] schedule+0x31/0xd0
[ 3384.502455] io_schedule+0x42/0x70
[ 3384.503344] folio_wait_bit_common+0x121/0x330
[ 3384.504442] ? __pfx_wake_page_function+0x10/0x10
[ 3384.505624] folio_wait_writeback+0x27/0x80
[ 3384.506639] __filemap_fdatawait_range+0x79/0xe0
[ 3384.507343] filemap_write_and_wait_range+0x81/0xb0
[ 3384.508043] nfs_wb_all+0x21/0x120 [nfs]
[ 3384.508739] nfs4_file_flush+0x6e/0xb0 [nfsv4]
[ 3384.509499] filp_flush+0x30/0x70
[ 3384.510045] filp_close+0xf/0x30
[ 3384.510639] put_files_struct+0x78/0xd0
[ 3384.511222] do_exit+0x345/0xb10
[ 3384.511764] ? handle_mm_fault+0x9e/0x360
[ 3384.512364] ? preempt_count_add+0x47/0xa0
[ 3384.513012] do_group_exit+0x2d/0x80
[ 3384.513592] __x64_sys_exit_group+0x14/0x20
[ 3384.514252] do_syscall_64+0x3f/0xf0
[ 3384.514809] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3384.515581] RIP: 0033:0x7f564911595d
[ 3384.516186] RSP: 002b:00007ffecd703338 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
[ 3384.517306] RAX: ffffffffffffffda RBX: 00007f5649211fa8 RCX: 00007f564911595d
[ 3384.518368] RDX: 00000000000000e7 RSI: ffffffffffffff28 RDI: 0000000000000000
[ 3384.519473] RBP: 00007ffecd703390 R08: 00007ffecd7032d8 R09: 00007ffecd70323f
[ 3384.520543] R10: 00007ffecd703190 R11: 0000000000000202 R12: 0000000000000001
[ 3384.521597] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f5649211fc0
[ 3384.522647] </TASK>

[root@kdevops-nfs-tls ~]# cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks
6 5285 0 0x0 0x0 0 nfs41_sequence_ops [nfsv4] nfsv4 SEQUENCE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1460 c805 -11 0x5 0x2a2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1461 c805 -11 0x5 0x2b2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1462 c805 -11 0x5 0x2c2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1463 c805 -11 0x5 0x2d2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1464 c805 -11 0x5 0x302113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1465 c805 -11 0x5 0x2e2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1466 c805 -11 0x5 0x2f2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1467 c805 -11 0x5 0x312113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1468 c805 -11 0x5 0x322113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1469 c805 -11 0x5 0x332113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1470 c805 -11 0x5 0x342113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1471 c805 -11 0x5 0x352113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1472 c805 -11 0x5 0x362113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1473 c805 -11 0x5 0x372113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1474 c805 -11 0x5 0x382113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1475 c805 -11 0x5 0x392113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1476 c805 -11 0x5 0x3a2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1477 c805 -11 0x5 0x3b2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1478 c805 -11 0x5 0x3c2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1479 c805 -11 0x5 0x3d2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1480 c805 -11 0x5 0x3e2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1481 c805 -11 0x5 0x3f2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1482 c805 -11 0x5 0x402113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1483 c805 -11 0x5 0x412113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1484 c805 -11 0x5 0x422113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
1485 c005 -11 0x5 0x432113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
1486 c005 -11 0x5 0x442113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
1487 c005 -11 0x5 0x452113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
1488 c005 -11 0x5 0x462113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
1489 c005 -11 0x5 0x472113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
1490 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1491 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1492 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1493 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1494 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1495 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1496 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1497 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1498 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1499 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1500 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1501 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1502 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1503 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1504 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1505 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1506 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1507 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1508 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1509 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1510 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1511 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1512 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1513 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1514 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1515 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1516 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1517 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1518 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1519 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1520 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1521 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1522 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1523 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1524 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1525 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1526 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1527 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1528 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1529 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1530 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1531 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1532 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1533 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1534 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1535 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1536 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1537 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1538 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1539 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1540 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1541 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1542 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1543 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1544 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1545 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1546 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1547 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1548 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1549 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1550 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1551 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1552 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1553 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1554 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1555 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1556 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1557 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1558 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1559 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1560 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1561 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1562 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1563 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1564 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1565 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1566 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1567 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1568 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1569 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1570 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1571 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1572 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1573 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1574 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1575 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1576 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1577 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1578 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1579 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1580 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1581 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1582 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1583 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1584 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1585 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1586 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1587 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1588 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1589 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1590 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1591 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1592 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1593 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1594 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1595 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1596 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1597 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
1598 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table

In the server, I see this in the log around the same time:

[ 2344.590202] rpc-srv/tcp: nfsd: got error -90 when sending 112 bytes - shutting down socket
[ 2349.182087] rpc-srv/tcp: nfsd: got error -74 when sending 112 bytes - shutting down socket
[ 2352.685424] rpc-srv/tcp: nfsd: got error -90 when sending 112 bytes - shutting down socket
[ 3583.897575] rpc-srv/tcp: nfsd: got error -104 when sending 112 bytes - shutting down socket

/proc/fs/nfsd/clients is empty however, so the client has been timed out
at this point.

On the wire, it looks like the client is just not communicating at all
with the server. Turning up tracepoints and rpcdebug info doesn't show
any activity whatsoever. It's like the client has just stopped trying to
do anything.

I was going to start digging more into the client to figure out what's
going on, but figured I'd send this in case others had seen something
similar. Anyone?
--
Jeff Layton <[email protected]>


2024-01-03 18:47:21

by Benjamin Coddington

[permalink] [raw]
Subject: Re: hangs during fstests testing with TLS

On 3 Jan 2024, at 13:03, Jeff Layton wrote:

> I'm seeing some hangs when testing with TLS in v6.7-rc8. This kernel has
> the patch I sent this morning to get rid of nfsd_put, but otherwise is
> stock v6.7-rc8:
>
> [ 2125.174937] run fstests generic/126 at 2024-01-03 09:46:39
> [ 2129.793577] run fstests generic/127 at 2024-01-03 09:46:44
> [ 3199.661565] run fstests generic/128 at 2024-01-03 10:04:33
> [ 3204.502354] run fstests generic/129 at 2024-01-03 10:04:38
> [ 3208.111189] RPC: Could not send backchannel reply error: -110
> [ 3384.487762] INFO: task looptest:92512 blocked for more than 120 seconds.
> [ 3384.491396] Not tainted 6.7.0-rc8-g6c5361baaf84 #74
> [ 3384.494103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3384.497215] task:looptest state:D stack:0 pid:92512 tgid:92512 ppid:92333 flags:0x00004002
> [ 3384.499471] Call Trace:
> [ 3384.500129] <TASK>
> [ 3384.500717] __schedule+0x3c4/0xad0
> [ 3384.501652] schedule+0x31/0xd0
> [ 3384.502455] io_schedule+0x42/0x70
> [ 3384.503344] folio_wait_bit_common+0x121/0x330
> [ 3384.504442] ? __pfx_wake_page_function+0x10/0x10
> [ 3384.505624] folio_wait_writeback+0x27/0x80
> [ 3384.506639] __filemap_fdatawait_range+0x79/0xe0
> [ 3384.507343] filemap_write_and_wait_range+0x81/0xb0
> [ 3384.508043] nfs_wb_all+0x21/0x120 [nfs]
> [ 3384.508739] nfs4_file_flush+0x6e/0xb0 [nfsv4]
> [ 3384.509499] filp_flush+0x30/0x70
> [ 3384.510045] filp_close+0xf/0x30
> [ 3384.510639] put_files_struct+0x78/0xd0
> [ 3384.511222] do_exit+0x345/0xb10
> [ 3384.511764] ? handle_mm_fault+0x9e/0x360
> [ 3384.512364] ? preempt_count_add+0x47/0xa0
> [ 3384.513012] do_group_exit+0x2d/0x80
> [ 3384.513592] __x64_sys_exit_group+0x14/0x20
> [ 3384.514252] do_syscall_64+0x3f/0xf0
> [ 3384.514809] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [ 3384.515581] RIP: 0033:0x7f564911595d
> [ 3384.516186] RSP: 002b:00007ffecd703338 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
> [ 3384.517306] RAX: ffffffffffffffda RBX: 00007f5649211fa8 RCX: 00007f564911595d
> [ 3384.518368] RDX: 00000000000000e7 RSI: ffffffffffffff28 RDI: 0000000000000000
> [ 3384.519473] RBP: 00007ffecd703390 R08: 00007ffecd7032d8 R09: 00007ffecd70323f
> [ 3384.520543] R10: 00007ffecd703190 R11: 0000000000000202 R12: 0000000000000001
> [ 3384.521597] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f5649211fc0
> [ 3384.522647] </TASK>
>
> [root@kdevops-nfs-tls ~]# cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks
> 6 5285 0 0x0 0x0 0 nfs41_sequence_ops [nfsv4] nfsv4 SEQUENCE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1460 c805 -11 0x5 0x2a2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1461 c805 -11 0x5 0x2b2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1462 c805 -11 0x5 0x2c2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1463 c805 -11 0x5 0x2d2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1464 c805 -11 0x5 0x302113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1465 c805 -11 0x5 0x2e2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1466 c805 -11 0x5 0x2f2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1467 c805 -11 0x5 0x312113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1468 c805 -11 0x5 0x322113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1469 c805 -11 0x5 0x332113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1470 c805 -11 0x5 0x342113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1471 c805 -11 0x5 0x352113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1472 c805 -11 0x5 0x362113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1473 c805 -11 0x5 0x372113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1474 c805 -11 0x5 0x382113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1475 c805 -11 0x5 0x392113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1476 c805 -11 0x5 0x3a2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1477 c805 -11 0x5 0x3b2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1478 c805 -11 0x5 0x3c2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1479 c805 -11 0x5 0x3d2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1480 c805 -11 0x5 0x3e2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1481 c805 -11 0x5 0x3f2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1482 c805 -11 0x5 0x402113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1483 c805 -11 0x5 0x412113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1484 c805 -11 0x5 0x422113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> 1485 c005 -11 0x5 0x432113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> 1486 c005 -11 0x5 0x442113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> 1487 c005 -11 0x5 0x452113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> 1488 c005 -11 0x5 0x462113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> 1489 c005 -11 0x5 0x472113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> 1490 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1491 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1492 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1493 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1494 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1495 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1496 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1497 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1498 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1499 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1500 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1501 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1502 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1503 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1504 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1505 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1506 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1507 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1508 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1509 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1510 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1511 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1512 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1513 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1514 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1515 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1516 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1517 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1518 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1519 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1520 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1521 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1522 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1523 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1524 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1525 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1526 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1527 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1528 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1529 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1530 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1531 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1532 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1533 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1534 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1535 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1536 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1537 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1538 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1539 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1540 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1541 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1542 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1543 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1544 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1545 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1546 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1547 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1548 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1549 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1550 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1551 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1552 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1553 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1554 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1555 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1556 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1557 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1558 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1559 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1560 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1561 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1562 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1563 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1564 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1565 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1566 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1567 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1568 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1569 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1570 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1571 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1572 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1573 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1574 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1575 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1576 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1577 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1578 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1579 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1580 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1581 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1582 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1583 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1584 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1585 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1586 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1587 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1588 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1589 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1590 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1591 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1592 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1593 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1594 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1595 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1596 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1597 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> 1598 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
>
> In the server, I see this in the log around the same time:
>
> [ 2344.590202] rpc-srv/tcp: nfsd: got error -90 when sending 112 bytes - shutting down socket
> [ 2349.182087] rpc-srv/tcp: nfsd: got error -74 when sending 112 bytes - shutting down socket
> [ 2352.685424] rpc-srv/tcp: nfsd: got error -90 when sending 112 bytes - shutting down socket
> [ 3583.897575] rpc-srv/tcp: nfsd: got error -104 when sending 112 bytes - shutting down socket

This looks like it started out as the problem I've been sending patches to
fix on 6.7, latest here:
https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@redhat.com/

.. however whenever I encounter the issue, the client reconnects the
transport again - so I think there might be an additional problem here. It
would be interesting to turn up some tracepoints to see what the sunrpc
scheduler is doing once the problem occurs. The -EAGAIN tasks should be
timing out and reconnecting the transport, I think.

Ben


2024-01-03 18:52:21

by Jeffrey Layton

[permalink] [raw]
Subject: Re: hangs during fstests testing with TLS

On Wed, 2024-01-03 at 13:47 -0500, Benjamin Coddington wrote:
> On 3 Jan 2024, at 13:03, Jeff Layton wrote:
>
> > I'm seeing some hangs when testing with TLS in v6.7-rc8. This kernel has
> > the patch I sent this morning to get rid of nfsd_put, but otherwise is
> > stock v6.7-rc8:
> >
> > [ 2125.174937] run fstests generic/126 at 2024-01-03 09:46:39
> > [ 2129.793577] run fstests generic/127 at 2024-01-03 09:46:44
> > [ 3199.661565] run fstests generic/128 at 2024-01-03 10:04:33
> > [ 3204.502354] run fstests generic/129 at 2024-01-03 10:04:38
> > [ 3208.111189] RPC: Could not send backchannel reply error: -110
> > [ 3384.487762] INFO: task looptest:92512 blocked for more than 120 seconds.
> > [ 3384.491396] Not tainted 6.7.0-rc8-g6c5361baaf84 #74
> > [ 3384.494103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 3384.497215] task:looptest state:D stack:0 pid:92512 tgid:92512 ppid:92333 flags:0x00004002
> > [ 3384.499471] Call Trace:
> > [ 3384.500129] <TASK>
> > [ 3384.500717] __schedule+0x3c4/0xad0
> > [ 3384.501652] schedule+0x31/0xd0
> > [ 3384.502455] io_schedule+0x42/0x70
> > [ 3384.503344] folio_wait_bit_common+0x121/0x330
> > [ 3384.504442] ? __pfx_wake_page_function+0x10/0x10
> > [ 3384.505624] folio_wait_writeback+0x27/0x80
> > [ 3384.506639] __filemap_fdatawait_range+0x79/0xe0
> > [ 3384.507343] filemap_write_and_wait_range+0x81/0xb0
> > [ 3384.508043] nfs_wb_all+0x21/0x120 [nfs]
> > [ 3384.508739] nfs4_file_flush+0x6e/0xb0 [nfsv4]
> > [ 3384.509499] filp_flush+0x30/0x70
> > [ 3384.510045] filp_close+0xf/0x30
> > [ 3384.510639] put_files_struct+0x78/0xd0
> > [ 3384.511222] do_exit+0x345/0xb10
> > [ 3384.511764] ? handle_mm_fault+0x9e/0x360
> > [ 3384.512364] ? preempt_count_add+0x47/0xa0
> > [ 3384.513012] do_group_exit+0x2d/0x80
> > [ 3384.513592] __x64_sys_exit_group+0x14/0x20
> > [ 3384.514252] do_syscall_64+0x3f/0xf0
> > [ 3384.514809] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > [ 3384.515581] RIP: 0033:0x7f564911595d
> > [ 3384.516186] RSP: 002b:00007ffecd703338 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
> > [ 3384.517306] RAX: ffffffffffffffda RBX: 00007f5649211fa8 RCX: 00007f564911595d
> > [ 3384.518368] RDX: 00000000000000e7 RSI: ffffffffffffff28 RDI: 0000000000000000
> > [ 3384.519473] RBP: 00007ffecd703390 R08: 00007ffecd7032d8 R09: 00007ffecd70323f
> > [ 3384.520543] R10: 00007ffecd703190 R11: 0000000000000202 R12: 0000000000000001
> > [ 3384.521597] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f5649211fc0
> > [ 3384.522647] </TASK>
> >
> > [root@kdevops-nfs-tls ~]# cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks
> > 6 5285 0 0x0 0x0 0 nfs41_sequence_ops [nfsv4] nfsv4 SEQUENCE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1460 c805 -11 0x5 0x2a2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1461 c805 -11 0x5 0x2b2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1462 c805 -11 0x5 0x2c2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1463 c805 -11 0x5 0x2d2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1464 c805 -11 0x5 0x302113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1465 c805 -11 0x5 0x2e2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1466 c805 -11 0x5 0x2f2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1467 c805 -11 0x5 0x312113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1468 c805 -11 0x5 0x322113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1469 c805 -11 0x5 0x332113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1470 c805 -11 0x5 0x342113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1471 c805 -11 0x5 0x352113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1472 c805 -11 0x5 0x362113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1473 c805 -11 0x5 0x372113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1474 c805 -11 0x5 0x382113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1475 c805 -11 0x5 0x392113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1476 c805 -11 0x5 0x3a2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1477 c805 -11 0x5 0x3b2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1478 c805 -11 0x5 0x3c2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1479 c805 -11 0x5 0x3d2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1480 c805 -11 0x5 0x3e2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1481 c805 -11 0x5 0x3f2113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1482 c805 -11 0x5 0x402113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1483 c805 -11 0x5 0x412113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1484 c805 -11 0x5 0x422113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_connect_status [sunrpc] q:xprt_sending
> > 1485 c005 -11 0x5 0x432113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> > 1486 c005 -11 0x5 0x442113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> > 1487 c005 -11 0x5 0x452113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> > 1488 c005 -11 0x5 0x462113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> > 1489 c005 -11 0x5 0x472113be 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_transmit_status [sunrpc] q:xprt_sending
> > 1490 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1491 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1492 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1493 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1494 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1495 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1496 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1497 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1498 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1499 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1500 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1501 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1502 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1503 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1504 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1505 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1506 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1507 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1508 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1509 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1510 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1511 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1512 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1513 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1514 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1515 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1516 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1517 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1518 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1519 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1520 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1521 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1522 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1523 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1524 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1525 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1526 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1527 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1528 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1529 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1530 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1531 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1532 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1533 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1534 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1535 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1536 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1537 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1538 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1539 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1540 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1541 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1542 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1543 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1544 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1545 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1546 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1547 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1548 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1549 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1550 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1551 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1552 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1553 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1554 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1555 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1556 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1557 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1558 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1559 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1560 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1561 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1562 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1563 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1564 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1565 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1566 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1567 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1568 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1569 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1570 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1571 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1572 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1573 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1574 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1575 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1576 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1577 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1578 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1579 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1580 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1581 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1582 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1583 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1584 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1585 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1586 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1587 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1588 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1589 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1590 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1591 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1592 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1593 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1594 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1595 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1596 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1597 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> > 1598 c005 0 0x5 0x0 0 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:rpc_prepare_task [sunrpc] q:ForeChannel Slot table
> >
> > In the server, I see this in the log around the same time:
> >
> > [ 2344.590202] rpc-srv/tcp: nfsd: got error -90 when sending 112 bytes - shutting down socket
> > [ 2349.182087] rpc-srv/tcp: nfsd: got error -74 when sending 112 bytes - shutting down socket
> > [ 2352.685424] rpc-srv/tcp: nfsd: got error -90 when sending 112 bytes - shutting down socket
> > [ 3583.897575] rpc-srv/tcp: nfsd: got error -104 when sending 112 bytes - shutting down socket
>
> This looks like it started out as the problem I've been sending patches to
> fix on 6.7, latest here:
> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@redhat.com/
>
> .. however whenever I encounter the issue, the client reconnects the
> transport again - so I think there might be an additional problem here. It
> would be interesting to turn up some tracepoints to see what the sunrpc
> scheduler is doing once the problem occurs. The -EAGAIN tasks should be
> timing out and reconnecting the transport, I think.
>

I did turn up all of the sunrpc and NFS client tracepoints, but saw no
output whatsoever. I also tried turning up all of the dprintk's but that
also showed nothing.

So far, this seems to be pretty reproducible by running fstests with TLS
enabled. I'm going to try testing a kernel with your v3 patchset. Stay
tuned...

--
Jeff Layton <[email protected]>

2024-01-03 19:25:59

by Chuck Lever III

[permalink] [raw]
Subject: Re: hangs during fstests testing with TLS



> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <[email protected]> wrote:
>
> This looks like it started out as the problem I've been sending patches to
> fix on 6.7, latest here:
> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@redhat.com/
>
> .. however whenever I encounter the issue, the client reconnects the
> transport again - so I think there might be an additional problem here.

I'm looking at the same problem as you, Ben. It doesn't seem to be
similar to what Jeff reports.

But I'm wondering if gerry-rigging the timeouts is the right answer
for backchannel replies. The problem, fundamentally, is that when a
forechannel RPC task holds the transport lock, the backchannel's reply
transmit path thinks that means the transport connection is down and
triggers a transport disconnect.

The use of ETIMEDOUT in call_bc_transmit_status() is... not especially
clear.

NFSD's backchannel client has its own set of quirks that make this
situation worse. For example, a reply transmit failure on the client
will screw up the one backchannel session slot, because the client will
have advanced the slot sequence number, but the server will never see
the reply to tell it to do the same.


Jeff says:
> I did turn up all of the sunrpc and NFS client tracepoints, but saw no
> output whatsoever. I also tried turning up all of the dprintk's but that
> also showed nothing.

Try boosting tlshd debugging for the whole test.

Also, the server might be doing something weird, so turn up debugging
there too.


--
Chuck Lever


2024-01-03 20:19:40

by Benjamin Coddington

[permalink] [raw]
Subject: Re: hangs during fstests testing with TLS

On 3 Jan 2024, at 14:12, Chuck Lever III wrote:

>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <[email protected]> wrote:
>>
>> This looks like it started out as the problem I've been sending patches to
>> fix on 6.7, latest here:
>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@redhat.com/
>>
>> .. however whenever I encounter the issue, the client reconnects the
>> transport again - so I think there might be an additional problem here.
>
> I'm looking at the same problem as you, Ben. It doesn't seem to be
> similar to what Jeff reports.
>
> But I'm wondering if gerry-rigging the timeouts is the right answer
> for backchannel replies. The problem, fundamentally, is that when a
> forechannel RPC task holds the transport lock, the backchannel's reply
> transmit path thinks that means the transport connection is down and
> triggers a transport disconnect.

Why shouldn't backchannel replies have normal timeout values?

> The use of ETIMEDOUT in call_bc_transmit_status() is... not especially
> clear.

Seems like it should mean that the reply couldn't be sent within (what
should be) the timeout values for the client's state management transport.

I'm glad you're seeing this problem too. I was worried that something was
seriously different about my test setup.

Ben


2024-01-03 21:46:21

by Chuck Lever III

[permalink] [raw]
Subject: Re: hangs during fstests testing with TLS



> On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <[email protected]> wrote:
>
> On 3 Jan 2024, at 14:12, Chuck Lever III wrote:
>
>>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <[email protected]> wrote:
>>>
>>> This looks like it started out as the problem I've been sending patches to
>>> fix on 6.7, latest here:
>>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@redhat.com/
>>>
>>> .. however whenever I encounter the issue, the client reconnects the
>>> transport again - so I think there might be an additional problem here.
>>
>> I'm looking at the same problem as you, Ben. It doesn't seem to be
>> similar to what Jeff reports.
>>
>> But I'm wondering if gerry-rigging the timeouts is the right answer
>> for backchannel replies. The problem, fundamentally, is that when a
>> forechannel RPC task holds the transport lock, the backchannel's reply
>> transmit path thinks that means the transport connection is down and
>> triggers a transport disconnect.
>
> Why shouldn't backchannel replies have normal timeout values?

RPC Replies are "send and forget". The server forechannel sends
its Replies without a timeout. There is no such thing as a
retransmitted RPC Reply (though a reliable transport might
retransmit portions of it, the RPC server itself is not aware of
that).

And I don't see anything in the client's backchannel path that
makes me think there's a different protocol-level requirement
in the backchannel.


>> The use of ETIMEDOUT in call_bc_transmit_status() is... not especially
>> clear.
>
> Seems like it should mean that the reply couldn't be sent within (what
> should be) the timeout values for the client's state management transport.

It looks to me like backchannel replies are HARD, not SOFT.
That suggests call_bc_transmit_status() is not expecting
ETIMEDOUT to mean "request timed out", which is the usual
meaning of ETIMEDOUT -- and that's only for SOFT RPCs.

See also 762e4e67b356 ("SUNRPC: Refactor RPC call encoding"),
which added RPC_TASK_NO_RETRANS_TIMEOUT to rpc_run_bc_task().

So my "read" of the code is that there currently is no
timeout set, and ETIMEDOUT isn't expected to mean the send
timed out. This code does not rely on a timeout to sent
replies at all. It seems to be reusing the ETIMEDOUT error
that was set as a side-effect of something else.


> I'm glad you're seeing this problem too. I was worried that something was
> seriously different about my test setup.

For the public record...

Testing with NFSv4.2, I see the client doing async COPY
operations, but then not replying to the server's CB_OFFLOAD
notifications.

The client is not actually dropping the CB_OFFLOADs, though;
it processes them but then never sends the reply because
xprt_prepare_transmit() fails to lock the transport when
sending the reply. The sleep_on_timeout in xprt_reserve_xprt
is falling through because, as you observed, the rqst timeout
settings are all zero.

So the client has processed the CB_OFFLOAD and advanced the
slot sequence number, but because the server doesn't see a
reply it leaves its copy of the slot sequence number alone.

The server sends the next CB_OFFLOAD and the client replies
with NFS4ERR_RETRY_UNCACHED_REP because the slot sequence
numbers no longer match. The client tosses that CB_OFFLOAD
notification.

The server doesn't retransmit it after the new session is
created because DESTROY_SESSION kills the backchannel
rpc_clnt and all its tasks. Or maybe there's some other bug
in there, but I think the server should retransmit and it
doesn't.

So the client copy operation is stuck waiting for a
CB_OFFLOAD notification that never arrives. The client could
send a COPY_NOTIFY, to check on the pending COPY, but it
doesn't.

Certainly the client expects a retransmit. See 3832591e6fa5
("SUNRPC: Handle connection issues correctly on the back
channel"). But I'm not sure NFSD ever did this -- I will
look into addressing that.

But also, we don't want the client spuriously disconnecting
and recycling its sessions just because the write lock is
held by another task right at that moment. That's an enormous
bubble in the throughput pipeline. The problem seems to be
worse when the forechannel is under load. I think
transmitting backchannel replies needs to be more reliable.


--
Chuck Lever


2024-01-04 12:22:59

by Benjamin Coddington

[permalink] [raw]
Subject: Re: hangs during fstests testing with TLS

On 3 Jan 2024, at 16:46, Chuck Lever III wrote:

>> On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <[email protected]> wrote:
>>
>> On 3 Jan 2024, at 14:12, Chuck Lever III wrote:
>>
>>>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <[email protected]> wrote:
>>>>
>>>> This looks like it started out as the problem I've been sending patches to
>>>> fix on 6.7, latest here:
>>>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@redhat.com/
>>>>
>>>> .. however whenever I encounter the issue, the client reconnects the
>>>> transport again - so I think there might be an additional problem here.
>>>
>>> I'm looking at the same problem as you, Ben. It doesn't seem to be
>>> similar to what Jeff reports.
>>>
>>> But I'm wondering if gerry-rigging the timeouts is the right answer
>>> for backchannel replies. The problem, fundamentally, is that when a
>>> forechannel RPC task holds the transport lock, the backchannel's reply
>>> transmit path thinks that means the transport connection is down and
>>> triggers a transport disconnect.
>>
>> Why shouldn't backchannel replies have normal timeout values?
>
> RPC Replies are "send and forget". The server forechannel sends
> its Replies without a timeout. There is no such thing as a
> retransmitted RPC Reply (though a reliable transport might
> retransmit portions of it, the RPC server itself is not aware of
> that).
>
> And I don't see anything in the client's backchannel path that
> makes me think there's a different protocol-level requirement
> in the backchannel.

Its not strictly a protocol thing, the timeouts are used to decide what to
do with a req or flag the transport state even if the request doesn't make
it to the wire. That's why the zero timeout values for this req improperly
resets the transport.

Ben


2024-01-04 13:42:43

by Jeffrey Layton

[permalink] [raw]
Subject: Re: hangs during fstests testing with TLS

On Thu, 2024-01-04 at 07:22 -0500, Benjamin Coddington wrote:
> On 3 Jan 2024, at 16:46, Chuck Lever III wrote:
>
> > > On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <[email protected]> wrote:
> > >
> > > On 3 Jan 2024, at 14:12, Chuck Lever III wrote:
> > >
> > > > > On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <[email protected]> wrote:
> > > > >
> > > > > This looks like it started out as the problem I've been sending patches to
> > > > > fix on 6.7, latest here:
> > > > > https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@redhat.com/
> > > > >
> > > > > .. however whenever I encounter the issue, the client reconnects the
> > > > > transport again - so I think there might be an additional problem here.
> > > >
> > > > I'm looking at the same problem as you, Ben. It doesn't seem to be
> > > > similar to what Jeff reports.
> > > >
> > > > But I'm wondering if gerry-rigging the timeouts is the right answer
> > > > for backchannel replies. The problem, fundamentally, is that when a
> > > > forechannel RPC task holds the transport lock, the backchannel's reply
> > > > transmit path thinks that means the transport connection is down and
> > > > triggers a transport disconnect.
> > >
> > > Why shouldn't backchannel replies have normal timeout values?
> >
> > RPC Replies are "send and forget". The server forechannel sends
> > its Replies without a timeout. There is no such thing as a
> > retransmitted RPC Reply (though a reliable transport might
> > retransmit portions of it, the RPC server itself is not aware of
> > that).
> >
> > And I don't see anything in the client's backchannel path that
> > makes me think there's a different protocol-level requirement
> > in the backchannel.
>
> Its not strictly a protocol thing, the timeouts are used to decide what to
> do with a req or flag the transport state even if the request doesn't make
> it to the wire. That's why the zero timeout values for this req improperly
> resets the transport.
>

FWIW, Ben's v3 patchset seems to fix the problem for me, and I was able
to run 3 loops of fstests with them in place, whereas before I couldn't
even make it through one full run without it hanging.

I'm happy to test v4 when you're ready.
--
Jeff Layton <[email protected]>

2024-01-04 14:20:47

by Chuck Lever III

[permalink] [raw]
Subject: Re: hangs during fstests testing with TLS

On Thu, Jan 04, 2024 at 07:22:18AM -0500, Benjamin Coddington wrote:
> On 3 Jan 2024, at 16:46, Chuck Lever III wrote:
>
> >> On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <[email protected]> wrote:
> >>
> >> On 3 Jan 2024, at 14:12, Chuck Lever III wrote:
> >>
> >>>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <[email protected]> wrote:
> >>>>
> >>>> This looks like it started out as the problem I've been sending patches to
> >>>> fix on 6.7, latest here:
> >>>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@redhat.com/
> >>>>
> >>>> .. however whenever I encounter the issue, the client reconnects the
> >>>> transport again - so I think there might be an additional problem here.
> >>>
> >>> I'm looking at the same problem as you, Ben. It doesn't seem to be
> >>> similar to what Jeff reports.
> >>>
> >>> But I'm wondering if gerry-rigging the timeouts is the right answer
> >>> for backchannel replies. The problem, fundamentally, is that when a
> >>> forechannel RPC task holds the transport lock, the backchannel's reply
> >>> transmit path thinks that means the transport connection is down and
> >>> triggers a transport disconnect.
> >>
> >> Why shouldn't backchannel replies have normal timeout values?
> >
> > RPC Replies are "send and forget". The server forechannel sends
> > its Replies without a timeout. There is no such thing as a
> > retransmitted RPC Reply (though a reliable transport might
> > retransmit portions of it, the RPC server itself is not aware of
> > that).
> >
> > And I don't see anything in the client's backchannel path that
> > makes me think there's a different protocol-level requirement
> > in the backchannel.
>
> Its not strictly a protocol thing, the timeouts are used to decide what to
> do with a req or flag the transport state even if the request doesn't make
> it to the wire. That's why the zero timeout values for this req improperly
> resets the transport.

I guess I'm harping on this a bit because forechannel v. backchannel
is already confusing enough. The use of timeouts for RPC Replies is
just heaping on to that confusion.

If we're going to keep an explicit timeout when sending the Reply,
it should have a little documentation. I suggest adding this to
xprt_init_bc_request() before the new call to
xprt_init_majortimeo():

/*
* Backchannel Replies are sent with !RPC_TASK_SOFT and
* RPC_TASK_NO_RETRANS_TIMEOUT. The major timeout setting
* affects only how long each Reply waits to be sent when
* a transport connection cannot be established.
*/
xprt_init_majortimeo(task, req, ...

HTH

--
Chuck Lever

2024-01-04 14:37:34

by Benjamin Coddington

[permalink] [raw]
Subject: Re: hangs during fstests testing with TLS

On 4 Jan 2024, at 9:16, Chuck Lever wrote:

> On Thu, Jan 04, 2024 at 07:22:18AM -0500, Benjamin Coddington wrote:
>> On 3 Jan 2024, at 16:46, Chuck Lever III wrote:
>>
>>>> On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <[email protected]> wrote:
>>>>
>>>> On 3 Jan 2024, at 14:12, Chuck Lever III wrote:
>>>>
>>>>>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <[email protected]> wrote:
>>>>>>
>>>>>> This looks like it started out as the problem I've been sending patches to
>>>>>> fix on 6.7, latest here:
>>>>>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@redhat.com/
>>>>>>
>>>>>> .. however whenever I encounter the issue, the client reconnects the
>>>>>> transport again - so I think there might be an additional problem here.
>>>>>
>>>>> I'm looking at the same problem as you, Ben. It doesn't seem to be
>>>>> similar to what Jeff reports.
>>>>>
>>>>> But I'm wondering if gerry-rigging the timeouts is the right answer
>>>>> for backchannel replies. The problem, fundamentally, is that when a
>>>>> forechannel RPC task holds the transport lock, the backchannel's reply
>>>>> transmit path thinks that means the transport connection is down and
>>>>> triggers a transport disconnect.
>>>>
>>>> Why shouldn't backchannel replies have normal timeout values?
>>>
>>> RPC Replies are "send and forget". The server forechannel sends
>>> its Replies without a timeout. There is no such thing as a
>>> retransmitted RPC Reply (though a reliable transport might
>>> retransmit portions of it, the RPC server itself is not aware of
>>> that).
>>>
>>> And I don't see anything in the client's backchannel path that
>>> makes me think there's a different protocol-level requirement
>>> in the backchannel.
>>
>> Its not strictly a protocol thing, the timeouts are used to decide what to
>> do with a req or flag the transport state even if the request doesn't make
>> it to the wire. That's why the zero timeout values for this req improperly
>> resets the transport.
>
> I guess I'm harping on this a bit because forechannel v. backchannel
> is already confusing enough. The use of timeouts for RPC Replies is
> just heaping on to that confusion.

Yeah, its super confusing for me too. Add to this the fact that we do it
completely different for 4.0 and 4.1, but all the variable names are
similar and/or reversed.

> If we're going to keep an explicit timeout when sending the Reply,
> it should have a little documentation. I suggest adding this to
> xprt_init_bc_request() before the new call to
> xprt_init_majortimeo():
>
> /*
> * Backchannel Replies are sent with !RPC_TASK_SOFT and
> * RPC_TASK_NO_RETRANS_TIMEOUT. The major timeout setting
> * affects only how long each Reply waits to be sent when
> * a transport connection cannot be established.
> */
> xprt_init_majortimeo(task, req, ...

Ok, I will add that on 2/2 v4 which is getting some basic testing ATM.

Ben