Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp2107326pxj; Sun, 16 May 2021 14:50:40 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw6/BBfMAEOc/MkLq7C+xo6t0Tafa3Y3feM4XLucxF8MsYW8ouVZabNObK3QQBKoKMCyxLR X-Received: by 2002:a17:906:5291:: with SMTP id c17mr15644451ejm.527.1621201840277; Sun, 16 May 2021 14:50:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1621201840; cv=none; d=google.com; s=arc-20160816; b=IagiZh1QYybsswmzS4LmaK8QnpoEGC091vj4/sq2srAFAHP3bNVh+6uhlCBDoadgMb XxsyHfY3079G0qFQBk71UzYEj77KDZA8+1ywhUPL7Ib4SxqoRLlTMD5rOpLevdxPi/vC Ls+nGuLfWf1qw8H8FzJPV/R24g37uBVQFmgVbmSRt893sgBeKoCtKUQ7RSqJYMb+J/er R5aoctwbwJ9/lpAdpIiORYu/bN0IHbHtaaXXAMJOtee79AoD6DbxQQAOLK5LevCb4WBs pvPVvuCtFCsuOwOmsWrU5T+jhq1DN7PcQwvsP/J8BWu2gPkYsDQGqXfLFm5bVPI66kHw B1KA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:date:message-id:subject :from:to:dkim-signature; bh=aJzwPpl2UO/SqTJuNlx9/WeJmEVYwy5oNC9kO/49N98=; b=snquOiuNDx2xqodPnlGTxk49JDuKJDfFt7ZWC+QwPOP68oXgJImm73T+Q42ShKrDLf lHL9XiRkEhlJGJcF9JQGcqfHJlrovFd8YC7r6XJTB7Ga5l1e3xOuJoHBhRQPsbpdmil6 XamEdrWGgQctVYZxar6terghpRLD1G9ojXQ6t+GBWqarmKTZv74A9owQdmQogXhQklu4 cwsyGv0u8IhcDoUQZV5ONHLRp9wS4TQ+4ArtZyF81me4iB96oY7ctk853oUeZu+1HN/t 8q4MKvPS+H1J98GALWrVR9X3edMO6ixofsOD00awVyd1Bcdd8KJOMLIN99o/hOVz+cEJ KF7Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@rothenpieler.org header.s=mail header.b=J1XOn12d; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=rothenpieler.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bx4si9348714ejb.30.2021.05.16.14.50.04; Sun, 16 May 2021 14:50:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@rothenpieler.org header.s=mail header.b=J1XOn12d; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=rothenpieler.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229610AbhEPRk7 (ORCPT + 99 others); Sun, 16 May 2021 13:40:59 -0400 Received: from btbn.de ([136.243.74.85]:53472 "EHLO btbn.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229459AbhEPRk7 (ORCPT ); Sun, 16 May 2021 13:40:59 -0400 X-Greylist: delayed 590 seconds by postgrey-1.27 at vger.kernel.org; Sun, 16 May 2021 13:40:58 EDT Received: from [IPv6:2001:16b8:6452:6b00:20de:7f71:ff0c:2e7a] (200116b864526b0020de7f71ff0c2e7a.dip.versatel-1u1.de [IPv6:2001:16b8:6452:6b00:20de:7f71:ff0c:2e7a]) by btbn.de (Postfix) with ESMTPSA id C6E3DE922F; Sun, 16 May 2021 19:29:52 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rothenpieler.org; s=mail; t=1621186192; bh=aJzwPpl2UO/SqTJuNlx9/WeJmEVYwy5oNC9kO/49N98=; h=To:From:Subject:Date; b=J1XOn12dIRIwNP5hJYizQm35y4f+3daypxyohUJrLtKtj1s/M7vWMJ5jq1dZO/T0/ TI/JBJnBxX2YGGNbUoenBdct1K7jLMbwtNTQiBrTeD0oo+GhJ/EgTVYiwowx6FIeLf QNwiaR+29xqG7xxoTK40TGEeuDktZPOITQRe5lW5aC+ASqVuBsNLzOmk9Zit03s8xn kEoON1fU4nfsxO+6wcqr4FLy8Vwx9Qx7MKvSwffkjOySMr7FTbTgB+Y8RpU3Wnf5K2 THaE/XU8c8jRiHgbuvAoQI/qedLwEPw/bmCLFN7H3LnQrJr4V8zWMa2DUkR3d4ZJsZ 2RygohuXNfAIw== To: Linux NFS Mailing List , linux-rdma From: Timo Rothenpieler Subject: Spurious instability with NFSoRDMA under moderate load Message-ID: <4da3b074-a6be-d83f-ccd4-b151557066aa@rothenpieler.org> Date: Sun, 16 May 2021 19:29:50 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-256; boundary="------------ms010503060107080907010709" Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org This is a cryptographically signed message in MIME format. --------------ms010503060107080907010709 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable This has happened 3 times so far over the last couple months, and I do=20 not have a clear way to reproduce it. It happens under moderate load, when lots of nodes read and write from=20 the server. Though not in any super intense way. Just normal program=20 execution, writing of light logs, and other standard tasks. The issues on the clients manifest in a multitude of ways. Most of the=20 time, random IO operations just fail, rarely hang indefinitely and make=20 the process unkillable. Another example would be: "Failed to remove=20 '.../.nfs00000000007b03af00000001': Device or resource busy" Once a client is in that state, the only way to get it back into order=20 is a reboot. On the server side, a single error cqe is dumped each time this problem=20 happened. So far, I always rebooted the server as well, to make sure=20 everything is back in order. Not sure if that is strictly necessary. > [561889.198889] infiniband mlx5_0: dump_cqe:272:(pid 709): dump error c= qe > [561889.198945] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 = 00 > [561889.198984] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 = 00 > [561889.199023] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 = 00 > [561889.199061] 00000030: 00 00 00 00 00 00 88 13 08 00 01 13 07 47 67 = d2 > [985074.602880] infiniband mlx5_0: dump_cqe:272:(pid 599): dump error c= qe > [985074.602921] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 = 00 > [985074.602946] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 = 00 > [985074.602970] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 = 00 > [985074.602994] 00000030: 00 00 00 00 00 00 88 13 08 00 01 46 f2 93 0b = d3 > [1648894.168819] infiniband ibp1s0: dump_cqe:272:(pid 696): dump error = cqe > [1648894.168853] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00= 00 > [1648894.168878] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00= 00 > [1648894.168903] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00= 00 > [1648894.168928] 00000030: 00 00 00 00 00 00 88 13 08 00 01 08 6b d2 b9= d3 These all happened under different Versions of the 5.10 Kernel. The last = one under 5.10.32 today. Switching all clients to TCP seems to make NFS works perfectly reliable. I'm not sure how to read those error dumps, so help there would be=20 appreciated. Could this be similar to spurious issues you get with UDP, where dropped = packages cause havoc? Though I would not expect heavy load on IB to=20 cause an error cqe to be logged. Thanks, Timo --------------ms010503060107080907010709 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCC DVkwggXkMIIDzKADAgECAhAI/yx7V5dPIG8WuMetnzcsMA0GCSqGSIb3DQEBCwUAMIGBMQsw CQYDVQQGEwJJVDEQMA4GA1UECAwHQmVyZ2FtbzEZMBcGA1UEBwwQUG9udGUgU2FuIFBpZXRy bzEXMBUGA1UECgwOQWN0YWxpcyBTLnAuQS4xLDAqBgNVBAMMI0FjdGFsaXMgQ2xpZW50IEF1 dGhlbnRpY2F0aW9uIENBIEczMB4XDTIxMDIxNDE5MTM0N1oXDTIyMDIxNDE5MTM0N1owIDEe MBwGA1UEAwwVdGltb0Byb3RoZW5waWVsZXIub3JnMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8A MIIBCgKCAQEA0WP2SBuRIpVw5O7QPakKoJjg7B4UNAKTyky1XMsievLNGnR4Nxe6kKU+1oW0 oF5FqMVH9NkT9zhWYJzr5sNwJMKb9t5k8kYC7GXzOM9PxVx3bkLF5bWZrbfelUUwcdiyEYoh d29C+PxiNLHvmayWb3NtxpWiax9A4x7dRhhtqB/0BkPix+ZsIFn8vxpCvIChE2YlQWK3i8UX uBtqm26zBl3BIjj+bpd+7ePVt60vRx/R3LFHtF6kL/gQvgRcm8CFc8Nj3dCUeR2lfG+DzoTY ED6yAi838kRh5JHbqIl/Fo9YRwOYUaq2TFT/fGue87d7duLbckX1aVot+OqE0aeV2QIDAQAB o4IBtjCCAbIwDAYDVR0TAQH/BAIwADAfBgNVHSMEGDAWgBS+l6mqhL+AvxBTfQky+eEuMhvP dzB+BggrBgEFBQcBAQRyMHAwOwYIKwYBBQUHMAKGL2h0dHA6Ly9jYWNlcnQuYWN0YWxpcy5p dC9jZXJ0cy9hY3RhbGlzLWF1dGNsaWczMDEGCCsGAQUFBzABhiVodHRwOi8vb2NzcDA5LmFj dGFsaXMuaXQvVkEvQVVUSENMLUczMCAGA1UdEQQZMBeBFXRpbW9Acm90aGVucGllbGVyLm9y ZzBHBgNVHSAEQDA+MDwGBiuBHwEYATAyMDAGCCsGAQUFBwIBFiRodHRwczovL3d3dy5hY3Rh bGlzLml0L2FyZWEtZG93bmxvYWQwHQYDVR0lBBYwFAYIKwYBBQUHAwIGCCsGAQUFBwMEMEgG A1UdHwRBMD8wPaA7oDmGN2h0dHA6Ly9jcmwwOS5hY3RhbGlzLml0L1JlcG9zaXRvcnkvQVVU SENMLUczL2dldExhc3RDUkwwHQYDVR0OBBYEFK/aNb0BTZd0BqHgSJnmTftGSlabMA4GA1Ud DwEB/wQEAwIFoDANBgkqhkiG9w0BAQsFAAOCAgEAT3W2bBaISi7Utg/WA3U+bBhiouolnROR AB0vW4m3igjMcWx5GrPb8CSWNcq0/+BG+bhj6s+q7D1E9h1HO9CZUCfD7ujXj/VT/h7oMAqX w3Tf6H92bvHmZCvZmb2HKEnAAa4URjeZyNI1uwsMirF/gC5zYX5pm2ydVGxGYusWq8VRZzgc m1a0f3SPtX2dmmqjCzfINsQPs3N7BQo6FO/PfCbCzt22e+9Zm0Lra0Wt2URFTYCKSTjsK2xC SkysTfVIrBZCOb83oTMsgYE9dBmK7Tmob/HzHKs0NUOu4TfEpCgFgoXozMqTLFQac7aW26YK O8ClFDaauyOC71A+kjrth/gkUNEK+Cd3W52hK2FWvxbG/8LQLDMYviZFKxv/LAHU0fb6omva R4dzu9Sagi1z5uI5KHs5SR85lH4Up0dYs+I2xyFb8wZVYa+VuvsJ4W/pL2OaMm0tez+aNprg XURytCSPfAlz3JQdEYIiKPlJrz7O6eL2j7RwxMcKFLQl117mhImjdauIjaaS60w92P7v+F7+ 7INJ8g0PFN2vHVCB9e1g4iSYIgiydDLcbs73Jp1yVp97plWZI9oirxvH1/vI05FUJ3gw9qg2 WfbttAr0AEakAUo3Dv8jB7aQor/5fu8NMOvWjFV7P7GTAgrwil8u6fXa8ae/kWzG/850vgqq GM0wggdtMIIFVaADAgECAhAXED7ePYoctcoGUZPnykNrMA0GCSqGSIb3DQEBCwUAMGsxCzAJ BgNVBAYTAklUMQ4wDAYDVQQHDAVNaWxhbjEjMCEGA1UECgwaQWN0YWxpcyBTLnAuQS4vMDMz NTg1MjA5NjcxJzAlBgNVBAMMHkFjdGFsaXMgQXV0aGVudGljYXRpb24gUm9vdCBDQTAeFw0y MDA3MDYwODQ1NDdaFw0zMDA5MjIxMTIyMDJaMIGBMQswCQYDVQQGEwJJVDEQMA4GA1UECAwH QmVyZ2FtbzEZMBcGA1UEBwwQUG9udGUgU2FuIFBpZXRybzEXMBUGA1UECgwOQWN0YWxpcyBT LnAuQS4xLDAqBgNVBAMMI0FjdGFsaXMgQ2xpZW50IEF1dGhlbnRpY2F0aW9uIENBIEczMIIC IjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA7eaHlqHBpLbtwkJV9z8PDyJgXxPgpkOI hkmReRwbLxpQD9xGAe72ujqGzFFh78QPgAhxKVqtGHzYeq0VJVCzhnCKRBbVX+JwIhL3ULYh UAZrViUp952qDB6qTL5sGeJS9F69VPSR5k6pFNw7mHDTTt0voWFg2aVkG3khomzVXoieJGOi Q4dH76paCtQbLkt59joAKz2BnwGLQ4wr09nfumJt5AKx2YxHK2XgSPslVZ4z8G00gimsfA7U tjT/wiekY6Z0b7ksLrEcvODncHQe9VSrNRA149SE3AlkWaZM/joVei/GYfj9K5jkiReinR4m qM353FEceLOeBhSTURpMdQ5wsXLi9DSTGBuNv4aw2Dozb/qBlkhGTvwk92mi0jAecE22Sn3A 9UfrU2p1w/uRs+TIteQ0xO0B/J2mY2caqocsS9SsriIGlQ8b0LT0o6Ob07KGtPa5/lIvMmx5 72Dv2v+vDiECByxm1Hdgjp8JtE4mdyYP6GBscJyT71NZw1zXHnFkyCbxReag9qaSR9x4CVVX j1BDmNROCqd5NAfIXUXYTFeZ/jukQigkxXGWhEhfLBC4Ha6pwizz9fq1+wwPKcWaF9P/SZOu BDrG30MiyCZa66G9mEtF5ZLuh4rGfKqxy4Z5Mxecuzt+MZmrSKfKGeXOeED/iuX5Z02M1o7i MS8CAwEAAaOCAfQwggHwMA8GA1UdEwEB/wQFMAMBAf8wHwYDVR0jBBgwFoAUUtiIOsifeGbt ifN7OHCUyQICNtAwQQYIKwYBBQUHAQEENTAzMDEGCCsGAQUFBzABhiVodHRwOi8vb2NzcDA1 LmFjdGFsaXMuaXQvVkEvQVVUSC1ST09UMEUGA1UdIAQ+MDwwOgYEVR0gADAyMDAGCCsGAQUF BwIBFiRodHRwczovL3d3dy5hY3RhbGlzLml0L2FyZWEtZG93bmxvYWQwHQYDVR0lBBYwFAYI KwYBBQUHAwIGCCsGAQUFBwMEMIHjBgNVHR8EgdswgdgwgZaggZOggZCGgY1sZGFwOi8vbGRh cDA1LmFjdGFsaXMuaXQvY24lM2RBY3RhbGlzJTIwQXV0aGVudGljYXRpb24lMjBSb290JTIw Q0EsbyUzZEFjdGFsaXMlMjBTLnAuQS4lMmYwMzM1ODUyMDk2NyxjJTNkSVQ/Y2VydGlmaWNh dGVSZXZvY2F0aW9uTGlzdDtiaW5hcnkwPaA7oDmGN2h0dHA6Ly9jcmwwNS5hY3RhbGlzLml0 L1JlcG9zaXRvcnkvQVVUSC1ST09UL2dldExhc3RDUkwwHQYDVR0OBBYEFL6XqaqEv4C/EFN9 CTL54S4yG893MA4GA1UdDwEB/wQEAwIBBjANBgkqhkiG9w0BAQsFAAOCAgEAJpvnG1kNdLMS A+nnVfeEgIXNQsM7YRxXx6bmEt9IIrFlH1qYKeNw4NV8xtop91Rle168wghmYeCTP10FqfuK MZsleNkI8/b3PBkZLIKOl9p2Dmz2Gc0I3WvcMbAgd/IuBtx998PJX/bBb5dMZuGV2drNmxfz 3ar6ytGYLxedfjKCD55Yv8CQcN6e9sW5OUm9TJ3kjt7Wdvd1hcw5s+7bhlND38rWFJBuzump 5xqm1NSOggOkFSlKnhSz6HUjgwBaid6Ypig9L1/TLrkmtEIpx+wpIj7WTA9JqcMMyLJ0rN6j jpetLSGUDk3NCOpQntSy4a8+0O+SepzS/Tec1cGdSN6Ni2/A7ewQNd1Rbmb2SM2qVBlfN0e6 ZklWo9QYpNZyf0d/d3upsKabE9eNCg1S4eDnp8sJqdlaQQ7hI/UYCAgDtLIm7/J9+/S2zuwE WtJMPcvaYIBczdjwF9uW+8NJ/Zu/JKb98971uua7OsJexPFRBzX7/PnJ2/NXcTdwudShJc/p d9c3IRU7qw+RxRKchIczv3zEuQJMHkSSM8KM8TbOzi/0v0lU6SSyS9bpGdZZxx19Hd8Qs0cv +R6nyt7ohttizwefkYzQ6GzwIwM9gSjH5Bf/r9Kc5/JqqpKKUGicxAGy2zKYEGB0Qo761Mcc IyclBW9mfuNFDbTBeDEyu80xggPzMIID7wIBATCBljCBgTELMAkGA1UEBhMCSVQxEDAOBgNV BAgMB0JlcmdhbW8xGTAXBgNVBAcMEFBvbnRlIFNhbiBQaWV0cm8xFzAVBgNVBAoMDkFjdGFs aXMgUy5wLkEuMSwwKgYDVQQDDCNBY3RhbGlzIENsaWVudCBBdXRoZW50aWNhdGlvbiBDQSBH MwIQCP8se1eXTyBvFrjHrZ83LDANBglghkgBZQMEAgEFAKCCAi0wGAYJKoZIhvcNAQkDMQsG CSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMjEwNTE2MTcyOTUwWjAvBgkqhkiG9w0BCQQx IgQgIXmj152Jf2ZtTCtOGD/Qabk19rZXa1qdekhCnbwt0TswbAYJKoZIhvcNAQkPMV8wXTAL BglghkgBZQMEASowCwYJYIZIAWUDBAECMAoGCCqGSIb3DQMHMA4GCCqGSIb3DQMCAgIAgDAN BggqhkiG9w0DAgIBQDAHBgUrDgMCBzANBggqhkiG9w0DAgIBKDCBpwYJKwYBBAGCNxAEMYGZ MIGWMIGBMQswCQYDVQQGEwJJVDEQMA4GA1UECAwHQmVyZ2FtbzEZMBcGA1UEBwwQUG9udGUg U2FuIFBpZXRybzEXMBUGA1UECgwOQWN0YWxpcyBTLnAuQS4xLDAqBgNVBAMMI0FjdGFsaXMg Q2xpZW50IEF1dGhlbnRpY2F0aW9uIENBIEczAhAI/yx7V5dPIG8WuMetnzcsMIGpBgsqhkiG 9w0BCRACCzGBmaCBljCBgTELMAkGA1UEBhMCSVQxEDAOBgNVBAgMB0JlcmdhbW8xGTAXBgNV BAcMEFBvbnRlIFNhbiBQaWV0cm8xFzAVBgNVBAoMDkFjdGFsaXMgUy5wLkEuMSwwKgYDVQQD DCNBY3RhbGlzIENsaWVudCBBdXRoZW50aWNhdGlvbiBDQSBHMwIQCP8se1eXTyBvFrjHrZ83 LDANBgkqhkiG9w0BAQEFAASCAQAOZtmlyurv086vNkZD//kQKAq3l/MVOExsNXdX9C8hVUMC vkb8+pBgGXXNDV0KrHSxnIUL+ScdQWHCk+RE1kFmxa5e4baFYKrsjhqUAX59hDAw58X8c6Cb GNBhjVlymM6RsL32g2UwzR4siVfOUNbHCZ75GRg7ZUDgS/KkTt88sH8jj1vVvmhuJCdFyHaJ xp3YDlcJYUoxOJCWbk6JI7gZRP2c4NfFiIodAreBctC7zKfdTqznaeOg/LLc3JQUZhSXFCBV H7vtMhSyp/byy104iz6hZsdmRXm2OJa3P6zFCy1YRxzHs7uB2rZ+Gr7lkkgup0IG5+xnm9Z/ +RMqyl9MAAAAAAAA --------------ms010503060107080907010709--