Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759583Ab2HIUWi (ORCPT ); Thu, 9 Aug 2012 16:22:38 -0400 Received: from keetweej.vanheusden.com ([80.101.105.103]:33330 "EHLO keetweej.vanheusden.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752085Ab2HIUWg (ORCPT ); Thu, 9 Aug 2012 16:22:36 -0400 Date: Thu, 9 Aug 2012 22:22:31 +0200 From: folkert To: nbd-general@lists.sourceforge.net, linux-kernel@vger.kernel.org Subject: nbd randomly disconnecting Message-ID: <20120809202229.GF14886@belle.intranet.vanheusden.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Organization: www.unixexpert.nl X-Chameleon-Return-To: folkert@vanheusden.com X-Xfmail-Return-To: folkert@vanheusden.com X-Phonenumber: +31-6-41278122 X-URL: http://www.vanheusden.com/ X-PGP-KeyID: 1F28D8AE X-GPG-fingerprint: AC89 09CE 41F2 00B4 FCF2 B174 3019 0E8C 1F28 D8AE X-Key: http://pgp.surfnet.nl:11371/pks/lookup?op=get&search=0x1F28D8AE Read-Receipt-To: Reply-By: Sun Aug 5 16:09:01 CEST 2012 X-Message-Flag: www.unixexpert.nl User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6131 Lines: 120 Hi, I'm investigating some strange behaviour with NBD. Randomly my nbd-client disconnects: 21:04:20.514159 IP 172.16.86.1.12341 > 172.16.86.129.39633: Flags [.], seq 3776733737:3776735185, ack 2604128070, win 11181, options [nop,nop,TS val 1145422093 ecr 141526], length 1448 21:04:20.514164 IP 172.16.86.1.12341 > 172.16.86.129.39633: Flags [.], seq 3776735185:3776736633, ack 2604128070, win 11181, options [nop,nop,TS val 1145422093 ecr 141526], length 1448 21:04:20.514170 IP 172.16.86.1.12341 > 172.16.86.129.39633: Flags [.], seq 3776736633:3776738081, ack 2604128070, win 11181, options [nop,nop,TS val 1145422093 ecr 141526], length 1448 21:04:20.514175 IP 172.16.86.1.12341 > 172.16.86.129.39633: Flags [P.], seq 3776738081:3776739081, ack 2604128070, win 11181, options [nop,nop,TS val 1145422093 ecr 141526], length 1000 21:04:20.514287 IP 172.16.86.129.39633 > 172.16.86.1.12341: Flags [R], seq 315327654, win 0, length 0 21:04:20.514379 IP 172.16.86.129.39633 > 172.16.86.1.12341: Flags [R], seq 315327654, win 0, length 0 21:04:20.514435 IP 172.16.86.129.39633 > 172.16.86.1.12341: Flags [R], seq 315327654, win 0, length 0 21:04:20.514486 IP 172.16.86.129.39633 > 172.16.86.1.12341: Flags [R], seq 315327654, win 0, length 0 21:04:20.514549 IP 172.16.86.129.39633 > 172.16.86.1.12341: Flags [R], seq 315327654, win 0, length 0 21:04:21.183393 IP 172.16.86.129.39633 > 172.16.86.1.12341: Flags [R], seq 315327654, win 0, length 0 The 'R' flag means that .129 "resets the connection" which means terminates the connection. .1 = server, .129 = client this happens with both kernel 3.0 and 3.5. With nbd-client 1:3.2-1. I ran nbd-client without forking and it then shows no error message at all on the console. The kernel log shows only: Aug 9 21:04:21 mdd1 kernel: [ 866.104603] block nbd0: shutting down socket Aug 9 21:04:21 mdd1 kernel: [ 866.104748] end_request: I/O error, dev nbd0, sector 3395936 Aug 9 21:04:21 mdd1 kernel: [ 866.105132] end_request: I/O error, dev nbd0, sector 3396184 Aug 9 21:04:21 mdd1 kernel: [ 866.105501] block nbd0: queue cleared and then a panic (which is expected as the block-layer gets upset when its requests are not handled). So no reason for this problem is shown by the nbd driver. Tried it again with full debugging of the nbd kernel module: ... [ 728.431017] nbd0: request ffff88000d943ae8: got 4096 bytes data [ 728.431021] nbd0: request ffff88000d943ae8: got 4096 bytes data [ 728.431024] nbd0: request ffff88000d943ae8: got 4096 bytes data [ 728.431029] nbd0: request ffff88000d943ae8: got 4096 bytes data [ 728.431030] nbd0: request ffff88000d943ae8: done [ 728.431036] nbd0: request ffff88000df75bd8: got reply [ 728.431040] nbd0: request ffff88000df75bd8: got 4096 bytes data [ 728.431046] nbd0: request ffff88000df75bd8: got 4096 bytes data [ 728.431047] nbd0: request ffff88000df75bd8: done [ 728.431169] nbd0: request ffff88001497fdc8: dequeued (flags=1) [ 728.431175] nbd0: request ffff880016395158: dequeued (flags=1) [ 728.431177] nbd0: request ffff88000d943808: dequeued (flags=1) [ 728.431182] nbd0: request ffff88001497fdc8: sending control (read@2452893696,126976B) [ 728.432807] nbd0: request ffff880016395158: sending control (read@2453020672,126976B) [ 728.433981] nbd0: request ffff88000d943808: sending control (read@2453147648,8192B) [ 728.434443] block nbd0: shutting down socket [ 728.435348] nbd0: request ffff88000d943808: failed [ 728.435351] end_request: I/O error, dev nbd0, sector 4791304 [ 728.435807] nbd0: request ffff880016395158: failed [ 728.435809] end_request: I/O error, dev nbd0, sector 4791056 [ 728.436196] nbd0: request ffff88001497fdc8: failed [ 728.436198] end_request: I/O error, dev nbd0, sector 4790808 So for some reason the nbd subsystem shuts down the socket without any error message. This shutting down socket message is emitted by sock_shutdown(). There are 2 places were this function is called: when a signal is retrieved (which gives a kernel warning error which is not logged in this situation). The other situation is where nbd_do_it returns, either with an error (which would be send to nbd-client but that program does not show any errors so this is not the problem here) so it must be nbd_read_stat (which is repeatingly invoked by nbd_do_it) or the nbd_thread. nbd_do_it returns without a message when nbd_find_request does not return an -ENOENT. All other paths (also nbd_thread) seem to nicely log a message when there's something wrong. nbd_find_request returns an error different from -ENOENT when wait_event_interruptible returns an error. That function returns -ERESTARTSYS when the wait is interrupted. So what I think should be done (but I'm not a kernel developer so not entirely sure) is wrap that wait_event_interruptible in some loop that when it returns -ERESTARTSYS, that it just tries again. Also enhance nbd_read_stat so that in this code: req = nbd_find_request(lo, *(struct request **)reply.handle); if (IS_ERR(req)) { result = PTR_ERR(req); if (result != -ENOENT) goto harderror; that then an error is shown in the kernel log, e.g.: if (result != -ENOENT) { dev_err(disk_to_dev(lo->disk), "Search for %p interrupted: %d\n", reply.handle, result); goto harderror; } OR In nbd_do_it check if lo -> harderror != 0 and in such case, do not return 0 (like it does now) but return the lo->harderror value. Please note that this system is actually a VM under vmware workstation 7.1.4 with the vmware tools modules not installed because I have not gotten around to upgrade them to a 3.5 kernel. It is a 64 bit system with 8GB ram. Folkert van Heusden -- ---------------------------------------------------------------------- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/