Return-Path: Received: from mx1.redhat.com ([209.132.183.28]:56773 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751013AbcGRLl3 convert rfc822-to-8bit (ORCPT ); Mon, 18 Jul 2016 07:41:29 -0400 From: "Benjamin Coddington" To: "Trond Myklebust" Cc: "Alexey Dobriyan" , "linux-nfs@vger.kernel.org" , "Schumaker Anna" , "Weston Andros Adamson" Subject: Re: RHEL7 hang (and probably mainline) while copying big files because of subrequests merging(?) Date: Mon, 18 Jul 2016 07:43:00 -0400 Message-ID: <37A2D73A-A493-4E29-B590-E5B272390359@redhat.com> In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Sender: linux-nfs-owner@vger.kernel.org List-ID: On 11 Jul 2016, at 9:28, Trond Myklebust wrote: >> On Jul 11, 2016, at 08:59, Alexey Dobriyan >> wrote: >> >> We have a customer who was able to reliably reproduce the following >> hang: >> (hang itself is rare but there are many machines, so it is not rare) >> >> INFO: task ascp:66692 blocked for more than 120 seconds. >> >>> bt 66692 >> PID: 66692 TASK: ffff883f124ba280 CPU: 5 COMMAND: "ascp" >> >> __schedule >> schedule >> schedule_timeout >> io_schedule_timeout >> io_schedule >> nfs_wait_bit_uninterruptible >> __wait_on_bit >> out_of_line_wait_on_bit >> nfs_wait_on_request >> nfs_try_to_update_request >> nfs_setup_write_request >> nfs_writepage_setup >> nfs_updatepage >> nfs_write_end >> generic_file_buffered_write >> __generic_file_aio_write >> generic_file_aio_write >> nfs_file_write >> do_sync_write >> vfs_write >> sys_write >> system_call_fastpath >> >> ascp is Aspera secure copy program. It is multithreaded. >> When the hang happens, 5 threads are in S-state and 1 on D-state. >> >> The workload is copying ~100+ GB files over NFS3 with 10GiB ethernet >> with the following mount options: >> rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=X.X.X.X,mountvers=3,mountport=300,mountproto=udp,local_lock=none,addr=X.X.X.X >> >> We got coredump (2 actually) and observations are below: >> >> 1) one of the inode's pagecache pages has page->private available: >> struct nfs_page { >> wb_list = { >> next = 0xffff885623ac4f80, >> prev = 0xffff885623ac4f80 >> }, >> wb_page = 0xffffea01218c2600, >> wb_context = 0xffff887f2265de00, >> wb_lock_context = 0xffff887f2265de00, >> wb_index = 2649328, >> wb_offset = 0, >> wb_pgbase = 0, >> wb_bytes = 0, >> wb_kref = { >> refcount = { >> counter = 3 >> } >> }, >> wb_flags = 19, >> wb_verf = { >> data = "\000\000\000\000\000\000\000" >> }, >> wb_this_page = 0xffff885623ac4f80, >> wb_head = 0xffff885623ac4f80 >> } >> >> ->wb_list is always empty, >> ->wb_bytes is always 0 (!). >> ->wb_kref is always 3. >> ->wb_flags = PG_BUSY|PG_MAPPED|PG_INODE_REF, >> >> page->flags = 0x2869 = >> PG_writeback|PG_private|PG_active|PF_lru|PG_uptodate|PG_locked >> >> The zero-length request is created at nfs_create_request(): >> WARN_ON_ONCE(req->wb_bytes == 0) >> nfs_create_request >> nfs_setup_write_request >> nfs_updatepage >> nfs_write_end >> generic_perform_write >> generic_file_buffered_write >> __generic_file_aio_write >> generic_file_aio_write >> nfs_file_write >> do_sync_write >> vfs_write >> SyS_write >> >> with count=0 coming from generic_perform_write (where else?). >> >> Customer is saying that RHEL6 was totally OK, and there are several >> reports >> of other people hitting same bug with RHEL7: >> https://gist.github.com/Millnert/ecc10d8cc79c81b55d7f >> https://bugs.centos.org/view.php?id=9284 >> > > Why is this being reported here and not to Red Hat? Is the bug > reproducible on the upstream kernel? There's a RHEL BZ open for it: https://bugzilla.redhat.com/show_bug.cgi?id=132463 Upstream has the problem, too. The pgio layer doesn't expect zero-length requests, so an nfs_page can get locked but never added to a pageio descriptor. To create such a problem, writeback has to happen just after nfs_write_end() has created a nfs_page with wb_bytes = 0. This can happen if iov_iter_copy_from_user_atomic() fails to copy pages in generic_perform_write(). Normally, we'd just go around the loop and try the copy again, but if writeback sneaks in then the request gets locked but never added to a pgio descriptor due to the way we use conditions on zero bytes to move requests around. I see two potential ways to fix: - Just skip the creation of the request if the length is zero. We're bound to just come around and created it again. - Allow zero-length requests, but fix up __nfs_pageio_add_request, nfs_generic_pg_test, and others that use conditions on zero. I think the first option to handle the case early is the simpler fix, and best approach, since: what do we do with a zero-length request? Not creating them helps us be conservative in what we send out to the nfs_pageio_ops. I don't understand what conditions can cause iov_iter_copy_from_user_atomic to fail since pagefaults are disabled. Can anyone enlighten me? To reproduce this upstream, I've simulated iov_iter_copy_from_user_atomic() returning zero once, then inserted a small delay after nfs_write_end to allow me to sneak in a sync. Any advice on how to proceed? I'll send a patch to fix the first way otherwise. If anyone has trouble with that RHEL BZ, send me your bugzilla email and I'll give you access. Ben