Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755134AbdC2JVz (ORCPT ); Wed, 29 Mar 2017 05:21:55 -0400 Received: from mail-vk0-f66.google.com ([209.85.213.66]:36776 "EHLO mail-vk0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754540AbdC2JVw (ORCPT ); Wed, 29 Mar 2017 05:21:52 -0400 MIME-Version: 1.0 In-Reply-To: <20170328133040.GJ18241@dhcp22.suse.cz> References: <20170328122559.966310440@linuxfoundation.org> <20170328122601.905696872@linuxfoundation.org> <20170328124312.GE18241@dhcp22.suse.cz> <20170328133040.GJ18241@dhcp22.suse.cz> From: Ilya Dryomov Date: Wed, 29 Mar 2017 11:21:44 +0200 Message-ID: Subject: Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations To: Michal Hocko Cc: Greg Kroah-Hartman , "linux-kernel@vger.kernel.org" , stable@vger.kernel.org, Sergey Jerusalimov , Jeff Layton Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8528 Lines: 191 On Tue, Mar 28, 2017 at 3:30 PM, Michal Hocko wrote: > On Tue 28-03-17 15:23:58, Ilya Dryomov wrote: >> On Tue, Mar 28, 2017 at 2:43 PM, Michal Hocko wrote: >> > On Tue 28-03-17 14:30:45, Greg KH wrote: >> >> 4.4-stable review patch. If anyone has any objections, please let me know. >> > >> > I haven't seen the original patch but the changelog makes me worried. >> > How exactly this is a problem? Where do we lockup? Does rbd/libceph take >> > any xfs locks? >> >> No, it doesn't. This is just another instance of "using GFP_KERNEL on >> the writeback path may lead to a deadlock" with nothing extra to it. >> >> XFS is writing out data, libceph messenger worker tries to open >> a socket and recurses back into XFS because the sockfs inode is >> allocated with GFP_KERNEL. The message with some of the data never >> goes out and eventually we get a deadlock. >> >> I've only included the offending stack trace. I guess I should have >> stressed that ceph-msgr workqueue is used for reclaim. > > Could you be more specific about the lockup scenario. I still do not get > how this would lead to a deadlock. This is a set of stack traces from http://tracker.ceph.com/issues/19309 (linked in the changelog): Workqueue: ceph-msgr con_work [libceph] ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00 ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148 Call Trace: [] schedule+0x29/0x70 [] schedule_timeout+0x1bd/0x200 [] ? ttwu_do_wakeup+0x2c/0x120 [] ? ttwu_do_activate.constprop.135+0x66/0x70 [] wait_for_completion+0xbf/0x180 [] ? try_to_wake_up+0x390/0x390 [] flush_work+0x165/0x250 [] ? worker_detach_from_pool+0xd0/0xd0 [] xlog_cil_force_lsn+0x81/0x200 [xfs] [] ? __slab_free+0xee/0x234 [] _xfs_log_force_lsn+0x4d/0x2c0 [xfs] [] ? lookup_page_cgroup_used+0xe/0x30 [] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [] xfs_log_force_lsn+0x3f/0xf0 [xfs] [] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [] xfs_iunpin_wait+0xc6/0x1a0 [xfs] [] ? wake_atomic_t_function+0x40/0x40 [] xfs_reclaim_inode+0xa3/0x330 [xfs] [] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs] [] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [] xfs_fs_free_cached_objects+0x15/0x20 [xfs] [] super_cache_scan+0x178/0x180 [] shrink_slab_node+0x14e/0x340 [] ? mem_cgroup_iter+0x16b/0x450 [] shrink_slab+0x100/0x140 [] do_try_to_free_pages+0x335/0x490 [] try_to_free_pages+0xb9/0x1f0 [] ? __alloc_pages_direct_compact+0x69/0x1be [] __alloc_pages_nodemask+0x69a/0xb40 [] alloc_pages_current+0x9e/0x110 [] new_slab+0x2c5/0x390 [] __slab_alloc+0x33b/0x459 [] ? sock_alloc_inode+0x2d/0xd0 [] ? inet_sendmsg+0x71/0xc0 [] ? sock_alloc_inode+0x2d/0xd0 [] kmem_cache_alloc+0x1a2/0x1b0 [] sock_alloc_inode+0x2d/0xd0 [] alloc_inode+0x26/0xa0 [] new_inode_pseudo+0x1a/0x70 [] sock_alloc+0x1e/0x80 [] __sock_create+0x95/0x220 [] sock_create_kern+0x24/0x30 [] con_work+0xef9/0x2050 [libceph] [] ? rbd_img_request_submit+0x4c/0x60 [rbd] [] process_one_work+0x159/0x4f0 [] worker_thread+0x11b/0x530 [] ? create_worker+0x1d0/0x1d0 [] kthread+0xc9/0xe0 [] ? flush_kthread_worker+0x90/0x90 [] ret_from_fork+0x58/0x90 [] ? flush_kthread_worker+0x90/0x90 We are writing out data on ceph_connection X: ceph_con_workfn mutex_lock(&con->mutex) # ceph_connection::mutex try_write ceph_tcp_connect sock_create_kern GFP_KERNEL allocation allocator recurses into XFS, more I/O is issued Workqueue: rbd rbd_request_workfn [rbd] ffff880047a83b38 0000000000000046 ffff881025350c00 ffff8800383fa9e0 0000000000012b00 0000000000000000 ffff880047a83fd8 0000000000012b00 ffff88014b638860 ffff8800383fa9e0 ffff880047a83b38 ffff8810878dc1b8 Call Trace: [] schedule+0x29/0x70 [] schedule_preempt_disabled+0x16/0x20 [] __mutex_lock_slowpath+0xa5/0x110 [] ? ceph_str_hash+0x26/0x80 [libceph] [] mutex_lock+0x36/0x4a [] ceph_con_send+0x4d/0x130 [libceph] [] __send_queued+0x120/0x150 [libceph] [] __ceph_osdc_start_request+0x5b/0xd0 [libceph] [] ceph_osdc_start_request+0x51/0x80 [libceph] [] rbd_obj_request_submit.isra.27+0x10/0x20 [rbd] [] rbd_img_obj_request_submit+0x23e/0x500 [rbd] [] rbd_img_request_submit+0x4c/0x60 [rbd] [] rbd_request_workfn+0x305/0x410 [rbd] [] process_one_work+0x159/0x4f0 [] worker_thread+0x11b/0x530 [] ? create_worker+0x1d0/0x1d0 [] kthread+0xc9/0xe0 [] ? flush_kthread_worker+0x90/0x90 [] ret_from_fork+0x58/0x90 [] ? flush_kthread_worker+0x90/0x90 Here is that I/O. We grab ceph_osd_client::request_mutex, but ceph_connection::mutex is being held by the worker that recursed into XFS: rbd_queue_workfn ceph_osdc_start_request mutex_lock(&osdc->request_mutex); ceph_con_send mutex_lock(&con->mutex) # deadlock Workqueue: ceph-msgr con_work [libceph] ffff88014a89fc08 0000000000000046 ffff88014a89fc18 ffff88013a2d90c0 0000000000012b00 0000000000000000 ffff88014a89ffd8 0000000000012b00 ffff880015a210c0 ffff88013a2d90c0 0000000000000000 ffff882028a84798 Call Trace: [] schedule+0x29/0x70 [] schedule_preempt_disabled+0x16/0x20 [] __mutex_lock_slowpath+0xa5/0x110 [] mutex_lock+0x36/0x4a [] alloc_msg+0xcf/0x210 [libceph] [] con_work+0x1675/0x2050 [libceph] [] ? ttwu_do_wakeup+0x2c/0x120 [] ? ttwu_do_activate.constprop.135+0x66/0x70 [] process_one_work+0x159/0x4f0 [] worker_thread+0x11b/0x530 [] ? create_worker+0x1d0/0x1d0 [] kthread+0xc9/0xe0 [] ? flush_kthread_worker+0x90/0x90 [] ret_from_fork+0x58/0x90 [] ? flush_kthread_worker+0x90/0x90 Workqueue: ceph-msgr con_work [libceph] ffff88014c10fc08 0000000000000046 ffff88013a2d9988 ffff88013a2d9920 0000000000012b00 0000000000000000 ffff88014c10ffd8 0000000000012b00 ffffffff81c1b4a0 ffff88013a2d9920 0000000000000000 ffff882028a84798 Call Trace: [] schedule+0x29/0x70 [] schedule_preempt_disabled+0x16/0x20 [] __mutex_lock_slowpath+0xa5/0x110 [] mutex_lock+0x36/0x4a [] alloc_msg+0xcf/0x210 [libceph] [] con_work+0x1675/0x2050 [libceph] [] ? put_prev_entity+0x3c/0x2e0 [] ? sched_clock_cpu+0x95/0xd0 [] process_one_work+0x159/0x4f0 [] worker_thread+0x11b/0x530 [] ? create_worker+0x1d0/0x1d0 [] kthread+0xc9/0xe0 [] ? flush_kthread_worker+0x90/0x90 [] ret_from_fork+0x58/0x90 [] ? flush_kthread_worker+0x90/0x90 These two are replies on ceph_connections Y and Z, which need ceph_osd_client::request_mutex to figure out which requests can be completed: alloc_msg get_reply mutex_lock(&osdc->request_mutex); Eventually everything else blocks on ceph_osd_client::request_mutex, since it's used for both submitting requests and handling replies. This really is a straightforward "using GFP_KERNEL on the writeback path isn't allowed" case. I'm not sure what made you worried here. Thanks, Ilya