Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932396AbdC2LFo (ORCPT ); Wed, 29 Mar 2017 07:05:44 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44042 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754517AbdC2LFS (ORCPT ); Wed, 29 Mar 2017 07:05:18 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 4CAD5C0467F1 Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=bfoster@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 4CAD5C0467F1 Date: Wed, 29 Mar 2017 07:05:16 -0400 From: Brian Foster To: Michal Hocko Cc: Ilya Dryomov , Greg Kroah-Hartman , "linux-kernel@vger.kernel.org" , stable@vger.kernel.org, Sergey Jerusalimov , Jeff Layton , linux-xfs@vger.kernel.org Subject: Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations Message-ID: <20170329110515.GA24681@bfoster.bfoster> References: <20170328122559.966310440@linuxfoundation.org> <20170328122601.905696872@linuxfoundation.org> <20170328124312.GE18241@dhcp22.suse.cz> <20170328133040.GJ18241@dhcp22.suse.cz> <20170329104126.GF27994@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170329104126.GF27994@dhcp22.suse.cz> User-Agent: Mutt/1.8.0 (2017-02-23) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Wed, 29 Mar 2017 11:05:17 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8406 Lines: 181 On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote: > [CC xfs guys] > > On Wed 29-03-17 11:21:44, Ilya Dryomov wrote: > [...] > > This is a set of stack traces from http://tracker.ceph.com/issues/19309 > > (linked in the changelog): > > > > Workqueue: ceph-msgr con_work [libceph] > > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000 > > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00 > > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148 > > Call Trace: > > [] schedule+0x29/0x70 > > [] schedule_timeout+0x1bd/0x200 > > [] ? ttwu_do_wakeup+0x2c/0x120 > > [] ? ttwu_do_activate.constprop.135+0x66/0x70 > > [] wait_for_completion+0xbf/0x180 > > [] ? try_to_wake_up+0x390/0x390 > > [] flush_work+0x165/0x250 > > I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work) > right? I kind of got lost where this waits on an IO. > Yep. That means a CIL push is already in progress. We wait on that to complete here. After that, the resulting task queues execution of xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may submit I/O to the log. I don't see any reference to xlog_cil_push() anywhere in the traces here or in the bug referenced above, however..? Brian > > [] ? worker_detach_from_pool+0xd0/0xd0 > > [] xlog_cil_force_lsn+0x81/0x200 [xfs] > > [] ? __slab_free+0xee/0x234 > > [] _xfs_log_force_lsn+0x4d/0x2c0 [xfs] > > [] ? lookup_page_cgroup_used+0xe/0x30 > > [] ? xfs_reclaim_inode+0xa3/0x330 [xfs] > > [] xfs_log_force_lsn+0x3f/0xf0 [xfs] > > [] ? xfs_reclaim_inode+0xa3/0x330 [xfs] > > [] xfs_iunpin_wait+0xc6/0x1a0 [xfs] > > [] ? wake_atomic_t_function+0x40/0x40 > > [] xfs_reclaim_inode+0xa3/0x330 [xfs] > [...] > > [] sock_alloc+0x1e/0x80 > > [] __sock_create+0x95/0x220 > > [] sock_create_kern+0x24/0x30 > > [] con_work+0xef9/0x2050 [libceph] > > [] ? rbd_img_request_submit+0x4c/0x60 [rbd] > > [] process_one_work+0x159/0x4f0 > > [] worker_thread+0x11b/0x530 > > [] ? create_worker+0x1d0/0x1d0 > > [] kthread+0xc9/0xe0 > > [] ? flush_kthread_worker+0x90/0x90 > > [] ret_from_fork+0x58/0x90 > > [] ? flush_kthread_worker+0x90/0x90 > > > > We are writing out data on ceph_connection X: > > > > ceph_con_workfn > > mutex_lock(&con->mutex) # ceph_connection::mutex > > try_write > > ceph_tcp_connect > > sock_create_kern > > GFP_KERNEL allocation > > allocator recurses into XFS, more I/O is issued > > I am not sure this is true actually. XFS tends to do an IO from a > separate kworkers rather than the direct reclaim context. > > > Workqueue: rbd rbd_request_workfn [rbd] > > ffff880047a83b38 0000000000000046 ffff881025350c00 ffff8800383fa9e0 > > 0000000000012b00 0000000000000000 ffff880047a83fd8 0000000000012b00 > > ffff88014b638860 ffff8800383fa9e0 ffff880047a83b38 ffff8810878dc1b8 > > Call Trace: > > [] schedule+0x29/0x70 > > [] schedule_preempt_disabled+0x16/0x20 > > [] __mutex_lock_slowpath+0xa5/0x110 > > [] ? ceph_str_hash+0x26/0x80 [libceph] > > [] mutex_lock+0x36/0x4a > > [] ceph_con_send+0x4d/0x130 [libceph] > > [] __send_queued+0x120/0x150 [libceph] > > [] __ceph_osdc_start_request+0x5b/0xd0 [libceph] > > [] ceph_osdc_start_request+0x51/0x80 [libceph] > > [] rbd_obj_request_submit.isra.27+0x10/0x20 [rbd] > > [] rbd_img_obj_request_submit+0x23e/0x500 [rbd] > > [] rbd_img_request_submit+0x4c/0x60 [rbd] > > [] rbd_request_workfn+0x305/0x410 [rbd] > > [] process_one_work+0x159/0x4f0 > > [] worker_thread+0x11b/0x530 > > [] ? create_worker+0x1d0/0x1d0 > > [] kthread+0xc9/0xe0 > > [] ? flush_kthread_worker+0x90/0x90 > > [] ret_from_fork+0x58/0x90 > > [] ? flush_kthread_worker+0x90/0x90 > > > > Here is that I/O. We grab ceph_osd_client::request_mutex, but > > ceph_connection::mutex is being held by the worker that recursed into > > XFS: > > > > rbd_queue_workfn > > ceph_osdc_start_request > > mutex_lock(&osdc->request_mutex); > > ceph_con_send > > mutex_lock(&con->mutex) # deadlock > > > > > > Workqueue: ceph-msgr con_work [libceph] > > ffff88014a89fc08 0000000000000046 ffff88014a89fc18 ffff88013a2d90c0 > > 0000000000012b00 0000000000000000 ffff88014a89ffd8 0000000000012b00 > > ffff880015a210c0 ffff88013a2d90c0 0000000000000000 ffff882028a84798 > > Call Trace: > > [] schedule+0x29/0x70 > > [] schedule_preempt_disabled+0x16/0x20 > > [] __mutex_lock_slowpath+0xa5/0x110 > > [] mutex_lock+0x36/0x4a > > [] alloc_msg+0xcf/0x210 [libceph] > > [] con_work+0x1675/0x2050 [libceph] > > [] ? ttwu_do_wakeup+0x2c/0x120 > > [] ? ttwu_do_activate.constprop.135+0x66/0x70 > > [] process_one_work+0x159/0x4f0 > > [] worker_thread+0x11b/0x530 > > [] ? create_worker+0x1d0/0x1d0 > > [] kthread+0xc9/0xe0 > > [] ? flush_kthread_worker+0x90/0x90 > > [] ret_from_fork+0x58/0x90 > > [] ? flush_kthread_worker+0x90/0x90 > > > > Workqueue: ceph-msgr con_work [libceph] > > ffff88014c10fc08 0000000000000046 ffff88013a2d9988 ffff88013a2d9920 > > 0000000000012b00 0000000000000000 ffff88014c10ffd8 0000000000012b00 > > ffffffff81c1b4a0 ffff88013a2d9920 0000000000000000 ffff882028a84798 > > Call Trace: > > [] schedule+0x29/0x70 > > [] schedule_preempt_disabled+0x16/0x20 > > [] __mutex_lock_slowpath+0xa5/0x110 > > [] mutex_lock+0x36/0x4a > > [] alloc_msg+0xcf/0x210 [libceph] > > [] con_work+0x1675/0x2050 [libceph] > > [] ? put_prev_entity+0x3c/0x2e0 > > [] ? sched_clock_cpu+0x95/0xd0 > > [] process_one_work+0x159/0x4f0 > > [] worker_thread+0x11b/0x530 > > [] ? create_worker+0x1d0/0x1d0 > > [] kthread+0xc9/0xe0 > > [] ? flush_kthread_worker+0x90/0x90 > > [] ret_from_fork+0x58/0x90 > > [] ? flush_kthread_worker+0x90/0x90 > > > > These two are replies on ceph_connections Y and Z, which need > > ceph_osd_client::request_mutex to figure out which requests can be > > completed: > > > > alloc_msg > > get_reply > > mutex_lock(&osdc->request_mutex); > > > > Eventually everything else blocks on ceph_osd_client::request_mutex, > > since it's used for both submitting requests and handling replies. > > > > This really is a straightforward "using GFP_KERNEL on the writeback > > path isn't allowed" case. I'm not sure what made you worried here. > > I am still not sure there is the dependency there. But if anything and > the con->mutex is the lock which is dangerous to recurse back to the FS > then please wrap the whole scope which takes the lock with the > memalloc_noio_save (or memalloc_nofs_save currently sitting in the mmotm > tree, if you can wait until that API gets merged) with a big fat comment > explaining why that is needed. Sticking the scope protection down the > path is just hard to understand later on. And as already mentioned > NOFS/NOIO context are (ab)used way too much without a clear/good reason. > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html