Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp1053500ybl; Sun, 1 Dec 2019 19:10:55 -0800 (PST) X-Google-Smtp-Source: APXvYqx26zPKjvdybBDgxjd38y8vVqVUUEGZL4XdSGYVX8G3O7X54L9N5ts7O/kO+qx6HhX7E6Tm X-Received: by 2002:a05:6402:1609:: with SMTP id f9mr61468821edv.37.1575256255209; Sun, 01 Dec 2019 19:10:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575256255; cv=none; d=google.com; s=arc-20160816; b=wavjlIoqK7m2QHV9QVnhXfnzeQXnQlgy8pByaVz6UgdMwal4DyM49eMu6sPa4qqPZG U2Lu2enJDbzjaXiz5k6/GWwFozOyyGP/Iy4Or9hjY6Si5fRh7EHAfN7PFYz88zssS7Lb jaJnVsDmRDynb7pd7XSwMx9ywhzif4w3hPH3kD3UVOUtqMDqMTp/cpM5lBtCuR7aMI3i 3ebed4lTzzmMGSfnzWxfPTRYK8DJrIdMl/U+iqJ3W5oXUxtr+yPsK9TGxpw3mMNk280P ZK/BgDr3nzz7vtgFp3wzsFV7919+uIW8bX0m+Qp4DlJZhTXsCqdE4bhhM8BUmlJM2C5B YBwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=ZrORCu3JmjgfaIaXjRnnHTJsm42wVzXk1cVlaykMMYM=; b=gDZXWKKuqSdI+yS7W2q+5KuHULCofkQ+PQvvTPjvoixf3U7D27xWXYVwCZ7ylfHJy3 CGFiXQUnqqByA679q2wh1yVdiG3BB6YZbM2D1JAXSMl7Ch/XzP4ISBz/bGi32y7ySiuX DCl4xvu2ahvbZbdL503d7nX4BzKkCnLFHA2CLj2hnnQe5Gptx1aUCA8coIoKAz6meJK3 /oXywLoGKKpJ+VbJcx6dz3g586eueIBFwuwI4Z9xOEDHAwwpO2h0rZttwUxD/ZnIbJSL wRg3A7WstcHyqNqrefxC6QzH2TgMIzhycFVZF8Ph/DLyVcB3+R6EtWf/Dj4aPIj+04gh e7Ig== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r22si10189175eji.254.2019.12.01.19.10.31; Sun, 01 Dec 2019 19:10:55 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727381AbfLBDIy (ORCPT + 99 others); Sun, 1 Dec 2019 22:08:54 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:57264 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727298AbfLBDIx (ORCPT ); Sun, 1 Dec 2019 22:08:53 -0500 Received: from dread.disaster.area (pa49-179-150-192.pa.nsw.optusnet.com.au [49.179.150.192]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 6A26143FD51; Mon, 2 Dec 2019 14:08:45 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1ibc4q-0008Aa-1R; Mon, 02 Dec 2019 14:08:44 +1100 Date: Mon, 2 Dec 2019 14:08:44 +1100 From: Dave Chinner To: Hillf Danton Cc: Ming Lei , linux-block , linux-fs , linux-xfs , linux-kernel , Christoph Hellwig , Jens Axboe , Peter Zijlstra , Vincent Guittot , Rong Chen , Tejun Heo Subject: Re: single aio thread is migrated crazily by scheduler Message-ID: <20191202030844.GD2695@dread.disaster.area> References: <20191114113153.GB4213@ming.t460p> <20191114235415.GL4614@dread.disaster.area> <20191115010824.GC4847@ming.t460p> <20191115045634.GN4614@dread.disaster.area> <20191115070843.GA24246@ming.t460p> <20191128094003.752-1-hdanton@sina.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191128094003.752-1-hdanton@sina.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=ZXpxJgW8/q3NVgupyyvOCQ==:117 a=ZXpxJgW8/q3NVgupyyvOCQ==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=pxVhFHJ0LMsA:10 a=7-415B0cAAAA:8 a=mzcwEs8pYe_3sm9skFEA:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 28, 2019 at 05:40:03PM +0800, Hillf Danton wrote: > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote: > > Yeah, the fio task averages 13.4ms on any given CPU before being > > switched to another CPU. Mind you, the stddev is 12ms, so the range > > of how long it spends on any one CPU is pretty wide (330us to > > 330ms). > > > Hey Dave > > > IOWs, this doesn't look like a workqueue problem at all - this looks > > Surprised to see you're so sure it has little to do with wq, Because I understand how the workqueue is used here. Essentially, the workqueue is not necessary for a -pure- overwrite where no metadata updates or end-of-io filesystem work is required. However, change the workload just slightly, such as allocating the space, writing into preallocated space (unwritten extents), using AIO writes to extend the file, using O_DSYNC, etc, and we *must* use a workqueue as we have to take blocking locks and/or run transactions. These may still be very short (e.g. updating inode size) and in most cases will not block, but if they do, then if we don't move the work out of the block layer completion context (i.e. softirq running the block bh) then we risk deadlocking the code. Not to mention none of the filesytem inode locks are irq safe. IOWs, we can remove the workqueue for this -one specific instance- but it does not remove the requirement for using a workqueue for all the other types of write IO that pass through this code. > > like the scheduler is repeatedly making the wrong load balancing > > decisions when mixing a very short runtime task (queued work) with a > > long runtime task on the same CPU.... > > > and it helps more to know what is driving lb to make decisions like > this. I know exactly what is driving it through both observation and understanding of the code, and I've explained it elsewhere in this thread. > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > WRITE_ONCE(dio->submit.waiter, NULL); > blk_wake_io_task(waiter); > } else if (dio->flags & IOMAP_DIO_WRITE) { > - struct inode *inode = file_inode(dio->iocb->ki_filp); > - > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > + schedule_work(&dio->aio.work); This does nothing but change the workqueue from a per-sb wq to the system wq. The work is still bound to the same CPU it is queued on, so nothing will change. Cheers, Dave. -- Dave Chinner david@fromorbit.com