Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp1602517ybl; Mon, 2 Dec 2019 05:50:23 -0800 (PST) X-Google-Smtp-Source: APXvYqxkeBExDMhQ2WKC+gsAiQUBfLg3bBpjcfj6fyUBqGlQkz3DG9IiJB7F/tbr8VbrwFm1bF+M X-Received: by 2002:a17:906:7c5:: with SMTP id m5mr77938240ejc.231.1575294623506; Mon, 02 Dec 2019 05:50:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575294623; cv=none; d=google.com; s=arc-20160816; b=SPCsn2rr6DzqEdjo8S2tA33iZvSnp9GIvqLd+U4+iop78/kTK2dMkmaBmSJwtdF8rh uVbRx95GP+0aChqd8osSBdt/1OXh7F/lMjYNIfctvjgFbV72TLvnlE1PSKw66/uFAZas z4LP4MNMa5j4/4ECEHm+WK9BLAFP/Nkp3kWFk8Ps+Dlrd7jCGrf03S3jhPw8fEd4uS7n 3OztPB9Px+ngdf50h8QQgKYtf+zmnbOJMFaqRnNlVbbx6NE5ZGBq0AcNDeTASSaaaJWk OHjLH6DYWPune+QrMMinI2z79hv3gFNCmPC6raV5aJstjaxJEn3V7XlDH5AQ2j/R5mxx K/yQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=49jYbZVycOXFspdxl9CDUTh2W/4I3gjVIzO2kAHOhfU=; b=mkV3KRmSkRpUraAqRpGOJK19GvcZsTaubst7QjfRu9LQ1KS0Jgu3zz+wEuBE5wjz37 W78ziw62a75cAWK3i95wwIxMd82NwRR4aXvKH+XRf25yCED67RfnG8LiA0Z29S5k7CBb Zz/yUqkruZ5fjYbQKNeBRcri/8xMMWPl1frfYk2r92euPgAHLxJ06JWigsByA5abqR6N lESx5QbHn30hY1adgru3U+5BKsAec5kIMQEPMEC5dFtINIiGliQgin1QUdWZBqmULdBj SPWGUPo51RM4uk0r8BS3MJT1So2/9JwbthZimL9U3DeCSRTtY+mFRZOtrI6SO9eeeJE9 LT8w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=weHqKqK6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i10si6600368ejf.44.2019.12.02.05.49.58; Mon, 02 Dec 2019 05:50:23 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=weHqKqK6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727488AbfLBNp5 (ORCPT + 99 others); Mon, 2 Dec 2019 08:45:57 -0500 Received: from mail-lj1-f195.google.com ([209.85.208.195]:39600 "EHLO mail-lj1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727487AbfLBNp5 (ORCPT ); Mon, 2 Dec 2019 08:45:57 -0500 Received: by mail-lj1-f195.google.com with SMTP id e10so31002375ljj.6 for ; Mon, 02 Dec 2019 05:45:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=49jYbZVycOXFspdxl9CDUTh2W/4I3gjVIzO2kAHOhfU=; b=weHqKqK6CpbCFt1jkDcyAM9l4ZlRl5wkEuDFLBbazMmIojfqabp5oSo6tLXuCIzaWu MuYMSoa0ONVmpFuwjEWhox/dhcv86Z+wsg/9kFtV+cGU8yOL6I56QuWgtyseIhjPdthi I4+IK8zSzhPx9UAPZQ1g80GtClIcJh08T7JdA38TQ8PMmNgRAMPhKG+oIUfRP8h/P0hE Ax9lEnmLM6EUctVOZmfwlpRgge80/NhgGBajWxHstwa7SYJzz95fdPGiB0sr3VHSmSSy AHtD+4qLHIo/GXk0NdOm9nCKaAktJHtGSveoJV3FECVT+9pI49wRBla4o7CRBlyZVmDK dsBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=49jYbZVycOXFspdxl9CDUTh2W/4I3gjVIzO2kAHOhfU=; b=snZYZjPt7r4W2c4EztzV8xTSDv+HqGvvR/M6749tzna9DiEGmMa8vIFfzVMJ2Meokm se+IgaI/l3Y604eezzT22mpyxBrv/v+jQNlL5DGXWfcNqx4FTtEbVLJbcq039ue3FYuW wWmfRl6VgSRONL71aVxyMH3IBn7qwH+PTJ6t6iByRxnQq0EoZO7fxb8J7J7Jnx6prCsH pKz0GGnfa0YNWAsEPnHYAXEO+BN7c29aC2XldGMq+AFFdiNX2xzBkveYOjejAIUfjvWi V8O+NcVJecS4LZ27nQJa8GT8lTo79je3CxZDK2/2dgK8qdiGaKcGH8w94Vg/F3T1ULAM 6ubQ== X-Gm-Message-State: APjAAAUOsEFx425yA23NFtqjTCJ1pHcCKmjbzirBrW+dcFtWJfEnK5K7 2/KHMBNuaByjWKXcq1hyQJeuOuogLQF4it2UM7of8w== X-Received: by 2002:a2e:9a04:: with SMTP id o4mr14412076lji.214.1575294354710; Mon, 02 Dec 2019 05:45:54 -0800 (PST) MIME-Version: 1.0 References: <20191114113153.GB4213@ming.t460p> <20191114235415.GL4614@dread.disaster.area> <20191115010824.GC4847@ming.t460p> <20191115045634.GN4614@dread.disaster.area> <20191115070843.GA24246@ming.t460p> <20191128094003.752-1-hdanton@sina.com> <20191202024625.GD24512@ming.t460p> <20191202040256.GE2695@dread.disaster.area> In-Reply-To: <20191202040256.GE2695@dread.disaster.area> From: Vincent Guittot Date: Mon, 2 Dec 2019 14:45:42 +0100 Message-ID: Subject: Re: single aio thread is migrated crazily by scheduler To: Dave Chinner Cc: Ming Lei , Hillf Danton , linux-block , linux-fs , linux-xfs , linux-kernel , Christoph Hellwig , Jens Axboe , Peter Zijlstra , Rong Chen , Tejun Heo Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2 Dec 2019 at 05:02, Dave Chinner wrote: > > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote: > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton wrote: > > > > --- a/fs/iomap/direct-io.c > > > > +++ b/fs/iomap/direct-io.c > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > > > WRITE_ONCE(dio->submit.waiter, NULL); > > > > blk_wake_io_task(waiter); > > > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > > > - > > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > > + schedule_work(&dio->aio.work); > > > > > > I'm not sure that this will make a real difference because it ends up > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so > > > the work will still be pinned to a CPU > > > Using system_unbound_wq should make a difference because it doesn't > > > pin the work on a CPU > > > + queue_work(system_unbound_wq, &dio->aio.work); > > > > Indeed, just run a quick test on my KVM guest, looks the following patch > > makes a difference: > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > index 9329ced91f1d..2f4488b0ecec 100644 > > --- a/fs/direct-io.c > > +++ b/fs/direct-io.c > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) > > { > > struct workqueue_struct *old; > > struct workqueue_struct *wq = alloc_workqueue("dio/%s", > > - WQ_MEM_RECLAIM, 0, > > + WQ_MEM_RECLAIM | > > + WQ_UNBOUND, 0, > > sb->s_id); > > That's not an answer to the user task migration issue. > > That is, all this patch does is trade user task migration when the > CPU is busy for migrating all the queued work off the CPU so the > user task does not get migrated. IOWs, this forces all the queued > work to be migrated rather than the user task. IOWs, it does not > address the issue we've exposed in the scheduler between tasks with > competing CPU affinity scheduling requirements - it just hides the > symptom. > > Maintaining CPU affinity across dispatch and completion work has > been proven to be a significant performance win. Right throughout > the IO stack we try to keep this submitter/completion affinity, > and that's the whole point of using a bound wq in the first place: > efficient delayed batch processing of work on the local CPU. Do you really want to target the same CPU ? looks like what you really want to target the same cache instead > > Spewing deferred completion work across every idle CPU in the > machine because the local cpu is temporarily busy is a bad choice, > both from a performance perspective (dirty cacheline bouncing) and > from a power efficiency point of view as it causes CPUs to be taken > out of idle state much more frequently[*]. > > The fact that the scheduler migrates the user task we use workqueues > for deferred work as they were intended doesn't make this a > workqueue problem. If the answer to this problem is "make all IO > workqueues WQ_UNBOUND" then we are effectively saying "the scheduler > has unfixable problems when mixing bound and unbound work on the > same run queue". > > And, besides, what happens when every other CPU is also completely > busy and can't run the work in a timely fashion? We've just moved > the work to some random CPU where we wait to be scheduled instead of > just sitting on the local CPU and waiting.... > > So, yes, we can work around the -symptoms- we see (frequent user > task migration) by changing the work queue configuration or > bypassing the workqueue for this specific workload. But these only > address the visible symptom and don't take into account the wider > goals of retaining CPU affinity in the IO stack, and they will have > variable scheduling latency and perofrmance and as the overall > system load changes. > > So, we can fiddle with workqueues, but it doesn't address the > underlying issue that the scheduler appears to be migrating > non-bound tasks off a busy CPU too easily.... The root cause of the problem is that the sched_wakeup_granularity_ns is in the same range or higher than load balance period. As Peter explained, This make the kworker waiting for the CPU for several load period and a transient unbalanced state becomes a stable one that the scheduler to fix. With default value, the scheduler doesn't try to migrate any task. Then, I agree that having an ack close to the request makes sense but forcing it on the exact same CPU is too restrictive IMO. Being able to use another CPU on the same core should not harm the performance and may even improve it. And that may still be the case while CPUs share their cache. > > -Dave. > > [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work > queue: it's designed for interrupt routines that defer work via work > queues to avoid doing work on otherwise idle CPUs. It does this by > turning the per-cpu wq into an unbound wq so that work gets > scheduled on a non-idle CPUs in preference to the local idle CPU > which can then remain in low power states. > > That's the exact opposite of what using WQ_UNBOUND ends up doing in > this IO completion context: it pushes the work out over idle CPUs > rather than keeping them confined on the already busy CPUs where CPU > affinity allows the work to be done quickly. So while WQ_UNBOUND > avoids the user task being migrated frequently, it results in the > work being spread around many more CPUs and we burn more power to do > the same work. > > -- > Dave Chinner > david@fromorbit.com