Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp40968yba; Mon, 20 May 2019 04:39:13 -0700 (PDT) X-Google-Smtp-Source: APXvYqzYREec9K1OWM05n7+cqYbjVC4ZlC7FVtyzE3rbQYfA74Y2Y6BS7aDspk5+wF1oJiBPUKyT X-Received: by 2002:a62:1993:: with SMTP id 141mr79358660pfz.97.1558352353122; Mon, 20 May 2019 04:39:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558352353; cv=none; d=google.com; s=arc-20160816; b=NwUZUv1LcOZ5v9CxqSbQUb1wu7Ft2bm6JgxdK+FxK9uuYbgBeu1Lcl3Ltor/uOZSXx /SgTMuzanMzhBobYj3JeZ+SHL/GfYJUyLJwf3ifvPhbPpJNjhScQoZ1vagq8b2OVdCH5 lIbXdK+q5E4GwQ5u49TtVULCRV1+5aIsNEspYDtq7HKbakk99nqmlu1TxHKt3ezkbzr5 d06Z+hd53hf8pxXI449vz0t/NOuDRwPmuIo4vbsGoeOrlHlXZW2HI7xXwAxr1375wIko piRHdXUJbmMY4qrxMglcuVy4iNXVV8C+f39NgNGzF/Dngwt3JPrtFDYVi8/6zJN5NJGk Sdeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:to:cc:in-reply-to:date:subject :mime-version:message-id:from:dkim-signature; bh=NwlnnO8ESeSxRw5I4cSTLQD2eWjIGGdF4E5YUL4sVio=; b=CtjCldTnP/wLdzgFZRyoZCqN5cGlJGWm81btcNO3uyjR7frhMVGO95zf/Ga1rV0+W9 hDJtJJj1ZKdhBZN3I9pVP95DeroIrz2tbo238sNyc+5lb1XDKGmdcZthUSYUu+i0VQo4 I3CPa2rgbu8TtKsb8q2jKhIK/qvUNfn+0p8T4nhMBol+9yJWcPM/fIu4Dtb8qOIOVXF2 WuvmfXSK/m3yvlhFJNhMgM1C7f1p6A73sOX7VTnxbguQ0Wiu/IjxYitUTPbVeGWbmt5b lbWKVNEK7VbmkOcEtKCsLmSSbhcTAWHs/xV7nS5OxGxNQRzrzIJsCy4DRzn5pfaQBAuV uTKg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=fAeOqnI2; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d21si16436398pgv.353.2019.05.20.04.38.45; Mon, 20 May 2019 04:39:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=fAeOqnI2; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729912AbfETKUA (ORCPT + 99 others); Mon, 20 May 2019 06:20:00 -0400 Received: from mail-wr1-f67.google.com ([209.85.221.67]:37760 "EHLO mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729692AbfETKT5 (ORCPT ); Mon, 20 May 2019 06:19:57 -0400 Received: by mail-wr1-f67.google.com with SMTP id e15so13939981wrs.4 for ; Mon, 20 May 2019 03:19:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=NwlnnO8ESeSxRw5I4cSTLQD2eWjIGGdF4E5YUL4sVio=; b=fAeOqnI2tkRWtCPWlJF7KOL2rZ6fIISdVjNIqB3Lv7Bxh9XfXkfPmpxvPqEkR4X54I aFuP6tqomqaxs9D49vDgbyzY7OpGuYSdY0CGjUoCDGqsw0p+TRLAPtIgZyehV+DwB5tI Bdj3K4ZYQuz+wThYUbsCAurjUcCpUIQfG+4KhbwkriHMYfVbJVZjpRcHQftlaejPYI+8 WZFnR+zW4uwmMal5Dl2kI3OUy2OVPKH9Y60RU2o6htHTKnwvOPHx/dulutbwccFAPS9T 9ct51ALpTxHBpEogLDVOc+PY4PajGm90K1Fd/aHl+pGbH0TL1aGLvyztnZ2EJUntDwt7 BaKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=NwlnnO8ESeSxRw5I4cSTLQD2eWjIGGdF4E5YUL4sVio=; b=PoIgjBksgysSAa95xfMhXQBPEA/k5qUk5wJjxkJ6kHJEMqyqCOOev7E1OCJmq5r6Za r/lM3mt522AWVg+75bfJy3T+ZzHOCShmeFFkfAlKKO8kZusOms/WG/azvomw7dTf01Oc kxxlQtm3cm6DOfJ/fuoG5xRlfX4suCrRE7vpOnsTp8e6T1r24aY9KjhU68HpmFJGcTnP sTV/3ndGunkCTnw8knV0ZkdRvDT6a6EURvn6vxKxjY9YVkRc8ObTkUXudJeGNfZNKof5 goqxe1btg3NMqLd0FJ/b5oi1h89Dd2vPwWt3CNdGyfgiZBw6ZeuR48GQdGLft83qel3z 2qmg== X-Gm-Message-State: APjAAAWv7rPQg02W/xnGT67ATxnGhZwwNSJnhibeZFiSdlnnBRwUuWLn AQeAdUTa3HB4Ht3jajSnibjUsw== X-Received: by 2002:adf:9c0a:: with SMTP id f10mr4504745wrc.248.1558347594134; Mon, 20 May 2019 03:19:54 -0700 (PDT) Received: from [192.168.0.100] ([88.147.73.106]) by smtp.gmail.com with ESMTPSA id s11sm31793721wrb.71.2019.05.20.03.19.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 May 2019 03:19:52 -0700 (PDT) From: Paolo Valente Message-Id: <07D11833-8285-49C2-943D-E4C1D23E8859@linaro.org> Content-Type: multipart/signed; boundary="Apple-Mail=_0E3633DD-4248-4655-9843-6C90BDFC002D"; protocol="application/pgp-signature"; micalg=pgp-sha256 Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.8\)) Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller Date: Mon, 20 May 2019 12:19:50 +0200 In-Reply-To: <46c6a4be-f567-3621-2e16-0e341762b828@csail.mit.edu> Cc: linux-fsdevel@vger.kernel.org, linux-block , linux-ext4@vger.kernel.org, cgroups@vger.kernel.org, kernel list , Jens Axboe , Jan Kara , jmoyer@redhat.com, tytso@mit.edu, amakhalov@vmware.com, anishs@vmware.com, srivatsab@vmware.com To: "Srivatsa S. Bhat" References: <8d72fcf7-bbb4-2965-1a06-e9fc177a8938@csail.mit.edu> <1812E450-14EF-4D5A-8F31-668499E13652@linaro.org> <46c6a4be-f567-3621-2e16-0e341762b828@csail.mit.edu> X-Mailer: Apple Mail (2.3445.104.8) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org --Apple-Mail=_0E3633DD-4248-4655-9843-6C90BDFC002D Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii > Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat = ha scritto: >=20 > On 5/18/19 11:39 AM, Paolo Valente wrote: >> I've addressed these issues in my last batch of improvements for BFQ, >> which landed in the upcoming 5.2. If you give it a try, and still see >> the problem, then I'll be glad to reproduce it, and hopefully fix it >> for you. >>=20 >=20 > Hi Paolo, >=20 > Thank you for looking into this! >=20 > I just tried current mainline at commit 72cf0b07, but unfortunately > didn't see any improvement: >=20 > dd if=3D/dev/zero of=3D/root/test.img bs=3D512 count=3D10000 = oflag=3Ddsync >=20 > With mq-deadline, I get: >=20 > 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s >=20 > With bfq, I get: > 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s >=20 Hi Srivatsa, thanks for reproducing this on mainline. I seem to have reproduced a bonsai-tree version of this issue. Before digging into the block trace, I'd like to ask you for some feedback. First, in my test, the total throughput of the disk happens to be about 20 times as high as that enjoyed by dd, regardless of the I/O scheduler. I guess this massive overhead is normal with dsync, but I'd like know whether it is about the same on your side. This will help me understand whether I'll actually be analyzing about the same problem as yours. Second, the commands I used follow. Do they implement your test case correctly? [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp [root@localhost tmp]# echo $BASHPID > = /sys/fs/cgroup/blkio/testgrp/cgroup.procs [root@localhost tmp]# cat /sys/block/sda/queue/scheduler [mq-deadline] bfq none [root@localhost tmp]# dd if=3D/dev/zero of=3D/root/test.img bs=3D512 = count=3D10000 oflag=3Ddsync 10000+0 record dentro 10000+0 record fuori 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler [root@localhost tmp]# dd if=3D/dev/zero of=3D/root/test.img bs=3D512 = count=3D10000 oflag=3Ddsync 10000+0 record dentro 10000+0 record fuori 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s Thanks, Paolo > Please let me know if any more info about my setup might be helpful. >=20 > Thank you! >=20 > Regards, > Srivatsa > VMware Photon OS >=20 >>=20 >>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat = ha scritto: >>>=20 >>>=20 >>> Hi, >>>=20 >>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput >>> running the following command, with the CFQ I/O scheduler: >>>=20 >>> dd if=3D/dev/zero of=3D/root/test.img bs=3D512 count=3D10000 = oflags=3Ddsync >>>=20 >>> Throughput with CFQ: 60 KB/s >>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s >>>=20 >>> I spent some time looking into it and found that this is caused by = the >>> undesirable interaction between 4 different components: >>>=20 >>> - blkio cgroup controller enabled >>> - ext4 with the jbd2 kthread running in the root blkio cgroup >>> - dd running on ext4, in any other blkio cgroup than that of jbd2 >>> - CFQ I/O scheduler with defaults for slice_idle and group_idle >>>=20 >>>=20 >>> When docker is enabled, systemd creates a blkio cgroup called >>> system.slice to run system services (and docker) under it, and a >>> separate blkio cgroup called user.slice for user processes. So, when >>> dd is invoked, it runs under user.slice. >>>=20 >>> The dd command above includes the dsync flag, which performs an >>> fdatasync after every write to the output file. Since dd is writing = to >>> a file on ext4, jbd2 will be active, committing transactions >>> corresponding to those fdatasync requests from dd. (In other words, = dd >>> depends on jdb2, in order to make forward progress). But jdb2 being = a >>> kernel thread, runs in the root blkio cgroup, as opposed to dd, = which >>> runs under user.slice. >>>=20 >>> Now, if the I/O scheduler in use for the underlying block device is >>> CFQ, then its inter-queue/inter-group idling takes effect (via the >>> slice_idle and group_idle parameters, both of which default to 8ms). >>> Therefore, everytime CFQ switches between processing requests from = dd >>> vs jbd2, this 8ms idle time is injected, which slows down the = overall >>> throughput tremendously! >>>=20 >>> To verify this theory, I tried various experiments, and in all = cases, >>> the 4 pre-conditions mentioned above were necessary to reproduce = this >>> performance drop. For example, if I used an XFS filesystem (which >>> doesn't use a separate kthread like jbd2 for journaling), or if I = dd'ed >>> directly to a block device, I couldn't reproduce the performance >>> issue. Similarly, running dd in the root blkio cgroup (where jbd2 >>> runs) also gets full performance; as does using the noop or deadline >>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle = set >>> to zero. >>>=20 >>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi, >>> both with virtualized storage as well as with disk pass-through, >>> backed by a rotational hard disk in both cases. The same problem was >>> also seen with the BFQ I/O scheduler in kernel v5.1. >>>=20 >>> Searching for any earlier discussions of this problem, I found an = old >>> thread on LKML that encountered this behavior [1], as well as a = docker >>> github issue [2] with similar symptoms (mentioned later in the >>> thread). >>>=20 >>> So, I'm curious to know if this is a well-understood problem and if >>> anybody has any thoughts on how to fix it. >>>=20 >>> Thank you very much! >>>=20 >>>=20 >>> [1]. https://lkml.org/lkml/2015/11/19/359 >>>=20 >>> [2]. https://github.com/moby/moby/issues/21485 >>> https://github.com/moby/moby/issues/21485#issuecomment-222941103 >>>=20 >>> Regards, >>> Srivatsa >>=20 >=20 --Apple-Mail=_0E3633DD-4248-4655-9843-6C90BDFC002D Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEpYoduex+OneZyvO8OAkCLQGo9oMFAlzif0YACgkQOAkCLQGo 9oM5xQ//dvsiY41qM7W/y79e3ViCl5oabT79tXKDHAGnXMLPWIScoMuI0CpONuBA hRvnLCO8Wvwl8aXbJzfVnv0J5XkBCiRlIEBf2XXvhgXWjwSOIRtLXoKfnyZ26h1Z GGrC8q4pY+s/2d8wUanLHlOo+ExsJoU+5xWzL9Wx45XafzaG88aOl/yCb8mGgw2S /kkSWIVGMn930yRgGqr4b+dNqivNg2euqSRpjfhRNm0hCkM7q2r1En2Xyef5eEeP 3d1SO+qQo6kIylYEtu8Y6foZZfqlBQLWUV/oCdaQ0Os77Z1cqfjIxUOLkkEU+QKi gHSXwCoM/S1pOL74o7TkDbWoL4MpyK9AaatNSULRxE1yWDYF2dd/31Tc/ZLcutYE VYG0SWUXx2diM5ccCWVedHEHhs9VVxUP4ftipt6tAzdsyT0RCxTwQPapn4NToMUj 3DlFpM8DUdYNuiVrQhpr3/+gB0OJ64XC3bgQM80rzFO2AWuGmydlUc6Tle8AqNNG qAHUZ25+VOBwPrOptJOH249yWYrSgT0FHqWEppNmCi0JyVLjbOJuLT06/IT7q2AW Ob5N/G2PxFiXuCDLeJlC7UIl8Ua7Sg5v+MmJVDgL0qDUctZvJHNZZdMbeedk1ZHH XH0SX/3JDNrmO2XdjIrzAWSdQX3Ev1Yv2jOkHnMAs6Na+1H+vf4= =lPeU -----END PGP SIGNATURE----- --Apple-Mail=_0E3633DD-4248-4655-9843-6C90BDFC002D--