Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp2385216yba; Fri, 17 May 2019 16:16:55 -0700 (PDT) X-Google-Smtp-Source: APXvYqwewg3lpOeXHvTGNjyxzv/Ybn/Jzvr7oH2T9rLG5IwmVndzt5oX2PYcHu2L4c/jXbI7Unbr X-Received: by 2002:aa7:8243:: with SMTP id e3mr63869758pfn.213.1558135015796; Fri, 17 May 2019 16:16:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558135015; cv=none; d=google.com; s=arc-20160816; b=U5fX9gV7ovz4rocIEWFnhRPWeFTLLK5zTekIeYB9HLfBJc7tw8SuTaMH6P1JRS6Dh0 KCTztQ8znIfYLXrWcZqKfWjpsZbP1QaHUkcD4jZLDToa7HJPbHO3Kd0VEA6BO9jSnG0B Jal7XWYH20K4/aD6K3LE/4ca1E6lmFXH8x1ZI18fg0afz48AOMEzvnBHe1fY1YyAkfux 1Kq6Pd4hrshWLwRgrvQi/YUzQ4E2Xalbxm7MwGtqyBsf9Sq81uFxLtbZFuE46gP4YNxJ p3f6QiuSps7iLboDzni+zNI+EyFKD2c2enQT9W78OpV0tvi0oDSwd+ASO8sSdw5jtkoY /yjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:mime-version:user-agent:date:message-id:subject :from:cc:to; bh=o3V8gq1j6QA4Wggg4Sr7bbe6/ptH1EwcaYppfOIb5k8=; b=AX7ZJSGAEuf7ZvbwBPlpJyqINtsnkKuJfCmOdyh/Bvx4o8tz8aqz56SLbrFrF4TQ3v C9KGyYtFxTIDOPB1g5tzR1r+PIXIjtbOa39jhtNv3MDhIfUi8TrQ1KB80iQsHaA60l7y +JPjy6kmAKECfK8aazH5pOWhfs9t3tCosH7r1aF5F91ePe4i+y+T0RrEabPvxoSJAn3h PuAHg1aUBW585b2ehu2860TYzK+p3ovExhR5TO8nYVjamxLjLygIipSkkIgCnIi2o088 nreR1AXVBYxV9J6NHvZsth/JVei9hhQK+T9iS5gq4Osh2xOIuPFMivyHpR2OcOlMLjuz FmkA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=csail.mit.edu Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o188si9038528pgo.489.2019.05.17.16.16.41; Fri, 17 May 2019 16:16:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=csail.mit.edu Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728639AbfEQWhI (ORCPT + 99 others); Fri, 17 May 2019 18:37:08 -0400 Received: from outgoing-stata.csail.mit.edu ([128.30.2.210]:38982 "EHLO outgoing-stata.csail.mit.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728179AbfEQWhI (ORCPT ); Fri, 17 May 2019 18:37:08 -0400 X-Greylist: delayed 1255 seconds by postgrey-1.27 at vger.kernel.org; Fri, 17 May 2019 18:37:06 EDT Received: from [4.30.142.84] (helo=srivatsab-a01.vmware.com) by outgoing-stata.csail.mit.edu with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1hRl91-000BD0-LI; Fri, 17 May 2019 18:16:03 -0400 To: linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-ext4@vger.kernel.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Cc: axboe@kernel.dk, paolo.valente@linaro.org, jack@suse.cz, jmoyer@redhat.com, tytso@mit.edu, amakhalov@vmware.com, anishs@vmware.com, srivatsab@vmware.com, "Srivatsa S. Bhat" From: "Srivatsa S. Bhat" Subject: CFQ idling kills I/O performance on ext4 with blkio cgroup controller Message-ID: <8d72fcf7-bbb4-2965-1a06-e9fc177a8938@csail.mit.edu> Date: Fri, 17 May 2019 15:16:01 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, One of my colleagues noticed upto 10x - 30x drop in I/O throughput running the following command, with the CFQ I/O scheduler: dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync Throughput with CFQ: 60 KB/s Throughput with noop or deadline: 1.5 MB/s - 2 MB/s I spent some time looking into it and found that this is caused by the undesirable interaction between 4 different components: - blkio cgroup controller enabled - ext4 with the jbd2 kthread running in the root blkio cgroup - dd running on ext4, in any other blkio cgroup than that of jbd2 - CFQ I/O scheduler with defaults for slice_idle and group_idle When docker is enabled, systemd creates a blkio cgroup called system.slice to run system services (and docker) under it, and a separate blkio cgroup called user.slice for user processes. So, when dd is invoked, it runs under user.slice. The dd command above includes the dsync flag, which performs an fdatasync after every write to the output file. Since dd is writing to a file on ext4, jbd2 will be active, committing transactions corresponding to those fdatasync requests from dd. (In other words, dd depends on jdb2, in order to make forward progress). But jdb2 being a kernel thread, runs in the root blkio cgroup, as opposed to dd, which runs under user.slice. Now, if the I/O scheduler in use for the underlying block device is CFQ, then its inter-queue/inter-group idling takes effect (via the slice_idle and group_idle parameters, both of which default to 8ms). Therefore, everytime CFQ switches between processing requests from dd vs jbd2, this 8ms idle time is injected, which slows down the overall throughput tremendously! To verify this theory, I tried various experiments, and in all cases, the 4 pre-conditions mentioned above were necessary to reproduce this performance drop. For example, if I used an XFS filesystem (which doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed directly to a block device, I couldn't reproduce the performance issue. Similarly, running dd in the root blkio cgroup (where jbd2 runs) also gets full performance; as does using the noop or deadline I/O schedulers; or even CFQ itself, with slice_idle and group_idle set to zero. These results were reproduced on a Linux VM (kernel v4.19) on ESXi, both with virtualized storage as well as with disk pass-through, backed by a rotational hard disk in both cases. The same problem was also seen with the BFQ I/O scheduler in kernel v5.1. Searching for any earlier discussions of this problem, I found an old thread on LKML that encountered this behavior [1], as well as a docker github issue [2] with similar symptoms (mentioned later in the thread). So, I'm curious to know if this is a well-understood problem and if anybody has any thoughts on how to fix it. Thank you very much! [1]. https://lkml.org/lkml/2015/11/19/359 [2]. https://github.com/moby/moby/issues/21485 https://github.com/moby/moby/issues/21485#issuecomment-222941103 Regards, Srivatsa