Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp3263067yba; Sat, 18 May 2019 12:55:04 -0700 (PDT) X-Google-Smtp-Source: APXvYqxCGispTUekWD5PQbaqQ1ALdEKuCGjyeh9yn61EhjI/NVY50ZIVtDtNdoYCZi7YFJSbj5kY X-Received: by 2002:a62:1a51:: with SMTP id a78mr68715197pfa.133.1558209304092; Sat, 18 May 2019 12:55:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558209304; cv=none; d=google.com; s=arc-20160816; b=zE68zCpUEcwPd7PJcQGPKLfFr3hkEYk8pwVq32nc/SqEDnUAuceYSdh7eh683Jlrac oS2Sys5DULJ30fjyg8SQRxsXJSIWP8+eQGdXxRuOXe1bsQdq574ETK8pRuUzZfWeBUbQ oNwLiudla7aOM8THXAKdB3aUmiCh5zFcIku+tx83rXD9L+gGBKgs9K1gTl58si7R6Rci XtmpmwbDPXIEE5QJZ//Pdz0MqP5kfSbHdm7F/JQMU6H26ukHq10Ry1cGCRYBU0Ef0A3B vIMr3o0qPzboMJy3T3k6smMysTDyseFrdq0J6EB92m3MQ2MsJ685LafCEeDAtoQmQ4bL 19tQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:mail-followup-to :message-id:subject:cc:to:from:date; bh=rAzSXAfEwcxoOBf2YXa3FHGyxeAAcy5xVXsWC4F00N8=; b=XH2xA0qysAHu4lvoxHMXPrBXyacULKmFvFLAwwi/TeBDZbQ8N7hEo78ZmyTmnKv640 arax7r7I/WcqLd/w6rl7wNbbyh53/4rfmqLDh3NmUr2qtWMmORyvQ1hGKni2R6IvE91Z wtKRBICInTcMoGHske0Ny60pnCH2Ye81X/nmOpdNAbxEp08zvIdGSJl0POAPDWYq1u0n ZomoMmjdZOeZEc4WkOthLpUWKcoyv4ZZtnuVrDM5LBww0ndjJJq4fj/ZHU3LDRBcAM5S vhyUQ3+gYWF6RkLl5iOQAn/YCmhbTl1sAxlna0AQnDUf9ZnG57n9IHDAPd1pxlkuoYzz DdrQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t66si12907368pgb.259.2019.05.18.12.54.48; Sat, 18 May 2019 12:55:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729868AbfERT3s (ORCPT + 99 others); Sat, 18 May 2019 15:29:48 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:52260 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729206AbfERT3s (ORCPT ); Sat, 18 May 2019 15:29:48 -0400 Received: from callcc.thunk.org ([66.31.38.53]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id x4IJT0Gx030129 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sat, 18 May 2019 15:29:01 -0400 Received: by callcc.thunk.org (Postfix, from userid 15806) id 5DC5E420027; Sat, 18 May 2019 15:28:47 -0400 (EDT) Date: Sat, 18 May 2019 15:28:47 -0400 From: "Theodore Ts'o" To: Paolo Valente Cc: "Srivatsa S. Bhat" , linux-fsdevel@vger.kernel.org, linux-block , linux-ext4@vger.kernel.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, axboe@kernel.dk, jack@suse.cz, jmoyer@redhat.com, amakhalov@vmware.com, anishs@vmware.com, srivatsab@vmware.com Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller Message-ID: <20190518192847.GB14277@mit.edu> Mail-Followup-To: Theodore Ts'o , Paolo Valente , "Srivatsa S. Bhat" , linux-fsdevel@vger.kernel.org, linux-block , linux-ext4@vger.kernel.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, axboe@kernel.dk, jack@suse.cz, jmoyer@redhat.com, amakhalov@vmware.com, anishs@vmware.com, srivatsab@vmware.com References: <8d72fcf7-bbb4-2965-1a06-e9fc177a8938@csail.mit.edu> <1812E450-14EF-4D5A-8F31-668499E13652@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1812E450-14EF-4D5A-8F31-668499E13652@linaro.org> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote: > I've addressed these issues in my last batch of improvements for > BFQ, which landed in the upcoming 5.2. If you give it a try, and > still see the problem, then I'll be glad to reproduce it, and > hopefully fix it for you. Hi Paolo, I'm curious if you could give a quick summary about what you changed in BFQ? I was considering adding support so that if userspace calls fsync(2) or fdatasync(2), to attach the process's CSS to the transaction, and then charge all of the journal metadata writes the process's CSS. If there are multiple fsync's batched into the transaction, the first process which forced the early transaction commit would get charged the entire journal write. OTOH, journal writes are sequential I/O, so the amount of disk time for writing the journal is going to be relatively small, and especially, the fact that work from other cgroups is going to be minimal, especially if hadn't issued an fsync(). In the case where you have three cgroups all issuing fsync(2) and they all landed in the same jbd2 transaction thanks to commit batching, in the ideal world we would split up the disk time usage equally across those three cgroups. But it's probably not worth doing that... That being said, we probably do need some BFQ support, since in the case where we have multiple processes doing buffered writes w/o fsync, we do charnge the data=ordered writeback to each block cgroup. Worse, the commit can't complete until the all of the data integrity writebacks have completed. And if there are N cgroups with dirty inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth of idle time tacked onto the commit time. If we charge the journal I/O to the cgroup, and there's only one process doing the dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync then we don't need to worry about this failure mode, since both the journal I/O and the data writeback will be hitting the same cgroup. But that's arguably an artificial use case, and much more commonly there will be multiple cgroups all trying to at least some file system I/O. - Ted