Received: by 2002:a25:86ce:0:0:0:0:0 with SMTP id y14csp641962ybm; Tue, 21 May 2019 00:40:06 -0700 (PDT) X-Google-Smtp-Source: APXvYqwKyXgDpxoNiJaeiGWFsPs4EN0uBAJIksNpNrmY7LQa8frNawTRTChpOGcF7c/0eGD7Vkdt X-Received: by 2002:a63:d949:: with SMTP id e9mr80064410pgj.437.1558424406247; Tue, 21 May 2019 00:40:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558424406; cv=none; d=google.com; s=arc-20160816; b=NnDb2xVDjGDbDXLaJpf1QEOiZuyR8DrxWrvswaeozZpKJOiiHQuafouHrvBwXe6aAS BLLDVj1oL4L7ONjzySoo7tDq8xZuVNeTLjp1Oxm6nOvqtxzyoZyIUm0WOiGHEBTZRu0q 7VpnsG1Qf8akUcggznceio+1jM8JSH7qRoSh/nxApcI51pZ/Xi3uaZo6ExbN+EiCUJTm I5EYqfy1/0bgkT7EpEb5bRS/MmbyZNqjuIfvfsqHsdVWRPFyWb4mRy1i7S+R24R/rOYi 6T9hLXUFR6ybCT49/cgd/n9NzZ18LfFowyTSPHDsyhu1XL437QvL6Dd4xv+iw0qtueR4 HXBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=PD3gnvomtD5VhdkV7Q4c3V/xyYHglmxFg3mfhMbpkIs=; b=yTBa1Dgjw2gtO3XQPMkgU1u1SkKGdCvs2SXpJm3zc3/D+6fVFtKTSb9xAKOHmyTlB2 Csh0VUlwc15rAD6HnfEQYuHx34v57YBxhUuUgUz37JMjuKrfFmRD/DHmLvvHrPopmq1c paz8huquomr2EdlmnL2lXhOaOhnbNO1xRkBEiyPc/t9W09lyHlK/o5BVVCy0jVSr2V4C k/XZCUEphLeyDmTQHT84XbDb/nmpCyB6fGpUCmtGQ7ZQ1tqO4+KDyiKgEEnRYGpaMwzL QVw7RPJocuSqKBeYJP3hAFKyz6nlUW5iplt+2dYe71W1D1VLAlLpag3tfIAc1U+hNfIi gHiQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v13si19642842plo.429.2019.05.21.00.39.48; Tue, 21 May 2019 00:40:06 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727106AbfEUHjG (ORCPT + 99 others); Tue, 21 May 2019 03:39:06 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:36535 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727100AbfEUHjG (ORCPT ); Tue, 21 May 2019 03:39:06 -0400 Received: from mail-wr1-f71.google.com ([209.85.221.71]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1hSzMM-0004oJ-8y for linux-ext4@vger.kernel.org; Tue, 21 May 2019 07:38:54 +0000 Received: by mail-wr1-f71.google.com with SMTP id k18so7722506wrl.4 for ; Tue, 21 May 2019 00:38:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=PD3gnvomtD5VhdkV7Q4c3V/xyYHglmxFg3mfhMbpkIs=; b=InkXg4Ziv/UnZdMbfRZ9HymWVk3jnGjFlnPNdk7JBBRWPdBA36W/V59MFb38tLm0hZ XLulBdpxY/btdeZacFNSqoJWmsNEOKrKCyhHRZLGKBl/pUSWYMCxkBL2BO3CTWt8Cgbf Jy3rXlQyRTzKjcWyg0bQTvZ6u7sk+MwIgTMOstaTjmqGrD16p6mY7H2BrvZ6Q97ZvzvE t9c33Evnqq+afJqn5trJGakHy12T2fwNZvph9Bcol80XAptz41UR2/dI/88DOoDbJTFM Xky5TK3bsrG/gMvExM50x8UP1L6uxfBobZbSR7OtNeLXgDbE3Vv652wvAsicODY6UtVR fMmA== X-Gm-Message-State: APjAAAV3hBRkXT0Pujddf9X5in+xrfEb21V/bDmPEBbZz8jFX7oEdm6r Elw2FoAup7BE8dA1BfYMoBQJnApePs4s6nDezyYviqid84vmQp6QYNQeGMzhbYHvViWnOAXTm/b aSPrHNEDgOk75XyyKY2KbE6/WmeONKJwLnneewCw= X-Received: by 2002:a1c:23d2:: with SMTP id j201mr2189819wmj.139.1558424333856; Tue, 21 May 2019 00:38:53 -0700 (PDT) X-Received: by 2002:a1c:23d2:: with SMTP id j201mr2189806wmj.139.1558424333602; Tue, 21 May 2019 00:38:53 -0700 (PDT) Received: from localhost (host157-126-dynamic.32-79-r.retail.telecomitalia.it. [79.32.126.157]) by smtp.gmail.com with ESMTPSA id 34sm35567853wre.32.2019.05.21.00.38.52 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 21 May 2019 00:38:52 -0700 (PDT) Date: Tue, 21 May 2019 09:38:51 +0200 From: Andrea Righi To: Paolo Valente Cc: Theodore Ts'o , "Srivatsa S. Bhat" , linux-fsdevel@vger.kernel.org, linux-block , linux-ext4@vger.kernel.org, cgroups@vger.kernel.org, kernel list , Jens Axboe , Jan Kara , jmoyer@redhat.com, amakhalov@vmware.com, anishs@vmware.com, srivatsab@vmware.com, Josef Bacik , Tejun Heo Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller Message-ID: <20190521073851.GA15262@xps-13> References: <8d72fcf7-bbb4-2965-1a06-e9fc177a8938@csail.mit.edu> <1812E450-14EF-4D5A-8F31-668499E13652@linaro.org> <20190518192847.GB14277@mit.edu> <98612748-8454-43E8-9915-BAEBA19A6FD7@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <98612748-8454-43E8-9915-BAEBA19A6FD7@linaro.org> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Mon, May 20, 2019 at 12:38:32PM +0200, Paolo Valente wrote: ... > > I was considering adding support so that if userspace calls fsync(2) > > or fdatasync(2), to attach the process's CSS to the transaction, and > > then charge all of the journal metadata writes the process's CSS. If > > there are multiple fsync's batched into the transaction, the first > > process which forced the early transaction commit would get charged > > the entire journal write. OTOH, journal writes are sequential I/O, so > > the amount of disk time for writing the journal is going to be > > relatively small, and especially, the fact that work from other > > cgroups is going to be minimal, especially if hadn't issued an > > fsync(). > > > > Yeah, that's a longstanding and difficult instance of the general > too-short-blanket problem. Jan has already highlighted one of the > main issues in his reply. I'll add a design issue (from my point of > view): I'd find a little odd that explicit sync transactions have an > owner to charge, while generic buffered writes have not. > > I think Andrea Righi addressed related issues in his recent patch > proposal [1], so I've CCed him too. > > [1] https://lkml.org/lkml/2019/3/9/220 If journal metadata writes are submitted using a process's CSS, the commit may be throttled and that can also throttle indirectly other "high-priority" blkio cgroups, so I think that logic alone isn't enough. We have discussed this priorty-inversion problem with Josef and Tejun (adding both of them in cc), the idea that seemed most reasonable was to temporarily boost the priority of blkio cgroups when there are multiple sync(2) waiters in the system. More exactly, when I/O is going to be throttled for a specific blkio cgroup, if there's any other blkio cgroup waiting for writeback I/O, no throttling is applied (this logic can be refined by saving a list of blkio sync(2) waiters and taking the highest I/O rate among them). In addition to that Tejun mentioned that he would like to see a better sync(2) isolation done at the fs namespace level. This last part still needs to be defined and addressed. However, even the simple logic above "no throttling if there's any other sync(2) waiter" can already prevent big system lockups (see for example the simple test case that I suggested here https://lkml.org/lkml/2019/), so I think having this change alone would be a nice improvement already: https://lkml.org/lkml/2019/3/9/220 Thanks, -Andrea > > > In the case where you have three cgroups all issuing fsync(2) and they > > all landed in the same jbd2 transaction thanks to commit batching, in > > the ideal world we would split up the disk time usage equally across > > those three cgroups. But it's probably not worth doing that... > > > > That being said, we probably do need some BFQ support, since in the > > case where we have multiple processes doing buffered writes w/o fsync, > > we do charnge the data=ordered writeback to each block cgroup. Worse, > > the commit can't complete until the all of the data integrity > > writebacks have completed. And if there are N cgroups with dirty > > inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth > > of idle time tacked onto the commit time. > > > > Jan already wrote part of what I wanted to reply here, so I'll > continue from his reply. > > Thanks, > Paolo > > > If we charge the journal I/O to the cgroup, and there's only one > > process doing the > > > > dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync > > > > then we don't need to worry about this failure mode, since both the > > journal I/O and the data writeback will be hitting the same cgroup. > > But that's arguably an artificial use case, and much more commonly > > there will be multiple cgroups all trying to at least some file system > > I/O. > > > > - Ted >