Received: by 2002:a05:7412:b995:b0:f9:9502:5bb8 with SMTP id it21csp6918229rdb; Tue, 2 Jan 2024 20:03:10 -0800 (PST) X-Google-Smtp-Source: AGHT+IEIOYAhGhT3Z0jOufe4REBBw4ti4gT26wqvnWQ4ToTITaVP0i/9RAmufcFkrV3x+6C/Uoio X-Received: by 2002:a05:6358:6f11:b0:175:2bc1:2eca with SMTP id r17-20020a0563586f1100b001752bc12ecamr3829630rwn.64.1704254589770; Tue, 02 Jan 2024 20:03:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704254589; cv=none; d=google.com; s=arc-20160816; b=i/StbZISibzCo4EcDi74XKVbL/qwoL9aYpcFMhnaTrcYmrYjROUTriE/nBk3vqcK4B 4xuh656QAxUr44OAFWSwLRyVymZ7X0D8DDQxVFmBWaRAdS+CDQsEZ9YHK4urgh6lo11p XN1YJ0aejQZwTzyACuWsH9PtR1Lz1Z5j+AyNqzdUL6CpmJpwyl1oQHPaCUVZI2BkOadI 65lccetOsb7wOWaMYCYx5vkoJ2cZ8GO+QYX4Teh9ghGJUv4jSOuac2vBt1T+Srh5gkse dAOpX/aKkRxC5U89ETStZghBoahD3tu1CJ21WoHvGFJtVFpNaQ/MQZSDCMxk5XS/UIEl Rx8g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date:dkim-signature; bh=yNiVBcSgnS7YDhIhezexv4UQ/o2XjO1B9pVBfER0O5A=; fh=Xw8ZooKZRB6PxfL5xnggJ3TIjLuCDZktDVtxtHuzwzI=; b=H95OxUOtSX2tXyHpvJfEnu+I/Cw7ni2JzHaUrZ7wWXj3TmhADtu95b1bhJ5GYx3VNU LRT4de2IF8WKxBjcxmT2Puypa62fHWRJtn0CwRl42JVynojeDQtJ6vSw5t7JRKkzm8ZN 55tfY5gIMK+KLXp9FLxn0fMm9sRiAonpFKcJVV/uML2YBIl+uYm3f7kT1sOt05r+ZQp9 4Wd1P639Uux6yqmWMqh2UxUoe0W7g4eg2BnUzfZzZsZN+TQ9Ae0wJgJ0PhsMoa7s/yl8 Dr810En2ZXyglw9VdYWZI4TrL0tRun/XJ6eyz/iNbtVqZpKcdGHDzRhJYUqnykLLDuce HHaA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=TtGrPA1O; spf=pass (google.com: domain of linux-kernel+bounces-15143-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-15143-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id ld3-20020a056a004f8300b006da3ed8279asi7948218pfb.227.2024.01.02.20.03.09 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jan 2024 20:03:09 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-15143-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=TtGrPA1O; spf=pass (google.com: domain of linux-kernel+bounces-15143-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-15143-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 0FFA5283FD4 for ; Wed, 3 Jan 2024 04:03:09 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 27061171CA; Wed, 3 Jan 2024 04:03:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="TtGrPA1O" X-Original-To: linux-kernel@vger.kernel.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B85E414F6E for ; Wed, 3 Jan 2024 04:02:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1704254577; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=yNiVBcSgnS7YDhIhezexv4UQ/o2XjO1B9pVBfER0O5A=; b=TtGrPA1OCgMRM1KoDM/gKw2vMlyWN+S3Lwuo2zeUJOPihJ6Z2abqMiOAVd9RxjZeHL3B3L Ta/bZaEjYeHHILwBJwwIlzG07s5LD72QvWSLM6QSo+sw0MlBgMVZ9dxd9Ti4CGPFbUQrK+ YpkpjTI8IvQ4eYO5HU13VxO2TJXUDhs= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-345-s2T-O5voO0qFpYZX1FTeFA-1; Tue, 02 Jan 2024 23:02:54 -0500 X-MC-Unique: s2T-O5voO0qFpYZX1FTeFA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 801381C05134; Wed, 3 Jan 2024 04:02:53 +0000 (UTC) Received: from fedora (unknown [10.72.116.42]) by smtp.corp.redhat.com (Postfix) with ESMTPS id D4E8351E3; Wed, 3 Jan 2024 04:02:47 +0000 (UTC) Date: Wed, 3 Jan 2024 12:02:43 +0800 From: Ming Lei To: Yu Kuai Cc: bvanassche@acm.org, hch@lst.de, axboe@kernel.dk, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yi.zhang@huawei.com, yangerkun@huawei.com, ming.lei@redhat.com Subject: Re: [PATCH -next RFC] block: support to account io_ticks precisely Message-ID: References: <20231205093743.1823351-1-yukuai1@huaweicloud.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231205093743.1823351-1-yukuai1@huaweicloud.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5 On Tue, Dec 05, 2023 at 05:37:43PM +0800, Yu Kuai wrote: > From: Yu Kuai > > Currently, io_ticks is accounted based on sampling, specifically > update_io_ticks() will always account io_ticks by 1 jiffies from > bdev_start_io_acct()/blk_account_io_start(), and the result can be > inaccurate, for example(HZ is 250): > > Test script: > fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms > > Test result: util is about 90%, while the disk is really idle. > > In order to account io_ticks precisely, update_io_ticks() must know if > there are IO inflight already, and this requires overhead slightly, > hence precise io accounting is disabled by default, and user can enable > it through sysfs entry. Yeah, the trouble is from commit 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting"), and real reason is that IO inflight info is too expensive to maintain in fast path, and RH have got several customer complaint in this area too. > > Noted that for rq-based devcie, part_stat_local_inc/dec() and > part_in_flight() is used to track inflight instead of iterating tags, > which is not supposed to be used in fast path because 'tags->lock' is > grabbed in blk_mq_find_and_get_req(). You can iterate over static requests via BT_TAG_ITER_STATIC_RQS, then tags->lock can be bypassed, but new helper is needed. But given it is only run once for each tick, I guess percpu counting might be fine too even in case of big machine. > > Signed-off-by: Yu Kuai > --- > Documentation/ABI/stable/sysfs-block | 8 ++++-- > block/blk-core.c | 17 ++++++++---- > block/blk-mq.c | 18 ++++++++++--- > block/blk-sysfs.c | 40 ++++++++++++++++++++++++++-- > block/blk.h | 4 ++- > block/genhd.c | 6 ++--- > include/linux/blk-mq.h | 1 + > include/linux/blkdev.h | 3 +++ > 8 files changed, 80 insertions(+), 17 deletions(-) > > diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block > index 1fe9a553c37b..e5fedecf7bdf 100644 > --- a/Documentation/ABI/stable/sysfs-block > +++ b/Documentation/ABI/stable/sysfs-block > @@ -358,8 +358,12 @@ What: /sys/block//queue/iostats > Date: January 2009 > Contact: linux-block@vger.kernel.org > Description: > - [RW] This file is used to control (on/off) the iostats > - accounting of the disk. > + [RW] This file is used to control the iostats accounting of the > + disk. If this value is 0, iostats accounting is disabled; If > + this value is 1, iostats accounting is enabled, but io_ticks is > + accounted by sampling and the result is not accurate; If this > + value is 2, iostats accounting is enabled and io_ticks is > + accounted precisely, but there will be slightly overhead. IMO, this approach looks fine. > > > What: /sys/block//queue/logical_block_size > diff --git a/block/blk-core.c b/block/blk-core.c > index fdf25b8d6e78..405883d606cd 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -935,14 +935,20 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob, > } > EXPORT_SYMBOL_GPL(iocb_bio_iopoll); > > -void update_io_ticks(struct block_device *part, unsigned long now, bool end) > +void update_io_ticks(struct block_device *part, unsigned long now, bool end, > + bool precise) > { > unsigned long stamp; > again: > stamp = READ_ONCE(part->bd_stamp); > - if (unlikely(time_after(now, stamp))) { > - if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) > + if (unlikely(time_after(now, stamp)) && > + likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) { > + if (precise) { > + if (end || part_in_flight(part)) > + __part_stat_add(part, io_ticks, now - stamp); Strictly speaking, `end` isn't need any more, but it can be thought as one optimization, given part_in_flight() is supposed to be non-zero in case of account_done. > + } else { > __part_stat_add(part, io_ticks, end ? now - stamp : 1); > + } > } > if (part->bd_partno) { > part = bdev_whole(part); > @@ -954,7 +960,8 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op, > unsigned long start_time) > { > part_stat_lock(); > - update_io_ticks(bdev, start_time, false); > + update_io_ticks(bdev, start_time, false, > + blk_queue_precise_io_stat(bdev->bd_queue)); blk_queue_precise_io_stat() can be moved into update_io_ticks() directly, and it should be fine given it is just done once in each tick. Thanks, Ming