Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751642Ab0K3FVo (ORCPT ); Tue, 30 Nov 2010 00:21:44 -0500 Received: from mx1.redhat.com ([209.132.183.28]:63811 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751188Ab0K3FVm (ORCPT ); Tue, 30 Nov 2010 00:21:42 -0500 Date: Tue, 30 Nov 2010 00:21:09 -0500 From: Mike Snitzer To: "Darrick J. Wong" Cc: Jens Axboe , "Theodore Ts'o" , Neil Brown , Andreas Dilger , Alasdair G Kergon , Jan Kara , linux-kernel , linux-raid@vger.kernel.org, Keith Mannthey , dm-devel@redhat.com, Mingming Cao , Tejun Heo , linux-ext4@vger.kernel.org, Ric Wheeler , Christoph Hellwig , Josef Bacik Subject: Re: [PATCH 3/4] dm: Compute average flush time from component devices Message-ID: <20101130052108.GA2107@redhat.com> References: <20101129220536.12401.16581.stgit@elm3b57.beaverton.ibm.com> <20101129220558.12401.95229.stgit@elm3b57.beaverton.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20101129220558.12401.95229.stgit@elm3b57.beaverton.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3405 Lines: 94 On Mon, Nov 29 2010 at 5:05pm -0500, Darrick J. Wong wrote: > For dm devices which are composed of other block devices, a flush is mapped out > to those other block devices. Therefore, the average flush time can be > computed as the average flush time of whichever device flushes most slowly. I share Neil's concern about having to track such fine grained additional state in order to make the FS behave somewhat better. What are the _real_ fsync-happy workloads which warrant this optimization? That concern aside, my comments on your proposed DM changes are inlined below. > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index 7cb1352..62aeeb9 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -846,12 +846,38 @@ static void start_queue(struct request_queue *q) > spin_unlock_irqrestore(q->queue_lock, flags); > } > > +static void measure_flushes(struct mapped_device *md) > +{ > + struct dm_table *t; > + struct dm_dev_internal *dd; > + struct list_head *devices; > + u64 max = 0, samples = 0; > + > + t = dm_get_live_table(md); > + devices = dm_table_get_devices(t); > + list_for_each_entry(dd, devices, list) { > + if (dd->dm_dev.bdev->bd_disk->avg_flush_time_ns <= max) > + continue; > + max = dd->dm_dev.bdev->bd_disk->avg_flush_time_ns; > + samples = dd->dm_dev.bdev->bd_disk->flush_samples; > + } > + dm_table_put(t); > + > + spin_lock(&md->disk->flush_time_lock); > + md->disk->avg_flush_time_ns = max; > + md->disk->flush_samples = samples; > + spin_unlock(&md->disk->flush_time_lock); > +} > + You're checking all devices in a table rather than all devices that will receive a flush. The devices that will receive a flush is left for each target to determine (target exposes num_flush_requests). I'd prefer to see a more controlled .iterate_devices() based iteration of devices in each target. dm-table.c:dm_calculate_queue_limits() shows how iterate_devices can be used to combine device specific data using a common callback and a data pointer -- for that data pointer we'd need a local temporary structure with your 'max' and 'samples' members. > static void dm_done(struct request *clone, int error, bool mapped) > { > int r = error; > struct dm_rq_target_io *tio = clone->end_io_data; > dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io; > > + if (clone->cmd_flags & REQ_FLUSH) > + measure_flushes(tio->md); > + > if (mapped && rq_end_io) > r = rq_end_io(tio->ti, clone, error, &tio->info); > > @@ -2310,6 +2336,8 @@ static void dm_wq_work(struct work_struct *work) > if (dm_request_based(md)) > generic_make_request(c); > else > + if (c->bi_rw & REQ_FLUSH) > + measure_flushes(md); > __split_and_process_bio(md, c); > > down_read(&md->io_lock); > You're missing important curly braces for the else in your dm_wq_work() change... But the bio-based call to measure_flushes() (dm_wq_work's call) should be pushed into __split_and_process_bio() -- and maybe measure_flushes() could grow a 'struct dm_table *table' argument that, if not NULL, avoids getting the reference that __split_and_process_bio() already has on the live table. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/