Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753411Ab0LWPjw (ORCPT ); Thu, 23 Dec 2010 10:39:52 -0500 Received: from mx1.redhat.com ([209.132.183.28]:52530 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751704Ab0LWPjv (ORCPT ); Thu, 23 Dec 2010 10:39:51 -0500 Date: Thu, 23 Dec 2010 10:39:15 -0500 From: Vivek Goyal To: Jerome Marchand Cc: Jens Axboe , Satoru Takeuchi , Linus Torvalds , Yasuaki Ishimatsu , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH] block: fix accounting bug on cross partition merges Message-ID: <20101223153915.GE9502@redhat.com> References: <4CFF3AD6.6010904@jp.fujitsu.com> <4CFF3C86.2070504@fusionio.com> <4CFF3DA4.5060705@jp.fujitsu.com> <4CFF9A2C.1070401@fusionio.com> <4D025154.8030400@redhat.com> <20101210165553.GE31737@redhat.com> <4D07D2AC.6000500@fusionio.com> <4D0B68AF.80804@redhat.com> <4D0BB4A1.8080305@fusionio.com> <4D13664C.3020500@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4D13664C.3020500@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9536 Lines: 295 On Thu, Dec 23, 2010 at 04:10:04PM +0100, Jerome Marchand wrote: > On 12/17/2010 08:06 PM, Jens Axboe wrote: > > On 2010-12-17 14:42, Jerome Marchand wrote: > >> > >> /proc/diskstats would display a strange output as follows. > > > > [snip] > > > > This looks a lot better! One comment: > > > >> diff --git a/block/blk-core.c b/block/blk-core.c > >> index 4ce953f..064921d 100644 > >> --- a/block/blk-core.c > >> +++ b/block/blk-core.c > >> @@ -64,13 +64,16 @@ static void drive_stat_acct(struct request *rq, int new_io) > >> return; > >> > >> cpu = part_stat_lock(); > >> - part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); > >> > >> - if (!new_io) > >> + if (!new_io) { > >> + part = rq->part; > >> part_stat_inc(cpu, part, merges[rw]); > >> - else { > >> + } else { > >> + part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); > >> part_round_stats(cpu, part); > >> part_inc_in_flight(part, rw); > >> + kref_get(&part->ref); > >> + rq->part = part; > >> } > >> > >> part_stat_unlock(); > > > > I don't think this is completely safe. The rcu lock is held due to the > > part_stat_lock(), but that only prevents the __delete_partition() > > callback from happening. Lets say you have this: > > > > CPU0 CPU1 > > part = disk_map_sector_rcu() > > kref_put(part); <- now 0 > > A kref_get(part) happens here, or am I missing something? It might happen that kref_get(part) is called after kref_put() has been called and reference has reached 0 and and delete_parition() call has been scheduled as soon as rcu grace period is over. So if you do kref_get() after that, it is not going to help. > If we use an atomic kref_test_and_get() function, which only increment the > refcount if it's not zero, we would be safe. If kref_test_and_get() fails, > we can cache rq->rq_disk->part0 instead. See my drafty patch below. Conceptually it kind of makes sense to me. So if even if we get the pointer to partition under rcu_read_lock(), we will not account the IO to partition if it is going away. > > > part_stat_unlock() > > __delete_partition(); > > ... > > delete_partition_rcu_cb(); > > merge, or endio, boom > > > > Now rq has ->part pointing to freed memory, later merges or end > > accounting will touch freed memory. > > > > I think we can fix this by just having delete_partition_rcu_rb() check > > the reference count and return if non-zero. Since someone holds a > > reference to the table, they will drop it and we'll re-schedule the rcu > > callback. > > > > > > > --- > > diff --git a/block/blk-core.c b/block/blk-core.c > index 4ce953f..72d12d2 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -64,13 +64,21 @@ static void drive_stat_acct(struct request *rq, int new_io) > return; > > cpu = part_stat_lock(); > - part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); > > - if (!new_io) > + if (!new_io) { > + part = rq->part; > part_stat_inc(cpu, part, merges[rw]); > - else { > + } else { > + part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); > + if(part->partno && !kref_test_and_get(&part->ref)) > + /* Do we have to check this part->partno both while taking and releasing reference. Can't we take one extra reference for disk->part0, at alloc_disk_node() time so that it is never freed and only freed when disk is going away and gendisk is being freed. That way, you don't have to differentiate between disk->part0 and rest of the partitions while taking or dropping references. Thanks Vivek > + * The partition is already being removed, > + * the request will be accounted on the disk only > + */ > + part = &rq->rq_disk->part0; > part_round_stats(cpu, part); > part_inc_in_flight(part, rw); > + rq->part = part; > } > > part_stat_unlock(); > @@ -128,6 +136,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq) > rq->ref_count = 1; > rq->start_time = jiffies; > set_start_time_ns(rq); > + rq->part = NULL; > } > EXPORT_SYMBOL(blk_rq_init); > > @@ -1776,7 +1785,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes) > int cpu; > > cpu = part_stat_lock(); > - part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req)); > + part = req->part; > part_stat_add(cpu, part, sectors[rw], bytes >> 9); > part_stat_unlock(); > } > @@ -1796,13 +1805,15 @@ static void blk_account_io_done(struct request *req) > int cpu; > > cpu = part_stat_lock(); > - part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req)); > + part = req->part; > > part_stat_inc(cpu, part, ios[rw]); > part_stat_add(cpu, part, ticks[rw], duration); > part_round_stats(cpu, part); > part_dec_in_flight(part, rw); > > + if (part->partno) > + kref_put(&part->ref, __delete_partition); > part_stat_unlock(); > } > } > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 77b7c26..3334578 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -351,11 +351,13 @@ static void blk_account_io_merge(struct request *req) > int cpu; > > cpu = part_stat_lock(); > - part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req)); > + part = req->part; > > part_round_stats(cpu, part); > part_dec_in_flight(part, rq_data_dir(req)); > > + if (part->partno) > + kref_put(&part->ref, __delete_partition); > part_stat_unlock(); > } > } > diff --git a/block/genhd.c b/block/genhd.c > index 5fa2b44..0c55eae 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -1192,6 +1192,7 @@ struct gendisk *alloc_disk_node(int minors, int node_id) > return NULL; > } > disk->part_tbl->part[0] = &disk->part0; > + kref_init(&disk->part0.ref); > > disk->minors = minors; > rand_initialize_disk(disk); > diff --git a/fs/partitions/check.c b/fs/partitions/check.c > index 0a8b0ad..0123717 100644 > --- a/fs/partitions/check.c > +++ b/fs/partitions/check.c > @@ -372,6 +372,13 @@ static void delete_partition_rcu_cb(struct rcu_head *head) > put_device(part_to_dev(part)); > } > > +void __delete_partition(struct kref *ref) > +{ > + struct hd_struct *part = container_of(ref, struct hd_struct, ref); > + > + call_rcu(&part->rcu_head, delete_partition_rcu_cb); > +} > + > void delete_partition(struct gendisk *disk, int partno) > { > struct disk_part_tbl *ptbl = disk->part_tbl; > @@ -390,7 +397,7 @@ void delete_partition(struct gendisk *disk, int partno) > kobject_put(part->holder_dir); > device_del(part_to_dev(part)); > > - call_rcu(&part->rcu_head, delete_partition_rcu_cb); > + kref_put(&part->ref, __delete_partition); > } > > static ssize_t whole_disk_show(struct device *dev, > @@ -489,6 +496,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno, > if (!dev_get_uevent_suppress(ddev)) > kobject_uevent(&pdev->kobj, KOBJ_ADD); > > + kref_init(&p->ref); > return p; > > out_free_info: > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index aae86fd..482a7fd 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -115,6 +115,7 @@ struct request { > void *elevator_private3; > > struct gendisk *rq_disk; > + struct hd_struct *part; > unsigned long start_time; > #ifdef CONFIG_BLK_CGROUP > unsigned long long start_time_ns; > diff --git a/include/linux/genhd.h b/include/linux/genhd.h > index 7a7b9c1..2ba2792 100644 > --- a/include/linux/genhd.h > +++ b/include/linux/genhd.h > @@ -116,6 +116,7 @@ struct hd_struct { > struct disk_stats dkstats; > #endif > struct rcu_head rcu_head; > + struct kref ref; > }; > > #define GENHD_FL_REMOVABLE 1 > @@ -583,6 +584,7 @@ extern struct hd_struct * __must_check add_partition(struct gendisk *disk, > sector_t len, int flags, > struct partition_meta_info > *info); > +extern void __delete_partition(struct kref *ref); > extern void delete_partition(struct gendisk *, int); > extern void printk_all_partitions(void); > > diff --git a/include/linux/kref.h b/include/linux/kref.h > index 6cc38fc..90b9e44 100644 > --- a/include/linux/kref.h > +++ b/include/linux/kref.h > @@ -23,6 +23,7 @@ struct kref { > > void kref_init(struct kref *kref); > void kref_get(struct kref *kref); > +int kref_test_and_get(struct kref *kref); > int kref_put(struct kref *kref, void (*release) (struct kref *kref)); > > #endif /* _KREF_H_ */ > diff --git a/lib/kref.c b/lib/kref.c > index d3d227a..5f663b9 100644 > --- a/lib/kref.c > +++ b/lib/kref.c > @@ -37,6 +37,22 @@ void kref_get(struct kref *kref) > } > > /** > + * kref_test_and_get - increment refcount for object only if refcount is not > + * zero. > + * @kref: object. > + * > + * Return non-zero if the refcount was incremented, 0 otherwise > + */ > +int kref_test_and_get(struct kref *kref) > +{ > + int ret; > + smp_mb__before_atomic_inc(); > + ret = atomic_inc_not_zero(&kref->refcount); > + smp_mb__after_atomic_inc(); > + return ret; > +} > + > +/** > * kref_put - decrement refcount for object. > * @kref: object. > * @release: pointer to the function that will clean up the object when the -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/