2009-04-16 07:27:18

by Nikanth Karthikesan

[permalink] [raw]
Subject: [PATCH] [RFC] make hd_struct->in_flight atomic to avoid diskstat corruption

The disk statistics exported to userspace through proc and sysfs are not
protected by locks to avoid performance overhead. Since most of the statistics
are maintained in the per_cpu struct disk_stats, the chances of them getting
corrupted is negligible. But the in_flight counter, that records the no of
requests currently in progress is not per-cpu. This increases the chance of it
getting corrupted. And corruption of this value would result in visibly
distorted statistics such as negative in_flight. This can be avoided by making
this field atomic.

Signed-off-by: Nikanth Karthikesan <[email protected]>

---

diff --git a/block/blk-core.c b/block/blk-core.c
index 07ab754..c295deb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1013,12 +1013,15 @@ static inline void add_request(struct request_queue *q, struct request *req)
static void part_round_stats_single(int cpu, struct hd_struct *part,
unsigned long now)
{
+ int in_flight;
+
if (now == part->stamp)
return;

- if (part->in_flight) {
+ in_flight = atomic_read(&part->in_flight);
+ if (in_flight) {
__part_stat_add(cpu, part, time_in_queue,
- part->in_flight * (now - part->stamp));
+ in_flight * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
}
part->stamp = now;
diff --git a/block/genhd.c b/block/genhd.c
index a9ec910..9436991 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1028,7 +1028,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
part_stat_read(hd, merges[1]),
(unsigned long long)part_stat_read(hd, sectors[1]),
jiffies_to_msecs(part_stat_read(hd, ticks[1])),
- hd->in_flight,
+ atomic_read(&hd->in_flight),
jiffies_to_msecs(part_stat_read(hd, io_ticks)),
jiffies_to_msecs(part_stat_read(hd, time_in_queue))
);
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 99e33ef..ae1f55a 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -241,7 +241,7 @@ ssize_t part_stat_show(struct device *dev,
part_stat_read(p, merges[WRITE]),
(unsigned long long)part_stat_read(p, sectors[WRITE]),
jiffies_to_msecs(part_stat_read(p, ticks[WRITE])),
- p->in_flight,
+ atomic_read(&p->in_flight),
jiffies_to_msecs(part_stat_read(p, io_ticks)),
jiffies_to_msecs(part_stat_read(p, time_in_queue)));
}
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 634c530..5921400 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -97,7 +97,7 @@ struct hd_struct {
int make_it_fail;
#endif
unsigned long stamp;
- int in_flight;
+ atomic_t in_flight;
#ifdef CONFIG_SMP
struct disk_stats *dkstats;
#else
@@ -321,16 +321,16 @@ static inline void free_part_stats(struct hd_struct *part)

static inline void part_inc_in_flight(struct hd_struct *part)
{
- part->in_flight++;
+ atomic_inc(&part->in_flight);
if (part->partno)
- part_to_disk(part)->part0.in_flight++;
+ atomic_inc(&part_to_disk(part)->part0.in_flight);
}

static inline void part_dec_in_flight(struct hd_struct *part)
{
- part->in_flight--;
+ atomic_dec(&part->in_flight);
if (part->partno)
- part_to_disk(part)->part0.in_flight--;
+ atomic_dec(&part_to_disk(part)->part0.in_flight);
}

/* block/blk-core.c */


2009-04-16 07:36:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] [RFC] make hd_struct->in_flight atomic to avoid diskstat corruption

On Thu, Apr 16 2009, Nikanth Karthikesan wrote:
> The disk statistics exported to userspace through proc and sysfs are
> not protected by locks to avoid performance overhead. Since most of
> the statistics are maintained in the per_cpu struct disk_stats, the
> chances of them getting corrupted is negligible. But the in_flight
> counter, that records the no of requests currently in progress is not
> per-cpu. This increases the chance of it getting corrupted. And
> corruption of this value would result in visibly distorted statistics
> such as negative in_flight. This can be avoided by making this field
> atomic.

Hmm. Did you observe this behaviour? A quick glance at the code reveals
that the callers of part_inc_in_flight() and part_dec_in_flight() in the
block layer are always done under the queue lock. Ditto
part_round_stats(), which calls part_round_stats_single() and also needs
protection for in_flight.

That basically just leaves the code reading this out and reporting, and
driver calls to part_round_stats(). I'd suggest looking there instead,
we're not going to make ->in_flight an atomic just because of some
silliness there that could be fixed.

--
Jens Axboe

2009-04-16 09:17:35

by Nikanth Karthikesan

[permalink] [raw]
Subject: Re: [PATCH] [RFC] make hd_struct->in_flight atomic to avoid diskstat corruption

On Thursday 16 April 2009 13:05:57 Jens Axboe wrote:
> On Thu, Apr 16 2009, Nikanth Karthikesan wrote:
> > The disk statistics exported to userspace through proc and sysfs are
> > not protected by locks to avoid performance overhead. Since most of
> > the statistics are maintained in the per_cpu struct disk_stats, the
> > chances of them getting corrupted is negligible. But the in_flight
> > counter, that records the no of requests currently in progress is not
> > per-cpu. This increases the chance of it getting corrupted. And
> > corruption of this value would result in visibly distorted statistics
> > such as negative in_flight. This can be avoided by making this field
> > atomic.
>
> Hmm. Did you observe this behaviour?

Sorry, not on current kernels. But on a very old 2.6.5 kernel.

Reading Documentation/iostats.txt and the changelog of commit
e71bf0d0ee89e51b92776391c5634938236977d5 made me assume that this could be a
problem even today.

> A quick glance at the code reveals
> that the callers of part_inc_in_flight() and part_dec_in_flight() in the
> block layer are always done under the queue lock. Ditto
> part_round_stats(), which calls part_round_stats_single() and also needs
> protection for in_flight.
>
> That basically just leaves the code reading this out and reporting, and
> driver calls to part_round_stats(). I'd suggest looking there instead,
> we're not going to make ->in_flight an atomic just because of some
> silliness there that could be fixed.

Isn't this also true for the stats protected by the part_stat_lock()? Only
places where we are only reading seems to be called without the queue lock.

Thanks
Nikanth

2009-04-16 14:41:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] [RFC] make hd_struct->in_flight atomic to avoid diskstat corruption

Hello, Nikanth, Jens.

Nikanth Karthikesan wrote:
>> Hmm. Did you observe this behaviour?
>
> Sorry, not on current kernels. But on a very old 2.6.5 kernel.
>
> Reading Documentation/iostats.txt and the changelog of commit
> e71bf0d0ee89e51b92776391c5634938236977d5 made me assume that this could be a
> problem even today.

The only problem we can run into there is if a request doesn't get
attributed to a partition on issue but gets attributed to a partition
on completion, which seems to be possible if a new partition is added
while IO on the whole device which fell into the new partition area is
already in progress, which, on the first glance, seems possible if the
admin tries really hard. I think we can get around the problem by
doing part->in_flight = min(max(new_val, part0->in_flight), 0) in
dec_in_flight(). This is pretty extreme corner case tho.

>> A quick glance at the code reveals
>> that the callers of part_inc_in_flight() and part_dec_in_flight() in the
>> block layer are always done under the queue lock. Ditto
>> part_round_stats(), which calls part_round_stats_single() and also needs
>> protection for in_flight.
>>
>> That basically just leaves the code reading this out and reporting, and
>> driver calls to part_round_stats(). I'd suggest looking there instead,
>> we're not going to make ->in_flight an atomic just because of some
>> silliness there that could be fixed.
>
> Isn't this also true for the stats protected by the
> part_stat_lock()? Only places where we are only reading seems to be
> called without the queue lock.

part_stat_lock() doesn't protect against simultaneous access. I don't
think we have any place where in_flight is updated without queuelock
and the counters being equal to or smaller then ulong, reading
shouldn't be a problem.

I don't think the bug you saw in 2.6.5 kernel applies to upstream
kernel. The minus in_flight value was seen on the diskstats of the
whole device which can't be affected by partition coming up while IOs
are in progress.

Thanks.

--
tejun

2009-04-16 16:33:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] [RFC] make hd_struct->in_flight atomic to avoid diskstat corruption

On Thu, Apr 16 2009, Tejun Heo wrote:
> Hello, Nikanth, Jens.
>
> Nikanth Karthikesan wrote:
> >> Hmm. Did you observe this behaviour?
> >
> > Sorry, not on current kernels. But on a very old 2.6.5 kernel.
> >
> > Reading Documentation/iostats.txt and the changelog of commit
> > e71bf0d0ee89e51b92776391c5634938236977d5 made me assume that this could be a
> > problem even today.
>
> The only problem we can run into there is if a request doesn't get
> attributed to a partition on issue but gets attributed to a partition
> on completion, which seems to be possible if a new partition is added
> while IO on the whole device which fell into the new partition area is
> already in progress, which, on the first glance, seems possible if the
> admin tries really hard. I think we can get around the problem by
> doing part->in_flight = min(max(new_val, part0->in_flight), 0) in
> dec_in_flight(). This is pretty extreme corner case tho.

Heh, that is pretty extreme. I'd prefer just quiescing the queue,
perhaps we should do that for partition map swaps.

> >> A quick glance at the code reveals
> >> that the callers of part_inc_in_flight() and part_dec_in_flight() in the
> >> block layer are always done under the queue lock. Ditto
> >> part_round_stats(), which calls part_round_stats_single() and also needs
> >> protection for in_flight.
> >>
> >> That basically just leaves the code reading this out and reporting, and
> >> driver calls to part_round_stats(). I'd suggest looking there instead,
> >> we're not going to make ->in_flight an atomic just because of some
> >> silliness there that could be fixed.
> >
> > Isn't this also true for the stats protected by the
> > part_stat_lock()? Only places where we are only reading seems to be
> > called without the queue lock.
>
> part_stat_lock() doesn't protect against simultaneous access. I don't
> think we have any place where in_flight is updated without queuelock
> and the counters being equal to or smaller then ulong, reading
> shouldn't be a problem.
>
> I don't think the bug you saw in 2.6.5 kernel applies to upstream
> kernel. The minus in_flight value was seen on the diskstats of the
> whole device which can't be affected by partition coming up while IOs
> are in progress.

Plus at least early versions of the SLES9 kernel had a missing lock
around io stat updates for SCSI. But that has been plugged for a long
time, so probably unrelated to this case as well.

--
Jens Axboe

2009-04-19 08:52:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] [RFC] make hd_struct->in_flight atomic to avoid diskstat corruption

Hello,

Jens Axboe wrote:
> On Thu, Apr 16 2009, Tejun Heo wrote:
>> Hello, Nikanth, Jens.
>>
>> Nikanth Karthikesan wrote:
>>>> Hmm. Did you observe this behaviour?
>>> Sorry, not on current kernels. But on a very old 2.6.5 kernel.
>>>
>>> Reading Documentation/iostats.txt and the changelog of commit
>>> e71bf0d0ee89e51b92776391c5634938236977d5 made me assume that this could be a
>>> problem even today.
>> The only problem we can run into there is if a request doesn't get
>> attributed to a partition on issue but gets attributed to a partition
>> on completion, which seems to be possible if a new partition is added
>> while IO on the whole device which fell into the new partition area is
>> already in progress, which, on the first glance, seems possible if the
>> admin tries really hard. I think we can get around the problem by
>> doing part->in_flight = min(max(new_val, part0->in_flight), 0) in
>> dec_in_flight(). This is pretty extreme corner case tho.
>
> Heh, that is pretty extreme. I'd prefer just quiescing the queue,
> perhaps we should do that for partition map swaps.

Yeah, I think that would be the better approach for swapping ptbl.
RCU isn't really necessary there.

Thanks.

--
tejun

2009-04-21 07:32:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] [RFC] make hd_struct->in_flight atomic to avoid diskstat corruption

On Sun, Apr 19 2009, Tejun Heo wrote:
> Hello,
>
> Jens Axboe wrote:
> > On Thu, Apr 16 2009, Tejun Heo wrote:
> >> Hello, Nikanth, Jens.
> >>
> >> Nikanth Karthikesan wrote:
> >>>> Hmm. Did you observe this behaviour?
> >>> Sorry, not on current kernels. But on a very old 2.6.5 kernel.
> >>>
> >>> Reading Documentation/iostats.txt and the changelog of commit
> >>> e71bf0d0ee89e51b92776391c5634938236977d5 made me assume that this could be a
> >>> problem even today.
> >> The only problem we can run into there is if a request doesn't get
> >> attributed to a partition on issue but gets attributed to a partition
> >> on completion, which seems to be possible if a new partition is added
> >> while IO on the whole device which fell into the new partition area is
> >> already in progress, which, on the first glance, seems possible if the
> >> admin tries really hard. I think we can get around the problem by
> >> doing part->in_flight = min(max(new_val, part0->in_flight), 0) in
> >> dec_in_flight(). This is pretty extreme corner case tho.
> >
> > Heh, that is pretty extreme. I'd prefer just quiescing the queue,
> > perhaps we should do that for partition map swaps.
>
> Yeah, I think that would be the better approach for swapping ptbl.
> RCU isn't really necessary there.

I think so, it definitely makes more sense to make the swap operation
heavier if we can make the read side free. Doesn't make much sense to
optimize for swapping partition tables ;)

--
Jens Axboe