2020-06-09 04:09:50

by Josh Snyder

[permalink] [raw]
Subject: [RFC 0/2] Increase accuracy and precision of sampled io_ticks

Commit 5b18b5a73760 ("block: delete part_round_stats and switch to less
precise counting") introduces a sampling technique for calculating
io_ticks. The sampling algorithm introduces bias in the calculation of
I/O utilization. In my production system, this bias means that a
workload which previously reported 10% I/O utilization now reports 80%.
Patch 1 of this series eliminates the bias.

The sampling technique is also subject to statistical noise. Because it
infers io_ticks based on only 100 samples per second, io_ticks becomes
imprecise, and subject to swings when measuring both random and
deterministic workloads. Patch 2 of this series provides increased
precision by raising the sampling rate.



2020-06-09 04:10:07

by Josh Snyder

[permalink] [raw]
Subject: [RFC 1/2] Eliminate over- and under-counting of io_ticks

Previously, io_ticks could be under-counted. Consider these I/Os along
the time axis (in jiffies):

t 012345678
io1 |----|
io2 |---|

Under the old approach, io_ticks would count up to 6, like so:

t 012345678
io1 |----|
io2 |---|
stamp 0 45 8
io_ticks 1 23 6

With this change, io_ticks instead counts to 8, eliminating the
under-counting:

t 012345678
io1 |----|
io2 |---|
stamp 0 5 8
io_ticks 0 5 8

It was also possible for io_ticks to be over-counted. Consider a
workload that issues I/Os deterministically at intervals of 8ms (125Hz).
If each I/O takes 1ms, then the true utilization is 12.5%. The previous
implementation will increment io_ticks once for each jiffy in which an
I/O ends. Since the workload issues an I/O reliably for each jiffy, the
reported utilization will be 100%. This commit changes the approach such
that only I/Os which cross a boundary between jiffies are counted. With
this change, the given workload would count an I/O tick on every eighth
jiffy, resulting in a (correct) calculated utilization of 12.5%.

Signed-off-by: Josh Snyder <[email protected]>
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
---
block/blk-core.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d1b79dfe9540..a0bbd9e099b9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1396,14 +1396,22 @@ unsigned int blk_rq_err_bytes(const struct request *rq)
}
EXPORT_SYMBOL_GPL(blk_rq_err_bytes);

-static void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
+static void update_io_ticks(struct hd_struct *part, unsigned long now, unsigned long start)
{
unsigned long stamp;
+ unsigned long elapsed;
again:
stamp = READ_ONCE(part->stamp);
if (unlikely(stamp != now)) {
- if (likely(cmpxchg(&part->stamp, stamp, now) == stamp))
- __part_stat_add(part, io_ticks, end ? now - stamp : 1);
+ if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
+ // stamp denotes the last IO to finish
+ // If this IO started before stamp, then there was overlap between this IO
+ // and that one. We increment only by the non-overlap time.
+ // If not, there was no overlap and we increment by our own time,
+ // disregarding stamp.
+ elapsed = now - (start < stamp ? stamp : start);
+ __part_stat_add(part, io_ticks, elapsed);
+ }
}
if (part->partno) {
part = &part_to_disk(part)->part0;
@@ -1439,7 +1447,7 @@ void blk_account_io_done(struct request *req, u64 now)
part_stat_lock();
part = req->part;

- update_io_ticks(part, jiffies, true);
+ update_io_ticks(part, jiffies, nsecs_to_jiffies(req->start_time_ns));
part_stat_inc(part, ios[sgrp]);
part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
part_stat_unlock();
@@ -1456,7 +1464,6 @@ void blk_account_io_start(struct request *rq)
rq->part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));

part_stat_lock();
- update_io_ticks(rq->part, jiffies, false);
part_stat_unlock();
}

@@ -1468,7 +1475,6 @@ unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors,
unsigned long now = READ_ONCE(jiffies);

part_stat_lock();
- update_io_ticks(part, now, false);
part_stat_inc(part, ios[sgrp]);
part_stat_add(part, sectors[sgrp], sectors);
part_stat_local_inc(part, in_flight[op_is_write(op)]);
@@ -1487,7 +1493,7 @@ void disk_end_io_acct(struct gendisk *disk, unsigned int op,
unsigned long duration = now - start_time;

part_stat_lock();
- update_io_ticks(part, now, true);
+ update_io_ticks(part, now, start_time);
part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_stat_local_dec(part, in_flight[op_is_write(op)]);
part_stat_unlock();
--
2.25.1

2020-06-09 04:11:00

by Josh Snyder

[permalink] [raw]
Subject: [RFC 2/2] Track io_ticks at microsecond granularity.

Previously, we performed truncation of I/O issue/completion times during
calculation of io_ticks, counting only I/Os which cross a jiffy
boundary. The effect is a sampling of I/Os: at every boundary between
jiffies we ask "is there an outstanding I/O" and increment a counter if
the answer is yes. This produces results that are accurate (they don't
systematically over- or under-count), but not precise (there is high
variance associated with only taking 100 samples per second).

This change modifies the sampling rate from 100Hz to 976562.5Hz (1
sample per 1024 nanoseconds). I chose this sampling rate by simulating a
workload in which I/Os are issued randomly (by a Poisson process), and
processed in constant time: an M/D/∞ system (Kendall's notation). My
goal was to produce a sampled utilization fraction which was correct to
one part-per-thousand given one second of samples.

The tradeoff of the higher sampling rate is increased synchronization
overhead caused by more frequent compare-and-swap operations. The
technique of commit 5b18b5a73760 ("block: delete part_round_stats and
switch to less precise counting") is to allow multiple I/Os to complete
while performing only one synchronized operation. As we are increasing
the sample rate by a factor of 10000, we will less frequently be able to
exercise the synchronization-free code path.

Included below is the Python script I used to perform the simulation. It
estimates the correct (calculated without sampling) value of %util, and
then reports the root-mean-squared error of the as-sampled estimates.
The parameters `io_rate`, `sample_rates`, and `avgqu_sz` are meant to be
tweaked to fit characteristics of a given workload. I have chosen to
simulate against a difficult workload: 1000 I/Os per second with an
average queue size of 0.01, implying that each I/O takes 10
microseconds. This I/O latency is on par with some of the fastest
production block devices available today, and an order of magnitude
faster than a typical datacenter-grade SSD. With this change, an
estimate of disk %util will not fluctuate as displayed by iostat with
four decimal places, at a refresh rate of 1 Hz.

#!/usr/bin/env python3
from math import log
from math import sqrt
from random import random

GIGA = 1_000_000_000
SECOND = GIGA

def times(interval, avgqu_sz, sample_rates):
time = 0
correct = 0

est_counters = [0] * len(sample_rates)

while time < SECOND:
gap = -log(random()) * interval
busy = svctm if gap > svctm else gap
finish_time = time + busy

correct += busy
for i, rate in enumerate(sample_rates):
est_counters[i] += (
float(int(finish_time * rate)) - int(time * rate)
)

time += gap

return correct, [
correct - (counter / rate)
for counter, rate in zip(est_counters, sample_rates)
]

# How many I/Os per second?
io_rate = 1000
# How frequently are we sampling? (GHz)
sample_rates = [
100 / GIGA, # 100 Hz
1000 / GIGA, # 1000 Hz
1 / 65536, # 15259 Hz
1 / 16384, # 61035 Hz
1 / 1024, # 976563 Hz
1 / 64, # 15625000 Hz
]
avgqu_sz = 0.01

interval = SECOND / io_rate
svctm = interval * avgqu_sz
total = 0
total_errors = [0] * len(sample_rates)
count = 0
while True:
correct, errors = times(interval, svctm, sample_rates)
for i, error in enumerate(errors):
total_errors[i] += error * error
total += correct / SECOND
count += 1

# prints [{RMS error} for rate in sample_rates]
to_print = [
"{:05.2f}".format(100 * sqrt(error / count) / SECOND)
for error in total_errors
]
print(' '.join(to_print))

Signed-off-by: Josh Snyder <[email protected]>
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
---
block/blk-core.c | 16 +++++++++++-----
block/genhd.c | 4 ++--
include/linux/genhd.h | 2 +-
include/linux/part_stat.h | 2 +-
4 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a0bbd9e099b9..2749c52d649c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -62,6 +62,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);

DEFINE_IDA(blk_queue_ida);

+#define IO_TICKS_COARSENESS 10
+
/*
* For queue allocation
*/
@@ -1396,10 +1398,14 @@ unsigned int blk_rq_err_bytes(const struct request *rq)
}
EXPORT_SYMBOL_GPL(blk_rq_err_bytes);

-static void update_io_ticks(struct hd_struct *part, unsigned long now, unsigned long start)
+static void update_io_ticks(struct hd_struct *part, u64 now, u64 start)
{
- unsigned long stamp;
- unsigned long elapsed;
+ u64 stamp;
+ u64 elapsed;
+
+ start &= ~((1<<IO_TICKS_COARSENESS) - 1);
+ now &= ~((1<<IO_TICKS_COARSENESS) - 1);
+
again:
stamp = READ_ONCE(part->stamp);
if (unlikely(stamp != now)) {
@@ -1447,7 +1453,7 @@ void blk_account_io_done(struct request *req, u64 now)
part_stat_lock();
part = req->part;

- update_io_ticks(part, jiffies, nsecs_to_jiffies(req->start_time_ns));
+ update_io_ticks(part, now, req->start_time_ns);
part_stat_inc(part, ios[sgrp]);
part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
part_stat_unlock();
@@ -1493,7 +1499,7 @@ void disk_end_io_acct(struct gendisk *disk, unsigned int op,
unsigned long duration = now - start_time;

part_stat_lock();
- update_io_ticks(part, now, start_time);
+ update_io_ticks(part, jiffies_to_nsecs(now), jiffies_to_nsecs(start_time));
part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_stat_local_dec(part, in_flight[op_is_write(op)]);
part_stat_unlock();
diff --git a/block/genhd.c b/block/genhd.c
index 1a7659327664..045cc9cd7a2c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1296,7 +1296,7 @@ ssize_t part_stat_show(struct device *dev,
(unsigned long long)stat.sectors[STAT_WRITE],
(unsigned int)div_u64(stat.nsecs[STAT_WRITE], NSEC_PER_MSEC),
inflight,
- jiffies_to_msecs(stat.io_ticks),
+ (unsigned int)div_u64(stat.io_ticks, NSEC_PER_MSEC),
(unsigned int)div_u64(stat.nsecs[STAT_READ] +
stat.nsecs[STAT_WRITE] +
stat.nsecs[STAT_DISCARD] +
@@ -1601,7 +1601,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
(unsigned int)div_u64(stat.nsecs[STAT_WRITE],
NSEC_PER_MSEC),
inflight,
- jiffies_to_msecs(stat.io_ticks),
+ (unsigned int)div_u64(stat.io_ticks, NSEC_PER_MSEC),
(unsigned int)div_u64(stat.nsecs[STAT_READ] +
stat.nsecs[STAT_WRITE] +
stat.nsecs[STAT_DISCARD] +
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 392aad5e29a2..ce13f47a4674 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -62,7 +62,7 @@ struct hd_struct {
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
seqcount_t nr_sects_seq;
#endif
- unsigned long stamp;
+ u64 stamp;
struct disk_stats __percpu *dkstats;
struct percpu_ref ref;

diff --git a/include/linux/part_stat.h b/include/linux/part_stat.h
index 24125778ef3e..208904b2447d 100644
--- a/include/linux/part_stat.h
+++ b/include/linux/part_stat.h
@@ -9,7 +9,7 @@ struct disk_stats {
unsigned long sectors[NR_STAT_GROUPS];
unsigned long ios[NR_STAT_GROUPS];
unsigned long merges[NR_STAT_GROUPS];
- unsigned long io_ticks;
+ u64 io_ticks;
local_t in_flight[2];
};

--
2.25.1

2020-06-09 08:11:04

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC 1/2] Eliminate over- and under-counting of io_ticks

On Mon, Jun 08, 2020 at 09:07:23PM -0700, Josh Snyder wrote:
> Previously, io_ticks could be under-counted. Consider these I/Os along
> the time axis (in jiffies):
>
> t 012345678
> io1 |----|
> io2 |---|

In current way, when io2 is done, io tickes should be 5, since 1 tick
is added for two io start.

>
> Under the old approach, io_ticks would count up to 6, like so:
>
> t 012345678
> io1 |----|
> io2 |---|
> stamp 0 45 8
> io_ticks 1 23 6

Before commit 5b18b5a73760("block: delete part_round_stats and switch to less precise counting"),
io tick is calculated accurately, which is basically:

(4 - 0) + (5 - 4) + (8 - 5) = 8

>
> With this change, io_ticks instead counts to 8, eliminating the
> under-counting:
>
> t 012345678
> io1 |----|
> io2 |---|
> stamp 0 5 8
> io_ticks 0 5 8
>
> It was also possible for io_ticks to be over-counted. Consider a
> workload that issues I/Os deterministically at intervals of 8ms (125Hz).
> If each I/O takes 1ms, then the true utilization is 12.5%. The previous
> implementation will increment io_ticks once for each jiffy in which an
> I/O ends. Since the workload issues an I/O reliably for each jiffy, the
> reported utilization will be 100%. This commit changes the approach such
> that only I/Os which cross a boundary between jiffies are counted. With
> this change, the given workload would count an I/O tick on every eighth
> jiffy, resulting in a (correct) calculated utilization of 12.5%.
>
> Signed-off-by: Josh Snyder <[email protected]>
> Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
> ---
> block/blk-core.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d1b79dfe9540..a0bbd9e099b9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1396,14 +1396,22 @@ unsigned int blk_rq_err_bytes(const struct request *rq)
> }
> EXPORT_SYMBOL_GPL(blk_rq_err_bytes);
>
> -static void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
> +static void update_io_ticks(struct hd_struct *part, unsigned long now, unsigned long start)
> {
> unsigned long stamp;
> + unsigned long elapsed;
> again:
> stamp = READ_ONCE(part->stamp);
> if (unlikely(stamp != now)) {
> - if (likely(cmpxchg(&part->stamp, stamp, now) == stamp))
> - __part_stat_add(part, io_ticks, end ? now - stamp : 1);
> + if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> + // stamp denotes the last IO to finish
> + // If this IO started before stamp, then there was overlap between this IO
> + // and that one. We increment only by the non-overlap time.
> + // If not, there was no overlap and we increment by our own time,
> + // disregarding stamp.

Linux kernel's comment style is '/**/'

> + elapsed = now - (start < stamp ? stamp : start);
> + __part_stat_add(part, io_ticks, elapsed);

Looks this way of only sampling IO done is smart, io ticks becomes much
more accurate than before.

> + }
> }
> if (part->partno) {
> part = &part_to_disk(part)->part0;
> @@ -1439,7 +1447,7 @@ void blk_account_io_done(struct request *req, u64 now)
> part_stat_lock();
> part = req->part;
>
> - update_io_ticks(part, jiffies, true);
> + update_io_ticks(part, jiffies, nsecs_to_jiffies(req->start_time_ns));

jiffies and req->start_time_ns may be from different clock sources, so
I'd suggest to merge the two patches into one.


Thanks,
Ming

2020-06-09 08:53:17

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC 2/2] Track io_ticks at microsecond granularity.

On Mon, Jun 08, 2020 at 09:07:24PM -0700, Josh Snyder wrote:
> Previously, we performed truncation of I/O issue/completion times during
> calculation of io_ticks, counting only I/Os which cross a jiffy
> boundary. The effect is a sampling of I/Os: at every boundary between
> jiffies we ask "is there an outstanding I/O" and increment a counter if
> the answer is yes. This produces results that are accurate (they don't
> systematically over- or under-count), but not precise (there is high
> variance associated with only taking 100 samples per second).
>
> This change modifies the sampling rate from 100Hz to 976562.5Hz (1
> sample per 1024 nanoseconds). I chose this sampling rate by simulating a
> workload in which I/Os are issued randomly (by a Poisson process), and
> processed in constant time: an M/D/∞ system (Kendall's notation). My
> goal was to produce a sampled utilization fraction which was correct to
> one part-per-thousand given one second of samples.
>
> The tradeoff of the higher sampling rate is increased synchronization
> overhead caused by more frequent compare-and-swap operations. The
> technique of commit 5b18b5a73760 ("block: delete part_round_stats and
> switch to less precise counting") is to allow multiple I/Os to complete
> while performing only one synchronized operation. As we are increasing
> the sample rate by a factor of 10000, we will less frequently be able to
> exercise the synchronization-free code path.

Not sure if we need so precise %util, and ~1M sampling rate may cause to run
cmpxchg() 1M/sec for each partition, which overhead might be observable.

Thanks,
Ming

2020-06-10 01:43:46

by Hou Tao

[permalink] [raw]
Subject: Re: [RFC 1/2] Eliminate over- and under-counting of io_ticks

Hi,

On 2020/6/9 12:07, Josh Snyder wrote:
> Previously, io_ticks could be under-counted. Consider these I/Os along
> the time axis (in jiffies):
>
> t 012345678
> io1 |----|
> io2 |---|
>
> Under the old approach, io_ticks would count up to 6, like so:
>
> t 012345678
> io1 |----|
> io2 |---|
> stamp 0 45 8
> io_ticks 1 23 6
>
> With this change, io_ticks instead counts to 8, eliminating the
> under-counting:
>
> t 012345678
> io1 |----|
> io2 |---|
> stamp 0 5 8
> io_ticks 0 5 8
>
For the following case, the under-counting is still possible if io2 wins cmpxchg():

t 0123456
io1 |-----|
io2 |--|
stamp 0 6
io_ticks 0 3

However considering patch 2 tries to improve sampling rate to 1 us, the problem will gone.

> It was also possible for io_ticks to be over-counted. Consider a
> workload that issues I/Os deterministically at intervals of 8ms (125Hz).
> If each I/O takes 1ms, then the true utilization is 12.5%. The previous
> implementation will increment io_ticks once for each jiffy in which an
> I/O ends. Since the workload issues an I/O reliably for each jiffy, the
> reported utilization will be 100%. This commit changes the approach such
> that only I/Os which cross a boundary between jiffies are counted. With
> this change, the given workload would count an I/O tick on every eighth
> jiffy, resulting in a (correct) calculated utilization of 12.5%.
>
> Signed-off-by: Josh Snyder <[email protected]>
> Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
> ---
> block/blk-core.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d1b79dfe9540..a0bbd9e099b9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1396,14 +1396,22 @@ unsigned int blk_rq_err_bytes(const struct request *rq)
> }
> EXPORT_SYMBOL_GPL(blk_rq_err_bytes);
>
> -static void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
> +static void update_io_ticks(struct hd_struct *part, unsigned long now, unsigned long start)
> {
> unsigned long stamp;
> + unsigned long elapsed;
> again:
> stamp = READ_ONCE(part->stamp);
> if (unlikely(stamp != now)) {
> - if (likely(cmpxchg(&part->stamp, stamp, now) == stamp))
> - __part_stat_add(part, io_ticks, end ? now - stamp : 1);
> + if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> + // stamp denotes the last IO to finish
> + // If this IO started before stamp, then there was overlap between this IO
> + // and that one. We increment only by the non-overlap time.
> + // If not, there was no overlap and we increment by our own time,
> + // disregarding stamp.
> + elapsed = now - (start < stamp ? stamp : start);
> + __part_stat_add(part, io_ticks, elapsed);
> + }
> }
> if (part->partno) {
> part = &part_to_disk(part)->part0;
> @@ -1439,7 +1447,7 @@ void blk_account_io_done(struct request *req, u64 now)
> part_stat_lock();
> part = req->part;
>
> - update_io_ticks(part, jiffies, true);
> + update_io_ticks(part, jiffies, nsecs_to_jiffies(req->start_time_ns));
> part_stat_inc(part, ios[sgrp]);
> part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
> part_stat_unlock();
> @@ -1456,7 +1464,6 @@ void blk_account_io_start(struct request *rq)
> rq->part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>
> part_stat_lock();
> - update_io_ticks(rq->part, jiffies, false);
> part_stat_unlock();
> }
>
> @@ -1468,7 +1475,6 @@ unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors,
> unsigned long now = READ_ONCE(jiffies);
>
> part_stat_lock();
> - update_io_ticks(part, now, false);
> part_stat_inc(part, ios[sgrp]);
> part_stat_add(part, sectors[sgrp], sectors);
> part_stat_local_inc(part, in_flight[op_is_write(op)]);
> @@ -1487,7 +1493,7 @@ void disk_end_io_acct(struct gendisk *disk, unsigned int op,
> unsigned long duration = now - start_time;
>
> part_stat_lock();
> - update_io_ticks(part, now, true);
> + update_io_ticks(part, now, start_time);
> part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
> part_stat_local_dec(part, in_flight[op_is_write(op)]);
> part_stat_unlock();
>

2020-06-10 07:30:33

by Josh Snyder

[permalink] [raw]
Subject: Re: [RFC 1/2] Eliminate over- and under-counting of io_ticks

Hello!

On 6/9/20 6:41 PM, Hou Tao wrote:
> Hi,
>
> For the following case, the under-counting is still possible if io2 wins cmpxchg():
>
> t 0123456
> io1 |-----|
> io2 |--|
> stamp 0 6
> io_ticks 0 3

I hadn't noticed that bug. It looks like it can produce an unbounded quantity of undercount.

> However considering patch 2 tries to improve sampling rate to 1 us, the problem will gone.

Now that you mention it, the below case is also poorly handled, and will be incorrect
regardless of sampling frequency. It experiences issues both under this patch (labeled
io_ticks) and the current implementation (labeled io_ticks~):

t 0123456
io1 |-----|
io2 |-|
stamp 0 56
io_ticks 28

stamp~ 0 3 56
io_ticks~ 1 34

I am beginning to doubt whether it is even possible to produce an algorithm that is
simultaneously unbiased and synchronization-lite. At the same time, Ming's comment on
patch 2 was leading me to wonder about the value of being synchronization-lite in the
first place. At the proposed sampling rate of 1M/s, it is unlikely that we'd ever exercise
the synchronization-free code path (and, as your case shows, incorrect). And for every
block device that I'm aware of (even the ones that return in 10us), the cost of a disk
access still completely dominates the cost of a locked CPU operation by three orders of
magnitude.

Josh