On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> Hi,
>
> At LSF/MM there was a slot about postgres' problems with the kernel. Our
> top#1 concern is frequent slow read()s that happen while another process
> calls fsync(), even though we'd be perfectly fine if that fsync() took
> ages.
> The "conclusion" of that part was that it'd be very useful to have a
> demonstration of the problem without needing a full blown postgres
> setup. I've quickly hacked something together, that seems to show the
> problem nicely.
>
> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> and the "IO Scheduling" bit in
> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>
> The tools output looks like this:
> gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> ...
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> commit[12155]: avg: 0.2 msec; max: 15.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> read[12157]: avg: 0.2 msec; max: 9.4 msec
> ...
> read[12165]: avg: 0.2 msec; max: 9.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> read[12162]: avg: 0.6 msec; max: 2765.5 msec
>
> So, the average read time is less than one ms (SSD, and about 50% cached
> workload). But once another backend does the fsync(), read latency
> skyrockets.
>
> A concurrent iostat shows the problem pretty clearly:
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 1.00 0.00 6322.00 337.00 51.73 4.38 17.26 2.09 0.32 0.19 2.59 0.14 90.00
> sda 0.00 0.00 6016.00 303.00 47.18 3.95 16.57 2.30 0.36 0.23 3.12 0.15 94.40
> sda 0.00 0.00 6236.00 1059.00 49.52 12.88 17.52 5.91 0.64 0.20 3.23 0.12 88.40
> sda 0.00 0.00 105.00 26173.00 0.89 311.39 24.34 142.37 5.42 27.73 5.33 0.04 100.00
> sda 0.00 0.00 78.00 27199.00 0.87 324.06 24.40 142.30 5.25 11.08 5.23 0.04 100.00
> sda 0.00 0.00 10.00 33488.00 0.11 399.05 24.40 136.41 4.07 100.40 4.04 0.03 100.00
> sda 0.00 0.00 3819.00 10096.00 31.14 120.47 22.31 42.80 3.10 0.32 4.15 0.07 96.00
> sda 0.00 0.00 6482.00 346.00 52.98 4.53 17.25 1.93 0.28 0.20 1.80 0.14 93.20
>
> While the fsync() is going on (or the kernel decides to start writing
> out aggressively for some other reason) the amount of writes to the disk
> is increased by two orders of magnitude. Unsurprisingly with disastrous
> consequences for read() performance. We really want a way to pace the
> writes issued to the disk more regularly.
Hi Andreas,
I've finally dug myself out from under the backlog from LSFMM far
enough to start testing this on my local IO performance test rig.
tl;dr: I can't reproduce this peaky behaviour on my test rig.
I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
an unmodified benchmark on a current 3.15-linus tree. All storage
(guest and host) is XFS based, guest VMs use virtio and direct IO to
the backing storage. The host is using noop IO scheduling.
The first IO setup I ran was a 100TB XFS filesystem in the guest.
The backing file is a sparse file on an XFS filesystem on a pair of
240GB SSDs (Samsung 840 EVO) in RAID 0 via DM. The SSDs are
exported as JBOD from a RAID controller which has 1GB of FBWC. The
guest is capable of sustaining around 65,000 random read IOPS and
40,000 write IOPS through this filesystem depending on the tests
being run.
The iostat output looks like this:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 0.00 1817.00 315.40 18.80 6.93 24.71 0.80 0.38 0.38 0.37 0.31 66.24
vdc 0.00 0.00 2094.20 323.20 21.82 7.10 24.50 0.81 0.33 0.34 0.27 0.28 68.48
vdc 0.00 0.00 1989.00 4500.20 20.50 56.64 24.34 24.82 3.82 0.43 5.32 0.12 80.16
vdc 0.00 0.00 2019.80 320.80 20.83 7.05 24.39 0.83 0.35 0.36 0.32 0.29 69.04
vdc 0.00 0.00 2206.60 323.20 22.57 7.10 24.02 0.87 0.34 0.34 0.33 0.28 71.92
vdc 0.00 0.00 2437.20 329.60 25.79 7.24 24.45 0.83 0.30 0.30 0.27 0.26 71.76
vdc 0.00 0.00 1224.40 11263.80 12.88 136.38 24.48 64.90 5.20 0.69 5.69 0.07 84.96
vdc 0.00 0.00 2074.60 319.40 21.03 7.01 23.99 0.84 0.35 0.36 0.30 0.29 68.96
vdc 0.00 0.00 1999.00 881.60 20.81 13.61 24.47 4.23 1.43 0.35 3.88 0.23 67.36
vdc 0.00 0.00 1193.40 2273.00 12.42 29.68 24.87 11.70 3.40 0.53 4.91 0.18 60.96
vdc 0.00 0.00 1724.00 314.40 17.80 6.91 24.83 0.74 0.36 0.36 0.35 0.30 61.52
vdc 0.00 0.00 2605.20 325.60 24.67 7.15 22.24 0.90 0.31 0.31 0.25 0.25 72.72
vdc 0.00 0.00 2340.60 324.40 23.85 7.12 23.80 0.83 0.31 0.31 0.29 0.26 68.96
vdc 0.00 0.00 2749.60 329.40 28.51 7.24 23.78 0.90 0.29 0.30 0.25 0.24 75.04
vdc 0.00 0.00 2619.60 324.60 27.72 7.13 24.24 0.88 0.30 0.30 0.28 0.24 71.76
vdc 0.00 0.00 1608.60 4532.40 17.17 56.40 24.54 25.78 4.20 0.50 5.51 0.12 75.36
vdc 0.00 0.00 2361.00 326.40 23.62 7.17 23.47 0.87 0.33 0.33 0.30 0.26 69.92
vdc 0.00 0.00 2460.00 326.00 25.89 7.16 24.30 0.88 0.32 0.32 0.26 0.26 72.72
vdc 0.00 0.00 2519.00 325.40 25.96 7.14 23.83 0.90 0.32 0.32 0.30 0.26 74.32
vdc 0.00 0.00 2709.80 326.20 28.91 7.17 24.34 0.94 0.31 0.30 0.36 0.25 75.52
vdc 0.00 0.00 2676.65 329.74 28.82 7.24 24.56 0.86 0.28 0.29 0.25 0.24 71.22
vdc 0.00 0.00 1788.40 4506.80 18.77 56.66 24.54 23.22 3.69 0.45 4.97 0.12 74.88
vdc 0.00 0.00 1850.40 319.60 19.76 7.02 25.28 0.80 0.37 0.37 0.34 0.30 64.80
Its obvious where the fsyncs are hitting, but they are making
almost no impact on the read performance. The benchmark is
simple not generating enough dirty data to overload the IO
subsystem, and hence there's no latency spikes to speak of.
Benchmark output across fsyncs also demonstrates that:
....
read[12494]: avg: 0.4 msec; max: 9.8 msec
read[12499]: avg: 0.4 msec; max: 9.8 msec
read[12495]: avg: 0.4 msec; max: 9.8 msec
commit[12491]: avg: 0.0 msec; max: 6.3 msec
wal[12491]: avg: 0.0 msec; max: 0.0 msec
wal[12491]: avg: 0.0 msec; max: 0.0 msec
starting fsync() of files
finished fsync() of files
wal[12491]: avg: 0.0 msec; max: 0.7 msec
commit[12491]: avg: 0.1 msec; max: 15.8 msec
wal[12491]: avg: 0.0 msec; max: 0.1 msec
wal[12491]: avg: 0.0 msec; max: 0.1 msec
read[12492]: avg: 0.6 msec; max: 10.0 msec
read[12496]: avg: 0.6 msec; max: 10.0 msec
read[12507]: avg: 0.6 msec; max: 10.0 msec
read[12505]: avg: 0.6 msec; max: 10.0 msec
....
So, I though switching to spinning disks might show the problem.
Same VM, this time using a 17TB linearly preallocated image file on
an 18TB XFS filesystem on the host (virtio in the guest again) on a
12 disk RAID-0 (DM again) using a 12x2TB SAS drives exported as JBOD
from a RAID controller with 512MB of BBWC. This is capable of about
2,000 random read IOPs, and write IOPS is dependent on the BBWC
flushing behaviour (peaks at about 15,000, sustains 1500).
Again, I don't see any bad behaviour:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdd 0.00 0.00 128.00 154.20 1.46 6.02 54.24 1.02 3.62 7.22 0.62 3.17 89.44
vdd 0.00 0.00 133.80 154.40 1.43 6.03 53.04 1.08 3.76 7.42 0.59 3.05 87.84
vdd 0.00 0.00 132.40 751.60 1.36 12.75 32.70 7.26 8.21 7.90 8.27 1.06 93.76
vdd 0.00 0.00 138.60 288.60 1.47 7.62 43.57 2.69 6.30 7.24 5.85 2.15 91.68
vdd 0.00 0.00 141.20 157.00 1.51 6.12 52.36 1.12 3.75 7.26 0.59 3.11 92.64
vdd 0.00 0.00 136.80 154.00 1.52 6.02 53.09 1.04 3.57 6.92 0.60 3.10 90.08
vdd 0.00 0.00 135.00 154.20 1.46 6.02 52.99 1.04 3.60 7.00 0.62 3.03 87.68
vdd 0.00 0.00 147.20 155.80 1.50 6.09 51.26 1.19 3.92 7.45 0.59 3.06 92.72
vdd 0.00 0.00 139.60 154.00 1.50 6.02 52.43 1.03 3.53 6.79 0.58 2.97 87.28
vdd 0.00 0.00 126.80 913.80 1.27 14.81 31.63 7.87 7.56 9.02 7.36 0.89 92.16
vdd 0.00 0.00 148.20 156.80 1.51 6.11 51.15 1.17 3.83 7.25 0.59 3.03 92.32
vdd 0.00 0.00 142.60 155.80 1.49 6.09 52.03 1.12 3.76 7.23 0.60 3.05 90.88
vdd 0.00 0.00 146.00 156.60 1.49 6.12 51.48 1.16 3.84 7.32 0.60 3.03 91.76
vdd 0.00 0.00 148.80 157.20 1.56 6.14 51.54 1.10 3.58 6.78 0.54 2.92 89.36
vdd 0.00 0.00 149.80 156.80 1.52 6.12 51.05 1.14 3.73 7.10 0.52 2.96 90.88
vdd 0.00 0.00 127.20 910.60 1.29 14.82 31.80 8.17 7.87 8.62 7.77 0.88 91.44
vdd 0.00 0.00 150.80 157.80 1.58 6.15 51.27 1.09 3.53 6.67 0.52 2.91 89.92
vdd 0.00 0.00 153.00 156.60 1.53 6.12 50.56 1.17 3.80 7.07 0.60 2.95 91.20
vdd 0.00 0.00 161.20 157.00 1.54 6.13 49.36 1.27 4.00 7.36 0.55 2.97 94.48
vdd 0.00 0.00 152.10 157.09 1.55 6.14 50.91 1.15 3.73 7.06 0.50 2.98 92.22
vdd 0.00 0.00 154.80 157.80 1.50 6.16 50.18 1.19 3.82 7.14 0.57 2.88 90.00
vdd 0.00 0.00 122.20 922.80 1.24 14.93 31.70 8.30 7.94 8.81 7.83 0.90 93.68
vdd 0.00 0.00 139.80 153.60 1.46 5.98 51.98 1.09 3.74 7.14 0.65 3.02 88.48
vdd 0.00 0.00 134.60 153.60 1.44 6.00 52.87 1.09 3.78 7.38 0.63 3.15 90.88
vdd 0.00 0.00 141.00 153.40 1.55 5.99 52.49 1.05 3.57 6.79 0.60 3.04 89.44
vdd 0.00 0.00 130.20 154.20 1.44 6.02 53.76 1.04 3.65 7.28 0.59 3.14 89.44
vdd 0.00 0.00 136.60 154.20 1.30 6.02 51.52 1.19 4.07 8.02 0.58 3.03 88.24
vdd 0.00 0.00 119.00 872.20 1.29 14.29 32.20 7.98 8.05 8.59 7.98 0.93 92.64
vdd 0.00 0.00 139.00 154.00 1.58 6.00 53.01 1.06 3.62 6.95 0.61 3.12 91.52
vdd 0.00 0.00 146.00 155.40 1.52 6.07 51.60 1.12 3.71 7.05 0.57 3.01 90.72
vdd 0.00 0.00 145.20 156.60 1.55 6.12 52.01 1.13 3.73 7.10 0.60 3.07 92.80
vdd 0.00 0.00 147.80 156.80 1.51 6.12 51.36 1.13 3.72 7.07 0.56 2.97 90.56
vdd 0.00 0.00 142.80 157.00 1.48 6.12 51.98 1.11 3.70 7.20 0.52 3.02 90.56
vdd 0.00 0.00 125.20 922.00 1.36 14.92 31.85 8.34 7.96 8.47 7.90 0.88 92.56
Again, you can see exactly where the fsyncs are hitting, and again
they are not massive spikes of write IO. And read latencies are showing:
read[12711]: avg: 6.1 msec; max: 182.2 msec
...
read[12711]: avg: 6.0 msec; max: 198.8 msec
....
read[12711]: avg: 5.9 msec; max: 158.8 msec
....
read[12711]: avg: 6.1 msec; max: 127.5 msec
....
read[12711]: avg: 5.9 msec; max: 355.6 msec
....
read[12711]: avg: 6.0 msec; max: 262.8 msec
....
No significant outliers. The same for commits and wal:
wal[12701]: avg: 0.0 msec; max: 0.2 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.0 msec; max: 1.0 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
starting fsync() of files
finished fsync() of files
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.1 msec; max: 129.2 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.0 msec; max: 0.9 msec
wal[12701]: avg: 0.0 msec; max: 0.2 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.0 msec; max: 23.1 msec
I'm not sure how you were generating the behaviour you reported, but
the test program as it stands does not appear to be causing any
problems at all on the sort of storage I'd expect large databases to
be hosted on....
I've tried a few tweaks to the test program, but I haven't been able
to make it misbehave. What do I need to tweak in the test program or
my test VM to make the kernel misbehave like you reported?
Cheers,
Dave.
--
Dave Chinner
[email protected]`
Hi Dave,
On 2014-04-09 19:20:09 +1000, Dave Chinner wrote:
> On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> > So, the average read time is less than one ms (SSD, and about 50% cached
> > workload). But once another backend does the fsync(), read latency
> > skyrockets.
> >
> > A concurrent iostat shows the problem pretty clearly:
> > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> > sda 1.00 0.00 6322.00 337.00 51.73 4.38 17.26 2.09 0.32 0.19 2.59 0.14 90.00
> > sda 0.00 0.00 6016.00 303.00 47.18 3.95 16.57 2.30 0.36 0.23 3.12 0.15 94.40
> > sda 0.00 0.00 6236.00 1059.00 49.52 12.88 17.52 5.91 0.64 0.20 3.23 0.12 88.40
> > sda 0.00 0.00 105.00 26173.00 0.89 311.39 24.34 142.37 5.42 27.73 5.33 0.04 100.00
> > sda 0.00 0.00 78.00 27199.00 0.87 324.06 24.40 142.30 5.25 11.08 5.23 0.04 100.00
> > sda 0.00 0.00 10.00 33488.00 0.11 399.05 24.40 136.41 4.07 100.40 4.04 0.03 100.00
> > sda 0.00 0.00 3819.00 10096.00 31.14 120.47 22.31 42.80 3.10 0.32 4.15 0.07 96.00
> > sda 0.00 0.00 6482.00 346.00 52.98 4.53 17.25 1.93 0.28 0.20 1.80 0.14 93.20
> >
> > While the fsync() is going on (or the kernel decides to start writing
> > out aggressively for some other reason) the amount of writes to the disk
> > is increased by two orders of magnitude. Unsurprisingly with disastrous
> > consequences for read() performance. We really want a way to pace the
> > writes issued to the disk more regularly.
> I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
> an unmodified benchmark on a current 3.15-linus tree. All storage
> (guest and host) is XFS based, guest VMs use virtio and direct IO to
> the backing storage. The host is using noop IO scheduling.
> The first IO setup I ran was a 100TB XFS filesystem in the guest.
> The backing file is a sparse file on an XFS filesystem on a pair of
> 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM. The SSDs are
> exported as JBOD from a RAID controller which has 1GB of FBWC. The
> guest is capable of sustaining around 65,000 random read IOPS and
> 40,000 write IOPS through this filesystem depending on the tests
> being run.
I think the 1GB FBWC explains the behaviour - IIRC the test as written
flushes about 400-500MB during fsync(). If the writebach cache can just
take that and continue as if nothing happened you'll see no problem.
> I'm not sure how you were generating the behaviour you reported, but
> the test program as it stands does not appear to be causing any
> problems at all on the sort of storage I'd expect large databases to
> be hosted on....
Since I had developed it while at LSF/MM I had little choice but to run
it only on my laptop. You might remember the speed of the conference
network ;)
> I've tried a few tweaks to the test program, but I haven't been able
> to make it misbehave. What do I need to tweak in the test program or
> my test VM to make the kernel misbehave like you reported?
I think there's two tweaks that would be worthwile to try to reproduce
the problem there:
* replace: "++writes % 100000" by something like "++writes %
500000". That should create more than 1GB of dirty memory to be
flushed out at the later fsync() which should then hit with your
amount of WC cache.
* replace the: "nsleep(200000);" by something smaller. I guess 70000 or
so might also trigger the problem alone.
Unfortunately right now I don't have any free rig with decent storage
available...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
ping?
On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> > Hi,
> >
> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > top#1 concern is frequent slow read()s that happen while another process
> > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > ages.
> > The "conclusion" of that part was that it'd be very useful to have a
> > demonstration of the problem without needing a full blown postgres
> > setup. I've quickly hacked something together, that seems to show the
> > problem nicely.
> >
> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > and the "IO Scheduling" bit in
> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> >
> > The tools output looks like this:
> > gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> > ...
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > commit[12155]: avg: 0.2 msec; max: 15.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > read[12157]: avg: 0.2 msec; max: 9.4 msec
> > ...
> > read[12165]: avg: 0.2 msec; max: 9.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > starting fsync() of files
> > finished fsync() of files
> > read[12162]: avg: 0.6 msec; max: 2765.5 msec
> >
> > So, the average read time is less than one ms (SSD, and about 50% cached
> > workload). But once another backend does the fsync(), read latency
> > skyrockets.
> >
> > A concurrent iostat shows the problem pretty clearly:
> > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> > sda 1.00 0.00 6322.00 337.00 51.73 4.38 17.26 2.09 0.32 0.19 2.59 0.14 90.00
> > sda 0.00 0.00 6016.00 303.00 47.18 3.95 16.57 2.30 0.36 0.23 3.12 0.15 94.40
> > sda 0.00 0.00 6236.00 1059.00 49.52 12.88 17.52 5.91 0.64 0.20 3.23 0.12 88.40
> > sda 0.00 0.00 105.00 26173.00 0.89 311.39 24.34 142.37 5.42 27.73 5.33 0.04 100.00
> > sda 0.00 0.00 78.00 27199.00 0.87 324.06 24.40 142.30 5.25 11.08 5.23 0.04 100.00
> > sda 0.00 0.00 10.00 33488.00 0.11 399.05 24.40 136.41 4.07 100.40 4.04 0.03 100.00
> > sda 0.00 0.00 3819.00 10096.00 31.14 120.47 22.31 42.80 3.10 0.32 4.15 0.07 96.00
> > sda 0.00 0.00 6482.00 346.00 52.98 4.53 17.25 1.93 0.28 0.20 1.80 0.14 93.20
> >
> > While the fsync() is going on (or the kernel decides to start writing
> > out aggressively for some other reason) the amount of writes to the disk
> > is increased by two orders of magnitude. Unsurprisingly with disastrous
> > consequences for read() performance. We really want a way to pace the
> > writes issued to the disk more regularly.
>
> Hi Andreas,
>
> I've finally dug myself out from under the backlog from LSFMM far
> enough to start testing this on my local IO performance test rig.
>
> tl;dr: I can't reproduce this peaky behaviour on my test rig.
>
> I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
> an unmodified benchmark on a current 3.15-linus tree. All storage
> (guest and host) is XFS based, guest VMs use virtio and direct IO to
> the backing storage. The host is using noop IO scheduling.
>
> The first IO setup I ran was a 100TB XFS filesystem in the guest.
> The backing file is a sparse file on an XFS filesystem on a pair of
> 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM. The SSDs are
> exported as JBOD from a RAID controller which has 1GB of FBWC. The
> guest is capable of sustaining around 65,000 random read IOPS and
> 40,000 write IOPS through this filesystem depending on the tests
> being run.
>
> The iostat output looks like this:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> vdc 0.00 0.00 1817.00 315.40 18.80 6.93 24.71 0.80 0.38 0.38 0.37 0.31 66.24
> vdc 0.00 0.00 2094.20 323.20 21.82 7.10 24.50 0.81 0.33 0.34 0.27 0.28 68.48
> vdc 0.00 0.00 1989.00 4500.20 20.50 56.64 24.34 24.82 3.82 0.43 5.32 0.12 80.16
> vdc 0.00 0.00 2019.80 320.80 20.83 7.05 24.39 0.83 0.35 0.36 0.32 0.29 69.04
> vdc 0.00 0.00 2206.60 323.20 22.57 7.10 24.02 0.87 0.34 0.34 0.33 0.28 71.92
> vdc 0.00 0.00 2437.20 329.60 25.79 7.24 24.45 0.83 0.30 0.30 0.27 0.26 71.76
> vdc 0.00 0.00 1224.40 11263.80 12.88 136.38 24.48 64.90 5.20 0.69 5.69 0.07 84.96
> vdc 0.00 0.00 2074.60 319.40 21.03 7.01 23.99 0.84 0.35 0.36 0.30 0.29 68.96
> vdc 0.00 0.00 1999.00 881.60 20.81 13.61 24.47 4.23 1.43 0.35 3.88 0.23 67.36
> vdc 0.00 0.00 1193.40 2273.00 12.42 29.68 24.87 11.70 3.40 0.53 4.91 0.18 60.96
> vdc 0.00 0.00 1724.00 314.40 17.80 6.91 24.83 0.74 0.36 0.36 0.35 0.30 61.52
> vdc 0.00 0.00 2605.20 325.60 24.67 7.15 22.24 0.90 0.31 0.31 0.25 0.25 72.72
> vdc 0.00 0.00 2340.60 324.40 23.85 7.12 23.80 0.83 0.31 0.31 0.29 0.26 68.96
> vdc 0.00 0.00 2749.60 329.40 28.51 7.24 23.78 0.90 0.29 0.30 0.25 0.24 75.04
> vdc 0.00 0.00 2619.60 324.60 27.72 7.13 24.24 0.88 0.30 0.30 0.28 0.24 71.76
> vdc 0.00 0.00 1608.60 4532.40 17.17 56.40 24.54 25.78 4.20 0.50 5.51 0.12 75.36
> vdc 0.00 0.00 2361.00 326.40 23.62 7.17 23.47 0.87 0.33 0.33 0.30 0.26 69.92
> vdc 0.00 0.00 2460.00 326.00 25.89 7.16 24.30 0.88 0.32 0.32 0.26 0.26 72.72
> vdc 0.00 0.00 2519.00 325.40 25.96 7.14 23.83 0.90 0.32 0.32 0.30 0.26 74.32
> vdc 0.00 0.00 2709.80 326.20 28.91 7.17 24.34 0.94 0.31 0.30 0.36 0.25 75.52
> vdc 0.00 0.00 2676.65 329.74 28.82 7.24 24.56 0.86 0.28 0.29 0.25 0.24 71.22
> vdc 0.00 0.00 1788.40 4506.80 18.77 56.66 24.54 23.22 3.69 0.45 4.97 0.12 74.88
> vdc 0.00 0.00 1850.40 319.60 19.76 7.02 25.28 0.80 0.37 0.37 0.34 0.30 64.80
>
> Its obvious where the fsyncs are hitting, but they are making
> almost no impact on the read performance. The benchmark is
> simple not generating enough dirty data to overload the IO
> subsystem, and hence there's no latency spikes to speak of.
>
> Benchmark output across fsyncs also demonstrates that:
>
> ....
> read[12494]: avg: 0.4 msec; max: 9.8 msec
> read[12499]: avg: 0.4 msec; max: 9.8 msec
> read[12495]: avg: 0.4 msec; max: 9.8 msec
> commit[12491]: avg: 0.0 msec; max: 6.3 msec
> wal[12491]: avg: 0.0 msec; max: 0.0 msec
> wal[12491]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> wal[12491]: avg: 0.0 msec; max: 0.7 msec
> commit[12491]: avg: 0.1 msec; max: 15.8 msec
> wal[12491]: avg: 0.0 msec; max: 0.1 msec
> wal[12491]: avg: 0.0 msec; max: 0.1 msec
> read[12492]: avg: 0.6 msec; max: 10.0 msec
> read[12496]: avg: 0.6 msec; max: 10.0 msec
> read[12507]: avg: 0.6 msec; max: 10.0 msec
> read[12505]: avg: 0.6 msec; max: 10.0 msec
> ....
>
> So, I though switching to spinning disks might show the problem.
> Same VM, this time using a 17TB linearly preallocated image file on
> an 18TB XFS filesystem on the host (virtio in the guest again) on a
> 12 disk RAID-0 (DM again) using a 12x2TB SAS drives exported as JBOD
> from a RAID controller with 512MB of BBWC. This is capable of about
> 2,000 random read IOPs, and write IOPS is dependent on the BBWC
> flushing behaviour (peaks at about 15,000, sustains 1500).
>
> Again, I don't see any bad behaviour:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> vdd 0.00 0.00 128.00 154.20 1.46 6.02 54.24 1.02 3.62 7.22 0.62 3.17 89.44
> vdd 0.00 0.00 133.80 154.40 1.43 6.03 53.04 1.08 3.76 7.42 0.59 3.05 87.84
> vdd 0.00 0.00 132.40 751.60 1.36 12.75 32.70 7.26 8.21 7.90 8.27 1.06 93.76
> vdd 0.00 0.00 138.60 288.60 1.47 7.62 43.57 2.69 6.30 7.24 5.85 2.15 91.68
> vdd 0.00 0.00 141.20 157.00 1.51 6.12 52.36 1.12 3.75 7.26 0.59 3.11 92.64
> vdd 0.00 0.00 136.80 154.00 1.52 6.02 53.09 1.04 3.57 6.92 0.60 3.10 90.08
> vdd 0.00 0.00 135.00 154.20 1.46 6.02 52.99 1.04 3.60 7.00 0.62 3.03 87.68
> vdd 0.00 0.00 147.20 155.80 1.50 6.09 51.26 1.19 3.92 7.45 0.59 3.06 92.72
> vdd 0.00 0.00 139.60 154.00 1.50 6.02 52.43 1.03 3.53 6.79 0.58 2.97 87.28
> vdd 0.00 0.00 126.80 913.80 1.27 14.81 31.63 7.87 7.56 9.02 7.36 0.89 92.16
> vdd 0.00 0.00 148.20 156.80 1.51 6.11 51.15 1.17 3.83 7.25 0.59 3.03 92.32
> vdd 0.00 0.00 142.60 155.80 1.49 6.09 52.03 1.12 3.76 7.23 0.60 3.05 90.88
> vdd 0.00 0.00 146.00 156.60 1.49 6.12 51.48 1.16 3.84 7.32 0.60 3.03 91.76
> vdd 0.00 0.00 148.80 157.20 1.56 6.14 51.54 1.10 3.58 6.78 0.54 2.92 89.36
> vdd 0.00 0.00 149.80 156.80 1.52 6.12 51.05 1.14 3.73 7.10 0.52 2.96 90.88
> vdd 0.00 0.00 127.20 910.60 1.29 14.82 31.80 8.17 7.87 8.62 7.77 0.88 91.44
> vdd 0.00 0.00 150.80 157.80 1.58 6.15 51.27 1.09 3.53 6.67 0.52 2.91 89.92
> vdd 0.00 0.00 153.00 156.60 1.53 6.12 50.56 1.17 3.80 7.07 0.60 2.95 91.20
> vdd 0.00 0.00 161.20 157.00 1.54 6.13 49.36 1.27 4.00 7.36 0.55 2.97 94.48
> vdd 0.00 0.00 152.10 157.09 1.55 6.14 50.91 1.15 3.73 7.06 0.50 2.98 92.22
> vdd 0.00 0.00 154.80 157.80 1.50 6.16 50.18 1.19 3.82 7.14 0.57 2.88 90.00
> vdd 0.00 0.00 122.20 922.80 1.24 14.93 31.70 8.30 7.94 8.81 7.83 0.90 93.68
> vdd 0.00 0.00 139.80 153.60 1.46 5.98 51.98 1.09 3.74 7.14 0.65 3.02 88.48
> vdd 0.00 0.00 134.60 153.60 1.44 6.00 52.87 1.09 3.78 7.38 0.63 3.15 90.88
> vdd 0.00 0.00 141.00 153.40 1.55 5.99 52.49 1.05 3.57 6.79 0.60 3.04 89.44
> vdd 0.00 0.00 130.20 154.20 1.44 6.02 53.76 1.04 3.65 7.28 0.59 3.14 89.44
> vdd 0.00 0.00 136.60 154.20 1.30 6.02 51.52 1.19 4.07 8.02 0.58 3.03 88.24
> vdd 0.00 0.00 119.00 872.20 1.29 14.29 32.20 7.98 8.05 8.59 7.98 0.93 92.64
> vdd 0.00 0.00 139.00 154.00 1.58 6.00 53.01 1.06 3.62 6.95 0.61 3.12 91.52
> vdd 0.00 0.00 146.00 155.40 1.52 6.07 51.60 1.12 3.71 7.05 0.57 3.01 90.72
> vdd 0.00 0.00 145.20 156.60 1.55 6.12 52.01 1.13 3.73 7.10 0.60 3.07 92.80
> vdd 0.00 0.00 147.80 156.80 1.51 6.12 51.36 1.13 3.72 7.07 0.56 2.97 90.56
> vdd 0.00 0.00 142.80 157.00 1.48 6.12 51.98 1.11 3.70 7.20 0.52 3.02 90.56
> vdd 0.00 0.00 125.20 922.00 1.36 14.92 31.85 8.34 7.96 8.47 7.90 0.88 92.56
>
> Again, you can see exactly where the fsyncs are hitting, and again
> they are not massive spikes of write IO. And read latencies are showing:
>
> read[12711]: avg: 6.1 msec; max: 182.2 msec
> ...
> read[12711]: avg: 6.0 msec; max: 198.8 msec
> ....
> read[12711]: avg: 5.9 msec; max: 158.8 msec
> ....
> read[12711]: avg: 6.1 msec; max: 127.5 msec
> ....
> read[12711]: avg: 5.9 msec; max: 355.6 msec
> ....
> read[12711]: avg: 6.0 msec; max: 262.8 msec
> ....
>
> No significant outliers. The same for commits and wal:
>
> wal[12701]: avg: 0.0 msec; max: 0.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 1.0 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> starting fsync() of files
> finished fsync() of files
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.1 msec; max: 129.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 0.9 msec
> wal[12701]: avg: 0.0 msec; max: 0.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 23.1 msec
>
> I'm not sure how you were generating the behaviour you reported, but
> the test program as it stands does not appear to be causing any
> problems at all on the sort of storage I'd expect large databases to
> be hosted on....
>
> I've tried a few tweaks to the test program, but I haven't been able
> to make it misbehave. What do I need to tweak in the test program or
> my test VM to make the kernel misbehave like you reported?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]`
> _______________________________________________
> Lsf mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/lsf
>
--
Dave Chinner
[email protected]
Hi Dave,
On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> ping?
I'd replied at http://marc.info/?l=linux-mm&m=139730910307321&w=2
As an additional note:
> On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > I'm not sure how you were generating the behaviour you reported, but
> > the test program as it stands does not appear to be causing any
> > problems at all on the sort of storage I'd expect large databases to
> > be hosted on....
A really really large number of database aren't stored on big enterprise
rigs...
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services