Anyone know why my xen xvda devices would be doing (apparently)
unaligned writes to my SAN causing horrible performance and massive
seeking and lots of reading for page cache backfill? BUT writing to
the device in the dom0 is very fast and causes no extra reads?
I am running the 2.6.18-164.11.1.el5xen xen/kernel which came with
CentOS 5.4
After spending a lot of time banging my head on this I seem to have
finally tracked it down to a difference between domU and dom0. I
never would have thought it would be this but it is extremely
reproduceable. We're talking a difference of 4-5x in write speed.
Reads are equally fast everywhere.
I am using AoE v72 kernel module (initiator) on a Dell R610's to talk
to vblade-19 (target) on Dell R710's all running CentOS 5.4. I have
striped two 7200 RPM SATA disks and exported the md with AoE (although
I have done these tests with individual disks also). Read performance
is excellent:
# dd of=/dev/null if=/dev/xvdg1 bs=4096 count=3000000
3000000+0 records in
3000000+0 records out
12288000000 bytes (12 GB) copied, 106.749 seconds, 115 MB/s
I dropped the cache with:
echo 1 > /proc/sys/vm/drop_caches
on both target and initiator before starting the test. This is great
for just a single gig-e link. This suggests that the network is fine.
However, write performance is odious. Typically around 20MB/s. It
should be more like 70MB/s per disk or better (7200rpm SATA) and max
out my gig-e with write performance similar to the above read
performance. I mentioned above that these are unaligned writes because
when running iostat on the target machine I can see lots of reads
happening which are surely causing seeks and killing
performance. Typical is something like 8MB/s of reads while doing
16MB/s of writes.
HOWEVER, if I do the writes from the dom0 the performance is
excellent:
# dd if=/dev/zero of=/dev/etherd/e6.2 bs=4096 count=3000000
3000000+0 records in
3000000+0 records out
12288000000 bytes (12 GB) copied, 104.679 seconds, 117 MB/s
And I see no reads happening on the disks being written to in
iostat. Purely streaming writes at high speeds.
I have had AoE working very well with Xen previously although not with
this particular hardware/xen/aoe version. Also it occurs to me that in
the past when I have done this I network booted the domU's and they
got root over AoE using a complicated initrd that I cooked up. In the
last year or so I decided that it was too complicated and went to
booting my dom0's from compact flash with the AoE driver in the dom0
instead of the domU. I now handing the domU xvd's from the AoE driver
in dom0. I strongly suspect that this is why things worked great
before but stink now. Unfortunately I don't have a working network
boot initrd setup like I used to and although I still have all of the
code etc. it would take a while to set up. I don't want to run that
setup in production anymore anyway if I can help it.
I have tried manually aligning the disk by setting the beginning of
data on the partition from 63 to 64 (although this is usually done for
RAID alignment) and I have tried changing the disk geometry to account
for the extra partition table which causes a half-block page-cache
misalignment as described by the ever insightful Kelsey Hudson in his
writeup on the issue here:
http://copilotco.com/Virtualization/wiki/aoe-caching-alignment.pdf/at_download/file
All to no avail. What am I missing here? Why is domU apparently
fudging my writes?
--
Tracy Reed
http://tracyreed.org
On Tue, Apr 20, 2010 at 11:49:55AM +0300, Pasi K?rkk?inen wrote:
> On Tue, Apr 20, 2010 at 01:09:58AM -0700, Tracy Reed wrote:
> > Anyone know why my xen xvda devices would be doing (apparently)
> > unaligned writes to my SAN causing horrible performance and massive
> > seeking and lots of reading for page cache backfill? BUT writing to
> > the device in the dom0 is very fast and causes no extra reads?
> >
> > I am running the 2.6.18-164.11.1.el5xen xen/kernel which came with
> > CentOS 5.4
> >
> > After spending a lot of time banging my head on this I seem to have
> > finally tracked it down to a difference between domU and dom0. I
> > never would have thought it would be this but it is extremely
> > reproduceable. We're talking a difference of 4-5x in write speed.
> > Reads are equally fast everywhere.
> >
> > I am using AoE v72 kernel module (initiator) on a Dell R610's to talk
> > to vblade-19 (target) on Dell R710's all running CentOS 5.4. I have
> > striped two 7200 RPM SATA disks and exported the md with AoE (although
> > I have done these tests with individual disks also). Read performance
> > is excellent:
> >
> > # dd of=/dev/null if=/dev/xvdg1 bs=4096 count=3000000
> > 3000000+0 records in
> > 3000000+0 records out
> > 12288000000 bytes (12 GB) copied, 106.749 seconds, 115 MB/s
> >
> > I dropped the cache with:
> >
> > echo 1 > /proc/sys/vm/drop_caches
> >
> > on both target and initiator before starting the test. This is great
> > for just a single gig-e link. This suggests that the network is fine.
> >
> > However, write performance is odious. Typically around 20MB/s. It
> > should be more like 70MB/s per disk or better (7200rpm SATA) and max
> > out my gig-e with write performance similar to the above read
> > performance. I mentioned above that these are unaligned writes because
> > when running iostat on the target machine I can see lots of reads
> > happening which are surely causing seeks and killing
> > performance. Typical is something like 8MB/s of reads while doing
> > 16MB/s of writes.
> >
> > HOWEVER, if I do the writes from the dom0 the performance is
> > excellent:
> >
> > # dd if=/dev/zero of=/dev/etherd/e6.2 bs=4096 count=3000000
> > 3000000+0 records in
> > 3000000+0 records out
> > 12288000000 bytes (12 GB) copied, 104.679 seconds, 117 MB/s
> >
> > And I see no reads happening on the disks being written to in
> > iostat. Purely streaming writes at high speeds.
> >
> > I have had AoE working very well with Xen previously although not with
> > this particular hardware/xen/aoe version. Also it occurs to me that in
> > the past when I have done this I network booted the domU's and they
> > got root over AoE using a complicated initrd that I cooked up. In the
> > last year or so I decided that it was too complicated and went to
> > booting my dom0's from compact flash with the AoE driver in the dom0
> > instead of the domU. I now handing the domU xvd's from the AoE driver
> > in dom0. I strongly suspect that this is why things worked great
> > before but stink now. Unfortunately I don't have a working network
> > boot initrd setup like I used to and although I still have all of the
> > code etc. it would take a while to set up. I don't want to run that
> > setup in production anymore anyway if I can help it.
> >
> > I have tried manually aligning the disk by setting the beginning of
> > data on the partition from 63 to 64 (although this is usually done for
> > RAID alignment) and I have tried changing the disk geometry to account
> > for the extra partition table which causes a half-block page-cache
> > misalignment as described by the ever insightful Kelsey Hudson in his
> > writeup on the issue here:
> >
> > http://copilotco.com/Virtualization/wiki/aoe-caching-alignment.pdf/at_download/file
> >
> > All to no avail. What am I missing here? Why is domU apparently
> > fudging my writes?
> >
>
> Please paste your domU partition table:
> sfdisk -d /dev/xvda
>
> Are you using filesystems on normal partitions, or LVM in the domU?
> I'm pretty sure this is a domU partitioning problem.
>
Also it's easy to verify.. add another disk (xvdb) to the domU,
and use dd to write directly to non-partitioned disk!
dd if=/dev/zero of=/dev/xvdb bs=something count=whatever
This shouldn't cause any un-aligned writes.
Also make sure you try different block sizes.. 4k might be ok for testing max iops,
but 64k or even 1024k is good for measuring max throughput.
-- Pasi
On Tue, Apr 20, 2010 at 01:09:58AM -0700, Tracy Reed wrote:
> Anyone know why my xen xvda devices would be doing (apparently)
> unaligned writes to my SAN causing horrible performance and massive
> seeking and lots of reading for page cache backfill? BUT writing to
> the device in the dom0 is very fast and causes no extra reads?
>
> I am running the 2.6.18-164.11.1.el5xen xen/kernel which came with
> CentOS 5.4
>
> After spending a lot of time banging my head on this I seem to have
> finally tracked it down to a difference between domU and dom0. I
> never would have thought it would be this but it is extremely
> reproduceable. We're talking a difference of 4-5x in write speed.
> Reads are equally fast everywhere.
>
> I am using AoE v72 kernel module (initiator) on a Dell R610's to talk
> to vblade-19 (target) on Dell R710's all running CentOS 5.4. I have
> striped two 7200 RPM SATA disks and exported the md with AoE (although
> I have done these tests with individual disks also). Read performance
> is excellent:
>
> # dd of=/dev/null if=/dev/xvdg1 bs=4096 count=3000000
> 3000000+0 records in
> 3000000+0 records out
> 12288000000 bytes (12 GB) copied, 106.749 seconds, 115 MB/s
>
> I dropped the cache with:
>
> echo 1 > /proc/sys/vm/drop_caches
>
> on both target and initiator before starting the test. This is great
> for just a single gig-e link. This suggests that the network is fine.
>
> However, write performance is odious. Typically around 20MB/s. It
> should be more like 70MB/s per disk or better (7200rpm SATA) and max
> out my gig-e with write performance similar to the above read
> performance. I mentioned above that these are unaligned writes because
> when running iostat on the target machine I can see lots of reads
> happening which are surely causing seeks and killing
> performance. Typical is something like 8MB/s of reads while doing
> 16MB/s of writes.
>
> HOWEVER, if I do the writes from the dom0 the performance is
> excellent:
>
> # dd if=/dev/zero of=/dev/etherd/e6.2 bs=4096 count=3000000
> 3000000+0 records in
> 3000000+0 records out
> 12288000000 bytes (12 GB) copied, 104.679 seconds, 117 MB/s
>
> And I see no reads happening on the disks being written to in
> iostat. Purely streaming writes at high speeds.
>
> I have had AoE working very well with Xen previously although not with
> this particular hardware/xen/aoe version. Also it occurs to me that in
> the past when I have done this I network booted the domU's and they
> got root over AoE using a complicated initrd that I cooked up. In the
> last year or so I decided that it was too complicated and went to
> booting my dom0's from compact flash with the AoE driver in the dom0
> instead of the domU. I now handing the domU xvd's from the AoE driver
> in dom0. I strongly suspect that this is why things worked great
> before but stink now. Unfortunately I don't have a working network
> boot initrd setup like I used to and although I still have all of the
> code etc. it would take a while to set up. I don't want to run that
> setup in production anymore anyway if I can help it.
>
> I have tried manually aligning the disk by setting the beginning of
> data on the partition from 63 to 64 (although this is usually done for
> RAID alignment) and I have tried changing the disk geometry to account
> for the extra partition table which causes a half-block page-cache
> misalignment as described by the ever insightful Kelsey Hudson in his
> writeup on the issue here:
>
> http://copilotco.com/Virtualization/wiki/aoe-caching-alignment.pdf/at_download/file
>
> All to no avail. What am I missing here? Why is domU apparently
> fudging my writes?
>
Please paste your domU partition table:
sfdisk -d /dev/xvda
Are you using filesystems on normal partitions, or LVM in the domU?
I'm pretty sure this is a domU partitioning problem.
-- Pasi
On Tue, Apr 20, 2010 at 11:49:55AM +0300, Pasi K?rkk?inen spake thusly:
> Please paste your domU partition table:
> sfdisk -d /dev/xvda
I have tried many different things including dd straight to the raw
unpartitioned device. That should not be affected by
partitioning/lvm/filesystem problems right?
> Are you using filesystems on normal partitions, or LVM in the domU?
> I'm pretty sure this is a domU partitioning problem.
I have done all of the above. Here I am an xvdg device in my domU to
which I am directly doing a dd to, no partitioning or anything:
# dd if=/dev/zero of=/dev/xvdg bs=4096 count=3000000
3000000+0 records in
3000000+0 records out
12288000000 bytes (12 GB) copied, 449.109 seconds, 27.4 MB/s
# /sbin/sfdisk -d /dev/xvdg
sfdisk: ERROR: sector 0 does not have an msdos signature
/dev/xvdg: unrecognized partition table type
No partitions found
and running iostat on the target shows the following:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 3474.60 1070.60 46.40 4311.20 13680.00 32.21 2.08 1.83 0.49 54.32
sdb 0.00 3376.00 1060.20 45.60 4289.60 13686.40 32.51 2.46 2.23 0.53 58.12
Or I can partition it with a geometry of 248 heads and 56 sectors
which is a multiple of 8 which should avoid the misalignment due to
the extra partition table (there is a partition on the physical disk
on the target already then I create a logical volume to export to the
initiator which then puts its own partition in it which causes
misalignment):
dd if=/dev/zero of=/dev/xvdg1 bs=4096 count=3000000
3000000+0 records in
3000000+0 records out
12288000000 bytes (12 GB) copied, 445.338 seconds, 27.6 MB/s
# /sbin/sfdisk -d /dev/xvdg
# partition table of /dev/xvdg
unit: sectors
/dev/xvdg1 : start= 56, size=566227592, Id=8e
/dev/xvdg2 : start= 0, size= 0, Id= 0
/dev/xvdg3 : start= 0, size= 0, Id= 0
/dev/xvdg4 : start= 0, size= 0, Id= 0
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 3472.20 1188.20 51.00 4805.60 14097.60 30.51 2.71 2.13 0.52 64.02
sdb 0.00 3472.40 1187.00 52.00 4784.00 14092.80 30.47 2.82 2.22 0.56 68.80
Or I can take a standard partition geometry and set it to start at 64
instead of 63 like so many RAID alignment pages talk about:
It is taking even longer this time and I am tired of waiting for dd
before sending off this email but suffice it to say it is painfully
slow.
# /sbin/sfdisk -d /dev/xvdg
# partition table of /dev/xvdg
unit: sectors
/dev/xvdg1 : start= 64, size=566226926, Id=83
/dev/xvdg2 : start= 0, size= 0, Id= 0
/dev/xvdg3 : start= 0, size= 0, Id= 0
/dev/xvdg4 : start= 0, size= 0, Id= 0
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1832.73 1234.73 30.94 4991.62 7864.27 20.31 1.52 1.23 0.47 59.82
sdb 0.00 1835.13 1219.76 30.54 4916.57 7839.52 20.40 1.27 1.04 0.45 56.67
I would not be at all surprised if you are right about it being a domU
partitioning problem. But every scheme I have tried has failed to work
properly. Appreciate any pointers.
--
Tracy Reed
http://tracyreed.org
On Tue, Apr 20, 2010 at 12:39:52PM -0700, Tracy Reed wrote:
> On Tue, Apr 20, 2010 at 11:49:55AM +0300, Pasi K?rkk?inen spake thusly:
> > Please paste your domU partition table:
> > sfdisk -d /dev/xvda
>
> I have tried many different things including dd straight to the raw
> unpartitioned device. That should not be affected by
> partitioning/lvm/filesystem problems right?
>
Yeah, partitioning doesn't affect when you use the straight/raw disk device.
> > Are you using filesystems on normal partitions, or LVM in the domU?
> > I'm pretty sure this is a domU partitioning problem.
>
> I have done all of the above. Here I am an xvdg device in my domU to
> which I am directly doing a dd to, no partitioning or anything:
>
> # dd if=/dev/zero of=/dev/xvdg bs=4096 count=3000000
> 3000000+0 records in
> 3000000+0 records out
> 12288000000 bytes (12 GB) copied, 449.109 seconds, 27.4 MB/s
>
Please try with "bs=1024k" and maybe with "bs=64k" aswell.
4k blocksize transfer will always be slower in domU than in dom0
since virtual disk abstraction makes some overhead, which is more
visible with small blocksizes.
> # /sbin/sfdisk -d /dev/xvdg
>
> sfdisk: ERROR: sector 0 does not have an msdos signature
> /dev/xvdg: unrecognized partition table type
> No partitions found
>
> and running iostat on the target shows the following:
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sda 0.00 3474.60 1070.60 46.40 4311.20 13680.00 32.21 2.08 1.83 0.49 54.32
> sdb 0.00 3376.00 1060.20 45.60 4289.60 13686.40 32.51 2.46 2.23 0.53 58.12
>
> Or I can partition it with a geometry of 248 heads and 56 sectors
> which is a multiple of 8 which should avoid the misalignment due to
> the extra partition table (there is a partition on the physical disk
> on the target already then I create a logical volume to export to the
> initiator which then puts its own partition in it which causes
> misalignment):
>
> dd if=/dev/zero of=/dev/xvdg1 bs=4096 count=3000000
> 3000000+0 records in
> 3000000+0 records out
> 12288000000 bytes (12 GB) copied, 445.338 seconds, 27.6 MB/s
>
So the speed is the same to the partitioned disk than to the raw disk?
What disk backend are you using in dom0? phy:? tap:aio: ?
> # /sbin/sfdisk -d /dev/xvdg
> # partition table of /dev/xvdg
> unit: sectors
>
> /dev/xvdg1 : start= 56, size=566227592, Id=8e
> /dev/xvdg2 : start= 0, size= 0, Id= 0
> /dev/xvdg3 : start= 0, size= 0, Id= 0
> /dev/xvdg4 : start= 0, size= 0, Id= 0
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sda 0.00 3472.20 1188.20 51.00 4805.60 14097.60 30.51 2.71 2.13 0.52 64.02
> sdb 0.00 3472.40 1187.00 52.00 4784.00 14092.80 30.47 2.82 2.22 0.56 68.80
>
> Or I can take a standard partition geometry and set it to start at 64
> instead of 63 like so many RAID alignment pages talk about:
>
> It is taking even longer this time and I am tired of waiting for dd
> before sending off this email but suffice it to say it is painfully
> slow.
>
You can cancel dd and it'll print the stats so far.
-- Pasi
On Tue, Apr 20, 2010 at 11:49:55AM +0300, Pasi K?rkk?inen spake thusly:
> Are you using filesystems on normal partitions, or LVM in the domU?
> I'm pretty sure this is a domU partitioning problem.
Also: What changes in the view of the partitioning between domU and
dom0? Wouldn't a partitioning error manifest itself in tests in the
dom0 as well as in the domU?
BTW: The dd from the last time in my last email finally finished:
# dd if=/dev/zero of=/dev/xvdg1 bs=4096 count=3000000
3000000+0 records in
3000000+0 records out
12288000000 bytes (12 GB) copied, 734.714 seconds, 16.7 MB/s
If I run that very same dd as above (the last test in my previous
email) with the same partition setup again but this time from the
dom0:
# dd if=/dev/zero of=/dev/etherd/e6.1 bs=4096 count=3000000
3000000+0 records in
3000000+0 records out
12288000000 bytes (12 GB) copied, 107.352 seconds, 114 MB/s
# /sbin/sfdisk -d /dev/etherd/e6.1
# partition table of /dev/etherd/e6.1
unit: sectors
/dev/etherd/e6.1p1 : start= 64, size=566226926, Id=83
/dev/etherd/e6.1p2 : start= 0, size= 0, Id= 0
/dev/etherd/e6.1p3 : start= 0, size= 0, Id= 0
/dev/etherd/e6.1p4 : start= 0, size= 0, Id= 0
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 17350.80 0.60 275.60 22.40 72540.00
525.43 97.94 344.01 3.62 100.02
sdb 0.00 17374.80 1.20 256.00 28.00 74848.00
582.24 136.20 527.72 3.89 100.02
72MB/s and 74MB/s per disk in the stripe. Nice. Wish I could get that in a domU!
--
Tracy Reed
http://tracyreed.org
On Tue, Apr 20, 2010 at 10:54:42PM +0300, Pasi K?rkk?inen spake thusly:
> Please try with "bs=1024k" and maybe with "bs=64k" aswell.
>
> 4k blocksize transfer will always be slower in domU than in dom0
> since virtual disk abstraction makes some overhead, which is more
> visible with small blocksizes.
But overhead in domU wouldn't be causing all of these reads. I am
doing a test with bs=64k now:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 3258.20 964.00 54.00 3903.20 13652.80
34.49 2.75 2.71 0.68 68.88
sdb 0.00 3270.20 974.80 54.00 3940.80 13710.40
34.31 2.42 2.38 0.55 56.22
> So the speed is the same to the partitioned disk than to the raw disk?
> What disk backend are you using in dom0? phy:? tap:aio: ?
Yes.
I am using phy: disk backend. Should I be using something else?
--
Tracy Reed
http://tracyreed.org
On Tue, Apr 20, 2010 at 01:00:04PM -0700, Tracy Reed wrote:
> On Tue, Apr 20, 2010 at 11:49:55AM +0300, Pasi K?rkk?inen spake thusly:
> > Are you using filesystems on normal partitions, or LVM in the domU?
> > I'm pretty sure this is a domU partitioning problem.
>
> Also: What changes in the view of the partitioning between domU and
> dom0? Wouldn't a partitioning error manifest itself in tests in the
> dom0 as well as in the domU?
>
> BTW: The dd from the last time in my last email finally finished:
>
> # dd if=/dev/zero of=/dev/xvdg1 bs=4096 count=3000000
> 3000000+0 records in
> 3000000+0 records out
> 12288000000 bytes (12 GB) copied, 734.714 seconds, 16.7 MB/s
>
> If I run that very same dd as above (the last test in my previous
The DomU disk from the Dom0 perspective is using 'phy' which means
there is no caching in Dom0 for that disk (but it is in DomU).
Caching should be done in DomU in that case - which begs the question -
how much memory do you have in your DomU? What happens if you
give to both Dom0 and DomU the same amount of memory?
> email) with the same partition setup again but this time from the
> dom0:
>
> # dd if=/dev/zero of=/dev/etherd/e6.1 bs=4096 count=3000000
> 3000000+0 records in
> 3000000+0 records out
> 12288000000 bytes (12 GB) copied, 107.352 seconds, 114 MB/s
OK. That is possibly caused by the fact that you are caching the data.
Look at your buffers cache (and drop the cache before this) and see
how it grows.
>
> # /sbin/sfdisk -d /dev/etherd/e6.1
> # partition table of /dev/etherd/e6.1
> unit: sectors
>
> /dev/etherd/e6.1p1 : start= 64, size=566226926, Id=83
How do you know this is a mis-aligned sectors issue? Is this what your
AOE vendor is telling you ?
I was thinking of first eliminating caching from the picture and seeing
the speeds you get when you do direct IO to the spindles. You can do this using
a tool called 'fio' or 'dd' with the oflag=direct. Try doing that from
both Dom0 and DomU and see what the speeds are.
On Tuesday, 20 April 2010 at 13:00, Tracy Reed wrote:
> On Tue, Apr 20, 2010 at 11:49:55AM +0300, Pasi K?rkk?inen spake thusly:
> > Are you using filesystems on normal partitions, or LVM in the domU?
> > I'm pretty sure this is a domU partitioning problem.
>
> Also: What changes in the view of the partitioning between domU and
> dom0? Wouldn't a partitioning error manifest itself in tests in the
> dom0 as well as in the domU?
>
> BTW: The dd from the last time in my last email finally finished:
>
> # dd if=/dev/zero of=/dev/xvdg1 bs=4096 count=3000000
> 3000000+0 records in
> 3000000+0 records out
> 12288000000 bytes (12 GB) copied, 734.714 seconds, 16.7 MB/s
>
> If I run that very same dd as above (the last test in my previous
> email) with the same partition setup again but this time from the
> dom0:
>
> # dd if=/dev/zero of=/dev/etherd/e6.1 bs=4096 count=3000000
> 3000000+0 records in
> 3000000+0 records out
> 12288000000 bytes (12 GB) copied, 107.352 seconds, 114 MB/s
>
> # /sbin/sfdisk -d /dev/etherd/e6.1
> # partition table of /dev/etherd/e6.1
> unit: sectors
>
> /dev/etherd/e6.1p1 : start= 64, size=566226926, Id=83
> /dev/etherd/e6.1p2 : start= 0, size= 0, Id= 0
> /dev/etherd/e6.1p3 : start= 0, size= 0, Id= 0
> /dev/etherd/e6.1p4 : start= 0, size= 0, Id= 0
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await svctm %util
> sda 0.00 17350.80 0.60 275.60 22.40 72540.00
> 525.43 97.94 344.01 3.62 100.02
> sdb 0.00 17374.80 1.20 256.00 28.00 74848.00
> 582.24 136.20 527.72 3.89 100.02
You could also be limited by the size of the block request ring (I
believe the ring is normally only one page) -- the ring needs to be
large enough to handle the bandwidth delay product, and AoE means the
delay is probably higher than normal. Do you get better performance
against a local partition?
On Tue, Apr 20, 2010 at 04:25:19PM -0400, Konrad Rzeszutek Wilk spake thusly:
> The DomU disk from the Dom0 perspective is using 'phy' which means
> there is no caching in Dom0 for that disk (but it is in DomU).
That is fine. I don't particularly want caching in dom0.
> Caching should be done in DomU in that case - which begs the question -
> how much memory do you have in your DomU? What happens if you
> give to both Dom0 and DomU the same amount of memory?
4G in domU and 1G in dom0.
> OK. That is possibly caused by the fact that you are caching the data.
> Look at your buffers cache (and drop the cache before this) and see
> how it grows.
I try to use large amounts of data so cache is less a factor but I
also drop the cache before each test using:
echo 1 > /proc/sys/vm/drop_caches.
I had to start doing this not only to ensure accurate results but also
because the way it was caching the reads was really confusing when I
would see a test start out apparently fine and writing at good speed
according to iostat and then suddenly start hitting the disk with
reads when it ran into data which it did not already have read into
cache.
> How do you know this is a mis-aligned sectors issue? Is this what your
> AOE vendor is telling you ?
No AoE vendor involved. I am using the free stuff. I think it is a
misalignment issue because during a purely write test it is doing
massive amounts of reading according to iostat.
Also note that there are several different kinds of misalignment which
can occur:
- Disk sector misalignment
- RAID chunk size misalignment
- Page cache misalignment
Would the first two necessarily show up in iostat? I'm not sure if
disk sector misalignment is dealth with automatically in the hardware
or if the kernel aligns it for us. RAID chunk size misalignment seems
like it would be dealth with in the RAID card if using hardware
RAID. But I am not. So the software RAID implementation might cause
reads to show up in iostat.
Linux page cache size is 4k which is why I am using 4k block size in
my dd tests.
> I was thinking of first eliminating caching from the picture and seeing
> the speeds you get when you do direct IO to the spindles. You can do this using
> a tool called 'fio' or 'dd' with the oflag=direct. Try doing that from
> both Dom0 and DomU and see what the speeds are.
I have never been quite clear on the purpose of oflag=direct. I have
read in the dd man page tht it is supposed to bypass cache. But
whenever I use it performance is horrible beyond merely just not
caching. I am doing the above dd with oflag=direct now as you
suggested and I see around 30 seconds of nothing hitting the disks and
then two or three seconds of writing in iostat on the target. I just
ctrl-c'd the dd and it shows:
#dd if=/dev/zero of=/dev/etherd/e6.1 oflag=direct bs=4096
count=3000000
1764883+0 records in
1764883+0 records out
7228960768 bytes (7.2 GB) copied, 402.852 seconds, 17.9 MB/s
But even on my local directly attached SATA workstation disk when
doing that same dd on an otherwise idle machine I see performance
like:
$ dd if=/dev/zero of=foo.test bs=4096 count=4000000
C755202+0 records in
755202+0 records out
3093307392 bytes (3.1 GB) copied, 128.552 s, 24.1 MB/s
which again suggests that oflag=direct isn't doing quite what I expect.
--
Tracy Reed
http://tracyreed.org
On Tue, Apr 20, 2010 at 01:41:51PM -0700, Brendan Cully spake thusly:
> You could also be limited by the size of the block request ring (I
> believe the ring is normally only one page) -- the ring needs to be
> large enough to handle the bandwidth delay product, and AoE means the
> delay is probably higher than normal.
Interesting. Any easy way to increase this as a test?
> Do you get better performance against a local partition?
You mean a local partition on local disk in the dom0 given to a domU
as xvd? Let's see...
I just created a 20G logical volume on the dom0:
# /usr/sbin/lvcreate -n test -L20G sysvg
Added it to the domain config file to be /dev/xvdi and rebooted.
"phy:/dev/sysvg/test,xvdi,w"
I know you can attach block devices on the fly but this has not been
entirely reliable for me in the past so I reboot now.
In domU against xvdi which is /dev/sysvg/test in dom0:
# dd if=/dev/zero of=/dev/xvdi bs=4096 count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 99.3749 seconds, 41.2 MB/s
And iostat on dom0 shows:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 824.00 4.00 101.00 16.00 39936.00
760.99 3.19 31.31 9.52 100.00
In dom0 against the local disk to demonstrate native performance:
# dd if=/dev/zero of=/dev/sysvg/test bs=4096 count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 84.9047 seconds, 48.2 MB/s
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm %util
sda 7.00 11104.00 5.00 96.00 48.00 48144.00 954.30
133.92 1172.40 9.94 100.40
Virtually no reads happening. This disk seems a bit slow (older 80G
sata disk) but otherwise normal. I don't see anything indicating
alignment issues.
--
Tracy Reed
http://tracyreed.org
On Tue, Apr 20, 2010 at 02:19:13PM -0700, Tracy Reed wrote:
> > How do you know this is a mis-aligned sectors issue? Is this what your
> > AOE vendor is telling you ?
>
> No AoE vendor involved. I am using the free stuff. I think it is a
> misalignment issue because during a purely write test it is doing
> massive amounts of reading according to iostat.
How about actually verifying that by e.g. using wireshark and comparing
the I/O patterns in the fast and slow cases? The differences in the
patterns may give clues where to look further.
> #dd if=/dev/zero of=/dev/etherd/e6.1 oflag=direct bs=4096
> count=3000000
> 1764883+0 records in
> 1764883+0 records out
> 7228960768 bytes (7.2 GB) copied, 402.852 seconds, 17.9 MB/s
>
> But even on my local directly attached SATA workstation disk when
> doing that same dd on an otherwise idle machine I see performance
> like:
>
> $ dd if=/dev/zero of=foo.test bs=4096 count=4000000
> C755202+0 records in
> 755202+0 records out
> 3093307392 bytes (3.1 GB) copied, 128.552 s, 24.1 MB/s
>
> which again suggests that oflag=direct isn't doing quite what I expect.
oflag=direct turns off caching on the host dd is running on, i.e. the
initiator. The target still caches writes of course, unless you tell it
not to by passing the "-d" flag to vblade.
Gabor