LinuxLists.cc - Big I/O latencies, except when iotop is hooked

2012-05-09 16:20:07

Subject: Big I/O latencies, except when iotop is hooked

Hi,

I've been noticing big I/O latencies when doing operations with
notmuch (xapian database), the operations are quite simple and
probably should not need a lot of I/O, however, they seen to take a
long time, sometimes even seconds. But when I hook iotop (-p pid), the
latency goes away, and every operation is reliably instantaneous
(basically).

Do you have any ideas what might be causing this delay, and why is
iotop making it go away?

BTW. This is an ext4 encrypted partition on a USB stick, I tried
different mount options without any noticeable change. I tried to copy
the data to my SSD drive and do the same operations, while it was much
faster, it still seemed to have some delays triggered randomly. This
is with v3.3.5.

Cheers.

--
Felipe Contreras

2012-05-09 17:56:19

by Arnd Bergmann

[permalink] [raw]

Subject: Re: Big I/O latencies, except when iotop is hooked

On Wednesday 09 May 2012, Felipe Contreras wrote:
> I've been noticing big I/O latencies when doing operations with
> notmuch (xapian database), the operations are quite simple and
> probably should not need a lot of I/O, however, they seen to take a
> long time, sometimes even seconds. But when I hook iotop (-p pid), the
> latency goes away, and every operation is reliably instantaneous
> (basically).
>
> Do you have any ideas what might be causing this delay, and why is
> iotop making it go away?
>
> BTW. This is an ext4 encrypted partition on a USB stick, I tried
> different mount options without any noticeable change. I tried to copy
> the data to my SSD drive and do the same operations, while it was much
> faster, it still seemed to have some delays triggered randomly. This
> is with v3.3.5.

USB sticks like most other cheap flash media tend to have long latencies
because of the effects I describe on https://lwn.net/Articles/428584/.

I don't know if you have the chance to run flashbench[1] on it (which
will destroy the data for the partition you are testing), but that
should at least tell you if it's a problem with the drive.
You can also use the information from that and combine it with
a blocktrace log and flashsim[2] to see where the actual latencies
are expected to happen.

The first thing you should check is whether the partitions are
properly aligned, using 'fdisk -l -u /dev/sdX'.

Of course none of this will tell you why iotop makes any difference.

Arnd

[1] http://git.linaro.org/gitweb?p=people/arnd/flashbench.git
[2] http://yxit.co.uk/public/flash-performance/flashsim/

2012-05-09 20:16:16

by Felipe Contreras

[permalink] [raw]

Subject: Re: Big I/O latencies, except when iotop is hooked

2012-05-09 20:43:16

by Arnd Bergmann

[permalink] [raw]

Subject: Re: Big I/O latencies, except when iotop is hooked

On Wednesday 09 May 2012, Felipe Contreras wrote:
> And it doesn't tell me why I see the issue on my SSD as well; albeit
> with much less delay.
>
> I'll try what you suggest, maybe I will get better performance by
> aligning the partitions, but the real issue seems to lie elsewhere,
> right?

For anything but the absolute worst devices, misaligning the partition
should cause a linear increase in latency (maybe factor 2 in the worst
case), so it doesn't explain the entire thing. However, you can use
flashbench to find out the erase block size. If it's larger than 4MB,
you definitely have a problem even with ext4 or btrfs. The latencies
you get from garbage collection for a single write are roughly the time
it takes to write one erase block, and that can be up to a second in the
worst case (slow media with large erase blocks).

Arnd

2012-05-09 22:19:25

by Felipe Contreras

[permalink] [raw]

Subject: Re: Big I/O latencies, except when iotop is hooked

On Wed, May 9, 2012 at 7:56 PM, Arnd Bergmann <[email protected]> wrote:
> On Wednesday 09 May 2012, Felipe Contreras wrote:
>> I've been noticing big I/O latencies when doing operations with
>> notmuch (xapian database), the operations are quite simple and
>> probably should not need a lot of I/O, however, they seen to take a
>> long time, sometimes even seconds. But when I hook iotop (-p pid), the
>> latency goes away, and every operation is reliably instantaneous
>> (basically).
>>
>> Do you have any ideas what might be causing this delay, and why is
>> iotop making it go away?
>>
>> BTW. This is an ext4 encrypted partition on a USB stick, I tried
>> different mount options without any noticeable change. I tried to copy
>> the data to my SSD drive and do the same operations, while it was much
>> faster, it still seemed to have some delays triggered randomly. This
>> is with v3.3.5.
>
> USB sticks like most other cheap flash media tend to have long latencies
> because of the effects I describe on https://lwn.net/Articles/428584/.
>
> I don't know if you have the chance to run flashbench[1] on it (which
> will destroy the data for the partition you are testing), but that
> should at least tell you if it's a problem with the drive.

I'm having trouble figuring this out, but I guess the EBS is 2MiB:

% sudo ./flashbench -a /dev/sdd --blocksize=1024 --count=10
align 2147483648 pre 932µs on 1.08ms post 978µs diff 124µs
align 1073741824 pre 641µs on 676µs post 639µs diff 36.2µs
align 536870912 pre 778µs on 854µs post 801µs diff 64.9µs
align 268435456 pre 826µs on 908µs post 880µs diff 54.6µs
align 134217728 pre 646µs on 660µs post 655µs diff 9.26µs
align 67108864 pre 783µs on 834µs post 783µs diff 50.7µs
align 33554432 pre 825µs on 942µs post 864µs diff 97.1µs
align 16777216 pre 802µs on 995µs post 835µs diff 176µs
align 8388608 pre 905µs on 1.09ms post 921µs diff 175µs
align 4194304 pre 990µs on 1.14ms post 998µs diff 145µs
align 2097152 pre 1.06ms on 1.21ms post 1.08ms diff 144µs
align 1048576 pre 1.06ms on 1.22ms post 1.06ms diff 162µs
align 524288 pre 1.04ms on 1.15ms post 996µs diff 136µs
align 262144 pre 1.06ms on 1.22ms post 1.06ms diff 155µs
align 131072 pre 1.01ms on 1.14ms post 999µs diff 138µs
align 65536 pre 1.06ms on 1.23ms post 1.05ms diff 179µs
align 32768 pre 1.05ms on 1.22ms post 1.1ms diff 146µs
align 16384 pre 1.03ms on 1.2ms post 1.05ms diff 157µs
align 8192 pre 1.03ms on 1.14ms post 1.02ms diff 115µs
align 4096 pre 1.03ms on 1.12ms post 1.04ms diff 88.5µs
align 2048 pre 1.01ms on 1.03ms post 1.02ms diff 12.9µs

http://usbflashspeed.com/19425

But sometimes I get weird results:

% sudo ./flashbench -O --erasesize=$[2 * 1024 * 1024] --blocksize=$[16
* 1024] /dev/sdd --open-au-nr=2 --count=10
2MiB 1.88M/s
1MiB 21.5M/s
512KiB 21.9M/s
256KiB 20.8M/s
128KiB 17.7M/s
64KiB 22.3M/s
32KiB 20.8M/s
16KiB 17.1M/s
% sudo ./flashbench -O --erasesize=$[2 * 1024 * 1024] --blocksize=$[16
* 1024] /dev/sdd --open-au-nr=2 --count=10
2MiB 21.3M/s
1MiB 21.9M/s
512KiB 21.7M/s
256KiB 21.3M/s
128KiB 17.6M/s
64KiB 23.6M/s
32KiB 20.3M/s
16KiB 16M/s

> You can also use the information from that and combine it with
> a blocktrace log and flashsim[2] to see where the actual latencies
> are expected to happen.

While doing the relevant operation I got:

% ./flashsim -i iolog -s 2M -b 16k -m 5 -a la
176.87

I guess that's bad?

> The first thing you should check is whether the partitions are
> properly aligned, using 'fdisk -l -u /dev/sdX'.

Doesn't seem like that, but that doesn't explain the results I'm getting.

I think it's pretty clear there's some kind of I/O scheduling problem.

% sudo gdisk -l /dev/sdd
GPT fdisk (gdisk) version 0.8.4

Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdd: 15638480 sectors, 7.5 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 115D90F0-6973-4C82-B778-F1434BDA38DE
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 15638446
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number Start (sector) End (sector) Size Code Name
1 2048 15638446 7.5 GiB 8300 Linux filesystem

Cheers.

--
Felipe Contreras

2012-05-10 08:24:33

by Felipe Contreras

[permalink] [raw]

Subject: Re: Big I/O latencies, except when iotop is hooked

On Wed, May 9, 2012 at 7:56 PM, Arnd Bergmann <[email protected]> wrote:

> The first thing you should check is whether the partitions are
> properly aligned, using 'fdisk -l -u /dev/sdX'.

I aligned it to 4MiB, it doesn't make much of a difference.

> Of course none of this will tell you why iotop makes any difference.

After trying a bit more I can still see the problem with iotop, I
could swear it happens much less often, but I can't really be sure
now.

I tried with different io schedulers, and it seems cfq does better, I
also tried different priorities, but the issue was still there.

I also tried to setup the crypto partition with align-payload, to no
avail, and to get rid of crypto altogether. I also tried with multiple
ext4 format options, but they seemed to make matters worst.

I finally tried on a normal disk device, and it works perfectly there
(even with encryption); so the issue happens only on SSD/usb key.

Cheers.

--
Felipe Contreras

2012-05-10 10:50:10

by Arnd Bergmann

[permalink] [raw]

Subject: Re: Big I/O latencies, except when iotop is hooked

On Wednesday 09 May 2012, Felipe Contreras wrote:
> On Wed, May 9, 2012 at 7:56 PM, Arnd Bergmann <[email protected]> wrote:
> > On Wednesday 09 May 2012, Felipe Contreras wrote:
> >> I've been noticing big I/O latencies when doing operations with
> >> notmuch (xapian database), the operations are quite simple and
> >> probably should not need a lot of I/O, however, they seen to take a
> >> long time, sometimes even seconds. But when I hook iotop (-p pid), the
> >> latency goes away, and every operation is reliably instantaneous
> >> (basically).
> >>
> >> Do you have any ideas what might be causing this delay, and why is
> >> iotop making it go away?
> >>
> >> BTW. This is an ext4 encrypted partition on a USB stick, I tried
> >> different mount options without any noticeable change. I tried to copy
> >> the data to my SSD drive and do the same operations, while it was much
> >> faster, it still seemed to have some delays triggered randomly. This
> >> is with v3.3.5.
> >
> > USB sticks like most other cheap flash media tend to have long latencies
> > because of the effects I describe on https://lwn.net/Articles/428584/.
> >
> > I don't know if you have the chance to run flashbench[1] on it (which
> > will destroy the data for the partition you are testing), but that
> > should at least tell you if it's a problem with the drive.
>
> I'm having trouble figuring this out, but I guess the EBS is 2MiB:
>
> % sudo ./flashbench -a /dev/sdd --blocksize=1024 --count=10
> align 2147483648 pre 932µs on 1.08ms post 978µs diff 124µs
> align 1073741824 pre 641µs on 676µs post 639µs diff 36.2µs
> align 536870912 pre 778µs on 854µs post 801µs diff 64.9µs
> align 268435456 pre 826µs on 908µs post 880µs diff 54.6µs
> align 134217728 pre 646µs on 660µs post 655µs diff 9.26µs
> align 67108864 pre 783µs on 834µs post 783µs diff 50.7µs
> align 33554432 pre 825µs on 942µs post 864µs diff 97.1µs
> align 16777216 pre 802µs on 995µs post 835µs diff 176µs
> align 8388608 pre 905µs on 1.09ms post 921µs diff 175µs
> align 4194304 pre 990µs on 1.14ms post 998µs diff 145µs
> align 2097152 pre 1.06ms on 1.21ms post 1.08ms diff 144µs
> align 1048576 pre 1.06ms on 1.22ms post 1.06ms diff 162µs

It's not really clear from this data. 2 MB sounds wrong because the numbers area
almost the same for the 1 MB and 2 MB lines. 4 MB is possible, and so is 32
MB according to those measurements, but that would be too large for an 8GB
drive.

What I find very unusual is the reported size of 15638480 sectors, which is not
a multiple of any of the usual erase block sizes. This means that the erase block
size could be an odd value.

The prime factors of the drive size are:
$ factor 15638480
15638480: 2 2 2 2 5 11 13 1367

This is 41KB short of 7636 MB, which is a multiple of 4MB but not 8 MB.
Hence I would guess that the erase block size is 4MB, with 41KB cut off
from either the first or the last block (I've seen both before, but usually
only on CF cards, which round down to a multiple of 63*255 sectors).

Let's hope that the last block is cut off, not the first one, otherwise
you'd have to move the start of the partition ahead by 41 KB. You can
test that by giving an '--offset=4152320' argument to flashbench, which
should either be much faster or much slower than not giving the offset
argument.

> http://usbflashspeed.com/19425
>
> But sometimes I get weird results:
>
> % sudo ./flashbench -O --erasesize=$[2 * 1024 * 1024] --blocksize=$[16
> * 1024] /dev/sdd --open-au-nr=2 --count=10
> 2MiB 1.88M/s
> 1MiB 21.5M/s
> 512KiB 21.9M/s
> 256KiB 20.8M/s
> 128KiB 17.7M/s
> 64KiB 22.3M/s
> 32KiB 20.8M/s
> 16KiB 17.1M/s
> % sudo ./flashbench -O --erasesize=$[2 * 1024 * 1024] --blocksize=$[16
> * 1024] /dev/sdd --open-au-nr=2 --count=10
> 2MiB 21.3M/s
> 1MiB 21.9M/s
> 512KiB 21.7M/s
> 256KiB 21.3M/s
> 128KiB 17.6M/s
> 64KiB 23.6M/s
> 32KiB 20.3M/s
> 16KiB 16M/s

This does not look that unusual. The first 2MB number was when the test hit
a garbage collection, which incurred an overhead of about one second (!)
for this device (it tried to write 2MB and came measured an average throughput
of 2MB/s for that). The drive is optimized for 64KB clusters, which are
faster than other sizes, and writing more than 64KB at once causes some
overhead that is less at larger sizes. My guess is that it can do random
access for 64KB chunks but not larger than that -- you can test that by
passing '--random' as an extra argument.

Note that '--count' is ignored for the '-O' test.

The important measurement (once you have found the erase block size) is
how many erase blocks can be held open at the same time, so I'd try
passing --open-au-nr=8 and see what happens. If the numbers are in the
same ballpark as above, then increase it further until you have hit
the point where.

> > You can also use the information from that and combine it with
> > a blocktrace log and flashsim[2] to see where the actual latencies
> > are expected to happen.
>
> While doing the relevant operation I got:
>
> % ./flashsim -i iolog -s 2M -b 16k -m 5 -a la
> 176.87
>
> I guess that's bad?

176 is extremely bad, yes. A good value would be between 1 and 3.
The algorithm is probably not 'la' but 'dl' and the page size (-b)
could be smaller -- you have to test by passing '--blocksize=1024' to
the --open-au test and see where it starts getting drastically
smaller (if 4KB gives you about half the throughput of 8KB,
8KB is the page size). Those two can make the result better.

As I said, the erase block size is more likely to be 4MB, which
will make the flashsim result worse.

Does flashsim give a better number for a trace taken with iotop
running?

Can you send me or upload that iolog file?

> > The first thing you should check is whether the partitions are
> > properly aligned, using 'fdisk -l -u /dev/sdX'.
>
> Doesn't seem like that, but that doesn't explain the results I'm getting.
>
> I think it's pretty clear there's some kind of I/O scheduling problem.
>
> % sudo gdisk -l /dev/sdd
> GPT fdisk (gdisk) version 0.8.4
>
> Partition table scan:
> MBR: protective
> BSD: not present
> APM: not present
> GPT: present
>
> Found valid GPT with protective MBR; using GPT.
> Disk /dev/sdd: 15638480 sectors, 7.5 GiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): 115D90F0-6973-4C82-B778-F1434BDA38DE
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 15638446
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 2014 sectors (1007.0 KiB)
>
> Number Start (sector) End (sector) Size Code Name
> 1 2048 15638446 7.5 GiB 8300 Linux filesystem

It's aligned to 1MB, which is less than the erase block size, but as
long as the file system doesn't know about erase blocks, that's good enough.

ARnd

2012-05-14 08:59:57

by Arnd Bergmann

[permalink] [raw]

Subject: Re: Big I/O latencies, except when iotop is hooked

On Thursday 10 May 2012, Arnd Bergmann wrote:
> 176 is extremely bad, yes. A good value would be between 1 and 3.
> The algorithm is probably not 'la' but 'dl' and the page size (-b)
> could be smaller -- you have to test by passing '--blocksize=1024' to
> the --open-au test and see where it starts getting drastically
> smaller (if 4KB gives you about half the throughput of 8KB,
> 8KB is the page size). Those two can make the result better.
>
> As I said, the erase block size is more likely to be 4MB, which
> will make the flashsim result worse.
>
> Does flashsim give a better number for a trace taken with iotop
> running?
>
> Can you send me or upload that iolog file?

Hi Felipe,

Any update? I'd really be interested in the trace file so that we
can look at data of a real-world case that hurts. I've discussed
your problem in the Linaro storage team meeting, and the question
came up whether this only happens with encryption enabled.

Which kind of encryption method do you actually use? Depending
on how the encryption is implemented, two things could possibly
go wrong that would not happen without it:

* A superblock is added to the partition that contains some metadata
relating to the encryption, so all data is offset by a few sectors,
turning a perfectly aligned partition into a misaligned one.

* All write accesses are sent as 4kb operations to the device rather
than getting at least flash-page-sized (e.g. 16kb) I/O most of the
time.

Arnd

2012-05-14 09:08:26

by Felipe Contreras

[permalink] [raw]

Subject: Re: Big I/O latencies, except when iotop is hooked

Hi,

On Mon, May 14, 2012 at 10:59 AM, Arnd Bergmann <[email protected]> wrote:
> On Thursday 10 May 2012, Arnd Bergmann wrote:
>> 176 is extremely bad, yes. A good value would be between 1 and 3.
>> The algorithm is probably not 'la' but 'dl' and the page size (-b)
>> could be smaller -- you have to test by passing '--blocksize=1024' to
>> the --open-au test and see where it starts getting drastically
>> smaller (if 4KB gives you about half the throughput of 8KB,
>> 8KB is the page size). Those two can make the result better.
>>
>> As I said, the erase block size is more likely to be 4MB, which
>> will make the flashsim result worse.
>>
>> Does flashsim give a better number for a trace taken with iotop
>> running?
>>
>> Can you send me or upload that iolog file?
>
> Hi Felipe,
>
> Any update?

Nope, sorry, my laptop got fried. I have a new one and I might be able
to test later today, I don't see why I would have any problems
reproducing this there, but lets see.

> I'd really be interested in the trace file so that we
> can look at data of a real-world case that hurts. I've discussed
> your problem in the Linaro storage team meeting, and the question
> came up whether this only happens with encryption enabled.

As I already said, this also happens without encryption.

> Which kind of encryption method do you actually use? Depending
> on how the encryption is implemented, two things could possibly
> go wrong that would not happen without it:

I just do 'cryptsetup luksFormat' I'm not sure what is their default.
According to the manpage it's aes-cbc-essiv:sha256. I tried with
different --align-payload values, but none that actually improved the
situation.

Anyway, the problem is visible even with plain ext4.

Cheers.

--
Felipe Contreras