2009-04-04 18:03:50

by Trenton D. Adams

[permalink] [raw]
Subject: IO latency - a special case

Hi Guys,

I've been reading a few threads related to IO, such as the recent ext3
fixes from Ted and such. I didn't want to cloud that thread, so I'm
starting a new one.

Here's something I haven't reported yet, because I haven't been able
to reproduce or identify in any reasonable way, but may be applicable
to this thread.

I am seeing instances where my work load is nil, and I run a process,
which normally does not do a lot of IO. I get load averages of
30-30-28, with a basic lockup for 10 minutes. The only thing I can
see that particular app doing is lots of quick IO, mostly reading,
etc. But, there was no other major workload at the time. Also, one
fix I have employed to reduce my latencies if I'm under heavy load, is
to use "sync" mount option, or "dirty_bytes". But, in this instance,
they had absolutely NO AFFECT. In addition, if I reboot the problem
goes away, for awhile. Swapping is not occurring when I check swap
after my computer comes back. So, it seems to me like this problem is
somewhere primarily outside of the FS layer, or at least outside the
FS layer TOO.

FYI: dirty_bytes setting has a good affect for me "usually", but not
in this case.

If the problem was with primarily ext3, why did I not see it in my
2.6.17 kernel on my i686 gentoo Linux box? Unless there were major
changes to ext3 since then which caused it. And believe me, I
understand that this latency issue is soooo difficult to find. Partly
because I'm an idiot and didn't report it when I saw it two years ago.
If I had reported it then, then you guys would probably be in the
right frame of mind, knowing what changes had just occurred, etc, etc.

If you want, I can give you an strace on the app I ran. I'm pretty
sure it was the one I ran when the problem was occuring. It's 47K
though. Hoever, it doesn't appear that any of the system calls took
any significant amount of time, which seems odd to me, seeing the
massive lockup. And, as far as I know, an app can't cause that kind
of load average of 30 lockup, only the kernel can. Well, also perhaps
a reniced and ioniced realtime process could. Am I right in that?

p.s.
Right now, I switched to data=writeback mode. I'm waiting to see if
this particular problem comes back. I know that overall latencies do
decrease when using data=writeback. And, being on a notebook, with a
battery, that option is okay for me.


2009-04-04 18:32:10

by Trenton D. Adams

[permalink] [raw]
Subject: Re: IO latency - a special case

Oh, what do you know, it has started again. Check out this vmstat
output. I'm getting a huge number of processes in uninterruptible
sleep. That is bad, right? I haven't seen numbers like that, even on
a heavy load.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 20 30204 13128 392 67836 0 0 10420 56 2126 2542 1 1 50 48
0 21 30204 13316 380 67648 0 0 4800 12 1142 1321 2 1 50 46
3 20 30204 13208 372 67932 0 0 24668 72 2298 2666 2 1 50 48
0 13 30204 13268 372 68196 0 0 33216 88 11021 11494 1 2 43 54
0 17 30204 12908 336 68272 0 0 4932 52 1546 1542 1 2 50 47
0 13 30224 14656 348 67784 0 20 5012 116 1338 1433 1 1 50 49
0 17 30224 13404 384 67856 0 0 4992 72 1372 1525 1 1 50 47
0 17 30224 13444 348 67908 0 0 5516 0 1355 1400 0 1 50 49
0 23 30224 13032 332 67972 0 0 12492 32 1346 1409 1 1 45 54
0 17 30224 13960 372 66896 0 0 26068 228 4266 4962 1 1 35 64
0 18 30224 13384 352 67624 0 0 10096 64 952 1141 2 1 50 47
0 24 30224 13048 312 67836 0 0 49276 184 2407 3106 1 1 51 48
1 31 30224 13864 328 68220 0 0 17664 88 5610 6673 1 1 24 75
1 27 30224 12992 356 68704 0 0 4828 16 1051 1193 0 1 0 99
0 27 30224 13072 336 68440 0 0 5324 4 985 1145 1 0 8 91
0 33 30224 13040 372 68560 0 0 5788 52 916 1171 0 1 29 70
0 23 30224 14616 392 66928 0 0 3628 68 1047 1187 1 1 17 82
0 16 30224 13216 384 68332 0 0 5152 8 979 1252 0 0 50 49
2 18 30224 12872 384 68032 0 0 14832 160 2768 3390 1 1 50 48
0 16 30224 13632 332 68080 0 0 6192 4 1134 1439 0 1 24 74
0 22 30224 13392 348 67928 0 0 5048 92 873 1086 1 1 9 90

OUCH, here's an strace snippet (futex takes 216 seconds)...
mprotect(0x7f99ef00d000, 4096, PROT_READ) = 0 <0.000019>
munmap(0x7f99f0273000, 214104) = 0 <0.000032>
mmap(NULL, 1052672, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_32BIT, -1, 0) = 0x416c0000 <0.000019>
mprotect(0x416c0000, 4096, PROT_NONE) = 0 <0.000017>
clone(child_stack=0x417c0250,

flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x417c09e0, tls=0x417c0950, child_tidptr=0x417c09e0) =
27046 <0.000035>
futex(0x417c09e0, FUTEX_WAIT, 27046, NULL) = ? ERESTARTSYS (To be
restarted) <216.418292>
--- SIGINT (Interrupt) @ 0 (0) ---

Oh, now I get it. Last time i did the strace, I didn't see high
numbers because I didn't kill the process, like I did this time, I
instead did the skinny elephant dance.

Tell me what you want me to do, and I will.

On Sat, Apr 4, 2009 at 12:03 PM, Trenton D. Adams
<[email protected]> wrote:
> Hi Guys,
>
> I've been reading a few threads related to IO, such as the recent ext3
> fixes from Ted and such. ?I didn't want to cloud that thread, so I'm
> starting a new one.
>
> Here's something I haven't reported yet, because I haven't been able
> to reproduce or identify in any reasonable way, but may be applicable
> to this thread.
>
> I am seeing instances where my work load is nil, and I run a process,
> which normally does not do a lot of IO. ?I get load averages of
> 30-30-28, with a basic lockup for 10 minutes. ?The only thing I can
> see that particular app doing is lots of quick IO, mostly reading,
> etc. ?But, there was no other major workload at the time. ?Also, one
> fix I have employed to reduce my latencies if I'm under heavy load, is
> to use "sync" mount option, or "dirty_bytes". ?But, in this instance,
> they had absolutely NO AFFECT. ?In addition, if I reboot the problem
> goes away, for awhile. ?Swapping is not occurring when I check swap
> after my computer comes back. ?So, it seems to me like this problem is
> somewhere primarily outside of the FS layer, or at least outside the
> FS layer TOO.
>
> FYI: dirty_bytes setting has a good affect for me "usually", but not
> in this case.
>
> If the problem was with primarily ext3, why did I not see it in my
> 2.6.17 kernel on my i686 gentoo Linux box? ?Unless there were major
> changes to ext3 since then which caused it. ?And believe me, I
> understand that this latency issue is soooo difficult to find. ?Partly
> because I'm an idiot and didn't report it when I saw it two years ago.
> ?If I had reported it then, then you guys would probably be in the
> right frame of mind, knowing what changes had just occurred, etc, etc.
>
> If you want, I can give you an strace on the app I ran. ?I'm pretty
> sure it was the one I ran when the problem was occuring. It's 47K
> though. ?Hoever, it doesn't appear that any of the system calls took
> any significant amount of time, which seems odd to me, seeing the
> massive lockup. ?And, as far as I know, an app can't cause that kind
> of load average of 30 lockup, only the kernel can. ?Well, also perhaps
> a reniced and ioniced realtime process could. ?Am I right in that?
>
> p.s.
> Right now, I switched to data=writeback mode. ?I'm waiting to see if
> this particular problem comes back. ?I know that overall latencies do
> decrease when using data=writeback. ?And, being on a notebook, with a
> battery, that option is okay for me.
>

2009-04-05 04:17:48

by Trenton D. Adams

[permalink] [raw]
Subject: Re: IO latency - a special case

On Sat, Apr 4, 2009 at 7:13 PM, Theodore Tso <[email protected]> wrote:
> Trenton. ?Couple of things to try. ?First of all, it looks like your
> application is multi-threaded. ?That's why it can drive up the load so
> high, even though "ps" and "top" is only showing one process. ?Try
> using the -f flag to strace so you can follow all of the processes and
> threads fork()'ed or cloned()'ed from the initial process.
>
> In addition, here's a rather brute-force script that I've used when
> trying to collect data when debugging performance or long-term
> stability problems at customer sites. ?Very often it was used on
> production machines where they don't allow random people to poke
> around on it, so this was designed to be given to a sysadmin, who
> would approve running it on their system, and some hours later, we
> would get the tarball, and then try to figure out what the heck was
> going on.
>
> It doesn't have to run out of cron, BTW; it also can be run from a
> command-line, and some of the polling intervals adjusted smaller if
> you need finer-grained resolution, or it can run as a stand-alone
> daemon as well.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted

Hi Ted,

I would imagine it is multi threaded, though I am not positive. I am
asking on IRC right now.

I will look into trying out your tests, and the -f flag. I will
report back once I have more data. The good thing is that once the
problem starts happening, it continues to happen. I think it happens
until reboot, but I'm not positive. I'm about to try again, so we'll
see, as I have not rebooted since last time.

Thanks.

2009-04-05 04:25:27

by Trenton D. Adams

[permalink] [raw]
Subject: Re: IO latency - a special case

One more quick thing. is it okay if I filter on "trace=futex" only,
or do you want something more?

2009-04-05 06:13:55

by Trenton D. Adams

[permalink] [raw]
Subject: Re: IO latency - a special case

Hi Ted,

I have no idea what "sar" is, and a google search doesn't appear to
reveal anything useful. Where can I get this "sar" tool?

I also do not have "iostat" on my system, and again, google does not
appear to reveal an official page for that, and gentoo doesn't appear
to have a package for it.

I have output without those tools, if you want that. I hit a 1 minute
load average of 46 this time, and 46 uninterruptible processes.

Thanks.

2009-04-05 06:26:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: IO latency - a special case

On Sun, Apr 05, 2009 at 12:13:44AM -0600, Trenton D. Adams wrote:
> Hi Ted,
>
> I have no idea what "sar" is, and a google search doesn't appear to
> reveal anything useful. Where can I get this "sar" tool?
>
> I also do not have "iostat" on my system, and again, google does not
> appear to reveal an official page for that, and gentoo doesn't appear
> to have a package for it.

Both "sar" and "iostat" are part of the sysstat package.

Package: sysstat
Priority: optional
Section: admin
Installed-Size: 808
Maintainer: Ubuntu MOTU Developers <[email protected]>
Original-Maintainer: Robert Luberda <[email protected]>
Architecture: i386
Version: 8.1.2-2ubuntu1
Depends: libc6 (>= 2.4), debconf (>= 0.5) | debconf-2.0, ucf (>= 2.003), lsb-base (>= 3.0-6), bzip2
Recommends: cron
Suggests: isag
Conflicts: atsar (<< 1.5-3)
Filename: pool/main/s/sysstat/sysstat_8.1.2-2ubuntu1_i386.deb
Size: 182222
MD5sum: e59153e0121ce437082ab7e03fe647e0
SHA1: 65383dec2ca6ea4fa5fed41afdde46639d2d5e98
SHA256: eb5530307e52ef69b9d1d3c0e43e7a02e805888e183adcbef272bf9283829851
Description: sar, iostat and mpstat - system performance tools for Linux
The sysstat package contains the following system performance tools:
* sar - collects and reports system activity information;
* iostat - reports CPU utilization and I/O statistics for disks;
* mpstat - reports global and per-processor statistics;
* pidstat - reports statistics for Linux tasks (processes);
* sadf - displays data collected by sar in various formats.
.
The statistics reported by sar concern I/O transfer rates,
paging activity, process-related activities, interrupts,
network activity, memory and swap space utilization, CPU
utilization, kernel activities and TTY statistics, among
others. Both UP and SMP machines are fully supported.
Homepage: http://pagesperso-orange.fr/sebastien.godard/
Bugs: mailto:[email protected]
Origin: Ubuntu

- Ted

2009-04-05 22:42:25

by Trenton D. Adams

[permalink] [raw]
Subject: Re: IO latency - a special case

Hi Ted,

I haven't seen the problem arise for awhile now. I think it may be
related to editing a file in one of my editors, and not saving it.
The editor might be putting some sort of lock on the file. I'm
waiting for it to occur again so that I can pin point it.

For now, what about that patch thread sent to LKML called "[tip PATCH
v7 0/9] RFC: futex: requeue pi implementation"??? Do you think that
could have fixed my issue? I don't have time to try it right now.
And besides, I should try and make it so I can reproduce the problem,
otherwise I might not know for awhile that it's been fixed, after
applying the patch.

I'll let you know more, when I have more data.

Thanks.

2009-06-07 03:14:52

by Trenton D. Adams

[permalink] [raw]
Subject: Re: IO latency - a special case

On Sun, Apr 5, 2009 at 4:42 PM, Trenton D.
Adams<[email protected]> wrote:
> Hi Ted,
>
> I haven't seen the problem arise for awhile now. ?I think it may be
> related to editing a file in one of my editors, and not saving it.
> The editor might be putting some sort of lock on the file. ?I'm
> waiting for it to occur again so that I can pin point it.
>
> For now, what about that patch thread sent to LKML called "[tip PATCH
> v7 0/9] RFC: futex: requeue pi implementation"??? ?Do you think that
> could have fixed my issue? ?I don't have time to try it right now.
> And besides, I should try and make it so I can reproduce the problem,
> otherwise I might not know for awhile that it's been fixed, after
> applying the patch.
>
> I'll let you know more, when I have more data.

I'm back into doing some Java development, and it's happening again.
I'm running 2.6.30-rc7, and I did a SysReq+T, I hope this is helpful.
I've attached a txt.gz file, because it's 367K to my
/var/log/messages.

The important stuff, I believe, is the Java related stuff. As a quick
refresher to this problem, it appears to be related to how my Java IDE
saves the file, and perhaps something doesn't release the futex right
away?

I've also seen these symptoms previously...

1. Problem arises, I Ctrl-C the process when I notice the system
getting really sluggish.
2. Start process again, and it locks up the machine for about an hour

Now, when i try the things below, it seems to be resolvable.

1. Problem arises, I Ctrl-C the process when I notice the system
getting really sluggish.
2. I go to my IDE, press Ctrl-S (save), or exit the IDE
3. Start process again, and it does not lock up

I have a feeling it may be a bug in the Java VM that is causing some
sort of problem with the kernel, but I have no idea how to diagnose
such a thing, given the very much a lack of how the kernel works. :P


Attachments:
trace.txt.gz (26.27 kB)

2009-06-11 19:11:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: IO latency - a special case

On Sat, Jun 06, 2009 at 09:14:43PM -0600, Trenton D. Adams wrote:
> I'm back into doing some Java development, and it's happening again.
> I'm running 2.6.30-rc7, and I did a SysReq+T, I hope this is helpful.
> I've attached a txt.gz file, because it's 367K to my
> /var/log/messages.
>
> The important stuff, I believe, is the Java related stuff. As a quick
> refresher to this problem, it appears to be related to how my Java IDE
> saves the file, and perhaps something doesn't release the futex right
> away?

Have you checked your memory usage via SysReq-m, or /proc/meminfo or
simply using "free". I'm seeing a lot of processes in do_page_fault,
which is interesting, especially in the non-Java processes.

Back in April, your vmstat output didn't show much swapping activity,
but maybe there was more paging activity that didn't show up there?

> I've also seen these symptoms previously...
>
> 1. Problem arises, I Ctrl-C the process when I notice the system
> getting really sluggish.
> 2. Start process again, and it locks up the machine for about an hour

What sort of process where you typicalling trying to kill using ^C?

- Ted

2009-06-13 03:48:47

by Trenton D. Adams

[permalink] [raw]
Subject: Re: IO latency - a special case

Sorry I didn't reply earlier. I got busy with some server troubles and such.

On Thu, Jun 11, 2009 at 1:11 PM, Theodore Tso<[email protected]> wrote:
> On Sat, Jun 06, 2009 at 09:14:43PM -0600, Trenton D. Adams wrote:
>> I'm back into doing some Java development, and it's happening again.
>> I'm running 2.6.30-rc7, and I did a SysReq+T, I hope this is helpful.
>> I've attached a txt.gz file, because it's 367K to my
>> /var/log/messages.
>>
>> The important stuff, I believe, is the Java related stuff. ?As a quick
>> refresher to this problem, it appears to be related to how my Java IDE
>> saves the file, and perhaps something doesn't release the futex right
>> away?
>
> Have you checked your memory usage via SysReq-m, or /proc/meminfo or
> simply using "free". ?I'm seeing a lot of processes in do_page_fault,
> which is interesting, especially in the non-Java processes.

I wasn't even aware of a SysReq-m. I'll try that next time. <rant>I
do wish the Macbook had a hard drive activity light *rolls
eyes*.</rant>

>
> Back in April, your vmstat output didn't show much swapping activity,
> but maybe there was more paging activity that didn't show up there?

Not sure I understand. Paging without swapping? I thought they were
one in the same, please explain; it's been a long time since I studied
kernel related theory.

>
>> I've also seen these symptoms previously...
>>
>> 1. Problem arises, I Ctrl-C the process when I notice the system
>> getting really sluggish.
>> 2. Start process again, and it locks up the machine for about an hour
>
> What sort of process where you typicalling trying to kill using ^C?

It's maven, which spawns javac (java compiler), as well as does
various other tasks such as making connections to the Internet, local
directories, etc.

2009-06-13 04:23:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: IO latency - a special case

On Fri, 12 Jun 2009 21:48:41 -0600
"Trenton D. Adams" <[email protected]> wrote:

> Sorry I didn't reply earlier. I got busy with some server troubles
> and such.
> > What sort of process where you typicalling trying to kill using ^C?
>
> It's maven, which spawns javac (java compiler), as well as does
> various other tasks such as making connections to the Internet, local
> directories, etc.

I assume you know latencytop and how it might be useful to diagnose
such delays ?


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-06-13 05:51:26

by Trenton D. Adams

[permalink] [raw]
Subject: Re: IO latency - a special case

On Fri, Jun 12, 2009 at 10:24 PM, Arjan van de Ven<[email protected]> wrote:
> On Fri, 12 Jun 2009 21:48:41 -0600
> "Trenton D. Adams" <[email protected]> wrote:
>
>> Sorry I didn't reply earlier. ?I got busy with some server troubles
>> and such.
>> > What sort of process where you typicalling trying to kill using ^C?
>>
>> It's maven, which spawns javac (java compiler), as well as does
>> various other tasks such as making connections to the Internet, local
>> directories, etc.
>
> I assume you know latencytop and how it might be useful to diagnose
> such delays ?

Your assumption would be incorrect. ;) Never heard of it before now.
But, I did have it installed for some reason.

However, when this starts happening, it's dang near impossible to do
anything on the machine. I wouldn't be able to run it if I tried,
though I will try :D. I barely have enough time to Ctrl-C before the
machines locks for about 3600 seconds. And during that time (around
10-30 seconds), I can't really switch applications, as the latency is
so high.

Right now, just during normal system usage, it's showing...
Cause Maximum Percentage
poll_schedule_timeout do_sys_poll sys_poll sysente 5.0 msec 58.0 %
futex_wait do_futex compat_sys_futex sysenter_disp 5.0 msec 40.3 %
blk_execute_rq scsi_execute scsi_execute_req scsi_ 3.9 msec 0.2 %
blk_execute_rq scsi_execute scsi_execute_req sd_re 3.5 msec 0.2 %
poll_schedule_timeout do_select core_sys_select sy 3.4 msec 0.3 %
blk_execute_rq scsi_execute scsi_execute_req sr_te 2.6 msec 0.5 %
blk_execute_rq scsi_execute scsi_execute_req sr_te 2.2 msec 0.5 %
blk_execute_rq scsi_execute sr_do_ioctl sr_packet 0.9 msec 0.1 %
blk_execute_rq scsi_execute scsi_execute_req ioctl 0.6 msec 0.0 %
release __blkdev_put blkdev_put blkdev_close __fput

2009-06-13 06:16:08

by Arjan van de Ven

[permalink] [raw]
Subject: Re: IO latency - a special case

On Fri, 12 Jun 2009 23:51:19 -0600
"Trenton D. Adams" <[email protected]> wrote:
>
> Right now, just during normal system usage, it's showing...
> Cause Maximum
> Percentage poll_schedule_timeout do_sys_poll sys_poll sysente 5.0
> msec 58.0 % futex_wait do_futex compat_sys_futex
> sysenter_disp 5.0 msec 40.3 % blk_execute_rq scsi_execute
> scsi_execute_req scsi_ 3.9 msec 0.2 % blk_execute_rq
> scsi_execute scsi_execute_req sd_re 3.5 msec 0.2 %
> poll_schedule_timeout do_select core_sys_select sy 3.4 msec
> 0.3 % blk_execute_rq scsi_execute scsi_execute_req sr_te 2.6
> msec 0.5 % blk_execute_rq scsi_execute scsi_execute_req
> sr_te 2.2 msec 0.5 % blk_execute_rq scsi_execute
> sr_do_ioctl sr_packet 0.9 msec 0.1 % blk_execute_rq
> scsi_execute scsi_execute_req ioctl 0.6 msec 0.0 % release
> __blkdev_put blkdev_put blkdev_close __fput

this shows it's not correctly installed....

(and it's likely an older version)


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org