2003-05-27 03:29:02

by manish

[permalink] [raw]
Subject: 2.4.20: Proccess stuck in __lock_page ...

Hello !

I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
I am running bonnie accross four drives in parallel:

bonnie -s 1000 -d /<dir-name>

bdflush settings on this system:

[root@dyn-10-123-130-235 vm]# cat bdflush
2 50 32 100 50 300 1 0 0

All the bonnie process and any other process (like df, ps -ef etc.) are
hung in __lock_page. Breaking into kdb, I observe the following for one
such bonnie process:

schedule(..)
__lock_page(..)
lock_page(..)
do_generic_file_read(..)
generic_file_read(..)

After this, the processes never exit the hang. At times, a couple of
bonnie processes complete but the hang still occurs with the remaining
processes and with the other processes.

I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
the hang does not occur. If I run, two bonnie processes, they never get
stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.

Any clues where this could be happening?

Thanks
-Manish


2003-05-27 03:52:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



On Mon, 26 May 2003, manish wrote:

> Hello !
>
> I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
> I am running bonnie accross four drives in parallel:
>
> bonnie -s 1000 -d /<dir-name>
>
> bdflush settings on this system:
>
> [root@dyn-10-123-130-235 vm]# cat bdflush
> 2 50 32 100 50 300 1 0 0
>
> All the bonnie process and any other process (like df, ps -ef etc.) are
> hung in __lock_page. Breaking into kdb, I observe the following for one
> such bonnie process:
>
> schedule(..)
> __lock_page(..)
> lock_page(..)
> do_generic_file_read(..)
> generic_file_read(..)
>
> After this, the processes never exit the hang. At times, a couple of
> bonnie processes complete but the hang still occurs with the remaining
> processes and with the other processes.
>
> I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
> the hang does not occur. If I run, two bonnie processes, they never get
> stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
>
> Any clues where this could be happening?

Hi,

Are you sure there is no disk activity ?

Run vmstat and check that, please.

2003-05-27 04:12:18

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marcelo Tosatti wrote:

>
>On Mon, 26 May 2003, manish wrote:
>
>>Hello !
>>
>>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
>>I am running bonnie accross four drives in parallel:
>>
>>bonnie -s 1000 -d /<dir-name>
>>
>>bdflush settings on this system:
>>
>>[root@dyn-10-123-130-235 vm]# cat bdflush
>>2 50 32 100 50 300 1 0 0
>>
>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>such bonnie process:
>>
>>schedule(..)
>>__lock_page(..)
>>lock_page(..)
>>do_generic_file_read(..)
>>generic_file_read(..)
>>
>>After this, the processes never exit the hang. At times, a couple of
>>bonnie processes complete but the hang still occurs with the remaining
>>processes and with the other processes.
>>
>>I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
>>the hang does not occur. If I run, two bonnie processes, they never get
>>stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
>>
>>Any clues where this could be happening?
>>
>
>Hi,
>
>Are you sure there is no disk activity ?
>
>Run vmstat and check that, please.
>
Hello !

Thanks for the response.

The light on the controller does not blink at all. Intitially, it does
blink. However, after this hang, it does not at all.

vmstat after the hang

1 1 0 780 2056892 5784 1415324 0 0 0 4 102 7
49 1 50
1 1 0 780 2056892 5784 1415324 0 0 0 4 102 9
49 1 50
1 1 0 780 2056892 5784 1415324 0 0 0 5 104 10
29 21 50
0 1 0 780 2056708 5784 1415324 0 0 0 1 104 12
0 13 86
1 1 0 780 2222904 5784 1249396 0 0 0 172 126 25
0 4 96
0 1 0 780 3081052 5784 391324 0 0 0 403 161 43
0 12 88
procs memory swap io
system cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 1 0 780 3080952 5788 391408 0 0 29 9 120 72
0 0 100
0 1 0 780 3080952 5788 391408 0 0 0 0 111 19
0 0 100
0 1 0 780 3080952 5788 391408 0 0 0 1 103 9
0 0 100
0 1 0 780 3080952 5788 391408 0 0 0 0 101 9
0 0 100
0 1 0 780 3080952 5788 391408 0 0 0 0 101 7
0 0 100
0 1 0 780 3080952 5788 391408 0 0 0 0 101 9
0 0 100
0 1 0 780 3080952 5788 391408 0 0 0 0 102 9
0 0 100
0 1 0 780 3080952 5788 391408 0 0 0 1 101 8
0 0 100
0 1 0 780 3081308 5788 391420 0 0 0 231 150 92
3 0 97
0 1 0 780 3081308 5788 391420 0 0 0 0 102 7
0 0 100
0 1 0 780 3081308 5788 391420 0 0 0 0 102 7
0 0 100
0 1 0 780 3081304 5788 391420 0 0 0 0 101 9
0 0 100
0 1 0 780 3081304 5788 391420 0 0 0 0 102 8
0 0 100
0 1 0 780 3081300 5788 391420 0 0 0 0 101 8
0 0 100
0 1 0 780 3081300 5788 391420 0 0 0 0 101 9
0 0 100
0 1 0 780 3081296 5788 391420 0 0 0 0 101 7
0 0 100
0 1 0 780 3081296 5788 391420 0 0 0 0 101 9
0 0 100

0 1 0 780 3081292 5788 391420 0 0 0 0 102 9
0 0 100
0 1 0 780 3081292 5788 391420 0 0 0 0 101 8
0 0 100
0 1 0 780 3081288 5788 391420 0 0 0 0 102 9
0 0 100
0 1 0 780 3081288 5788 391420 0 0 0 0 102 7
0 0 100
0 1 0 780 3081284 5788 391420 0 0 0 0 102 9
0 0 100
0 1 0 780 3081284 5788 391420 0 0 0 0 102 8
0 0 100
0 1 0 780 3081280 5788 391420 0 0 0 0 101 8
0 0 100

0 1 0 780 3081276 5788 391420 0 0 0 0 102 9
0 0 100

0 1 0 780 3081260 5788 391420 0 0 0 0 235 30
0 0 100
0 1 0 780 3081260 5788 391420 0 0 0 0 101 9
0 0 100
0 1 0 780 3081256 5788 391420 0 0 0 0 101 7
0 0 100
0 1 0 780 3081248 5788 391424 0 0 0 169 137 54
3 1 97
0 1 0 780 3081248 5788 391424 0 0 0 0 101 9
0 0 100
0 1 0 780 3081248 5788 391424 0 0 0 0 101 8
0 0 100
0 1 0 780 3081248 5788 391424 0 0 0 0 101 9
0 0 100

One bonnie process is hung.







2003-05-27 04:18:36

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marcelo Tosatti wrote:

>
>On Mon, 26 May 2003, manish wrote:
>
>>Hello !
>>
>>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
>>I am running bonnie accross four drives in parallel:
>>
>>bonnie -s 1000 -d /<dir-name>
>>
>>bdflush settings on this system:
>>
>>[root@dyn-10-123-130-235 vm]# cat bdflush
>>2 50 32 100 50 300 1 0 0
>>
>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>such bonnie process:
>>
>>schedule(..)
>>__lock_page(..)
>>lock_page(..)
>>do_generic_file_read(..)
>>generic_file_read(..)
>>
>>After this, the processes never exit the hang. At times, a couple of
>>bonnie processes complete but the hang still occurs with the remaining
>>processes and with the other processes.
>>
>>I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
>>the hang does not occur. If I run, two bonnie processes, they never get
>>stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
>>
>>Any clues where this could be happening?
>>
>
>Hi,
>
>Are you sure there is no disk activity ?
>
>Run vmstat and check that, please.
>
Hello !

My bad. This is one of the kernels that had modified the IO subsystem to
replace the io_request_lock with a finer grained host_lock and queue_lock.

I also noticed that the hang occurs when the settings of bdflush are the
following:

root@dyn-10-123-130-235 vm]# cat bdflush
30 50 32 100 50 300 60 0 0

Thanks
-Manish





2003-05-27 04:48:34

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



On Mon, 26 May 2003, manish wrote:

> Marcelo Tosatti wrote:
>
> >
> >On Mon, 26 May 2003, manish wrote:
> >
> >>Hello !
> >>
> >>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
> >>I am running bonnie accross four drives in parallel:
> >>
> >>bonnie -s 1000 -d /<dir-name>
> >>
> >>bdflush settings on this system:
> >>
> >>[root@dyn-10-123-130-235 vm]# cat bdflush
> >>2 50 32 100 50 300 1 0 0
> >>
> >>All the bonnie process and any other process (like df, ps -ef etc.) are
> >>hung in __lock_page. Breaking into kdb, I observe the following for one
> >>such bonnie process:
> >>
> >>schedule(..)
> >>__lock_page(..)
> >>lock_page(..)
> >>do_generic_file_read(..)
> >>generic_file_read(..)
> >>
> >>After this, the processes never exit the hang. At times, a couple of
> >>bonnie processes complete but the hang still occurs with the remaining
> >>processes and with the other processes.
> >>
> >>I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
> >>the hang does not occur. If I run, two bonnie processes, they never get
> >>stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
> >>
> >>Any clues where this could be happening?
> >>
> >
> >Hi,
> >
> >Are you sure there is no disk activity ?
> >
> >Run vmstat and check that, please.
> >
> Hello !
>
> Thanks for the response.
>
> The light on the controller does not blink at all. Intitially, it does
> blink. However, after this hang, it does not at all.
>
> vmstat after the hang
>
> 1 1 0 780 2056892 5784 1415324 0 0 0 4 102 7
> 49 1 50
> 1 1 0 780 2056892 5784 1415324 0 0 0 4 102 9
> 49 1 50
> 1 1 0 780 2056892 5784 1415324 0 0 0 5 104 10
> 29 21 50
> 0 1 0 780 2056708 5784 1415324 0 0 0 1 104 12
> 0 13 86
> 1 1 0 780 2222904 5784 1249396 0 0 0 172 126 25
> 0 4 96
> 0 1 0 780 3081052 5784 391324 0 0 0 403 161 43
> 0 12 88
> procs memory swap io
> system cpu
> r b w swpd free buff cache si so bi bo in cs us
> sy id
> 0 1 0 780 3080952 5788 391408 0 0 29 9 120 72
> 0 0 100
> 0 1 0 780 3080952 5788 391408 0 0 0 0 111 19
> 0 0 100
> 0 1 0 780 3080952 5788 391408 0 0 0 1 103 9
> 0 0 100
> 0 1 0 780 3080952 5788 391408 0 0 0 0 101 9
> 0 0 100
> 0 1 0 780 3080952 5788 391408 0 0 0 0 101 7
> 0 0 100
> 0 1 0 780 3080952 5788 391408 0 0 0 0 101 9
> 0 0 100
> 0 1 0 780 3080952 5788 391408 0 0 0 0 102 9
> 0 0 100
> 0 1 0 780 3080952 5788 391408 0 0 0 1 101 8
> 0 0 100
> 0 1 0 780 3081308 5788 391420 0 0 0 231 150 92
> 3 0 97
> 0 1 0 780 3081308 5788 391420 0 0 0 0 102 7
> 0 0 100
> 0 1 0 780 3081308 5788 391420 0 0 0 0 102 7
> 0 0 100
> 0 1 0 780 3081304 5788 391420 0 0 0 0 101 9
> 0 0 100
> 0 1 0 780 3081304 5788 391420 0 0 0 0 102 8
> 0 0 100
> 0 1 0 780 3081300 5788 391420 0 0 0 0 101 8
> 0 0 100
> 0 1 0 780 3081300 5788 391420 0 0 0 0 101 9
> 0 0 100
> 0 1 0 780 3081296 5788 391420 0 0 0 0 101 7

Ok, and does it happen with the stock kernel?

2003-05-27 14:01:51

by Carl-Daniel Hailfinger

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Christian,

this looks supiciously like the problem you are experiencing since
2.4.19-pre. Maybe we can fix this for good.

Marcelo Tosatti wrote:
>
> On Mon, 26 May 2003, manish wrote:
>
>
>>Hello !
>>
>>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
>>I am running bonnie accross four drives in parallel:
>>
>>bonnie -s 1000 -d /<dir-name>
>>
>>bdflush settings on this system:
>>
>>[root@dyn-10-123-130-235 vm]# cat bdflush
>>2 50 32 100 50 300 1 0 0
>>
>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>hung in __lock_page. Breaking into kdb, I observe the following for one

Following is SysRq-T output for stuck processes during such a pause from
Christian Klose. Only processes in D state are listed for brevity.
Especially the last two call traces are interesting.

kjournald D C15C7240 4 122 1 123 120 (L-TLB)
Call Trace: [__get_request_wait+197/208] [__make_request+392/1472]
[generic_make_request+226/304] [submit_bh+80/112] [ll_rw_block+263/432]
[journal_commit_transaction+4017/4416] [kjournald+277/464]
[commit_timeout+0/16] [kernel_thread+46/64] [kjournald+0/464]
kmail D D73E9360 2656 1960 1 1978 (NOTLB)
Call Trace: [sleep_on+56/96] [log_wait_commit+56/80]
[journal_stop+345/480] [journal_force_commit+60/64]
[ext3_force_commit+35/48] [ext3_sync_file+132/176]
[ext3_writepage+0/672] [sys_fsync+151/208] [system_call+51/56]
mc D C016B338 0 2177 2152 2179 (NOTLB)
Call Trace: [journal_stop+328/480] [__lock_page+149/192]
[lock_page+26/32] [do_generic_file_read+653/1104]
[file_read_actor+0/160] [generic_file_read+178/368]
[file_read_actor+0/160] [sys_read+163/320] [system_call+51/56]
kmail D 00200282 2656 1960 1 1978 (NOTLB)
Call Trace: [sleep_on+56/96] [log_wait_commit+56/80]
[journal_stop+345/480] [journal_force_commit+60/64]
[ext3_force_commit+35/48] [ext3_sync_file+132/176]
[ext3_writepage+0/672] [sys_fsync+151/208] [system_call+51/56]
mc D C016B338 0 2177 2152 2179 (NOTLB)
Call Trace: [journal_stop+328/480] [__lock_page+149/192]
[lock_page+26/32] [do_generic_file_read+653/1104]
[file_read_actor+0/160] [generic_file_read+178/368]
[file_read_actor+0/160] [sys_read+163/320] [system_call+51/56]
grep D DFD7E120 0 3243 1470 3244 (NOTLB)
Call Trace: [__wait_on_buffer+93/144] [bread+123/144]
[ext3_get_branch+106/240] [ext3_get_block_handle+120/688]
[create_buffers+107/224] [ext3_get_block+74/144]
[block_read_full_page+541/624] [__alloc_pages+75/400]
[page_cache_read+173/208] [ext3_get_block+0/144]
[read_cluster_nonblocking+57/80] [filemap_nopage+285/560]
[do_no_page+137/480] [do_page_fault+376/1246] [handle_mm_fault+119/256]
[do_page_fault+376/1246] [rb_insert_color+210/240]
[do_page_fault+0/1246] [error_code+52/60] [clear_user+51/80]
[do_page_fault+0/1246] [error_code+52/60] [clear_user+51/80]
[padzero+40/48] [load_elf_binary+1179/2848] [load_elf_binary+0/2848]
[search_binary_handler+269/400] [copy_strings+440/560]
[do_execve+365/544] [sys_execve+66/128] [system_call+51/56]
grep D C02508D4 0 3244 1470 3245 3243 (NOTLB)
Call Trace: [__lock_page+149/192] [lock_page+26/32]
[filemap_nopage+305/560] [do_no_page+137/480] [do_page_fault+376/1246]
[handle_mm_fault+119/256] [do_page_fault+376/1246]
[rb_insert_color+210/240] [do_page_fault+0/1246] [error_code+52/60]
[clear_user+51/80] [do_page_fault+0/1246] [error_code+52/60]
[clear_user+51/80] [padzero+40/48] [load_elf_binary+1179/2848]
[__lock_page+175/192] [file_read_actor+0/160] [load_elf_binary+0/2848]
[search_binary_handler+269/400] [copy_strings+440/560]
[do_execve+365/544] [sys_execve+66/128] [system_call+51/56]
grep D C02508D4 0 3245 1470 3244 (NOTLB)
Call Trace: [__lock_page+149/192] [lock_page+26/32]
[filemap_nopage+305/560] [do_no_page+137/480] [do_page_fault+376/1246]
[handle_mm_fault+119/256] [do_page_fault+376/1246]
[rb_insert_color+210/240] [do_page_fault+0/1246] [error_code+52/60]
[clear_user+51/80] [do_page_fault+0/1246] [error_code+52/60]
[clear_user+51/80] [padzero+40/48] [load_elf_binary+1179/2848]
[__lock_page+175/192] [file_read_actor+0/160] [load_elf_binary+0/2848]
[search_binary_handler+269/400] [copy_strings+440/560]
[do_execve+365/544] [sys_execve+66/128] [system_call+51/56]


Regards,
Carl-Daniel

2003-05-27 14:15:44

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 04:14:51PM +0200, Carl-Daniel Hailfinger wrote:
> Christian,
> this looks supiciously like the problem you are experiencing since
> 2.4.19-pre. Maybe we can fix this for good.

The most I know of this is that someone made it go away by backing out
some ll_rw_blk.c cset.


-- wli

2003-05-27 15:16:18

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marcelo Tosatti wrote:

>
>On Mon, 26 May 2003, manish wrote:
>
>>Marcelo Tosatti wrote:
>>
>>>On Mon, 26 May 2003, manish wrote:
>>>
>>>>Hello !
>>>>
>>>>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
>>>>I am running bonnie accross four drives in parallel:
>>>>
>>>>bonnie -s 1000 -d /<dir-name>
>>>>
>>>>bdflush settings on this system:
>>>>
>>>>[root@dyn-10-123-130-235 vm]# cat bdflush
>>>>2 50 32 100 50 300 1 0 0
>>>>
>>>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>>>such bonnie process:
>>>>
>>>>schedule(..)
>>>>__lock_page(..)
>>>>lock_page(..)
>>>>do_generic_file_read(..)
>>>>generic_file_read(..)
>>>>
>>>>After this, the processes never exit the hang. At times, a couple of
>>>>bonnie processes complete but the hang still occurs with the remaining
>>>>processes and with the other processes.
>>>>
>>>>I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
>>>>the hang does not occur. If I run, two bonnie processes, they never get
>>>>stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
>>>>
>>>>Any clues where this could be happening?
>>>>
>>>Hi,
>>>
>>>Are you sure there is no disk activity ?
>>>
>>>Run vmstat and check that, please.
>>>
>>Hello !
>>
>>Thanks for the response.
>>
>> The light on the controller does not blink at all. Intitially, it does
>>blink. However, after this hang, it does not at all.
>>
>>vmstat after the hang
>>
>>1 1 0 780 2056892 5784 1415324 0 0 0 4 102 7
>>49 1 50
>> 1 1 0 780 2056892 5784 1415324 0 0 0 4 102 9
>>49 1 50
>> 1 1 0 780 2056892 5784 1415324 0 0 0 5 104 10
>>29 21 50
>> 0 1 0 780 2056708 5784 1415324 0 0 0 1 104 12
>>0 13 86
>> 1 1 0 780 2222904 5784 1249396 0 0 0 172 126 25
>>0 4 96
>> 0 1 0 780 3081052 5784 391324 0 0 0 403 161 43
>>0 12 88
>> procs memory swap io
>>system cpu
>> r b w swpd free buff cache si so bi bo in cs us
>>sy id
>> 0 1 0 780 3080952 5788 391408 0 0 29 9 120 72
>>0 0 100
>> 0 1 0 780 3080952 5788 391408 0 0 0 0 111 19
>>0 0 100
>> 0 1 0 780 3080952 5788 391408 0 0 0 1 103 9
>>0 0 100
>> 0 1 0 780 3080952 5788 391408 0 0 0 0 101 9
>>0 0 100
>> 0 1 0 780 3080952 5788 391408 0 0 0 0 101 7
>>0 0 100
>> 0 1 0 780 3080952 5788 391408 0 0 0 0 101 9
>>0 0 100
>> 0 1 0 780 3080952 5788 391408 0 0 0 0 102 9
>>0 0 100
>> 0 1 0 780 3080952 5788 391408 0 0 0 1 101 8
>>0 0 100
>> 0 1 0 780 3081308 5788 391420 0 0 0 231 150 92
>>3 0 97
>> 0 1 0 780 3081308 5788 391420 0 0 0 0 102 7
>>0 0 100
>> 0 1 0 780 3081308 5788 391420 0 0 0 0 102 7
>>0 0 100
>> 0 1 0 780 3081304 5788 391420 0 0 0 0 101 9
>>0 0 100
>> 0 1 0 780 3081304 5788 391420 0 0 0 0 102 8
>>0 0 100
>> 0 1 0 780 3081300 5788 391420 0 0 0 0 101 8
>>0 0 100
>> 0 1 0 780 3081300 5788 391420 0 0 0 0 101 9
>>0 0 100
>> 0 1 0 780 3081296 5788 391420 0 0 0 0 101 7
>>
>
>Ok, and does it happen with the stock kernel?
>
Yes, with the stock kernel too but after long hrs of runtime ..

Thanks
-Manish



2003-05-27 16:47:51

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



On Tue, 27 May 2003, manish wrote:

> >Ok, and does it happen with the stock kernel?
> Yes, with the stock kernel too but after long hrs of runtime ..

Could you try Alt+SysRq+T and send us the output on the locked STOCK
kernel please?

2003-05-27 17:15:47

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



On Tue, 27 May 2003, Carl-Daniel Hailfinger wrote:

> Christian,
>
> this looks supiciously like the problem you are experiencing since
> 2.4.19-pre. Maybe we can fix this for good.
>
> Marcelo Tosatti wrote:
> >
> > On Mon, 26 May 2003, manish wrote:
> >
> >
> >>Hello !
> >>
> >>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
> >>I am running bonnie accross four drives in parallel:
> >>
> >>bonnie -s 1000 -d /<dir-name>
> >>
> >>bdflush settings on this system:
> >>
> >>[root@dyn-10-123-130-235 vm]# cat bdflush
> >>2 50 32 100 50 300 1 0 0
> >>
> >>All the bonnie process and any other process (like df, ps -ef etc.) are
> >>hung in __lock_page. Breaking into kdb, I observe the following for one
>
> Following is SysRq-T output for stuck processes during such a pause from
> Christian Klose. Only processes in D state are listed for brevity.
> Especially the last two call traces are interesting.

A "pause" is perfectly fine (to some extent, of course), now a hang is
not. Is this backtrace from a hanged, unusable kernel or ?

2003-05-27 17:25:51

by Carl-Daniel Hailfinger

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marcelo Tosatti wrote:
>
> On Tue, 27 May 2003, Carl-Daniel Hailfinger wrote:
>
>>Marcelo Tosatti wrote:
>>
>>>On Mon, 26 May 2003, manish wrote:
>>>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>
>>Following is SysRq-T output for stuck processes during such a pause from
>>Christian Klose. Only processes in D state are listed for brevity.
>>Especially the last two call traces are interesting.
>
> A "pause" is perfectly fine (to some extent, of course), now a hang is
> not. Is this backtrace from a hanged, unusable kernel or ?

AFAIK, the kernel is not unusable, but a 20 second pause with no disk
access at all is not nice either.


Regards,
Carl-Daniel

2003-05-27 17:24:16

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, 27 May 2003, Carl-Daniel Hailfinger wrote:
>> Following is SysRq-T output for stuck processes during such a pause from
>> Christian Klose. Only processes in D state are listed for brevity.
>> Especially the last two call traces are interesting.

On Tue, May 27, 2003 at 02:27:00PM -0300, Marcelo Tosatti wrote:
> A "pause" is perfectly fine (to some extent, of course), now a hang is
> not. Is this backtrace from a hanged, unusable kernel or ?

This sounds like deadlocked proceses, but not a whole system hang.
Sounds like a correctness issue, not a performance issue.


-- wli

2003-05-27 17:24:55

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 19:27, Marcelo Tosatti wrote:

Hi Marcelo,

> > Following is SysRq-T output for stuck processes during such a pause from
> > Christian Klose. Only processes in D state are listed for brevity.
> > Especially the last two call traces are interesting.
> A "pause" is perfectly fine (to some extent, of course), now a hang is
> not. Is this backtrace from a hanged, unusable kernel or ?
A pause is _not_ perfectly fine, even not to some extent. That pause we are
discussing about is a pause of the _whole_ machine, not just disk i/o pauses.
Mouse stops, keyboard stops, everything stops, who knows wtf.

That behaviour is absolutely bullshit for desktop users. For serverusage you
may not notice it in this dimension (mostly no X so no mouse), but also for a
server environment this may be very bad.

ciao, Marc

2003-05-27 17:36:11

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...


On Tue, 27 May 2003, Marc-Christian Petersen wrote:

> On Tuesday 27 May 2003 19:27, Marcelo Tosatti wrote:
>
> Hi Marcelo,
>
> > > Following is SysRq-T output for stuck processes during such a pause from
> > > Christian Klose. Only processes in D state are listed for brevity.
> > > Especially the last two call traces are interesting.
> > A "pause" is perfectly fine (to some extent, of course), now a hang is
> > not. Is this backtrace from a hanged, unusable kernel or ?
> A pause is _not_ perfectly fine, even not to some extent. That pause we are
> discussing about is a pause of the _whole_ machine, not just disk i/o pauses.
> Mouse stops, keyboard stops, everything stops, who knows wtf.

Do you also notice them?


> That behaviour is absolutely bullshit for desktop users. For serverusage you
> may not notice it in this dimension (mostly no X so no mouse), but also for a
> server environment this may be very bad.

Agreed.

2003-05-27 17:37:52

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Carl-Daniel Hailfinger wrote:

>Marcelo Tosatti wrote:
>
>>On Tue, 27 May 2003, Carl-Daniel Hailfinger wrote:
>>
>>>Marcelo Tosatti wrote:
>>>
>>>>On Mon, 26 May 2003, manish wrote:
>>>>
>>>>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>>>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>>>>
>>>Following is SysRq-T output for stuck processes during such a pause from
>>>Christian Klose. Only processes in D state are listed for brevity.
>>>Especially the last two call traces are interesting.
>>>
>>A "pause" is perfectly fine (to some extent, of course), now a hang is
>>not. Is this backtrace from a hanged, unusable kernel or ?
>>
>
>AFAIK, the kernel is not unusable, but a 20 second pause with no disk
>access at all is not nice either.
>
>
>Regards,
>Carl-Daniel
>
Hello !

It is not a system hang but the processes hang showing the same stack
trace. This is certainly not a pause since the bonnie processes that
were hung (or deadlocked) never completed after several hrs. The stack
trace was the same.

Thanks
Manish




2003-05-27 17:42:21

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 19:47, Marcelo Tosatti wrote:

Hi Marcelo,

> > A pause is _not_ perfectly fine, even not to some extent. That pause we
> > are discussing about is a pause of the _whole_ machine, not just disk i/o
> > pauses. Mouse stops, keyboard stops, everything stops, who knows wtf.
> Do you also notice them?
I do, people I know do also, numbers of those people only _I_ know are about
~30. I've reported this problem over a year ago while 2.4.19-pre time.

> > That behaviour is absolutely bullshit for desktop users. For serverusage
> > you may not notice it in this dimension (mostly no X so no mouse), but
> > also for a server environment this may be very bad.
> Agreed.
thanks =)

ciao, Marc

2003-05-27 17:42:22

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marcelo Tosatti wrote:

>On Tue, 27 May 2003, Marc-Christian Petersen wrote:
>
>>On Tuesday 27 May 2003 19:27, Marcelo Tosatti wrote:
>>
>>Hi Marcelo,
>>
>>>>Following is SysRq-T output for stuck processes during such a pause from
>>>>Christian Klose. Only processes in D state are listed for brevity.
>>>>Especially the last two call traces are interesting.
>>>>
>>>A "pause" is perfectly fine (to some extent, of course), now a hang is
>>>not. Is this backtrace from a hanged, unusable kernel or ?
>>>
>>A pause is _not_ perfectly fine, even not to some extent. That pause we are
>>discussing about is a pause of the _whole_ machine, not just disk i/o pauses.
>>Mouse stops, keyboard stops, everything stops, who knows wtf.
>>
>
>Do you also notice them?
>
>
>>That behaviour is absolutely bullshit for desktop users. For serverusage you
>>may not notice it in this dimension (mostly no X so no mouse), but also for a
>>server environment this may be very bad.
>>
>
>Agreed.
>
Hi Marc,

With respect to the hangs that you noticed, did the processes complete
after a "pause" or did they stay hung (deadlocked)?

Thanks
Manish



2003-05-27 17:49:14

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



On Tue, 27 May 2003, Marc-Christian Petersen wrote:

> On Tuesday 27 May 2003 19:47, Marcelo Tosatti wrote:
>
> Hi Marcelo,
>
> > > A pause is _not_ perfectly fine, even not to some extent. That pause we
> > > are discussing about is a pause of the _whole_ machine, not just disk i/o
> > > pauses. Mouse stops, keyboard stops, everything stops, who knows wtf.
> > Do you also notice them?
> I do, people I know do also, numbers of those people only _I_ know are about
> ~30. I've reported this problem over a year ago while 2.4.19-pre time.

Can you please try to reproduce it with -aa?

> > > That behaviour is absolutely bullshit for desktop users. For serverusage
> > > you may not notice it in this dimension (mostly no X so no mouse), but
> > > also for a server environment this may be very bad.
> > Agreed.

2003-05-27 17:56:23

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 19:50, manish wrote:

Hi Manish,

> It is not a system hang but the processes hang showing the same stack
> trace. This is certainly not a pause since the bonnie processes that
> were hung (or deadlocked) never completed after several hrs. The stack
> trace was the same.
then you are hitting a different bug or a bug related to the issues Christian
Klose and me and $tons of others were complaining.

The bug you are hitting might be the problem with "process stuck in D state"
Andrea Arcangeli fixed, let me guess, over half a year ago or so.

In case you have a good mind to try to address your issue, you might want to
try out the patch you can find here:

http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2aa1/9980_fix-pausing-2

ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
speak _NOW_ please, doesn't matter who you are!

I've added Andrea into CC.

ciao, Marc


2003-05-27 18:02:24

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Hi,

On Tue, May 27, 2003 at 02:47:24PM -0300, Marcelo Tosatti wrote:
> On Tue, 27 May 2003, Marc-Christian Petersen wrote:
> > > A "pause" is perfectly fine (to some extent, of course), now a hang is
> > > not. Is this backtrace from a hanged, unusable kernel or ?
> > A pause is _not_ perfectly fine, even not to some extent. That pause we are
> > discussing about is a pause of the _whole_ machine, not just disk i/o pauses.
> > Mouse stops, keyboard stops, everything stops, who knows wtf.
>
> Do you also notice them?

Since 2.4.19 I notice a lot of pauses with interactive work (desktop
usage). If i copy a big file over network or on local disk, some of my
desktop machines simply don't respond anymore to user requests (e.g. I
start copying a large file over nfs to local disk and start mozilla,
mozilla won't start until the copy is finished).
My current testcase is: dd if=/dev/zero of=blubber bs=4096 count=65000 and
moving the mouse during this operation. With 2.4.18 everything is ok, the
mouse runs smooth the whole time. 2.4.19 and later: I get mouse hangs, it
won't move for a second, sometimes longer. wolk reduces this problem, but
doesn't solve it.
On my servers (mostly IBM xseries 345 and 335) it's ok with a
vanilla-kernel, but there is no interactive work, mostly routing or
network monitoring.
I hope, I can run a vanilla 2.4 kernel again on my machines, at the moment
that isn't possible.

Bye,
Matthias
--
[email protected]
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

2003-05-27 17:59:18

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 19:53, manish wrote:

Hi Manish,

> With respect to the hangs that you noticed, did the processes complete
> after a "pause" or did they stay hung (deadlocked)?
yes, no processes get ever deadlocked nor anything else in this area. The
whole system just does _nothing_ for an amount of time (1-15 seconds,
depends). _Sometimes_ (not always) even a ping is stoped for the amount of
time the machine does nothing but pausing.

Also not a hardware problem. I made this clear before reporting this bug.
Tested tons of different hardware, different drivers for the network card
etc.

I repeat this now for the $high_number'th time ;):
- 2.4.18 worked perfect
- 2.4.19-pre not

ciao, Marc


2003-05-27 18:08:01

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



On Tue, 27 May 2003, Marc-Christian Petersen wrote:

> On Tuesday 27 May 2003 19:53, manish wrote:
>
> Hi Manish,
>
> > With respect to the hangs that you noticed, did the processes complete
> > after a "pause" or did they stay hung (deadlocked)?
> yes, no processes get ever deadlocked nor anything else in this area. The
> whole system just does _nothing_ for an amount of time (1-15 seconds,
> depends). _Sometimes_ (not always) even a ping is stoped for the amount of
> time the machine does nothing but pausing.
>
> Also not a hardware problem. I made this clear before reporting this bug.
> Tested tons of different hardware, different drivers for the network card
> etc.
>
> I repeat this now for the $high_number'th time ;):
> - 2.4.18 worked perfect
> - 2.4.19-pre not

Thats very useful information. Can you track down which -pre introduced
the hangs?

Thanks!

2003-05-27 17:59:18

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marcelo Tosatti wrote:

>
>On Tue, 27 May 2003, Marc-Christian Petersen wrote:
>
>>On Tuesday 27 May 2003 19:47, Marcelo Tosatti wrote:
>>
>>Hi Marcelo,
>>
>>>>A pause is _not_ perfectly fine, even not to some extent. That pause we
>>>>are discussing about is a pause of the _whole_ machine, not just disk i/o
>>>>pauses. Mouse stops, keyboard stops, everything stops, who knows wtf.
>>>>
>>>Do you also notice them?
>>>
>>I do, people I know do also, numbers of those people only _I_ know are about
>>~30. I've reported this problem over a year ago while 2.4.19-pre time.
>>
>
>Can you please try to reproduce it with -aa?
>
>>>>That behaviour is absolutely bullshit for desktop users. For serverusage
>>>>you may not notice it in this dimension (mostly no X so no mouse), but
>>>>also for a server environment this may be very bad.
>>>>
>>>Agreed.
>>>
Hello !

After several tests, I have noticed that I can produce this problem
easily when my bdflush settings are:

30 50 32 100 50 300 60 0 0

and it occurs very less frequently when my settings are:

2 50 32 100 50 300 1 0 0


Right now, I noticed the following stack trace for one such stuck process:

sys_read
generic_file_read
do_generic_file_read
page_cache_read
__alloc_pages
balance_classzone
try_to_free_pages
shrink_caches
shrink_cache
try_to_release_page
try_to_free_buffer
sync_page_buffers
wait_on_buffer
__wait_on_buffer
schedule

Thanks
-Manish









2003-05-27 18:06:13

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 19:57, Marcelo Tosatti wrote:

Hi Marcelo,

> > I do, people I know do also, numbers of those people only _I_ know are
> > about ~30. I've reported this problem over a year ago while 2.4.19-pre
> > time.
> Can you please try to reproduce it with -aa?
not again ;)

I've tried almost all known kernel tree's around, every kernel has the same
effect. I even tried SuSE and Redhat Kernels.

I've 'wasted' tons of time just find a solution for it.

Andrea introduced, to address _exact_ this problem (pauses, stops, mouse is
dead etc.), his lowlatency elevator. Side effect: decreases i/o throughput,
and the "pauses/stops" are still there. Much less but not gone.

The _only_ workaround yet (known to the public) is to change nr_requests in
drivers/block/ll_rw_blk.c from 128 to 4 which gives a performance hit of
about 40% (not acceptable in any way).

.oO( I am quite sure I've mailed you all this stuff privately in response to
your private mail to me ;) )Oo.

ciao, Marc

2003-05-27 18:13:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 08:08:43PM +0200, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 19:57, Marcelo Tosatti wrote:
>
> Hi Marcelo,
>
> > > I do, people I know do also, numbers of those people only _I_ know are
> > > about ~30. I've reported this problem over a year ago while 2.4.19-pre
> > > time.
> > Can you please try to reproduce it with -aa?
> not again ;)
>
> I've tried almost all known kernel tree's around, every kernel has the same
> effect. I even tried SuSE and Redhat Kernels.
>
> I've 'wasted' tons of time just find a solution for it.
>
> Andrea introduced, to address _exact_ this problem (pauses, stops, mouse is
> dead etc.), his lowlatency elevator. Side effect: decreases i/o throughput,

not exactly decreases I/O throughput, the latest I/O benchmarks I seen
from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
included the lowlatency elevator patch. So it may not help latency but
it doesn't hurt in the numbers, at least not in the high end (that in
theory is the one that needs the overkill length in the I/O queue most).

However it definitely helps latency for me and I had a number of
positive reports.

Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
may affect the latency so you can try with plain ext2 to be sure it's
not a fs issue.

the lowlatency elevator patch may not be perfect but it definitely seems
to work better here. especially since there's no apparent throughput
loss, it makes lots of sense to keep it applied, or it would waste lots
of ram for apparently no gain.

> and the "pauses/stops" are still there. Much less but not gone.
>
> The _only_ workaround yet (known to the public) is to change nr_requests in
> drivers/block/ll_rw_blk.c from 128 to 4 which gives a performance hit of
> about 40% (not acceptable in any way).
>
> .oO( I am quite sure I've mailed you all this stuff privately in response to
> your private mail to me ;) )Oo.
>
> ciao, Marc
>


Andrea

2003-05-27 18:15:27

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 20:16, Marcelo Tosatti wrote:

Hi Marcelo,

> > I repeat this now for the $high_number'th time ;):
> > - 2.4.18 worked perfect
> > - 2.4.19-pre not
> Thats very useful information. Can you track down which -pre introduced
> the hangs?
If I am not on drugs and my last test was not under drugs, the causing patch
is this one:

http://linux.bkbits.net:8080/linux-2.4/diffs/drivers/block/[email protected]?nav=index.html|ChangeSet@-2y|[email protected]|hist/drivers/block/ll_rw_blk.c

ciao, Marc

2003-05-27 18:25:26

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 20:25, Andrea Arcangeli wrote:

Hi Andrea,

> not exactly decreases I/O throughput, the latest I/O benchmarks I seen
it decreases performance. I've seen this, Con also saw this (well it's better
than the 'nr_requests = 4' change ;) but mouse stops are still there.

> from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
> included the lowlatency elevator patch. So it may not help latency but
> it doesn't hurt in the numbers, at least not in the high end (that in
> theory is the one that needs the overkill length in the I/O queue most).
I agree with the last sentence, in theory, but practice showed something
different (about 10% to 15% performance decrease)

But I am quite sure that this depends on your machine/hardware. Using IDE
instead of SCSI for example.

> However it definitely helps latency for me and I had a number of
> positive reports.
It helps but it's not as good as 2.4.18 stock.

> Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
I also tried that.

> may affect the latency so you can try with plain ext2 to be sure it's
> not a fs issue.
Sure, I did this too. FS independent, where ReiserFS is still the best for
this scenario with the most few pauses than any other FS (ext2, ext3, ...)

But for desktop usage: not acceptable! No way, No go!

> the lowlatency elevator patch may not be perfect but it definitely seems
> to work better here. especially since there's no apparent throughput
> loss, it makes lots of sense to keep it applied, or it would waste lots
> of ram for apparently no gain.
hehe, well wasting RAM for no gain is my next part on my todo ;) (cache
everything even if there is no RAM for example, well but this is not the
point in this thread)

ciao, Marc

2003-05-27 18:22:08

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

u

On Tue, 27 May 2003, Andrea Arcangeli wrote:

> On Tue, May 27, 2003 at 08:08:43PM +0200, Marc-Christian Petersen wrote:
> > On Tuesday 27 May 2003 19:57, Marcelo Tosatti wrote:
> >
> > Hi Marcelo,
> >
> > > > I do, people I know do also, numbers of those people only _I_ know are
> > > > about ~30. I've reported this problem over a year ago while 2.4.19-pre
> > > > time.
> > > Can you please try to reproduce it with -aa?
> > not again ;)
> >
> > I've tried almost all known kernel tree's around, every kernel has the same
> > effect. I even tried SuSE and Redhat Kernels.
> >
> > I've 'wasted' tons of time just find a solution for it.
> >
> > Andrea introduced, to address _exact_ this problem (pauses, stops, mouse is
> > dead etc.), his lowlatency elevator. Side effect: decreases i/o throughput,
>
> not exactly decreases I/O throughput, the latest I/O benchmarks I seen
> from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
> included the lowlatency elevator patch. So it may not help latency but
> it doesn't hurt in the numbers, at least not in the high end (that in
> theory is the one that needs the overkill length in the I/O queue most).
>
> However it definitely helps latency for me and I had a number of
> positive reports.
>
> Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
> may affect the latency so you can try with plain ext2 to be sure it's
> not a fs issue.
>
> the lowlatency elevator patch may not be perfect but it definitely seems
> to work better here. especially since there's no apparent throughput
> loss, it makes lots of sense to keep it applied, or it would waste lots
> of ram for apparently no gain.

Andrea,

It seems your "fix-pausing" patch is fixing a potential wakeup
miss, right? (I looked quickly throught it). Could you explain me the
problem its trying to fix and how?

Its too late to fix that in 2.4.21 (rc5 is going out in hours).

2003-05-27 18:27:15

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:

Hi Marcelo,

> It seems your "fix-pausing" patch is fixing a potential wakeup
> miss, right? (I looked quickly throught it). Could you explain me the
> problem its trying to fix and how?
Please have also a look here:

http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html

ciao, Marc

2003-05-27 18:47:58

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marc-Christian Petersen wrote:

>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
>
>Hi Marcelo,
>
>>It seems your "fix-pausing" patch is fixing a potential wakeup
>>miss, right? (I looked quickly throught it). Could you explain me the
>>problem its trying to fix and how?
>>
>Please have also a look here:
>
>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
>
>ciao, Marc
>
Hello !

I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
the stack trace:

sys_write
generic_file_write
ext2_get_group_desc
bread
__wait_on_buffer
schedule




2003-05-27 18:50:38

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



On Tue, 27 May 2003, manish wrote:

> Marc-Christian Petersen wrote:
>
> >On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
> >
> >Hi Marcelo,
> >
> >>It seems your "fix-pausing" patch is fixing a potential wakeup
> >>miss, right? (I looked quickly throught it). Could you explain me the
> >>problem its trying to fix and how?
> >>
> >Please have also a look here:
> >
> >http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
> >
> >ciao, Marc
> >
> Hello !
>
> I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
> the stack trace:
>
> sys_write
> generic_file_write
> ext2_get_group_desc
> bread
> __wait_on_buffer
> schedule

Huh? You mean bonnie still deadlocks or ?

2003-05-27 18:57:14

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marcelo Tosatti wrote:

>
>On Tue, 27 May 2003, manish wrote:
>
>>Marc-Christian Petersen wrote:
>>
>>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
>>>
>>>Hi Marcelo,
>>>
>>>>It seems your "fix-pausing" patch is fixing a potential wakeup
>>>>miss, right? (I looked quickly throught it). Could you explain me the
>>>>problem its trying to fix and how?
>>>>
>>>Please have also a look here:
>>>
>>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
>>>
>>>ciao, Marc
>>>
>>Hello !
>>
>>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
>>the stack trace:
>>
>>sys_write
>>generic_file_write
>>ext2_get_group_desc
>>bread
>>__wait_on_buffer
>>schedule
>>
>
>Huh? You mean bonnie still deadlocks or ?
>
Well, this is to the kernel that has the io_request_lock removed. The
stock kernel (with the fix-pausing-2 patch) is running fine upto now.
However, we will have to give it a few hrs of runtime.

Thanks
Manish



2003-05-27 18:59:49

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marcelo Tosatti wrote:

>
>On Tue, 27 May 2003, manish wrote:
>
>>Marc-Christian Petersen wrote:
>>
>>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
>>>
>>>Hi Marcelo,
>>>
>>>>It seems your "fix-pausing" patch is fixing a potential wakeup
>>>>miss, right? (I looked quickly throught it). Could you explain me the
>>>>problem its trying to fix and how?
>>>>
>>>Please have also a look here:
>>>
>>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
>>>
>>>ciao, Marc
>>>
>>Hello !
>>
>>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
>>the stack trace:
>>
>>sys_write
>>generic_file_write
>>ext2_get_group_desc
>>bread
>>__wait_on_buffer
>>schedule
>>
>
>Huh? You mean bonnie still deadlocks or ?
>
At the time the processes get stuck:


[root@dyn-10-123-130-235 vm]# more /proc/meminfo
total: used: free: shared: buffers: cached:
Mem: 3709870080 3699126272 10743808 0 18313216 3531255808
Swap: 1077501952 0 1077501952
MemTotal: 3622920 kB
MemFree: 10492 kB
MemShared: 0 kB
Buffers: 17884 kB
Cached: 3448492 kB
SwapCached: 0 kB
Active: 25252 kB
Inactive: 3445344 kB
HighTotal: 2752512 kB
HighFree: 2120 kB
LowTotal: 870408 kB
LowFree: 8372 kB
SwapTotal: 1052248 kB
SwapFree: 1052248 kB




2003-05-27 19:17:33

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



On Tue, 27 May 2003, manish wrote:

> Marcelo Tosatti wrote:
>
> >
> >On Tue, 27 May 2003, manish wrote:
> >
> >>Marc-Christian Petersen wrote:
> >>
> >>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
> >>>
> >>>Hi Marcelo,
> >>>
> >>>>It seems your "fix-pausing" patch is fixing a potential wakeup
> >>>>miss, right? (I looked quickly throught it). Could you explain me the
> >>>>problem its trying to fix and how?
> >>>>
> >>>Please have also a look here:
> >>>
> >>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
> >>>
> >>>ciao, Marc
> >>>
> >>Hello !
> >>
> >>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
> >>the stack trace:
> >>
> >>sys_write
> >>generic_file_write
> >>ext2_get_group_desc
> >>bread
> >>__wait_on_buffer
> >>schedule
> >>
> >
> >Huh? You mean bonnie still deadlocks or ?
> >
> At the time the processes get stuck:
>
>
> [root@dyn-10-123-130-235 vm]# more /proc/meminfo
> total: used: free: shared: buffers: cached:
> Mem: 3709870080 3699126272 10743808 0 18313216 3531255808
> Swap: 1077501952 0 1077501952
> MemTotal: 3622920 kB
> MemFree: 10492 kB
> MemShared: 0 kB
> Buffers: 17884 kB
> Cached: 3448492 kB
> SwapCached: 0 kB
> Active: 25252 kB
> Inactive: 3445344 kB
> HighTotal: 2752512 kB
> HighFree: 2120 kB
> LowTotal: 870408 kB
> LowFree: 8372 kB
> SwapTotal: 1052248 kB
> SwapFree: 1052248 kB
>

Ok, so just to confirm: You're still getting pauses with Andrea's patches
but no hangs anymore?

Correct?

2003-05-27 19:22:02

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marcelo Tosatti wrote:

>
>On Tue, 27 May 2003, manish wrote:
>
>>Marcelo Tosatti wrote:
>>
>>>On Tue, 27 May 2003, manish wrote:
>>>
>>>>Marc-Christian Petersen wrote:
>>>>
>>>>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
>>>>>
>>>>>Hi Marcelo,
>>>>>
>>>>>>It seems your "fix-pausing" patch is fixing a potential wakeup
>>>>>>miss, right? (I looked quickly throught it). Could you explain me the
>>>>>>problem its trying to fix and how?
>>>>>>
>>>>>Please have also a look here:
>>>>>
>>>>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
>>>>>
>>>>>ciao, Marc
>>>>>
>>>>Hello !
>>>>
>>>>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
>>>>the stack trace:
>>>>
>>>>sys_write
>>>>generic_file_write
>>>>ext2_get_group_desc
>>>>bread
>>>>__wait_on_buffer
>>>>schedule
>>>>
>>>Huh? You mean bonnie still deadlocks or ?
>>>
>>At the time the processes get stuck:
>>
>>
>>[root@dyn-10-123-130-235 vm]# more /proc/meminfo
>> total: used: free: shared: buffers: cached:
>>Mem: 3709870080 3699126272 10743808 0 18313216 3531255808
>>Swap: 1077501952 0 1077501952
>>MemTotal: 3622920 kB
>>MemFree: 10492 kB
>>MemShared: 0 kB
>>Buffers: 17884 kB
>>Cached: 3448492 kB
>>SwapCached: 0 kB
>>Active: 25252 kB
>>Inactive: 3445344 kB
>>HighTotal: 2752512 kB
>>HighFree: 2120 kB
>>LowTotal: 870408 kB
>>LowFree: 8372 kB
>>SwapTotal: 1052248 kB
>>SwapFree: 1052248 kB
>>
>
>Ok, so just to confirm: You're still getting pauses with Andrea's patches
>but no hangs anymore?
>
>Correct?
>
Hi Marcelo,

I have applied Andrea's patch to two kernels:

1. Stock 2.4.20
2. 2.4.20 with the io_request_lock removed.

The tests on the first one are still going. The tests on the second one
showed processes getting stuck for long times (> 5 minutes) and not
paused ...

Thanks
Manish



2003-05-27 19:51:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 03:33:14PM -0300, Marcelo Tosatti wrote:
> u
>
> On Tue, 27 May 2003, Andrea Arcangeli wrote:
>
> > On Tue, May 27, 2003 at 08:08:43PM +0200, Marc-Christian Petersen wrote:
> > > On Tuesday 27 May 2003 19:57, Marcelo Tosatti wrote:
> > >
> > > Hi Marcelo,
> > >
> > > > > I do, people I know do also, numbers of those people only _I_ know are
> > > > > about ~30. I've reported this problem over a year ago while 2.4.19-pre
> > > > > time.
> > > > Can you please try to reproduce it with -aa?
> > > not again ;)
> > >
> > > I've tried almost all known kernel tree's around, every kernel has the same
> > > effect. I even tried SuSE and Redhat Kernels.
> > >
> > > I've 'wasted' tons of time just find a solution for it.
> > >
> > > Andrea introduced, to address _exact_ this problem (pauses, stops, mouse is
> > > dead etc.), his lowlatency elevator. Side effect: decreases i/o throughput,
> >
> > not exactly decreases I/O throughput, the latest I/O benchmarks I seen
> > from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
> > included the lowlatency elevator patch. So it may not help latency but
> > it doesn't hurt in the numbers, at least not in the high end (that in
> > theory is the one that needs the overkill length in the I/O queue most).
> >
> > However it definitely helps latency for me and I had a number of
> > positive reports.
> >
> > Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
> > may affect the latency so you can try with plain ext2 to be sure it's
> > not a fs issue.
> >
> > the lowlatency elevator patch may not be perfect but it definitely seems
> > to work better here. especially since there's no apparent throughput
> > loss, it makes lots of sense to keep it applied, or it would waste lots
> > of ram for apparently no gain.
>
> Andrea,
>
> It seems your "fix-pausing" patch is fixing a potential wakeup
> miss, right? (I looked quickly throught it). Could you explain me the

yes, not just one but multiple of them, all similar. lots of boxes were
hanging in a weird manner until I found and fixed this glitch.

> problem its trying to fix and how?

I'm attaching the old email, it should have all the explanataions.

but don't use that old patch (that was the first revision and it missed
one last race in wait_for_request noticed by Chris or Andrew [or
both?]), use this one instead (seems just the second revision, should be
that one plus that last race fix):

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2aa1/9980_fix-pausing-2

thanks,

>
> Its too late to fix that in 2.4.21 (rc5 is going out in hours).


Andrea


Attachments:
(No filename) (2.60 kB)
(No filename) (18.42 kB)
Download all attachments

2003-05-27 19:56:25

by Chris Mason

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, 2003-05-27 at 14:33, Marcelo Tosatti wrote:

> Andrea,
>
> It seems your "fix-pausing" patch is fixing a potential wakeup
> miss, right? (I looked quickly throught it). Could you explain me the
> problem its trying to fix and how?
>
> Its too late to fix that in 2.4.21 (rc5 is going out in hours).

The bug report seems to be on ext2, and on a box with 3.5GB of ram and
4G of dirty data. So, I don't think he is hitting the fix-pausing bug,
which needs just the right set of conditions to miss unplugs:

1) bdflush can't be awake, so the percentage of dirty buffers has to be
somewhat low. Otherwise bdflush will trigger unplugs.

2) kupdate needs to be stuck waiting on the super lock, otherwise
kupdate would be triggering unplugs

2a) Some process needs to be calling wait_on_buffer() with the super
lock held. This makes it pretty much impossible to trigger on ext2
without using O_SYNC mode.

3) You've got to race in __wait_on_buffer (cut n' paste from an old mail
from Andrea)

CPU0 CPU1
----------------- ------------------------
reiserfs_writepage
lock_buffer()
fsync_buffers_list() under lock_super()
wait_on_buffer()
run_task_queue(&tq_disk) -> noop
schedule() <- hang with lock_super acquired
submit_bh()
/* don't unplug here */


With ext3, you can trigger with two procs, it gets much easier if you
toss a schedule() into submit_bh(), right before generic_make_request.
reiserfs + the data logging patches is easier to trigger and produces
longer pauses.

For ext3:
A: while(1) sync
B: while(1) write(fd, 8k); fsync(fd); ftruncate(fd, 0);

The idea behind proc B is to increase the chances the
sync and the fsync are trying to write and wait on the same buffer.

ext3 is hung on a metadata block, while it tries to get write access to
the block before logging it. This ends up calling wait_on_buffer with
the super held while in proc B, while proc A is in sync flushing the
metadata block.

I trigged the hang in ext3 during block allocation, so the ftruncate
makes sure ext3 is constantly allocating blocks (and always dirtying the
same bitmap/direct block).

It isn't a perfect reproduction of the hang, because in ext3 kjournald
wakes up every once and a while (~30 seconds or more) and kicks the
transaction. But, with more procs running, someone could be waiting
with the journal lock held, which would keep kjournald from fixing
things.



2003-05-27 19:59:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Hi,

On Tue, May 27, 2003 at 08:35:33PM +0200, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 20:25, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > not exactly decreases I/O throughput, the latest I/O benchmarks I seen
> it decreases performance. I've seen this, Con also saw this (well it's better
> than the 'nr_requests = 4' change ;) but mouse stops are still there.
>
> > from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
> > included the lowlatency elevator patch. So it may not help latency but
> > it doesn't hurt in the numbers, at least not in the high end (that in
> > theory is the one that needs the overkill length in the I/O queue most).
> I agree with the last sentence, in theory, but practice showed something
> different (about 10% to 15% performance decrease)
>
> But I am quite sure that this depends on your machine/hardware. Using IDE
> instead of SCSI for example.

10/15 performance drop doesn't sound good, no matter what hardware ;).

However in contest I recall there was quite an improvement in latency at
least (I mean, it had some positive effect too)

Getting the best throughput and latency at the same time is normally not
possible, however evaluating if it's losing excessive throughput given a
certain latency improvement is difficult.


>
> > However it definitely helps latency for me and I had a number of
> > positive reports.
> It helps but it's not as good as 2.4.18 stock.

I'll try to find what's the precise reason of the interactivity drop
with the 2.4.18->2.4.19 blkdev changes on Thu. I think I shortly looked
into it once but there was no definitive answer, or anyways going back
to the 2.4.18 code didn't appeal or make much sense.

However I suspect this responsiveness issue could be storage hardware
dependent.

The sentence by Linus in the last few days while talking with Jens,
about storage that reorders stuff and starve requests at the two ends of
the platter was very scary, maybe you're really bitten by something like
that. Linux does the right thing but your hardware keeps posting stuff
under the os and mine doesn't.


>
> > Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
> I also tried that.
>
> > may affect the latency so you can try with plain ext2 to be sure it's
> > not a fs issue.
> Sure, I did this too. FS independent, where ReiserFS is still the best for
> this scenario with the most few pauses than any other FS (ext2, ext3, ...)
>
> But for desktop usage: not acceptable! No way, No go!
>
> > the lowlatency elevator patch may not be perfect but it definitely seems
> > to work better here. especially since there's no apparent throughput
> > loss, it makes lots of sense to keep it applied, or it would waste lots
> > of ram for apparently no gain.
> hehe, well wasting RAM for no gain is my next part on my todo ;) (cache
> everything even if there is no RAM for example, well but this is not the
> point in this thread)
>
> ciao, Marc
>


Andrea

2003-05-27 19:57:48

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



On Tue, 27 May 2003, Andrea Arcangeli wrote:

> > It seems your "fix-pausing" patch is fixing a potential wakeup
> > miss, right? (I looked quickly throught it). Could you explain me the
>
> yes, not just one but multiple of them, all similar. lots of boxes were
> hanging in a weird manner until I found and fixed this glitch.
>
> > problem its trying to fix and how?
>
> I'm attaching the old email, it should have all the explanataions.
>
> but don't use that old patch (that was the first revision and it missed
> one last race in wait_for_request noticed by Chris or Andrew [or
> both?]), use this one instead (seems just the second revision, should be
> that one plus that last race fix):
>
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2aa1/9980_fix-pausing-2

I wonder if the additional wakeups result in performance degradation (not
that it matters much in case there is no other way to fix the problem).

But anyway I would like to have some numbers with/without the patch.

Do you have them ?

2003-05-27 20:07:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 12:34:38PM -0700, manish wrote:
> Marcelo Tosatti wrote:
>
> >
> >On Tue, 27 May 2003, manish wrote:
> >
> >>Marcelo Tosatti wrote:
> >>
> >>>On Tue, 27 May 2003, manish wrote:
> >>>
> >>>>Marc-Christian Petersen wrote:
> >>>>
> >>>>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
> >>>>>
> >>>>>Hi Marcelo,
> >>>>>
> >>>>>>It seems your "fix-pausing" patch is fixing a potential wakeup
> >>>>>>miss, right? (I looked quickly throught it). Could you explain me the
> >>>>>>problem its trying to fix and how?
> >>>>>>
> >>>>>Please have also a look here:
> >>>>>
> >>>>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
> >>>>>
> >>>>>ciao, Marc
> >>>>>
> >>>>Hello !
> >>>>
> >>>>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
> >>>>the stack trace:
> >>>>
> >>>>sys_write
> >>>>generic_file_write
> >>>>ext2_get_group_desc
> >>>>bread
> >>>>__wait_on_buffer
> >>>>schedule
> >>>>
> >>>Huh? You mean bonnie still deadlocks or ?
> >>>
> >>At the time the processes get stuck:
> >>
> >>
> >>[root@dyn-10-123-130-235 vm]# more /proc/meminfo
> >> total: used: free: shared: buffers: cached:
> >>Mem: 3709870080 3699126272 10743808 0 18313216 3531255808
> >>Swap: 1077501952 0 1077501952
> >>MemTotal: 3622920 kB
> >>MemFree: 10492 kB
> >>MemShared: 0 kB
> >>Buffers: 17884 kB
> >>Cached: 3448492 kB
> >>SwapCached: 0 kB
> >>Active: 25252 kB
> >>Inactive: 3445344 kB
> >>HighTotal: 2752512 kB
> >>HighFree: 2120 kB
> >>LowTotal: 870408 kB
> >>LowFree: 8372 kB
> >>SwapTotal: 1052248 kB
> >>SwapFree: 1052248 kB
> >>
> >
> >Ok, so just to confirm: You're still getting pauses with Andrea's patches
> >but no hangs anymore?
> >
> >Correct?
> >
> Hi Marcelo,
>
> I have applied Andrea's patch to two kernels:
>
> 1. Stock 2.4.20
> 2. 2.4.20 with the io_request_lock removed.
>
> The tests on the first one are still going. The tests on the second one
> showed processes getting stuck for long times (> 5 minutes) and not
> paused ...

sorry if it's a dumb question but what is the "io_request_lock removed"
thing? Hope you didn't delete any io_request_lock, if you did you can
get worse things than crashes (i.e. mm/fs corruption). the pausing bug
was a genuine race (quite innocent, if you could trigger a disk unplug
you could recover from it)

Andrea

2003-05-27 20:12:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 05:08:38PM -0300, Marcelo Tosatti wrote:
>
>
> On Tue, 27 May 2003, Andrea Arcangeli wrote:
>
> > > It seems your "fix-pausing" patch is fixing a potential wakeup
> > > miss, right? (I looked quickly throught it). Could you explain me the
> >
> > yes, not just one but multiple of them, all similar. lots of boxes were
> > hanging in a weird manner until I found and fixed this glitch.
> >
> > > problem its trying to fix and how?
> >
> > I'm attaching the old email, it should have all the explanataions.
> >
> > but don't use that old patch (that was the first revision and it missed
> > one last race in wait_for_request noticed by Chris or Andrew [or
> > both?]), use this one instead (seems just the second revision, should be
> > that one plus that last race fix):
> >
> > http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2aa1/9980_fix-pausing-2
>
> I wonder if the additional wakeups result in performance degradation (not
> that it matters much in case there is no other way to fix the problem).

in theory yes.

>
> But anyway I would like to have some numbers with/without the patch.
>
> Do you have them ?

Hmm, in bigbox.html we should find the difference of the timings
before/after, and I recall it wasn't measurable. I can search for it on
Thu if you want the exact numbers.

However the last numbers from Randy showed my tree going faster than 2.5
with bonnie and tiotest so I think we don't need to worry and I would
probably not fix it in a different way in 2.4 even if it would mean a 1%
degradation. When it was shipped there was no time to measure any
degradation but the problem it fix is so severe that we never had any
doubt if to include it or not ;).

Andrea

2003-05-27 20:15:07

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 22:10, Andrea Arcangeli wrote:

Hi Andrea,

> 10/15 performance drop doesn't sound good, no matter what hardware ;).
lol, well. YES ;)

> However in contest I recall there was quite an improvement in latency at
> least (I mean, it had some positive effect too)
Yeah, but latency != throughput ;)

> Getting the best throughput and latency at the same time is normally not
> possible, however evaluating if it's losing excessive throughput given a
> certain latency improvement is difficult.
It is possible. I use 2.5 (preferably -mm tree) now more then any 2.4*.
I use the AS (Anticipatory IO Scheduler) which AKPM included in his tree.
This scheduler is kicking ass. Everything is rock fast, I can trash my HD to
whatever I want, I still get no mouse stops, keyboard stops or anything like
that. Even starting up multiple programs is possible while trashing the HD.
Sure, it takes longer but it works :)

I try to backport BIO and then AS for quite over 2 weeks now, but it seems, at
least for me, that it's an impossible mission ;(


> I'll try to find what's the precise reason of the interactivity drop
cool. Thanks.

> with the 2.4.18->2.4.19 blkdev changes on Thu. I think I shortly looked
> into it once but there was no definitive answer, or anyways going back
> to the 2.4.18 code didn't appeal or make much sense.
Yeah, that's not an option. The throughput has been increased in 2.4.19
compared to 2.4.18.

> However I suspect this responsiveness issue could be storage hardware
> dependent.
Hmm, I am quite sure that it isn't. I have ton's of mostly totally different
hardware in my company, also test machines for WOLK at freenet.de (the
biggest I had was a QUAD Xeon 1GHz with 16GB memory and hardware RAID (Compaq
ML570 to be exact (f*cking nice machine btw. ;) and I even hit it on that
machine. Friends of mine having also different hardware then me, also hitting
that bug. _If_ it's the case of storage hardware, then many storage hardware
is affected ;)

> The sentence by Linus in the last few days while talking with Jens,
> about storage that reorders stuff and starve requests at the two ends of
> the platter was very scary, maybe you're really bitten by something like
> that. Linux does the right thing but your hardware keeps posting stuff
> under the os and mine doesn't.
Oh, did I miss something at lkml or was it privately?

ciao, Marc

2003-05-27 20:15:01

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:

Hi Andrea,


> > 1. Stock 2.4.20
> > 2. 2.4.20 with the io_request_lock removed.
> > The tests on the first one are still going. The tests on the second one
> > showed processes getting stuck for long times (> 5 minutes) and not
> > paused ...
> sorry if it's a dumb question but what is the "io_request_lock removed"
> thing? Hope you didn't delete any io_request_lock, if you did you can
> get worse things than crashes (i.e. mm/fs corruption). the pausing bug
> was a genuine race (quite innocent, if you could trigger a disk unplug
> you could recover from it)
>
> Andrea
funny. I asked him the same ;)

see his response:

-----------------------------------------------------------------------
>what is this io_request_lock patch you are talking about?
>
>ciao, Marc
>
We made some changes to the 2.4.20 kernel to remove the io_request_lock
and replace with queue_lock and host_lock.
-----------------------------------------------------------------------

ciao, Marc

2003-05-27 20:30:17

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marc-Christian Petersen wrote:

>On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:
>
>Hi Andrea,
>
>
>>>1. Stock 2.4.20
>>>2. 2.4.20 with the io_request_lock removed.
>>>The tests on the first one are still going. The tests on the second one
>>>showed processes getting stuck for long times (> 5 minutes) and not
>>>paused ...
>>>
>>sorry if it's a dumb question but what is the "io_request_lock removed"
>>thing? Hope you didn't delete any io_request_lock, if you did you can
>>get worse things than crashes (i.e. mm/fs corruption). the pausing bug
>>was a genuine race (quite innocent, if you could trigger a disk unplug
>>you could recover from it)
>>
>>Andrea
>>
>funny. I asked him the same ;)
>
>see his response:
>
>-----------------------------------------------------------------------
>
>>what is this io_request_lock patch you are talking about?
>>
>>ciao, Marc
>>
>We made some changes to the 2.4.20 kernel to remove the io_request_lock
>and replace with queue_lock and host_lock.
>-----------------------------------------------------------------------
>
>ciao, Marc
>
We made a change in the 2.4.20 kernel to remove the io_request_lock and
replace with the host_lock and the queue_lock. Probably, not a right
thing to do

Thanks
Manish



2003-05-27 20:36:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 01:42:32PM -0700, manish wrote:
> Marc-Christian Petersen wrote:
>
> >On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:
> >
> >Hi Andrea,
> >
> >
> >>>1. Stock 2.4.20
> >>>2. 2.4.20 with the io_request_lock removed.
> >>>The tests on the first one are still going. The tests on the second one
> >>>showed processes getting stuck for long times (> 5 minutes) and not
> >>>paused ...
> >>>
> >>sorry if it's a dumb question but what is the "io_request_lock removed"
> >>thing? Hope you didn't delete any io_request_lock, if you did you can
> >>get worse things than crashes (i.e. mm/fs corruption). the pausing bug
> >>was a genuine race (quite innocent, if you could trigger a disk unplug
> >>you could recover from it)
> >>
> >>Andrea
> >>
> >funny. I asked him the same ;)
> >
> >see his response:
> >
> >-----------------------------------------------------------------------
> >
> >>what is this io_request_lock patch you are talking about?
> >>
> >>ciao, Marc
> >>
> >We made some changes to the 2.4.20 kernel to remove the io_request_lock
> >and replace with queue_lock and host_lock.
> >-----------------------------------------------------------------------
> >
> >ciao, Marc
> >
> We made a change in the 2.4.20 kernel to remove the io_request_lock and
> replace with the host_lock and the queue_lock. Probably, not a right
> thing to do

right you are, but never mind, only remeber e2fsck the fs before
booting the box so you don't risk fs corruption later with the solid
kernels.

Andrea

2003-05-27 20:34:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 10:24:22PM +0200, Marc-Christian Petersen wrote:
> I try to backport BIO and then AS for quite over 2 weeks now, but it seems, at
> least for me, that it's an impossible mission ;(

bio breaks all drivers, not a good idea to backport ;)

note that the anticipatory scheduler generates very bad results with the
winmark. it certainly has merits but it has large downsides too.

I would be also curious if you could compare anticipatory with CFQ. The
CFQ was designed to provide the highest possible degree of fariness.

> > I'll try to find what's the precise reason of the interactivity drop
> cool. Thanks.
>
> > with the 2.4.18->2.4.19 blkdev changes on Thu. I think I shortly looked
> > into it once but there was no definitive answer, or anyways going back
> > to the 2.4.18 code didn't appeal or make much sense.
> Yeah, that's not an option. The throughput has been increased in 2.4.19
> compared to 2.4.18.

agreed.

>
> > However I suspect this responsiveness issue could be storage hardware
> > dependent.
> Hmm, I am quite sure that it isn't. I have ton's of mostly totally different
> hardware in my company, also test machines for WOLK at freenet.de (the
> biggest I had was a QUAD Xeon 1GHz with 16GB memory and hardware RAID (Compaq
> ML570 to be exact (f*cking nice machine btw. ;) and I even hit it on that
> machine. Friends of mine having also different hardware then me, also hitting
> that bug. _If_ it's the case of storage hardware, then many storage hardware
> is affected ;)

;)

> > The sentence by Linus in the last few days while talking with Jens,
> > about storage that reorders stuff and starve requests at the two ends of
> > the platter was very scary, maybe you're really bitten by something like
> > that. Linux does the right thing but your hardware keeps posting stuff
> > under the os and mine doesn't.
> Oh, did I miss something at lkml or was it privately?

I read it on l-k yesterday a few days ago, search emails from Linus with
Jens somewhere in CC and you should find it.

Andrea

2003-05-27 20:39:27

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Andrea Arcangeli wrote:

>On Tue, May 27, 2003 at 01:42:32PM -0700, manish wrote:
>
>>Marc-Christian Petersen wrote:
>>
>>>On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:
>>>
>>>Hi Andrea,
>>>
>>>
>>>>>1. Stock 2.4.20
>>>>>2. 2.4.20 with the io_request_lock removed.
>>>>>The tests on the first one are still going. The tests on the second one
>>>>>showed processes getting stuck for long times (> 5 minutes) and not
>>>>>paused ...
>>>>>
>>>>sorry if it's a dumb question but what is the "io_request_lock removed"
>>>>thing? Hope you didn't delete any io_request_lock, if you did you can
>>>>get worse things than crashes (i.e. mm/fs corruption). the pausing bug
>>>>was a genuine race (quite innocent, if you could trigger a disk unplug
>>>>you could recover from it)
>>>>
>>>>Andrea
>>>>
>>>funny. I asked him the same ;)
>>>
>>>see his response:
>>>
>>>-----------------------------------------------------------------------
>>>
>>>>what is this io_request_lock patch you are talking about?
>>>>
>>>>ciao, Marc
>>>>
>>>We made some changes to the 2.4.20 kernel to remove the io_request_lock
>>>and replace with queue_lock and host_lock.
>>>-----------------------------------------------------------------------
>>>
>>>ciao, Marc
>>>
>>We made a change in the 2.4.20 kernel to remove the io_request_lock and
>>replace with the host_lock and the queue_lock. Probably, not a right
>>thing to do
>>
>
>right you are, but never mind, only remeber e2fsck the fs before
>booting the box so you don't risk fs corruption later with the solid
>kernels.
>
>Andrea
>
So, does it imply that we cannot remove the io_request_lock in 2.4 at all?

Thanks
Manish



2003-05-27 20:41:44

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 22:45, Andrea Arcangeli wrote:

Hi Andrea,

> > I try to backport BIO and then AS for quite over 2 weeks now, but it
> > seems, at least for me, that it's an impossible mission ;(
> bio breaks all drivers, not a good idea to backport ;)
HAHAHAH. Another wasted 2 weeks in my life ;-)

But why does it brake all drivers? Could you please elaborate a bit?

> note that the anticipatory scheduler generates very bad results with the
> winmark. it certainly has merits but it has large downsides too.
hmm, I am not aware of it, or even I _was_ not aware of it till now.

> I would be also curious if you could compare anticipatory with CFQ. The
> CFQ was designed to provide the highest possible degree of fariness.
I'll can bench it, sure. I used CFQ before I switched to AS because I was
curious about AS and as I didn't see a real difference in latency but AS gave
me more throughput, I use AS from now on.

> I read it on l-k yesterday a few days ago, search emails from Linus with
> Jens somewhere in CC and you should find it.
Already found it :) thank you.

ciao, Marc

2003-05-27 20:44:40

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27 2003, Marc-Christian Petersen wrote:
> I try to backport BIO and then AS for quite over 2 weeks now, but it
> seems, at least for me, that it's an impossible mission ;(

You're nuts, that's not only incredibly silly it's not even needed for
what you want.

What you want is the proper io scheduler abstraction interface. With
that in place, you can port the 2.5 io schedulers without too much
trouble. They have very little dependencies on bio itself ('bio' has
become on of the most abused terms in 2.5. I use it only to describe the
io structure).

You basically need to pin down users that directly manipulate the queue
to extract/insert requests. So step one is doing elv_add_request(),
elv_next_request, and elv_remove_request(). That is a 1:1 mapping to
what 2.4 has right now, so you should be able to accomplish this change
without changing how the code works.

But still, why on earth waste your time with something like this now
when we are so close to 2.6? 2.4 is a stable code base, it should stay
that way. I'm really not interested in more esoteric 2.4 backports, the
vendor kernels are bad enough as it is.

--
Jens Axboe

2003-05-27 20:47:04

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27 2003, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 22:45, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > > I try to backport BIO and then AS for quite over 2 weeks now, but it
> > > seems, at least for me, that it's an impossible mission ;(
> > bio breaks all drivers, not a good idea to backport ;)
> HAHAHAH. Another wasted 2 weeks in my life ;-)
>
> But why does it brake all drivers? Could you please elaborate a bit?

Are you serious? Please tell me you haven't spend two weeks on the
project not realising this?

I think the problem here is that you are saying 'bio' when you really
mean something else. bio is the 2.5 io structure. What _exactly_ do you
mean with 'backporting bio'? I don't think you have the slightest idea
of the nastiness involved with doing something like that.

--
Jens Axboe

2003-05-27 20:52:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 01:50:55PM -0700, manish wrote:
> Andrea Arcangeli wrote:
>
> >On Tue, May 27, 2003 at 01:42:32PM -0700, manish wrote:
> >
> >>Marc-Christian Petersen wrote:
> >>
> >>>On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:
> >>>
> >>>Hi Andrea,
> >>>
> >>>
> >>>>>1. Stock 2.4.20
> >>>>>2. 2.4.20 with the io_request_lock removed.
> >>>>>The tests on the first one are still going. The tests on the second one
> >>>>>showed processes getting stuck for long times (> 5 minutes) and not
> >>>>>paused ...
> >>>>>
> >>>>sorry if it's a dumb question but what is the "io_request_lock removed"
> >>>>thing? Hope you didn't delete any io_request_lock, if you did you can
> >>>>get worse things than crashes (i.e. mm/fs corruption). the pausing bug
> >>>>was a genuine race (quite innocent, if you could trigger a disk unplug
> >>>>you could recover from it)
> >>>>
> >>>>Andrea
> >>>>
> >>>funny. I asked him the same ;)
> >>>
> >>>see his response:
> >>>
> >>>-----------------------------------------------------------------------
> >>>
> >>>>what is this io_request_lock patch you are talking about?
> >>>>
> >>>>ciao, Marc
> >>>>
> >>>We made some changes to the 2.4.20 kernel to remove the io_request_lock
> >>>and replace with queue_lock and host_lock.
> >>>-----------------------------------------------------------------------
> >>>
> >>>ciao, Marc
> >>>
> >>We made a change in the 2.4.20 kernel to remove the io_request_lock and
> >>replace with the host_lock and the queue_lock. Probably, not a right
> >>thing to do
> >>
> >
> >right you are, but never mind, only remeber e2fsck the fs before
> >booting the box so you don't risk fs corruption later with the solid
> >kernels.
> >
> >Andrea
> >
> So, does it imply that we cannot remove the io_request_lock in 2.4 at all?

io_request_lock can be at most made per-device in 2.4, this is just the
case in my tree for istance. Locks are there for a reason, unless you
redesign the code to work more scalar, you can't just drop them and
expect stuff to work. But the io_request_lock has nothing to do with
both the hangs and the delays, it only hurts scalability if you've lots
of devices and lots of cpus.

Andrea

2003-05-27 20:52:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 10:55:16PM +0200, Jens Axboe wrote:
> But still, why on earth waste your time with something like this now
> when we are so close to 2.6? 2.4 is a stable code base, it should stay
> that way. I'm really not interested in more esoteric 2.4 backports, the
> vendor kernels are bad enough as it is.

They've backported everything else, so I guess it stood to reason it'd
happen eventually.

I, for one, got a good laugh out of it. =) Makes me wonder if the 2.4
distro backport trees' diffs are bigger than 2.4 itself yet.


-- wli

2003-05-27 20:58:39

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tuesday 27 May 2003 23:00, Jens Axboe wrote:

Hi Jens,

> Are you serious? Please tell me you haven't spend two weeks on the
> project not realising this?
Well, 2 weeks means in hours not more than 5 or 6 just delayed over many days.

And it was further just to go deeper into the code, not a real attempt to
backport it. NM.

ciao, Marc

2003-05-27 21:06:56

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27 2003, William Lee Irwin III wrote:
> I, for one, got a good laugh out of it. =) Makes me wonder if the 2.4
> distro backport trees' diffs are bigger than 2.4 itself yet.

Heh, well they're open for inspection, it's probably not far off :)

--
Jens Axboe

2003-05-27 21:06:55

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27 2003, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 23:00, Jens Axboe wrote:
>
> Hi Jens,
>
> > Are you serious? Please tell me you haven't spend two weeks on the
> > project not realising this?
> Well, 2 weeks means in hours not more than 5 or 6 just delayed over many days.
>
> And it was further just to go deeper into the code, not a real attempt to
> backport it. NM.

A bigger analysis of the problem before starting mindless (and useless)
porting would have brought you a lot farther :)

If you're just looking to port some io schedulers, the explanation I
left you in the previous mail should be plenty to get you started.

--
Jens Axboe

2003-05-27 21:20:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 02:05:18PM -0700, William Lee Irwin III wrote:
> They've backported everything else, so I guess it stood to reason it'd
> happen eventually.

you probably forgot we have varyio in 2.4 due the lack of bio ;)

Andrea

2003-05-27 22:07:35

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Andrea Arcangeli <[email protected]> wrote:
>
> However the last numbers from Randy showed my tree going faster than 2.5
> with bonnie and tiotest so I think we don't need to worry and I would
> probably not fix it in a different way in 2.4 even if it would mean a 1%
> degradation.

That could be because -aa quadruples the size of the VM readahead window.

Changes such as that should be removed when assessing the performance
impact of this particular patch.


2003-05-27 22:24:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 03:18:30PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > However the last numbers from Randy showed my tree going faster than 2.5
> > with bonnie and tiotest so I think we don't need to worry and I would
> > probably not fix it in a different way in 2.4 even if it would mean a 1%
> > degradation.
>
> That could be because -aa quadruples the size of the VM readahead window.
>
> Changes such as that should be removed when assessing the performance
> impact of this particular patch.

I understand that was a generic benchmark against 2.5, not meant to
evaluate the effect of the fixed readahead (see the name of the patch
"readahead-got-broken-somehwere"). I don't see any good reason why
should Randy cripple down my tree before benchmarking against 2.5? if
something it's ok to apply some of my patches to 2.5, that's great, the
other way around not IMHO.

Andrea

2003-05-27 22:29:55

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Andrea Arcangeli <[email protected]> wrote:
>
> On Tue, May 27, 2003 at 03:18:30PM -0700, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > However the last numbers from Randy showed my tree going faster than 2.5
> > > with bonnie and tiotest so I think we don't need to worry and I would
> > > probably not fix it in a different way in 2.4 even if it would mean a 1%
> > > degradation.
> >
> > That could be because -aa quadruples the size of the VM readahead window.
> >
> > Changes such as that should be removed when assessing the performance
> > impact of this particular patch.
>
> I understand that was a generic benchmark against 2.5, not meant to
> evaluate the effect of the fixed readahead (see the name of the patch
> "readahead-got-broken-somehwere"). I don't see any good reason why
> should Randy cripple down my tree before benchmarking against 2.5? if
> something it's ok to apply some of my patches to 2.5, that's great, the
> other way around not IMHO.
>

No.

What I am saying is that evaluation of the effect of an IO scheduler change
cannot be performed when there is a 4:1 change in the readhead window present
in the same tree.

ie: we cannot conclude anything about the effect of the IO scheduler change
from Randy's numbers. Too many variables.


2003-05-27 22:44:35

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 03:40:49PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Tue, May 27, 2003 at 03:18:30PM -0700, Andrew Morton wrote:
> > > Andrea Arcangeli <[email protected]> wrote:
> > > >
> > > > However the last numbers from Randy showed my tree going faster than 2.5
> > > > with bonnie and tiotest so I think we don't need to worry and I would
> > > > probably not fix it in a different way in 2.4 even if it would mean a 1%
> > > > degradation.
> > >
> > > That could be because -aa quadruples the size of the VM readahead window.
> > >
> > > Changes such as that should be removed when assessing the performance
> > > impact of this particular patch.
> >
> > I understand that was a generic benchmark against 2.5, not meant to
> > evaluate the effect of the fixed readahead (see the name of the patch
> > "readahead-got-broken-somehwere"). I don't see any good reason why
> > should Randy cripple down my tree before benchmarking against 2.5? if
> > something it's ok to apply some of my patches to 2.5, that's great, the
> > other way around not IMHO.
> >
>
> No.
>
> What I am saying is that evaluation of the effect of an IO scheduler change
> cannot be performed when there is a 4:1 change in the readhead window present
> in the same tree.
>
> ie: we cannot conclude anything about the effect of the IO scheduler change
> from Randy's numbers. Too many variables.

an accurate evaluation can't be made from such comparison, but I never
claimed that to be an accurate evaluation, I just said we don't need to
worry, == "can't be too bad".

I just said it can't be too bad. and this is true, you even admit that a
readahead change for sure has more impact than whatever change the
fix-pausing generated. That's all I meant. Can't be too bad. the fact
mainline doesn't do readahead properly is much worse thing than whatever
slowdown can be generated by the fix pausing.

Furthmore I said we can deduce the accurate numbers from bigbox.html,
with very minor changes (not 2.4 vs 2.5) that as well shows the fix for
the deadlock not measurable as far as I can tell.

Andrea

2003-05-27 22:53:23

by Georg Nikodym

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, 27 May 2003 20:04:49 +0200
Marc-Christian Petersen <[email protected]> wrote:

> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard
> is dead/:
> speak _NOW_ please, doesn't matter who you are!

Uh, ok. These pauses have kept me from using anything newer than riel's
2.4.19-rmap15a

-g


Attachments:
(No filename) (189.00 B)

2003-05-27 23:11:11

by Christopher S. Aker

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard
> is dead/: speak _NOW_ please, doesn't matter who you are!

I've been able to reproduce the pauses on two different machines/mb/processor,
although each machine has >= 2.5GB ram. I can reproduce this in 2.4.19, 2.4.20,
and the 2.4.21-rc1/rc2/rc3.

After the machine un-pauses, everything completes/returns to normal. I don't
experience deadlocked processes.

Both my machines are IDE, using UDMA, hdparam stuff is maxxed; messing with
bdflush, elvtune doesn't make any difference. Limiting the ram on the machines
didn't help.

Pauses have lasted anywhere from a few seconds to a few minutes. Anything later
than 2.4.18 is unusable for me because of this.

-Chris


2003-05-28 05:18:52

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003 04:04, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 19:50, manish wrote:
>
> Hi Manish,
>
> > It is not a system hang but the processes hang showing the same stack
> > trace. This is certainly not a pause since the bonnie processes that
> > were hung (or deadlocked) never completed after several hrs. The stack
> > trace was the same.
>
> then you are hitting a different bug or a bug related to the issues
> Christian Klose and me and $tons of others were complaining.
>
> The bug you are hitting might be the problem with "process stuck in D
> state" Andrea Arcangeli fixed, let me guess, over half a year ago or so.
>
> In case you have a good mind to try to address your issue, you might want
> to try out the patch you can find here:
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2
>aa1/9980_fix-pausing-2
>
> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is
> dead/: speak _NOW_ please, doesn't matter who you are!

Yo!

I'll throw my babushka into the ring too. I think it's obvious from MCP's
comments that I've been involved in testing this problem. I've spent hours,
possibly days trying to find a way to fix the pauses introduced since
2.4.19pre1. I agree with what MCP describes that the machine can come to a
standstill under any sort of disk i/o and is unusable for a variable length
of time. I've been playing with all sorts of numbers in my patchset to try
and limit it with only mild success. The best results I've had without a
major decrease in throughput was using akpm's read latency 2 patch but by
significantly reducing the nr_requests. It was changing the number of
requests that I discovered dropping them to 4 fixed the problem but destroyed
write throughput. I was pleased to see AA give the problem recognition after
my contest results on his kernel but disappointed that the problem only was
reduced, not fixed.

I have seen it on every piece of hardware I have used a 2.4.19+ kernel on
using the desktop. I have no idea what the real problem is, but I firmly
believe with MCP that it is the biggest flaw in 2.4 on the desktop (no idea
what it does to servers). We've tried over and over again fiddling with the
numbers and patches and only going to less than 2.4.19 fixes it completely.

Con

2003-05-28 05:51:43

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Con Kolivas wrote:
> On Wed, 28 May 2003 04:04, Marc-Christian Petersen wrote:
> > On Tuesday 27 May 2003 19:50, manish wrote:
> >
> > Hi Manish,
> >
> > > It is not a system hang but the processes hang showing the same stack
> > > trace. This is certainly not a pause since the bonnie processes that
> > > were hung (or deadlocked) never completed after several hrs. The stack
> > > trace was the same.
> >
> > then you are hitting a different bug or a bug related to the issues
> > Christian Klose and me and $tons of others were complaining.
> >
> > The bug you are hitting might be the problem with "process stuck in D
> > state" Andrea Arcangeli fixed, let me guess, over half a year ago or so.
> >
> > In case you have a good mind to try to address your issue, you might want
> > to try out the patch you can find here:
> >
> > http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2
> >aa1/9980_fix-pausing-2
> >
> > ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is
> > dead/: speak _NOW_ please, doesn't matter who you are!
>
> Yo!
>
> I'll throw my babushka into the ring too. I think it's obvious from MCP's
> comments that I've been involved in testing this problem. I've spent hours,
> possibly days trying to find a way to fix the pauses introduced since
> 2.4.19pre1. I agree with what MCP describes that the machine can come to a
> standstill under any sort of disk i/o and is unusable for a variable length
> of time. I've been playing with all sorts of numbers in my patchset to try
> and limit it with only mild success. The best results I've had without a
> major decrease in throughput was using akpm's read latency 2 patch but by
> significantly reducing the nr_requests. It was changing the number of
> requests that I discovered dropping them to 4 fixed the problem but destroyed
> write throughput. I was pleased to see AA give the problem recognition after
> my contest results on his kernel but disappointed that the problem only was
> reduced, not fixed.

Does the problem change at all if you force batch_requests to 0?

--
Jens Axboe

2003-05-28 06:59:01

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003 16:04, Jens Axboe wrote:
> On Wed, May 28 2003, Con Kolivas wrote:
> > On Wed, 28 May 2003 04:04, Marc-Christian Petersen wrote:
> > > On Tuesday 27 May 2003 19:50, manish wrote:
> > >
> > > Hi Manish,
> > >
> > > > It is not a system hang but the processes hang showing the same stack
> > > > trace. This is certainly not a pause since the bonnie processes that
> > > > were hung (or deadlocked) never completed after several hrs. The
> > > > stack trace was the same.
> > >
> > > then you are hitting a different bug or a bug related to the issues
> > > Christian Klose and me and $tons of others were complaining.
> > >
> > > The bug you are hitting might be the problem with "process stuck in D
> > > state" Andrea Arcangeli fixed, let me guess, over half a year ago or
> > > so.
> > >
> > > In case you have a good mind to try to address your issue, you might
> > > want to try out the patch you can find here:
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
> > >1rc2 aa1/9980_fix-pausing-2
> > >
> > > ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is
> > > dead/: speak _NOW_ please, doesn't matter who you are!
> >
> > Yo!
> >
> > I'll throw my babushka into the ring too. I think it's obvious from MCP's
> > comments that I've been involved in testing this problem. I've spent
> > hours, possibly days trying to find a way to fix the pauses introduced
> > since 2.4.19pre1. I agree with what MCP describes that the machine can
> > come to a standstill under any sort of disk i/o and is unusable for a
> > variable length of time. I've been playing with all sorts of numbers in
> > my patchset to try and limit it with only mild success. The best results
> > I've had without a major decrease in throughput was using akpm's read
> > latency 2 patch but by significantly reducing the nr_requests. It was
> > changing the number of requests that I discovered dropping them to 4
> > fixed the problem but destroyed write throughput. I was pleased to see AA
> > give the problem recognition after my contest results on his kernel but
> > disappointed that the problem only was reduced, not fixed.
>
> Does the problem change at all if you force batch_requests to 0?

I've tried batch_requests to 1 by itself (without changing the nr_request) and
that didn't fix it, but recall dropping nr_requests to 2 (which would make
batch requests==0) made the machine fail to boot so I haven't tried batch
requests 0 by itself. Should it boot with it == 0?

Con

2003-05-28 07:00:57

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Con Kolivas wrote:
> On Wed, 28 May 2003 16:04, Jens Axboe wrote:
> > On Wed, May 28 2003, Con Kolivas wrote:
> > > On Wed, 28 May 2003 04:04, Marc-Christian Petersen wrote:
> > > > On Tuesday 27 May 2003 19:50, manish wrote:
> > > >
> > > > Hi Manish,
> > > >
> > > > > It is not a system hang but the processes hang showing the same stack
> > > > > trace. This is certainly not a pause since the bonnie processes that
> > > > > were hung (or deadlocked) never completed after several hrs. The
> > > > > stack trace was the same.
> > > >
> > > > then you are hitting a different bug or a bug related to the issues
> > > > Christian Klose and me and $tons of others were complaining.
> > > >
> > > > The bug you are hitting might be the problem with "process stuck in D
> > > > state" Andrea Arcangeli fixed, let me guess, over half a year ago or
> > > > so.
> > > >
> > > > In case you have a good mind to try to address your issue, you might
> > > > want to try out the patch you can find here:
> > > >
> > > > http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
> > > >1rc2 aa1/9980_fix-pausing-2
> > > >
> > > > ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is
> > > > dead/: speak _NOW_ please, doesn't matter who you are!
> > >
> > > Yo!
> > >
> > > I'll throw my babushka into the ring too. I think it's obvious from MCP's
> > > comments that I've been involved in testing this problem. I've spent
> > > hours, possibly days trying to find a way to fix the pauses introduced
> > > since 2.4.19pre1. I agree with what MCP describes that the machine can
> > > come to a standstill under any sort of disk i/o and is unusable for a
> > > variable length of time. I've been playing with all sorts of numbers in
> > > my patchset to try and limit it with only mild success. The best results
> > > I've had without a major decrease in throughput was using akpm's read
> > > latency 2 patch but by significantly reducing the nr_requests. It was
> > > changing the number of requests that I discovered dropping them to 4
> > > fixed the problem but destroyed write throughput. I was pleased to see AA
> > > give the problem recognition after my contest results on his kernel but
> > > disappointed that the problem only was reduced, not fixed.
> >
> > Does the problem change at all if you force batch_requests to 0?
>
> I've tried batch_requests to 1 by itself (without changing the
> nr_request) and that didn't fix it, but recall dropping nr_requests to
> 2 (which would make batch requests==0) made the machine fail to boot
> so I haven't tried batch requests 0 by itself. Should it boot with it
> == 0?

If you leave nr_requests as it is, I don't see why it should not boot
with batch_requests == 0.

I can't see in all of these mails whether backing out akpm's starvation
patch makes the problem go away. Does it?

--
Jens Axboe

2003-05-28 07:02:57

by Marc Wilson

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 08:04:49PM +0200, Marc-Christian Petersen wrote:
> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
> speak _NOW_ please, doesn't matter who you are!

Ok, add my box to the list. Variety of post 2.4.18 kernels, -ac's, -rc's,
etc... all demonstrate it to one degree or another.

Lately it's gotten REALLY bad.

Currently I'm using 21-rc2-ac2 and it freezes for upwards of 15 sec
regularly when I'm exercising the HD (three simultaneous brag threads
downloading from various newsgroups). The mouse moves, but other than
that, X is entirely unresponsive. An xterm with continually scrolling
text, for example, will appear to stop scrolling until the kernel comes
back.

The HD light is on solid the whole time.

21-rc2 does it too. I haven't tried anything later than that yet. Well, I
tried 20-ck7 and it ate my RAID0 due to a DMA-ism and I've not tested
anything else since. :(

--
Marc Wilson | Nothing in life is to be feared. It is only to
[email protected] | be understood.

2003-05-28 07:20:37

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wednesday 28 May 2003 09:13, Jens Axboe wrote:

Hi Jens,

> If you leave nr_requests as it is, I don't see why it should not boot
> with batch_requests == 0.
> I can't see in all of these mails whether backing out akpm's starvation
> patch makes the problem go away. Does it?
If you mean
"http://linux.bkbits.net:8080/linux-2.4/diffs/drivers/block/[email protected]?nav=index.html|ChangeSet@-2y|[email protected]|hist/drivers/block/ll_rw_blk.c"

that one, the answer is YES.

ciao, Marc


2003-05-28 07:23:03

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Marc-Christian Petersen wrote:
> On Wednesday 28 May 2003 09:13, Jens Axboe wrote:
>
> Hi Jens,
>
> > If you leave nr_requests as it is, I don't see why it should not boot
> > with batch_requests == 0.
> > I can't see in all of these mails whether backing out akpm's starvation
> > patch makes the problem go away. Does it?
> If you mean

> "http://linux.bkbits.net:8080/linux-2.4/diffs/drivers/block/[email protected]?nav=index.html|ChangeSet@-2y|[email protected]|hist/drivers/block/ll_rw_blk.c"
>
> that one, the answer is YES.

That's the one, yes. Andrew, looks like your patch brought out some
really bad behaviour.

--
Jens Axboe

2003-05-28 07:38:33

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Jens Axboe <[email protected]> wrote:
>
> > that one, the answer is YES.
>
> That's the one, yes. Andrew, looks like your patch brought out some
> really bad behaviour.

Yes, but why?

It'd be interesting if any of these changes make a difference.


drivers/block/ll_rw_blk.c | 7
fs/buffer.c | 3030 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 3033 insertions(+), 4 deletions(-)

diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
--- 24/drivers/block/ll_rw_blk.c~a 2003-05-28 00:48:09.000000000 -0700
+++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 00:50:02.000000000 -0700
@@ -590,10 +590,10 @@ static struct request *__get_request_wai
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);

- generic_unplug_device(q);
- add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
+ add_wait_queue(&q->wait_for_requests[rw], &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
+ generic_unplug_device(q);
if (q->rq[rw].count == 0)
schedule();
spin_lock_irq(&io_request_lock);
@@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= q->batch_requests &&
- waitqueue_active(&q->wait_for_requests[rw]))
+ if (++q->rq[rw].count >= q->batch_requests)
wake_up(&q->wait_for_requests[rw]);
}
}

_

2003-05-28 08:17:52

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:
> >
> > > that one, the answer is YES.
> >
> > That's the one, yes. Andrew, looks like your patch brought out some
> > really bad behaviour.
>
> Yes, but why?
>
> It'd be interesting if any of these changes make a difference.
>
>
> drivers/block/ll_rw_blk.c | 7
> fs/buffer.c | 3030 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 3033 insertions(+), 4 deletions(-)
>
> diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~a 2003-05-28 00:48:09.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 00:50:02.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
> - generic_unplug_device(q);
> - add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> + add_wait_queue(&q->wait_for_requests[rw], &wait);
> do {
> set_current_state(TASK_UNINTERRUPTIBLE);
> + generic_unplug_device(q);
> if (q->rq[rw].count == 0)
> schedule();
> spin_lock_irq(&io_request_lock);
> @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> */
> if (q) {
> list_add(&req->queue, &q->rq[rw].free);
> - if (++q->rq[rw].count >= q->batch_requests &&
> - waitqueue_active(&q->wait_for_requests[rw]))
> + if (++q->rq[rw].count >= q->batch_requests)
> wake_up(&q->wait_for_requests[rw]);
> }
> }

The unplug() move could be the key, in theory we could end up having to
unplug the queue again.

Question to the ones seeing the stalls - does a sysrq-s make things go
again?

--
Jens Axboe

2003-05-28 08:35:18

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wednesday 28 May 2003 10:30, Jens Axboe wrote:

Hi Jens,

> The unplug() move could be the key, in theory we could end up having to
> unplug the queue again.
Hmm, afaik fix-pausing-2 patch does it similar, moving unplug_device() to the
same place.

> Question to the ones seeing the stalls - does a sysrq-s make things go
> again?
no (at least not for me)

ciao, Marc


2003-05-28 08:35:16

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wednesday 28 May 2003 09:51, Andrew Morton wrote:

Hi Andrew,

> Yes, but why?
I don't know :(

> It'd be interesting if any of these changes make a difference.
I'll check it this evening! Many thanks.

ciao, Marc


2003-05-28 09:24:36

by Ragnar Hojland Espinosa

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Tue, May 27, 2003 at 08:04:49PM +0200, Marc-Christian Petersen wrote:
>
> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
> speak _NOW_ please, doesn't matter who you are!

FWIW, me too.

Actually it just happens in the fixing stage when burning prebuilt iso
images from the hard disk (same IDE channel as the burner, 2.4.20)
Having a completely frozen machine under X was quite panic inducing ;)

A friend told me they also get regular "pauses" when quitting from
vmware.
--
Ragnar Hojland - Project Manager
Linalco "Especialistas Linux y en Software Libre"
http://www.linalco.com Tel: +34-91-5970074 Fax: +34-91-5970083

2003-05-28 09:32:36

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Ragnar Hojland Espinosa wrote:
> On Tue, May 27, 2003 at 08:04:49PM +0200, Marc-Christian Petersen wrote:
> >
> > ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
> > speak _NOW_ please, doesn't matter who you are!
>
> FWIW, me too.
>
> Actually it just happens in the fixing stage when burning prebuilt iso
> images from the hard disk (same IDE channel as the burner, 2.4.20)
> Having a completely frozen machine under X was quite panic inducing ;)
>
> A friend told me they also get regular "pauses" when quitting from
> vmware.

Lemme guess, hard drive on the same channel as the burner? There's
nothing we can do about that, hardware limitation. The reason you see it
during fixation is because that's one long single command, and we cannot
preempt the channel and service requests while that is going on.

--
Jens Axboe

2003-05-28 09:40:42

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wednesday 28 May 2003 11:36, Ragnar Hojland Espinosa wrote:

Hi Ragnar,

> Actually it just happens in the fixing stage when burning prebuilt iso
> images from the hard disk (same IDE channel as the burner, 2.4.20)
> Having a completely frozen machine under X was quite panic inducing ;)
That's a problem of IDE itself. I still say IDE is broken by design ;-)

> A friend told me they also get regular "pauses" when quitting from
> vmware.
Yep, occurs also with my machines.

ciao, Marc

2003-05-28 09:48:32

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Marc-Christian Petersen wrote:
> On Wednesday 28 May 2003 11:36, Ragnar Hojland Espinosa wrote:
>
> Hi Ragnar,
>
> > Actually it just happens in the fixing stage when burning prebuilt iso
> > images from the hard disk (same IDE channel as the burner, 2.4.20)
> > Having a completely frozen machine under X was quite panic inducing ;)
> That's a problem of IDE itself. I still say IDE is broken by design ;-)

It is actually possible to use the IMMED bit of the CLOSE_TRACK command
to get around this. In that case the cd-r will return the command as
completed and the drive on the same channel can service requests.

--
Jens Axboe

2003-05-28 10:03:45

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 12:51:56AM -0700, Andrew Morton wrote:
> It'd be interesting if any of these changes make a difference.
>
>
> drivers/block/ll_rw_blk.c | 7
> fs/buffer.c | 3030 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 3033 insertions(+), 4 deletions(-)
>
> diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~a 2003-05-28 00:48:09.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 00:50:02.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
> - generic_unplug_device(q);
> - add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> + add_wait_queue(&q->wait_for_requests[rw], &wait);
> do {
> set_current_state(TASK_UNINTERRUPTIBLE);
> + generic_unplug_device(q);
> if (q->rq[rw].count == 0)
> schedule();
> spin_lock_irq(&io_request_lock);
> @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> */
> if (q) {
> list_add(&req->queue, &q->rq[rw].free);
> - if (++q->rq[rw].count >= q->batch_requests &&
> - waitqueue_active(&q->wait_for_requests[rw]))
> + if (++q->rq[rw].count >= q->batch_requests)
> wake_up(&q->wait_for_requests[rw]);
> }
> }
>

Works fine on my notebook. Good throughput and no mouse hangs anymore.

Thanks,
Matthias
--
[email protected]
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

2003-05-28 10:05:30

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Matthias Mueller wrote:
> On Wed, May 28, 2003 at 12:51:56AM -0700, Andrew Morton wrote:
> > It'd be interesting if any of these changes make a difference.
> >
> >
> > drivers/block/ll_rw_blk.c | 7
> > fs/buffer.c | 3030 ++++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 3033 insertions(+), 4 deletions(-)
> >
> > diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~a 2003-05-28 00:48:09.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 00:50:02.000000000 -0700
> > @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> > register struct request *rq;
> > DECLARE_WAITQUEUE(wait, current);
> >
> > - generic_unplug_device(q);
> > - add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > + add_wait_queue(&q->wait_for_requests[rw], &wait);
> > do {
> > set_current_state(TASK_UNINTERRUPTIBLE);
> > + generic_unplug_device(q);
> > if (q->rq[rw].count == 0)
> > schedule();
> > spin_lock_irq(&io_request_lock);
> > @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> > */
> > if (q) {
> > list_add(&req->queue, &q->rq[rw].free);
> > - if (++q->rq[rw].count >= q->batch_requests &&
> > - waitqueue_active(&q->wait_for_requests[rw]))
> > + if (++q->rq[rw].count >= q->batch_requests)
> > wake_up(&q->wait_for_requests[rw]);
> > }
> > }
> >
>
> Works fine on my notebook. Good throughput and no mouse hangs anymore.

Could you possibly try just the last hunk of the patch, then? Ie just
remove the waitqueue_active(&q->wait_for_requests[rw]) check, leave the
rest as-is.

--
Jens Axboe

2003-05-28 10:09:49

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Matthias Mueller <[email protected]> wrote:
>
> Works fine on my notebook. Good throughput and no mouse hangs anymore.

Interesting.

Could you please work out which change caused it? Go back to stock 2.4 and
then apply this:


diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
--- 24/drivers/block/ll_rw_blk.c~1 2003-05-28 03:20:42.000000000 -0700
+++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:20:57.000000000 -0700
@@ -590,10 +590,10 @@ static struct request *__get_request_wai
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);

- generic_unplug_device(q);
add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
+ generic_unplug_device(q);
if (q->rq[rw].count == 0)
schedule();
spin_lock_irq(&io_request_lock);



then this:

diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
--- 24/drivers/block/ll_rw_blk.c~2 2003-05-28 03:21:03.000000000 -0700
+++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:09.000000000 -0700
@@ -590,7 +590,7 @@ static struct request *__get_request_wai
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);

- add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
+ add_wait_queue(&q->wait_for_requests[rw], &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
generic_unplug_device(q);


Then this (totally unlikely, don't bother):

diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
--- 24/drivers/block/ll_rw_blk.c~3 2003-05-28 03:21:15.000000000 -0700
+++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:39.000000000 -0700
@@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= q->batch_requests &&
- waitqueue_active(&q->wait_for_requests[rw]))
+ if (++q->rq[rw].count >= q->batch_requests)
wake_up(&q->wait_for_requests[rw]);
}
}

_

2003-05-28 10:12:27

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wednesday 28 May 2003 12:13, Matthias Mueller wrote:

Hi Matthias, Andrew,

> > It'd be interesting if any of these changes make a difference.
> Works fine on my notebook. Good throughput and no mouse hangs anymore.
damn, I *KNEW* Andrew is able to fix this. I knew that for over a year!! ;)

ciao, Marc

2003-05-28 10:12:37

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Andrew Morton wrote:
> Matthias Mueller <[email protected]> wrote:
> >
> > Works fine on my notebook. Good throughput and no mouse hangs anymore.
>
> Interesting.
>
> Could you please work out which change caused it? Go back to stock 2.4 and
> then apply this:
>
>
> diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~1 2003-05-28 03:20:42.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:20:57.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
> - generic_unplug_device(q);
> add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> do {
> set_current_state(TASK_UNINTERRUPTIBLE);
> + generic_unplug_device(q);
> if (q->rq[rw].count == 0)
> schedule();
> spin_lock_irq(&io_request_lock);

I think it was already established that this wasn't the reason. Was my
first suspect too, though...

> then this:
>
> diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~2 2003-05-28 03:21:03.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:09.000000000 -0700
> @@ -590,7 +590,7 @@ static struct request *__get_request_wai
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
> - add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> + add_wait_queue(&q->wait_for_requests[rw], &wait);
> do {
> set_current_state(TASK_UNINTERRUPTIBLE);
> generic_unplug_device(q);

Since we do a general wake_up(), only the order of wakeups matter here
right (lifo vs fifo). Given that, the _exclusive() should be more fair
possibly at the cost of a bit of throughput.

> Then this (totally unlikely, don't bother):
>
> diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~3 2003-05-28 03:21:15.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:39.000000000 -0700
> @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> */
> if (q) {
> list_add(&req->queue, &q->rq[rw].free);
> - if (++q->rq[rw].count >= q->batch_requests &&
> - waitqueue_active(&q->wait_for_requests[rw]))
> + if (++q->rq[rw].count >= q->batch_requests)
> wake_up(&q->wait_for_requests[rw]);
> }
> }

Well it's the only one left :). But you are right, try one of them at
the time, establishing the effect of each of them.

--
Jens Axboe

2003-05-28 10:14:50

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003 20:23, Andrew Morton wrote:
> Matthias Mueller <[email protected]> wrote:
> > Works fine on my notebook. Good throughput and no mouse hangs anymore.
>
> Interesting.
>
> Could you please work out which change caused it? Go back to stock 2.4 and
> then apply this:
>
>
> diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~1 2003-05-28 03:20:42.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:20:57.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
> - generic_unplug_device(q);
> add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> do {
> set_current_state(TASK_UNINTERRUPTIBLE);
> + generic_unplug_device(q);
> if (q->rq[rw].count == 0)
> schedule();
> spin_lock_irq(&io_request_lock);

It's not this because this is the layout in my -ck* and it still exhibits the
pauses.


2003-05-28 10:16:59

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wednesday 28 May 2003 12:29, Con Kolivas wrote:

Hi Con, AKPM, Jens,

> > diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~1 2003-05-28 03:20:42.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:20:57.000000000 -0700
> > @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> > register struct request *rq;
> > DECLARE_WAITQUEUE(wait, current);
> >
> > - generic_unplug_device(q);
> > add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > do {
> > set_current_state(TASK_UNINTERRUPTIBLE);
> > + generic_unplug_device(q);
> > if (q->rq[rw].count == 0)
> > schedule();
> > spin_lock_irq(&io_request_lock);
> It's not this because this is the layout in my -ck* and it still exhibits
> the pauses.
Same for -WOLK*

ciao, Marc

2003-05-28 10:34:39

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003 20:25, Jens Axboe wrote:
> On Wed, May 28 2003, Andrew Morton wrote:
> > Matthias Mueller <[email protected]> wrote:
> > > Works fine on my notebook. Good throughput and no mouse hangs anymore.
> >
> > Interesting.
> >
> > Could you please work out which change caused it? Go back to stock 2.4
> > and then apply this:
> >
> >
> > diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~1 2003-05-28 03:20:42.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:20:57.000000000 -0700
> > @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> > register struct request *rq;
> > DECLARE_WAITQUEUE(wait, current);
> >
> > - generic_unplug_device(q);
> > add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > do {
> > set_current_state(TASK_UNINTERRUPTIBLE);
> > + generic_unplug_device(q);
> > if (q->rq[rw].count == 0)
> > schedule();
> > spin_lock_irq(&io_request_lock);
>
> I think it was already established that this wasn't the reason. Was my
> first suspect too, though...
>
> > then this:
> >
> > diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~2 2003-05-28 03:21:03.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:09.000000000 -0700
> > @@ -590,7 +590,7 @@ static struct request *__get_request_wai
> > register struct request *rq;
> > DECLARE_WAITQUEUE(wait, current);
> >
> > - add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > + add_wait_queue(&q->wait_for_requests[rw], &wait);
> > do {
> > set_current_state(TASK_UNINTERRUPTIBLE);
> > generic_unplug_device(q);
>
> Since we do a general wake_up(), only the order of wakeups matter here
> right (lifo vs fifo). Given that, the _exclusive() should be more fair
> possibly at the cost of a bit of throughput.
>
> > Then this (totally unlikely, don't bother):
> >
> > diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~3 2003-05-28 03:21:15.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:39.000000000 -0700
> > @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> > */
> > if (q) {
> > list_add(&req->queue, &q->rq[rw].free);
> > - if (++q->rq[rw].count >= q->batch_requests &&
> > - waitqueue_active(&q->wait_for_requests[rw]))
> > + if (++q->rq[rw].count >= q->batch_requests)
> > wake_up(&q->wait_for_requests[rw]);
> > }
> > }
>
> Well it's the only one left :). But you are right, try one of them at
> the time, establishing the effect of each of them.

THIS IS IT! The last one. No pauses writing a 2Gb file now unless I do a read
midstream.

Con

2003-05-28 10:37:35

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Con Kolivas wrote:
> On Wed, 28 May 2003 20:25, Jens Axboe wrote:
> > On Wed, May 28 2003, Andrew Morton wrote:
> > > Matthias Mueller <[email protected]> wrote:
> > > > Works fine on my notebook. Good throughput and no mouse hangs anymore.
> > >
> > > Interesting.
> > >
> > > Could you please work out which change caused it? Go back to stock 2.4
> > > and then apply this:
> > >
> > >
> > > diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> > > --- 24/drivers/block/ll_rw_blk.c~1 2003-05-28 03:20:42.000000000 -0700
> > > +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:20:57.000000000 -0700
> > > @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> > > register struct request *rq;
> > > DECLARE_WAITQUEUE(wait, current);
> > >
> > > - generic_unplug_device(q);
> > > add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > > do {
> > > set_current_state(TASK_UNINTERRUPTIBLE);
> > > + generic_unplug_device(q);
> > > if (q->rq[rw].count == 0)
> > > schedule();
> > > spin_lock_irq(&io_request_lock);
> >
> > I think it was already established that this wasn't the reason. Was my
> > first suspect too, though...
> >
> > > then this:
> > >
> > > diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
> > > --- 24/drivers/block/ll_rw_blk.c~2 2003-05-28 03:21:03.000000000 -0700
> > > +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:09.000000000 -0700
> > > @@ -590,7 +590,7 @@ static struct request *__get_request_wai
> > > register struct request *rq;
> > > DECLARE_WAITQUEUE(wait, current);
> > >
> > > - add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > > + add_wait_queue(&q->wait_for_requests[rw], &wait);
> > > do {
> > > set_current_state(TASK_UNINTERRUPTIBLE);
> > > generic_unplug_device(q);
> >
> > Since we do a general wake_up(), only the order of wakeups matter here
> > right (lifo vs fifo). Given that, the _exclusive() should be more fair
> > possibly at the cost of a bit of throughput.
> >
> > > Then this (totally unlikely, don't bother):
> > >
> > > diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
> > > --- 24/drivers/block/ll_rw_blk.c~3 2003-05-28 03:21:15.000000000 -0700
> > > +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:39.000000000 -0700
> > > @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> > > */
> > > if (q) {
> > > list_add(&req->queue, &q->rq[rw].free);
> > > - if (++q->rq[rw].count >= q->batch_requests &&
> > > - waitqueue_active(&q->wait_for_requests[rw]))
> > > + if (++q->rq[rw].count >= q->batch_requests)
> > > wake_up(&q->wait_for_requests[rw]);
> > > }
> > > }
> >
> > Well it's the only one left :). But you are right, try one of them at
> > the time, establishing the effect of each of them.
>
> THIS IS IT! The last one. No pauses writing a 2Gb file now unless I do a read
> midstream.

Cool, especially since we can easily apply this to -rc5 without any
worries. Marcelo, if you please...?

===== drivers/block/ll_rw_blk.c 1.44 vs edited =====
--- 1.44/drivers/block/ll_rw_blk.c Mon Apr 14 12:53:03 2003
+++ edited/drivers/block/ll_rw_blk.c Wed May 28 12:49:30 2003
@@ -829,8 +829,7 @@
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= q->batch_requests &&
- waitqueue_active(&q->wait_for_requests[rw]))
+ if (++q->rq[rw].count >= q->batch_requests)
wake_up(&q->wait_for_requests[rw]);
}
}

--
Jens Axboe

2003-05-28 10:46:12

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Jens Axboe <[email protected]> wrote:
>
> > THIS IS IT! The last one. No pauses writing a 2Gb file now unless I do a read
> > midstream.
>
> Cool, especially since we can easily apply this to -rc5 without any
> worries. Marcelo, if you please...?
>
> ===== drivers/block/ll_rw_blk.c 1.44 vs edited =====
> --- 1.44/drivers/block/ll_rw_blk.c Mon Apr 14 12:53:03 2003
> +++ edited/drivers/block/ll_rw_blk.c Wed May 28 12:49:30 2003
> @@ -829,8 +829,7 @@
> */
> if (q) {
> list_add(&req->queue, &q->rq[rw].free);
> - if (++q->rq[rw].count >= q->batch_requests &&
> - waitqueue_active(&q->wait_for_requests[rw]))
> + if (++q->rq[rw].count >= q->batch_requests)
> wake_up(&q->wait_for_requests[rw]);
> }
> }

umm, I'd like confirmation of that.

The waitqueue_active() test is wrong because of a missing barrier, but only
on SMP. And if it does make a mistake it will surely correct itself when the
next request is put back. (That's why I left it there...)

More testing, please.

2003-05-28 10:51:57

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...



Con Kolivas wrote:

>On Wed, 28 May 2003 20:25, Jens Axboe wrote:
>
>>On Wed, May 28 2003, Andrew Morton wrote:
>>
>>>Then this (totally unlikely, don't bother):
>>>
>>>diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
>>>--- 24/drivers/block/ll_rw_blk.c~3 2003-05-28 03:21:15.000000000 -0700
>>>+++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:39.000000000 -0700
>>>@@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
>>> */
>>> if (q) {
>>> list_add(&req->queue, &q->rq[rw].free);
>>>- if (++q->rq[rw].count >= q->batch_requests &&
>>>- waitqueue_active(&q->wait_for_requests[rw]))
>>>+ if (++q->rq[rw].count >= q->batch_requests)
>>> wake_up(&q->wait_for_requests[rw]);
>>> }
>>> }
>>>
>>Well it's the only one left :). But you are right, try one of them at
>>the time, establishing the effect of each of them.
>>
>
>THIS IS IT! The last one. No pauses writing a 2Gb file now unless I do a read
>midstream.
>
>
OK, I can't see how this would make a difference, but there
is similar (batch_requests) code in the mm tree, so it would
be nice if someone would work out what is going on.


2003-05-28 11:05:20

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wednesday 28 May 2003 12:59, Andrew Morton wrote:

Hi Andrew,

> umm, I'd like confirmation of that.
>
> The waitqueue_active() test is wrong because of a missing barrier, but only
> on SMP. And if it does make a mistake it will surely correct itself when
> the next request is put back. (That's why I left it there...)
> More testing, please.
Does the attached one make sense?

ciao, Marc



Attachments:
(No filename) (396.00 B)
llrwblk.patch (478.00 B)
Download all attachments

2003-05-28 11:13:33

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Marc-Christian Petersen <[email protected]> wrote:
>
> Does the attached one make sense?

Nope.

Guys, you're the ones who can reproduce this. Please spend more time
working out which chunk (or combination thereof) actually fixes the
problem. If indeed any of them do.

I'm suspecting that Con's fingers slipped.


2003-05-28 11:19:03

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wednesday 28 May 2003 13:27, Andrew Morton wrote:

Hi Akpm,

> > Does the attached one make sense?
> Nope.
nm.

> Guys, you're the ones who can reproduce this. Please spend more time
> working out which chunk (or combination thereof) actually fixes the
> problem. If indeed any of them do.
As I said, I will test it this evening. ATM I don't have time to recompile and
reboot. This evening I will test extensively, even on SMP, SCSI, IDE and so
on.

ciao, Marc

2003-05-28 11:27:27

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003 21:27, Andrew Morton wrote:
> Marc-Christian Petersen <[email protected]> wrote:
> > Does the attached one make sense?
>
> Nope.
>
> Guys, you're the ones who can reproduce this. Please spend more time
> working out which chunk (or combination thereof) actually fixes the
> problem. If indeed any of them do.
>
> I'm suspecting that Con's fingers slipped.

I've been known to be email trigger happy in the past but a serious thrashing
with just this one change made massive improvements.

However -
One test case does not a fix give.

Others please test this. It's extremely important.

If you're interested the best test for me is:
dd if=/dev/zero of=dump bs=4096 count=512000

Con

2003-05-28 11:44:07

by Alan

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Mer, 2003-05-28 at 10:36, Ragnar Hojland Espinosa wrote:
> Actually it just happens in the fixing stage when burning prebuilt iso
> images from the hard disk (same IDE channel as the burner, 2.4.20)
> Having a completely frozen machine under X was quite panic inducing ;)

If you have a disk and the burner ont he same channel this is quite
normal. The fixate is a single ATAPI command and like all ATA commands
locks the bus to both master/slave for its duration of execution.

Its an IDE limitation

2003-05-28 11:57:58

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 03:23:15AM -0700, Andrew Morton wrote:
> Could you please work out which change caused it? Go back to stock 2.4 and
> then apply this:
>
>
> diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~1 2003-05-28 03:20:42.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:20:57.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
> - generic_unplug_device(q);
> add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> do {
> set_current_state(TASK_UNINTERRUPTIBLE);
> + generic_unplug_device(q);
> if (q->rq[rw].count == 0)
> schedule();
> spin_lock_irq(&io_request_lock);
>
>
>
> then this:
>
> diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~2 2003-05-28 03:21:03.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:09.000000000 -0700
> @@ -590,7 +590,7 @@ static struct request *__get_request_wai
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
> - add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> + add_wait_queue(&q->wait_for_requests[rw], &wait);
> do {
> set_current_state(TASK_UNINTERRUPTIBLE);
> generic_unplug_device(q);
>
>
> Then this (totally unlikely, don't bother):
>
> diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~3 2003-05-28 03:21:15.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:39.000000000 -0700
> @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> */
> if (q) {
> list_add(&req->queue, &q->rq[rw].free);
> - if (++q->rq[rw].count >= q->batch_requests &&
> - waitqueue_active(&q->wait_for_requests[rw]))
> + if (++q->rq[rw].count >= q->batch_requests)
> wake_up(&q->wait_for_requests[rw]);
> }
> }
>
> _

Tested all of them and some combinations:
patch 1 alone: still mouse hangs
patch 2 alone: still mouse hangs
patch 3 alone: no hangs, but I get some zombie process (starting a lot of
xterms results in zombie xterms, not noticed with vanilla
and the other patches)
patch 1+2: no mouse hangs
patch 1+2+3: no mouse hangs, no zombies

Bye,
Matthias
--
[email protected]
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

2003-05-28 12:01:57

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> Tested all of them and some combinations:
> patch 1 alone: still mouse hangs
> patch 2 alone: still mouse hangs
> patch 3 alone: no hangs, but I get some zombie process (starting a lot of
> xterms results in zombie xterms, not noticed with vanilla
> and the other patches)
> patch 1+2: no mouse hangs
> patch 1+2+3: no mouse hangs, no zombies

Forgot to mention: no zombies with patch 1 or 2

Matthias
--
[email protected]
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

2003-05-28 12:07:53

by Carl-Daniel Hailfinger

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Matthias Mueller wrote:
> On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
>
>>Tested all of them and some combinations:
>>patch 1 alone: still mouse hangs
>>patch 2 alone: still mouse hangs
>>patch 3 alone: no hangs, but I get some zombie process (starting a lot of
>> xterms results in zombie xterms, not noticed with vanilla
>> and the other patches)
>>patch 1+2: no mouse hangs
>>patch 1+2+3: no mouse hangs, no zombies
>
>
> Forgot to mention: no zombies with patch 1 or 2

So 1+2 gives you zombies?


Carl-Daniel

2003-05-28 12:10:20

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 02:21:08PM +0200, Carl-Daniel Hailfinger wrote:
> Matthias Mueller wrote:
> > On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> >
> >>Tested all of them and some combinations:
> >>patch 1 alone: still mouse hangs
> >>patch 2 alone: still mouse hangs
> >>patch 3 alone: no hangs, but I get some zombie process (starting a lot of
> >> xterms results in zombie xterms, not noticed with vanilla
> >> and the other patches)
> >>patch 1+2: no mouse hangs
> >>patch 1+2+3: no mouse hangs, no zombies
> >
> >
> > Forgot to mention: no zombies with patch 1 or 2
>
> So 1+2 gives you zombies?

No, work ok, just forgot to mention that, too. I think I should go to
sleep...

Matthias

--
[email protected]
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

2003-05-28 12:14:53

by Carl-Daniel Hailfinger

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Matthias Mueller wrote:
> On Wed, May 28, 2003 at 02:21:08PM +0200, Carl-Daniel Hailfinger wrote:
>
>>Matthias Mueller wrote:
>>
>>>On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
>>>
>>>
>>>>Tested all of them and some combinations:
>>>>patch 1 alone: hangs, no zombies
>>>>patch 2 alone: hangs, no zombies
>>>>patch 3 alone: no hangs, zombies
>>>>patch 1+2: no hangs, no zombies
>>>>patch 1+2+3: no hangs, no zombies

Right?

2003-05-28 12:25:56

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 02:28:10PM +0200, Carl-Daniel Hailfinger wrote:
> Matthias Mueller wrote:
> > On Wed, May 28, 2003 at 02:21:08PM +0200, Carl-Daniel Hailfinger wrote:
> >
> >>Matthias Mueller wrote:
> >>
> >>>On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> >>>
> >>>
> >>>>Tested all of them and some combinations:
> >>>>patch 1 alone: hangs, no zombies
> >>>>patch 2 alone: hangs, no zombies
> >>>>patch 3 alone: no hangs, zombies
> >>>>patch 1+2: no hangs, no zombies
> >>>>patch 1+2+3: no hangs, no zombies
>
> Right?
Yes.

2003-05-28 12:40:26

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Marc-Christian Petersen wrote:
> On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
>
> Hi Akpm,
>
> > > Does the attached one make sense?
> > Nope.
> nm.
>
> > Guys, you're the ones who can reproduce this. Please spend more time
> > working out which chunk (or combination thereof) actually fixes the
> > problem. If indeed any of them do.
> As I said, I will test it this evening. ATM I don't have time to
> recompile and reboot. This evening I will test extensively, even on
> SMP, SCSI, IDE and so on.

May I ask how you are reproducing the bad results? I'm trying in vain
here...

--
Jens Axboe

2003-05-28 12:54:22

by Carl-Daniel Hailfinger

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Jens Axboe wrote:
> On Wed, May 28 2003, Marc-Christian Petersen wrote:
>
>>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
>>
>>>Guys, you're the ones who can reproduce this. Please spend more time
>>>working out which chunk (or combination thereof) actually fixes the
>>>problem. If indeed any of them do.
>>
>>As I said, I will test it this evening. ATM I don't have time to
>>recompile and reboot. This evening I will test extensively, even on
>>SMP, SCSI, IDE and so on.
>
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

Quoting Con Kolivas:

dd if=/dev/zero of=dump bs=4096 count=512000


HTH,
Carl-Daniel

2003-05-28 12:55:53

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
> Jens Axboe wrote:
> > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> >
> >>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> >>
> >>>Guys, you're the ones who can reproduce this. Please spend more time
> >>>working out which chunk (or combination thereof) actually fixes the
> >>>problem. If indeed any of them do.
> >>
> >>As I said, I will test it this evening. ATM I don't have time to
> >>recompile and reboot. This evening I will test extensively, even on
> >>SMP, SCSI, IDE and so on.
> >
> > May I ask how you are reproducing the bad results? I'm trying in vain
> > here...
>
> Quoting Con Kolivas:
>
> dd if=/dev/zero of=dump bs=4096 count=512000

already tried that, no go. on ide/scsi? what filesystem? how much ram?
anything else running? smp/up?

--
Jens Axboe

2003-05-28 12:55:25

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

I can reproduce it with dd if=/dev/zero of=trash bs=4096 count=65000 on my
notebook (probably a slower harddisk makes it easier to see the mouse
hangs).

Matthias

2003-05-28 13:04:06

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 03:08:39PM +0200, Jens Axboe wrote:
> > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > here...
> >
> > Quoting Con Kolivas:
> >
> > dd if=/dev/zero of=dump bs=4096 count=512000
>
> already tried that, no go. on ide/scsi? what filesystem? how much ram?
> anything else running? smp/up?

ide-notebook-harddrive, tested with ext2 and ext3. 256MB Ram, X11 started,
idle bind9 and idle postgresql. Tested directly after a reboot, ~85MB Ram
used without buffers/cache.

Matthias

2003-05-28 13:07:45

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003 23:08, Jens Axboe wrote:
> On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
> > Jens Axboe wrote:
> > > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > >>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> > >>>Guys, you're the ones who can reproduce this. Please spend more time
> > >>>working out which chunk (or combination thereof) actually fixes the
> > >>>problem. If indeed any of them do.
> > >>
> > >>As I said, I will test it this evening. ATM I don't have time to
> > >>recompile and reboot. This evening I will test extensively, even on
> > >>SMP, SCSI, IDE and so on.
> > >
> > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > here...
> >
> > Quoting Con Kolivas:
> >
> > dd if=/dev/zero of=dump bs=4096 count=512000
>
> already tried that, no go. on ide/scsi? what filesystem? how much ram?
> anything else running? smp/up?

I'm using UP on IDE. I reproduce it easily on a P3 256Mb laptop with 5400rpm
drive, and less easily but still occurs on a P4 2.53 512Mb pc with 2x7200rpm
software raid 0 IDE drives. Even if the only thing you try to do is move the
mouse, the mouse will freeze for up to 30secs. When you first start the write
no disk activity happens for up to a few seconds, then it will start writing
madly and the machine will come to a standstill for a variable length of
time. Then it will come back to life for a few seconds only to die again for
a few seconds and so on till the write is complete.

Still testing combinations to see which is the best, but 1+2 seems better than
3 alone as doing reads midstream in the write don't cause hangs. I haven't
seen zombie processes ever.

Con

2003-05-28 13:16:57

by Stefan Foerster

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

* Jens Axboe <[email protected]> wrote:
> On Wed, May 28 2003, Marc-Christian Petersen wrote:
>>> Guys, you're the ones who can reproduce this. Please spend more time
>>> working out which chunk (or combination thereof) actually fixes the
>>> problem. If indeed any of them do.
>> As I said, I will test it this evening. ATM I don't have time to
>> recompile and reboot. This evening I will test extensively, even on
>> SMP, SCSI, IDE and so on.
>
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

It is easily reproducable by using dd with an appropriate blocksize
reading from /dev/zero.

With chunk #3 from Andrew, I do not get pauses, but I noticed text
scrolling in an xterm stopping for like a second.

I did not get any zombie processes.

Ciao
Stefan

2003-05-28 13:16:58

by Stefan Foerster

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

* Jens Axboe <[email protected]> wrote:
> On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
>> dd if=/dev/zero of=dump bs=4096 count=512000
>
> already tried that, no go. on ide/scsi? what filesystem? how much ram?
> anything else running? smp/up?

Doesn't matter if IDE or SCSI, to be honest, SCSI with the old aic7xxx
from vanilla 2.4.20 is even worse than IDE.

My box is up, had only my window manager with some open xterms
running, nothing which should create any load.


Ciao
Stefan
--
Stefan F?rster Public Key: 0xBBE2A9E9
FdI #122: Updateritis - Softwarebulemie (Frank Klemm)

2003-05-28 13:18:04

by Carl-Daniel Hailfinger

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Con Kolivas wrote:
> On Wed, 28 May 2003 23:08, Jens Axboe wrote:
>
>>On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
>>
>>>Jens Axboe wrote:
>>>
>>>>On Wed, May 28 2003, Marc-Christian Petersen wrote:
>>>>
>>>>>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
>>>>>
>>>>>>Guys, you're the ones who can reproduce this. Please spend more time
>>>>>>working out which chunk (or combination thereof) actually fixes the
>>>>>>problem. If indeed any of them do.
>>>>>
>>>>>As I said, I will test it this evening. ATM I don't have time to
>>>>>recompile and reboot. This evening I will test extensively, even on
>>>>>SMP, SCSI, IDE and so on.
>>>>
>>>>May I ask how you are reproducing the bad results? I'm trying in vain
>>>>here...
>>>
>>>Quoting Con Kolivas:
>>>
>>>dd if=/dev/zero of=dump bs=4096 count=512000
>>
>>already tried that, no go. on ide/scsi? what filesystem? how much ram?
>>anything else running? smp/up?
>
>
> I'm using UP on IDE. I reproduce it easily on a P3 256Mb laptop with 5400rpm
> drive, and less easily but still occurs on a P4 2.53 512Mb pc with 2x7200rpm
> software raid 0 IDE drives. Even if the only thing you try to do is move the
> mouse, the mouse will freeze for up to 30secs. When you first start the write
> no disk activity happens for up to a few seconds, then it will start writing
> madly and the machine will come to a standstill for a variable length of
> time. Then it will come back to life for a few seconds only to die again for
> a few seconds and so on till the write is complete.
>
> Still testing combinations to see which is the best, but 1+2 seems better than
> 3 alone as doing reads midstream in the write don't cause hangs. I haven't
> seen zombie processes ever.

Just curious - which compiler did you use?


Carl-Daniel

2003-05-28 13:19:37

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003 23:30, Carl-Daniel Hailfinger wrote:
> Con Kolivas wrote:
> > On Wed, 28 May 2003 23:08, Jens Axboe wrote:
> >>On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
> >>>Jens Axboe wrote:
> >>>>On Wed, May 28 2003, Marc-Christian Petersen wrote:
> >>>>>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> >>>>>>Guys, you're the ones who can reproduce this. Please spend more time
> >>>>>>working out which chunk (or combination thereof) actually fixes the
> >>>>>>problem. If indeed any of them do.
> >>>>>
> >>>>>As I said, I will test it this evening. ATM I don't have time to
> >>>>>recompile and reboot. This evening I will test extensively, even on
> >>>>>SMP, SCSI, IDE and so on.
> >>>>
> >>>>May I ask how you are reproducing the bad results? I'm trying in vain
> >>>>here...
> >>>
> >>>Quoting Con Kolivas:
> >>>
> >>>dd if=/dev/zero of=dump bs=4096 count=512000
> >>
> >>already tried that, no go. on ide/scsi? what filesystem? how much ram?
> >>anything else running? smp/up?
> >
> > I'm using UP on IDE. I reproduce it easily on a P3 256Mb laptop with
> > 5400rpm drive, and less easily but still occurs on a P4 2.53 512Mb pc
> > with 2x7200rpm software raid 0 IDE drives. Even if the only thing you try
> > to do is move the mouse, the mouse will freeze for up to 30secs. When you
> > first start the write no disk activity happens for up to a few seconds,
> > then it will start writing madly and the machine will come to a
> > standstill for a variable length of time. Then it will come back to life
> > for a few seconds only to die again for a few seconds and so on till the
> > write is complete.
> >
> > Still testing combinations to see which is the best, but 1+2 seems better
> > than 3 alone as doing reads midstream in the write don't cause hangs. I
> > haven't seen zombie processes ever.
>
> Just curious - which compiler did you use?

For this latest testing gcc 3.2.2

The hangs predate this to a time when I was using 2.95.3 and getting the
hangs.

Con

2003-05-28 13:24:52

by Stefan Foerster

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

* Stefan Foerster <[email protected]> wrote:
[...]
> Doesn't matter if IDE or SCSI, to be honest, SCSI with the old aic7xxx
> from vanilla 2.4.20 is even worse than IDE.
>
> My box is up, had only my window manager with some open xterms
> running, nothing which should create any load.

Oh silly me, forgot to include that info: I have 512MB of RAM, an Athlon XP.

Filesystems didn't seem to matter much in my tests, got hangs with
ext2, ext3 and XFS.


Ciao
Stefan
--
Stefan F?rster Public Key: 0xBBE2A9E9
FdI #44: Verdeckter Fehler - Siemens hat mitentwickelt. (J?rg Pechau)

2003-05-28 13:46:05

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003 20:23, Andrew Morton wrote:
> Could you please work out which change caused it? Go back to stock 2.4 and
> then apply this:
>
[snip] 1

> then this:
[snip] 2

> Then this (totally unlikely, don't bother):
[snip] 3

Ok patch combination final score for me is as follows in the presence of a
large continuous write:
1 No change
2 No change
3 improvement++; minor hangs with reads
1+2 improvement+++; minor pauses with switching applications
1+2+3 improvement++++; no pauses

Applications may start up slowly that's fine. The mouse cursor keeps spinning
and responding at all times though with 1+2+3 which it hasn't done in 2.4 for
a year or so.

Con

2003-05-28 14:20:42

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Chris Mason wrote:
> On Wed, 2003-05-28 at 09:08, Jens Axboe wrote:
> >
> > > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > > here...
> > >
> > > Quoting Con Kolivas:
> > >
> > > dd if=/dev/zero of=dump bs=4096 count=512000
> >
> > already tried that, no go. on ide/scsi? what filesystem? how much ram?
> > anything else running? smp/up?
>
> I think we've got a few different problems. On SMP boxes, you need to
> have the fix-pausing patch from andrea applied to catch all the corner
> cases.

Agree

>
> On UP boxes it's possible the requests are starving in the drive, SCSI
> users should try with the max tags set down to something sensible,
> between 8 and 32.
>
> IDE people can try lowering the max_kb_per_request paramater in
> /proc/ide/<drive>/settings, but this should only affect starvation with
> the writeback cache on.
>
> I made a patch a while ago that timed how long people spent waiting in
> __get_request_wait, it might help us figure out where the starvation is
> really happening.

But this seems totally unrelated to the reported problems, we are
talking about complete stalls of the mouse. No amount of io starvation
should provoke something like that.

--
Jens Axboe

2003-05-28 14:46:35

by Chris Mason

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 2003-05-28 at 10:33, Jens Axboe wrote:

> > On UP boxes it's possible the requests are starving in the drive, SCSI
> > users should try with the max tags set down to something sensible,
> > between 8 and 32.
> >
> > IDE people can try lowering the max_kb_per_request paramater in
> > /proc/ide/<drive>/settings, but this should only affect starvation with
> > the writeback cache on.
> >
> > I made a patch a while ago that timed how long people spent waiting in
> > __get_request_wait, it might help us figure out where the starvation is
> > really happening.
>
> But this seems totally unrelated to the reported problems, we are
> talking about complete stalls of the mouse. No amount of io starvation
> should provoke something like that.

Well, if it wasn't io related starvation, andrew's batch requests patch
wouldn't change things. I'm hoping the stats patch will get us some
numbers to go along with the perceived stalls, almost done merging.

-chris


2003-05-28 15:05:53

by Chris Mason

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 2003-05-28 at 09:08, Jens Axboe wrote:
>
> > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > here...
> >
> > Quoting Con Kolivas:
> >
> > dd if=/dev/zero of=dump bs=4096 count=512000
>
> already tried that, no go. on ide/scsi? what filesystem? how much ram?
> anything else running? smp/up?

I think we've got a few different problems. On SMP boxes, you need to
have the fix-pausing patch from andrea applied to catch all the corner
cases.

On UP boxes it's possible the requests are starving in the drive, SCSI
users should try with the max tags set down to something sensible,
between 8 and 32.

IDE people can try lowering the max_kb_per_request paramater in
/proc/ide/<drive>/settings, but this should only affect starvation with
the writeback cache on.

I made a patch a while ago that timed how long people spent waiting in
__get_request_wait, it might help us figure out where the starvation is
really happening.

-chris


2003-05-28 15:26:43

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Chris Mason wrote:
> On Wed, 2003-05-28 at 10:33, Jens Axboe wrote:
>
> > > On UP boxes it's possible the requests are starving in the drive, SCSI
> > > users should try with the max tags set down to something sensible,
> > > between 8 and 32.
> > >
> > > IDE people can try lowering the max_kb_per_request paramater in
> > > /proc/ide/<drive>/settings, but this should only affect starvation with
> > > the writeback cache on.
> > >
> > > I made a patch a while ago that timed how long people spent waiting in
> > > __get_request_wait, it might help us figure out where the starvation is
> > > really happening.
> >
> > But this seems totally unrelated to the reported problems, we are
> > talking about complete stalls of the mouse. No amount of io starvation
> > should provoke something like that.
>
> Well, if it wasn't io related starvation, andrew's batch requests patch
> wouldn't change things. I'm hoping the stats patch will get us some
> numbers to go along with the perceived stalls, almost done merging.

Correction then, it doesn't appear to be starvation in the usual sense.
But you are right, pulling some stats out of the situation would be
nice. I still can't reproduce here.

--
Jens Axboe

2003-05-28 18:19:07

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003, Jens Axboe wrote:

> > > Guys, you're the ones who can reproduce this. Please spend more time
> > > working out which chunk (or combination thereof) actually fixes the
> > > problem. If indeed any of them do.
> > As I said, I will test it this evening. ATM I don't have time to
> > recompile and reboot. This evening I will test extensively, even on
> > SMP, SCSI, IDE and so on.
>
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

I can reproduce across spindles with cvs import'ing a kernel tree,
make sure you're running X11 and try and do things in it, e.g. scrolling
windows, dragging etc.

Zwane
--
function.linuxpower.ca

2003-05-28 18:30:20

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 28 May 2003, Zwane Mwaikambo wrote:

> I can reproduce across spindles with cvs import'ing a kernel tree,
> make sure you're running X11 and try and do things in it, e.g. scrolling
> windows, dragging etc.

Forgot to mention, 2x 400MHz/512MB RAM, read is from UW2/7200 write to
UDMA33/5400 (w/ 2MB cache).

Zwane
--
function.linuxpower.ca

2003-05-28 18:36:12

by Elladan

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> >
> > Hi Akpm,
> >
> > > > Does the attached one make sense?
> > > Nope.
> > nm.
> >
> > > Guys, you're the ones who can reproduce this. Please spend more time
> > > working out which chunk (or combination thereof) actually fixes the
> > > problem. If indeed any of them do.
> > As I said, I will test it this evening. ATM I don't have time to
> > recompile and reboot. This evening I will test extensively, even on
> > SMP, SCSI, IDE and so on.
>
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

It might be useful to check what video hardware and X servers people are
using here. If the behavior is just mouse freezups, the "silken mouse"
feature of XFree might have some effect, since it involves XFree binding
a signal to mouse device events.

-J

2003-05-28 18:42:05

by Thomas Tonino

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Jens Axboe wrote:

> Lemme guess, hard drive on the same channel as the burner? There's
> nothing we can do about that, hardware limitation.

hmmm... most drives these days have a command to read free buffer capacity, so
there is no need to send more than the drive can swallow - and no need to tie up
the channel.

> The reason you see it
> during fixation is because that's one long single command, and we cannot
> preempt the channel and service requests while that is going on.

But this may be the exception that breaks the rule. Bah.


Thomas

2003-05-28 19:40:17

by David Ford

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Hmm, odd. I see similar dead time in 2.5.x, it is annoying but I
haven't had any time to track it down. I'm currently on .69 and
planning on putting .70 on this evening.

David

Marc Wilson wrote:

>On Tue, May 27, 2003 at 08:04:49PM +0200, Marc-Christian Petersen wrote:
>
>
>>ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
>> speak _NOW_ please, doesn't matter who you are!
>>
>>
>
>Ok, add my box to the list. Variety of post 2.4.18 kernels, -ac's, -rc's,
>etc... all demonstrate it to one degree or another.
>
>Lately it's gotten REALLY bad.
>
>Currently I'm using 21-rc2-ac2 and it freezes for upwards of 15 sec
>regularly when I'm exercising the HD (three simultaneous brag threads
>downloading from various newsgroups). The mouse moves, but other than
>that, X is entirely unresponsive. An xterm with continually scrolling
>text, for example, will appear to stop scrolling until the kernel comes
>back.
>
>The HD light is on solid the whole time.
>
>21-rc2 does it too. I haven't tried anything later than that yet. Well, I
>tried 20-ck7 and it ate my RAID0 due to a DMA-ism and I've not tested
>anything else since. :(
>


2003-05-28 22:49:45

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, 29 May 2003 04:47, Elladan wrote:
> On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > > On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> > >
> > > Hi Akpm,
> > >
> > > > > Does the attached one make sense?
> > > >
> > > > Nope.
> > >
> > > nm.
> > >
> > > > Guys, you're the ones who can reproduce this. Please spend more time
> > > > working out which chunk (or combination thereof) actually fixes the
> > > > problem. If indeed any of them do.
> > >
> > > As I said, I will test it this evening. ATM I don't have time to
> > > recompile and reboot. This evening I will test extensively, even on
> > > SMP, SCSI, IDE and so on.
> >
> > May I ask how you are reproducing the bad results? I'm trying in vain
> > here...
>
> It might be useful to check what video hardware and X servers people are
> using here. If the behavior is just mouse freezups, the "silken mouse"
> feature of XFree might have some effect, since it involves XFree binding
> a signal to mouse device events.

Xfree 3.3.6, 4.2,4.3
Drivers nvidia, nv, sis, sisfb, vesa, vesafb

are the drivers on the machines where I've seen it happen so far - ie without
discrimination.

Con

2003-05-28 23:26:35

by Chris Mason

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, 2003-05-28 at 11:39, Jens Axboe wrote:

> Correction then, it doesn't appear to be starvation in the usual sense.
> But you are right, pulling some stats out of the situation would be
> nice. I still can't reproduce here.

Well, it's not pretty but it gets some numbers out there. This patch
only calculates the time spent waiting in __get_request_wait, it isn't
interested in any other metrics. stats are per-queue and are reset when
you mount the FS, you get a print out either when you unmount the FS or
when you run elvtune /dev/xxx (no other args, just enough to trigger the
read ioctl).

The output looks like this (after a dbench 50 run 2.4.21-rc6)

device 03:04: num_req 12248, total jiffies waited 26729
417 forced to wait
1 min wait, 432 max wait
64 average wait
314 < 100, 62 < 200, 20 < 300, 20 < 400, 1 < 500
0 waits longer than 500 jiffies

It tells us there were 12248 total requests (merges don't count), and
that we spent 26,729 jiffies waiting in __get_request_wait. We had to
wait 417 times, the minimum was 1 and the max was 432 jiffies. The line
with the < signs is a simple way to get the deviations. 314 requests
waited < 100 jiffies, 62 requests waited less than 200 jiffies, etc.

People who see stalls on UP machines and have seen improvements by
playing with code in drivers/block/ll_rw_blk.c are encouraged to try
getting numbers with this patch applied. It will make it easier to
figure things out.

I haven't tried Andrea's fix-pausing on top of this yet, any rejects
should be minor.

-chris


Attachments:
lat-stat-3.diff (4.66 kB)

2003-05-29 01:20:07

by manish

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Andrew Morton wrote:

>Matthias Mueller <[email protected]> wrote:
>
>>Works fine on my notebook. Good throughput and no mouse hangs anymore.
>>
>
>Interesting.
>
>Could you please work out which change caused it? Go back to stock 2.4 and
>then apply this:
>
>
>diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
>--- 24/drivers/block/ll_rw_blk.c~1 2003-05-28 03:20:42.000000000 -0700
>+++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:20:57.000000000 -0700
>@@ -590,10 +590,10 @@ static struct request *__get_request_wai
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
>- generic_unplug_device(q);
> add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> do {
> set_current_state(TASK_UNINTERRUPTIBLE);
>+ generic_unplug_device(q);
> if (q->rq[rw].count == 0)
> schedule();
> spin_lock_irq(&io_request_lock);
>
>
>
>then this:
>
>diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
>--- 24/drivers/block/ll_rw_blk.c~2 2003-05-28 03:21:03.000000000 -0700
>+++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:09.000000000 -0700
>@@ -590,7 +590,7 @@ static struct request *__get_request_wai
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
>- add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
>+ add_wait_queue(&q->wait_for_requests[rw], &wait);
> do {
> set_current_state(TASK_UNINTERRUPTIBLE);
> generic_unplug_device(q);
>
>
>Then this (totally unlikely, don't bother):
>
>diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
>--- 24/drivers/block/ll_rw_blk.c~3 2003-05-28 03:21:15.000000000 -0700
>+++ 24-akpm/drivers/block/ll_rw_blk.c 2003-05-28 03:21:39.000000000 -0700
>@@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> */
> if (q) {
> list_add(&req->queue, &q->rq[rw].free);
>- if (++q->rq[rw].count >= q->batch_requests &&
>- waitqueue_active(&q->wait_for_requests[rw]))
>+ if (++q->rq[rw].count >= q->batch_requests)
> wake_up(&q->wait_for_requests[rw]);
> }
> }
>
>_
>
Hello !

I have applied patch 1+2+3 and it seemed to have solved the
stalls/pauses that I was seeing with the stock kernel after long hrs of
test using bonnie.

Thanks much
Manish




2003-05-29 08:23:25

by Ragnar Hojland Espinosa

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 11:58:43AM +0100, Alan Cox wrote:
> On Mer, 2003-05-28 at 10:36, Ragnar Hojland Espinosa wrote:
> > Actually it just happens in the fixing stage when burning prebuilt iso
> > images from the hard disk (same IDE channel as the burner, 2.4.20)
> > Having a completely frozen machine under X was quite panic inducing ;)
>
> If you have a disk and the burner ont he same channel this is quite
> normal. The fixate is a single ATAPI command and like all ATA commands
> locks the bus to both master/slave for its duration of execution.
>
> Its an IDE limitation

Thats what you get for cheap hardware ;) Anyway, I do have two
questions regarding pauses when fixating, in case someone knows..

- Why it doesn't the freeze always happen (I think it doesn't)
- Why doesn't the complete computer freeze happen always.

--
Ragnar Hojland - Project Manager
Linalco "Especialistas Linux y en Software Libre"
http://www.linalco.com Tel: +34-91-5970074 Fax: +34-91-5970083

2003-05-29 12:39:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 01:17:59PM +0200, Marc-Christian Petersen wrote:
> On Wednesday 28 May 2003 12:59, Andrew Morton wrote:
>
> Hi Andrew,
>
> > umm, I'd like confirmation of that.
> >
> > The waitqueue_active() test is wrong because of a missing barrier, but only
> > on SMP. And if it does make a mistake it will surely correct itself when
> > the next request is put back. (That's why I left it there...)
> > More testing, please.
> Does the attached one make sense?

btw, I already fixed this race in my tree:

void blkdev_release_request(struct request *req)
{
request_queue_t *q = req->q;

req->rq_status = RQ_INACTIVE;
req->q = NULL;

/*
* Request may not have originated from ll_rw_blk. if not,
* assume it has free buffers and check waiters
*/
if (q) {
list_add(&req->queue, &q->rq.free);
if (++q->rq.count >= q->batch_requests && !blk_oversized_queue_batch(q)) {
smp_mb();
if (waitqueue_active(&q->wait_for_requests))
wake_up(&q->wait_for_requests);


so if this was this one my tree wouldn't exibith it (and it would
trigger on smp only).

>
> ciao, Marc
>
>




Andrea

2003-05-29 12:55:53

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 09:03:42AM +1000, Con Kolivas wrote:
> On Thu, 29 May 2003 04:47, Elladan wrote:
> > On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> > > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > > > On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> > > >
> > > > Hi Akpm,
> > > >
> > > > > > Does the attached one make sense?
> > > > >
> > > > > Nope.
> > > >
> > > > nm.
> > > >
> > > > > Guys, you're the ones who can reproduce this. Please spend more time
> > > > > working out which chunk (or combination thereof) actually fixes the
> > > > > problem. If indeed any of them do.
> > > >
> > > > As I said, I will test it this evening. ATM I don't have time to
> > > > recompile and reboot. This evening I will test extensively, even on
> > > > SMP, SCSI, IDE and so on.
> > >
> > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > here...
> >
> > It might be useful to check what video hardware and X servers people are
> > using here. If the behavior is just mouse freezups, the "silken mouse"
> > feature of XFree might have some effect, since it involves XFree binding
> > a signal to mouse device events.
>
> Xfree 3.3.6, 4.2,4.3
> Drivers nvidia, nv, sis, sisfb, vesa, vesafb
>
> are the drivers on the machines where I've seen it happen so far - ie without
> discrimination.

what about the window manager? do you use focus follow mouse? Just
trying to find a pattern. For the record KDE 3.1 + focus follow mouse
and X 4.3.0 here, I guess Jens uses the same software combination. the
mouse for me is always perfectly fluid no matter how fast and how long I
write, no matter if I don't touch the mouse for minutes, ALT+TAB as
well. I definitely can't reproduce in any way the mouse stalls (I'm
using cp /dev/zero . on a ext3 fs in ordered mode). hardware is 1G of
ram smp IDE single spindle primary master matrox GS450. I almost
couldn't notice the background write flood if I only would increase the
xmms buffer (infact I thought it stopped writing for a dozen seconds out
of space, and instead it was still writing). (kernel is 2.4.21rc4aa1 of
course)

Andrea

2003-05-29 13:05:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> Tested all of them and some combinations:
> patch 1 alone: still mouse hangs
> patch 2 alone: still mouse hangs
> patch 3 alone: no hangs, but I get some zombie process (starting a lot of
> xterms results in zombie xterms, not noticed with vanilla
> and the other patches)
> patch 1+2: no mouse hangs
> patch 1+2+3: no mouse hangs, no zombies

I can't find a sense in the zombie thing, how can you generate zombie at
all from xterms? That sounds like your userspace is terribly broken and
it may have race conditions or whatever. In no way those patches can
generate or not-generate zombies from xterms. I never ever seen a zombie
xterm in my whole linux experience.

either that or the GUI is doing something intentionally to try to reduce
the number of wait4 syscalls to the miniumum colescing the wait4, but
that would be very bad design of the GUI software since you're not going
to start an xterm (or whatever else window) a every millisecond, so it
would be very pointless and confusing, I certainly wouldn't like it.
(the wait4 thing I don't love it even in the servers where it might
be accepted as a microoptimization)

It's impossible to trust the rest of the report while hearing about such
a fundamental brekage in the core of your GUI, the mouse hangs could be
just an userspace bug that triggers when some timing changes in presence
of writes, or whatever. So please install an userspace that never
generates zombie xterm ever, and see if you can reproduce still.

Andrea

2003-05-29 13:10:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 12:00:11AM +1000, Con Kolivas wrote:
> On Wed, 28 May 2003 20:23, Andrew Morton wrote:
> > Could you please work out which change caused it? Go back to stock 2.4 and
> > then apply this:
> >
> [snip] 1
>
> > then this:
> [snip] 2
>
> > Then this (totally unlikely, don't bother):
> [snip] 3
>
> Ok patch combination final score for me is as follows in the presence of a
> large continuous write:
> 1 No change
> 2 No change
> 3 improvement++; minor hangs with reads
> 1+2 improvement+++; minor pauses with switching applications
> 1+2+3 improvement++++; no pauses

then please try 1+2 alone too (i.e. w/o 3), because it's not obvious to me
that you're really the race in 3 in a single write (I spotted and just
fixed such a race in my tree some months ago, but thought it was a
theoretical one only, I mean on x86).

The improvement++ might be just an emotional feeling if you didn't
generate numbers to measure it (I know on myself it can happen when you
try a new patch, that everything seems faster until you really measure
it ;).

> Applications may start up slowly that's fine. The mouse cursor keeps spinning
> and responding at all times though with 1+2+3 which it hasn't done in 2.4 for

the mouse cursor always worked and still works fine for me (and I was
just running with 3 applied, just to get the theretical bit correct).

> a year or so.
>
> Con


Andrea

2003-05-29 13:43:31

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

Hello !

I've done a few tests with -rc6 on my dev machine (dual xp 1.5G, 512 MB, scsi).
It's the *FIRST* time I have ever seen my mouse cursor hang (just a little bit
however, and totally acceptable) ! Usually, my kernel include -aa VM and lowlat
patches, and I've never encountered this behaviour on this machine with such a
configuration. However, with stock kernel, I admit that during the 2 minutes it
takes to write the 2G file, I see the mouse stick two or three times during
about 1 second, which is quite acceptable IMHO. Opening an xterm may take 10s
to get to the prompt (more annoying). Same to launch 'ps'.

I use a fairly simple window manager (ctwm), which doesn't access the disk once
it's launched. It never gets stuck during all the operation if I disable the
swap. If I enable the swap, it sometimes takes one or two seconds to draw a
menu. The swap is used up to about 4 MB.

I then tried -rc6 with ll_rw_blk from -rc5, and it's worse, even with swap
disabled. The hangs happen more often, but are about the same durations. So I
confirm that -rc6 is better here than -rc5.

I retried with rc4aa1, and everything went very smooth again ; it takes at most
1 second to get an xterm with the prompt ready, and ps responds immediately. So
I think that there are two things here:
- those who experience very long hangs may use a heavy window manager
which does continuous disk accesses (I mean it accesses the disk for any
simple operation).
- a hungry WM may also be swapped during such operations, rendering it
totally unusable, particularly if the swap is on the same physical disk
as the file being written to.

So, could the people who report long hangs retry with swap disabled ?
Can we limit the amount of memory consummed by the cache during such a write ?

Regards,
Willy

2003-05-29 13:59:12

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 03:19:37PM +0200, Andrea Arcangeli wrote:
> On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> > Tested all of them and some combinations:
> > patch 1 alone: still mouse hangs
> > patch 2 alone: still mouse hangs
> > patch 3 alone: no hangs, but I get some zombie process (starting a lot of
> > xterms results in zombie xterms, not noticed with vanilla
> > and the other patches)
> > patch 1+2: no mouse hangs
> > patch 1+2+3: no mouse hangs, no zombies
>
> I can't find a sense in the zombie thing, how can you generate zombie at
> all from xterms? That sounds like your userspace is terribly broken and
> it may have race conditions or whatever. In no way those patches can
> generate or not-generate zombies from xterms. I never ever seen a zombie
> xterm in my whole linux experience.

I rechecked everything an noticed, that it wasn't a xterm, but a wrapper
script, that executed rxvt. I changed that to plain xterm and the zombies
were gone. So I think there was probably a bug in rxvt triggered there.
After that I redid the tests, with the same result (and no zombies).
I can feel no difference between 1+2 or 1+2+3.

Matthias
--
[email protected]
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

2003-05-29 13:55:22

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, 29 May 2003 23:55, Willy Tarreau wrote:
> Hello !
>
> I've done a few tests with -rc6 on my dev machine (dual xp 1.5G, 512 MB,
> scsi). It's the *FIRST* time I have ever seen my mouse cursor hang (just a
> little bit however, and totally acceptable) ! Usually, my kernel include
> -aa VM and lowlat patches, and I've never encountered this behaviour on
> this machine with such a configuration. However, with stock kernel, I admit
> that during the 2 minutes it takes to write the 2G file, I see the mouse
> stick two or three times during about 1 second, which is quite acceptable
> IMHO. Opening an xterm may take 10s to get to the prompt (more annoying).
> Same to launch 'ps'.
>
> I use a fairly simple window manager (ctwm), which doesn't access the disk
> once it's launched. It never gets stuck during all the operation if I
> disable the swap. If I enable the swap, it sometimes takes one or two
> seconds to draw a menu. The swap is used up to about 4 MB.
>
> I then tried -rc6 with ll_rw_blk from -rc5, and it's worse, even with swap
> disabled. The hangs happen more often, but are about the same durations. So
> I confirm that -rc6 is better here than -rc5.
>
> I retried with rc4aa1, and everything went very smooth again ; it takes at
> most 1 second to get an xterm with the prompt ready, and ps responds
> immediately. So I think that there are two things here:
> - those who experience very long hangs may use a heavy window manager
> which does continuous disk accesses (I mean it accesses the disk for
> any simple operation).
> - a hungry WM may also be swapped during such operations, rendering it
> totally unusable, particularly if the swap is on the same physical disk
> as the file being written to.
>
> So, could the people who report long hangs retry with swap disabled ?
> Can we limit the amount of memory consummed by the cache during such a
> write ?

I still get hangs with rc6 with massive writeouts to swap. The problem was
that I was getting hangs without writeouts to swap with 2.4.19pre1
->2.4.21pre5. I didn't expect the patch backout to suddenly make writing to
swap occur for free (although that would be nice).

Con

2003-05-29 14:25:54

by Matthias Mueller

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 03:55:08PM +0200, Willy Tarreau wrote:
> Hello !
>
> I've done a few tests with -rc6 on my dev machine (dual xp 1.5G, 512 MB, scsi).
> It's the *FIRST* time I have ever seen my mouse cursor hang (just a little bit
> however, and totally acceptable) ! Usually, my kernel include -aa VM and lowlat
> patches, and I've never encountered this behaviour on this machine with such a
> configuration. However, with stock kernel, I admit that during the 2 minutes it
> takes to write the 2G file, I see the mouse stick two or three times during
> about 1 second, which is quite acceptable IMHO. Opening an xterm may take 10s
> to get to the prompt (more annoying). Same to launch 'ps'.
>
> I use a fairly simple window manager (ctwm), which doesn't access the disk once
> it's launched. It never gets stuck during all the operation if I disable the
> swap. If I enable the swap, it sometimes takes one or two seconds to draw a
> menu. The swap is used up to about 4 MB.
>
> I then tried -rc6 with ll_rw_blk from -rc5, and it's worse, even with swap
> disabled. The hangs happen more often, but are about the same durations. So I
> confirm that -rc6 is better here than -rc5.
>
> I retried with rc4aa1, and everything went very smooth again ; it takes at most
> 1 second to get an xterm with the prompt ready, and ps responds immediately. So
> I think that there are two things here:
> - those who experience very long hangs may use a heavy window manager
> which does continuous disk accesses (I mean it accesses the disk for any
> simple operation).
> - a hungry WM may also be swapped during such operations, rendering it
> totally unusable, particularly if the swap is on the same physical disk
> as the file being written to.
>
> So, could the people who report long hangs retry with swap disabled ?
> Can we limit the amount of memory consummed by the cache during such a write ?

I run fluxbox, not a very heavy window manager, but I installed ctwm and
tried again with vanilla 2.4.20. If I disabled swap the short hangs (1s) are
gone, but the long mouse hangs (10s) are still there.

Matthias
--
[email protected]
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

2003-05-29 14:32:45

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thursday 29 May 2003 15:55, Willy Tarreau wrote:

Hi Willy,

> I've done a few tests with -rc6 on my dev machine (dual xp 1.5G, 512 MB,
> scsi). It's the *FIRST* time I have ever seen my mouse cursor hang (just a
> little bit however, and totally acceptable) ! Usually, my kernel include -aa
> VM and lowlat patches, and I've never encountered this behaviour on this
> machine with such a configuration. However, with stock kernel, I admit that
> during the 2 minutes it takes to write the 2G file, I see the mouse stick
> two or three times during about 1 second, which is quite acceptable IMHO.
WRONG. A mouse stick is not acceptable in _any_ way. Other OS' can handle this
pretty well, and if Linux has problems with mouse sticks, this has to be
fixed! Either in kernel space or in userspace (XFree86).

> Opening an xterm may take 10s to get to the prompt (more annoying). Same to
> launch 'ps'.
ACK!

> I retried with rc4aa1, and everything went very smooth again ; it takes at
> most 1 second to get an xterm with the prompt ready, and ps responds
> immediately. So I think that there are two things here:
> - those who experience very long hangs may use a heavy window manager
> which does continuous disk accesses (I mean it accesses the disk for
> any simple operation).
> - a hungry WM may also be swapped during such operations, rendering it
> totally unusable, particularly if the swap is on the same physical disk
> as the file being written to.
Well, sorry, but: no!

The pauses/stops occurs no matter of what WindowManager (KDE2/3, WindowMaker,
fvwm, gnome etc. foobar). The point why you are not seeing such things with
-aa is his Lowlatency Elevator and lowlatency-fixes and some important fixes
which are not in stock kernel yet.

I reproduced mouse sticks and keyboard does not accept anything problems for
$seconds with _every_ kernel which is based on 2.4.19/2.4.20/2.4.21*. This
also includes -AA (well, not that braindead bad like mainline did before the
fix) but this is because of lowlat elevator from Andrea. And as I told
yesterday (or 2 days ago? dunno) lowlat elevator drops throughput (Andrea, it
_does_ ;).

It's not just only mouse hangs (as I've reported tons of times) but also
keyboard does not accept any input (delay varies between 1 to 15 seconds) and
this also applies if you don't run X at all.

Another fine example is:

- Start a screen session, not running X at all.
- Trash your HD with tons of writes.
- Press Ctrl-A-C for a new screen session.

You will see, it takes as long as, you wrote above, with starting up an Xterm
or calling ps. It does _not_ happen with 2.4.18!

> So, could the people who report long hangs retry with swap disabled ?
It's somewhat better but not acceptable.

> Can we limit the amount of memory consummed by the cache during such a
> write ?
I ask for such a feature since years ;)

Well, my summary: The bug is there, for over 15 months ( I won't mention it
again that I've reported the bug 15 months ago ;-) ... It _may_ be some very
obscure hardware problem to be able to reproduce this bug but as this thread
shows up, there are tons of people who can reproduce this with different
hardware starting with 2.4.19-pre1.

ciao, Marc

2003-05-29 14:51:02

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, 29 May 2003 23:09, Andrea Arcangeli wrote:
> On Thu, May 29, 2003 at 09:03:42AM +1000, Con Kolivas wrote:
> > On Thu, 29 May 2003 04:47, Elladan wrote:
> > > On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> > > > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > > > > On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> > > > >
> > > > > Hi Akpm,
> > > > >
> > > > > > > Does the attached one make sense?
> > > > > >
> > > > > > Nope.
> > > > >
> > > > > nm.
> > > > >
> > > > > > Guys, you're the ones who can reproduce this. Please spend more
> > > > > > time working out which chunk (or combination thereof) actually
> > > > > > fixes the problem. If indeed any of them do.
> > > > >
> > > > > As I said, I will test it this evening. ATM I don't have time to
> > > > > recompile and reboot. This evening I will test extensively, even on
> > > > > SMP, SCSI, IDE and so on.
> > > >
> > > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > > here...
> > >
> > > It might be useful to check what video hardware and X servers people
> > > are using here. If the behavior is just mouse freezups, the "silken
> > > mouse" feature of XFree might have some effect, since it involves XFree
> > > binding a signal to mouse device events.
> >
> > Xfree 3.3.6, 4.2,4.3
> > Drivers nvidia, nv, sis, sisfb, vesa, vesafb
> >
> > are the drivers on the machines where I've seen it happen so far - ie
> > without discrimination.
>
> what about the window manager? do you use focus follow mouse? Just
> trying to find a pattern. For the record KDE 3.1 + focus follow mouse
> and X 4.3.0 here, I guess Jens uses the same software combination. the
> mouse for me is always perfectly fluid no matter how fast and how long I
> write, no matter if I don't touch the mouse for minutes, ALT+TAB as
> well. I definitely can't reproduce in any way the mouse stalls (I'm
> using cp /dev/zero . on a ext3 fs in ordered mode). hardware is 1G of
> ram smp IDE single spindle primary master matrox GS450. I almost
> couldn't notice the background write flood if I only would increase the
> xmms buffer (infact I thought it stopped writing for a dozen seconds out
> of space, and instead it was still writing). (kernel is 2.4.21rc4aa1 of
> course)

Why should it matter what wm I use if the pauses were there before and not
there now?

Con

2003-05-29 15:53:16

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 04:45:26PM +0200, Marc-Christian Petersen wrote:
> > machine with such a configuration. However, with stock kernel, I admit that
> > during the 2 minutes it takes to write the 2G file, I see the mouse stick
> > two or three times during about 1 second, which is quite acceptable IMHO.
> WRONG. A mouse stick is not acceptable in _any_ way. Other OS' can handle this
Excuse me, Marc, I didn't mean it was normally acceptable, but quite acceptable
compared to what other people report.

> > Opening an xterm may take 10s to get to the prompt (more annoying). Same to
> > launch 'ps'.
> ACK!

The problem is specifically due to the cache, and only related to I/O but not
to other subsystems : if I start 50 xterms during that write, they take the
same time to respond as when there's only one. And they all respond
simultaneously, showing that they were all waiting for the files to be read
from the disk. But I cannot hang anything which doesn't need disk access.
Perhaps some people have their X server swap !

> The pauses/stops occurs no matter of what WindowManager (KDE2/3, WindowMaker,
> fvwm, gnome etc. foobar). The point why you are not seeing such things with
> -aa is his Lowlatency Elevator and lowlatency-fixes and some important fixes
> which are not in stock kernel yet.

Do you agree that if the WM does no disk access and the mouse/keyboard freezes,
it means that X and/or the WM swap ? And if it's not the case, then it's related
to something else, and I don't see how playing with elevators can help!

> I reproduced mouse sticks and keyboard does not accept anything problems for
> $seconds with _every_ kernel which is based on 2.4.19/2.4.20/2.4.21*. This
> also includes -AA (well, not that braindead bad like mainline did before the
> fix) but this is because of lowlat elevator from Andrea. And as I told
> yesterday (or 2 days ago? dunno) lowlat elevator drops throughput (Andrea, it
> _does_ ;).

I also confirm it does ; it takes 122 seconds to write this file in -rc6, and
142 seconds in -aa. But I don't think that desktop people would notice anyway.

> It's not just only mouse hangs (as I've reported tons of times) but also
> keyboard does not accept any input (delay varies between 1 to 15 seconds) and
> this also applies if you don't run X at all.

in fact, we don't know if the keyboard doesn't accept inputs or if the process
bound to the TTY is stuck ! If Alt-SysRq replies immediately, the problem is on
the user process side.

> - Start a screen session, not running X at all.
> - Trash your HD with tons of writes.
> - Press Ctrl-A-C for a new screen session.
>
> You will see, it takes as long as, you wrote above, with starting up an Xterm
> or calling ps. It does _not_ happen with 2.4.18!

I think that for this, screen will need to allocate some memory, which may take
some time under these conditions. I don't have screen right here, so I won't
try it, but I suspect that a program which uses pre-allocated memory will have
no problem at all.

> > So, could the people who report long hangs retry with swap disabled ?
> It's somewhat better but not acceptable.

OK

> > Can we limit the amount of memory consummed by the cache during such a
> > write ?
> I ask for such a feature since years ;)

another solution would be to be able to specify that a process could use
pre-allocated memory.

Cheers,
Willy

2003-05-29 15:58:16

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 04:38:28PM +0200, Matthias Mueller wrote:

> I run fluxbox, not a very heavy window manager, but I installed ctwm and
> tried again with vanilla 2.4.20. If I disabled swap the short hangs (1s) are
> gone, but the long mouse hangs (10s) are still there.

Thanks for the test, but I find it really amazing that the mouse hangs while
it has nothing to do with any block device at all !

Cheers,
Willy

2003-05-29 16:05:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 03:55:08PM +0200, Willy Tarreau wrote:
> So, could the people who report long hangs retry with swap disabled ?
> Can we limit the amount of memory consummed by the cache during such a write ?

the vm should be (i.e. is supposed to be) smart enough not to unmap
anything significant just because of large writes. I'm sure it's not
swapping anything on my desktop during write flood (and certainly not
the mouse pointer) but checking with swapoff is certainly a good hint to
be sure.

Andrea

2003-05-29 16:09:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 04:10:34PM +0200, Matthias Mueller wrote:
> On Thu, May 29, 2003 at 03:19:37PM +0200, Andrea Arcangeli wrote:
> > On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> > > Tested all of them and some combinations:
> > > patch 1 alone: still mouse hangs
> > > patch 2 alone: still mouse hangs
> > > patch 3 alone: no hangs, but I get some zombie process (starting a lot of
> > > xterms results in zombie xterms, not noticed with vanilla
> > > and the other patches)
> > > patch 1+2: no mouse hangs
> > > patch 1+2+3: no mouse hangs, no zombies
> >
> > I can't find a sense in the zombie thing, how can you generate zombie at
> > all from xterms? That sounds like your userspace is terribly broken and
> > it may have race conditions or whatever. In no way those patches can
> > generate or not-generate zombies from xterms. I never ever seen a zombie
> > xterm in my whole linux experience.
>
> I rechecked everything an noticed, that it wasn't a xterm, but a wrapper
> script, that executed rxvt. I changed that to plain xterm and the zombies
> were gone. So I think there was probably a bug in rxvt triggered there.
> After that I redid the tests, with the same result (and no zombies).
> I can feel no difference between 1+2 or 1+2+3.

this sounds very sane now thanks for fixing the issues with the zombies!

it also makes sense to me that 1+2 is the same as 1+2+3, because I'd be
very surprised if the (purely smp) race condition in 3 made a whole lot
of difference for interactivity of a large write.

Andrea

2003-05-29 16:11:00

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wednesday 28 May 2003 13:31, Marc-Christian Petersen wrote:

Hi Andrew,

> > Guys, you're the ones who can reproduce this. Please spend more time
> > working out which chunk (or combination thereof) actually fixes the
> > problem. If indeed any of them do.
> As I said, I will test it this evening. ATM I don't have time to recompile
> and reboot. This evening I will test extensively, even on SMP, SCSI, IDE
> and so on.
Sorry, haven't had any time yesterday.

So my 10? comment for the patches (like the ones in -rc6).

1. Braindead pausings are GONE (mouse is not sticky as w/o the patch).
2. Mouse sticks are still there rarely (short ones, max. 1 second)
(If one can say 1 second is short ...).
3. all three patches are needed.

No side effects yet tho. Works with SCSI, IDE and SMP.

ciao, Marc

2003-05-29 16:36:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 06:06:04PM +0200, Willy TARREAU wrote:
> I also confirm it does ; it takes 122 seconds to write this file in -rc6, and
> 142 seconds in -aa. But I don't think that desktop people would notice anyway.

btw, were you running parallel reads or writes at the same time? (i.e.
launching xterms or ps etc.. in parallel?) I ask because if xterm
startups quick is because the write workload is getting more seeks in
its way.

I'd be very interested if you can measure a bonnie performance change in
contigous reads and writes on a otherwise completely idle machine, the
size of the queue has to be big enough to keep the I/O pipeline full
during contigous writes at full speed. saying that throughput decrease
alone is not enough to evaluate the reason of this drop.

you can also try with:

echo 20 500 0 0 500 3000 30 10 >/proc/sys/vm/bdflush

just in case.

Andrea

2003-05-29 17:33:43

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Thu, May 29, 2003 at 06:49:40PM +0200, Andrea Arcangeli wrote:
> btw, were you running parallel reads or writes at the same time? (i.e.
> launching xterms or ps etc.. in parallel?) I ask because if xterm
> startups quick is because the write workload is getting more seeks in
> its way.

Well, you're right, I was starting some xterms, but not that much perhaps
a tens during all the test.

> I'd be very interested if you can measure a bonnie performance change in
> contigous reads and writes on a otherwise completely idle machine, the
> size of the queue has to be big enough to keep the I/O pipeline full
> during contigous writes at full speed.

for this I'll have to install bonnie, I won't do it right now.

> you can also try with:
>
> echo 20 500 0 0 500 3000 30 10 >/proc/sys/vm/bdflush

interestingly, it seems as the lower the last 2 values, the longer it takes.
I retried without opening any xterm, and it took 130 seconds. With the above
changes to bdflush, 135 s. With '80 50', 118s.

vmstat also show me that the test begins at a sustained 16-19 MB/s write
throughput during about the first minute. Then it starts to show regular drops
to 5-7 MB/s for 6-7s, and goes back to full speed. Since this is on reiserfs,
I wonder if this activity is not related to the journal.

Moreover, the disk still writes during about 10s after the end of the dd, so
I don't think that mesuring the time dd takes to complete is a good indicator
of anything (or I should try with a final sync).

If I write simultaneously to two 1G files, wait a few time and then read from
them while still writing, I begin to wait a few seconds for xterm to give me
the prompt. But when writes finish and there are only concurrent reads,
everything gets smooth again, eventhough the disk emits a terrible seek sound !

Cheers,
Willy

2003-06-02 10:29:53

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.4.20: Proccess stuck in __lock_page ...

On Wed, May 28 2003, Thomas Tonino wrote:
> Jens Axboe wrote:
>
> >Lemme guess, hard drive on the same channel as the burner? There's
> >nothing we can do about that, hardware limitation.
>
> hmmm... most drives these days have a command to read free buffer capacity,
> so there is no need to send more than the drive can swallow - and no need
> to tie up the channel.

As we cannot do more than 128kb in a single request (cdrecord uses 63kb
for writing), there's no problem there. I think you are misunderstanding
me. This is not a problem with ide layer starving the hard drive by
continually sending writes to the cd-r, it's a problem with not being
able to preempt service for a single command duration.

> >The reason you see it
> >during fixation is because that's one long single command, and we cannot
> >preempt the channel and service requests while that is going on.
>
> But this may be the exception that breaks the rule. Bah.

No, that is the entire problem.

--
Jens Axboe