2003-07-05 20:18:04

by Andrew Morton

[permalink] [raw]
Subject: anticipatory scheduler merged


Linus has merged the anticipator scheduler. A few things to note on this:

- For some workloads it is around 10% slower than deadline. Most notably
database workloads which seek all over the disk performing reads and
synchronous writes.

Sorry, we just weren't able to get the last little bit back.

But for other workloads (reading lots of filesystem objects when
there's a lot of writeout around, when there's a streaming read, etc) it
is up to 1000% faster. I believe it is a better all-round IO scheduler.

Unless Nick pulls a big rabbit out of his hat, database people will
need to boot their kernels with `elevator=deadline' to get the last bit
of performance back.

- These changes have been well tested, but it is five months work and
over 100 patches. There's probably a bug or two. If you suspect that
something has gone wrong at the block layer (lots of tasks stuck in D
state) then please retest with `elevator=deadline'.

Thanks.


2003-07-07 02:07:46

by Brandon Low

[permalink] [raw]
Subject: Re: anticipatory scheduler merged

On Sat, 07/05/03 at 13:33:34 -0700, Andrew Morton wrote:
> - These changes have been well tested, but it is five months work and
> over 100 patches. There's probably a bug or two. If you suspect that
> something has gone wrong at the block layer (lots of tasks stuck in D
> state) then please retest with `elevator=deadline'.
>
> Thanks.

I am seeing these D tasks when running 2.5.74-mm2 under a heavy seeking
load (compiling application, untarring kernel, and filesharing
simultaneously) on a slow (laptop 4200RPM) hdd. I find that after about
10 uptime when I start throwing on the seeking loads one or all of them
go to D state and any new disk IO is either blocked or very slow.

I have tested with elevator=deadline and have been unable to reproduce.

Any further testing or debugging you need me to do I can probably do
(but I'm not terribly knowledgable so I'll need step by step for said
testing). Thanks!

Brandon Low

2003-07-07 02:23:39

by Nick Piggin

[permalink] [raw]
Subject: Re: anticipatory scheduler merged



Brandon Low wrote:

>On Sat, 07/05/03 at 13:33:34 -0700, Andrew Morton wrote:
>
>>- These changes have been well tested, but it is five months work and
>> over 100 patches. There's probably a bug or two. If you suspect that
>> something has gone wrong at the block layer (lots of tasks stuck in D
>> state) then please retest with `elevator=deadline'.
>>
>>Thanks.
>>
>
>I am seeing these D tasks when running 2.5.74-mm2 under a heavy seeking
>load (compiling application, untarring kernel, and filesharing
>simultaneously) on a slow (laptop 4200RPM) hdd. I find that after about
>10 uptime when I start throwing on the seeking loads one or all of them
>go to D state and any new disk IO is either blocked or very slow.
>
>I have tested with elevator=deadline and have been unable to reproduce.
>
>Any further testing or debugging you need me to do I can probably do
>(but I'm not terribly knowledgable so I'll need step by step for said
>testing). Thanks!
>
>

OK, so the disk is IDE, and you aren't using ide-scsi? Please compile the
kernel with "Magic SysRq key" config option enabled. Its in the Kernel
hacking submenu.

Then repeat the system freeze, and press alt+sys rq+t. Run dmesg | less to
get the task trace. You'll get a lot of lines like this:
bash S 00000001 627 625 626 (NOTLB)
followed by their call traces. Copy the call trace of one or two tasks that
have a D following their name. Then post it here.

Oh and you'll probably have to run dmesg and less to get them into cache
before the system freezes!

Thanks,
Nick

2003-07-07 08:10:02

by James Cloos

[permalink] [raw]
Subject: Re: anticipatory scheduler merged

>>>>> "Andrew" == Andrew Morton <[email protected]> writes:

Andrew> - These changes have been well tested, but it is five months
Andrew> work and over 100 patches. There's probably a bug or two. If
Andrew> you suspect that something has gone wrong at the block layer
Andrew> (lots of tasks stuck in D state) then please retest with
Andrew> `elevator=deadline'.

Looks like I got hit by such a bug.¹ It left a strip(1) in __down,
and a subsequent rm(1) recursing to that file also is in __down.

I’ll give a try after sleep w/ deadline....

The dumps I still have² are:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
00000000
*pde = 00000000
Oops: 0000 [#2]
CPU: 0
EIP: 0060:[<00000000>] Not tainted
EFLAGS: 00010286
EIP is at 0x0
eax: c04b0d20 ebx: fffffff4 ecx: d87bcd3c edx: d87bcd3c
esi: ca6466c4 edi: d0f55e00 ebp: c9b51f70 esp: c9b51f08
ds: 007b es: 007b ss: 0068
Process strip (pid: 18461, threadinfo=c9b50000 task=c40a32a0)
Stack: c01675f6 ca6466c4 d0f55e00 c9b51f70 ffffffd8 d87bcd20 00000242 c9b51f70
c0167f24 c9b51f78 d87bcd20 c9b51f70 c9b50000 c9b51f78 00000001 00000002
cad39d60 00000241 00000002 c520e000 c9b50000 c015734b c520e000 00000242
Call Trace:
[<c01675f6>] __lookup_hash+0xc6/0xe0
[<c0167f24>] open_namei+0x334/0x460
[<c015734b>] filp_open+0x3b/0x70
[<c015786b>] sys_open+0x5b/0xa0
[<c010b379>] sysenter_past_esp+0x52/0x71

Code: Bad EIP value.

ksymoops adds this bit:

>>EIP; 00000000 Before first symbol

>>eax; c04b0d20 <ext3_file_inode_operations+0/60>
>>ebx; fffffff4 <TSS_ESP0_OFFSET+1f0/????>
>>ecx; d87bcd3c <_end+181db9b8/3fa1cc7c>
>>edx; d87bcd3c <_end+181db9b8/3fa1cc7c>
>>esi; ca6466c4 <_end+a065340/3fa1cc7c>
>>edi; d0f55e00 <_end+10974a7c/3fa1cc7c>
>>ebp; c9b51f70 <_end+9570bec/3fa1cc7c>
>>esp; c9b51f08 <_end+9570b84/3fa1cc7c>

and the next oops is:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c0142ef0
*pde = 00000000
Oops: 0000 [#3]
CPU: 0
EIP: 0060:[<c0142ef0>] Not tainted
EFLAGS: 00010006
EIP is at kfree+0x30/0x70
eax: 00140000 ebx: ce0afaa0 ecx: dd7127b0 edx: 00000000
esi: 00000100 edi: 00000206 ebp: dd7127b0 esp: cad1bf48
ds: 007b es: 007b ss: 0068
Process lsof (pid: 18589, threadinfo=cad1a000 task=c40a26a0)
Stack: 00000000 ce0afab8 ce0afaa0 c8df17a0 dd7127b0 c0178635 00000100 00000000
c8df17a0 c0178610 dffd61e0 c015951a dd7127b0 c8df17a0 d1389d40 c8df17a0
d3e403e0 00000000 cad1a000 c015792d c8df17a0 d3e403e0 d3e403e0 c8df17a0
Call Trace:
[<c0178635>] seq_release_private+0x25/0x48
[<c0178610>] seq_release_private+0x0/0x48
[<c015951a>] __fput+0x12a/0x170
[<c015792d>] filp_close+0x4d/0x90
[<c01579cb>] sys_close+0x5b/0x90
[<c010b379>] sysenter_past_esp+0x52/0x71

Code: 8b 1a 8b 03 3b 43 04 73 18 89 74 83 10 ff 03 57 9d 8b 5c 24

ksymoops adds:

>>EIP; c0142ef0 <kfree+30/70> <=====

>>ebx; ce0afaa0 <_end+dace71c/3fa1cc7c>
>>ecx; dd7127b0 <_end+1d13142c/3fa1cc7c>
>>ebp; dd7127b0 <_end+1d13142c/3fa1cc7c>
>>esp; cad1bf48 <_end+a73abc4/3fa1cc7c>

Trace; c0178635 <seq_release_private+25/48>
Trace; c0178610 <seq_release_private+0/48>
Trace; c015951a <__fput+12a/170>
Trace; c015792d <filp_close+4d/90>
Trace; c01579cb <sys_close+5b/90>
Trace; c010b379 <sysenter_past_esp+52/71>

Code; c0142ef0 <kfree+30/70>
00000000 <_EIP>:
Code; c0142ef0 <kfree+30/70> <=====
0: 8b 1a mov (%edx),%ebx <=====
Code; c0142ef2 <kfree+32/70>
2: 8b 03 mov (%ebx),%eax
Code; c0142ef4 <kfree+34/70>
4: 3b 43 04 cmp 0x4(%ebx),%eax
Code; c0142ef7 <kfree+37/70>
7: 73 18 jae 21 <_EIP+0x21>
Code; c0142ef9 <kfree+39/70>
9: 89 74 83 10 mov %esi,0x10(%ebx,%eax,4)
Code; c0142efd <kfree+3d/70>
d: ff 03 incl (%ebx)
Code; c0142eff <kfree+3f/70>
f: 57 push %edi
Code; c0142f00 <kfree+40/70>
10: 9d popf
Code; c0142f01 <kfree+41/70>
11: 8b 5c 24 00 mov 0x0(%esp,1),%ebx


I presume there was another Oops: 0000, but it is lost².

-JimC

¹ And I didn’t even know I was running the new scheduler; the
bk-commits-head email lagged enough behind the push to
linux.bkbits.net that I received it after booting
the new kernel.... I guess that answers the
question of which comes first. ;-/

² Turns out msyslogd(8)’s im_linux(8)’ is not too
happy w/ 2.5’s lack of /proc/ksyms. [SIGH]

2003-07-07 08:19:23

by Nick Piggin

[permalink] [raw]
Subject: Re: anticipatory scheduler merged

It looks like you are hitting the same bug as reported in
"[OOPS] 2.5.74-bk4 in __lookup_hash". I think wli would like
to see your .config, lspci, dmesg, and what the box was doing
before crashing.

At this stage you probably haven't hit an AS bug.


James H. Cloos Jr. wrote:

>>>>>>"Andrew" == Andrew Morton <[email protected]> writes:
>>>>>>
>
>Andrew> - These changes have been well tested, but it is five months
>Andrew> work and over 100 patches. There's probably a bug or two. If
>Andrew> you suspect that something has gone wrong at the block layer
>Andrew> (lots of tasks stuck in D state) then please retest with
>Andrew> `elevator=deadline'.
>
>Looks like I got hit by such a bug.¹ It left a strip(1) in __down,
>and a subsequent rm(1) recursing to that file also is in __down.
>
>I’ll give a try after sleep w/ deadline....
>
>The dumps I still have² are:
>
>Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
>00000000
>*pde = 00000000
>Oops: 0000 [#2]
>CPU: 0
>EIP: 0060:[<00000000>] Not tainted
>EFLAGS: 00010286
>EIP is at 0x0
>eax: c04b0d20 ebx: fffffff4 ecx: d87bcd3c edx: d87bcd3c
>esi: ca6466c4 edi: d0f55e00 ebp: c9b51f70 esp: c9b51f08
>ds: 007b es: 007b ss: 0068
>Process strip (pid: 18461, threadinfo=c9b50000 task=c40a32a0)
>Stack: c01675f6 ca6466c4 d0f55e00 c9b51f70 ffffffd8 d87bcd20 00000242 c9b51f70
> c0167f24 c9b51f78 d87bcd20 c9b51f70 c9b50000 c9b51f78 00000001 00000002
> cad39d60 00000241 00000002 c520e000 c9b50000 c015734b c520e000 00000242
>Call Trace:
> [<c01675f6>] __lookup_hash+0xc6/0xe0
> [<c0167f24>] open_namei+0x334/0x460
> [<c015734b>] filp_open+0x3b/0x70
> [<c015786b>] sys_open+0x5b/0xa0
> [<c010b379>] sysenter_past_esp+0x52/0x71
>
>Code: Bad EIP value.
>
>ksymoops adds this bit:
>
>
>>>EIP; 00000000 Before first symbol
>>>
>
>>>eax; c04b0d20 <ext3_file_inode_operations+0/60>
>>>ebx; fffffff4 <TSS_ESP0_OFFSET+1f0/????>
>>>ecx; d87bcd3c <_end+181db9b8/3fa1cc7c>
>>>edx; d87bcd3c <_end+181db9b8/3fa1cc7c>
>>>esi; ca6466c4 <_end+a065340/3fa1cc7c>
>>>edi; d0f55e00 <_end+10974a7c/3fa1cc7c>
>>>ebp; c9b51f70 <_end+9570bec/3fa1cc7c>
>>>esp; c9b51f08 <_end+9570b84/3fa1cc7c>
>>>
>
>and the next oops is:
>
>Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
>c0142ef0
>*pde = 00000000
>Oops: 0000 [#3]
>CPU: 0
>EIP: 0060:[<c0142ef0>] Not tainted
>EFLAGS: 00010006
>EIP is at kfree+0x30/0x70
>eax: 00140000 ebx: ce0afaa0 ecx: dd7127b0 edx: 00000000
>esi: 00000100 edi: 00000206 ebp: dd7127b0 esp: cad1bf48
>ds: 007b es: 007b ss: 0068
>Process lsof (pid: 18589, threadinfo=cad1a000 task=c40a26a0)
>Stack: 00000000 ce0afab8 ce0afaa0 c8df17a0 dd7127b0 c0178635 00000100 00000000
> c8df17a0 c0178610 dffd61e0 c015951a dd7127b0 c8df17a0 d1389d40 c8df17a0
> d3e403e0 00000000 cad1a000 c015792d c8df17a0 d3e403e0 d3e403e0 c8df17a0
>Call Trace:
> [<c0178635>] seq_release_private+0x25/0x48
> [<c0178610>] seq_release_private+0x0/0x48
> [<c015951a>] __fput+0x12a/0x170
> [<c015792d>] filp_close+0x4d/0x90
> [<c01579cb>] sys_close+0x5b/0x90
> [<c010b379>] sysenter_past_esp+0x52/0x71
>
>Code: 8b 1a 8b 03 3b 43 04 73 18 89 74 83 10 ff 03 57 9d 8b 5c 24
>
>ksymoops adds:
>
>
>>>EIP; c0142ef0 <kfree+30/70> <=====
>>>
>
>>>ebx; ce0afaa0 <_end+dace71c/3fa1cc7c>
>>>ecx; dd7127b0 <_end+1d13142c/3fa1cc7c>
>>>ebp; dd7127b0 <_end+1d13142c/3fa1cc7c>
>>>esp; cad1bf48 <_end+a73abc4/3fa1cc7c>
>>>
>
>Trace; c0178635 <seq_release_private+25/48>
>Trace; c0178610 <seq_release_private+0/48>
>Trace; c015951a <__fput+12a/170>
>Trace; c015792d <filp_close+4d/90>
>Trace; c01579cb <sys_close+5b/90>
>Trace; c010b379 <sysenter_past_esp+52/71>
>
>Code; c0142ef0 <kfree+30/70>
>00000000 <_EIP>:
>Code; c0142ef0 <kfree+30/70> <=====
> 0: 8b 1a mov (%edx),%ebx <=====
>Code; c0142ef2 <kfree+32/70>
> 2: 8b 03 mov (%ebx),%eax
>Code; c0142ef4 <kfree+34/70>
> 4: 3b 43 04 cmp 0x4(%ebx),%eax
>Code; c0142ef7 <kfree+37/70>
> 7: 73 18 jae 21 <_EIP+0x21>
>Code; c0142ef9 <kfree+39/70>
> 9: 89 74 83 10 mov %esi,0x10(%ebx,%eax,4)
>Code; c0142efd <kfree+3d/70>
> d: ff 03 incl (%ebx)
>Code; c0142eff <kfree+3f/70>
> f: 57 push %edi
>Code; c0142f00 <kfree+40/70>
> 10: 9d popf
>Code; c0142f01 <kfree+41/70>
> 11: 8b 5c 24 00 mov 0x0(%esp,1),%ebx
>
>
>I presume there was another Oops: 0000, but it is lost².
>
>-JimC
>
>¹ And I didn’t even know I was running the new scheduler; the
> bk-commits-head email lagged enough behind the push to
> linux.bkbits.net that I received it after booting
> the new kernel.... I guess that answers the
> question of which comes first. ;-/
>
>² Turns out msyslogd(8)’s im_linux(8)’ is not too
> happy w/ 2.5’s lack of /proc/ksyms. [SIGH]
>
>
>
>

2003-07-07 08:22:32

by Andrew Morton

[permalink] [raw]
Subject: Re: anticipatory scheduler merged

"James H. Cloos Jr." <[email protected]> wrote:
>
> ...
> Looks like I got hit by such a bug.

No, these aren't due to the IO scheduler.

> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> 00000000
> *pde = 00000000
> Oops: 0000 [#2]
> CPU: 0
> EIP: 0060:[<00000000>] Not tainted
> EFLAGS: 00010286
> EIP is at 0x0
> eax: c04b0d20 ebx: fffffff4 ecx: d87bcd3c edx: d87bcd3c
> esi: ca6466c4 edi: d0f55e00 ebp: c9b51f70 esp: c9b51f08
> ds: 007b es: 007b ss: 0068
> Process strip (pid: 18461, threadinfo=c9b50000 task=c40a32a0)
> Stack: c01675f6 ca6466c4 d0f55e00 c9b51f70 ffffffd8 d87bcd20 00000242 c9b51f70
> c0167f24 c9b51f78 d87bcd20 c9b51f70 c9b50000 c9b51f78 00000001 00000002
> cad39d60 00000241 00000002 c520e000 c9b50000 c015734b c520e000 00000242
> Call Trace:
> [<c01675f6>] __lookup_hash+0xc6/0xe0
> [<c0167f24>] open_namei+0x334/0x460
> [<c015734b>] filp_open+0x3b/0x70
> [<c015786b>] sys_open+0x5b/0xa0
> [<c010b379>] sysenter_past_esp+0x52/0x71

This one is a null inode->i_op->lookup() pointer. Several people have
reported this.

What filesystem was in use at the time?


> >>eax; c04b0d20 <ext3_file_inode_operations+0/60>

eh? We did a lookup in a regular file?

> Call Trace:
> [<c0178635>] seq_release_private+0x25/0x48
> [<c0178610>] seq_release_private+0x0/0x48
> [<c015951a>] __fput+0x12a/0x170
> [<c015792d>] filp_close+0x4d/0x90
> [<c01579cb>] sys_close+0x5b/0x90
> [<c010b379>] sysenter_past_esp+0x52/0x71

This one's fixed. Here's the patch:



diff -puN net/ipv4/igmp.c~netstat-oops-fix net/ipv4/igmp.c
--- 25/net/ipv4/igmp.c~netstat-oops-fix Thu Jul 3 12:31:19 2003
+++ 25-akpm/net/ipv4/igmp.c Thu Jul 3 12:31:19 2003
@@ -2099,7 +2099,7 @@ struct igmp_mc_iter_state {
struct in_device *in_dev;
};

-#define igmp_mc_seq_private(seq) ((struct igmp_mc_iter_state *)&seq->private)
+#define igmp_mc_seq_private(seq) ((struct igmp_mc_iter_state *)(seq)->private)

static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
{
@@ -2254,7 +2254,7 @@ struct igmp_mcf_iter_state {
struct ip_mc_list *im;
};

-#define igmp_mcf_seq_private(seq) ((struct igmp_mcf_iter_state *)&seq->private)
+#define igmp_mcf_seq_private(seq) ((struct igmp_mcf_iter_state *)(seq)->private)

static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
{
diff -puN net/ipv4/raw.c~netstat-oops-fix net/ipv4/raw.c
--- 25/net/ipv4/raw.c~netstat-oops-fix Thu Jul 3 12:31:19 2003
+++ 25-akpm/net/ipv4/raw.c Thu Jul 3 12:31:19 2003
@@ -687,7 +687,7 @@ struct raw_iter_state {
int bucket;
};

-#define raw_seq_private(seq) ((struct raw_iter_state *)&seq->private)
+#define raw_seq_private(seq) ((struct raw_iter_state *)(seq)->private)

static struct sock *raw_get_first(struct seq_file *seq)
{
diff -puN net/ipv6/anycast.c~netstat-oops-fix net/ipv6/anycast.c
--- 25/net/ipv6/anycast.c~netstat-oops-fix Thu Jul 3 12:31:19 2003
+++ 25-akpm/net/ipv6/anycast.c Thu Jul 3 12:31:19 2003
@@ -441,7 +441,7 @@ struct ac6_iter_state {
struct inet6_dev *idev;
};

-#define ac6_seq_private(seq) ((struct ac6_iter_state *)&seq->private)
+#define ac6_seq_private(seq) ((struct ac6_iter_state *)(seq)->private)

static inline struct ifacaddr6 *ac6_get_first(struct seq_file *seq)
{
diff -puN net/ipv6/ip6_flowlabel.c~netstat-oops-fix net/ipv6/ip6_flowlabel.c
--- 25/net/ipv6/ip6_flowlabel.c~netstat-oops-fix Thu Jul 3 12:31:19 2003
+++ 25-akpm/net/ipv6/ip6_flowlabel.c Thu Jul 3 12:31:19 2003
@@ -559,7 +559,7 @@ struct ip6fl_iter_state {
int bucket;
};

-#define ip6fl_seq_private(seq) ((struct ip6fl_iter_state *)&(seq)->private)
+#define ip6fl_seq_private(seq) ((struct ip6fl_iter_state *)(seq)->private)

static struct ip6_flowlabel *ip6fl_get_first(struct seq_file *seq)
{
diff -puN net/ipv6/mcast.c~netstat-oops-fix net/ipv6/mcast.c
--- 25/net/ipv6/mcast.c~netstat-oops-fix Thu Jul 3 12:31:19 2003
+++ 25-akpm/net/ipv6/mcast.c Thu Jul 3 12:31:19 2003
@@ -2045,7 +2045,7 @@ struct igmp6_mc_iter_state {
struct inet6_dev *idev;
};

-#define igmp6_mc_seq_private(seq) ((struct igmp6_mc_iter_state *)&seq->private)
+#define igmp6_mc_seq_private(seq) ((struct igmp6_mc_iter_state *)(seq)->private)

static inline struct ifmcaddr6 *igmp6_mc_get_first(struct seq_file *seq)
{
@@ -2185,7 +2185,7 @@ struct igmp6_mcf_iter_state {
struct ifmcaddr6 *im;
};

-#define igmp6_mcf_seq_private(seq) ((struct igmp6_mcf_iter_state *)&seq->private)
+#define igmp6_mcf_seq_private(seq) ((struct igmp6_mcf_iter_state *)(seq)->private)

static inline struct ip6_sf_list *igmp6_mcf_get_first(struct seq_file *seq)
{
diff -puN net/ipv6/raw.c~netstat-oops-fix net/ipv6/raw.c
--- 25/net/ipv6/raw.c~netstat-oops-fix Thu Jul 3 12:31:19 2003
+++ 25-akpm/net/ipv6/raw.c Thu Jul 3 12:31:19 2003
@@ -913,7 +913,7 @@ struct raw6_iter_state {
int bucket;
};

-#define raw6_seq_private(seq) ((struct raw6_iter_state *)&seq->private)
+#define raw6_seq_private(seq) ((struct raw6_iter_state *)(seq)->private)

static struct sock *raw6_get_first(struct seq_file *seq)
{

_