2003-07-18 12:15:37

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)


CCed lkml for obvious reasons

On Fri, 18 Jul 2003, Stephan von Krawczynski wrote:

> On Wed, 16 Jul 2003 08:37:51 -0300 (BRT)
> Marcelo Tosatti <[email protected]> wrote:
>
> >
> > Stephan, can you reproduce it easily?
>
> Hello,
>
> there is definitely something about it. pre6 froze after 2 days of
> testing. I guess I was unlucky this time with logfiles, no messages
> there. There is something severe. You may call it reproducable, but not
> easy.

Stephan,

What is your workload?

I'll try to reproduce it.


2003-07-18 12:36:10

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)

On Fri, 18 Jul 2003 09:23:10 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

>
> CCed lkml for obvious reasons
>
> On Fri, 18 Jul 2003, Stephan von Krawczynski wrote:
>
> > On Wed, 16 Jul 2003 08:37:51 -0300 (BRT)
> > Marcelo Tosatti <[email protected]> wrote:
> >
> > >
> > > Stephan, can you reproduce it easily?
> >
> > Hello,
> >
> > there is definitely something about it. pre6 froze after 2 days of
> > testing. I guess I was unlucky this time with logfiles, no messages
> > there. There is something severe. You may call it reproducable, but not
> > easy.
>
> Stephan,
>
> What is your workload?
>
> I'll try to reproduce it.

You need heavy NFS action and I/O load. Its the same box I use for
server-scenario tests. 3 GB RAM, SMP, 320 GB RAID5 (3ware), SDLT tape drive, 2
x 1000 TX. In detail:

00:00.0 Host bridge: ServerWorks CNB20HE Host Bridge (rev 23)
00:00.1 Host bridge: ServerWorks CNB20HE Host Bridge (rev 01)
00:00.2 Host bridge: ServerWorks: Unknown device 0006 (rev 01)
00:00.3 Host bridge: ServerWorks: Unknown device 0006 (rev 01)
00:02.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0d)
00:03.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0d)
00:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV200 QW [Radeon
7500]
00:05.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 07)
00:05.1 Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 07)
00:07.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:0f.0 ISA bridge: ServerWorks CSB5 South Bridge (rev 93)
00:0f.1 IDE interface: ServerWorks CSB5 IDE Controller (rev 93)
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 05)
00:0f.3 Host bridge: ServerWorks GCLE Host Bridge
01:02.0 RAID bus controller: 3ware Inc 3ware 7000-series ATA-RAID (rev 01)
01:03.0 Network controller: AVM Audiovisuelles MKTG & Computer System GmbH
Fritz!PCI v2.0 ISDN (rev 01)
01:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5701 Gigabit
Ethernet (rev 15)
02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5701 Gigabit
Ethernet (rev 15)
02:03.0 SCSI storage controller: Adaptec AIC-7899P U160/m (rev 01)
02:03.1 SCSI storage controller: Adaptec AIC-7899P U160/m (rev 01)

Take several NFS clients and write to this box some GBs (all at same time),
then copy these files around on the box or tar them. You should see collapses
like from the BUG I posted lately up to complete freeze.
I have continuous cpu load above 2.0 upto about 8.0

Regards,
Stephan

2003-07-18 15:06:45

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)

On Fri, 18 Jul 2003 11:14:15 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

>
> I have just started stress testing a 8way OSDL box to see if I can
> reproduce the problem. I'm using pre6+axboes BH_Sync patch.
>
> I'm running 50 dbench clients on aic7xxx (ext2) and 50 dbench clients on
> DAC960 (ext3). Lets see what happens.
>
> After lunch I'll keep looking at the oopses. During the morning I only had
> time to setup the OSDL box and start the tests.

On my box it takes about 48 hours before the problem shows. But that may
heavily depend on the box I guess.

Regards,
Stephan

2003-07-18 16:38:03

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)


I have just started stress testing a 8way OSDL box to see if I can
reproduce the problem. I'm using pre6+axboes BH_Sync patch.

I'm running 50 dbench clients on aic7xxx (ext2) and 50 dbench clients on
DAC960 (ext3). Lets see what happens.

After lunch I'll keep looking at the oopses. During the morning I only had
time to setup the OSDL box and start the tests.

On Fri, 18 Jul 2003, Stephan von Krawczynski wrote:

> On Fri, 18 Jul 2003 09:23:10 -0300 (BRT)
> Marcelo Tosatti <[email protected]> wrote:
>
> >
> > CCed lkml for obvious reasons
> >
> > On Fri, 18 Jul 2003, Stephan von Krawczynski wrote:
> >
> > > On Wed, 16 Jul 2003 08:37:51 -0300 (BRT)
> > > Marcelo Tosatti <[email protected]> wrote:
> > >
> > > >
> > > > Stephan, can you reproduce it easily?
> > >
> > > Hello,
> > >
> > > there is definitely something about it. pre6 froze after 2 days of
> > > testing. I guess I was unlucky this time with logfiles, no messages
> > > there. There is something severe. You may call it reproducable, but not
> > > easy.
> >
> > Stephan,
> >
> > What is your workload?
> >
> > I'll try to reproduce it.

2003-07-18 17:03:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)

On Fri, Jul 18, 2003 at 02:50:33PM +0200, Stephan von Krawczynski wrote:
> You need heavy NFS action and I/O load. Its the same box I use for

I wonder if it can be related to the nfs changes. I also had those nfs
changes in my tree previously, but most of them rejected (i.e. a -R
wouldn't clean it up) so there must be further or slightly different
changes in mainline pre6 compared to 21rc8aa1. It could be only an
editing thing though.

It would be very interesting if you could still reproduce w/o nfs (for
example replacing the nfs transfers temporarily with an rsync, that
would reduce the scope of the problem a lot).

Andrea

2003-07-21 08:34:08

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)

On Fri, 18 Jul 2003 11:14:15 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

>
> I have just started stress testing a 8way OSDL box to see if I can
> reproduce the problem. I'm using pre6+axboes BH_Sync patch.
>
> I'm running 50 dbench clients on aic7xxx (ext2) and 50 dbench clients on
> DAC960 (ext3). Lets see what happens.
>
> After lunch I'll keep looking at the oopses. During the morning I only had
> time to setup the OSDL box and start the tests.

Hello Marcelo,

have you seen anything in your tests? My box just froze again after 3 days
during NFS action. This was with pre6, I am switching over to pre7.

Regards,
Stephan


2003-07-21 11:41:30

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)



On Mon, 21 Jul 2003, Stephan von Krawczynski wrote:

> On Fri, 18 Jul 2003 11:14:15 -0300 (BRT)
> Marcelo Tosatti <[email protected]> wrote:
>
> >
> > I have just started stress testing a 8way OSDL box to see if I can
> > reproduce the problem. I'm using pre6+axboes BH_Sync patch.
> >
> > I'm running 50 dbench clients on aic7xxx (ext2) and 50 dbench clients on
> > DAC960 (ext3). Lets see what happens.
> >
> > After lunch I'll keep looking at the oopses. During the morning I only had
> > time to setup the OSDL box and start the tests.
>
> Hello Marcelo,
>
> have you seen anything in your tests? My box just froze again after 3 days
> during NFS action. This was with pre6, I am switching over to pre7.

No. I just checked it and the 8way is alive and well:

bash-2.05a$ uptime
4:53am up 2 days, 18:04, 2 users, load average: 100.57, 96.27, 95.22


Could you try to reproduce the tests with something else other than NFS?
(local disk, SMB, ...) as Andrea suggested?

2003-07-21 14:50:24

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)

On Mon, 21 Jul 2003 10:49:06 +0200
Stephan von Krawczynski <[email protected]> wrote:

> On Fri, 18 Jul 2003 11:14:15 -0300 (BRT)
> Marcelo Tosatti <[email protected]> wrote:
>
> >
> > I have just started stress testing a 8way OSDL box to see if I can
> > reproduce the problem. I'm using pre6+axboes BH_Sync patch.
> >
> > I'm running 50 dbench clients on aic7xxx (ext2) and 50 dbench clients on
> > DAC960 (ext3). Lets see what happens.
> >
> > After lunch I'll keep looking at the oopses. During the morning I only had
> > time to setup the OSDL box and start the tests.
>
> Hello Marcelo,
>
> have you seen anything in your tests? My box just froze again after 3 days
> during NFS action. This was with pre6, I am switching over to pre7.

I managed to freeze the pre7 box within these few hours. There was no nfs
involved, only tar-to-tape.
I switched back to 2.4.21 to see if it is still stable.
Is there a possibility that the i/o-scheduler has another flaw somewhere (just
like during mount previously) ...


Regards,
Stephan

2003-07-21 18:02:08

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)



On Mon, 21 Jul 2003, Stephan von Krawczynski wrote:

> On Mon, 21 Jul 2003 10:49:06 +0200
> Stephan von Krawczynski <[email protected]> wrote:
>
> > On Fri, 18 Jul 2003 11:14:15 -0300 (BRT)
> > Marcelo Tosatti <[email protected]> wrote:
> >
> > >
> > > I have just started stress testing a 8way OSDL box to see if I can
> > > reproduce the problem. I'm using pre6+axboes BH_Sync patch.
> > >
> > > I'm running 50 dbench clients on aic7xxx (ext2) and 50 dbench clients on
> > > DAC960 (ext3). Lets see what happens.
> > >
> > > After lunch I'll keep looking at the oopses. During the morning I only had
> > > time to setup the OSDL box and start the tests.
> >
> > Hello Marcelo,
> >
> > have you seen anything in your tests? My box just froze again after 3 days
> > during NFS action. This was with pre6, I am switching over to pre7.
>
> I managed to freeze the pre7 box within these few hours. There was no nfs
> involved, only tar-to-tape.

You had NMI on, correct? Sysrq doesnt work, correct?

> I switched back to 2.4.21 to see if it is still stable. Is there a
> possibility that the i/o-scheduler has another flaw somewhere (just like
> during mount previously) ...

It might be a problem in the IO scheduler, yes.

Lets isolate the problems: If 2.4.21 doenst lockup, try 2.4.22-pre7
without drivers/block/ll_rw_blk{.c,.h} changes.

2003-07-21 18:33:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)

On Mon, Jul 21, 2003 at 05:05:17PM +0200, Stephan von Krawczynski wrote:
> On Mon, 21 Jul 2003 10:49:06 +0200
> Stephan von Krawczynski <[email protected]> wrote:
>
> > On Fri, 18 Jul 2003 11:14:15 -0300 (BRT)
> > Marcelo Tosatti <[email protected]> wrote:
> >
> > >
> > > I have just started stress testing a 8way OSDL box to see if I can
> > > reproduce the problem. I'm using pre6+axboes BH_Sync patch.
> > >
> > > I'm running 50 dbench clients on aic7xxx (ext2) and 50 dbench clients on
> > > DAC960 (ext3). Lets see what happens.
> > >
> > > After lunch I'll keep looking at the oopses. During the morning I only had
> > > time to setup the OSDL box and start the tests.
> >
> > Hello Marcelo,
> >
> > have you seen anything in your tests? My box just froze again after 3 days
> > during NFS action. This was with pre6, I am switching over to pre7.
>
> I managed to freeze the pre7 box within these few hours. There was no nfs
> involved, only tar-to-tape.
> I switched back to 2.4.21 to see if it is still stable.
> Is there a possibility that the i/o-scheduler has another flaw somewhere (just
> like during mount previously) ...

is it a scsi tape? Is the tape always involved? there are st.c updates
between 2.4.21 to 22pre7. you can try to back them out.

If only the BKCVS would provide the tags in all files and not only in
the file ChangeSets it would be very easy again to extract all the st.c
updates. What happened to the BKCVS, why aren't the tags present in all
the files anymore? Is it a mistake or intentional?

You should also provide a SYSRQ+P/T of the hang or we can't debug it at
all.

Andrea

2003-07-21 18:54:56

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)

On Mon, 21 Jul 2003 14:23:53 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

> > > Hello Marcelo,
> > >
> > > have you seen anything in your tests? My box just froze again after 3
> > > days during NFS action. This was with pre6, I am switching over to pre7.
> >
> > I managed to freeze the pre7 box within these few hours. There was no nfs
> > involved, only tar-to-tape.
>
> You had NMI on, correct? Sysrq doesnt work, correct?

Yes, that's right.

> > I switched back to 2.4.21 to see if it is still stable. Is there a
> > possibility that the i/o-scheduler has another flaw somewhere (just like
> > during mount previously) ...
>
> It might be a problem in the IO scheduler, yes.
>
> Lets isolate the problems: If 2.4.21 doenst lockup, try 2.4.22-pre7
> without drivers/block/ll_rw_blk{.c,.h} changes.

I am pretty confident that 2.4.21 does not lock up, I tested it long time ago
and to my memory it had no problems. Anyway I re-check to make sure the box is
still ok.

Can you send me patches off-list to reverse from -pre7. Just to make sure we
are talking of the same stuff...

Regards,
Stephan

2003-07-21 19:09:57

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)

On Mon, 21 Jul 2003 12:20:33 -0400
Andrea Arcangeli <[email protected]> wrote:

> > I managed to freeze the pre7 box within these few hours. There was no nfs
> > involved, only tar-to-tape.
> > I switched back to 2.4.21 to see if it is still stable.
> > Is there a possibility that the i/o-scheduler has another flaw somewhere
> > (just like during mount previously) ...
>
> is it a scsi tape?

yes.

> Is the tape always involved?

No, I experience both freeze during nfs-only action and freeze during
tar-to-scsi-tape.
My feelings are that the freeze does (at least in the nfs case) not happen
during high load but rather when load seems relatively light. Handwaving one
could say it looks rather like an I/O sched starvation issue than breakdown
during high load. Similar to the last issue.

> there are st.c updates
> between 2.4.21 to 22pre7. you can try to back them out.

Hm, which?

> [...]
> You should also provide a SYSRQ+P/T of the hang or we can't debug it at
> all.

Well, I really tried hard to produce something, but failed so far, if I had
more time I would try a serial console hoping that it survives long enough to
show at least _something_.
The only thing I ever could see was the BUG in page-alloc thing from the
beginning of this thread.

Regards,
Stephan

2003-07-21 19:29:12

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)



On Mon, 21 Jul 2003, Stephan von Krawczynski wrote:

> On Mon, 21 Jul 2003 12:20:33 -0400
> Andrea Arcangeli <[email protected]> wrote:
>
> > > I managed to freeze the pre7 box within these few hours. There was no nfs
> > > involved, only tar-to-tape.
> > > I switched back to 2.4.21 to see if it is still stable.
> > > Is there a possibility that the i/o-scheduler has another flaw somewhere
> > > (just like during mount previously) ...
> >
> > is it a scsi tape?
>
> yes.
>
> > Is the tape always involved?
>
> No, I experience both freeze during nfs-only action and freeze during
> tar-to-scsi-tape.
> My feelings are that the freeze does (at least in the nfs case) not happen
> during high load but rather when load seems relatively light. Handwaving one
> could say it looks rather like an I/O sched starvation issue than breakdown
> during high load. Similar to the last issue.
>
> > there are st.c updates
> > between 2.4.21 to 22pre7. you can try to back them out.
>
> Hm, which?
>
> > [...]
> > You should also provide a SYSRQ+P/T of the hang or we can't debug it at
> > all.
>
> Well, I really tried hard to produce something, but failed so far, if I had
> more time I would try a serial console hoping that it survives long enough to
> show at least _something_.
> The only thing I ever could see was the BUG in page-alloc thing from the
> beginning of this thread.

Stephan,

I'm sending you the scsi tape driver changes in 2.4.22-pre so you can
revert them (in private in a few minutes).

If that doesnt make us spot the problem, can you PLEASE find out in which
-pre the problem starts ?

Thank you

2003-07-21 19:57:45

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)

On Mon, 21 Jul 2003 16:40:27 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

> If that doesnt make us spot the problem, can you PLEASE find out in which
> -pre the problem starts ?

Right away I can tell you there was no problem up to the pre that did not boot
on my box, I thing it was pre3, right? Meaing pre1 and pre2 work.

pre5 was the first one that booted again - and the first I can tell has the
problem.

I can "port" the mini-patch from chris back to pre3 and try this one as next
step...

Regards,
Stephan

2003-07-21 20:54:34

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Bug Report: 2.4.22-pre5: BUG in page_alloc (fwd)


Just FYI, the 8way box is running for three days with LOTS of IO and
memory pressure:

hostname: dev8-005 (dev8-005.pdx.osdl.net) running linux

bash-2.05a$ uptime
2:03pm up 3 days, 3:14, 2 users, load average: 82.48, 91.67, 94.29
bash-2.05a$ vmstat 2
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
1 77 2 3436 8232 77288 885880 0 0 3 12 13 16 4
9 8
0 78 3 3436 7300 77448 886596 0 0 108 12184 619 448 0
9 90
0 78 2 3436 11472 77760 880692 0 0 400 22922 836 2497 2
33 65
0 77 2 3428 7292 78176 884640 6 0 414 7858 761 511 0
11 88
0 77 3 3428 7392 78348 884776 0 0 238 9942 687 449 0
9 91
....


Interactivity under this extreme circumstances is impressive. Very good.

Great work Andrea, Mason and Jens. Thanks.


On Mon, 21 Jul 2003, Stephan von Krawczynski wrote:

> On Mon, 21 Jul 2003 12:20:33 -0400
> Andrea Arcangeli <[email protected]> wrote:
>
> > > I managed to freeze the pre7 box within these few hours. There was no nfs
> > > involved, only tar-to-tape.
> > > I switched back to 2.4.21 to see if it is still stable.
> > > Is there a possibility that the i/o-scheduler has another flaw somewhere
> > > (just like during mount previously) ...
> >
> > is it a scsi tape?
>
> yes.
>
> > Is the tape always involved?
>
> No, I experience both freeze during nfs-only action and freeze during
> tar-to-scsi-tape.
> My feelings are that the freeze does (at least in the nfs case) not happen
> during high load but rather when load seems relatively light. Handwaving one
> could say it looks rather like an I/O sched starvation issue than breakdown
> during high load. Similar to the last issue.
>
> > there are st.c updates
> > between 2.4.21 to 22pre7. you can try to back them out.
>
> Hm, which?
>
> > [...]
> > You should also provide a SYSRQ+P/T of the hang or we can't debug it at
> > all.
>
> Well, I really tried hard to produce something, but failed so far, if I had
> more time I would try a serial console hoping that it survives long enough to
> show at least _something_.
> The only thing I ever could see was the BUG in page-alloc thing from the
> beginning of this thread.
>
> Regards,
> Stephan
>