Good morning Ben,
I just tried your test program with 2.4.13, 2 Gig, and it ran without
problems. Could you try that over there and see if you get the same result?
If it does run, the next move would be to check with 3.5 Gig.
Regards,
Daniel
On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> I just tried your test program with 2.4.13, 2 Gig, and it ran without
> problems. Could you try that over there and see if you get the same result?
> If it does run, the next move would be to check with 3.5 Gig.
Ben reports that his test with 2 Gig memory runs fine, as it does for me, but
that it locks up tight with 3.5 Gig, requiring power cycle. Since I only
have 2 Gig here I can't reproduce that (yet).
--
Daniel
On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
> On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> > I just tried your test program with 2.4.13, 2 Gig, and it ran without
> > problems. Could you try that over there and see if you get the same result?
> > If it does run, the next move would be to check with 3.5 Gig.
>
> Ben reports that his test with 2 Gig memory runs fine, as it does for me, but
> that it locks up tight with 3.5 Gig, requiring power cycle. Since I only
> have 2 Gig here I can't reproduce that (yet).
are you sure it isn't an oom condition. can you reproduce on
2.4.14pre5aa1? mainline (at least before pre6) could deadlock with too
much mlocked memory.
Andrea
On Wed, 31 Oct 2001, Daniel Phillips wrote:
> On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> > I just tried your test program with 2.4.13, 2 Gig, and it ran without
> > problems. Could you try that over there and see if you get the same result?
> > If it does run, the next move would be to check with 3.5 Gig.
>
> Ben reports that his test with 2 Gig memory runs fine, as it does for
> me, but that it locks up tight with 3.5 Gig, requiring power cycle.
> Since I only have 2 Gig here I can't reproduce that (yet).
Does it lock up if your low memory is reduced to 512 MB ?
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
On October 31, 2001 09:48 pm, Rik van Riel wrote:
> On Wed, 31 Oct 2001, Daniel Phillips wrote:
> > On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> > > I just tried your test program with 2.4.13, 2 Gig, and it ran without
> > > problems. Could you try that over there and see if you get the same result?
> > > If it does run, the next move would be to check with 3.5 Gig.
> >
> > Ben reports that his test with 2 Gig memory runs fine, as it does for
> > me, but that it locks up tight with 3.5 Gig, requiring power cycle.
> > Since I only have 2 Gig here I can't reproduce that (yet).
>
> Does it lock up if your low memory is reduced to 512 MB ?
Ben?
--
Daniel
On October 31, 2001 09:45 pm, Andrea Arcangeli wrote:
> On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
> > On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> > > I just tried your test program with 2.4.13, 2 Gig, and it ran without
> > > problems. Could you try that over there and see if you get the same result?
> > > If it does run, the next move would be to check with 3.5 Gig.
> >
> > Ben reports that his test with 2 Gig memory runs fine, as it does for me, but
> > that it locks up tight with 3.5 Gig, requiring power cycle. Since I only
> > have 2 Gig here I can't reproduce that (yet).
>
> are you sure it isn't an oom condition. can you reproduce on
> 2.4.14pre5aa1? mainline (at least before pre6) could deadlock with too
> much mlocked memory.
I don't know, I can't reproduce it here, I don't have enough memory. Ben?
--
Daniel
On Wed, 31 Oct 2001, Daniel Phillips wrote:
> On October 31, 2001 09:48 pm, Rik van Riel wrote:
> > On Wed, 31 Oct 2001, Daniel Phillips wrote:
> > > Ben reports that his test with 2 Gig memory runs fine, as it does for
> > > me, but that it locks up tight with 3.5 Gig, requiring power cycle.
> > > Since I only have 2 Gig here I can't reproduce that (yet).
> >
> > Does it lock up if your low memory is reduced to 512 MB ?
>
> Ben?
Nonono, I mean that if _you_ reduce low memory to 512MB
on your 2GB machine, maybe you can reproduce the problem
more easily.
If the Google people try this with larger machines, it'll
almost certainly make triggering the bug even easier ;)
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
On Oct 31, 2001 22:03 +0100, Daniel Phillips wrote:
> Ben reports that his test with 2 Gig memory runs fine, as it does for me, but
> that it locks up tight with 3.5 Gig, requiring power cycle. Since I only
> have 2 Gig here I can't reproduce that (yet).
Sadly, I bought some memory yesterday, and it was only U$30 for 256MB
DIMMs, so $120/GB if you have enough slots. Not that I'm suggesting
you go out and but more memory Daniel, as you probably have your slots
filled with 2GB, and larger sticks are still bit more expesive.
The only thing that bugs me about the low memory price is that Windows
XP recommends at least 128MB for a workable system. A year or two ago
that would have been considered a bloated pig, but now they are giving
away 128MB DIMMs with a purchase of XP. Sad, really. Maybe M$ is
subsidizing the chipmakers to make RAM cheap so XP can run on peoples'
computers ;-)? What else would you do with U$50 billion in cash (or
whatever) that M$ has?
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
> On October 31, 2001 09:45 pm, Andrea Arcangeli wrote:
>
>>On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
>>
>>>On October 31, 2001 07:06 pm, Daniel Phillips wrote:
>>>
>>>>I just tried your test program with 2.4.13, 2 Gig, and it ran
>>>>without problems. Could you try that over there and see if you
>>>>get the same result? If it does run, the next move would be to
>>>>check with 3.5 Gig.
>>>>
>>>Ben reports that his test with 2 Gig memory runs fine, as it does
>>>for me, but that it locks up tight with 3.5 Gig, requiring power
>>>cycle. Since I only have 2 Gig here I can't reproduce that (yet).
>>>
>>are you sure it isn't an oom condition. can you reproduce on
>>2.4.14pre5aa1? mainline (at least before pre6) could deadlock with
>>too much mlocked memory.
>>
>
> I don't know, I can't reproduce it here, I don't have enough memory.
> Ben?
My test application gets killed (I believe by the oom handler). dmesg
complains about a lot of 0-order allocation failures. For this test,
I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz.
- Ben
Ben Smith
Google, Inc
On Thu, Nov 01, 2001 at 01:19:15AM +0100, Daniel Phillips wrote:
> If it does turn out to be oom, it's still a bug, right?
The testcase I checked a few weeks ago looked correct, so whatever it
is, it should be a kernel bug.
Andrea
On Wed, Oct 31, 2001 at 02:12:00PM -0800, Ben Smith wrote:
> My test application gets killed (I believe by the oom handler). dmesg
> complains about a lot of 0-order allocation failures. For this test,
> I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz.
Interesting, now we need to find out if the problem is the allocator in
2.4.14pre5aa1 that fails too early by mistake, or if this is a true oom
condition. I tend to think it's a true oom condition since mainline
deadlocked under the same workload where -aa correctly killed the task.
Can you provide also a 'vmstat 1' trace of the last 20/30 seconds before
the task gets killed?
A true oom condition could be caused by a memleak in mlock or something
like that (or of course it could be a bug in the userspace testcase, but
I checked the testcase a few weeks ago and I didn't found anything wrong
in it).
Andrea
On October 31, 2001 09:45 pm, Andrea Arcangeli wrote:
> On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
> > On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> > > I just tried your test program with 2.4.13, 2 Gig, and it ran without
> > > problems. Could you try that over there and see if you get the same
result?
> > > If it does run, the next move would be to check with 3.5 Gig.
> >
> > Ben reports that his test with 2 Gig memory runs fine, as it does for me,
but
> > that it locks up tight with 3.5 Gig, requiring power cycle. Since I only
> > have 2 Gig here I can't reproduce that (yet).
>
> are you sure it isn't an oom condition.
The way the test code works is, it keeps mlocking more blocks of memory until
one of the mlocks fails, and then it does the rest of its work with that many
blocks of memory. It's hard to see how we could get a legitimate oom with
that strategy.
> can you reproduce on
> 2.4.14pre5aa1? mainline (at least before pre6) could deadlock with too
> much mlocked memory.
OK, he tried it with pre5aa1:
ben> My test application gets killed (I believe by the oom handler). dmesg
ben> complains about a lot of 0-order allocation failures. For this test,
ben> I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz.
*Just in case* it's oom-related I've asked Ben to try it with one less than
the maximum number of memory blocks he can allocate.
If it does turn out to be oom, it's still a bug, right?
--
Daniel
> *Just in case* it's oom-related I've asked Ben to try it with one less than
> the maximum number of memory blocks he can allocate.
I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks,
and it has the same behavior (my app gets killed, 0-order allocation
failures, and the system stays up.
- Ben
Ben Smith
Google, Inc
On Wed, 31 Oct 2001, Ben Smith wrote:
> > *Just in case* it's oom-related I've asked Ben to try it with one less than
> > the maximum number of memory blocks he can allocate.
>
> I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks,
> and it has the same behavior (my app gets killed, 0-order allocation
> failures, and the system stays up.
If you still have swap free at the point where the process
gets killed, or if the memory is file-backed, then we are
positive it's a kernel bug.
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
>>>*Just in case* it's oom-related I've asked Ben to try it with one less than
>>>the maximum number of memory blocks he can allocate.
>>>
>>I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks,
>>and it has the same behavior (my app gets killed, 0-order allocation
>>failures, and the system stays up.
>>
>
> If you still have swap free at the point where the process
> gets killed, or if the memory is file-backed, then we are
> positive it's a kernel bug.
This machine is configured without a swap file. The memory is file backed,
though (read-only mmap, followed by a mlock).
- Ben
Ben Smith
Google, Inc
On Wed, Oct 31, 2001 at 05:55:25PM -0800, Ben Smith wrote:
> >>>*Just in case* it's oom-related I've asked Ben to try it with one less than
> >>>the maximum number of memory blocks he can allocate.
> >>>
> >>I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks,
> >>and it has the same behavior (my app gets killed, 0-order allocation
> >>failures, and the system stays up.
> >>
> >
> > If you still have swap free at the point where the process
> > gets killed, or if the memory is file-backed, then we are
> > positive it's a kernel bug.
>
> This machine is configured without a swap file. The memory is file backed,
ok fine on this side. so again, what's happening is the equivalent of
mlock lefting those mappings locked. It seems the previous mlock is
forbidding the cache to be released. Otherwise I don't see why the
kernel shouldn't release the cache correctly. So it could be an mlock
bug in the kernel.
Andrea
On October 31, 2001 10:53 pm, Andreas Dilger wrote:
> On Oct 31, 2001 22:03 +0100, Daniel Phillips wrote:
> > Ben reports that his test with 2 Gig memory runs fine, as it does for me, but
> > that it locks up tight with 3.5 Gig, requiring power cycle. Since I only
> > have 2 Gig here I can't reproduce that (yet).
>
> Sadly, I bought some memory yesterday, and it was only U$30 for 256MB
> DIMMs, so $120/GB if you have enough slots. Not that I'm suggesting
> you go out and but more memory Daniel, as you probably have your slots
> filled with 2GB, and larger sticks are still bit more expesive.
You're not kidding. Just FYI, four 1GB sticks for this machine will set you
back a kilobuck. (PC/133 Registered SDRAM 72-bit ECC, 168-pin gold-plated
DIMM)
--
Daniel
Hi,
We have a couple of Dells with 4G of memory. We have been
experiencing the same google problems. My boss has asked me to roll
one of them back to a 2.2 kernel, we will live with 2G of memory for
now, while I help out with testing mmap problems. I am, however
having problems compiling the 2.2.19 kernel with the
linux-2.2.19-reiserfs-3.5.34-patch.bz2 patch. I get the following
error:
ld -m elf_i386 -T /home/sven/linux-build/linux-2.2.19-reiserfs/arch/i386/vmlinux.lds -e stext arch/i386/kernel/head.o arch/i386/kernel/init_task.o init/main.o init/version.o \
--start-group \
arch/i386/kernel/kernel.o arch/i386/mm/mm.o kernel/kernel.o mm/mm.o fs/fs.o ipc/ipc.o \
fs/filesystems.a \
net/network.a \
drivers/block/block.a drivers/char/char.o drivers/misc/misc.a drivers/net/net.a drivers/scsi/scsi.a drivers/pci/pci.a drivers/video/video.a \
/home/sven/linux-build/linux-2.2.19-reiserfs/arch/i386/lib/lib.a /home/sven/linux-build/linux-2.2.19-reiserfs/lib/lib.a /home/sven/linux-build/linux-2.2.19-reiserfs/arch/i386/lib/lib.a \
--end-group \
-o vmlinux
fs/filesystems.a(reiserfs.o): In function `ip_check_balance':
reiserfs.o(.text+0x9cc2): undefined reference to `memset'
drivers/scsi/scsi.a(aic7xxx.o): In function `aic7xxx_load_seeprom':
aic7xxx.o(.text+0x117ff): undefined reference to `memcpy'
make: *** [vmlinux] Error 1
I tried patching the 2.2.20-pre12 patch but got the same (or similar)
results.
After I get this 2.2.19 system stable by boss says fixing the google
bug is my "top priority". Unfortunely this will only like be the
second time I dig into linux source code so I expect it will be mostly
me testing other people patches. But I will try my best.
Here is info from my 2.2.19 system as asked for in the REPORTING-BUGS
file:
This is a Red Hat 7.1 system.
$ source scripts/ver_linux
Linux ps1.web.nj.nec.com 2.4.9-marcelo-patch #10 SMP Wed Aug 22 15:13:48 EDT 2001 i686 unknown
Gnu C 2.96
Gnu make 3.79.1
binutils 2.10.91.0.2
util-linux 2.11b
modutils 2.4.2
e2fsprogs 1.19
reiserfsprogs 3.x.0f
Linux C Library 2.2.2
Dynamic linker (ldd) 2.2.2
Procps 2.0.7
Net-tools 1.57
Console-tools 0.3.3
Sh-utils 2.0
Modules Loaded autofs eepro100 md
the Marcelo Patch is from the list time I stuck my nose in the kernel
with a himem patch. Checking my diff and the 2.4.13 kernel the stuff
is nearly the same.
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 6
cpu MHz : 993.400
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1979.18
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 6
cpu MHz : 993.400
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1985.74
$ cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: SEAGATE Model: ST173404LC Rev: 0004
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: SEAGATE Model: ST173404LC Rev: 0004
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 02 Lun: 00
Vendor: SEAGATE Model: ST173404LC Rev: 0004
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 03 Lun: 00
Vendor: SEAGATE Model: ST173404LC Rev: 0004
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: SEAGATE Model: ST173404LC Rev: 0004
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 05 Lun: 00
Vendor: SEAGATE Model: ST173404LC Rev: 0004
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 06 Lun: 00
Vendor: DELL Model: 1x6 U2W SCSI BP Rev: 5.35
Type: Processor ANSI SCSI revision: 02
Host: scsi2 Channel: 00 Id: 05 Lun: 00
Vendor: NEC Model: CD-ROM DRIVE:466 Rev: 1.06
Type: CD-ROM ANSI SCSI revision: 02
bash-2.04$ lsmod
Module Size Used by
autofs 11920 1 (autoclean)
eepro100 17184 1 (autoclean)
md 43616 0 (unused)
On Thu, 1 Nov 2001 11:56:04 -0500 (EST),
Sven Heinicke <[email protected]> wrote:
>fs/filesystems.a(reiserfs.o): In function `ip_check_balance':
>reiserfs.o(.text+0x9cc2): undefined reference to `memset'
>drivers/scsi/scsi.a(aic7xxx.o): In function `aic7xxx_load_seeprom':
>aic7xxx.o(.text+0x117ff): undefined reference to `memcpy'
The aic7xxx reference to memcpy is a gcc feature. If you do an
assignment of a complete structure then gcc may convert that into a
call to memcpy(). Alas gcc does the conversion using the "standard"
version of memcpy, not the "optimized by cpp" version that the kernel
uses. Try this patch
Index: 19.1/drivers/scsi/aic7xxx.c
--- 19.1/drivers/scsi/aic7xxx.c Tue, 13 Feb 2001 08:26:08 +1100 kaos (linux-2.2/d/b/43_aic7xxx.c 1.1.1.3.2.1.3.1.1.3 644)
+++ 19.1(w)/drivers/scsi/aic7xxx.c Fri, 02 Nov 2001 09:36:49 +1100 kaos (linux-2.2/d/b/43_aic7xxx.c 1.1.1.3.2.1.3.1.1.3 644)
@@ -9190,7 +9190,7 @@ aic7xxx_load_seeprom(struct aic7xxx_host
p->flags |= AHC_TERM_ENB_SE_LOW | AHC_TERM_ENB_SE_HIGH;
}
}
- p->sc = *sc;
+ memcpy(&(p->sc), sc, sizeof(p->sc));
}
p->discenable = 0;
Cannot help with the reiserfs problem, the code is not in the pristine
2.2.19 tree.
Ben Smith writes:
> > On October 31, 2001 09:45 pm, Andrea Arcangeli wrote:
> >
> >>On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
> >>
> >>>On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> >>>
> >>>>I just tried your test program with 2.4.13, 2 Gig, and it ran
> >>>>without problems. Could you try that over there and see if you
> >>>>get the same result? If it does run, the next move would be to
> >>>>check with 3.5 Gig.
> >>>>
> >>>Ben reports that his test with 2 Gig memory runs fine, as it does
> >>>for me, but that it locks up tight with 3.5 Gig, requiring power
> >>>cycle. Since I only have 2 Gig here I can't reproduce that (yet).
> >>>
> >>are you sure it isn't an oom condition. can you reproduce on
> >>2.4.14pre5aa1? mainline (at least before pre6) could deadlock with
> >>too much mlocked memory.
> >>
> >
> > I don't know, I can't reproduce it here, I don't have enough memory.
> > Ben?
>
> My test application gets killed (I believe by the oom handler). dmesg
> complains about a lot of 0-order allocation failures. For this test,
> I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz.
> - Ben
>
> Ben Smith
> Google, Inc
>
This is a System with 4G of memory and regular swap. With 2 Pentium
III 1Ghz processors.
On 2.4.14-pre6aa1 it happily runs until:
munmap'ed 7317d000
Loading data at 7317d000 for slot 2
Load (/mnt/sdb/sven/chunk10) succeeded!
mlocking slot 2, 7317d000
mlocking at 7317d000 of size 1048576
Connection to hera closed by remote host.
Connection to hera closed.
Where is kills my ssh and other programs. fills my /var/log/messages
with:
Nov 2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
Nov 2 11:29:07 ps2 syslogd: select: Cannot allocate memory
Nov 2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
Nov 2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1f0/0)
Nov 2 11:29:07 ps2 last message repeated 2 times
a bunch of times. Then doesn't free the mmaped memory until file
system is unmounted. It never starts going into swap.
2.4.14-pre5aa1 does about the same.
Sven
On Fri, Nov 02, 2001 at 12:51:09PM -0500, Sven Heinicke wrote:
> a bunch of times. Then doesn't free the mmaped memory until file
> system is unmounted. It never starts going into swap.
thanks for testing. This matches the idea that those pages doesn't want
to be unmapped for whatever reason (and because there's an mlock in our
way at the moment I'd tend to point my finger in that direction rather
than into the vm direction). I'll look more closely into this testcase
shortly.
Andrea
On November 2, 2001 06:51 pm, Sven Heinicke wrote:
> Ben Smith writes:
> > > On October 31, 2001 09:45 pm, Andrea Arcangeli wrote:
> > >
> > >>On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
> > >>
> > >>>On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> > >>>
> > >>>>I just tried your test program with 2.4.13, 2 Gig, and it ran
> > >>>>without problems. Could you try that over there and see if you
> > >>>>get the same result? If it does run, the next move would be to
> > >>>>check with 3.5 Gig.
> > >>>>
> > >>>Ben reports that his test with 2 Gig memory runs fine, as it does
> > >>>for me, but that it locks up tight with 3.5 Gig, requiring power
> > >>>cycle. Since I only have 2 Gig here I can't reproduce that (yet).
> > >>>
> > >>are you sure it isn't an oom condition. can you reproduce on
> > >>2.4.14pre5aa1? mainline (at least before pre6) could deadlock with
> > >>too much mlocked memory.
> > >>
> > >
> > > I don't know, I can't reproduce it here, I don't have enough memory.
> > > Ben?
> >
> > My test application gets killed (I believe by the oom handler). dmesg
> > complains about a lot of 0-order allocation failures. For this test,
> > I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz.
> > - Ben
> >
> > Ben Smith
> > Google, Inc
> >
>
>
> This is a System with 4G of memory and regular swap. With 2 Pentium
> III 1Ghz processors.
>
> On 2.4.14-pre6aa1 it happily runs until:
>
> munmap'ed 7317d000
> Loading data at 7317d000 for slot 2
> Load (/mnt/sdb/sven/chunk10) succeeded!
> mlocking slot 2, 7317d000
> mlocking at 7317d000 of size 1048576
> Connection to hera closed by remote host.
> Connection to hera closed.
>
> Where is kills my ssh and other programs. fills my /var/log/messages
> with:
>
> Nov 2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
> Nov 2 11:29:07 ps2 syslogd: select: Cannot allocate memory
> Nov 2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
> Nov 2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1f0/0)
> Nov 2 11:29:07 ps2 last message repeated 2 times
>
> a bunch of times. Then doesn't free the mmaped memory until file
> system is unmounted.
Not freeing the memory is expected and normal. The previously-mlocked file
data remains cached in that memory, and even though it's not free, it's
'easily freeable' so there's no smoking gun there. The reason the memory is
freed on umount is, there's no possibility that that file data can be
referenced again and it makes sense to free it up immediately.
On the other hand, the 0-order failures and oom-kills indicate a genuine bug.
> It never starts going into swap.
>
> 2.4.14-pre5aa1 does about the same.
--
Daniel
On November 2, 2001 07:00 pm, Andrea Arcangeli wrote:
> On Fri, Nov 02, 2001 at 12:51:09PM -0500, Sven Heinicke wrote:
> > a bunch of times. Then doesn't free the mmaped memory until file
> > system is unmounted. It never starts going into swap.
>
> thanks for testing. This matches the idea that those pages doesn't want
> to be unmapped for whatever reason (and because there's an mlock in our
> way at the moment I'd tend to point my finger in that direction rather
> than into the vm direction). I'll look more closely into this testcase
> shortly.
The mlock handling looks dead simple:
vmscan.c
227 if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
228 return count;
It's hard to see how that could be wrong. Plus, this test program does run
under 2.4.9, it just uses way too much CPU on that kernel. So I'd say mm
bug.
--
Daniel
On November 2, 2001 07:48 pm, Sven Heinicke wrote:
> > Not freeing the memory is expected and normal. The previously-mlocked file
> > data remains cached in that memory, and even though it's not free, it's
> > 'easily freeable' so there's no smoking gun there. The reason the memory is
> > freed on umount is, there's no possibility that that file data can be
> > referenced again and it makes sense to free it up immediately.
>
> That cool and all, but how to I free up the memory w/o umounting the
> partition?
You don't, that's the mm's job. It tries to do it at the last minute, when
it's sure the memory is needed for something more important.
> Also, I just tried 2.4.14-pre7. It acted the same way as 2.4.13 does,
> requiring the reset key to continue.
--
Daniel
> Not freeing the memory is expected and normal. The previously-mlocked file
> data remains cached in that memory, and even though it's not free, it's
> 'easily freeable' so there's no smoking gun there. The reason the memory is
> freed on umount is, there's no possibility that that file data can be
> referenced again and it makes sense to free it up immediately.
That cool and all, but how to I free up the memory w/o umounting the
partition?
Also, I just tried 2.4.14-pre7. It acted the same way as 2.4.13 does,
requiring the reset key to continue.
Sven
In article <[email protected]>,
Daniel Phillips <[email protected]> wrote:
>
>It's hard to see how that could be wrong. Plus, this test program does run
>under 2.4.9, it just uses way too much CPU on that kernel. So I'd say mm
>bug.
So how much memory is mlocked?
The locked memory will stay in the inactive list (it won't even ever be
activated, because we don't bother even scanning the mapped locked
regions), and the inactive list fills up with pages that are completely
worthless.
And the kernel will decide that because most of the unfreeable pages are
mapped, it needs to do VM scanning, which obviously doesn't help.
Why _does_ this thing do mlock, anyway? What's the point? And how much
does it try to lock?
If root wants to shoot himself in the head by mlocking all of memory,
that's not a VM problem, that's a stupid administrator problem.
Linus
On November 2, 2001 09:27 pm, Linus Torvalds wrote:
> In article <[email protected]>,
> Daniel Phillips <[email protected]> wrote:
> >
> >It's hard to see how that could be wrong. Plus, this test program does
> >run under 2.4.9, it just uses way too much CPU on that kernel. So I'd say
> >mm bug.
>
> So how much memory is mlocked?
I'm not sure exactly, I didn't run the test. I *think* it's just over 50% of
physical memory.
> The locked memory will stay in the inactive list (it won't even ever be
> activated, because we don't bother even scanning the mapped locked
> regions), and the inactive list fills up with pages that are completely
> worthless.
Yes, it does various things on various vms. On 2.4.9 it stays on the
inactive list until free memory gets down to rock bottom, then most of it
moves to the active list and the system reaches a steady state where it can
operate, though with kswapd grabbing 99% CPU (two processor system), but the
test does complete. On the current kernel the it dies.
> And the kernel will decide that because most of the unfreeable pages are
> mapped, it needs to do VM scanning, which obviously doesn't help.
>
> Why _does_ this thing do mlock, anyway? What's the point? And how much
> does it try to lock?
It's how the google database engine works, and keeps latency down, by mapping
big database files into memory. I didn't get more information than that on
the application.
> If root wants to shoot himself in the head by mlocking all of memory,
> that's not a VM problem, that's a stupid administrator problem.
In the tests I did, it was about 1 gig out of 2. I'm not sure how much
memory is mlocked in the 3.5 Gig test the one that's failing, but it's
certainly not anything like all of memory. Really, we should be able to
mlock 90%+ of memory without falling over.
--
Daniel
> So how much memory is mlocked?
In the 3.5G case, we lock 4 blocks (4 * 427683520 bytes, or 1.631M).
There is code in the kernel that prevents more than 1/2 of all physical
pages from being mlocked:
mlock.c:215-218: (in do_mlock)
/* we may lock at most half of physical memory... */
/* (this check is pretty bogus, but doesn't hurt) */
if (locked > num_physpages/2)
goto out;
For 2.2 we were have a patch that increases this to 90% or 60M, but we
don't use this patch on 2.4 yet.
> Why _does_ this thing do mlock, anyway? What's the point? And how much
> does it try to lock?
Latency. We know exactly what data should remain in memory, so we're
trying to prevent the vm from paging out the wrong data. It makes a huge
difference in performance.
- Ben
Ben Smith
Google, Inc.
On Fri, 2 Nov 2001, Linus Torvalds wrote:
> If root wants to shoot himself in the head by mlocking all of memory,
> that's not a VM problem, that's a stupid administrator problem.
The kernel limits the amount of mlock()d memory to
50% of RAM, so we _should_ be ok.
(yes, this limit is per process, but daniel only
has one process running anyway)
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
[ Slightly updated version of earlier private email ]
On Fri, 2 Nov 2001, Daniel Phillips wrote:
>
> Yes, it does various things on various vms. On 2.4.9 it stays on the
> inactive list until free memory gets down to rock bottom, then most of it
> moves to the active list and the system reaches a steady state where it can
> operate, though with kswapd grabbing 99% CPU (two processor system), but the
> test does complete. On the current kernel the it dies.
On the 2.4.9 kernel, the "active" list is completely and utterly misnamed.
We move random pages to the active list, for random reasons. One of the
random reasons we have is "this page is mapped". Which has nothing to do
with activeness. The "active" list might as well have been called
"random_list_two".
In the new VM, only _active_ page get moved to the active list. So the
mlocked pages will stay on the inactive list until somebody says they are
active. And right now nobody will ever say that they are active, because
we don't even scan the locked areas.
And the advantage of the non-random approach is that in the new VM, we can
_use_ the knowledge that the inactive list has filled up with mapped pages
to make a _useful_ decision: we decide that we need to start scanning the
VM tree and try to remove pages from the mappings.
Notice? No more "random decisions". We have a well-defined point where we
can say "Ok, our inactive list seems to be mostly mapped, so let's try to
unmap something".
In short, 2.4.9 handles the test because it does everything else wrong.
While 2.4.13 doesn't handle the test well, because the VM says "there's a
_lot_ of inactive mapped pages, I need to _do_ something about it". And
then vmscanning doesn't actually do anything.
Suggested patch appended.
> In the tests I did, it was about 1 gig out of 2. I'm not sure how much
> memory is mlocked in the 3.5 Gig test the one that's failing, but it's
> certainly not anything like all of memory. Really, we should be able to
> mlock 90%+ of memory without falling over.
Not a way in hell, for many reasons, and none of them have anything to do
with this particular problem.
The most _trivial_ reason is that if you lock more than 900MB of memory,
that locked area may well be all of the lowmem pages, and you're now
screwed forever. Dead, dead, dead.
(And I can come up with loads that do exactly the above: it's easy enough
to try to first allocate up all of highmem, and then do a mlock and try to
allocate up all of lowmem locked. It's even easier if you use loopback or
something that only wants to allocate lowmem in the first place).
In short, we MUST NOT mlock more than maybe 500MB _tops_ on intel. If we
ever do, our survival is pretty random, regardless of other VM issues.
The appended patch will should fix the unintentional problem, though.
Linus
----
diff -u --recursive --new-file penguin/linux/mm/vmscan.c linux/mm/vmscan.c
--- penguin/linux/mm/vmscan.c Thu Nov 1 17:59:12 2001
+++ linux/mm/vmscan.c Fri Nov 2 13:10:58 2001
@@ -49,7 +49,7 @@
swp_entry_t entry;
/* Don't look at this pte if it's been accessed recently. */
- if (ptep_test_and_clear_young(page_table)) {
+ if ((vma->vm_flags & VM_LOCKED) || ptep_test_and_clear_young(page_table)) {
mark_page_accessed(page);
return 0;
}
@@ -220,8 +220,8 @@
pgd_t *pgdir;
unsigned long end;
- /* Don't swap out areas which are locked down */
- if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
+ /* Don't swap out areas which are reserved */
+ if (vma->vm_flags & VM_RESERVED)
return count;
pgdir = pgd_offset(mm, address);
In article <[email protected]>, Ben Smith <[email protected]> wrote:
>
>For 2.2 we were have a patch that increases this to 90% or 60M, but we
>don't use this patch on 2.4 yet.
Well, you'll also deadlock your machine if you happen to lock down the
lowmemory area on x86. Sounds like a _bad_ idea.
Anyway, I posted a suggested patch that should fix the behaviour, but it
doesn't fix the fundamental problem with locking the wrong kinds of
pages (ie you're definitely on your own if you happen to lock down most
of the low 1GB of an intel machine).
>Latency. We know exactly what data should remain in memory, so we're
>trying to prevent the vm from paging out the wrong data. It makes a huge
>difference in performance.
It would be interesting to hear whether that is equally true in the new
VM that doesn't necessarily page stuff out unless it can show that the
memory pressure is actually from VM mappings.
How big is your mlock area during real load? Still the "max the kernel
will allow"? Or is that just a benchmark/test kind of thing?
Linus
On Fri, 2 Nov 2001 13:13:10 -0800 (PST) Linus Torvalds <[email protected]>
wrote:
> - /* Don't swap out areas which are locked down */
> - if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
> + /* Don't swap out areas which are reserved */
> + if (vma->vm_flags & VM_RESERVED)
> return count;
Although I agree what you said about differences of old and new VM, I believe
the above was not really what Ben intended to do by mlocking. I mean, you swap
them out right now, or not?
Regards,
Stephan
> Anyway, I posted a suggested patch that should fix the behaviour, but it
> doesn't fix the fundamental problem with locking the wrong kinds of
> pages (ie you're definitely on your own if you happen to lock down most
> of the low 1GB of an intel machine).
I've tried the patch you sent and it doesn't help. I applied the patch
to 2.4.13-pre7 and it hung the machine in the same way (ctrl-alt-del
didn't work). The last few lines of vmstat before the machine hung look
like this:
0 1 0 0 133444 5132 3367312 0 0 31196 0 1121 2123
0 6 94
0 1 0 0 63036 5216 3435920 0 0 34338 14 1219 2272
0 5 95
2 0 1 0 6156 1828 3494904 0 0 31268 0 1130 2198
0 23 77
1 0 1 0 3596 864 3498488 0 0 2720 16 1640 1068
0 88 12
> It would be interesting to hear whether that is equally true in the new
> VM that doesn't necessarily page stuff out unless it can show that the
> memory pressure is actually from VM mappings.
>
> How big is your mlock area during real load? Still the "max the kernel
> will allow"? Or is that just a benchmark/test kind of thing?
I haven't had a chance to try my real app yet, but my test application
is a good simulation of what the real program does, minus any of the
accessing of the data that it maps. Since it's the only application
running, and for performance reasons we'd need all of our data in
memory, we map the "max the kernel will allow".
As another note, I've re-written my test application to use madvise
instead of mlock, on a suggestion from Andrea. It also doesn't work. For
2.4.13, after running for a while, my test app hangs, using one CPU, and
kswapd consumes the other CPU. I was eventually able to kill my test app.
I've also re-written my test app to use anonymous mmap, followed by a
mlock and read()'s. This actually does work without problems, but
doesn't really do what we want for other reasons.
- Ben
Ben Smith
Google, Inc.
On November 2, 2001 11:42 pm, Ben Smith wrote:
> As another note, I've re-written my test application to use madvise
> instead of mlock, on a suggestion from Andrea. It also doesn't work. For
> 2.4.13, after running for a while, my test app hangs, using one CPU, and
> kswapd consumes the other CPU. I was eventually able to kill my test app.
OK, while there may be room for debate over whether the mlock problem is a
bug there's no question with madvise. The program still doesn't work if you
replace the mlocks with madvises (except for the mlock that's used to
estimate memory size).
--
Daniel
In article <[email protected]>,
Stephan von Krawczynski <[email protected]> wrote:
>
>> - /* Don't swap out areas which are locked down */
>> - if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
>> + /* Don't swap out areas which are reserved */
>> + if (vma->vm_flags & VM_RESERVED)
>> return count;
>
>Although I agree what you said about differences of old and new VM, I believe
>the above was not really what Ben intended to do by mlocking. I mean, you swap
>them out right now, or not?
Not. See where I added the VM_LOCKED test - deep down in the page-out
code it will decide that a VM_LOCKED page is always accessed, and will
move it to the active list instead of swapping it out.
Linus
Hello Justin, hello Gerard
I am looking currently for reasons for bad behaviour of aic7xxx driver
in an shared interrupt setup and general not-nice behaviour of the
driver regarding multi-tasking environment.
Here is what I found in the code:
/*
* SCSI controller interrupt handler.
*/
void
ahc_linux_isr(int irq, void *dev_id, struct pt_regs * regs)
{
struct ahc_softc *ahc;
struct ahc_cmd *acmd;
u_long flags;
ahc = (struct ahc_softc *) dev_id;
ahc_lock(ahc, &flags);
ahc_intr(ahc);
/*
* It would be nice to run the device queues from a
* bottom half handler, but as there is no way to
* dynamically register one, we'll have to postpone
* that until we get integrated into the kernel.
*/
ahc_linux_run_device_queues(ahc);
acmd = TAILQ_FIRST(&ahc->platform_data->completeq);
TAILQ_INIT(&ahc->platform_data->completeq);
ahc_unlock(ahc, &flags);
if (acmd != NULL)
ahc_linux_run_complete_queue(ahc, acmd);
}
This is nice. I cannot read the complete code around it (it is derived
from aic7xxx_linux.c) but if I understand the naming and comments
correct, some workload is done inside the hardware interrupt (which
shouldn't), which would very much match my tests showing bad overall
performance behaviour. Obviously this code is old (read the comment)
and needs reworking.
Comments?
Regards,
Stephan
In article <[email protected]> you wrote:
> Hello Justin, hello Gerard
>
> I am looking currently for reasons for bad behaviour of aic7xxx driver
> in an shared interrupt setup and general not-nice behaviour of the
> driver regarding multi-tasking environment.
> Here is what I found in the code:
> * It would be nice to run the device queues from a
> * bottom half handler, but as there is no way to
> * dynamically register one, we'll have to postpone
> * that until we get integrated into the kernel.
> */
sounds like a good tasklet candidate......
> >Hello Justin, hello Gerard
> >
> >I am looking currently for reasons for bad behaviour of aic7xxx
driver
> >in an shared interrupt setup and general not-nice behaviour of the
> >driver regarding multi-tasking environment.
>
> Can you be more specific?
Yes, of course :-)
What I am seeing over here is that aic7xxx is _significantly_ slower
than symbios _in the exact same context_. I refused to use the "new"
driver as long as possible because I had (right from the first test)
the "feeling" that it hurts the machine overall performance in some
way, meaning the box seems _slow_ and less responsive than it was with
the old aic driver. When I directly compared it with symbios (LSI
Logic hardware sold from Tekram) I additionaly found out, that it
seems to hurt the interrupt performance of a network card sharing its
interrupt with the aic which again does not happen with symbios. I
have already seen such behaviour before, on merely every driver I
formerly wrote for shared interrupt systems I had to fill in code that
_prevents_ lockout of other interrupt users due to indefinitely
walking through the own code in high load situation.
But, of course, you _know_ this. Nobody writes a driver like new
aic7xxx _and_ doesn't know :-)
My guess is that this knowledge made you enter the comment I ripped
from your code about using bottom half handler instead of dealing with
workload in a hardware interrupt. Again, I have to no extent read your
code completely or the like. I simply tried to find the hardware
interrupt routine and look if it does significant eli (EverLasting
Interrupt ;-) stuff - and I found your comment.
Can you re-comment from todays point of view?
> >This is nice. I cannot read the complete code around it (it is
derived
> >from aic7xxx_linux.c) but if I understand the naming and comments
> >correct, some workload is done inside the hardware interrupt (which
> >shouldn't), which would very much match my tests showing bad
overall
> >performance behaviour. Obviously this code is old (read the
comment)
> >and needs reworking.
> >Comments?
>
> I won't comment on whether deferring this work until outside of
> an interrupt context would help your "problem" until I understand
> what you are complaining about. 8-)
In a nutshell:
a) long lasting interrupt workloads prevent normal process activity
(creating latency and sticky behaviour)
b) long lasting interrupt workloads play bad on other interrupt users
(e.g. on the same shared interrupt)
I can see _both_ comparing aic with symbios.
Regards,
Stephan
>Can you re-comment from todays point of view?
I believe that if you were to set the tag depth in the new aic7xxx
driver to a level similar to either the symbios or the old aic7xxx
driver, that the problem you describe would go away. The driver
will only perform internal queuing if a device cannot handle the
original queue depth exported to the SCSI mid-layer. Since the
mid-layer provides no mechanism for proper, dynamic, throttling,
queuing at the driver level will always be required when the driver
determines that a target cannot accept additional commands. The default
used by the older driver, 8, seems to work for most drives. So, no
internal queuing is required. If you are really concerned about
interrupt latency, this will also be a win as you will reduce your
transaction throughput and thus the frequency of interrupts seen
by the controller.
>> I won't comment on whether deferring this work until outside of
>> an interrupt context would help your "problem" until I understand
>> what you are complaining about. 8-)
>
>In a nutshell:
>a) long lasting interrupt workloads prevent normal process activity
>(creating latency and sticky behaviour)
Deferring the work to outside of interrupt context will not, in
general, allow non-kernel processes to run any sooner. Only interrupt
handlers that don't block on the io-request lock (may it die a horrible
death) would be allowed to pre-empt this activity. Even in this case,
there will be times, albeit much shorter, that this interrupt
will be blocked by the per-controller spin-lock used to protect
driver data structures and access to the card's registers.
If your processes are really feeling sluggish, you are probably doing
*a lot* of I/O. The only thing that might help is some interrupt
coalessing algorithm in the aic7xxx driver's firmware. Since these
chips do not have an easily utilized timer facility any such algorithm
would be tricky to implement. I've thought about it, but not enough
to implement it yet.
>b) long lasting interrupt workloads play bad on other interrupt users
>(e.g. on the same shared interrupt)
Sure. As the comment suggests, the driver should use a bottom half
handler or whatever new deferral mechanism is currently the rage
in Linux. When I first ported the driver, it was targeted to be a
module, suitable for a driver diskette, to replace the old driver.
Things have changed since then, and this area should be revisited.
Internal queuing was not required in the original FreeBSD driver and
this is something the mid-layer should do on a driver's behalf, but
I've already ranted enough about that.
>I can see _both_ comparing aic with symbios.
I'm not sure that you would see much of a difference if you set the
symbios driver to use 253 commands per-device. I haven't looked at
the sym driver for some time, but last I remember it does not use
a bottom half handler and handles queue throttling internally. It
may perform less work at interrupt time than the aic7xxx driver if
locally queued I/O is compiled into a format suitable for controller
consumption rather than queue the ScsiCmnd structure provided by
the mid-layer. The aic7xxx driver has to convert a ScsiCmnd into a
controller data structure to service an internal queue and this can
take a bit of time.
I would be interresting if there is a disparity in the TPS numbers
and tag depths in your comparisons. Higher tag depth usually means
higher TPS which may also mean less interactive response from the
system. All things being equal, I would expect the sym and aic7xxx
drivers to perform about the same.
--
Justin
On Sat, 3 Nov 2001, Justin T. Gibbs wrote:
[...]
> >I can see _both_ comparing aic with symbios.
>
> I'm not sure that you would see much of a difference if you set the
> symbios driver to use 253 commands per-device. I haven't looked at
This is discouraged. :)
Better, IMO, to compare behaviours with realistic queue depths. As you
know, more than 64 for hard disks does not make sense (yet).
Personnaly, I use 64 under FreeBSD and 16 under Linux. Guess why ? :-)
> the sym driver for some time, but last I remember it does not use
> a bottom half handler and handles queue throttling internally. It
There is no BH in the driver. The stock sym53c8xx even uses scsi_obsolete
that requires more load in interrupt context for command completion.
SYM-2 that comes back from FreeBSD uses the EH threaded stuff that just
queues to a BH on completion. Stephan may want to give SYM-2 a try, IMO.
> may perform less work at interrupt time than the aic7xxx driver if
> locally queued I/O is compiled into a format suitable for controller
> consumption rather than queue the ScsiCmnd structure provided by
> the mid-layer. The aic7xxx driver has to convert a ScsiCmnd into a
> controller data structure to service an internal queue and this can
> take a bit of time.
The sym* drivers also uses an internal data structure to handle I/Os. The
SCSI script does not know about any O/S specific data structure.
> I would be interresting if there is a disparity in the TPS numbers
> and tag depths in your comparisons. Higher tag depth usually means
> higher TPS which may also mean less interactive response from the
> system. All things being equal, I would expect the sym and aic7xxx
> drivers to perform about the same.
Agreed.
G?rard.
On Sat, 03 Nov 2001 22:47:39 -0700 "Justin T. Gibbs" <[email protected]> wrote:
> >Can you re-comment from todays point of view?
>
> I believe that if you were to set the tag depth in the new aic7xxx
> driver to a level similar to either the symbios or the old aic7xxx
> driver, that the problem you describe would go away.
Nope.
I know the stuff :-) I already took tcq down to 8 (as in old driver) back at
the times I compared old an new driver. Indeed I found out that everything is a
lot worse if using tcq 256 (which doesn't work anyway and gets down to 128 in
real life using my IBM harddrive). After using depth 8 the comparison to
symbios is just as described. Though I must admit, that the symbios driver
takes down tcq from 8 to 4 according to his boot-up message. Do you think it
will make a noticeable difference if I hardcode the depth to 4 in the aic7xxx
driver?
> The driver
> will only perform internal queuing if a device cannot handle the
> original queue depth exported to the SCSI mid-layer. Since the
> mid-layer provides no mechanism for proper, dynamic, throttling,
> queuing at the driver level will always be required when the driver
> determines that a target cannot accept additional commands. The default
> used by the older driver, 8, seems to work for most drives. So, no
> internal queuing is required. If you are really concerned about
> interrupt latency, this will also be a win as you will reduce your
> transaction throughput and thus the frequency of interrupts seen
> by the controller.
Hm, this is not really true in my experience. Since a harddrive is in a
completely other time-framing than pure software issues it may well be, that
building up internal data not directly inside the hardware interrupt, but on a
somewhere higher layer, is no noticeable performance loss, _if_ it is done
right. "Right" here means obviously there must not be a synchronous linkage
between this higher layer and the hardware interrupt in this sense that the
higher layer has to wait on hardware interrupts' completion. But this is all
pretty "down to earth" stuff you know anyways.
> >> I won't comment on whether deferring this work until outside of
> >> an interrupt context would help your "problem" until I understand
> >> what you are complaining about. 8-)
> >
> >In a nutshell:
> >a) long lasting interrupt workloads prevent normal process activity
> >(creating latency and sticky behaviour)
>
> Deferring the work to outside of interrupt context will not, in
> general, allow non-kernel processes to run any sooner.
kernel processes would be completely sufficient. If you hit allocation routines
(e.g.) the whole system enters hickup state :-).
> Only interrupt
> handlers that don't block on the io-request lock (may it die a horrible
> death) would be allowed to pre-empt this activity. Even in this case,
> there will be times, albeit much shorter, that this interrupt
> will be blocked by the per-controller spin-lock used to protect
> driver data structures and access to the card's registers.
Well, this is a natural thing. You always have to protect such exclusively
working things like controller registers, but doubtlessly things turn out the
better the less exclusiveness you have (what can be more exclusive than a
hardware interrupt?).
> If your processes are really feeling sluggish, you are probably doing
> *a lot* of I/O.
Yes, of course. I wouldn't have complained in the first place _not_ knowing
that symbios does it better.
> The only thing that might help is some interrupt
> coalessing algorithm in the aic7xxx driver's firmware. Since these
> chips do not have an easily utilized timer facility any such algorithm
> would be tricky to implement. I've thought about it, but not enough
> to implement it yet.
I cannot comment on that, I don't know what Gerard really does here.
> >b) long lasting interrupt workloads play bad on other interrupt users
> >(e.g. on the same shared interrupt)
>
> Sure. As the comment suggests, the driver should use a bottom half
> handler or whatever new deferral mechanism is currently the rage
> in Linux.
Do you think this is complex in implementation?
> [...]
> >I can see _both_ comparing aic with symbios.
>
> I'm not sure that you would see much of a difference if you set the
> symbios driver to use 253 commands per-device.
As stated earlier I took both drivers to comparable values (8).
> I would be interresting if there is a disparity in the TPS numbers
> and tag depths in your comparisons. Higher tag depth usually means
> higher TPS which may also mean less interactive response from the
> system. All things being equal, I would expect the sym and aic7xxx
> drivers to perform about the same.
I can confirm that. 253 is a bad joke in terms of interactive responsiveness
during high load.
Probably the configured standard value should be taken down remarkably.
253 feels like old IDE.
Yes, I know this comment hurt you badly ;-)
In my eyes the changes required in your driver are _not_ that big. The gain
would be noticeable. I don't say its a bad driver, really not, I would only
suggest some refinement. I know _you_ can do a bit better, prove me right ;-)
Regards,
Stephan
>On Sat, 03 Nov 2001 22:47:39 -0700 "Justin T. Gibbs" <[email protected]> wrote
>:
>
>> >Can you re-comment from todays point of view?
>>
>> I believe that if you were to set the tag depth in the new aic7xxx
>> driver to a level similar to either the symbios or the old aic7xxx
>> driver, that the problem you describe would go away.
>
>Nope.
>I know the stuff :-) I already took tcq down to 8 (as in old driver) back at
>the times I compared old an new driver.
Then you will have to find some other reason for the difference in
performance. Internal queuing is not a factor with any reasonable
modern drive when the depth is set at 8.
>Indeed I found out that everything is a lot worse if using tcq 256 (which
>doesn't work anyway and gets down to 128 in real life using my IBM harddrive).
The driver cannot know if you are using an external RAID controller or
an IBM drive or a Qunatum fireball. It is my belief that in a true
multi-tasking workload, giving the device as much work to chew on
as it can handle is always best. Your sequential bandwidth may
be a bit less, but sequential I/O is not that interesting in my opinion.
>After using depth 8 the comparison to
>symbios is just as described. Though I must admit, that the symbios driver
>takes down tcq from 8 to 4 according to his boot-up message. Do you think it
>will make a noticeable difference if I hardcode the depth to 4 in the aic7xxx
>driver?
As mentioned above, I would not expect any difference.
>> The driver
>> will only perform internal queuing if a device cannot handle the
>> original queue depth exported to the SCSI mid-layer. Since the
>> mid-layer provides no mechanism for proper, dynamic, throttling,
>> queuing at the driver level will always be required when the driver
>> determines that a target cannot accept additional commands. The default
>> used by the older driver, 8, seems to work for most drives. So, no
>> internal queuing is required. If you are really concerned about
>> interrupt latency, this will also be a win as you will reduce your
>> transaction throughput and thus the frequency of interrupts seen
>> by the controller.
>
>Hm, this is not really true in my experience. Since a harddrive is in a
>completely other time-framing than pure software issues it may well be, that
>building up internal data not directly inside the hardware interrupt, but on a
>somewhere higher layer, is no noticeable performance loss, _if_ it is done
>right. "Right" here means obviously there must not be a synchronous linkage
>between this higher layer and the hardware interrupt in this sense that the
>higher layer has to wait on hardware interrupts' completion. But this is all
>pretty "down to earth" stuff you know anyways.
I don't understand how your comments relate to mine. In a perfect world,
and with a "real" SCSI layer in Linux, the driver would never have any
queued data above and beyond what it can directly send to the device.
Since Linux lets you set the queue depth only at startup, before you can
dynamically determine a useful value, the driver has little choice. To
say it more directly, internal queuing is not something I wanted in the
design - in fact it makes it more complicated and less efficient.
>> Deferring the work to outside of interrupt context will not, in
>> general, allow non-kernel processes to run any sooner.
>
>kernel processes would be completely sufficient. If you hit allocation routine
>s
>(e.g.) the whole system enters hickup state :-).
But even those kernel processes will not run unless they have a higher
priority than the bottom half handler. I can't stress this enough...
interactive performance will not change if this is done because kernel
tasks take priority over user tasks.
>> If your processes are really feeling sluggish, you are probably doing
>> *a lot* of I/O.
>
>Yes, of course. I wouldn't have complained in the first place _not_ knowing
>that symbios does it better.
I wish you could be a bit more quanitative in your analysis. It seems
clear to me that the area you're pointing to is not the cause of your
complaint. Without a quantitative analysis, I can't help you figure
this out.
>> Sure. As the comment suggests, the driver should use a bottom half
>> handler or whatever new deferral mechanism is currently the rage
>> in Linux.
>
>Do you think this is complex in implementation?
No, but doing anything like this requires some research to find a solution
that works for all kernel versions the driver supports. I hope I don't need
three different implementations to make this work. Regardless, this change
will not make any difference in your problem.
>> I would be interresting if there is a disparity in the TPS numbers
>> and tag depths in your comparisons. Higher tag depth usually means
>> higher TPS which may also mean less interactive response from the
>> system. All things being equal, I would expect the sym and aic7xxx
>> drivers to perform about the same.
>
>I can confirm that. 253 is a bad joke in terms of interactive responsiveness
>during high load.
Its there for throughput, not interactive performance. I'm sure if you
were doing things like news expirations, you'd appreciate the higher number
(up to the 128 tags your drives support).
>Probably the configured standard value should be taken down remarkably.
>253 feels like old IDE.
>Yes, I know this comment hurt you badly ;-)
Not really. Each to their own. You can tune your system however you
see fit.
>In my eyes the changes required in your driver are _not_ that big. The gain
>would be noticeable. I don't say its a bad driver, really not, I would only
>suggest some refinement. I know _you_ can do a bit better, prove me right ;-)
Show me where the real problem is, and I'll fix it. I'll add the bottom
half handler too eventually, but I don't see it as a pressing item. I'm
much more interested in why you are seeing the behavior you are and exactly
what, quantitatively, that behavior is.
--
Justin
On Sun, 04 Nov 2001 11:10:26 -0700 "Justin T. Gibbs" <[email protected]> wrote:
> >On Sat, 03 Nov 2001 22:47:39 -0700 "Justin T. Gibbs" <[email protected]>
wrote
> Show me where the real problem is, and I'll fix it. I'll add the bottom
> half handler too eventually, but I don't see it as a pressing item. I'm
> much more interested in why you are seeing the behavior you are and exactly
> what, quantitatively, that behavior is.
Hm, what more specific can I tell you, than:
Take my box with
Host: scsi1 Channel: 00 Id: 03 Lun: 00
Vendor: TEAC Model: CD-ROM CD-532S Rev: 1.0A
Type: CD-ROM ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 08 Lun: 00
Vendor: IBM Model: DDYS-T36950N Rev: S96H
Type: Direct-Access ANSI SCSI revision: 03
and an aic7xxx driver. Start xcdroast an read a cd image. You get something
between 2968,4 and 3168,2 kB/s throughput measured from xcdroast.
Now redo this with a Tekram controller (which is sym53c1010) and you get
throughput of 3611,1 to 3620,2 kB/s.
No special stuff or background processes or anything else involved. I wonder
how much simpler a test could be.
Give me values to compare from _your_ setup.
If you redo this test with nfs-load (copy files from some client to your
test-box acting as nfs-server) you will end up at 1926 - 2631 kB/s throughput
with aic, but 3395 - 3605 kB/s with symbios.
If you need more on that picture, then redo the last and start _some_
application in the background during the test (like mozilla). Time how long it
takes until the application is up and running.
If you are really unlucky you have your mail-client open during test and let it
get mails via pop3 in a MH folder (lots of small files). You have a high chance
that your mail-client is unusable until xcdroast is finished with cd reading -
but not with symbios.
??
Regards,
Stephan
On Sun, 04 Nov 2001 11:10:26 -0700 "Justin T. Gibbs" <[email protected]> wrote:
> >Nope.
> >I know the stuff :-) I already took tcq down to 8 (as in old driver) back at
> >the times I compared old an new driver.
>
> Then you will have to find some other reason for the difference in
> performance. Internal queuing is not a factor with any reasonable
> modern drive when the depth is set at 8.
Hm, obviously we could start right from the beginning and ask people with aic
controllers and symbios controllers for some comparison figures. Hopefully some
people are interested.
Here we go:
Hello out there :-)
we need your help. If you own a scsi-controller from adaptec or one with
ncr/symbios chipset can you please do the following:
reboot your box. Start xcdroast and read in a data cd. Tell us: brand of your
cdrom, how much RAM you have, processor type, throughput as measured by
xcdroast. Nice would be if you try several times.
We are not really interested in the hard figures, but want to extract some
"global" tendency.
Thank you for your cooperation,
Stephan
PS: my values are (I obviously have both controllers):
Adaptec:
Drive TEAC-CD-532S (30x), 1 GB RAM, 2 x PIII 1GHz
test 1: 2998,9 kB/s
test 2: 2968,4 kB/s
test 3: 3168,2 kB/s
Tekram (symbios)
Drive TEAC-CD-532S (30x), 1 GB RAM, 2 X PIII 1GHz
test 1: 3619,3 kB/s
test 2: 3611,1 kB/s
test 3: 3620,2 kB/s
>Hm, what more specific can I tell you, than:
Well, the numbers paint a different picture than your pervious
comments. You never mentioned a performance disparity, only a
loss in interactive performance.
>Take my box with
>
>Host: scsi1 Channel: 00 Id: 03 Lun: 00
> Vendor: TEAC Model: CD-ROM CD-532S Rev: 1.0A
> Type: CD-ROM ANSI SCSI revision: 02
>Host: scsi0 Channel: 00 Id: 08 Lun: 00
> Vendor: IBM Model: DDYS-T36950N Rev: S96H
> Type: Direct-Access ANSI SCSI revision: 03
>
>and an aic7xxx driver.
A full dmesg would be better. Right now I have no idea what kind
of aic7xxx controller you are using, the speed and type of CPU,
the chipset in the machine, etc. etc. In general, I'd rather see
the raw data than a version edited down based on the conclusions
you've already drawn or on what you feel is important.
>Start xcdroast an read a cd image. You get something
>between 2968,4 and 3168,2 kB/s throughput measured from xcdroast.
>
>Now redo this with a Tekram controller (which is sym53c1010) and you get
>throughput of 3611,1 to 3620,2 kB/s.
Were both tests performed from cold boot to a new file in the same
directory with similar amounts of that filesystem in use?
>No special stuff or background processes or anything else involved. I wonder
>how much simpler a test could be.
It doesn't matter how simple it is if you've never mentioned it before.
Your tone is somewhat indignant. Do you not understand why this
data is important to understanding and correcting the problem?
>Give me values to compare from _your_ setup.
Send me a c1010. 8-)
>If you redo this test with nfs-load (copy files from some client to your
>test-box acting as nfs-server) you will end up at 1926 - 2631 kB/s throughput
>with aic, but 3395 - 3605 kB/s with symbios.
What is the interrupt load during these tests? Have you verified that
disconnection is enabled for all devices on the aic7xxx controller?
>If you need more on that picture, then redo the last and start _some_
>application in the background during the test (like mozilla). Time how long it
>takes until the application is up and running.
Since you are experiencing the problem, can't you time it? There is
little guarantee that I will be able to reproduce the exact scenario
you are describing. As I mentioned before, I don't have a c1010,
so I cannot perform the comparison you feel is so telling.
This does not look like an interrupt latency problem.
--
Justin
Hi Stephan,
The difference in performance for your CD (slow device) between aic7xxx
and sym53c8xx using equi-capable HBAs (notably Ultra-160) cannot be
believed a single second to be due to a design flaw in the aic7xxx driver.
Instead of trying to prove Justin wrong with his driver, you should look
into your system configuration and/or provide Justin with accurate
information and/or do different testings in order to get some clue about
the real cause.
You may have triggerred a software/hardware bug somewhere, but I am
convinced that it cannot be a driver design bug.
In order to help Justin work on your problem, you should for example
report:
- The device configuration you set up in the controller EEPROM/NVRAM.
- The kernel boot-up messages.
- Your kernel configuration.
- Etc...
You might for example have unintentionnaly configured some devices in the
HBA set-up for disconnection not to be granted. Such configuration MISTAKE
is likely to kill SCSI performances a LOT.
G?rard.
PS: If you are interested in Justin's ability to design software for SCSI,
then you may want to have a look into all FreeBSD IO-related stuff owned
by Justin.
On Sun, 4 Nov 2001, Stephan von Krawczynski wrote:
> On Sun, 04 Nov 2001 11:10:26 -0700 "Justin T. Gibbs" <[email protected]> wrote:
>
> > >On Sat, 03 Nov 2001 22:47:39 -0700 "Justin T. Gibbs" <[email protected]>
> wrote
> > Show me where the real problem is, and I'll fix it. I'll add the bottom
> > half handler too eventually, but I don't see it as a pressing item. I'm
> > much more interested in why you are seeing the behavior you are and exactly
> > what, quantitatively, that behavior is.
>
> Hm, what more specific can I tell you, than:
>
> Take my box with
>
> Host: scsi1 Channel: 00 Id: 03 Lun: 00
> Vendor: TEAC Model: CD-ROM CD-532S Rev: 1.0A
> Type: CD-ROM ANSI SCSI revision: 02
> Host: scsi0 Channel: 00 Id: 08 Lun: 00
> Vendor: IBM Model: DDYS-T36950N Rev: S96H
> Type: Direct-Access ANSI SCSI revision: 03
>
> and an aic7xxx driver. Start xcdroast an read a cd image. You get something
> between 2968,4 and 3168,2 kB/s throughput measured from xcdroast.
>
> Now redo this with a Tekram controller (which is sym53c1010) and you get
> throughput of 3611,1 to 3620,2 kB/s.
> No special stuff or background processes or anything else involved. I wonder
> how much simpler a test could be.
> Give me values to compare from _your_ setup.
>
> If you redo this test with nfs-load (copy files from some client to your
> test-box acting as nfs-server) you will end up at 1926 - 2631 kB/s throughput
> with aic, but 3395 - 3605 kB/s with symbios.
>
> If you need more on that picture, then redo the last and start _some_
> application in the background during the test (like mozilla). Time how long it
> takes until the application is up and running.
> If you are really unlucky you have your mail-client open during test and let it
> get mails via pop3 in a MH folder (lots of small files). You have a high chance
> that your mail-client is unusable until xcdroast is finished with cd reading -
> but not with symbios.
>
> ??
>
> Regards,
> Stephan
On Sun, 04 Nov 2001 12:13:20 -0700 "Justin T. Gibbs" <[email protected]> wrote:
> >Hm, what more specific can I tell you, than:
>
> Well, the numbers paint a different picture than your pervious
> comments. You never mentioned a performance disparity, only a
> loss in interactive performance.
See:
Date: Wed, 31 Oct 2001 16:45:39 +0100
From: Stephan von Krawczynski <[email protected]>
To: linux-kernel <[email protected]>
Subject: The good, the bad & the ugly (or VM, block devices, and SCSI :-)
Message-Id: <[email protected]>
> A full dmesg would be better. Right now I have no idea what kind
> of aic7xxx controller you are using,
Adaptec A29160 (see above mail). Remarkably is I have a 32 bit PCI bus, no 64
bit. This is an Asus CUV4X-D board.
> the speed and type of CPU,
2 x PIII 1GHz
> the chipset in the machine,
00:00.0 Host bridge: VIA Technologies, Inc. VT82C693A/694x [Apollo PRO133x]
(rev c4)
00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x
AGP]
00:04.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev
40)
00:04.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
00:04.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:04.3 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:04.4 Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev
40)
00:09.0 PCI bridge: Digital Equipment Corporation DECchip 21152 (rev 03)
00:0a.0 Network controller: Elsa AG QuickStep 1000 (rev 01)
00:0b.0 SCSI storage controller: Symbios Logic Inc. (formerly NCR) 53c1010
Ultra3 SCSI Adapter (rev 01)
00:0b.1 SCSI storage controller: Symbios Logic Inc. (formerly NCR) 53c1010
Ultra3 SCSI Adapter (rev 01)
00:0d.0 Multimedia audio controller: Creative Labs SB Live! EMU10000 (rev 07)
00:0d.1 Input device controller: Creative Labs SB Live! (rev 07)
01:00.0 VGA compatible controller: nVidia Corporation NV11 (rev b2)
02:04.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43
(rev 41)
02:05.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43
(rev 41)
02:06.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43
(rev 41)
02:07.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43
(rev 41)
> Were both tests performed from cold boot
I rechecked that several times, made no difference.
> to a new file in the same
> directory with similar amounts of that filesystem in use?
yes. There is no difference if the file is a) new or b) overwritten. Anyway
both test cases use the same filesystems, I really exchanged only the
controllers, everything else is completely the same.
Just did another test run with symbios, _after_ heavy nfs and I/O action on the
box and about 145 MB in swap currently. Result: 3620,1 kB/s. _Very_ stable
appearance from symbios.
> >No special stuff or background processes or anything else involved. I wonder
> >how much simpler a test could be.
>
> It doesn't matter how simple it is if you've never mentioned it before.
Sorry, but there was nothing left out on my side. s.a.
> Your tone is somewhat indignant. Do you not understand why this
> data is important to understanding and correcting the problem?
Sorry for that, this is unintentional. Though my written english may look nice,
keep in mind I am no native english-speaking, so some things may come over a
bit rougher than intended.
> >Give me values to compare from _your_ setup.
>
> Send me a c1010. 8-)
Sorry, misunderstanding. What I meant was: how fast can you read data from your
cd-rom attached to some adaptec controller?
> >If you redo this test with nfs-load (copy files from some client to your
> >test-box acting as nfs-server) you will end up at 1926 - 2631 kB/s
throughput
> >with aic, but 3395 - 3605 kB/s with symbios.
>
> What is the interrupt load during these tests?
How can I present you an exact figure on this?
> Have you verified that
> disconnection is enabled for all devices on the aic7xxx controller?
yes.
> This does not look like an interrupt latency problem.
Based on which thoughts?
Regards,
Stephan
>See:
>
>Date: Wed, 31 Oct 2001 16:45:39 +0100
>From: Stephan von Krawczynski <[email protected]>
>To: linux-kernel <[email protected]>
>Subject: The good, the bad & the ugly (or VM, block devices, and SCSI :-)
>Message-Id: <[email protected]>
<Sigh> I don't read all of the LK list and the mail was not cc'd
to me, so I did not see this thread.
>> A full dmesg would be better. Right now I have no idea what kind
>> of aic7xxx controller you are using,
>
>Adaptec A29160 (see above mail). Remarkably is I have a 32 bit PCI bus, no 64
>bit. This is an Asus CUV4X-D board.
*Please stop editing things*. I need the actual boot messages from
the detection of the aic7xxx card. It would also be nice to see
the output of /proc/scsi/aic7xxx/<card #>
>> the speed and type of CPU,
>
>2 x PIII 1GHz
Dmesg please.
>Sorry, misunderstanding. What I meant was: how fast can you read data
>from your cd-rom attached to some adaptec controller?
I'll run some tests tomorrow at work. I'm sure the results will
be dependent on the cdrom in question but they may show something.
>> >If you redo this test with nfs-load (copy files from some client to your
>> >test-box acting as nfs-server) you will end up at 1926 - 2631 kB/s
>throughput
>> >with aic, but 3395 - 3605 kB/s with symbios.
>>
>> What is the interrupt load during these tests?
>
>How can I present you an exact figure on this?
Isn't there a systat or vmstat equivalent under Linux that gives you
interrupt rates? I'll poke around tomorrow when I'm in front of a Linux
box.
>> Have you verified that
>> disconnection is enabled for all devices on the aic7xxx controller?
>
>yes.
The driver may not be seeing the same things as SCSI-Select for some
strange reason. Again, just email me a full dmesg after a successful
boot along with the /proc/scsi/aic7xxx/ output.
>> This does not look like an interrupt latency problem.
>
>Based on which thoughts?
It really looks like a bug in the driver's round-robin code or perhaps
a difference in how many transactions we allow to be queued in the
untagged case.
Can you re-run your tests with the output directed to /dev/null for cdrom
reads and also perform some benchmarks against your disk? The benchmarks
should operate on one device only at a time with as little I/O to any other
device during the test.
--
Justin
On Sun, 04 Nov 2001, Justin T. Gibbs wrote:
> Isn't there a systat or vmstat equivalent under Linux that gives you
> interrupt rates? I'll poke around tomorrow when I'm in front of a Linux
> box.
vmstat is usually available, systat/iostat and the like are not
ubiquitous however.
--
Matthias Andree