2000-12-07 20:31:34

by Alan

[permalink] [raw]
Subject: Linux 2.2.18pre25


Ok we believe the VM crash looping printing error messages is now fixed.
Marcelo finally figured it out and my 8Mb 486 has been running 2.2.18pre
with that fix and stably[1].

So I figure this is it for 2.2.18, subject to evidence to the contrary

Alan


2.2.18pre25
o Fix tight loop spinning reporting out of free (Marcelo Tosatti)
pages
o Back out ppa changes causing problems for a (Tim Waugh)
few users
o Set master enable on UHCI USB controllers (Erik Mouw)
o RIO DCD fixes (Patrick van de Lageweg)
o 3c59x.c support for 3c556B (Andrew Morton)
o S390 cleanups for loopsperjiffy etc (Kurt Roeckx)
o Fix acceleport 4 SMP hangs (Al Borchers)
o Fix drivers/char/Makefile buglet (Chip Salzenberg)
o PPC syscall table fix (Chip Salzenberg)
o Move HID sysctl to avoid clash in 2.4 case (Tom Rini)
o Small symbios check condition fix (G?rard Roudier)
o Fix Makefile module version check (Eric Lammerts)
o Fix DRM build on Sparc (Dave Miller)
o Work around Dallas D4201 PCM8 audio bug (Thomas Sailer)
o Fix USB memory leak (Dan Streetman)
o Fix ioremap fencepost error (Chip Salzenberg)


2.2.18pre24
o Expose put_unused_fd for modules (Andi Kleen)
o Fix the ps/2 mouse probe I hope (me)
o Fix crash in cosa driver (Jan Kasprzak)
o Fix procfs negative seek offset error reporting (HJ Lu)
o Fix ext2 file limit constraints (Andrea Arcangeli)
o Fix lockf corner cases (Andi Kleen, me)
o Fix NCPfs date limits (Igor Zhbanov)
o Update DRM (Chip Salzenberg)
o Fix missing Alpha includes (Matt Wilson)
o Fix missing symbols on alpha (Matt Wilson)

2.2.18pre23
o Fix alpha compile problem (Herbert Xu)
o Scan DMI bios data to find broken laptops (me)
o Fix megaraid module symbols (Michael Marxmeier)
o Fix visor/OHCI problem (Gerg Kroah-Hartman)
o Fix sysctl_jiffies compile bug (Tomasz K?oczko)
o Init mic input low to avoid feedback (Pete Zaitcev)
o Fix typo in acenic headers (Val Henson)
o David Woodhouse has moved (David Woodhouse)
o Compaq raid driver update (Charles White)
o Fix aha1542 scribbles on errors (Phil Stracchino)
o Update Advansys driver to v3.3D (Bob Frey)
o Fix maestro ioctl locking (Zach Brown)
o Formatting cleanup for setup.c (Dave Jones)
o Fix FAT32 bugs on Alpha (Bill Nottingham)

2.2.18pre22
o Fix HZ assumption in USB hub driver (Oleg Drokin)
o Fix ndisc range check on ipv6 (Dave Miller)
o Clear other fields in qcam VIDIOCGWIN return (Damion de Soto)
o Fix sparc64 includes for socket.h (Solar Designer)
o ELF platform was misset for Pentium IV (Mikael Pettersson)
o ADMTek 985 ident was wrong (Lee Bradshaw)
o Fix filemark status test on scsi tape (Robin Miller)
o Fix file/block when spacing to tape beginning (Kai Maiksara)
o Small ISDN documentation fixes (Kai Germaschewski)
o Resync icn driver with core isdn tree (Kai Germaschewski)
o Fix isdn loopback driver (Kai Germaschewski)
o Fix small leaks in lockd (Trond Myklebust)
o Add Pentium IV rep nop, ident etc (Various folks,
notably HPA and
Linda Wang)
o Update sparc default config (Dave Miller)
o Hopefully properly fix the megaraid problem (Willy Tarreau, AMI
and others)
o Resync tcp bits with Dave (Dave Miller)
o Make cpqarray provide randomness (Nigel Metheringham)
o Fix wavefront symbols bug (Carlos E. Gorges)
o Fix acenic jumbo handling when flushing ring (Val Henson)
o Fix ace_set_mac_addr for littleendian hosts (Stephen Hack)
o Fix assorted typos in the kernel (Andries Brouwer)
o EEPro100 fixes (Dragan Stancevic)
o Fix hisax _setup crash case (David Woodhouse)
o Fix small cdrom driver bugs (Jens Axboe)
o Fix remaining vmalloc corner cases (Ben LaHaise)
o Update USB maintainers (Greg Kroah-Hartman)
o Fix matroxfb doc bug (Pavel Rabel)
o Fix setscheduler lock inversion (Andrew Morton)
o Fix scsi unload/sg ioctl oops (Paul Clements)

2.2.18pre21
o Environment controller update for sparc (Eric Brower)
o No italian translation for config.help (Andrea Ferraris)
o Fix type error in buz driver (Pete Zaitcev)
o Resnchronize Apple PowerMac codebase (Paul Mackerras & co)
o Merge powermac tree fixes into usb
o Powermac input device handling changes
o Fix console switch fonts
o S/390 merge (IBM S/390 folks)
(Merge grunt work done by Kurt Roeckx)
o Make knfsd TCP an option (me)
o Drop cisco info packets (0x2000) (Ivan Passos)
o Add belkin USB serial cable (William Greathouse)

2.2.18pre20
o Fix ide-probe SMP build error (Ian Morgan)
o Fix appletalk physical layer ioctl handling (Andi Kleen)
o Sparc update (Dave Miller)
o Update Stephen Tweedie's contact info (Stephen Tweedie)
o Fix typo in esp and scsi_obsolete code (Dave Miller)
o Bonding ioctl check fix (Willy Tarreau)
o Fix ipv6 procfs bug (Al Viro)
o Report PIV in proc as family 15 and uname as (me)
model 6 as discussed
o Redo Intel cache decodes as code not tables (me)
and add new ones (based on updates by
Asit Mallick & Andrew Ip)
o Fix CMOS locking in machine_power_off paths (me)
o Create build tree symlinks only if insmod is
new enough not to be confused by it (Keith Owens)
o Fix cmsg handling (Philippe Troin)
o Tiny xpds driver changes (Dan Hollis)
o Fix vmalloc sign bug (Ben LaHaise)
o SMBFS fixes/changes for find_next problems and (Urban Widmark)
to avoid truncate bug in netapps
o Fix ntfs translation bug (Anton Altaparmakov)
o Fix sparc problem with some soundcards and the (Jeff Garzik)
_IOC magic
o Update ppa driver to v2.05 (Tim Waugh)


2.2.18pre19
o Fix transproxy socket lookup (Val Henson)
o Add ICS1893 PHY to the SiS900 driver (Lei-Chun Chang)
o Fix documentation error in matroxfb (Vsevolod Sipakov)
o Update IDE floppy maintainer (Paul Bristow)
o Fix remaining cmos locking (Paul Gortmaker)
o Fix sparc bitfield/compiler bits on sound (Dave Miller)
o Update Pegasus USB driver (Petko Manolov)
o Networking updates - move divert header (Andi Kleen)
o Add ETH_P_ATM* defines (Matti Aarnio)
o Fix one more missing GFP_KERNEL/sk->allocation (Dave Miller)
o Fix ISDN multilink handler bug (Kai Germaschewski)
o Fix ymfpci unload cases (Kai Germaschewski)

2.2.18pre18
o Fix off by one in net/ipv4/proc (Dave Miller)
o Move the fpu emu patch that got away (Dave Miller)
o K6 update for MTRR ability (Dave Jones)
o Fix raid1/vm deadlock (Marcelo Tosatti)
o Fix usb mouse userspace memory accesses (David Woodhouse)
o Fix xpdsl if compiled in (typo) (Arjan van de Ven)
o Rio fixes for modem handling. Fix a small (Patrick van de Lageweg)
generic serial bug
o IBMtr driver fixes for cable pulls, pcmcia (Burt Silverman,
behaviour etc Mike Sullivan)
o Tidy up /dev/microcode messages (Daniel Roesen)
o Add arpfilter (Andi Kleen)
o IDE floppy updates for clik support, cleanups (Paul Bristow)
o Fix irongate handling on Alpha (Soohoon Lee)
o Fix HZ=100 assumption in aha152x.c (me)
o Fix power management handling in i810 audio (me)
(From an ALSA fix by Godmar Back)
o Put the NFS block default back to 4K (Trond Myklebust)
o Fix misleading comment in printk code (Riley Williams)
o Fix fbcon scroll back/paste bug (Herbert Xu)
o Fix rtc_lock for ide-probe, and hd.c (Richard Johnson)
o Backport of 2.4 PR_GET/SET_KEEPCAPS (Brian Brunswick)
(from Chris Evans 2.4 code)
o LRU list corruption fix (Andrea Arcangeli)
o Initial gcc 2.96+ support for kernel building (H J Lu)
| Not a recommended compiler for production kernels...
o ALI silence clearing fix (Ching-Ling Lee)
o Fix remaining old-style use of copy_strings (Solar Designer)
o Better pci_resource_start macro for 2.2 (Jeff Garzik)
o Fix nbd deadlock (Marcelo Tosatti)

2.2.18pre17
o Move a few escaped m68k headers into the right (me)
directory
o Backport 2.4 AF_UNIX garbage collect speedups (Dave Miller)
o TCP fixes for NFS (Saadia Khan)
o Fix USB audio hangs (David Woodhouse)
o Sparc64 dcache and exec fixes (Dave Miller)
o Fix typing crap in divert.h (Jeff Garzik)
o Use pkt_type in diverter, add maintainer info (Dave Miller)
o Fix obscure NAT problem in FIB code (Dave Miller)
o Fix sk->allocation in TCP sendmsg (Marcelo Tossati)
o Elevator fixes (Andrea Arcangeli)
o Allow broken_suid on NFS root (Trond Myklebust)
o Fix net/ipv6/proc off by one bug (Dave Miller)
o Fix AGP oops on Alpha (Michal Jaegermann)
o MSR/CPUID init call fixes (Arjan van de Ven)
o CS4281 sound hang fixes (Thomas Woller)
o AX.25 comment updates, Joerg has moved email (Joerg Reuter)

2.2.18pre16
o Finally get the m68k tree merged (Andrew McPherson
and a cast of many)
o Bring the sparc back in line, make it build (Anton Blanchard)
o USB Bluetooth fixes/docs (Greg Kroah-Hartman)
o Fix auth_null credentials bug (Hai-Pao Fan)
o Update cpu flag names (Dave Jones)
o Console 'quiet' boot option as in 2.4 (Rusty Russell)
o Make the sx serial driver work again (Patrick van de Lageweg)
o Fix negotation on the SYM53C1010 (Gerard Roudier)
o Fix alpha loops per jiffy (Jay Estabrook)
o Fix pegasus to work with 2.2 kernels (Greg Kroah-Hartman)
o Update plusb driver for 2.2.x (Eric Ayers,
Deti Fliegl)
o Fix ohci to use __init (Greg Kroah-Hartman)
o /sbin/hotplug support for USB as in 2.4 (Greg Kroah-Hartman)
o Update ksymoops url (Keith Owens)
o Update the changes doc about gcc (Petri Kaukasoina)
o Fix AMD flag naming (Ulrich Windl)
o Restore old block size on devices after a
partition scan (needed for powermac for one) (Michael Schmitz)
o Fix GPL naming in SubmittingDrivers (Mike Harris)
o NFSv3 server patches merge (Dave Higgen)
o CS46xx changes (Nils Faerber)
o Fix sys_nanosleep for >4GHz CPU changes (me)
(Spotted by Ben Herrenschmidt)
o Fix pas rev D mixer (??)
o Fix multiple spelling errors (Andr? Dahlqvist)
o ISDN updates (Kai Germaschewski)
o XSpeed DSL driver (Timothy Lee,
Dan Hollis)
o IDE multi-lun/single-lun handling (Jens Axboe)
o Fix alpha generic trident sound support (Rich Payne)
o Fix PPC for loops per jiffy (Cort Dougan)

2.2.18pre15
o Default msdos behaviour to old (small) letters (me)
| An option 'big' goes with 'small'
o Fix define collision in cpqfc (Arjan van de Ven)
o Fix case where scripts/kwhich isnt executable (me)
o Alpha FPU divide fix (Richard Henderson)
o Add ADMtek985 to the tulip list (J Katz)
o Lose excess ymfpci debugging (Rob Landley)
o Fix i2c bus id clash (Russell King)
o Update the ARM vidc driver (Russell King)
o Update the ARM am79c961a driver (Russell King)
o Fix parport_pc build with no PCI (Russell King)
o Fix ARM memzero (Russell King)
o Update ARM for __init and __setup (Russell King)
o Update ARM to loops_per_jiffy (Russell King)
o Remove arm ecard debug messages (Russell King)
o Fix ARM makefiles (Russell King)
o Fix iph5526 driver to use mdelay (Arjan van de Ven)
o Fix epca, dtlk, aha152x loops_per_sec bits (Philipp Rumpf)
o Fix smp tlb invalidate and bogomip printing (Philipp Rumpf)
o Fix NLS warnings (Arjan van de Ven)
o Fix wavfront conversion to loops_per_jiffies (me)
o Fix an audio problem and a sanyo changer (Jens Axboe)
problem
o Fix include bug with divert (me)
| Alternate fix to Willy Tarreau's
o Fix Alpha for loops_per_jiffy (Willy Tarreau)

2.2.18pre14
o Reorder attributes in drm to work with gcc272 (me)
o GNU cross compilers are foo-bar-gcc (Russell King)
o Add extra strange pcnet32 ident (Willy Tarreau)
o Since no vendor can get which right.. use a (Miquel van Smoorenburg)
shell script instead
| Please nobody tell me this fails in some bash version!
o Should be using bash not bash2 (escaped debug) (Petri Kaukasoina)
o spin_unlock_irq wrong debug mode printk (Willy Tarreau)
o Fix pcxx for the loops changes (Arjan van de Ven)
o Fix ov511/via-rhine name clash (Arjan van de Ven)
o Fix bridge compile with loops_per_sec change (Mitch Adair)
o 8139too driver added (Jeff Garzik)

2.2.18pre13
o Change udelay to use loops_per tick (Philipp Rumpf)
| Otherwise we bomb out at 2GHz which isnt far enough
| away with 1.4/1.6GHz stuff due out RSN
o Fix drivers using big delays to use mdelay (me)
o Fix drivers that used loops_per_sec (Philipp Rumpf, me)
o Fix yamaha PCI sound SMP bug (Arjan van de Ven)
o Change to preferred USB init fix (David Rees)
o Fix rio fix (Arjan van de Ven)
o Catch the VT but no mouse case in init/main.c (Arjan van de Ven)
o Fix the 'which' compiler stuff (Horst von Brand,
Peter Samuelson)
| Can someone verify for me this works on Slackware and
| on Caldera ?
o Add devfs include. Devfs wont be going into 2.2 (Richard Gooch)
but this again makes it easier to do 2.2/2.4
drivers.

2.2.18pre12
o Fix cyrix MTRR handling bug (IIZUKA Daisuke)
o Fix ymfpci poll (me, Arjan)
o Update radio-maestro, add Configure.help (Adam Tla/lka>
o Fix rio/generic serial build bug (Marcelo Tossati)
o USB build bug fix (Arjan van de Ven)
o Fix missing ac97_codec.c return value (Arjan van de Ven)
o Fix several warnings (Arjan van de Ven)
o Made the PS/2 reconnect behaviour optional (me)
| Its now 'psaux-reconnect' on the boot line
o Allow for newer Hauppauge with 4 ports (Krischan Jodies)
o Switch sound drivers from library to object (Arjan van de Ven)
o Kill the not working ac97 lock on the 810 (me)
o Automatically select older compilers for kernel
builds on Debian and RH (Arjan van de Ven)
o Start volumes higher on ac97, teach the driver (Rui Sousa)
about 5bit and 6bit codec precision and use
the mute bit.

2.2.18pre11
o Kill bogus codec_id assignment (Linus Torvalds)
o Update codec init code to handle id right (me)
o Fix dead/clashing define for NFS (Trond Myklebust)
o Remove the find_vga crap from bttv (me)
o Fix return on probe failure for cadet (Arjan van de Ven)
o Add missing configure.help stuff from 2.4test (Alan Ford)
o Fix inia100/megaraid define clash (Arjan van de Ven)
o __xchg marked as taking volatiles (Arjan van de Ven)
o Fix vwsnd warning in sound core (Arjan van de Ven)
o wdt_pci driver should return -EIO on error (Arjan van de Ven)
o Fix init_adfs_fs warning (Arjan van de Ven)
o Fix the joystick driver option parsing (Arjan van de Ven)
o Update mkdep to handle // commenting (Mike Klar)
o Thunderlan driver typo fixes (Torben Mathiasen)
o Add KX133/KT133 stuff to the AGP/DRM (Jeff Nguyen)
o FIx multiple card bug in eepro driver (Aristeu Filho)
o Initial YMF PCI native driver (Pete Zaitcev)
| Based on Jaroslav's ALSA driver and I've tweaked it
| a bit and maybe broken it 8)
o Fix procfs unlink bugs (Willy Tarreau)
o X.25 bugfix backport (Henner Eisen)
o Fix incorrect free_dma on DMAless boxes (Boria)
o Fix via audio driver merge (Nick Lamb)
o Update plusb driver to 2.4 one (Greg Kroah-Hartman)
o Put description info in wacom driver (Greg Kroah-Hartman)
o Update both UHCI drivers to match 2.4test (Greg Kroah-Hartman)
o Masquerade cleanup/warning fixes (Horst von Brand)

2.2.18pre10
o Add printk level to partition printk messages (me)
o Fix bluesmoke address report/serialize (Andrea Arcangeli)
o Add 2.4pre CPUID/MSR docs to 2.2.18pre (Adrian Bunk)
o Update to the 2.4pre via audio driver (Jeff Garzik)
o Fix small SMP race in set_current_state (Andrea Arcangeli)
o Fix __KERNEL__ checks in sparc headers (Dave Miller)
o Fix ADFS root directory bug added in pre9 (Russell King)
o Trap incorrect swap partition sizes (Andries Brouwer)
o Fix nfsroot bootp/dhcp on sparc64 (Dave Miller)
o Tidy up tcp opt parsing (Dave Miller)
o Check range on port range sysctl (Dave Miller)
o Back out erroneous i2c.h change (Arjan van de Ven)
o Fix trident hangs due to over zealous addition (Eric Brombaugh)
of midi support
o Fix big endian/macro bug in ext2fs (Andi Kleen)
o Bring dabusb driver into line with 2.4 (Greg Kroah-Hartman)
o Bring event drivers into line with 2.4 (Franz Sirl,
Greg Kroah-Hartman)
o Fix usb help texts (Greg Kroah-Hartman)
o Generic frame diverter (Benoit Locher)
o Bring USB serial back into line with 2.4 (Greg Kroah-Hartman)
o Fix DVD driver rpc state bug (Jens Axboe)
o Fix extra sunrpc printk (Tim Mann)
o USB init tidy up (Greg Kroah-Hartman)
o Allow PlanB video on generic PPC (Michel Lanners)
o Doc fixes/trim cvs logs on isdn drivers (Kai Germaschewski)
o USB hid, hub, ibmcam, dsbr100 devices updates (Greg Kroah-Hartman)
o Return EAFNOSUPPORT for out of range families
o Fix SMP locking on floppy driver (Jonathan Corbet)
o Add module author info to acm.c (Greg Kroah-Hartman)
o Update CREDITS to reflect all the USB guys (Greg Kroah-Hartman)

o ipfw wrong allocation flag fix (Rusty Russell)
o Implement Sun style lockf/nfs cache barriers (Trond Myklebust)
o Updated ISI serial driver (Multitech)
| You may well need their newer firmware set/loader for the
| later cards too

2.2.18pre9
o Fix usb module load oops (Thomas Sailer)
o Bring USB boot drivers in line with 2.4t8 (Greg Kroah-Hartman)
o And USB print drivers (Greg Kroah-Hartman)
o And USB Rio driver (Greg Kroah-Hartman)
o And USB dc2xx driver (Greg Kroah-Hartman)
o And USB mdc800 driver (Greg Kroah-Hartman)
o NFSv3 support and NFS updates (Trond Myklebust and co)
o Compaq 64bit/66Mhz PCI Fibrechannel driver (Amy Vanzant-Hodge)
o Disable microtouch driver (doesnt work in 2.2 (Greg Kroah-Hartman)
currently)
o Update ADFS support (Russell King)
o Update ARM arch specific code and includes (Russell King)
o Update ARM specific drivers (Russell King)
o Use both fast and slow A20 gating on boot (Kira Brown)
| if your box doesnt boot I want to know about it...
| Needed for stuff like the AMD Elan

2.2.18pre8
o Fix mtrr compile bug (Peter Blomgren)
o Alpha PCI boot up fix (Michal Jaegermann)
o Fix vt/keyboard dependancy in USB config (Arjan van de Ven)
o Fix sound hangs on cs4281 (Tom Woller)
o Fix Alpha vmlinuz.lds (Andrea Arcangeli)
o Fix CDROMPLAYTRKIND bug, allow root to open (Jens Axboe)
the cd door whenver.
o Update ov511 to match 2.4 (Greg Kroah-Hartman)
o Further devio.c fix (Greg Kroah-Hartman)
o Update NR_TASKS comment (Jarkko Kovala)
o Further sparc64 ioctl translator fixes (Andi Kleen)

2.2.18pre7
o Fix the AGP compile in bug (Arjan van de Ven)
o Revert old incorrect syncppp state change (Ivan Passos)
o Fix i810 rng to actually get built in (Arjan van de Ven)
o Megaraid compile fix, joystick, mkiss fixes (Arjan van de Ven)
o Kawasaki USB ethernet depends on net (Arjan van de Ven)
o Compaq cpqarray update (Charles White)
o Fix usb problem with no USB unit found (Oleg Drokin)
o Driver for the radio on some maestro cards (Adam Tlalka)
o Additional shared map support needed for sparc64(Dave Miller)
o Fix wdt_pci when compiled in (me, Arjan van de Ven)
o Fix usb missing symbol when non modular (Arjan van de Ven)
o Identify chip and also handle MTRR for the (me)
Cyrix III
o Allow binding to all ports multicast (Andi Kleen)
o Bring USB docs up to date (Greg Kroah-Hartman)
o Bring USB devio up to date (Greg Kroah-Hartman)
o pci_resource_len null function for non PCI case (Arjan van de Ven)
o Fix synchronous write off end of disk bug (Jari Ruusu)

2.2.18pre6
o Fix the IDE PCI not compiling bug (Dag Wieers)
o Kill an escaped reference to vger.rutgers (Dave Miller)
o Small rtl8139 fixups (Jeff Garzik)
o Add USB bluetooth driver (Greg Kroah-Hartman)
o Fix oops in visor driver (Greg Kroah-Hartman)
o Remove some unneeded ext2 includes,fix a bug (Andreas Dilger)
in the UFS code
o Fix rtc race between timer and rtc irq (Andrea Arcangeli)
o Fix slow gettimeofday SMP race (Andrea Arcangeli)
o Check lost_ticks in settimeofday to be more (Andrea Arcangeli)
precise

2.2.18pre5
o Added older VIA ide chipsets to the not to be (me)
autotuned list
o Fix crash on boot problem with __setup stuff (me)
o Small acenic fix (Matt Domsch)
o Fix hfc_pci isdn driver (Jens David)
o Fix smbfs configuration problem (Urban Widmark)
o Emu10K wrapper/build fixes (Rui Sousa)
o Small cleanups (Arjan van de Ven)
o Fix sparc32 build bug (Horst von Brand)
o Fix quota oops (Martin Diehl)
o Add i810 random number driver (Jeff Garzik)
o Clear suid bits on ext2 truncate as per SuS (Andi Kleen)
o Fix illegal use of section attributes (Arjan van de Ven)
o Documentation for nmi watchdog (Marcelo Tosatti)
o Fix uninitialised variable warnings (Arjan van de Ven)
o Save DR6 condition into the TSS (Ryan Wallach)
o Add additional __init's to the kernel (Andrzej M. Krzysztofowicz)
o Backport 2.4 wdt_pci driver (JP Nollman, me)
o AGP i810 fixes (Chip Salzenberg)
o UDMA support for ALI1543 & 1543C IDE devices (ALI)
o 2.4 MSR/CPUID driver backport (Dave Jones,
H Peter Anvin)
o Fix incorrect use of kernel v user ptr in NCPfs (Petr Vandrovec)
o Updated scsi tape driver (Kai Makisara)

2.2.18pre4
o Remove the aacraid driver again, having looked (me)
at what is needed to make it acceptable and
debug it - Im dumping it back on Adaptec
o DAC960 update (Leonard Zubkoff)
o Add setup vmlinuz.lds changes for Sparc (Arjan van den Ven)
o Sparc updates for drm, ioctl and other (Dave Miller)
o Megaraid driver update (Peter Jarrett)
o Add cd volume 0 to the amp power off on the
crystal cs46xx (Bill Nottingham)
o Fix IPV6 fragment and kfree bugs (Alexey Kuznetsov)
o Fix emu10k build bug (me)
o Emu10K driver upgrade. Adds emu-aps support (Rui Sousa)
o Updated IBM serveraid driver to 4.20 (IBM)
o Ext2 block handling cleanup from 2.4 (Al Viro)
o Make the ATI128 driver modular (Marcelo Tosatti)
o Fix megaraid build bug with gcc 2.7.2 (Arjan van de Ven)
o Fix some of the dquot races (Jan Kara)
o x86 setup code cleanup (Dave Jones)
o Implement 2.4 compatible __setup and __initcall (Arjan van de Ven)
o Tidy up smp_call_function stuff (Keitaro Yosimura)
o Remove 2.4 compat glue from cs4281 driver (Marcelo Tosatti)
o Fix minor bugs in bluesmoke now someone actually
has a faulty CPU and logs (me)
o Fix definition of IPV6_TLV_ROUTERALERT (Dave Miller)
o Fix in6_addr, ip_decrease_ttl, other (Dave Miller)
minor bits
o cp932 fixes (Kazuki Yasumatsu)
o Updated gdth driver (Andreas Koepf)
o Acenic update (Jes Sorensen)
o Update USB serial drivers (Greg Kroah-Hartman)
o Move pci_resource_len into pci compat (Marcelo Tosatti)

2.2.18pre3 (versus 2.2.17pre20)
o Clean up most of the compatibility macros (me)
that various people use. I've systematically
moved the 100% correct ones to the headers
used in 2.4
o Fix newly introduced bug in kmem_cache_shrink (Daniel Roesen)
o Further updates to symbios drivers (Gerhard Roudier)
o Remove emu10K warning and mtrr warning (Daniel Roesen)
o Fix symbol clash between cs4281 and esssolo1 (Arjan van de Ven)
o Fix acenic non modular/module build issues (Arjan van de Ven)
o Fix bug in alpha csum_partial_copy that could (Herbert Xu)
cause spurious EFAULTs
o Yet another eepro100 variant sighted (Torben Mathiasen)
o Minor microcode.c final tweak (Daniel Roesen)
o Document that ATIFB is now modular (Marcelo Tosatti)
o Parport update (Tim Waugh)
o First set of ext2 updates/fixes (Al Viro)
o Bring smbfs back into line with 2.2 (Urban Widmark)
| This should make OS/2 work again
o Fix S/390 _stext (still doesnt build dasd) (Kurt Roeckx)
o Remove unused vars in arch/i386/kernel/bios32.c (Daniel Roesen)
o Update the DHCP initrd support (Chip Salzenberg)
o Allow opening empty scsi removables like IDE
with O_NONBLOCK (needed for some ioctls) (Chip Salzenberg)
o Back out vibra mixer change
o Fix error returns in sbni driver (Dawson Engler)
o Initial merge of the aacraid driver (Adaptec)
| Much deuglification left to be done here
o Report megaraid: on obscure megaraid error (Daniel Deimert)
strings
o Add another CS4299 id string (Mulder Tjeerd)

2.2.18pre2 (versus 2.2.17pre20)

o Fix the compile problems with microcode.c (Dave Jones,
Daniel Roesen)
o GDTH driver update (Achim Leubner)
o Fix mathsemu miuse of casting with asm (??)
o Make msnd_pinnacle driver build on Alpha
o Acenic 0.45 fixes (Chip Salzenberg)
o Compaq CISS driver (SA 5300) (Charles White,
+ cleanups me)
+ gcc 2.95 fixup
o Modularise pm2fb and atyfb
o Upgrade AMI Megaraid driver to 1.09 (AMI)
o Add DEC HSG80 and COMPAQ 'logical volume' to
scsi multilun list
o SK PCI FDDI driver support (Schneider & Koch)
o Linux 2.2 USB backport (Vojtech Pavlik)
backport 3 + further fixes from the USB list
+ mm/slab.c fix for cache destroy
o AGP driver backport (XFree86, Precision
DRM driver backport Insight, XiG, HJ Lu,
VA Linux,
and others)

2.2.18pre1 (versus 2.2.17pre20)

o Update symbios/ncr driver to 1.7.0/3.4.0 (Gerhard Roudier)
o Updated ATP870U driver (ACard)
o Avoid running tq_scheduler stuff sometimes with (Andrea Arcangeli)
interrupts off
o Futher cpu setup updates (me)
o IBM MCA scsi driver updates (Michael Lang)
o Fix incorrect out of memory handling in bttv (Dawson Engler)
o Fix incorrect out of memory handling in buz (Dawson Engler)
o Fix incorrect out of memory handling in qpmouse (Dawson Engler)
o Fix error handling memory leak in ipddp (Dawson Engler)
o Fix error handling memory leak in sdla (Dawson Engler)
o Fix error handling memory leak in softoss (Dawson Engler)
o Fix error handling memory leak in ixj (Dawson Engler)
o Fix error handling memory leak in ax25 (Dawson Engler)
o Merge the microcode driver from 2.4 into 2.2 (Tigran Aivazian)
o Fix skbuff handling bug in the smc9194 driver (Arnaldo Melo)
o Make vfat use the same generation rules as (H. Kawaguchi,
in windows 9x Chip Salzenberg)
o Fix oops in the CPQ array driver (Arnaldo Melo)
o Fix ac97 codec not setting the id field (Bill Nottingham)
o Further work on the cs46xx/CD power bits (me)
o Synclink updates (Paul Fulgham)
o Synclink init bug fix (Arnaldo Melo)
o Handle odd interrupts from toshiba floppies (Alain Knaff)
o Fix trident driver build on nautilus Alpha (Peter Petrakis)
o Add later sb16 imix support tot he sb driver (Massimo Dal Zotto)
o Ignore luns that report can be connected, but (Matt Domsch)
not currently
o Fix dereference after kfree in uart401.c (Dawson Engler)
o Return correct SuS error code for an unknown (Herbert Xu)
socket family
o Add sub window clipping to the bttv driver (Thomas Jacob)
o Fix nfs cache locked messages (Trond Myklebust)
o Fix the modutils misdocumentation (Martin Douda)
o Remove bogus biosparm code from seagate.c (Andries Brouwer)
o Return correct error code on failed fasync set (Chip Salzenberg)
o Handle dcc resume with newer irc clients when (Scottie Shore)
doing an irq masq

--
Alan Cox <[email protected]>
Red Hat Kernel Hacker
& Linux 2.2 Maintainer Brainbench MVP for TCP/IP
http://www.linux.org.uk/diary http://www.brainbench.com


[1] It does have the page aging patch too, but I want to merge that in
2.2.19pre so we can study any suprises it causes.


2000-12-07 23:54:15

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>So I figure this is it for 2.2.18, subject to evidence to the contrary

Megaraid still needs fixing. I sent you the patch twice, so have
other people, but it still isn't fixed. The

megaBase &= PCI_BASE_ADDRESS_MEM_MASK;

...

megaBase &= PCI_BASE_ADDRESS_IO_MASK;

is removed by the 2.2.18 version (read the patch) and that breaks
older megaraid cards.

Existing megaraid system with 2.2.x kernels WILL break with 2.2.18

Mike.

2000-12-08 00:10:22

by Alan

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> Megaraid still needs fixing. I sent you the patch twice, so have
> other people, but it still isn't fixed. The

I asked people to explain why it was needed. I am still waiting. It is a
patch that does nothing. I will not put random deep magic into the kernel.

I have no reason to believe the current driver in 2.2.18pre24 does not work,
have you tried that specific kernel ?


2000-12-08 00:51:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

On Thu, Dec 07, 2000 at 08:03:00PM +0000, Alan Cox wrote:
>
> Ok we believe the VM crash looping printing error messages is now fixed.

Such bug can't generate crashes. Did you ever reproduced crashes on your 8Mb
486 with 2.2.18pre24?

> Marcelo finally figured it out and my 8Mb 486 has been running 2.2.18pre
> with that fix and stably[1].

diff -urN 2.2.18pre24/mm/filemap.c 2.2.18pre25/mm/filemap.c
--- 2.2.18pre24/mm/filemap.c Wed Nov 29 19:28:29 2000
+++ 2.2.18pre25/mm/filemap.c Fri Dec 8 00:41:45 2000
@@ -220,8 +220,10 @@
* throttling.
*/

- if (!try_to_free_buffers(page, wait))
+ if (!try_to_free_buffers(page, wait)) {
+ if(--count < 0) break;
goto refresh_clock;
+ }
return 1;
}

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.2/2.2.18pre24aa1/00_account-failed-buffer-tries-1

--- 2.2.17pre19/mm/filemap.c Tue Aug 22 14:54:13 2000
+++ /tmp/filemap.c Thu Aug 24 01:05:50 2000
@@ -179,6 +179,8 @@
if ((gfp_mask & __GFP_DMA) && !PageDMA(page))
continue;

+ count--;
+
/*
* Is it a page swap page? If so, we want to
* drop it if it is no longer used, even if it
@@ -224,7 +226,7 @@
return 1;
}

- } while (--count > 0);
+ } while (count > 0);
return 0;
}

lftp> pwd
ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.17pre19
^^^^^^^^^^^
lftp> ls -l account-failed-buffer-tries-1
-rw-r--r-- 1 korg korg 407 Sep 5 22:43 account-failed-buffer-tries-1
^^^^^^
lftp>

Only difference is that pre25 keeps decreasing `count' for locked, mapped and
out-of-zone pages and that means it will still fail to shrink the cache when it
looks at the unlucky part of the physical memory while the
account-failed-buffer-tries-1 intentionally doesn't decrease `count' in that
cases to avoid failing in such unlucky cases.

account-failed-buffer-tries-1 is included in VM-global-7 and it was
described in the 2.2.18pre21aa2 email to l-k (CC'ed you) in date Fri, 17 Nov
2000 18:54:43 +0100:

[..]
00_account-failed-buffer-tries-1

Account also the failed buffer tries during shrink_mmap. (me)
(this is included in the VM-global that I maintain against vanilla
2.2.x btw)
[..]

Andrea

2000-12-08 00:56:16

by Alan

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> Such bug can't generate crashes. Did you ever reproduced crashes on your 8Mb
> 486 with 2.2.18pre24?

Yes. Every 20 minutes or so quite reliably. With that change it has yet to
crash (its actually running that + page aging + another minor tweak so it
doesnt return success on page aging until we have a clump of free pages.

With just the page aging patch it performed way better but still hung.

> ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.2/2.2.18pre24aa1/00_account-failed-buffer-tries-1
>

Oh well ;)

> account-failed-buffer-tries-1 is included in VM-global-7 and it was
> described in the 2.2.18pre21aa2 email to l-k (CC'ed you) in date Fri, 17 Nov
> 2000 18:54:43 +0100:

The problem is its hard to know which of your patches depend on what, and
the complete set is large to say the least.

Alan

2000-12-08 01:12:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

On Fri, Dec 08, 2000 at 12:27:58AM +0000, Alan Cox wrote:
> The problem is its hard to know which of your patches depend on what, and
> the complete set is large to say the least.

That's why I use a `proposed' directory that only contains patches that can be
applied to your tree, in this case it was:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/proposed/v2.2/2.2.18pre2/VM-global-2.2.18pre2-6.bz2

(note: the above is outdated so it's not anymore suggested for inclusion of
course)

I sumbitted most of the not-feature-oriented stuff at pre2 time and I plan to
re-submit after 2.2.18 is released.

Andrea

2000-12-08 01:16:19

by Alan

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> (note: the above is outdated so it's not anymore suggested for inclusion of
> course)
>
> I sumbitted most of the not-feature-oriented stuff at pre2 time and I plan to
> re-submit after 2.2.18 is released.

Excellent. I've been trying to avoid VM fixes for 2.2.18 to stop stuff getting
muddled together and hard to debug. Running with page aging convinces me that
2.2.19 we need to sort some of the vm issues out badly, and make it faster than
2.4test 8)

2000-12-08 01:16:29

by Rainer Mager

[permalink] [raw]
Subject: Signal 11

Hi all,

I've searched around for a answer to this with no real luck yet. If anyone
has some ideas I'd be very grateful.

I recently upgraded to a new machine. It is running RedHat 6.2 Linux (with
a SMP 2.4.0test[8-11] kernel) and has a Matrox G400 in it. X is 4.0.1.
Anyway, about once every 2-3 days X will spontaneously die and the only info
I get back is that it was because of signal 11.
I've heard that signal 11 can be related to bad hardware, most often
memory, but I've done a good bit of testing on this and the system seems ok.
What I did was to run the VA Linux Cerberos(sp?) test for 15 hours+ with no
errors. Actually this only worked when running from the console. When
running from X the machine locked up (although no signal 11).
The only info I've gotten back from the XFree86 mailing lists so far is
that there are known and wide spread problems with SMP and these types of
problems. Can anyone comment on this? Are there known SMP problems? What is
the current resolution plan?


Thanks,

--Rainer

2000-12-08 01:40:07

by Michel Lespinasse

[permalink] [raw]
Subject: Re: Signal 11

On Fri, Dec 08, 2000 at 09:44:29AM +0900, Rainer Mager wrote:

> I've heard that signal 11 can be related to bad hardware, most
> often memory, but I've done a good bit of testing on this and the
> system seems ok. What I did was to run the VA Linux Cerberos(sp?)
> test for 15 hours+ with no errors. Actually this only worked when
> running from the console. When running from X the machine locked up
> (although no signal 11).

Don't be so quick to dismiss the "bad hardware" possibility. It is
really quite common these days. And, some cases of bad hardware are
not detected using simple tests like memtest86. (I'm not sure exactly
what cerberos does, do you have a link for it ?).

My recommandation would be to take a big source tree (say, a bit
bigger than the amount of RAM you have), and run repetitive
tar+detar+diff -ru runs on it for 48 hours or so. If your hardware
runs OK, diff should not report any inconsistencies. I found this test
to be quite reliable to detect hardware problems. If you have several
disk controllers, run one instance of the test on each of
them. Additionally you could run a background task to keep the CPU at
100% - a simple while 1 loop would do.

--
Michel "Walken" LESPINASSE
Of course I think I'm right. If I thought I was wrong, I'd change my mind.

2000-12-08 01:40:55

by Jeff Merkey

[permalink] [raw]
Subject: Re: Signal 11


I have previously reported this error (about three months ago) on 2.4
with XFree 3.3.6. If you are running RedHat 6.2, then you are running
this X Server. It also shows up on Calders'a 2.4 eDesktop. It appears
to be something with glib 2.1 < versions on 2.4. I also see it with
secure shell 1.2.27 on 2.4. I've also seen it on RH 7.0 on 2.4 kernels
as well, but only with SSH.

Jeff

Rainer Mager wrote:
>
> Hi all,
>
> I've searched around for a answer to this with no real luck yet. If anyone
> has some ideas I'd be very grateful.
>
> I recently upgraded to a new machine. It is running RedHat 6.2 Linux (with
> a SMP 2.4.0test[8-11] kernel) and has a Matrox G400 in it. X is 4.0.1.
> Anyway, about once every 2-3 days X will spontaneously die and the only info
> I get back is that it was because of signal 11.
> I've heard that signal 11 can be related to bad hardware, most often
> memory, but I've done a good bit of testing on this and the system seems ok.
> What I did was to run the VA Linux Cerberos(sp?) test for 15 hours+ with no
> errors. Actually this only worked when running from the console. When
> running from X the machine locked up (although no signal 11).
> The only info I've gotten back from the XFree86 mailing lists so far is
> that there are known and wide spread problems with SMP and these types of
> problems. Can anyone comment on this? Are there known SMP problems? What is
> the current resolution plan?
>
> Thanks,
>
> --Rainer
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-12-08 01:51:45

by Andi Kleen

[permalink] [raw]
Subject: Re: Signal 11

On Fri, Dec 08, 2000 at 09:44:29AM +0900, Rainer Mager wrote:
> I recently upgraded to a new machine. It is running RedHat 6.2 Linux (with
> a SMP 2.4.0test[8-11] kernel) and has a Matrox G400 in it. X is 4.0.1.
> Anyway, about once every 2-3 days X will spontaneously die and the only info
> I get back is that it was because of signal 11.
> I've heard that signal 11 can be related to bad hardware, most often
> memory, but I've done a good bit of testing on this and the system seems ok.
> What I did was to run the VA Linux Cerberos(sp?) test for 15 hours+ with no
> errors. Actually this only worked when running from the console. When
> running from X the machine locked up (although no signal 11).
> The only info I've gotten back from the XFree86 mailing lists so far is
> that there are known and wide spread problems with SMP and these types of
> problems. Can anyone comment on this? Are there known SMP problems? What is
> the current resolution plan?

signal 11 just means that the program crashed with a segmentation fault.

Sounds like a X Server bug. You should probably contact XFree86, not
linux-kernel


-Andi

2000-12-08 01:58:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>
>Excellent. I've been trying to avoid VM fixes for 2.2.18 to stop stuff getting
>muddled together and hard to debug. Running with page aging convinces me that
>2.2.19 we need to sort some of the vm issues out badly, and make it faster than
>2.4test 8)

Ahh.. The challenge is out!

You and me. Mano a mano.

Linus

2000-12-08 02:00:05

by Jeff Merkey

[permalink] [raw]
Subject: Re: Signal 11


Andi,

It's related to some change in 2.4 vs. 2.2. There are other programs
affected other than X, SSH also get's spurious signal 11's now and again
with 2.4 and glibc <= 2.1 and it does not occur on 2.2.

Jeff

Andi Kleen wrote:
>
> On Fri, Dec 08, 2000 at 09:44:29AM +0900, Rainer Mager wrote:
> > I recently upgraded to a new machine. It is running RedHat 6.2 Linux (with
> > a SMP 2.4.0test[8-11] kernel) and has a Matrox G400 in it. X is 4.0.1.
> > Anyway, about once every 2-3 days X will spontaneously die and the only info
> > I get back is that it was because of signal 11.
> > I've heard that signal 11 can be related to bad hardware, most often
> > memory, but I've done a good bit of testing on this and the system seems ok.
> > What I did was to run the VA Linux Cerberos(sp?) test for 15 hours+ with no
> > errors. Actually this only worked when running from the console. When
> > running from X the machine locked up (although no signal 11).
> > The only info I've gotten back from the XFree86 mailing lists so far is
> > that there are known and wide spread problems with SMP and these types of
> > problems. Can anyone comment on this? Are there known SMP problems? What is
> > the current resolution plan?
>
> signal 11 just means that the program crashed with a segmentation fault.
>
> Sounds like a X Server bug. You should probably contact XFree86, not
> linux-kernel
>
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-12-08 02:11:08

by Andi Kleen

[permalink] [raw]
Subject: Re: Signal 11

On Thu, Dec 07, 2000 at 06:24:34PM -0700, Jeff V. Merkey wrote:
>
> Andi,
>
> It's related to some change in 2.4 vs. 2.2. There are other programs
> affected other than X, SSH also get's spurious signal 11's now and again
> with 2.4 and glibc <= 2.1 and it does not occur on 2.2.

So have you enabled core dumps and actually looked at the core dumps
of the programs using gdb to see where they crashed ?



-Andi

2000-12-08 02:18:31

by Jeff Merkey

[permalink] [raw]
Subject: Re: Signal 11



Andi Kleen wrote:
>
> On Thu, Dec 07, 2000 at 06:24:34PM -0700, Jeff V. Merkey wrote:
> >
> > Andi,
> >
> > It's related to some change in 2.4 vs. 2.2. There are other programs
> > affected other than X, SSH also get's spurious signal 11's now and again
> > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
>
> So have you enabled core dumps and actually looked at the core dumps
> of the programs using gdb to see where they crashed ?

Yes. I can only get the SSH crash when I am running remotely from the
house over the internet, and it only shows then when running a build in
jobserver mode (parallel build). The X problem seems related as well,
since it's related to (usually) NetScape spawing off a forked process.
I will attempt to recreate tonight, and post the core dump file.

Jeff





>
> -Andi

2000-12-08 02:28:51

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Signal 11

On Fri, 8 Dec 2000, Rainer Mager wrote:

> Hi all,
>
> I've searched around for a answer to this with no real luck yet. If anyone
> has some ideas I'd be very grateful.

Signal 11 just means that you "seg-faulted". This is usually caused
by a coding error. However, if you have tools (like the C compiler)
that has been running fine, but starts to seg-fault, this points to
the very real possibility of a hardware error.

Modern RAM (with no error correction), running outside of its
timing specifications, is often the culpret. Even power supplies can
cause this problem. All you need is a single-bit error in a pointer's
value and -- signal 11.

Also, a bad opcode fetched from RAM with an error, also traps to
the same handler.

Do:

char main[]={0xff,0xff,0xff,0xff};


Compile and run this (it will compile!). You will see what
bad opcodes will do.



Cheers,
Dick Johnson

Penguin : Linux version 2.4.0 on an i686 machine (799.54 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


2000-12-08 02:35:32

by Peter Samuelson

[permalink] [raw]
Subject: Re: Signal 11


[Dick Johnson]
> Do:
>
> char main[]={0xff,0xff,0xff,0xff};

Oh come on, at least pick an *interesting* invalid opcode:

char main[]={0xf0,0x0f,0xc0,0xc8}; /* try also on NT (: */

Peter

2000-12-08 02:46:05

by Rainer Mager

[permalink] [raw]
Subject: RE: Signal 11

Hi all,

Thanks for all the input so far. Regarding this...

> (I'm not sure exactly what cerberos does, do you have a link for it ?).

The official name is "Cerberus Test Control System" aka CTCS. I don't know
the official site but a search for this should reveal something. Anyway it
is a pretty comprehensive test that includes multiple kernel compiles,
memory tests, disk test, etc, etc. Like I said, I ran this for more than 15
hours with no problems.

Well, actually, I did notice that if I run CTCS from within X then it
freezes up after a few minutes. This appears to happen when/because of
extreme swapping.


Aside from the above I've also run repeated kernel compiles (more than 50
times) with 'make -j bzImage' and had no problems; all outputs were
identical.

So given these tests, I'm reasonably confident the core hardware is ok. I
suppose it is possible there's some iffy bits in the G400's VRAM (but
wouldn't that just result in screen artifacts?). I will admit that I have't
yet tried swapping RAM or any other system components.


Any other ideas?

2000-12-08 03:01:17

by Dave Jones

[permalink] [raw]
Subject: Re: Signal 11

On Thu, 7 Dec 2000, Jeff V. Merkey wrote:

> It's related to some change in 2.4 vs. 2.2. There are other programs
> affected other than X, SSH also get's spurious signal 11's now and again
> with 2.4 and glibc <= 2.1 and it does not occur on 2.2.

<AOL>

I've begun to get a bit paranoid about my K6-2 500 box.

Various processes have been getting random signals after heavy CPU usage.
Playing an MPEG movie, kernel compile, or even just some small apps
compiling sometimes. Just for the record, this isn't an OOM situation,
I've watched this box with half its memory free or in buffers left
unattended, and suddenly a compile will just die.

I replaced the CPU with a brand new K6-2. Problem remained.
Next suspect was faulty RAM. Despite having passed a memtest, I
swapped out the DIMMs for some known good ones.
Suspecting cooling problems, I added some case fans.
Next came a bigger power supply. Still the problems.
The latest last ditch attempt to make this box stable has been
to attach the biggest fan I could find that would fit a socket 7 CPU.

And still the problems are there.
The only remaining suspect would be a flaky motherboard.
But then comes the real killer : This box is rock solid under 2.2

*boggle*

I'm not sure exactly when this started, but I think I first noticed
it around test5 or so, but didn't suspect the kernel at the time.

I've tried kernels compiled with everything from 2.91.66 when this
was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
debian on it. If this is a compiler bug, it's one that no compiler
I've tried seems to be immune from.

regards,

Davej.

--
| Dave Jones <[email protected]> http://www.suse.de/~davej
| SuSE Labs

2000-12-08 03:48:27

by Jeff Merkey

[permalink] [raw]
Subject: Re: Signal 11


Dave,

I think there may be a case when a process forks, that the MMU or some
other subsystem is either not setting the page bits correctly, or
mapping in a bad page. It's a LEVEL I bug in 2.4 is this is the case,
BTW. In core dumps (I've looked at 2 of them from SSH) it barfs right
after executing fork() or one of the exec functions and at some places
in the code where there's not any obvious coding bugs. Looks like some
type of mapping problem. I reported it three months ago, but it was
pretty much ignored.

Linus needs to add this one to the pre-12 list -- looks like some type
of mapping bug.

Jeff

[email protected] wrote:
>
> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
>
> > It's related to some change in 2.4 vs. 2.2. There are other programs
> > affected other than X, SSH also get's spurious signal 11's now and again
> > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
>
> <AOL>
>
> I've begun to get a bit paranoid about my K6-2 500 box.
>
> Various processes have been getting random signals after heavy CPU usage.
> Playing an MPEG movie, kernel compile, or even just some small apps
> compiling sometimes. Just for the record, this isn't an OOM situation,
> I've watched this box with half its memory free or in buffers left
> unattended, and suddenly a compile will just die.
>
> I replaced the CPU with a brand new K6-2. Problem remained.
> Next suspect was faulty RAM. Despite having passed a memtest, I
> swapped out the DIMMs for some known good ones.
> Suspecting cooling problems, I added some case fans.
> Next came a bigger power supply. Still the problems.
> The latest last ditch attempt to make this box stable has been
> to attach the biggest fan I could find that would fit a socket 7 CPU.
>
> And still the problems are there.
> The only remaining suspect would be a flaky motherboard.
> But then comes the real killer : This box is rock solid under 2.2
>
> *boggle*
>
> I'm not sure exactly when this started, but I think I first noticed
> it around test5 or so, but didn't suspect the kernel at the time.
>
> I've tried kernels compiled with everything from 2.91.66 when this
> was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
> debian on it. If this is a compiler bug, it's one that no compiler
> I've tried seems to be immune from.
>
> regards,
>
> Davej.
>
> --
> | Dave Jones <[email protected]> http://www.suse.de/~davej
> | SuSE Labs

2000-12-08 03:57:59

by Dave Jones

[permalink] [raw]
Subject: Re: Signal 11

On Thu, 7 Dec 2000, Jeff V. Merkey wrote:

> I think there may be a case when a process forks, that the MMU or some
> other subsystem is either not setting the page bits correctly, or
> mapping in a bad page. It's a LEVEL I bug in 2.4 is this is the case,
> BTW. In core dumps (I've looked at 2 of them from SSH) it barfs right
> after executing fork() or one of the exec functions and at some places
> in the code where there's not any obvious coding bugs. Looks like some
> type of mapping problem. I reported it three months ago, but it was
> pretty much ignored.
>
> Linus needs to add this one to the pre-12 list -- looks like some type
> of mapping bug.

Now that you mention it, every app that has bombed has been the type
that forks a lot. MpegTV, gtv, and make spring to mind. All apps drive
the CPU load up quite a lot, which was why I initially suspected
overheating. I don't see it on my other 2.4 boxes though which is
suspicious. But they don't get as much of a beating as this, which was
up until last week my main workstation.

regards,

Dave.

--
| Dave Jones <[email protected]> http://www.suse.de/~davej
| SuSE Labs

2000-12-08 04:11:00

by Jeff Merkey

[permalink] [raw]
Subject: Re: Signal 11



"Jeff V. Merkey" wrote:
>
> > So have you enabled core dumps and actually looked at the core dumps
> > of the programs using gdb to see where they crashed ?
>
> Yes. I can only get the SSH crash when I am running remotely from the
> house over the internet, and it only shows then when running a build in
> jobserver mode (parallel build). The X problem seems related as well,
> since it's related to (usually) NetScape spawing off a forked process.
> I will attempt to recreate tonight, and post the core dump file.

BTW. Were I to wager a guess, I would guess it's related to the paging
problems in 2.4 when a process gets cloned, since everytime I have seen
it, it happens when a child process gets forked then accesses the cloned
data from the parent. In the previous core dumps, it always puked right
after a call to fork() when the child process attempted to WRITE (not
read) data in the program.

Jeff

2000-12-08 10:17:17

by David Woodhouse

[permalink] [raw]
Subject: Re: Signal 11


[email protected] said:
> Sounds like a X Server bug. You should probably contact XFree86, not
> linux-kernel

I quote from the X devel list, which perhaps I shouldn't do but this is hardly
NDA'd stuff:

On Mon 20 Nov 2000, [email protected] said:
> I have seen random crashes on dual P3 BX boards (Tyan) and dual Xeon
> GX boards (Intel). XFree86 core dumps indicate that it happens in
> random places, in old as dirt software rendering code that has nothing
> wrong with it. I've only seen this under 2.3.x/2.4 SMP kernels. I
> would say that this is definitely a kernel problem.

XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
kernels - even on my BP6?. The random crashes started to happen when I
upgraded my distribution? - and are only seen by people using 2.4. So I
suspect that it's the combination of glibc and kernel which is triggering
it.

--
dwmw2

? And the BP6 still falls over less frequently than the dual P3 I use at
work.
? RH7. Don't start.


2000-12-08 10:18:37

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> I asked people to explain why it was needed. I am still waiting. It is a
> patch that does nothing. I will not put random deep magic into the
> kernel.

Alan, I replied to you a few weeks ago (pre20 times) when you asked me why
I was sending you this patch. (perhaps you didn't receive my email). What I
observed was that my netraid card had a 0xXXXX8 base address and the patch
aligned that address to 16 bytes :

|Bus 0, device 2, function 1:
| Unknown class: Intel OEM MegaRAID Controller (rev 5).
| Medium devsel. Fast back-to-back capable. BIST capable. IRQ 10. Master
Capable. Latency=64.
| Prefetchable 32 bit memory at 0xf0000000 [0xf0000008].

as you see, the board is found at 0xf0000008, but used aligned to 0xf0000000.

my server currently works with that patch, but I'm sure it won't boot anymore
if I apply this 2.2.18pre25 alone.

just in case, here it is again.

Cheers,
Willy

--- 18pre/drivers/scsi/megaraid.c Wed Nov 8 16:02:45 2000
+++ 18pre+megaraid/drivers/scsi/megaraid.c Fri Nov 10 12:03:05 2000
@@ -1920,10 +1920,14 @@

pciIdx++;

- if (flag & BOARD_QUARTZ)
+ if (flag & BOARD_QUARTZ) {
+ megaBase &= PCI_BASE_ADDRESS_IO_MASK;
megaBase = (long) ioremap (megaBase, 128);
- else
+ }
+ else {
+ megaBase &= PCI_BASE_ADDRESS_MEM_MASK;
megaBase += 0x10;
+ }

/* Initialize SCSI Host structure */
host = scsi_register (pHostTmpl, sizeof (mega_host_config));

2000-12-08 14:22:21

by Alan

[permalink] [raw]
Subject: Re: Signal 11

> Various processes have been getting random signals after heavy CPU usage.
> Playing an MPEG movie, kernel compile, or even just some small apps
> compiling sometimes. Just for the record, this isn't an OOM situation,
> I've watched this box with half its memory free or in buffers left
> unattended, and suddenly a compile will just die.

This is consistent with page cache corruption in memory. We definitely had
that in older 2.4test kernels. I saw this building stuff on Linux parisc
and it was because some page of gcc had randomly decided to become something
different. Since that was test6 I didnt figure it important 8)

2000-12-08 14:37:22

by Alan

[permalink] [raw]
Subject: Re: Signal 11

> > wrong with it. I've only seen this under 2.3.x/2.4 SMP kernels. I
> > would say that this is definitely a kernel problem.=20
>
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6=B9. The random crashes started to happen when =
> I
> upgraded my distribution=B2 - and are only seen by people using 2.4. So=
> I
> suspect that it's the combination of glibc and kernel which is triggeri=
> ng
> it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

2000-12-08 14:38:21

by Alan

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> my server currently works with that patch, but I'm sure it won't boot anymore
> if I apply this 2.2.18pre25 alone.

Some days I don't know why I bother

> just in case, here it is again.

It doesnt even apply

>

2000-12-08 15:47:27

by willy tarreau

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> It doesnt even apply

sorry Alan, I think it's because I had to copy/paste
it
with my mouse under X into my browser (I don't have
smtp access here at work), and it applies here with a
-12 lines offset...

Here it is attached for 2.2.18pre25, but since the
raid
server is running now (under 2.2.18pre20+patch), I
won't be able to test it till next week, but
I'm a bit confident since it will do the same as the
one which currently allows this server to boot.

as soon as I can reboot it, I promise I will test the
kernel with and without the patch to be really sure.
but before that, if people who have problems with
megaraid/netraid could give it a try, that would be
cool. Also, it would be nice if people for which the
normal megaraid driver works would accept to check
this
doesn't break anything.

Regards,
Willy


___________________________________________________________
Do You Yahoo!? -- Pour dialoguer en direct avec vos amis,
Yahoo! Messenger : http://fr.messenger.yahoo.com


Attachments:
(No filename) (988.00 B)
patch-megaraid-fix (674.00 B)
patch-megaraid-fix
Download all attachments

2000-12-08 16:38:57

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

According to Alan Cox:
> > my server currently works with that patch, but I'm sure it won't boot anymore
> > if I apply this 2.2.18pre25 alone.
>
> Some days I don't know why I bother

Bad day, Alan? ;)

> > just in case, here it is again.
> It doesnt even apply

Hmm, it did apply for me. Do newer versions of patch have the -l option
on by default?

Anyway. I just threw together a testmachine with a megaraid card.
With 2.2.18pre18, it doesn't boot. With 2.2.18pre18 + Willy's patch,
it does boot.

And with 2.2.18pre25 without any extra patches, it magically works.

So I took the plunge and compiled 2.2.18pre25 on the production
machine with the megaraid. And well, it's coming up as I write this.

I see that another patch _has_ been applied between pre18 and pre25
that tooks out some forward/backwards-compat logic with LINUX_VERSION_CODE
magic (beneath /* Read the base port and IRQ from PCI */). And
reading the patch, it makes sense. It probably does about the same
as Willy's patch, but the "right" way by using pci_resource_start()
which the one in pre18 only did for kernels > 2.3.0

So, it looks like pre25 has a working megaraid driver. Thanks Alan.

Mike.

2000-12-08 17:07:42

by Matthew Vanecek

[permalink] [raw]
Subject: Re: Signal 11

Peter Samuelson wrote:
>
> [Dick Johnson]
> > Do:
> >
> > char main[]={0xff,0xff,0xff,0xff};
>
> Oh come on, at least pick an *interesting* invalid opcode:
>
> char main[]={0xf0,0x0f,0xc0,0xc8}; /* try also on NT (: */
>

me2v@reliant DRFDecoder $ ./op
Illegal instruction (core dumped)

Is that the expected behavior?

--
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
********************************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...

2000-12-08 17:15:23

by Matthew Vanecek

[permalink] [raw]
Subject: Re: Signal 11

[email protected] wrote:
>
> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
>
> > I think there may be a case when a process forks, that the MMU or some
> > other subsystem is either not setting the page bits correctly, or
> > mapping in a bad page. It's a LEVEL I bug in 2.4 is this is the case,
> > BTW. In core dumps (I've looked at 2 of them from SSH) it barfs right
> > after executing fork() or one of the exec functions and at some places
> > in the code where there's not any obvious coding bugs. Looks like some
> > type of mapping problem. I reported it three months ago, but it was
> > pretty much ignored.
> >
> > Linus needs to add this one to the pre-12 list -- looks like some type
> > of mapping bug.
>
> Now that you mention it, every app that has bombed has been the type
> that forks a lot. MpegTV, gtv, and make spring to mind. All apps drive
> the CPU load up quite a lot, which was why I initially suspected
> overheating. I don't see it on my other 2.4 boxes though which is
> suspicious. But they don't get as much of a beating as this, which was
> up until last week my main workstation.
>
> regards,
>
> Dave.
>

I've noticed the same problem, and it occasionally happens with XFree86
4.0.1, as well. Hopefully we've established that this is not the
hardware issue which gcc people of so fond of pushing sig 11s on (even
in the face of overwhelming evidence to the contrary). It would be good
to have this put on a current to-do list and looked into.

--
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
********************************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...

2000-12-08 17:20:44

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Signal 11

On Fri, 8 Dec 2000, Matthew Vanecek wrote:

> Peter Samuelson wrote:
> >
> > [Dick Johnson]
> > > Do:
> > >
> > > char main[]={0xff,0xff,0xff,0xff};
> >
> > Oh come on, at least pick an *interesting* invalid opcode:
> >
> > char main[]={0xf0,0x0f,0xc0,0xc8}; /* try also on NT (: */
> >
>
> me2v@reliant DRFDecoder $ ./op
> Illegal instruction (core dumped)
>
> Is that the expected behavior?

Yep. And on early Pentinums, the ones with the "f00f" bug, it
would lock the machine tighter than a witches crotch. Ooops,
not politically correct.... It would allow user-mode code
to halt the machine.

Here is code that just quietly returns to the runtime code
that called it:

char main[]={0x90, 0x90, 0xc3};

FYI, if the .data section was not executable, you couldn't do
this. You would have to use some __asm__ stuff to put it in
the .text section. But, this is an interesting example of
how you can create code that the compiler refuses to generate.

It's easier to use assembly, though.....

Cheers,
Dick Johnson

Penguin : Linux version 2.4.0 on an i686 machine (799.54 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


2000-12-08 17:33:44

by Martin Kacer

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

On Thu, 7 Dec 2000, Alan Cox wrote:

# Ok we believe the VM crash looping printing error messages is now fixed.
# Marcelo finally figured it out and my 8Mb 486 has been running 2.2.18pre
# with that fix and stably[1].

Unfortunately, I don't think it is fixed. We maintain a heavy loaded
FTP/Samba server here (120+ active connections with very long data
transfers in rush hours) and it had the "VM: do_try_to_free_pages failed"
problem since 2.2.17 was first installed (there was FreeBSD before that).

We aplied 2.2.18pre25 patch yesterday hoping it could solve it. The
only difference is that the server reached several hours uptime instead of
40 minutes (with pre24). After two hours of load between 6.00 and 15.00
the console was flooded with those unpopular messages ("VM: ..."). The
system was taken down by generation of these messages so quickly, that
even none of the messages appeared in syslog! No response to Ctrl-Alt-Del,
of course... :-( Just trashing...


On Fri, 8 Dec 2000, Andrea Arcangeli wrote:

# > Ok we believe the VM crash looping printing error messages is now fixed.
# Such bug can't generate crashes. Did you ever reproduced crashes on your 8Mb
# 486 with 2.2.18pre24?

Our bug can generate them. :-( Maybe it's a different one? ;-)


Is there any chance to get rid of these VMM failures?

Sorry if I've missed something important recently mentioned here. I had
not enough time to follow the lk list carefully. Is there any reliable
solution?

It seems we need to return back to 2.2.13 for some time. :-(
Martin.

2000-12-08 17:35:44

by Alan

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> as soon as I can reboot it, I promise I will test the
> kernel with and without the patch to be really sure.
> but before that, if people who have problems with
> megaraid/netraid could give it a try, that would be
> cool. Also, it would be nice if people for which the
> normal megaraid driver works would accept to check
> this
> doesn't break anything.

Your patch changes the mask on both IO and memory ports to be MEM mask, which
is obviously incorrect. It wont actually bite you because all the masking
has already been done by pci_resource_start() so you are masking already
zero bits.

2000-12-08 17:37:44

by Alan

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> > Some days I don't know why I bother
> Bad day, Alan? ;)

Umm no but having people _keep_ sending you do nothing patches gets
annoying after a while ;)

> reading the patch, it makes sense. It probably does about the same
> as Willy's patch, but the "right" way by using pci_resource_start()
> which the one in pre18 only did for kernels > 2.3.0

I suspect what actually happened is that someone fixed pci_resource_start()
looking over the change set, and that fixed the megaraid driver

2000-12-08 17:49:57

by Alan

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> We aplied 2.2.18pre25 patch yesterday hoping it could solve it. The
> only difference is that the server reached several hours uptime instead of
> 40 minutes (with pre24). After two hours of load between 6.00 and 15.00
> the console was flooded with those unpopular messages ("VM: ..."). The
> system was taken down by generation of these messages so quickly, that
> even none of the messages appeared in syslog! No response to Ctrl-Alt-Del,
> of course... :-( Just trashing...
>
> Our bug can generate them. :-( Maybe it's a different one? ;-)

Quite possibly.

> Is there any chance to get rid of these VMM failures?

By finding them. Are you confident you are not running out of memory.
Presumably since 2.2.13 works you are 8)

2000-12-08 18:07:50

by Martin Kacer

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

On Fri, 8 Dec 2000, Alan Cox wrote:

# > Is there any chance to get rid of these VMM failures?
# By finding them.

:-) I am not so familiar with MM in Linux. :^(
And do not have enough time for intensive study...
Although I would probably like that work...

# Are you confident you are not running out of memory.

Well, almost sure. This is the log with load records:

(according to /proc/meminfo)
FTPusers SMBusr load free mem free swap
Fri Dec 8 14:35:05 CET 2000 61 35 6.17 3068 kB 128932 kB
Fri Dec 8 14:40:04 CET 2000 59 36 5.05 2280 kB 130320 kB
Fri Dec 8 14:45:03 CET 2000 59 36 5.97 2896 kB 131448 kB
Fri Dec 8 14:50:03 CET 2000 59 35 6.59 2908 kB 133140 kB
Fri Dec 8 14:55:04 CET 2000 53 36 8.82 2380 kB 133952 kB
Fri Dec 8 15:00:03 CET 2000 53 40 6.42 2728 kB 135064 kB
Fri Dec 8 15:05:03 CET 2000 48 39 5.47 2264 kB 135684 kB
Fri Dec 8 15:10:03 CET 2000 48 41 3.90 3204 kB 135928 kB
Fri Dec 8 15:15:03 CET 2000 51 41 5.93 2628 kB 135700 kB
Fri Dec 8 15:20:03 CET 2000 50 45 6.50 2124 kB 135828 kB
Fri Dec 8 15:25:03 CET 2000 56 44 7.92 2192 kB 136080 kB
Fri Dec 8 15:30:03 CET 2000 49 45 10.89 2072 kB 136176 kB
Fri Dec 8 15:35:03 CET 2000 51 42 6.32 2960 kB 136156 kB
Fri Dec 8 15:40:04 CET 2000 54 44 6.92 2364 kB 136220 kB
Fri Dec 8 15:45:03 CET 2000 54 44 6.63 2852 kB 136348 kB
Fri Dec 8 15:50:04 CET 2000 53 46 3.63 2248 kB 136420 kB
Fri Dec 8 15:55:03 CET 2000 59 48 6.51 3060 kB 136312 kB
(crashed during the next 5 minutes)

Doesn't seem to have consumed all of swap space.
I will try to determine more info the next time - I promise...

# Presumably since 2.2.13 works you are 8)

I didn't tell it worked. It had worked a long time ago.
It is still not tested now. Unfortunately, due to the absence of raid0
module the bootup process destroyed our 140GB partition. It will take some
time to make the system running again. :-(

Thank for your answer anyway...
Martin.

2000-12-08 18:13:20

by Peter Samuelson

[permalink] [raw]
Subject: Re: Signal 11


[Dick Johnson]
> > > char main[]={0xf0,0x0f,0xc0,0xc8}; /* try also on NT (: */
> > me2v@reliant DRFDecoder $ ./op
> > Illegal instruction (core dumped)
>
> Yep. And on early Pentinums, the ones with the "f00f" bug, it would
> lock the machine tighter than a witches crotch. Ooops, not
> politically correct.... It would allow user-mode code to halt the
> machine.

...Until Linux 2.0.34 or so (can't remember the exact version number)
which had the workaround for this bug, about a week after the bug was
discovered.

And I was reminded in private mail that the correct lockup sequence is
actually

char main[]={0xf0,0x0f,0xc7,0xc8};

where the 0xc8 can be anything from 0xc8 to 0xcf.

Peter

2000-12-08 18:39:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

On Fri, Dec 08, 2000 at 06:02:57PM +0100, Martin Kacer wrote:
> Is there any chance to get rid of these VMM failures?

You should apply this patch on top of 2.2.18pre25:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre25/VM-global-2.2.18pre25-7.bz2

> It seems we need to return back to 2.2.13 for some time. :-(

Definitely no, you only need to apply the above collection of bugfixes.

Andrea

2000-12-08 18:43:53

by Philipp Rumpf

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

On Fri, Dec 08, 2000 at 10:47:46AM +0100, Willy Tarreau wrote:
> |Bus 0, device 2, function 1:
> | Unknown class: Intel OEM MegaRAID Controller (rev 5).
> | Medium devsel. Fast back-to-back capable. BIST capable. IRQ 10. Master
> Capable. Latency=64.
> | Prefetchable 32 bit memory at 0xf0000000 [0xf0000008].
>
> as you see, the board is found at 0xf0000008, but used aligned to 0xf0000000.

No. It's found at 0xf0000000, and has 8 bytes of MMIO space.

> my server currently works with that patch, but I'm sure it won't boot anymore
> if I apply this 2.2.18pre25 alone.

"I'm sure" meaning "I didn't test it" ?

2000-12-08 19:01:15

by Martin Kacer

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

On Fri, 8 Dec 2000, Andrea Arcangeli wrote:

# On Fri, Dec 08, 2000 at 06:02:57PM +0100, Martin Kacer wrote:
# > Is there any chance to get rid of these VMM failures?
# You should apply this patch on top of 2.2.18pre25:
# ftp://.../VM-global-2.2.18pre25-7.bz2

Well, I've found that VM-global patch before, of course. Until now, the
last version was against pre18. Since I do not know the exact rules for
including new things into Alan's tree, I thought that VM-global patch was
already included in pre24. Sorry for my lack of experience. ;-)) I should
have checked it.
As I wrote before, I had no time recently to follow the mailing list
carefully and I didn't know exactly what VM-global patch is.

# > It seems we need to return back to 2.2.13 for some time. :-(
# Definitely no, you only need to apply the above collection of bugfixes.

Ok, I can try it, at least.
I will let you know about results.

Martin.

2000-12-08 19:51:37

by Dr. Kelsey Hudson

[permalink] [raw]
Subject: Re: Signal 11

Don't post the core file... It's system-dependant and really wont do
anyone but yourself a shred of good.

On Thu, 7 Dec 2000, Jeff V. Merkey wrote:

>
>
> Andi Kleen wrote:
> >
> > On Thu, Dec 07, 2000 at 06:24:34PM -0700, Jeff V. Merkey wrote:
> > >
> > > Andi,
> > >
> > > It's related to some change in 2.4 vs. 2.2. There are other programs
> > > affected other than X, SSH also get's spurious signal 11's now and again
> > > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
> >
> > So have you enabled core dumps and actually looked at the core dumps
> > of the programs using gdb to see where they crashed ?
>
> Yes. I can only get the SSH crash when I am running remotely from the
> house over the internet, and it only shows then when running a build in
> jobserver mode (parallel build). The X problem seems related as well,
> since it's related to (usually) NetScape spawing off a forked process.
> I will attempt to recreate tonight, and post the core dump file.
>
> Jeff
>
>
>
>
>
> >
> > -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Kelsey Hudson [email protected]
Software Engineer
Compendium Technologies, Inc (619) 725-0771
---------------------------------------------------------------------------

2000-12-08 20:05:55

by Mark Vojkovich

[permalink] [raw]
Subject: Re: Signal 11



On Fri, 8 Dec 2000, David Woodhouse wrote:

>
> [email protected] said:
> > Sounds like a X Server bug. You should probably contact XFree86, not
> > linux-kernel
>
> I quote from the X devel list, which perhaps I shouldn't do but this is hardly
> NDA'd stuff:
>
> On Mon 20 Nov 2000, [email protected] said:
> > I have seen random crashes on dual P3 BX boards (Tyan) and dual Xeon
> > GX boards (Intel). XFree86 core dumps indicate that it happens in
> > random places, in old as dirt software rendering code that has nothing
> > wrong with it. I've only seen this under 2.3.x/2.4 SMP kernels. I
> > would say that this is definitely a kernel problem.
>
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6?. The random crashes started to happen when I
> upgraded my distribution? - and are only seen by people using 2.4. So I
> suspect that it's the combination of glibc and kernel which is triggering
> it.

Some additional data points. It goes away on UP 2.4 kernels.
Also, I can't recall seeing this problem on IA64. Maybe it's still
there on IA64 and I just haven't been trying hard enough to crash
it, but my current impression is that the problem doesn't exist on IA64.

Hmmm... IA64 is a static server. I don't hear of people having
problems on 3.3.6 servers either. I'm wondering if a non-loader
4.0 server would have problems on IA32 with a 2.4 kernel. That's
something for people to try.


Mark.

>
> --
> dwmw2
>
> ? And the BP6 still falls over less frequently than the dual P3 I use at
> work.
> ? RH7. Don't start.
>
>
>

2000-12-08 20:08:05

by Dr. Kelsey Hudson

[permalink] [raw]
Subject: Re: Signal 11

On Thu, 7 Dec 2000, Peter Samuelson wrote:

>
> [Dick Johnson]
> > Do:
> >
> > char main[]={0xff,0xff,0xff,0xff};
>
> Oh come on, at least pick an *interesting* invalid opcode:
>
> char main[]={0xf0,0x0f,0xc0,0xc8}; /* try also on NT (: */

What's funny, is that this actually executes on SPARC hardware, but
immediately segfaults. On Intel hardware though, you get a message similar
to:

zsh: illegal hardware instruction (core dumped) a.out

I wrote relatively the same program in college. It exploits the F0 0F bug
found in early Pentium hardware.

Kelsey Hudson [email protected]
Software Engineer
Compendium Technologies, Inc (619) 725-0771
---------------------------------------------------------------------------

2000-12-08 20:14:57

by Dr. Kelsey Hudson

[permalink] [raw]
Subject: Re: Signal 11

On Fri, 8 Dec 2000 [email protected] wrote:

> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
>
> > I think there may be a case when a process forks, that the MMU or some
> > other subsystem is either not setting the page bits correctly, or
> > mapping in a bad page. It's a LEVEL I bug in 2.4 is this is the case,
> > BTW. In core dumps (I've looked at 2 of them from SSH) it barfs right
> > after executing fork() or one of the exec functions and at some places
> > in the code where there's not any obvious coding bugs. Looks like some
> > type of mapping problem. I reported it three months ago, but it was
> > pretty much ignored.
> >
> > Linus needs to add this one to the pre-12 list -- looks like some type
> > of mapping bug.
>
> Now that you mention it, every app that has bombed has been the type
> that forks a lot. MpegTV, gtv, and make spring to mind. All apps drive
> the CPU load up quite a lot, which was why I initially suspected
> overheating. I don't see it on my other 2.4 boxes though which is
> suspicious. But they don't get as much of a beating as this, which was
> up until last week my main workstation.

Just to add some input and insight on here, I loaded the system down with
some FFT algorithms, and then ran an 8-way kernel compile. The machine in
question is a dual P3/600 with 512MB RAM, 2.4.0-test11. The load
skyrocketed to a mere 13.6. xmms was still running, didn't skip even
once. The FFT algorithms didn't bitch at all. Neither did the kernel
compile. In fact, it compiled without a hitch...

I dunno what to say about these boxes that segfault all the
time... Probably just bad hardware somewhere along the lines.

Kelsey Hudson [email protected]
Software Engineer
Compendium Technologies, Inc (619) 725-0771
---------------------------------------------------------------------------

2000-12-08 20:34:46

by willy tarreau

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> > Bad day, Alan? ;)
> Umm no but having people _keep_ sending you do
> nothing patches gets annoying after a while ;)

Please accept all my apologies, Alan. When I quickly
sent you the last patch, I didn't notice that some
other broken code had been removed, what I discovered
later back home and after comparing 2.2.18pre2[15]
(what Miquel noticed too).

Next time, I'll spend a little more of my time on
carefully reading the patch before resending an old
useless one.

Cheers,
Willy


___________________________________________________________
Do You Yahoo!? -- Pour dialoguer en direct avec vos amis,
Yahoo! Messenger : http://fr.messenger.yahoo.com

2000-12-08 20:43:09

by willy tarreau

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> "I'm sure" meaning "I didn't test it" ?

absolutely, I believed that the driver was *exactly*
the same as the previous release which didn't boot and
needed the fix, but another fix has been applied and
corrected it. Now I think it will work with a clean
2.2.18pre25. Anyway, I left a kernel compile behind me
this evening, so I'll confirm this on monday as soon
as
I can reboot the server on a pre25.

Cheers,
Willy


___________________________________________________________
Do You Yahoo!? -- Pour dialoguer en direct avec vos amis,
Yahoo! Messenger : http://fr.messenger.yahoo.com

2000-12-08 22:52:09

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: Signal 11

On Fri, Dec 08, 2000 at 11:34:51AM -0800, Mark Vojkovich wrote:
>
>
> On Fri, 8 Dec 2000, David Woodhouse wrote:
>
> Some additional data points. It goes away on UP 2.4 kernels.
> Also, I can't recall seeing this problem on IA64. Maybe it's still
> there on IA64 and I just haven't been trying hard enough to crash
> it, but my current impression is that the problem doesn't exist on IA64.
>
> Hmmm... IA64 is a static server. I don't hear of people having
> problems on 3.3.6 servers either. I'm wondering if a non-loader
> 4.0 server would have problems on IA32 with a 2.4 kernel. That's
> something for people to try.
>
>
> Mark.


I have not seen it on UP systems either. I only see it on SMP systems.
After trying very hard last night, I was able to get my 4 x PPro system to
do it with 2.4.0-12. It seems related to loading in some way. If you
have more than two processors, the loading is less since there's more
processors, and for whatever reason, it makes it harder to produce
whatever race condition is causing it. I can get it to happen
pretty easily on a 2 x PII system.

:-)

Jeff



>
> >
> > --
> > dwmw2
> >
> > ? And the BP6 still falls over less frequently than the dual P3 I use at
> > work.
> > ? RH7. Don't start.
> >
> >
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-12-08 22:59:30

by David Woodhouse

[permalink] [raw]
Subject: Re: Signal 11

On Fri, 8 Dec 2000, Jeff V. Merkey wrote:

> I have not seen it on UP systems either. I only see it on SMP systems.
> After trying very hard last night, I was able to get my 4 x PPro system to
> do it with 2.4.0-12. It seems related to loading in some way. If you
> have more than two processors, the loading is less since there's more
> processors, and for whatever reason, it makes it harder to produce
> whatever race condition is causing it. I can get it to happen
> pretty easily on a 2 x PII system.

Can you reproduce it with bcrl's patch below:

Index: mm/memory.c
===================================================================
RCS file: /net/passion/inst/cvs/linux/mm/memory.c,v
retrieving revision 1.2.2.40
diff -u -r1.2.2.40 memory.c
--- mm/memory.c 2000/12/05 13:33:39 1.2.2.40
+++ mm/memory.c 2000/12/08 22:24:09
@@ -860,6 +860,7 @@
/*
* Ok, we need to copy. Oh, well..
*/
+ set_pte(page_table, pte);
spin_unlock(&mm->page_table_lock);
new_page = page_cache_alloc();
if (!new_page)
@@ -870,6 +871,12 @@
* Re-check the pte - we dropped the lock
*/
if (pte_same(*page_table, pte)) {
+ /* We are changing the pte, so get rid of the old
+ * one to avoid races with the hardware, this really
+ * only affects the accessed bit here.
+ */
+ pte = ptep_get_and_clear(page_table);
+
if (PageReserved(old_page))
++mm->rss;
break_cow(vma, old_page, new_page, address, page_table);
@@ -1216,12 +1223,14 @@
return do_swap_page(mm, vma, address, pte,
pte_to_swp_entry(entry), write_access);
}

+ entry = ptep_get_and_clear(pte);
if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address, pte, entry);

entry = pte_mkdirty(entry);
}
+
entry = pte_mkyoung(entry);
establish_pte(vma, address, pte, entry);
spin_unlock(&mm->page_table_lock);


--
dwmw2


2000-12-08 23:40:58

by Horst von Brand

[permalink] [raw]
Subject: Re: Signal 11

David Woodhouse <[email protected]> said:

[...]

> I quote from the X devel list, which perhaps I shouldn't do but this is
> hardly NDA'd stuff:

> On Mon 20 Nov 2000, [email protected] said:
> > I have seen random crashes on dual P3 BX boards (Tyan) and dual Xeon
> > GX boards (Intel). XFree86 core dumps indicate that it happens in
> > random places, in old as dirt software rendering code that has nothing
> > wrong with it. I've only seen this under 2.3.x/2.4 SMP kernels. I
> > would say that this is definitely a kernel problem.

> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6?. The random crashes started to happen when I
> upgraded my distribution? - and are only seen by people using 2.4. So I
> suspect that it's the combination of glibc and kernel which is triggering
> it.

I get regular segfaults and random lockups trying to build CVS GCCs and
kernels since I updated RH 7 to glibc-2.2-5. P3, sr440bx mobo (UP),
2.2.18preX kernels; previously rock solid. Might be that the mains voltage
here tends to be out of whack, but I doubt it.
--
Horst von Brand [email protected]
Casilla 9G, Vin~a del Mar, Chile +56 32 672616

2000-12-09 00:25:24

by Alan

[permalink] [raw]
Subject: Re: Linux 2.2.18pre25

> Well, I've found that VM-global patch before, of course. Until now, the
> last version was against pre18. Since I do not know the exact rules for
> including new things into Alan's tree, I thought that VM-global patch was
> already included in pre24. Sorry for my lack of experience. ;-)) I should
> have checked it.
> As I wrote before, I had no time recently to follow the mailing list
> carefully and I didn't know exactly what VM-global patch is.
>
> # > It seems we need to return back to 2.2.13 for some time. :-(
> # Definitely no, you only need to apply the above collection of bugfixes.
>
> Ok, I can try it, at least.
> I will let you know about results.

VM-global is currently on my 2.2.19pre pile of stuff. Im monitoring a few
cases with interest before I commit to that decision however

2000-12-09 00:32:04

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: Signal 11



I'll try.

Jeff


On Fri, Dec 08, 2000 at 10:24:55PM +0000, David Woodhouse wrote:
> On Fri, 8 Dec 2000, Jeff V. Merkey wrote:
>
> > I have not seen it on UP systems either. I only see it on SMP systems.
> > After trying very hard last night, I was able to get my 4 x PPro system to
> > do it with 2.4.0-12. It seems related to loading in some way. If you
> > have more than two processors, the loading is less since there's more
> > processors, and for whatever reason, it makes it harder to produce
> > whatever race condition is causing it. I can get it to happen
> > pretty easily on a 2 x PII system.
>
> Can you reproduce it with bcrl's patch below:
>
> Index: mm/memory.c
> ===================================================================
> RCS file: /net/passion/inst/cvs/linux/mm/memory.c,v
> retrieving revision 1.2.2.40
> diff -u -r1.2.2.40 memory.c
> --- mm/memory.c 2000/12/05 13:33:39 1.2.2.40
> +++ mm/memory.c 2000/12/08 22:24:09
> @@ -860,6 +860,7 @@
> /*
> * Ok, we need to copy. Oh, well..
> */
> + set_pte(page_table, pte);
> spin_unlock(&mm->page_table_lock);
> new_page = page_cache_alloc();
> if (!new_page)
> @@ -870,6 +871,12 @@
> * Re-check the pte - we dropped the lock
> */
> if (pte_same(*page_table, pte)) {
> + /* We are changing the pte, so get rid of the old
> + * one to avoid races with the hardware, this really
> + * only affects the accessed bit here.
> + */
> + pte = ptep_get_and_clear(page_table);
> +
> if (PageReserved(old_page))
> ++mm->rss;
> break_cow(vma, old_page, new_page, address, page_table);
> @@ -1216,12 +1223,14 @@
> return do_swap_page(mm, vma, address, pte,
> pte_to_swp_entry(entry), write_access);
> }
>
> + entry = ptep_get_and_clear(pte);
> if (write_access) {
> if (!pte_write(entry))
> return do_wp_page(mm, vma, address, pte, entry);
>
> entry = pte_mkdirty(entry);
> }
> +
> entry = pte_mkyoung(entry);
> establish_pte(vma, address, pte, entry);
> spin_unlock(&mm->page_table_lock);
>
>
> --
> dwmw2
>

2000-12-09 06:03:58

by Dave Jones

[permalink] [raw]
Subject: Re: Signal 11


David Woodhouse ([email protected]) wrote...

> Can you reproduce it with bcrl's patch below:

Did nothing for me. gcc still got a sig11 after a while.
Took three runs of 'make bzImage' before it completed.

I wondered if I'd been unlucky enough to have been sent a
replacement K6-2 which was also screwed, but as I mentioned
earlier, this box runs fine under 2.2

btw, I was unsubscribed from all lists at vger yesterday,
for reasons currently unknown to me. Did this happen to anyone
else, or did my mail setup break something?

regards,

Davej.

--
| Dave Jones <[email protected]> http://www.suse.de/~davej
| SuSE Labs

2000-12-09 19:32:38

by Matthew Vanecek

[permalink] [raw]
Subject: Re: Signal 11

Alan Cox wrote:
>
> > > wrong with it. I've only seen this under 2.3.x/2.4 SMP kernels. I
> > > would say that this is definitely a kernel problem.=20
> >
> > XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> > kernels - even on my BP6=B9. The random crashes started to happen when =
> > I
> > upgraded my distribution=B2 - and are only seen by people using 2.4. So=
> > I
> > suspect that it's the combination of glibc and kernel which is triggeri=
> > ng
> > it.
>
> Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
> table updating race help ?
>
> Alan

Where are his fixes at? I don't seem to see any of his posts in the
archives.
--
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
********************************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...

2000-12-09 19:51:03

by Dave Jones

[permalink] [raw]
Subject: Re: Signal 11

On Sat, 9 Dec 2000, Matthew Vanecek wrote:

> > Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
> > table updating race help ?
> > Alan
>
> Where are his fixes at? I don't seem to see any of his posts in the
> archives.

dwmw2 posted one such patch earlier this week :-

http://www.lib.uaa.alaska.edu/linux-kernel/archive/2000-Week-49/0856.html

regards,

Davej.

--
| Dave Jones <[email protected]> http://www.suse.de/~davej
| SuSE Labs

2000-12-10 00:02:10

by Matthew Vanecek

[permalink] [raw]
Subject: Re: Signal 11

[email protected] wrote:
>
> On Sat, 9 Dec 2000, Matthew Vanecek wrote:
>
> > > Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
> > > table updating race help ?
> > > Alan
> >
> > Where are his fixes at? I don't seem to see any of his posts in the
> > archives.
>
> dwmw2 posted one such patch earlier this week :-
>
> http://www.lib.uaa.alaska.edu/linux-kernel/archive/2000-Week-49/0856.html
>
> regards,
>

I saw that. I thought it was a patch to try to "reproduce it", as
opposed to fixing it. Is it truly a fix, and is it applicable for UP
kernels?
--
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
********************************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...

2000-12-11 01:29:57

by Rainer Mager

[permalink] [raw]
Subject: RE: Signal 11

I just applied the said patch and will report my results. Note that I have
never been able to reliably, on-demand reproduce this so give me a few days
to see what happens.

--Rainer


-----Original Message-----
From: Alan Cox [mailto:[email protected]]
Sent: Friday, December 08, 2000 11:07 PM
To: David Woodhouse
Cc: Andi Kleen; Rainer Mager; [email protected]; Mark Vojkovich
Subject: Re: Signal 11


> > wrong with it. I've only seen this under 2.3.x/2.4 SMP kernels. I
> > would say that this is definitely a kernel problem.=20
>
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6=B9. The random crashes started to happen when =
> I
> upgraded my distribution=B2 - and are only seen by people using 2.4. So=
> I
> suspect that it's the combination of glibc and kernel which is triggeri=
> ng
> it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

2000-12-11 02:03:09

by Rainer Mager

[permalink] [raw]
Subject: OOPS when using 4GB memory setting

Hi all,

About 1 month back I reported a problem with getting OOPs when running with
a kernel compiled with the 4GB memory setting. Since then I've finally
managed to get the ksymoops results. Where should I post them?

To review:

My machine has 1GB RAM. If I build a 2.4.0test11 (or 8, 9, or 10 I haven't
tried earlier) kernel and chose the 1GB memory setting then only 900504 K is
detected (but everything runs stably). If I chose the 4GB memory setting
then the full 1 GB is detected but I get oops. I can reliably force an oops
by mounting a samba drive and then accessing it (via ls for example).


--Rainer

2000-12-11 09:36:23

by Rainer Mager

[permalink] [raw]
Subject: RE: Signal 11

Well, I just had a Signal 11 even with the patch. What can I do to help
figure this out?


Thanks,

--Rainer

-----Original Message-----
From: Alan Cox [mailto:[email protected]]
Sent: Friday, December 08, 2000 11:07 PM
To: David Woodhouse
Cc: Andi Kleen; Rainer Mager; [email protected]; Mark Vojkovich
Subject: Re: Signal 11


> > wrong with it. I've only seen this under 2.3.x/2.4 SMP kernels. I
> > would say that this is definitely a kernel problem.=20
>
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6=B9. The random crashes started to happen when =
> I
> upgraded my distribution=B2 - and are only seen by people using 2.4. So=
> I
> suspect that it's the combination of glibc and kernel which is triggeri=
> ng
> it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

2000-12-11 14:03:58

by Mike Galbraith

[permalink] [raw]
Subject: RE: Signal 11

On Mon, 11 Dec 2000, Rainer Mager wrote:

> Well, I just had a Signal 11 even with the patch. What can I do to help
> figure this out?

Is init permanently running after you see a couple of these?

-Mike

2000-12-11 14:48:16

by Dave Jones

[permalink] [raw]
Subject: RE: Signal 11

On Mon, 11 Dec 2000, Rainer Mager wrote:

> Well, I just had a Signal 11 even with the patch. What can I do to help
> figure this out?

My troublesome box finally seems to be stable. It's been up for the
last two days whilst under quite heavy loads without problems.
Previously, it would be lucky to last an hour.
The change? I disabled DRM & AGPGart.
With them both disabled, I get no problems at all. No Sig11's,
No Sig4's, No lockups.

This box has a Voodoo3 3000 AGP..

01:00.0 VGA compatible controller: 3Dfx Interactive, Inc. Voodoo 3 (rev 01)

And is running on an MVP3 chipset....

00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP]

This box does display the same problem with IRQ routing that I've
got on my Athlon box...

PCI: Using IRQ router VIA [1106/0586] at 00:07.0
PCI: Assigned IRQ 11 for device 00:08.0
PCI: The same IRQ used for device 01:00.0
IRQ routing conflict in pirq table! Try 'pci=autoirq'

(00:08:0 is an SBLive)

A related problem ?
As I mentioned in an earlier mail `autoirq' is an unknown option.

The Athlon box has similar messages, but it happens with even
more devices..

They both do the same with the various PCI options 'nobios' etc,
and changing PnP OS in the BIOS makes no difference either.

regards,

Davej.

--
| Dave Jones <[email protected]> http://www.suse.de/~davej
| SuSE Labs

2000-12-11 23:55:29

by Rainer Mager

[permalink] [raw]
Subject: RE: Signal 11

(This message contains a number of related replies.)

> From: Mike Galbraith [mailto:[email protected]]
> Is init permanently running after you see a couple of these?

No, that is, after 23 hours up time it has used only 6 seconds CPU time
(according to top).

That reminds me that I should repeat that my signal 11 problem has (so far)
only caused X to die. The OS remains up and stable.


> From: [email protected] [mailto:[email protected]]
> My troublesome box finally seems to be stable.[...]I disabled DRM
> & AGPGart. With them both disabled, I get no problems at all.
> No Sig11's, No Sig4's, No lockups.
>
> This box has a Voodoo3 3000 AGP..

I suppose I can try this too. My box has a Matrox G400. BTW, what is DRM?
Direct Rendering something?


> From: CMA [mailto:[email protected]]
> Did you already try to selectively disable L1 and L2 caches (if
> your box has both) and see what happens?

I'll look into this as well. Anyone have any pointers on how to do this? I
have a Tyan Tiger 133 with Award BIOS if this helps/matters.

Even if this setting does make a difference, what does this tell me/us? I
don't consider running the box with disabled cache(s) a viable solution.



Thanks all and keep those suggestions coming.

--Rainer

2000-12-13 00:54:09

by Rainer Mager

[permalink] [raw]
Subject: RE: Signal 11 - the continuing saga

Hi again,

Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
work in 12 that directly addresses this signal 11 problem). When compiling
the new kernel I chose to disable AGPGart and RDM as suggested by
[email protected]. I will report later if this makes any difference.

On another, possibly related note, I'm getting some really weird behavior
with a Java program. The only reason I mention it here is because it dies
with our old friend Signal 11. Anyway, please bear with the description
below.
I have a tiny bash script that launches a Java swing app. If I run my
script from an xterm (or gnome-terminal or whatever) then it starts up fine.
If, however, I try to launch it from my gnome taskbar's menu then it dies
with signal 11 (the Java log is available upon request). This seems to be
100% consistent, since I noticed it yesterday, even across reboots.
Interestingly, the same behavior occurs if I try to run the program from
withis JBuilder 4.
So, is this related to the larger signal 11 problems?


What else can I do regarding these issues to help fix it? Would a core dump
help anyone? I'd really like to contribute somehow but I need some
direction.


--Rainer

> From: CMA [mailto:[email protected]]
> Did you already try to selectively disable L1 and L2 caches (if
> your box has both) and see what happens?

Anyone know how to do this?

2000-12-13 01:52:30

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: Signal 11 - the continuing saga

On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
> Hi again,
>
> Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
> work in 12 that directly addresses this signal 11 problem). When compiling
> the new kernel I chose to disable AGPGart and RDM as suggested by
> [email protected]. I will report later if this makes any difference.
>
> On another, possibly related note, I'm getting some really weird behavior
> with a Java program. The only reason I mention it here is because it dies
> with our old friend Signal 11. Anyway, please bear with the description
> below.
> I have a tiny bash script that launches a Java swing app. If I run my
> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
> If, however, I try to launch it from my gnome taskbar's menu then it dies
> with signal 11 (the Java log is available upon request). This seems to be
> 100% consistent, since I noticed it yesterday, even across reboots.
> Interestingly, the same behavior occurs if I try to run the program from
> withis JBuilder 4.
> So, is this related to the larger signal 11 problems?

There's a corruption bug in the page cache somewhere, and it's 100%
reproducable. Finding it will be tough....

>
>
> What else can I do regarding these issues to help fix it? Would a core dump
> help anyone? I'd really like to contribute somehow but I need some
> direction.
>
>
> --Rainer
>
> > From: CMA [mailto:[email protected]]
> > Did you already try to selectively disable L1 and L2 caches (if
> > your box has both) and see what happens?
>
> Anyone know how to do this?

Usually this is performed in the BIOS setup. You can also disable L1
with a sequence of instructions that write to the CR0 register on intel
and flip a bit, but in doing this you have to execute a WBINV (write
back invalidate) instruction to flush out the cache. BIOS setup is
probably simpler. Disabling Level I will make the machine slower
than mollasses, BTW, and if this bug is race related (they always
are) it won't help much in running it down.

Jeff

>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-12-13 02:17:06

by Rainer Mager

[permalink] [raw]
Subject: RE: Signal 11 - the continuing saga

Thanks for the info...

> [mailto:[email protected]]On Behalf Of Jeff V. Merkey
> > So, is this related to the larger signal 11 problems?
>
> There's a corruption bug in the page cache somewhere, and it's 100%
> reproducable. Finding it will be tough....

Ok, granted this will be tough but is anyone even actively working on it?
What can I do to help?



> > Anyone know how to do [disable L1 and L2 caches]?
>
> Usually this is performed in the BIOS setup. You can also disable L1
> with a sequence of instructions that write to the CR0 register on intel
> and flip a bit, but in doing this you have to execute a WBINV (write
> back invalidate) instruction to flush out the cache. BIOS setup is
> probably simpler. Disabling Level I will make the machine slower
> than mollasses, BTW, and if this bug is race related (they always
> are) it won't help much in running it down.

Aha, just as I suspected. My BIOS doesn't appear to support this. You seem
to be saying that doing so won't really contribute anything anyway so I will
hold off for now.



--Rainer

2000-12-13 03:48:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Signal 11 - the continuing saga

In article <[email protected]>,
Jeff V. Merkey <[email protected]> wrote:
>On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
>> I have a tiny bash script that launches a Java swing app. If I run my
>> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
>> If, however, I try to launch it from my gnome taskbar's menu then it dies
>> with signal 11 (the Java log is available upon request). This seems to be
>> 100% consistent, since I noticed it yesterday, even across reboots.
>> Interestingly, the same behavior occurs if I try to run the program from
>> withis JBuilder 4.
>> So, is this related to the larger signal 11 problems?
>
>There's a corruption bug in the page cache somewhere, and it's 100%
>reproducable. Finding it will be tough....

Unlikely. If the actual program data was corrupted, it would SIGSEGV
regardless of how it's executed.

I'd guess that the program has a bug, and depending on the arguments and
environment (especially the latter will be different), it shows up or
not. Things like not having a LOCALE set in either case or similar.

Linus

2000-12-13 05:00:32

by Mike Galbraith

[permalink] [raw]
Subject: RE: Signal 11 - the continuing saga

On Wed, 13 Dec 2000, Rainer Mager wrote:

> Thanks for the info...
>
> > [mailto:[email protected]]On Behalf Of Jeff V. Merkey
> > > So, is this related to the larger signal 11 problems?
> >
> > There's a corruption bug in the page cache somewhere, and it's 100%
> > reproducable. Finding it will be tough....
>
> Ok, granted this will be tough but is anyone even actively working on it?
> What can I do to help?

If you want, I can extract IKD.. which happens to have a trap in place
for this (because I have a 100% reproducable swap related SIGSEGV that
I'm trying to figure out).

If you're interested, let me know and I'll extract it (quite large) and
send it along instructions on how to do the trap.

-Mike

2000-12-13 10:06:38

by Rainer Mager

[permalink] [raw]
Subject: RE: Signal 11 - the continuing saga

Mike et al,

I have no idea what IKD is and I don't know what to do with any results I
might find BUT I'd be happy to do this if it will help. Please pass on the
info with the instructions. Who should I report the results to?



--Rainer

> [mailto:[email protected]]On Behalf Of Mike Galbraith
> If you want, I can extract IKD.. which happens to have a trap in place
> for this (because I have a 100% reproducable swap related SIGSEGV that
> I'm trying to figure out).
>
> If you're interested, let me know and I'll extract it (quite large) and
> send it along instructions on how to do the trap.
>
> -Mike

2000-12-13 10:06:48

by Rainer Mager

[permalink] [raw]
Subject: RE: Signal 11 - the continuing saga

Give that man a cigar....it was an env var (not LOCALE but LANG). I'd
actually checked this but I didn't think that made a difference in my case.

Thanks Linus, now can you fix the larger signal 11 problem?

--Rainer


> [mailto:[email protected]]On Behalf Of Linus Torvalds
> I'd guess that the program has a bug, and depending on the arguments and
> environment (especially the latter will be different), it shows up or
> not. Things like not having a LOCALE set in either case or similar.
>
> Linus

2000-12-13 12:40:21

by CMA

[permalink] [raw]
Subject: R: Signal 11 - the continuing saga

>> From: CMA [mailto:[email protected]]
>> Did you already try to selectively disable L1 and L2 caches (if
>> your box has both) and see what happens?
>
>Anyone know how to do this?

If you own a p6 class machine (sorry but I didn't find your hw specs in
previous messages)
you should be able to enter setup and disable L1 and/or L2 usually in
"advanced setup".
If you disable L1, the machine will be *much* slower.
If you disable L2, you will notice it under heavy load.
Most of the times sig 11 is due L1 cache overheating (on chip). Just
controlling whether cpu cooling fan is properly seated and spinning solves
the problem.
Regards.
Dr. Eng. Mauro Tassinari
http://www.c-m-a.it

2000-12-13 16:12:03

by Mike Galbraith

[permalink] [raw]
Subject: RE: Signal 11 - the continuing saga

On Wed, 13 Dec 2000, Rainer Mager wrote:

> Mike et al,
>
> I have no idea what IKD is and I don't know what to do with any results I
> might find BUT I'd be happy to do this if it will help. Please pass on the
> info with the instructions. Who should I report the results to?

IKD is a debugging toolkit. The trap I have set up freezes the kernel
trace buffer at SIGSEGV time. From there you have to read it backward
looking for problems. (which isn't particularly easy). I was thinking
you wanted to roll your shirt sleeves up and maybe this would help ;-)

If you want it, and do a trace, I'b be very interested in the last
couple of schedules to compare to my traces. It's not something you
can just run and report though.

-Mike

2000-12-13 17:18:10

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: Signal 11 - the continuing saga

On Tue, Dec 12, 2000 at 07:17:41PM -0800, Linus Torvalds wrote:
> In article <[email protected]>,
> Jeff V. Merkey <[email protected]> wrote:
> >On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
> >> I have a tiny bash script that launches a Java swing app. If I run my
> >> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
> >> If, however, I try to launch it from my gnome taskbar's menu then it dies
> >> with signal 11 (the Java log is available upon request). This seems to be
> >> 100% consistent, since I noticed it yesterday, even across reboots.
> >> Interestingly, the same behavior occurs if I try to run the program from
> >> withis JBuilder 4.
> >> So, is this related to the larger signal 11 problems?
> >
> >There's a corruption bug in the page cache somewhere, and it's 100%
> >reproducable. Finding it will be tough....
>
> Unlikely. If the actual program data was corrupted, it would SIGSEGV
> regardless of how it's executed.
>
> I'd guess that the program has a bug, and depending on the arguments and
> environment (especially the latter will be different), it shows up or
> not. Things like not having a LOCALE set in either case or similar.
>
> Linus

Linus,

I agree that there may be some problem in the code above -- the question is
what has changed to make this behavior emerge? I see it with a host of
programs(ssh, make, netscape) -- true all are userspace. Time permitting,
I may attempt to track this down in ssh and make in jobserver mode. It
may be related to some interaction that changed underneath.

Jeff


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-12-14 13:13:02

by Clayton Weaver

[permalink] [raw]
Subject: Re: Signal 11

This is unrelated to the signal 11 problem, but something to consider
for "random crashes and segfaults", ie are you using this compiler
and glibc version combination.

There has a been a thread on the teTeX mailing list the last few days
about a (RedHat, but probably more general than just their rpms)
gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile

unsigned varname; /* "unsigned int varname;" is ok */

(no problem at -O or no optimization at all, and doesn't happen if teTeX
is compiled with kgcc).

Showed up in the kpathsea library (which began to split paths on
'-' as well as '/' after a user upgraded compiler and libc and
recompiled teTeX).

Regards,

Clayton Weaver
<mailto:[email protected]>
(Seattle)

"Everybody's ignorant, just in different subjects." Will Rogers



2000-12-14 19:42:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: Signal 11

In article <[email protected]>,
Clayton Weaver <[email protected]> wrote:
>
>There has a been a thread on the teTeX mailing list the last few days
>about a (RedHat, but probably more general than just their rpms)
>gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile

Quite frankly, anybody who uses RedHat 7.0 and their broken compiler for
_anything_ is going to have trouble.

I don't know why RH decided to do their idiotic gcc-2.96 release (it
certainly wasn't approved by any technical gcc people - the gcc people
were upset about it too), and I find it even more surprising that they
apparently KNEW that the compiler they were using was completely broken.
They included another (non-broken) compiler, and called it "kgcc".

"kgcc" stands for "kernel gcc", apparently because (a) they realised
that a miscompiled kernel is even worse than miscompiling some random
user applications and (b) gcc-2.96 is so broken that it requires special
libraries for C++ vtable chunks handling that is different, so the
_working_ gcc can only be used with programs that do not need such
library support. Namely the kernel.

In case it wasn't obvious yet, I consider RedHat-7.0 to be basically
unusable as a development platform, and I hope RH downgrades their
compiler to something that works better RSN. It apparently has problems
compiling stuff like the CVS snapshots of X etc too (and obviously,
anything you compile under gcc-2.96 is not likely to work anywhere else
except with the broken libraries).

Linus

2000-12-14 23:04:12

by Alan

[permalink] [raw]
Subject: Re: Signal 11

> I don't know why RH decided to do their idiotic gcc-2.96 release (it
> certainly wasn't approved by any technical gcc people - the gcc people

Every single patch in that release barring I believe 2 was accepted into
the main tree. So they liked the code. The naming did upset people and was
unfortunate, but done talking to the compiler folks at Red Hat with the
best of intentions behind it. If we had called it 'Red Hat cc' I think people
would have been even more offended at the way they had been discredited.

I do understand why they got peeved, I do understand why they feel no urge
to support the 296 codebase (nor would I want them to). I hit 'd' when I
see 'I have 2.2.18 patched with [reiserfs|ext3|bigmem|lfs]' for the same
reasons.

> They included another (non-broken) compiler, and called it "kgcc".
> "kgcc" stands for "kernel gcc", apparently because (a) they realised

kgcc is a convention invented a long time ago by Conectiva. Debian also used
to have gcc272. It is done because

gcc272 is useless at C++, has lots of bugs
egcs is no better at C++ and has lots of bugs
gcc295 is a little better at C++ and is _Crawling_ with bugs
gcc296(redhat) is a lot better at C++ and doesn't appear to be any buggier.

In fact gcc296 is the first compiler that can compiled 2.2.16 correctly. All
the previous compilers miscompile the strstr() inline in some cases. Thats
why I had to hack the 2.2 kernel tree to make it work. (And the cases where
you got compile time errors gcc was right to moan about - like using (...)
in traditional

> user applications and (b) gcc-2.96 is so broken that it requires special
> libraries for C++ vtable chunks handling that is different, so the

Wrong - the C++ vtable format change is part of the intended progression of the
compiler and needed to meet standards compliance. gcc 295 also changed the
internal formats. Unfortunately the gcc295 and 296 formats are both probably
not the final format. The compiler folks are not willing to guarantee anything
untill gcc 3.0, which may actually be out by the time 2.4 is stable.

> unusable as a development platform, and I hope RH downgrades their
> compiler to something that works better RSN. It apparently has problems

Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
buggier than before, but that the bugs are in different places. egcs and gcc295
both caused X compile problems too.

I still advise people: Use egcs-1.1.2 for Linux 2.2.x. You can build 2.2.18 with
gcc 2.9.6 but I personally wouldn't be running production systems on a kernel
built that way - but NOT because gcc296 is buggier but because the bugs are
going to be in different places and I firmly believe production system people
should let the loons find them ;)

Alan

2000-12-14 23:17:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: Signal 11



On Thu, 14 Dec 2000, Alan Cox wrote:
>
> > user applications and (b) gcc-2.96 is so broken that it requires special
> > libraries for C++ vtable chunks handling that is different, so the
>
> Wrong - the C++ vtable format change is part of the intended progression of the
> compiler and needed to meet standards compliance. gcc 295 also changed the
> internal formats. Unfortunately the gcc295 and 296 formats are both probably
> not the final format. The compiler folks are not willing to guarantee anything
> untill gcc 3.0, which may actually be out by the time 2.4 is stable.

If you ask any gcc folks, the main reason they think this was a really
stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
with the 2.95.x release _and_ the upcoming 3.0 release.

Nobody asked the people who knew this, apparently.

> > unusable as a development platform, and I hope RH downgrades their
> > compiler to something that works better RSN. It apparently has problems
>
> Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
> buggier than before, but that the bugs are in different places. egcs and gcc295
> both caused X compile problems too.

gcc-2.95.2 is at least a real release, from a branch that is actively
maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
many problems as possible _without_ being incompatible like the snapshots
are.

Or just stay at 2.91.66 (egcs).

As to X compile problems - neither egcs nor 2.95.2 appears to have any
trouble with the CVS tree. Possibly because they got fixed, because, after
all, at least those were real releases.

I'd applaud RedHat for making snapshots available, but they should be
marked as SNAPSHOTS, and not as the main compiler with no way to fix the
damn problems it causes.

As it is, anybody doing development is probably better off at RH-6.2.
That is doubly true if they intend to release binaries.

Linus

2000-12-14 23:17:13

by Jakub Jelinek

[permalink] [raw]
Subject: Re: Signal 11

On Thu, Dec 14, 2000 at 04:42:03AM -0800, Clayton Weaver wrote:
> There has a been a thread on the teTeX mailing list the last few days
> about a (RedHat, but probably more general than just their rpms)
> gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile
>
> unsigned varname; /* "unsigned int varname;" is ok */
>
> (no problem at -O or no optimization at all, and doesn't happen if teTeX
> is compiled with kgcc).

That one is fixed already for some time, it was a bug in loop unrolling
(that patch is still pending review for the mainline CVS though).

Jakub

2000-12-14 23:29:34

by Bernhard Rosenkraenzer

[permalink] [raw]
Subject: Re: Signal 11

On Thu, 14 Dec 2000, Linus Torvalds wrote:

> If you ask any gcc folks, the main reason they think this was a really
> stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
> with the 2.95.x release _and_ the upcoming 3.0 release.

The same thing is true of *any* gcc release.
For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
_and_ the upcoming 3.0 release.

> > Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
> > buggier than before, but that the bugs are in different places. egcs and gcc295
> > both caused X compile problems too.
>
> gcc-2.95.2 is at least a real release, from a branch that is actively
> maintained

Not very actively.
Please take the time to compare the activity in gcc_2_95_branch with the
patches in the current "2.96" version in rawhide.

> - so a 2.95.3 is likely to happen reasonably soon, fixing as
> many problems as possible _without_ being incompatible like the snapshots
> are.

It will be incompatible with any non-2.95.x-version, and I don't think
2.96-68 is any more buggy than the current 2.95 branch.
The initial 2.96 "release" did have some odd bugs; all the known ones have
been fixed.

> Or just stay at 2.91.66 (egcs).

This may be good for the kernel, but it's not acceptable for C++.
Also, there's no support for some of the platforms we have to work with,
such as ia64 and S/390 - using different compilers for different
architectures isn't a real solution either.

> As to X compile problems - neither egcs nor 2.95.2 appears to have any
> trouble with the CVS tree.

Neither does 2.96-68.

LLaP
bero


2000-12-14 23:42:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: Signal 11



On Thu, 14 Dec 2000, Bernhard Rosenkraenzer wrote:
> >
> > gcc-2.95.2 is at least a real release, from a branch that is actively
> > maintained
>
> Not very actively.
> Please take the time to compare the activity in gcc_2_95_branch with the
> patches in the current "2.96" version in rawhide.

Take a look at the differences in linux-2.2.x and linux-2.3.x.

linux-2.3.x is was a h*ll of a lot more "actively maintained".

But nobody really considers that to be an argument for RedHat (or anybody
else) to installa 2.3.x kernel by default. Sure, most distributions have a
"hacker kernel", but it's NOT installed by default, and it is clearly
marked as experimental.

Your arguments make no sense.

The compiler is often _more_ important to system stability than the
kernel. A "real release" implies that it at least had testing, and that
people know what the problem spots tend to be.

Note that the "know what the problem spots tend to be" is important.

> > As to X compile problems - neither egcs nor 2.95.2 appears to have any
> > trouble with the CVS tree.
>
> Neither does 2.96-68.

Good. Maybe you'd make it clearer to everybody who installed from your
CD's that they had better upgrade. Pronto.

Linus

2000-12-14 23:53:02

by Alan

[permalink] [raw]
Subject: Re: Signal 11

> If you ask any gcc folks, the main reason they think this was a really
> stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
> with the 2.95.x release _and_ the upcoming 3.0 release.

And with egcs 1.1.2. So
egcs is a different format to all others
2.95 is a different format to all others
2.96 is a different format to all others

and 2.96 is a C++ compiler

> gcc-2.95.2 is at least a real release, from a branch that is actively
> maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
> many problems as possible _without_ being incompatible like the snapshots
> are.

The 2.96 tree is maintained actively. Updates for the Red Hat 7 packages
are being worked on and CygnusHat people are working on both that maintenance
and on feeding all they find back to the core gcc team.

In fact we have sufficient faith in it we sell packages and support based around
that and our preparedness to support it.

> As to X compile problems - neither egcs nor 2.95.2 appears to have any
> trouble with the CVS tree. Possibly because they got fixed, because, after
> all, at least those were real releases.

I asked Jakub. He's confused as to your report. As far as he is aware the only
X problems in the CVS tree were related to XFree86 source code bugs misusing
type punning. If you have a case to lookat Jakub would love to hear about it
and fix either X or gcc.

> I'd applaud RedHat for making snapshots available, but they should be
> marked as SNAPSHOTS, and not as the main compiler with no way to fix the
> damn problems it causes.

That it was confusing and mistaken by some as an official GNU group release
is something we never intended and have already apologised for. It was done
without malice or ill intent.

> As it is, anybody doing development is probably better off at RH-6.2.
> That is doubly true if they intend to release binaries.

We strongly recommend that people use 6.2 for developing binaries for general
release unless they have specific requirements for glibc 2.2. Thats the same
guidelines the LSB 'oops we havent finished yet here is a quickie for now'
documentation recommends.

Similarly RPM packaging using RPMv3 is recommended.

Alan

2000-12-15 00:06:45

by Jakub Jelinek

[permalink] [raw]
Subject: Re: Signal 11

On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
> user applications and (b) gcc-2.96 is so broken that it requires special
> libraries for C++ vtable chunks handling that is different, so the
> _working_ gcc can only be used with programs that do not need such
> library support.

Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
even if we used g++ 2.95.2 we would not have C++ binary compatible with
other distributions).
This will change once 3.0 is out, but it will still take some time.

> compiler to something that works better RSN. It apparently has problems
> compiling stuff like the CVS snapshots of X etc too (and obviously,
> anything you compile under gcc-2.96 is not likely to work anywhere else
> except with the broken libraries).

Can you point to things in X which were actually miscompiled because of bugs
in gcc 2.96? So far I was aware about X bugs (already fixed in X CVS) which
were triggered with -fstrict-aliasing which is now the default while
gcc 2.95.2 had -fstrict-aliasing disabled by default.
That is not to say there were not bugs in the gcc we shipped, but the bugs
which were reported against it have been fixed already.

Jakub

2000-12-15 00:22:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: Signal 11



On Thu, 14 Dec 2000, Jakub Jelinek wrote:

> On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
> > user applications and (b) gcc-2.96 is so broken that it requires special
> > libraries for C++ vtable chunks handling that is different, so the
> > _working_ gcc can only be used with programs that do not need such
> > library support.
>
> Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
> bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
> bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
> even if we used g++ 2.95.2 we would not have C++ binary compatible with
> other distributions).

Yes.

And I realize that somebody inside RedHat really wanted to use a snapshot
in order to get some C++ code to compile right.

But it at the same time threw C stability out the window, by using a
not-very-widely-tested snapshot for a major new release.

Are you seriously saying that you think it was a good trade-off? Or are
you just ashamed of admitting that RH did something stupid?

> > compiler to something that works better RSN. It apparently has problems
> > compiling stuff like the CVS snapshots of X etc too (and obviously,
> > anything you compile under gcc-2.96 is not likely to work anywhere else
> > except with the broken libraries).
>
> Can you point to things in X which were actually miscompiled because of bugs
> in gcc 2.96?

I have a report from a Sony VAIO user that couldn't compile the CVS X at
all on his picturebook (and you need to compile the CVS tree in order to
get required fixes for the ATI Rage Mobility in that machine). I don't
know the details, but they were apparently due to RH 7 issues.

> So far I was aware about X bugs (already fixed in X CVS) which
> were triggered with -fstrict-aliasing which is now the default while
> gcc 2.95.2 had -fstrict-aliasing disabled by default.

I hope that's another thing that the gcc people fix by the time they do a
_real_ release. Anobody who thinks that "-fstrict-aliasing" being on by
default is a good idea is probably a compiler person who hasn't seen real
code.

> That is not to say there were not bugs in the gcc we shipped, but the bugs
> which were reported against it have been fixed already.

That's good.

It's even better if you don't play quite as fast-and-lose with your
shipping compiler.

Linus

2000-12-15 00:41:38

by miquels

[permalink] [raw]
Subject: Re: Signal 11

In article <[email protected]>,
Bernhard Rosenkraenzer <[email protected]> wrote:
>The same thing is true of *any* gcc release.
>For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
>_and_ the upcoming 3.0 release.

Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
And since redhat is _the_ distro that commercial entities use to
release software for, this was very arguably a bad move.

There's simply no excuse. It's too obvious.

Mike.

2000-12-15 00:42:08

by lamont

[permalink] [raw]
Subject: Re: Signal 11


I had tons of problems with K6III/450s in ASUS P5A motherboards with
various kinds of 128MB SIMMs. There were multiple different symptoms,
including just sig11s on compiles, corrupted input (leading to syntax
error) in compiles, and corrupted input in the buffer cache (same crash
over and over, but dd if=/dev/hda of=/dev/null bs=1024k count=128 fixed
it). Swapping the memory would sometimes get rid of the problem, but then
it would come back weeks-months later.

I saw a bizzare problem once in an Tyan dual proc PIII/500 box with
2x256MB ECC RAM that one of the ECC RAM sticks was bad and that repeated
kernel compiles would hang after about 24 hours. Strange problem, but
found that in troubleshooting it, the problem followed this stick of RAM
around to different machines. Blamed the RAM but don't understand what
the underlying problem was...

On Fri, 8 Dec 2000 [email protected] wrote:
> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
>
> > It's related to some change in 2.4 vs. 2.2. There are other programs
> > affected other than X, SSH also get's spurious signal 11's now and again
> > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
>
> <AOL>
>
> I've begun to get a bit paranoid about my K6-2 500 box.
>
> Various processes have been getting random signals after heavy CPU usage.
> Playing an MPEG movie, kernel compile, or even just some small apps
> compiling sometimes. Just for the record, this isn't an OOM situation,
> I've watched this box with half its memory free or in buffers left
> unattended, and suddenly a compile will just die.
>
> I replaced the CPU with a brand new K6-2. Problem remained.
> Next suspect was faulty RAM. Despite having passed a memtest, I
> swapped out the DIMMs for some known good ones.
> Suspecting cooling problems, I added some case fans.
> Next came a bigger power supply. Still the problems.
> The latest last ditch attempt to make this box stable has been
> to attach the biggest fan I could find that would fit a socket 7 CPU.
>
> And still the problems are there.
> The only remaining suspect would be a flaky motherboard.
> But then comes the real killer : This box is rock solid under 2.2
>
> *boggle*
>
> I'm not sure exactly when this started, but I think I first noticed
> it around test5 or so, but didn't suspect the kernel at the time.
>
> I've tried kernels compiled with everything from 2.91.66 when this
> was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
> debian on it. If this is a compiler bug, it's one that no compiler
> I've tried seems to be immune from.
>
> regards,
>
> Davej.
>
>

2000-12-15 01:00:18

by Alan

[permalink] [raw]
Subject: Re: Signal 11

> Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
> And since redhat is _the_ distro that commercial entities use to
> release software for, this was very arguably a bad move.

Except you conveniently ignore a few facts

o Someone else moved to 2.95 not RH . In fact some of us felt 2.95 wasnt
fit to ship at the time.

o We tell vendors to build RPMv3 , glibc 2.1.x

o Vendors not being stupid understand that they have a bigger market
share if they do that.

Alan

2000-12-15 01:12:40

by miquels

[permalink] [raw]
Subject: Re: Signal 11

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>> Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
>> And since redhat is _the_ distro that commercial entities use to
>> release software for, this was very arguably a bad move.
>
>Except you conveniently ignore a few facts

Doesn't everyone. I should have included a smiley with as comment
that I was only half-joking. Anyway this is the kernel list, and
as such this is becoming off-topic.

Mike.

2000-12-15 01:28:43

by Michael Peddemors

[permalink] [raw]
Subject: Re: Signal 11

Sticking my nose where it doesn't belong...

On Thu, 14 Dec 2000, Alan Cox wrote:
> > Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
> > And since redhat is _the_ distro that commercial entities use to
> > release software for, this was very arguably a bad move.

> o We tell vendors to build RPMv3 , glibc 2.1.x

Curious HOW do you tell vendors??

> o Vendors not being stupid understand that they have a bigger market
> share if they do that.

Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

--
--------------------------------------------------------
Michael Peddemors - Senior Consultant
Unix?Administration - WebSite Hosting
Network?Services - Programming
Wizard?Internet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com
--------------------------------------------------------
(604)?589-0037 Beautiful British Columbia, Canada
--------------------------------------------------------

2000-12-15 01:37:33

by Alan

[permalink] [raw]
Subject: Re: Signal 11

> > o We tell vendors to build RPMv3 , glibc 2.1.x
> Curious HOW do you tell vendors??

When they ask. More usefully Dan Quinlann and most vendors put together a
recommended set of things to build with and use. It warns about library
pitfalls, kernel changes and what packaging is supported. It is far from
perfect and nothing like the LSB goals but its a start and following it does
give you applications that with a bit of care run on everything.

> > o Vendors not being stupid understand that they have a bigger market
> > share if they do that.
> Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

I believe so, and Adabas was SuSE only, and I doubt either vendor wanted it
that way. Both actually ran fine on the other but were not supported.

Alan

2000-12-15 16:43:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Signal 11

Date: Fri, 15 Dec 2000 01:09:29 +0000 (GMT)
From: Alan Cox <[email protected]>

> > o We tell vendors to build RPMv3 , glibc 2.1.x
> Curious HOW do you tell vendors??

When they ask. More usefully Dan Quinlann and most vendors put together a
recommended set of things to build with and use. It warns about library
pitfalls, kernel changes and what packaging is supported. It is far from
perfect and nothing like the LSB goals but its a start and following it does
give you applications that with a bit of care run on everything.

In the interests of making sure everyone understands the history:

The Linux Development Platform Specification (LDPS) was started as a
result of an informal evening post-LSB-meeting gathering in June --- to
which by the way Red Hat didn't send any representatives(*) --- the
discussion at the restaurant started along the lines of "Oh, my *GOD*
RedHat is about to do something stupid --- they're releasing Red Hat 7.0
with beta/snapshots of just about every single critical system component
except the kernel --- and vendors who fall into the trap developing
against Red Hat 7.0 won't work with any other distribution. This is
going to be *bad* for Linux."

So yes, the reason why LDPS was formed was to recommend to vendors what
they should build and use --- but while Alan gave comments about the
LDPS once it was announced that a group of people were working on the
LDPS , there is no way that the LDPS could even vaguely be considered a
Red Hat initiative. (The LDPS is a separate work group which is part of
the FSG, so it is a sister group to the LSB effort.)

- Ted

(*) Ever since Jim Kingdon left Red Hat (he was at VA Linux for a while,
and is now at SGI), as far as I know no one at Red Hat is actively
participating in the LSB activities --- they haven't sent anyone to the
physical LSB meetings, or participated in the bi-weekly phone
conferences, or taken work items to help finish the LSB. Alan does
participate on the mailing lists, and makes quite helpful comments, but
as far as I know that's about the limit to Red Hat's participation to
either the LSB or the LDPS specification work. Speaking as someone who
has been contributing time and effort to the LSB, it would be great if
Red Hat were to become more fully involved in the LSB; I (and I'm sure
all the other LSB volunteers) would welcome a greater level of
participation by Red Hat.

2000-12-16 01:52:55

by Dan Egli

[permalink] [raw]
Subject: Re: Signal 11

On Thu, 14 Dec 2000, Linus Torvalds wrote:

> Yes.
>
> And I realize that somebody inside RedHat really wanted to use a snapshot
> in order to get some C++ code to compile right.
>
> But it at the same time threw C stability out the window, by using a
> not-very-widely-tested snapshot for a major new release.
>
> Are you seriously saying that you think it was a good trade-off? Or are
> you just ashamed of admitting that RH did something stupid?
>
Pardon the poking in here, but I must say I agree here. RH did a VERY dumb
thing.

> I have a report from a Sony VAIO user that couldn't compile the CVS X at
> all on his picturebook (and you need to compile the CVS tree in order to
> get required fixes for the ATI Rage Mobility in that machine). I don't
> know the details, but they were apparently due to RH 7 issues.

It's not in the X tree or anything, but here's a personal example.
Machine: Dual P3 550
HDD: Dual Ultra2Wide Seagate 18GB Hdd
OS: RedHat 7
Compile Target: Linux Kernel 2.2.17
Result with gcc 2.96: Failure (syntax errors in the i386 branch of the
arch tree)
Result with compat-egcs-62: Success on the first try.


2000-12-16 01:57:15

by Alan

[permalink] [raw]
Subject: Re: Signal 11gy

> It's not in the X tree or anything, but here's a personal example.
> Machine: Dual P3 550
> HDD: Dual Ultra2Wide Seagate 18GB Hdd
> OS: RedHat 7
> Compile Target: Linux Kernel 2.2.17
> Result with gcc 2.96: Failure (syntax errors in the i386 branch of the
> arch tree)
> Result with compat-egcs-62: Success on the first try.

It isnt a bug in the compiler. Its a bug in the kernel tree. Its a bug in
the old compiler that it didnt error it before.