2007-05-19 05:18:00

by Linus Torvalds

[permalink] [raw]
Subject: Linux 2.6.22-rc2


It's out there, both patches/tarballs and git trees are updated (although
mirroring might still be ongoing)

Various random fixes all over - the shortlog (appended) is fairly
readable. The most notable ones are probably more SLUB fixes, and the
epoll optimizations and cleanups.

But there's stuff in architectures (ia64, SH, AVR32, POWER), libata,
network drivers, sound.. Give it a try.

I've been telling some people off on merging stuff, and I'll get even more
hard-nosed about it after -rc2, so please don't even try to send anything
but real fixes.

I think the current situation looks reasonably good for 2.6.22, but I hope
everybody will take a good look at the regression lists (whether they
_think_ they are affected or not), and spend some time wondering "was that
anything I did, or is it something I can look at". Ok?

Linus

---

Aaron Durbin (1):
acpi: fix potential call to a freed memory section.

Al Viro (11):
fix deadlock in loop.c
missing mm.h in fw-ohci
missing dependencies for USB drivers in input
missing includes in mlx4
em28xx and ivtv should depend on PCI
rpadlpar breakage - fallout of struct subsystem removal
m32r: __xchg() should be always_inline
audit_match_signal() and friends are used only if CONFIG_AUDITSYSCALL is set
fix uml-x86_64
arm: walk_stacktrace() needs to be exported
pata_scc had been missed by ata_std_prereset() switch

Alan Cox (1):
sl82c105: Switch to ref counting API

Andrew Morton (2):
parport_pc needs dma-mapping.h
slub: fix handling of oversized slabs

Arthur Jones (1):
IB/ipath: Shadow the gpio_mask register

Auke Kok (2):
ixgb: don't print error if pci_enable_msi() fails, cleanup minor leak
e1000: Fix msi enable leak on error, don't print error message, cleanup

Bartlomiej Zolnierkiewicz (13):
pdc202xx_old: rewrite mode programming code (v2)
serverworks: PIO mode setup fixes
sis5513: PIO mode setup fixes
alim15x3: use ide_tune_dma()
pdc202xx_new: use ide_tune_dma()
ide: always disable DMA before tuning it
cs5530/sc1200: add ->udma_filter methods
ide: use ide_tune_dma() part #2
cs5530/sc1200: DMA support cleanup
cs5530/sc1200: add ->speedproc support
ide: remove ide_dma_enable()
ide: add missing validity checks for identify words 62 and 63
ide: remove ide_use_dma()

Becky Bruce (1):
[POWERPC] Change include protections to ASM_POWERPC

Benjamin Herrenschmidt (4):
[POWERPC] Add spinlock to request_phb_iospace()
[POWERPC] Fix IO space on PCI buses created from of_platform
[POWERPC] Make sure device node type/name is not NULL on hot-added nodes
Make __vunmap static

Bernhard Walle (1):
i386/x86-64: fix section mismatch

Christian Krafft (1):
[POWERPC] cell_defconfig: Disable cpufreq and pmi

Christoph Hellwig (7):
[AVR32] optimize pagefault path
SUNRPC: remove dead variable 'rpciod_running'
[IA64] optimize pagefaults a little
[POWERPC] viopath: Use completion
[POWERPC] viopath: Use a completion in some more places
small netdevices.txt fix
spidernet: node-aware skbuff allocation

Christoph Lameter (15):
SLUB: CONFIG_LARGE_ALLOCS must consider MAX_ORDER limit
SLUB: It is legit to allocate a slab of the maximum permitted size
Fix: find_or_create_page skips cpuset memory spreading.
Slab allocators: Drop support for destructors
SLUB: Remove depends on EXPERIMENTAL and !ARCH_USES_SLAB_PAGE_STRUCT
SLAB: Move two remaining SLAB specific definitions to slab_def.h
SLUB: Define functions for cpu slab handling instead of using PageActive
slab: warn on zero-length allocations
SLUB: slabinfo fixes
SLUB: Do our own flags based on PG_active and PG_error
Remove SLAB_CTOR_CONSTRUCTOR
SLUB: Simplify debug code
Slab allocators: define common size limitations
Fix page allocation flags in grow_dev_page()
slub: another slabinfo fix

Corey Mutter (1):
[IPV6]: Reverse sense of promisc tests in ip6_mc_input

Dan Aloni (1):
make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core filename size

Daniel Drake (2):
[CPUFREQ] powernow-k7: fix MHz rounding issue with perflib
[ALSA] usb-audio: another Logitech QuickCam ID

Daniel T Chen (1):
[ALSA] Include quirks from Ubuntu Dapper/Edgy/Feisty

Dave Jiang (2):
[POWERPC] Fix comment in booke_wdt
[POWERPC] 85xx: Add device nodes for error reporting devices used by EDAC

Dave Jones (4):
[CPUFREQ] Support rev H AMD64s in powernow-k8
MAINTAINERS update.
[CPUFREQ] Correct revision mask for powernow-k8
[IPV4]: Correct rp_filter help text.

David Brownell (3):
gpio interface loosens call restrictions
rtc-omap build fix
rtc kconfig clarification

David Gibson (4):
[POWERPC] Remove fixup_bigphys_addr() for arch/powerpc to avoid link error
[POWERPC] Fix bug adding properties with flatdevtree.c's ft_set_prop()
[POWERPC] Fix make rules for treeImage.initrd
[POWERPC] Small fixes for the Ebony device tree

David Howells (2):
AFS: write back dirty data on unmount
AFS: Fix afs_prepare_write()

David S. Miller (16):
[SERIAL] SUNHV: Add an ID string.
[SPARC64]: Be more resiliant with PCI I/O space regs.
[SPARC] SBUS: bbc_i2c.c needs asm/io.h
[SPARC] SBUS: display7seg.c needs asm/io.h
[SCSI]: Add help text for SCSI_ESP_CORE.
[SPARC64]: Add missing cpus_empty() check in hypervisor xcall handling.
[SPARC64]: Accept ebus_bus_type for generic DMA ops.
[SPARC32]: Update defconfig.
[SPARC32]: asm/system.h needs asm/smp.h
[VIDEO]: XVR-500 and XVR-2500 need FB=y.
[SPARC64]: Update defconfig.
[SPARC64]: Add hypervisor API negotiation and fix console bugs.
[NET]: Fix BMSR_100{HALF,FULL}2 defines in linux/mii.h
[TCP]: TCP_CONG_YEAH requires TCP_CONG_VEGAS
[SPARC64]: Fix sched_clock() et al.
[IPV4]: Remove IPVS icmp hack from route.c for now.

David Woodhouse (2):
NS16550A: Restore HS settings in EXCR2 on resume
Fix incorrect prototype for ipxrtr_route_packet()

Davide Libenzi (6):
fix epoll single pass code and add wait-exclusive flag
epoll locks changes and cleanups
epoll: fix some comments
epoll: move kfree inside ep_free
eventfd use waitqueue lock ...
timerfd use waitqueue lock ...

Domen Puncer (1):
spi: fix spidev for >sizeof(long)/32 devices

Eugene Surovegin (3):
ibm_emac: fix section mismatch warnings
ibm_emac: improved PHY support
ibm_emac: fix link speed detection change

Gabriel Mansi (1):
[AGPGART] Fix wrong ID in via-agp.c

Geert Uytterhoeven (1):
m68k: implement __clear_user()

Gerald Britton (1):
cciss: Fix pci_driver.shutdown while device is still active

Gerd Hoffmann (1):
Refine SCREEN_INFO sanity check for vgacon initialization

H. Peter Anvin (1):
Further update of the i386 boot documentation

Haavard Skinnemoen (3):
[AVR32] Remove bogus comment in arch/avr32/kernel/irq.c
[AVR32] Wire up signalfd, timerfd and eventfd
[AVR32] Implement platform hooks for atmel_lcdfb driver

Heiko Carstens (2):
simplify compat_sys_timerfd
Let smp_call_function_single return -EBUSY on UP

Herbert Xu (4):
[IPSEC]: Check validity of direction in xfrm_policy_byid
[IPSEC]: Don't warn if high-order hash resize fails
[CRYPTO] padlock: Make CRYPTO_DEV_PADLOCK a tristate again
[CRYPTO] tcrypt: Add missing error check

Hoang-Nam Nguyen (1):
IB/ehca: Fix AQP0/1 QP number

Hugh Dickins (2):
slub: don't confuse ctor and dtor
i386: don't check_pgt_cache in flush_tlb_mm

Jack Morgenstein (1):
IB/mlx4: Fix uninitialized spinlock for 32-bit archs

Jamal Hadi Salim (1):
[NET_SCHED]: prio qdisc boundary condition

James.Yang (1):
[POWERPC] Remove CPU_FTR_NEED_COHERENT for 7448.

Jan Engelhardt (1):
Use menuconfig objects: IDE

Jan Kara (1):
circular locking dependency found in QUOTA OFF

Jarek Poplawski (1):
[NET]: lockdep classes in register_netdevice

Jaroslav Kysela (1):
[ALSA] version 1.0.14rc4

Jay Lan (1):
[IA64] kdump on INIT needs multi-nodes sync-up (v.2)

Jens Axboe (1):
ll_rw_blk: fix gcc 4.2 warning on current_io_context()

Jeremy Fitzhardinge (2):
i386: move common parts of smp into their own file
i386: fix voyager build

Joachim Fenkes (4):
IB/ehca: Correctly set GRH mask bit in ehca_modify_qp()
IB/ehca: Remove _irqsave, move #ifdef
IB/ehca: Beautify sysfs attribute code and fix compiler warnings
IB/ehca: Disable scaling code by default, bump version number

Jon Tollefson (1):
[POWERPC] Correct #endif comment

Josh Boyer (1):
[POWERPC] Pass per-file CFLAGs for platform specific op codes

Kailang Yang (1):
[ALSA] hda-codec - Fix ALC882/861VD codec support on some laptops

Kim Phillips (1):
ucc_geth: eliminate max-speed, change interface-type to phy-connection-type

Kumar Gala (3):
[POWERPC] Fix COMMON symbol warnings
[POWERPC] 85xx: Add device nodes for error reporting devices used by EDAC
[POWERPC] Removed hardcoded phandles from dts

Liam Girdwood (2):
[ALSA] ASoC AC97 static GPL symbol fix
[ALSA] ASoC AC97 device reg bugfix

Linus Torvalds (5):
Revert "ipmi: add new IPMI nmi watchdog handling"
x86: Fix discontigmem + non-HIGHMEM compile
Fix ACPI suspend / device suspend ordering problem
Revert "[PATCH] x86: Drop cc-options call for all options supported in gcc 3.2+"
Linux v2.6.22-rc2

Martin Michlmayr (1):
[IA64] Fix section conflict of ia64_mlogbuf_finish

Michael S. Tsirkin (3):
IB/mthca: Fix posting >255 recv WRs for Tavor
IB/mthca: Set cleaned CQEs back to HW ownership when cleaning CQ
IPoIB/cm: Optimize stale connection detection

Milind Arun Choudhary (1):
sh64: ROUND_UP macro cleanup in arch/sh64/kernel/pci_sh5.c

Mithlesh Thukral (1):
NetXen: Fix NetXen driver ping on system-p

Mitsuru Chinen (1):
[IPV4] SNMP: Display new statistics at /proc/net/netstat

Morten Banzon (1):
[PPC] MCC2 missing in MPC826x device_list

Nate Diller (2):
NFS: use zero_user_page
ecryptfs: use zero_user_page

Nick Piggin (2):
slob: implement RCU freeing
mm: more rmap checking

Nicolas Pitre (1):
pxamci: fix PXA27x MMC workaround for bad CRC with 136 bit response

Oleg Nesterov (3):
NLM: don't use CLONE_SIGHAND in nlmclnt_recovery
make freezeable workqueues singlethread
revert "cancel_delayed_work: use del_timer() instead of del_timer_sync()"

Olof Johansson (4):
pasemi_mac: Interrupt ack fixes
[POWERPC] pasemi: CONFIG_GENERIC_TBSYNC no longer needed
[POWERPC] Update pasemi_defconfig
[POWERPC] Remove warning in mpic.c

Paul Mundt (10):
net: Trivial MLX4_DEBUG dependency fix.
sh64: Wire up many new syscalls.
sh64: Fixups for the irq_regs changes.
sh64: dma-mapping updates.
sh64: ppoll/pselect6() and restartable syscalls.
sh64: Fixup sh-sci build.
sh64: Update cayman defconfig.
sh64: generic quicklist support.
sh64: Add .gitignore entry for syscalltab.
nommu: add ioremap_page_range()

Peer Chen (1):
drivers/ata: remove the wildcard from sata_nv driver

Pierre Ossman (2):
sdhci: handle dma boundary interrupts
mmc: use assigned major for block device

Prarit Bhargava (1):
Remove cpu hotplug defines for __INIT & __INITDATA

Rafael J. Wysocki (1):
swsusp: fix sysfs interface

Randy Dunlap (2):
parport: mailing list is subscribers-only
docbook: make kernel-locking table readable

Rene Herman (1):
[ALSA] Fix probe of non-PnP ISA devices

Robert Reif (1):
[SPARC32]: Fix sparc32 kdebug changes.

Roland Dreier (1):
mlx4_core: Remove unused doorbell_lock

Rolf Eike Beer (1):
Fix roundup_pow_of_two(1)

Satyam Sharma (1):
[BLUETOOTH]: Fix locking in hci_sock_dev_event().

Scott Wood (1):
gianfar: Add I/O barriers when touching buffer descriptor ownership.

Sean Hefty (3):
RDMA/cma: Simplify device removal handling code
RDMA/cma: Fix synchronization with device removal in cma_iw_handler
RDMA/cma: Add check to validate that cm_id is bound to a device

Segher Boessenkool (3):
[POWERPC] Specify GNUTARGET on $(AR) invocations
[POWERPC] Fix sed command lines for zlib source construction
[POWERPC] Fix ppc_rtas_progress_show()

Sergei Shtylyov (1):
sl82c105: add speedproc() method and MWDMA0/1 support

Simon Arlott (2):
[IA64] spelling fixes: arch/ia64/
spelling fixes: arch/sh64/

Simon Horman (1):
alpha: fix hard_smp_processor_id compile error

Stefan Roscher (1):
IB/ehca: Serialize hypervisor calls in ehca_register_mr()

Stephen Hemminger (7):
[TCP] slow start: Make comments and code logic clearer.
sky2: remove Gigabyte 88e8056 restriction
sky2: PHY register settings
sky2: keep track of receive alloc failures
sky2: MIB counter overflow handling
sky2: remove dual port workaround
sky2: memory barriers change

Stephen Rothwell (6):
Declare another couple of compat syscalls.
Revert "MAINTAINERS: remove invalid list address for TPM"
[POWERPC] Wire up some more syscalls
[POWERPC] Update iseries_defconfig
[POWERPC] Fix warning on UP
[POWERPC] Remove build warnings in windfarm_core

Takashi Iwai (1):
[ALSA] hda-codec - Make the mixer capability check more robust

Tejun Heo (8):
libata: separate out ata_dev_reread_id()
libata: during revalidation, check n_sectors after device is configured
libata-acpi: add ATA_FLAG_ACPI_SATA port flag
libata: fix shutdown warning message printing
libata: track spindown status and skip spindown_compat if possible
sata_nv: fix fallout of devres conversion
libata: remove libata.spindown_compat
sata_via: pcim_iomap_regions() conversion missed BAR5

Thomas Gleixner (2):
timekeeping fix patch got mis-applied
clocksource: fix lock order in the resume path

Thomas Reitmayr (1):
[ALSA] usbaudio - Coping with short replies in usbmixer

Timur Tabi (1):
[POWERPC] Fix alignment problem in rh_alloc_align() with exact-sized blocks

Tony Breeds (1):
[POWERPC] Fix Kconfig undefined symbol 'IBM_NEW_EMAC_ZMII'

Tony Luck (2):
[IA64] wire up {signal,timer,event}fd syscalls
[IA64] s/scalibility/scalability/

Trond Myklebust (6):
NLM: Fix locking client timeouts...
NFS4: Fix incorrect use of sizeof() in fs/nfs/nfs4xdr.c
NFS: Fix some 'sparse' warnings...
NFS: Fix more sparse warnings
NLM: Fix sparse warnings
SUNRPC: Fix sparse warnings

Vitaly Wool (1):
smc911x: fix compilation breakage

Yoichi Yuasa (1):
mmc: au1xmmc command types check from data flags

Yoshinori Sato (1):
h8300 atomic.h update

[email protected] (3):
pasemi_mac: Fix register defines
pasemi_mac: Terminate PCI ID list
pasemi_mac: Fix local-mac-address parsing

wendy xiong (1):
icom: add new sub-device-id to support new adapter


2007-05-19 07:00:35

by Andrey Borzenkov

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

Linus Torvalds wrote:

>
> It's out there, both patches/tarballs and git trees are updated (although
> mirroring might still be ongoing)
>

trivia

make: Entering directory `/home/bor/src/linux-git'
GEN /home/bor/build/linux-2.6.22/Makefile
scripts/kconfig/conf -s arch/i386/Kconfig
drivers/macintosh/Kconfig:116:warning: 'select' used by config
symbol 'PMAC_APM_EMU' refers to undefined
symbol 'SYS_SUPPORTS_APM_EMULATION'
drivers/net/Kconfig:2283:warning: 'select' used by config symbol 'UCC_GETH'
refers to undefined symbol 'UCC_FAST'
drivers/input/keyboard/Kconfig:170:warning: 'select' used by config
symbol 'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
drivers/input/mouse/Kconfig:182:warning: 'select' used by config
symbol 'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'



2007-05-19 14:28:25

by Indan Zupancic

[permalink] [raw]
Subject: [BUG: 2.6.22-rc2] SLAB doesn't like usb_get_configuration()

Hello,

I had two SLAb related bugs, both with usb_get_configuration()
near the end of the backtrace. First one was with git between
rc1 and rc2, but very close to rc2, second one was with rc2,
both at bootup.

Oh, almost forgot: First one uses SLAB, second one SLUB.

[ 85.574686] usb 2-1: new full speed USB device using ohci_hcd and address 2
[ 85.709684] BUG: at /home/indan/src/git/linux-2.6/mm/slab.c:777 __find_genera
l_cachep()
[ 85.709693] [<b014be2f>] __kmalloc+0x3f/0xa0
[ 85.709709] [<b013d271>] __kzalloc+0xd/0x34
[ 85.709720] [<c09a9d78>] usb_get_configuration+0x93e/0xd26 [usbcore]
[ 85.709761] [<c09a7459>] usb_control_msg+0xbe/0xc8 [usbcore]
[ 85.709784] [<c09a856a>] usb_get_device_descriptor+0x72/0x7c [usbcore]
[ 85.709804] [<c09a3259>] hub_port_init+0x55d/0x567 [usbcore]
[ 85.709823] [<c09a3914>] usb_new_device+0x17/0xdd [usbcore]
[ 85.709841] [<c09a435f>] hub_thread+0x6cc/0xa4a [usbcore]
[ 85.709862] [<b0125dff>] autoremove_wake_function+0x0/0x35
[ 85.709871] [<c09a3c93>] hub_thread+0x0/0xa4a [usbcore]
[ 85.709888] [<b0125d47>] kthread+0x36/0x5b
[ 85.709893] [<b0125d11>] kthread+0x0/0x5b
[ 85.709898] [<b010476b>] kernel_thread_helper+0x7/0x10
[ 85.709907] =======================

and

[ 30.420891] usb 2-1: new full speed USB device using ohci_hcd and address 2
[ 30.555891] BUG: at /home/indan/src/git/linux-2.6/include/linux/slub_def.h:77 kmalloc_index()
[ 30.555901] [<b014d02c>] get_slab+0x43/0x214
[ 30.555913] [<c09acc18>] usb_get_configuration+0x923/0xd07 [usbcore]
[ 30.555951] [<b014e16a>] __kmalloc_track_caller+0xf/0x56
[ 30.555959] [<b013d2b1>] __kzalloc+0x11/0x38
[ 30.555971] [<c09acc18>] usb_get_configuration+0x923/0xd07 [usbcore]
[ 30.555991] [<c09aa34d>] usb_control_msg+0xbe/0xc8 [usbcore]
[ 30.556014] [<c09a6234>] hub_port_init+0x559/0x563 [usbcore]
[ 30.556033] [<c09a68d4>] usb_new_device+0x17/0xdd [usbcore]
[ 30.556052] [<c09a72da>] hub_thread+0x68f/0x9f8 [usbcore]
[ 30.556071] [<b027252f>] __sched_text_start+0x497/0x539
[ 30.556080] [<b0125e27>] autoremove_wake_function+0x0/0x35
[ 30.556089] [<c09a6c4b>] hub_thread+0x0/0x9f8 [usbcore]
[ 30.556107] [<b0125d6f>] kthread+0x36/0x5b
[ 30.556112] [<b0125d39>] kthread+0x0/0x5b
[ 30.556118] [<b010476b>] kernel_thread_helper+0x7/0x10
[ 30.556125] =======================

Both are triggered by WARN_ON_ONCE(size == 0),

Full dmesg outputs and config for second bug attached.

Greetings,

Indan


Attachments:
SLABBUG (17.47 kB)
SLABBUG2 (17.94 kB)
.config (32.96 kB)
Download all attachments

2007-05-19 18:20:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: [BUG: 2.6.22-rc2] SLAB doesn't like usb_get_configuration()

On Sat, 19 May 2007, Indan Zupancic wrote:

> I had two SLAb related bugs, both with usb_get_configuration()
> near the end of the backtrace. First one was with git between
> rc1 and rc2, but very close to rc2, second one was with rc2,
> both at bootup.

Well usb_get_configuration seems to do a kmalloc(0) which is a bit
strange and this is why we flagged the allocation in the slab allocators.
Is there some way to avoid allocating an object of zero length?

2007-05-19 19:32:00

by Greg KH

[permalink] [raw]
Subject: Re: [BUG: 2.6.22-rc2] SLAB doesn't like usb_get_configuration()

On Sat, May 19, 2007 at 11:20:44AM -0700, Christoph Lameter wrote:
> On Sat, 19 May 2007, Indan Zupancic wrote:
>
> > I had two SLAb related bugs, both with usb_get_configuration()
> > near the end of the backtrace. First one was with git between
> > rc1 and rc2, but very close to rc2, second one was with rc2,
> > both at bootup.
>
> Well usb_get_configuration seems to do a kmalloc(0) which is a bit
> strange and this is why we flagged the allocation in the slab allocators.
> Is there some way to avoid allocating an object of zero length?

Can you try the patch below and let me know if it fixes the issue for
you or not?

thanks,

greg k-h


From: Alan Stern <[email protected]>
Subject: [PATCH] USB: don't try to kzalloc 0 bytes

This patch (as907) prevents us from trying to allocate 0 bytes
when an interface has no endpoint descriptors.

Signed-off-by: Alan Stern <[email protected]>

--- usb-2.6.orig/drivers/usb/core/config.c
+++ usb-2.6/drivers/usb/core/config.c
@@ -185,10 +185,12 @@ static int usb_parse_interface(struct de
num_ep = USB_MAXENDPOINTS;
}

- len = sizeof(struct usb_host_endpoint) * num_ep;
- alt->endpoint = kzalloc(len, GFP_KERNEL);
- if (!alt->endpoint)
- return -ENOMEM;
+ if (num_ep > 0) { /* Can't allocate 0 bytes */
+ len = sizeof(struct usb_host_endpoint) * num_ep;
+ alt->endpoint = kzalloc(len, GFP_KERNEL);
+ if (!alt->endpoint)
+ return -ENOMEM;
+ }

/* Parse all the endpoint descriptors */
n = 0;

2007-05-19 22:10:59

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG: 2.6.22-rc2] SLAB doesn't like usb_get_configuration()

On Sat, May 19, 2007 21:33, Greg KH wrote:
> On Sat, May 19, 2007 at 11:20:44AM -0700, Christoph Lameter wrote:
>> On Sat, 19 May 2007, Indan Zupancic wrote:
>>
>> > I had two SLAb related bugs, both with usb_get_configuration()
>> > near the end of the backtrace. First one was with git between
>> > rc1 and rc2, but very close to rc2, second one was with rc2,
>> > both at bootup.
>>
>> Well usb_get_configuration seems to do a kmalloc(0) which is a bit
>> strange and this is why we flagged the allocation in the slab allocators.
>> Is there some way to avoid allocating an object of zero length?
>
> Can you try the patch below and let me know if it fixes the issue for
> you or not?
>
> thanks,
>
> greg k-h
>
>
> From: Alan Stern <[email protected]>
> Subject: [PATCH] USB: don't try to kzalloc 0 bytes
>
> This patch (as907) prevents us from trying to allocate 0 bytes
> when an interface has no endpoint descriptors.
>
> Signed-off-by: Alan Stern <[email protected]>
>
> --- usb-2.6.orig/drivers/usb/core/config.c
> +++ usb-2.6/drivers/usb/core/config.c
> @@ -185,10 +185,12 @@ static int usb_parse_interface(struct de
> num_ep = USB_MAXENDPOINTS;
> }
>
> - len = sizeof(struct usb_host_endpoint) * num_ep;
> - alt->endpoint = kzalloc(len, GFP_KERNEL);
> - if (!alt->endpoint)
> - return -ENOMEM;
> + if (num_ep > 0) { /* Can't allocate 0 bytes */
> + len = sizeof(struct usb_host_endpoint) * num_ep;
> + alt->endpoint = kzalloc(len, GFP_KERNEL);
> + if (!alt->endpoint)
> + return -ENOMEM;
> + }
>
> /* Parse all the endpoint descriptors */
> n = 0;
>

Thanks, this one seems to fix it, I don't get the BUG anymore.

Greetings,

Indan


2007-05-20 12:52:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2: make -j makes it unresponsive

Hi,

On Saturday, 19 May 2007 07:17, Linus Torvalds wrote:
>
> It's out there, both patches/tarballs and git trees are updated (although
> mirroring might still be ongoing)
>
> Various random fixes all over - the shortlog (appended) is fairly
> readable. The most notable ones are probably more SLUB fixes, and the
> epoll optimizations and cleanups.
>
> But there's stuff in architectures (ia64, SH, AVR32, POWER), libata,
> network drivers, sound.. Give it a try.
>
> I've been telling some people off on merging stuff, and I'll get even more
> hard-nosed about it after -rc2, so please don't even try to send anything
> but real fixes.
>
> I think the current situation looks reasonably good for 2.6.22, but I hope
> everybody will take a good look at the regression lists (whether they
> _think_ they are affected or not), and spend some time wondering "was that
> anything I did, or is it something I can look at". Ok?

Running 'make -j' kernel compilation on my test box (Athlon64 X2, 2 SATA drives
with 6 software RAID1 ext3 and reiserfs partitions, 2 GB of RAM) makes it
completely unresponsive. I can't even move the mouse pointer when it's
running, I can't log to the box from the network etc.

The anticipatory IO scheduler is used.

Greetings,
Rafael

2007-05-20 13:01:32

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2: make -j makes it unresponsive

"Rafael J. Wysocki" <[email protected]> writes:

> Running 'make -j' kernel compilation on my test box (Athlon64 X2, 2 SATA
> drives
> with 6 software RAID1 ext3 and reiserfs partitions, 2 GB of RAM) makes it
> completely unresponsive. I can't even move the mouse pointer when it's
> running, I can't log to the box from the network etc.

How many processes does it spawn? Try some sane limit.
--
Krzysztof Halasa

2007-05-20 13:18:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2: make -j makes it unresponsive

On Sunday, 20 May 2007 15:01, Krzysztof Halasa wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > Running 'make -j' kernel compilation on my test box (Athlon64 X2, 2 SATA
> > drives
> > with 6 software RAID1 ext3 and reiserfs partitions, 2 GB of RAM) makes it
> > completely unresponsive. I can't even move the mouse pointer when it's
> > running, I can't log to the box from the network etc.
>
> How many processes does it spawn? Try some sane limit.

Do you think it works as a fork bomb? Well, it didn't work like that before,
AFAIR, but then 2.6.21 also does it with the same settings, so sorry for the
noise.

Greetings,
Rafael

2007-05-20 22:08:51

by Mike Houston

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Fri, 18 May 2007 22:17:14 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:


> Stephen Hemminger (7):
> [TCP] slow start: Make comments and code logic clearer.
> *** sky2: remove Gigabyte 88e8056 restriction ***
> sky2: PHY register settings
> sky2: keep track of receive alloc failures
> sky2: MIB counter overflow handling
> sky2: remove dual port workaround
> sky2: memory barriers change
>

I tested this and it's still horribly broken for me with Gigabyte
88E8056 onboard LAN. Same symptom as before, it works for several
seconds and then dies.

Relevant portion of logs:

May 20 15:57:48 cramit kernel: sky2 0000:04:00.0: v1.14 addr
0xf8000000 irq 16 Yukon-EC Ultra (0xb4) rev 2
May 20 15:57:48 cramit kernel: sky2 eth0: addr 00:16:e6:da:f3:b5

May 20 15:57:48 cramit kernel: sky2 eth0: enabling interface
May 20 15:57:48 cramit kernel: sky2 eth0: ram buffer 0K
May 20 15:57:48 cramit kernel: ACPI: PCI Interrupt 0000:00:1b.0[A] ->
GSI 22 (level, low) -> IRQ 18
May 20 15:57:48 cramit kernel: PCI: Setting latency timer of device
0000:00:1b.0 to 64
May 20 15:57:50 cramit kernel: sky2 eth0: Link is up at 100 Mbps,
full duplex, flow control both

Attempt to ftp a file to another box on LAN and about 1.5
megabytes into the transfer:

May 20 16:01:43 cramit kernel: sky2 eth0: hw error interrupt status
0x8
May 20 16:01:43 cramit kernel: sky2 eth0: MAC parity error
May 20 16:01:43 cramit kernel: sky2 0000:04:00.0: error interrupt
status=0x80000000
May 20 16:01:43 cramit kernel: sky2 eth0: hw error interrupt status
0x8
May 20 16:01:43 cramit kernel: sky2 eth0: MAC parity error

Transfer stalls and that's all she wrote.

If interested in seeing kernel config:
http://www.mikeserv.org/files/config-2.6.22-rc2

Oh well, back to trusty rtl8139 based PCI card for now.

Thanks for working on this stuff,

Mike Houston

2007-05-21 15:46:17

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Sun, 20 May 2007 17:05:06 -0400
Mike Houston <[email protected]> wrote:

> On Fri, 18 May 2007 22:17:14 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
>
> > Stephen Hemminger (7):
> > [TCP] slow start: Make comments and code logic clearer.
> > *** sky2: remove Gigabyte 88e8056 restriction ***
> > sky2: PHY register settings
> > sky2: keep track of receive alloc failures
> > sky2: MIB counter overflow handling
> > sky2: remove dual port workaround
> > sky2: memory barriers change
> >
>
> I tested this and it's still horribly broken for me with Gigabyte
> 88E8056 onboard LAN. Same symptom as before, it works for several
> seconds and then dies.

It's almost certainly a problem with the BIOS and hardware (not a sky2)
driver issue. Since there are many similar boards and configurations, I made
the decision not to enforce restrictions in the driver.

--
Stephen Hemminger <[email protected]>

2007-05-21 17:12:31

by Mike Houston

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Mon, 21 May 2007 08:45:49 -0700
Stephen Hemminger <[email protected]> wrote:

> It's almost certainly a problem with the BIOS and hardware (not a
> sky2) driver issue. Since there are many similar boards and
> configurations, I made the decision not to enforce restrictions in
> the driver.

>> May 20 15:57:48 cramit kernel: sky2 0000:04:00.0: v1.14 addr
>> 0xf8000000 irq 16 Yukon-EC Ultra (0xb4) rev 2

Thank you for your answer. I was half wondering if that was the case
after staring at those log messages several more times. I don't
understand hardware at the low level but got thinking maybe interrupt
routing issue. There's an Nvidia PCI Express card in there that gets
IRQ 16, though it was not initialized by a driver at the time. (plain
old VGA console after fresh cold boot... no framebuffer, no X, no
nvidia module). I guess some things don't share well.

It works well in that other OS that came with the hardware, but
that's beside the point.

Mike Houston

2007-05-21 17:38:20

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Mon, 21 May 2007 13:10:55 -0400
Mike Houston <[email protected]> wrote:

> On Mon, 21 May 2007 08:45:49 -0700
> Stephen Hemminger <[email protected]> wrote:
>
> > It's almost certainly a problem with the BIOS and hardware (not a
> > sky2) driver issue. Since there are many similar boards and
> > configurations, I made the decision not to enforce restrictions in
> > the driver.
>
> >> May 20 15:57:48 cramit kernel: sky2 0000:04:00.0: v1.14 addr
> >> 0xf8000000 irq 16 Yukon-EC Ultra (0xb4) rev 2
>
> Thank you for your answer. I was half wondering if that was the case
> after staring at those log messages several more times. I don't
> understand hardware at the low level but got thinking maybe interrupt
> routing issue. There's an Nvidia PCI Express card in there that gets
> IRQ 16, though it was not initialized by a driver at the time. (plain
> old VGA console after fresh cold boot... no framebuffer, no X, no
> nvidia module). I guess some things don't share well.
>
> It works well in that other OS that came with the hardware, but
> that's beside the point.

It is some low level PCI Express related stuff, try latest BIOS (F9)
and if that doesn't help there is a EEPROM update from Gigabyte
for the Marvell hardware that might help.

--
Stephen Hemminger <[email protected]>

2007-05-22 02:59:40

by Mike Houston

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Mon, 21 May 2007 10:37:55 -0700
Stephen Hemminger <[email protected]> wrote:

> On Mon, 21 May 2007 13:10:55 -0400
> Mike Houston <[email protected]> wrote:
>
> > On Mon, 21 May 2007 08:45:49 -0700
> > Stephen Hemminger <[email protected]> wrote:
> >
> > > It's almost certainly a problem with the BIOS and hardware (not
> > > a sky2) driver issue. Since there are many similar boards and
> > > configurations, I made the decision not to enforce restrictions
> > > in the driver.
> >
> > >> May 20 15:57:48 cramit kernel: sky2 0000:04:00.0: v1.14 addr
> > >> 0xf8000000 irq 16 Yukon-EC Ultra (0xb4) rev 2
> >
> > Thank you for your answer. I was half wondering if that was the
> > case after staring at those log messages several more times. I
> > don't understand hardware at the low level but got thinking maybe
> > interrupt routing issue. There's an Nvidia PCI Express card in
> > there that gets IRQ 16, though it was not initialized by a driver
> > at the time. (plain old VGA console after fresh cold boot... no
> > framebuffer, no X, no nvidia module). I guess some things don't
> > share well.
> >
> > It works well in that other OS that came with the hardware, but
> > that's beside the point.
>
> It is some low level PCI Express related stuff, try latest BIOS (F9)
> and if that doesn't help there is a EEPROM update from Gigabyte
> for the Marvell hardware that might help.

Thanks for your suggestions, I followed through on them. It may still
be interesting/useful to hear from me that it didn't help. The
problem is the same.

My motherboard is a newer revision (Gigabyte GA-965P-DS3 Rev 3.3) and
already had the "F10" bios version, but I flashed to the latest F11
version anyways. I also flashed with the EEPROM update from Gigabyte,
from a FAQ entry for my motherboard revision.
(faq_marvell_eeprom.zip). Both operations were successful. I cleared
the CMOS and reconfigured after the bios flash too.

Incidently, it was showing IRQ 16 in that early initialization
message, but actually getting a MSI interrupt (IRQ 219, PCI-MSI-edge)

I've disabled the onboard yukon2 adapter in bios and gone
back to the PCI card now. I think we can consider the matter closed,
since it's not a problem with the driver, but just so you know, I'm
always willing to help test when it's hardware that I have.

Mike Houston

2007-05-22 04:32:15

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Mon, 21 May 2007 22:58:06 -0400
Mike Houston <[email protected]> wrote:

> On Mon, 21 May 2007 10:37:55 -0700
> Stephen Hemminger <[email protected]> wrote:
>
> > On Mon, 21 May 2007 13:10:55 -0400
> > Mike Houston <[email protected]> wrote:
> >
> > > On Mon, 21 May 2007 08:45:49 -0700
> > > Stephen Hemminger <[email protected]> wrote:
> > >
> > > > It's almost certainly a problem with the BIOS and hardware (not
> > > > a sky2) driver issue. Since there are many similar boards and
> > > > configurations, I made the decision not to enforce restrictions
> > > > in the driver.
> > >
> > > >> May 20 15:57:48 cramit kernel: sky2 0000:04:00.0: v1.14 addr
> > > >> 0xf8000000 irq 16 Yukon-EC Ultra (0xb4) rev 2
> > >
> > > Thank you for your answer. I was half wondering if that was the
> > > case after staring at those log messages several more times. I
> > > don't understand hardware at the low level but got thinking maybe
> > > interrupt routing issue. There's an Nvidia PCI Express card in
> > > there that gets IRQ 16, though it was not initialized by a driver
> > > at the time. (plain old VGA console after fresh cold boot... no
> > > framebuffer, no X, no nvidia module). I guess some things don't
> > > share well.
> > >
> > > It works well in that other OS that came with the hardware, but
> > > that's beside the point.
> >
> > It is some low level PCI Express related stuff, try latest BIOS (F9)
> > and if that doesn't help there is a EEPROM update from Gigabyte
> > for the Marvell hardware that might help.
>
> Thanks for your suggestions, I followed through on them. It may still
> be interesting/useful to hear from me that it didn't help. The
> problem is the same.
>
> My motherboard is a newer revision (Gigabyte GA-965P-DS3 Rev 3.3) and
> already had the "F10" bios version, but I flashed to the latest F11
> version anyways. I also flashed with the EEPROM update from Gigabyte,
> from a FAQ entry for my motherboard revision.
> (faq_marvell_eeprom.zip). Both operations were successful. I cleared
> the CMOS and reconfigured after the bios flash too.
>
> Incidently, it was showing IRQ 16 in that early initialization
> message, but actually getting a MSI interrupt (IRQ 219, PCI-MSI-edge)
>
> I've disabled the onboard yukon2 adapter in bios and gone
> back to the PCI card now. I think we can consider the matter closed,
> since it's not a problem with the driver, but just so you know, I'm
> always willing to help test when it's hardware that I have.
>
> Mike Houston

There maybe some hardware level interaction with SATA controller.
I saw no failures running off i386 kernel of PATA drive and quickly
see errors with SATA/AHCI and x86_64.

--
Stephen Hemminger <[email protected]>

2007-05-22 04:36:27

by Jeff Garzik

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

Stephen Hemminger wrote:
> There maybe some hardware level interaction with SATA controller.
> I saw no failures running off i386 kernel of PATA drive and quickly
> see errors with SATA/AHCI and x86_64.


I presume AHCI is the only other device in the system using PCI MSI,
when you see problems?

Jeff


2007-05-22 04:43:20

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Tue, 22 May 2007 00:36:15 -0400
Jeff Garzik <[email protected]> wrote:

> Stephen Hemminger wrote:
> > There maybe some hardware level interaction with SATA controller.
> > I saw no failures running off i386 kernel of PATA drive and quickly
> > see errors with SATA/AHCI and x86_64.
>
>
> I presume AHCI is the only other device in the system using PCI MSI,
> when you see problems?
>
> Jeff
>
>
AHCI on this motherboard doesn't seem to use MSI. The problems occur
even if I boot with nomsi.

--
Stephen Hemminger <[email protected]>

2007-05-22 05:06:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2



On Mon, 21 May 2007, Stephen Hemminger wrote:
>
> AHCI on this motherboard doesn't seem to use MSI. The problems occur
> even if I boot with nomsi.

Have you tried playing with PCI latency counters etc?

Maybe the SATA/AHCI thing is better at saturating the bus, and the sky2
hardware gets upset if it has overlong DMA access latencies due to some
other controller keeping the bus busy with a long burst access?

I can't really see that being a real problem in this day and age of PCI-X
etc, but it _used_ to be a possible issue a decade ago. Maybe you've found
a case where it matters even on modern hardware? We occasionally used to
set the PCI latency timer to make people happy.

(Not that I'm convinced it even has any semantic meaning on a modern PCI
system..)

Linus

2007-05-22 17:19:56

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Mon, 21 May 2007 22:04:26 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Mon, 21 May 2007, Stephen Hemminger wrote:
> >
> > AHCI on this motherboard doesn't seem to use MSI. The problems occur
> > even if I boot with nomsi.
>
> Have you tried playing with PCI latency counters etc?
>
> Maybe the SATA/AHCI thing is better at saturating the bus, and the sky2
> hardware gets upset if it has overlong DMA access latencies due to some
> other controller keeping the bus busy with a long burst access?
>
> I can't really see that being a real problem in this day and age of PCI-X
> etc, but it _used_ to be a possible issue a decade ago. Maybe you've found
> a case where it matters even on modern hardware? We occasionally used to
> set the PCI latency timer to make people happy.
>
> (Not that I'm convinced it even has any semantic meaning on a modern PCI
> system..)
>
> Linus

The device in question is PCI Express, and the latency has no meaning (at
least in vendor spec).

--
Stephen Hemminger <[email protected]>

2007-05-22 17:55:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

Linus Torvalds wrote:
>
> I can't really see that being a real problem in this day and age of PCI-X
> etc, but it _used_ to be a possible issue a decade ago. Maybe you've found
> a case where it matters even on modern hardware? We occasionally used to
> set the PCI latency timer to make people happy.
>
> (Not that I'm convinced it even has any semantic meaning on a modern PCI
> system..)
>

The PCI latency counters matter as long as you're talking a PCI or PCI-X
bus. It matters not one iota on anything that pretends to be a PCI bus
but isn't, i.e. PCI Express, HyperTransport, etc.

-hpa

2007-05-22 21:16:30

by Mike Houston

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Mon, 21 May 2007 21:31:46 -0700
Stephen Hemminger <[email protected]> wrote:

> There maybe some hardware level interaction with SATA controller.
> I saw no failures running off i386 kernel of PATA drive and quickly
> see errors with SATA/AHCI and x86_64.

AHCI SATA on i386, but I'm not sure that has anything to do with the
problem after what follows below.

I did another test here today. I disconnected my SATA hard disks and
installed a regular PATA drive. The only PATA port I have though, is
on the jmicron 363 controller. So I enabled that controller in the
bios (I keep it disabled because I have no use for it) and installed a
distro on the drive. PCLinuxOS TR4, which probably isn't the best test
system to use (and is not for me), but it's the only one I had on hand
that recognized IDE disks on the jmicron 363 controller with the
distro kernel.

After the install was done, I disconnected the SATA CD drive so there
would be no SATA devices. Nothing was on the ICH8 controller, which
I had put in IDE mode. (no setting to disable it entirely in bios)

I compiled 2.6.22-rc2 without libata/SATA support and only enabled the
old IDE subsystem with the jmicron 36x driver.

2.6.22-rc2 kernel was working well, and I brought up the sky2 eth0
interface alright, and as is the case most of the time (but not
always), I was able to do light stuff with it for a short time (e.g.
ssh in to another box, transfer a small text file etc.) but as soon
as I start trying to move any serious data the same or similar
problem occurs.

The only device using MSI at the time was the sky2, if that's
relevant. There were no other ethernet cards installed at the
time either.

In this case I actually had the kernel crash. First time for me ever
having a kernel oops! System locked up with keyboard LED's blinking.

Not sure if anyone wants to see all of it (maybe some screwy
userland stuff involved), so I won't include that mess in the
message. It's here:
http://www.mikeserv.org/files/kernelcrash.txt

But in there we get this, a somewhat similar message:

May 22 16:16:45 localhost kernel: sky2 0000:04:00.0: error interrupt
status=0x1
May 22 16:16:45 localhost kernel: sky2 eth0: descriptor error q=0x280
get=285 [800042375e2e5e] put=285

I hard booted and tried again a second time, and this time the kernel
didn't oops but I got this:

May 22 16:34:09 testinstall kernel: sky2 0000:04:00.0: error
interrupt status=0x1
May 22 16:34:09 testinstall kernel: sky2 eth0:
descriptor error q=0x280 get=497 [800042367dde5e] put=497
May 22 16:34:09 testinstall kernel: sky2 0000:04:00.0: error interrupt
status=0x80000000
May 22 16:34:09 testinstall kernel: sky2 eth0: hw error interrupt
status 0x8
May 22 16:34:09 testinstall kernel: sky2
eth0: MAC parity error

So it's the same problem. On halting, I quickly saw what looked like
a kernel oops but nothing was logged at that stage.

Third try was the kernel oops again on attempting to transfer a file.

By the way, last night I did grab the dmesg output from the last
attempt to use sky2 on my normal (from scratch) system in case it
would be useful. This is not to be confused with the PATA experiment
above:
http://www.mikeserv.org/files/dmesg-2.6.22-rc2.txt

Mike Houston

2007-05-23 00:00:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2



On Tue, 22 May 2007, Mike Houston wrote:
>
> In this case I actually had the kernel crash. First time for me ever
> having a kernel oops! System locked up with keyboard LED's blinking.
>
> Not sure if anyone wants to see all of it (maybe some screwy
> userland stuff involved), so I won't include that mess in the
> message. It's here:
> http://www.mikeserv.org/files/kernelcrash.txt

I think you have major memory corruption. That first oops disassembles to

mov 0x10(%eax),%esi
mov $0xfffffdfd,%eax
test %esi,%esi
je after_call
mov %edx,%ecx
mov %edi,%eax
mov %ebx,%edx
call *%esi
after_call:

which is (from net/ipv4/af_inet.c, inet_ioctl()):

default:
if (sk->sk_prot->ioctl)
err = sk->sk_prot->ioctl(sk, cmd, arg);
else
err = -ENOIOCTLCMD;
break;

and the load off "sk->sk_prot->ioctl" oopses, because "sk->sk_prot" is
corrupt and contains 0x8e3cad42, which is not a valid kernel pointer.

The other oops is even worse.

I also think it meshes with

sky2 eth0: descriptor error q=0x280 get=285 [800042375e2e5e] put=285

and I suspect your memory got corrupted by sky2 reading the wrong
descriptors, and overwriting kernel memory.

So it's almost certainly some DMA problem. Now, _why_ you have DMA
problems, I have no idea. But can you try:
- disable CONFIG_PREEMPT
- disable CONFIG_HIGHMEM if you have it on
- just in general see if you can disable any kernel config options that
might be unnecessary.
to see if it changes the situation at all..

Linus

2007-05-23 00:29:50

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

Linus Torvalds wrote:
> On Tue, 22 May 2007, Mike Houston wrote:
>
>> In this case I actually had the kernel crash. First time for me ever
>> having a kernel oops! System locked up with keyboard LED's blinking.
>>
>> Not sure if anyone wants to see all of it (maybe some screwy
>> userland stuff involved), so I won't include that mess in the
>> message. It's here:
>> http://www.mikeserv.org/files/kernelcrash.txt
>>
>
> I think you have major memory corruption. That first oops disassembles to
>
> mov 0x10(%eax),%esi
> mov $0xfffffdfd,%eax
> test %esi,%esi
> je after_call
> mov %edx,%ecx
> mov %edi,%eax
> mov %ebx,%edx
> call *%esi
> after_call:
>
> which is (from net/ipv4/af_inet.c, inet_ioctl()):
>
> default:
> if (sk->sk_prot->ioctl)
> err = sk->sk_prot->ioctl(sk, cmd, arg);
> else
> err = -ENOIOCTLCMD;
> break;
>
> and the load off "sk->sk_prot->ioctl" oopses, because "sk->sk_prot" is
> corrupt and contains 0x8e3cad42, which is not a valid kernel pointer.
>
> The other oops is even worse.
>
> I also think it meshes with
>
> sky2 eth0: descriptor error q=0x280 get=285 [800042375e2e5e] put=285
>
>
Descriptor error means, the driver told it to do something but the
OWNER bit wasn't set.
Only ever saw this on the Gigabyte motherboard.

It looks like the chip reads the wrong memory sometimes. The problem
happens only on the on-board NIC's
and only on this kind of motherboard. For testing, I have put code in
to check that the receive data actually
arrived before the IRQ, it triggered on my Gigabyte 925 motherboard. It
appears that DMA access
is messed up. This board has lots of "overclocker" friendly stuff; maybe
the BIOS never really sets up the PCI
bridges and clocks properly.

It doesn't seem like a software or driver problem. I have tried tweaking
PCI registers but nothing worked
in this case.

2007-05-23 01:53:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2



On Tue, 22 May 2007, Stephen Hemminger wrote:
>
> It looks like the chip reads the wrong memory sometimes. The problem happens
> only on the on-board NIC's and only on this kind of motherboard.

Do you know if it happens for particular addresses? (Ie, can you tell what
the physical address of the descriptor is for the errors?)

> For testing, I have put code in to check that the receive data actually
> arrived before the IRQ, it triggered on my Gigabyte 925 motherboard. It
> appears that DMA access is messed up.

Yes, that certainly would also explain memory corruption. Either because
writes went to the wrong address, or because writes went to the right
address, but because an earlier IO descriptor read had gotten corrupted,
the "right address" was in fact the wrong one ;)

The reason I ask whether you have some way of telling the pattern for the
physical address is that one traditional cause of DMA errors is due to
broken RAM remapping setup.

As an example of that - imagine that you have 1GB of RAM in the machine,
and realize that the memory behind the 640kB -> 1MB area isn't accessible,
because it's taken up by the legacy ISA region.

You have two possible outcomes: either (a) the memory is just "gone", and
you lost it, or (b) there is some RAM remapping in the core chipset that
makes the lost 384kB show up _above_ the 1GB mark instead.

The same "legacy ISA" hole situation happens for the "legacy PCI" hole,
which is why if you have 4GB of RAM in the machine, usually you'll see
3GB at addresses 0-3GB (roughly), and then you'll see the rest at above
the 4GB mark, in order to have a nice PCI hole in the 32-bit access range.

There's also the "legacy 286" hole at the 15-16MB mark (which nobody uses
any more, but chipsets still inexplicably support), and the SMM remapping.

Anyway, core chipsets generally do CPU memory accesses _differently_ from
DMA accesses from the PCI bus (at a minimum, SMM is something that only
the CPU can do), so I could see a situation where the remapping was set up
correctly for the CPU (and perhaps for "core chipset" devices like the
integrated southbridge), but devices that do DMA from the outside get
screwed over.

But it might not happen for all addresses. Non-remapped stuff might work
well, so if there is some way of figuring out what the bad DMA address was
for an erreneous access, that might offer some clues.

> This board has lots of "overclocker" friendly stuff; maybe the BIOS
> never really sets up the PCI bridges and clocks properly.

It's hard to set up a normal PCI-PCI bridge subtly incorrectly. But
special RAM timing or remapping stuff for the host bridge - sure.

> It doesn't seem like a software or driver problem. I have tried tweaking PCI
> registers but nothing worked in this case.

Yeah, the PCI registers that would affect things like this tend to be in
the host bridge, not on the normal device.

That said, Intel doesn't generally do the really insane things. And a lot
of the old remapping stuff is simply not done any more. For example, I
doubt that the 925 chipset even supports remapping the 640k-1M range any
more: 384kB just isn't worth it when people talk about gigs of RAM, the
way it was when 16MB was considered a lot.

And looking quickly at the Intel 925X MCH (memory controller hub)
registers, nothing jumps out as a good candidate for some obvious bug.

Linus

2007-05-23 14:58:39

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Tue, 22 May 2007 18:53:33 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Tue, 22 May 2007, Stephen Hemminger wrote:
> >
> > It looks like the chip reads the wrong memory sometimes. The problem happens
> > only on the on-board NIC's and only on this kind of motherboard.
>
> Do you know if it happens for particular addresses? (Ie, can you tell what
> the physical address of the descriptor is for the errors?)

I'll look but there didn't seem to be an obvious pattern when I last looked.


>
> > For testing, I have put code in to check that the receive data actually
> > arrived before the IRQ, it triggered on my Gigabyte 925 motherboard. It
> > appears that DMA access is messed up.
>
> Yes, that certainly would also explain memory corruption. Either because
> writes went to the wrong address, or because writes went to the right
> address, but because an earlier IO descriptor read had gotten corrupted,
> the "right address" was in fact the wrong one ;)
>
> The reason I ask whether you have some way of telling the pattern for the
> physical address is that one traditional cause of DMA errors is due to
> broken RAM remapping setup.
>
> As an example of that - imagine that you have 1GB of RAM in the machine,
> and realize that the memory behind the 640kB -> 1MB area isn't accessible,
> because it's taken up by the legacy ISA region.
>
> You have two possible outcomes: either (a) the memory is just "gone", and
> you lost it, or (b) there is some RAM remapping in the core chipset that
> makes the lost 384kB show up _above_ the 1GB mark instead.
>
> The same "legacy ISA" hole situation happens for the "legacy PCI" hole,
> which is why if you have 4GB of RAM in the machine, usually you'll see
> 3GB at addresses 0-3GB (roughly), and then you'll see the rest at above
> the 4GB mark, in order to have a nice PCI hole in the 32-bit access range.
>
> There's also the "legacy 286" hole at the 15-16MB mark (which nobody uses
> any more, but chipsets still inexplicably support), and the SMM remapping.
>
> Anyway, core chipsets generally do CPU memory accesses _differently_ from
> DMA accesses from the PCI bus (at a minimum, SMM is something that only
> the CPU can do), so I could see a situation where the remapping was set up
> correctly for the CPU (and perhaps for "core chipset" devices like the
> integrated southbridge), but devices that do DMA from the outside get
> screwed over.
>

This board doesn't have any onboard video so that helps. I am running
with 2GB of memory.

I can put a card with similar chip in an X1 slot, and there are no
problems. Same driver, but different bridges, and slightly different
Marvell chip.

> But it might not happen for all addresses. Non-remapped stuff might work
> well, so if there is some way of figuring out what the bad DMA address was
> for an erreneous access, that might offer some clues.
>
> > This board has lots of "overclocker" friendly stuff; maybe the BIOS
> > never really sets up the PCI bridges and clocks properly.
>
> It's hard to set up a normal PCI-PCI bridge subtly incorrectly. But
> special RAM timing or remapping stuff for the host bridge - sure.
>
> > It doesn't seem like a software or driver problem. I have tried tweaking PCI
> > registers but nothing worked in this case.
>
> Yeah, the PCI registers that would affect things like this tend to be in
> the host bridge, not on the normal device.
>
> That said, Intel doesn't generally do the really insane things. And a lot
> of the old remapping stuff is simply not done any more. For example, I
> doubt that the 925 chipset even supports remapping the 640k-1M range any
> more: 384kB just isn't worth it when people talk about gigs of RAM, the
> way it was when 16MB was considered a lot.
>
> And looking quickly at the Intel 925X MCH (memory controller hub)
> registers, nothing jumps out as a good candidate for some obvious bug.
>
> Linus

Here is the PCI controller chain to the device:

00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 02) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
I/O behind bridge: 00005000-00005fff
Memory behind bridge: fff00000-000fffff
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B-
Capabilities: [40] Express Root Port (Slot+) IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s unlimited, L1 unlimited
Device: Errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 1
Link: Latency L0s <1us, L1 <4us
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x0
Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug+ Surpise+
Slot: Number 16, PowerLimit 10.000000
Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq-
Slot: AttnInd Unknown, PwrInd Unknown, Power-
Root: Correctable- Non-Fatal- Fatal- PME-
Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0300c Data: 4169
Capabilities: [90] Subsystem: Giga-byte Technology Unknown device 5001
Capabilities: [a0] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100] Virtual Channel
Capabilities: [180] Unknown (5)

00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5 (rev 02) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
I/O behind bridge: 0000a000-0000afff
Memory behind bridge: f8000000-f9ffffff
Prefetchable memory behind bridge: 0000000080100000-00000000801fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B-
Capabilities: [40] Express Root Port (Slot+) IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s unlimited, L1 unlimited
Device: Errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 5
Link: Latency L0s <256ns, L1 <4us
Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
Link: Speed 2.5Gb/s, Width x1
Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug+ Surpise+
Slot: Number 20, PowerLimit 10.000000
Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq-
Slot: AttnInd Unknown, PwrInd Unknown, Power-
Root: Correctable- Non-Fatal- Fatal- PME-
Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0300c Data: 4181
Capabilities: [90] Subsystem: Giga-byte Technology Unknown device 5001
Capabilities: [a0] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100] Virtual Channel
Capabilities: [180] Unknown (5)

05:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev 14)
Subsystem: Giga-byte Technology Unknown device e000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 14
Region 0: Memory at f9000000 (64-bit, non-prefetchable) [size=16K]
Region 2: I/O ports at a000 [size=256]
[virtual] Expansion ROM at 80100000 [disabled] [size=128K]
Capabilities: [48] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] Vital Product Data
Capabilities: [5c] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [e0] Express Legacy Endpoint IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s unlimited, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s L1, Port 0
Link: Latency L0s <256ns, L1 unlimited
Link: ASPM Disabled RCB 128 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x1
Capabilities: [100] Advanced Error Reporting


--
Stephen Hemminger <[email protected]>

2007-05-23 17:39:30

by Mike Houston

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Tue, 22 May 2007 17:00:18 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> and the load off "sk->sk_prot->ioctl" oopses, because "sk->sk_prot"
> is corrupt and contains 0x8e3cad42, which is not a valid kernel
> pointer.
>
> The other oops is even worse.
>
> I also think it meshes with
>
> sky2 eth0: descriptor error q=0x280 get=285
> [800042375e2e5e] put=285
>
> and I suspect your memory got corrupted by sky2 reading the wrong
> descriptors, and overwriting kernel memory.
>
> So it's almost certainly some DMA problem. Now, _why_ you have DMA
> problems, I have no idea. But can you try:
> - disable CONFIG_PREEMPT
> - disable CONFIG_HIGHMEM if you have it on
> - just in general see if you can disable any kernel config options
> that might be unnecessary.
> to see if it changes the situation at all..

Thanks for looking at this. After further posts in the discussion I
wasn't sure if you still wanted me to try this, but I thought it
might be useful to see if (particularly) highmem support might change
the behaviour, or the messages in any way that might lead to a clue.
There was no change to the behaviour.

I have a Core 2 duo, and 2 Gb of RAM, but I built a uniprocessor
kernel (with apic), without highmem support, with no PREEMPT and
without other unnecessary stuff. If by chance I got it working, my
plan was to enable things one at a time.

I won't get that oops on this setup though (never have, anyways...
it was just the PCLinux install on that other hard disk which has
now been returned to use elsewhere), but the messages on trying to
transfer data are the same:

First try (instant failure on trying to ssh):

May 23 12:51:14 cramit kernel: sky2 eth0: enabling interface
May 23 12:51:14 cramit kernel: sky2 eth0: ram buffer 0K
May 23 12:51:16 cramit kernel: sky2 eth0: Link is up at 100 Mbps,
full duplex, flow control both May 23 12:51:34 cramit kernel: sky2
0000:04:00.0: error interrupt status=0x1 May 23 12:51:34 cramit
kernel: sky2 eth0: descriptor error q=0x280 get=7 [0] put=7

Second try after cold boot (failure on trying to transfer file):

May 23 12:52:59 cramit kernel: sky2 eth0: enabling interface
May 23 12:52:59 cramit kernel: sky2 eth0: ram buffer 0K
May 23 12:53:01 cramit kernel: sky2 eth0: Link is up at 100 Mbps,
full duplex, flow control both
May 23 12:55:40 cramit kernel: sky2
0000:04:00.0: error interrupt status=0x80000000
May 23 12:55:40 cramit kernel: sky2 eth0: hw error interrupt status
0x8
May 23 12:55:40 cramit kernel: sky2 eth0: MAC parity error

This is exactly the behaviour I've been seeing.

I still happen to have a Windows Vista install kicking around, so to
make sure we're not flogging a dead horse I booted that and let it
set up the yukon2 chip and I tested it. (more to make sure that
eeprom update didn't break it). I used it for a bit and successfully
transferred some large files from box running Samba. MS must be using
some specific workaround or something.

Mike Houston

2007-05-23 17:46:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2



On Wed, 23 May 2007, Mike Houston wrote:
>
> I still happen to have a Windows Vista install kicking around, so to
> make sure we're not flogging a dead horse I booted that and let it
> set up the yukon2 chip and I tested it. (more to make sure that
> eeprom update didn't break it). I used it for a bit and successfully
> transferred some large files from box running Samba. MS must be using
> some specific workaround or something.

I think there is some lspci-like thing for windows too.

Can you do the equivalent of "lspci -vvxxx" on that box under both Linux
and Windows? _If_ it's some PCI config space thing (which is not at all
guaranteed - it could be about setup in random MMIO ranges) it might give
us some clues.

Linus

2007-05-23 18:05:13

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Wed, 23 May 2007 10:46:05 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Wed, 23 May 2007, Mike Houston wrote:
> >
> > I still happen to have a Windows Vista install kicking around, so to
> > make sure we're not flogging a dead horse I booted that and let it
> > set up the yukon2 chip and I tested it. (more to make sure that
> > eeprom update didn't break it). I used it for a bit and successfully
> > transferred some large files from box running Samba. MS must be using
> > some specific workaround or something.
>
> I think there is some lspci-like thing for windows too.
>
> Can you do the equivalent of "lspci -vvxxx" on that box under both Linux
> and Windows? _If_ it's some PCI config space thing (which is not at all
> guaranteed - it could be about setup in random MMIO ranges) it might give
> us some clues.
>
> Linus

lspci will work in windows, it is probably part of cygwin.

--
Stephen Hemminger <[email protected]>

2007-05-24 18:27:16

by Mike Houston

[permalink] [raw]
Subject: Re: Linux 2.6.22-rc2

On Wed, 23 May 2007 10:46:05 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Wed, 23 May 2007, Mike Houston wrote:
> >
> > I still happen to have a Windows Vista install kicking around, so
> > to make sure we're not flogging a dead horse I booted that and
> > let it set up the yukon2 chip and I tested it. (more to make sure
> > that eeprom update didn't break it). I used it for a bit and
> > successfully transferred some large files from box running Samba.
> > MS must be using some specific workaround or something.
>
> I think there is some lspci-like thing for windows too.
>
> Can you do the equivalent of "lspci -vvxxx" on that box under both
> Linux and Windows? _If_ it's some PCI config space thing (which is
> not at all guaranteed - it could be about setup in random MMIO
> ranges) it might give us some clues.
>

This is the sky2 issue with Gigabyte 88E8056 onboard LAN.

I've had no luck getting pciutils compiled for win32, but I found a
utility that gives similar output called Craig Hart's PCI bus sniffer
(pci32.exe).

Here is the output of pci32 with hex dump from within Windows Vista:
http://www.mikeserv.org/files/pci32_info.txt

Here is the output of lspci -vvxxx from within Linux:
http://www.mikeserv.org/files/lspci.txt

I hope this is helpful,

Mike Houston

2007-05-24 22:12:24

by Stephen Hemminger

[permalink] [raw]
Subject: sky2/pci issues on Gigabyte

On Thu, 24 May 2007 14:26:44 -0400
Mike Houston <[email protected]> wrote:

> On Wed, 23 May 2007 10:46:05 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
> >
> >
> > On Wed, 23 May 2007, Mike Houston wrote:
> > >
> > > I still happen to have a Windows Vista install kicking around, so
> > > to make sure we're not flogging a dead horse I booted that and
> > > let it set up the yukon2 chip and I tested it. (more to make sure
> > > that eeprom update didn't break it). I used it for a bit and
> > > successfully transferred some large files from box running Samba.
> > > MS must be using some specific workaround or something.
> >
> > I think there is some lspci-like thing for windows too.
> >
> > Can you do the equivalent of "lspci -vvxxx" on that box under both
> > Linux and Windows? _If_ it's some PCI config space thing (which is
> > not at all guaranteed - it could be about setup in random MMIO
> > ranges) it might give us some clues.
> >
>
> This is the sky2 issue with Gigabyte 88E8056 onboard LAN.
>
> I've had no luck getting pciutils compiled for win32, but I found a
> utility that gives similar output called Craig Hart's PCI bus sniffer
> (pci32.exe).
>
> Here is the output of pci32 with hex dump from within Windows Vista:
> http://www.mikeserv.org/files/pci32_info.txt
>
> Here is the output of lspci -vvxxx from within Linux:
> http://www.mikeserv.org/files/lspci.txt
>
> I hope this is helpful,
>
> Mike Houston


Looking at the 88e8056 PCI config values:

Differences:
1. Linux use MSI (no INTx), Vista does not.

2. Vista use IRQ 16, Linux uses 219 (because of MSI)

3. Vista sets Device Control(E8) 4000h = 2K,
Linux uses 2000h = 1k
This would cause larger max read requests.

4. Vista status (EA) 0010h
Linux is 0019h
Driver doesn't bother clearing the correctable error status on boot.

You can get the same settings on Linux without changing driver by doing:

modprobe sky2 disable_msi=1
setpci -s 04:00 e8.w=4000,19

No luck, I tried it, but it still dies..

--
Stephen Hemminger <[email protected]>

2007-05-24 22:48:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: sky2/pci issues on Gigabyte



On Thu, 24 May 2007, Stephen Hemminger wrote:
>
> Looking at the 88e8056 PCI config values:

I think you're looking at the wrong device.

The ones that matter are likely the PCI-X bridge, not the device. The
device cannot reasonably screw up DMA (unless it's really scrogged, but
then it wouldn't work under Vista either).

So it's much more likely to be about device 00:1c.4, which is the bridge
to PCI bus #4:

00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5
Bus: primary=00, secondary=04, subordinate=04, sec-latency=0

So I'd look at its config space instead ("-" is Vista, "+" is Linux):

-00: 86 80 47 28 07 00 10 00 02 00 04 06 08 00 81 00
+00: 86 80 47 28 07 04 10 00 02 00 04 06 08 00 81 00

10: 00 00 00 00 00 00 00 00 00 04 04 00 b0 b0 00 00

-20: 00 f7 f0 f8 f1 ff 01 00 00 00 00 00 00 00 00 00
+20: 00 f7 f0 f8 01 80 01 80 00 00 00 00 00 00 00 00

-30: 00 00 00 00 40 00 00 00 00 00 00 00 10 01 04 00
+30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 04 00

-40: 10 80 41 01 c0 8f 00 00 00 00 10 00 11 24 11 05
+40: 10 80 41 01 c0 8f 00 00 0f 00 11 00 11 24 11 05

50: 40 00 11 30 60 05 a0 00 00 00 48 01 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

-80: 05 90 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+80: 05 90 01 00 0c 10 e0 fe d1 41 00 00 00 00 00 00

90: 0d a0 00 00 58 14 01 50 00 00 00 00 00 00 00 00
a0: 01 00 02 c8 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Which I _think_ is (I tried to be careful, but..):

Vista Linux

04: 0x00100007 0x00100407
24: 0x0001fff1 0x08018001
3c: 0x00040110 0x0004010b
48: 0x00100000 0x0011000f
80: 0x00009005 0x00019005
84: 0x00000000 0xfee0100c
88: 0x00000000 0x000041d1

but I have not looked at what the _meaning_ of those register
differences are.

The host bridge itself could be the problem, but that one is identical
in the PCI config space. I guess it could also be this one:

00:01.0 PCI bridge: Intel Corporation 82P965/G965 PCI Express Root Port (rev 02) (prog-if 00 [Normal decode])
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0

but I don't know how "port 5" (which is the bus that the ethernet
controller is behind) is related to that "root port" (which is reported
to bridge only subordinate bus 01). The "root port" thing makes me
suspect that device 00:01.0 is somehow related to 00:1c.4 despite the
apparent lack of relationship in the bus topology itself (and the root
port does _not_ decode the IO/MEM resources that lead to the ethernet
chip).

There _are_ differences in that root port device too, but I haven't done
the diff of them yet.

Linus

2007-05-24 23:08:55

by Stephen Hemminger

[permalink] [raw]
Subject: Re: sky2/pci issues on Gigabyte

On Thu, 24 May 2007 15:48:23 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Thu, 24 May 2007, Stephen Hemminger wrote:
> >
> > Looking at the 88e8056 PCI config values:
>
> I think you're looking at the wrong device.

I didn't expect it to work, just heading for the easy to hit difference first.

>
> The ones that matter are likely the PCI-X bridge, not the device. The
> device cannot reasonably screw up DMA (unless it's really scrogged, but
> then it wouldn't work under Vista either).

PCI-E

>
> So it's much more likely to be about device 00:1c.4, which is the bridge
> to PCI bus #4:
>
> 00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5
> Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
>
>
>
> Which I _think_ is (I tried to be careful, but..):
> So I'd look at its config space instead ("-" is Vista, "+" is Linux):


> -00: 86 80 47 28 07 00 10 00 02 00 04 06 08 00 81 00
> +00: 86 80 47 28 07 04 10 00 02 00 04 06 08 00 81 00
^--- INTX disable bit
Vista isn't enabling MSI, Linux is.
Try "nomsi"?
>
> 10: 00 00 00 00 00 00 00 00 00 04 04 00 b0 b0 00 00
>
> -20: 00 f7 f0 f8 f1 ff 01 00 00 00 00 00 00 00 00 00
> +20: 00 f7 f0 f8 01 80 01 80 00 00 00 00 00 00 00 00
24: BAR5 differnence ?
>
> -30: 00 00 00 00 40 00 00 00 00 00 00 00 10 01 04 00
> +30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 04 00
3c: Assigned IRQ value

> -40: 10 80 41 01 c0 8f 00 00 00 00 10 00 11 24 11 05
> +40: 10 80 41 01 c0 8f 00 00 0f 00 11 00 11 24 11 05
48: PCI Express device control
Vista: 0000
Linux: 000f = advanced error reports enabled
4c: PCI Express device status
Vista: 0010
Linux: 0011 = correctable error detected
Driver doesn't clear error during boot, you can do it with
setpci but it doesn't fix problem. (I do have fix bug it is
not important for this discussion).

> 50: 40 00 11 30 60 05 a0 00 00 00 48 01 00 00 00 00
> 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
> -80: 05 90 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> +80: 05 90 01 00 0c 10 e0 fe d1 41 00 00 00 00 00 00
These are the MSI setup registers which Vista isn't using.

> 90: 0d a0 00 00 58 14 01 50 00 00 00 00 00 00 00 00
> a0: 01 00 02 c8 00 00 00 00 00 00 00 00 00 00 00 00
> b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

So only difference I see is MSI, and advanced error reporting
bits.



--
Stephen Hemminger <[email protected]>

2007-05-25 00:02:09

by Mike Houston

[permalink] [raw]
Subject: Re: sky2/pci issues on Gigabyte

On Thu, 24 May 2007 16:04:40 -0700
Stephen Hemminger <[email protected]> wrote:
> > -00: 86 80 47 28 07 00 10 00 02 00 04 06 08 00 81 00
> > +00: 86 80 47 28 07 04 10 00 02 00 04 06 08 00 81 00
> ^--- INTX disable bit
> Vista isn't enabling MSI, Linux is.
> Try "nomsi"?

I had noticed that Vista wasn't using MSI for any devices and I
tried booting with pci=nomsi in addition to building the kernel
without MSI enabled (just in case there might somehow be a
difference). I see I should have mentioned it, but it had no effect
on the problem. The device gets IRQ 16 in Linux without MSI and it
still croaks with the same messages.

Mike Houston