LinuxLists.cc - Processes stuck on D state on Dual Opteron

2005-04-05 02:21:26

Subject: Processes stuck on D state on Dual Opteron

Hi,

While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D
state after some time.
This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other
node has no RAM modules plugged in, since this board works only with pairs).

I was using stress (http://weather.ou.edu/~apw/projects/stress/) with the
following command line:

stress -v -c 20 -i 12 -m 10 -d 20

This causes a constant load avg. of around 70, makes the machine go into
swap a little, and writes up to about 20GB of random data to disk while
eating up all CPU. After about half and hour random processes like top, df,
etc get stuck in D state. Half of the 60 or so stress processes are also in D
state. The machine keeps being responsive for maybe some 15 minutes but then
the shells just hang and sshd stops responding to connections, though the
machine replies to pings (I don't have console acess till tomorrow).

The system is using ext3 with md software Raid1.

I'm interested in knowing if anyone out there with dual Opterons can
reproduce this or not. I also have access to an HP DL360 Dual Xeon, so I will
try to find out if this is AMD64 specific as soon as possible. Please let me
know if you want me to run some other tests or give some more info to help
solve this one.

Kernel config follows (compiled with gcc-3.4.4 on debian)...

Best regards, thanks

Claudio Martins

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.12-rc2
# Tue Apr 5 00:15:41 2005
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_CLEAN_COMPILE=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
CONFIG_SYSCTL=y
# CONFIG_AUDIT is not set
CONFIG_HOTPLUG=y
CONFIG_KOBJECT_UEVENT=y
# CONFIG_IKCONFIG is not set
# CONFIG_CPUSETS is not set
# CONFIG_EMBEDDED is not set
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_CC_ALIGN_FUNCTIONS=0
CONFIG_CC_ALIGN_LABELS=0
CONFIG_CC_ALIGN_LOOPS=0
CONFIG_CC_ALIGN_JUMPS=0
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_OBSOLETE_MODPARM=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Processor type and features
#
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
# CONFIG_SCHED_SMT is not set
CONFIG_K8_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_DISCONTIGMEM=y
CONFIG_NUMA=y
CONFIG_HAVE_DEC_LOCK=y
CONFIG_NR_CPUS=2
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_GART_IOMMU=y
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
CONFIG_SECCOMP=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y

#
# Power management options
#
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
# CONFIG_SOFTWARE_SUSPEND is not set

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_BOOT=y
CONFIG_ACPI_INTERPRETER=y
# CONFIG_ACPI_SLEEP is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_IBM is not set
# CONFIG_ACPI_TOSHIBA is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_BUS=y
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_PCI=y
CONFIG_ACPI_SYSTEM=y
# CONFIG_ACPI_CONTAINER is not set

#
# CPU Frequency scaling
#
# CONFIG_CPU_FREQ is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
# CONFIG_UNORDERED_IO is not set
# CONFIG_PCIEPORTBUS is not set
CONFIG_PCI_MSI=y
# CONFIG_PCI_LEGACY_PROC is not set
# CONFIG_PCI_NAMES is not set
# CONFIG_PCI_DEBUG is not set

#
# PCCARD (PCMCIA/CardBus) support
#
# CONFIG_PCCARD is not set

#
# PCI Hotplug Support
#
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
CONFIG_IA32_AOUT=y
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_UID16=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=m
# CONFIG_DEBUG_DRIVER is not set

#
# Memory Technology Devices (MTD)
#
# CONFIG_MTD is not set

#
# Parallel port support
#
# CONFIG_PARPORT is not set

#
# Plug and Play support
#
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set

#
# Protocols
#
CONFIG_PNPACPI=y

#
# Block devices
#
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_CPQ_DA is not set
CONFIG_BLK_CPQ_CISS_DA=m
# CONFIG_CISS_SCSI_TAPE is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
CONFIG_BLK_DEV_NBD=m
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=4096
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_LBD is not set
# CONFIG_CDROM_PKTCDVD is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_ATA_OVER_ETH is not set

#
# ATA/ATAPI/MFM/RLL support
#
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_IDE_SATA is not set
# CONFIG_BLK_DEV_HD_IDE is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECD=y
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEFLOPPY is not set
# CONFIG_BLK_DEV_IDESCSI is not set
# CONFIG_IDE_TASK_IOCTL is not set

#
# IDE chipset support/bugfixes
#
CONFIG_IDE_GENERIC=y
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set
CONFIG_BLK_DEV_IDEPCI=y
# CONFIG_IDEPCI_SHARE_IRQ is not set
# CONFIG_BLK_DEV_OFFBOARD is not set
# CONFIG_BLK_DEV_GENERIC is not set
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
CONFIG_IDEDMA_PCI_AUTO=y
# CONFIG_IDEDMA_ONLYDISK is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
CONFIG_BLK_DEV_AMD74XX=y
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_SC1200 is not set
# CONFIG_BLK_DEV_PIIX is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_IDE_ARM is not set
CONFIG_BLK_DEV_IDEDMA=y
CONFIG_IDEDMA_IVB=y
CONFIG_IDEDMA_AUTO=y
# CONFIG_BLK_DEV_HD is not set

#
# SCSI device support
#
CONFIG_SCSI=y
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
# CONFIG_SCSI_MULTI_LUN is not set
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set

#
# SCSI Transport Attributes
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set

#
# SCSI low-level drivers
#
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_SCSI_SATA is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_FC is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
CONFIG_SCSI_QLA2XXX=y
# CONFIG_SCSI_QLA21XX is not set
# CONFIG_SCSI_QLA22XX is not set
# CONFIG_SCSI_QLA2300 is not set
# CONFIG_SCSI_QLA2322 is not set
# CONFIG_SCSI_QLA6312 is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_DEBUG is not set

#
# Multi-device support (RAID and LVM)
#
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID5=y
CONFIG_MD_RAID6=y
CONFIG_MD_MULTIPATH=y
CONFIG_MD_FAULTY=y
CONFIG_BLK_DEV_DM=m
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_MIRROR=m
CONFIG_DM_ZERO=m
# CONFIG_DM_MULTIPATH is not set

#
# Fusion MPT device support
#
CONFIG_FUSION=m
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_CTL=m

#
# IEEE 1394 (FireWire) support
#
# CONFIG_IEEE1394 is not set

#
# I2O device support
#
# CONFIG_I2O is not set

#
# Networking support
#
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
CONFIG_UNIX=y
CONFIG_NET_KEY=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_FWMARK=y
CONFIG_IP_ROUTE_MULTIPATH=y
# CONFIG_IP_ROUTE_MULTIPATH_CACHED is not set
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE=m
CONFIG_NET_IPGRE_BROADCAST=y
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_TUNNEL=m
CONFIG_IP_TCPDIAG=y
CONFIG_IP_TCPDIAG_IPV6=y

#
# IP: Virtual Server Configuration
#
CONFIG_IP_VS=m
# CONFIG_IP_VS_DEBUG is not set
CONFIG_IP_VS_TAB_BITS=12

#
# IPVS transport protocol load balancing support
#
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_PROTO_ESP=y
CONFIG_IP_VS_PROTO_AH=y

#
# IPVS scheduler
#
CONFIG_IP_VS_RR=m
CONFIG_IP_VS_WRR=m
CONFIG_IP_VS_LC=m
CONFIG_IP_VS_WLC=m
CONFIG_IP_VS_LBLC=m
CONFIG_IP_VS_LBLCR=m
CONFIG_IP_VS_DH=m
CONFIG_IP_VS_SH=m
CONFIG_IP_VS_SED=m
CONFIG_IP_VS_NQ=m

#
# IPVS application helper
#
CONFIG_IP_VS_FTP=m
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
CONFIG_INET6_TUNNEL=m
CONFIG_IPV6_TUNNEL=m
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_BRIDGE_NETFILTER=y

#
# IP: Netfilter Configuration
#
CONFIG_IP_NF_CONNTRACK=m
CONFIG_IP_NF_CT_ACCT=y
CONFIG_IP_NF_CONNTRACK_MARK=y
CONFIG_IP_NF_CT_PROTO_SCTP=m
CONFIG_IP_NF_FTP=m
CONFIG_IP_NF_IRC=m
CONFIG_IP_NF_TFTP=m
CONFIG_IP_NF_AMANDA=m
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_LIMIT=m
CONFIG_IP_NF_MATCH_IPRANGE=m
CONFIG_IP_NF_MATCH_MAC=m
CONFIG_IP_NF_MATCH_PKTTYPE=m
CONFIG_IP_NF_MATCH_MARK=m
CONFIG_IP_NF_MATCH_MULTIPORT=m
CONFIG_IP_NF_MATCH_TOS=m
CONFIG_IP_NF_MATCH_RECENT=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_DSCP=m
CONFIG_IP_NF_MATCH_AH_ESP=m
CONFIG_IP_NF_MATCH_LENGTH=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_MATCH_TCPMSS=m
CONFIG_IP_NF_MATCH_HELPER=m
CONFIG_IP_NF_MATCH_STATE=m
CONFIG_IP_NF_MATCH_CONNTRACK=m
CONFIG_IP_NF_MATCH_OWNER=m
CONFIG_IP_NF_MATCH_PHYSDEV=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_MATCH_REALM=m
CONFIG_IP_NF_MATCH_SCTP=m
CONFIG_IP_NF_MATCH_COMMENT=m
CONFIG_IP_NF_MATCH_CONNMARK=m
CONFIG_IP_NF_MATCH_HASHLIMIT=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_IP_NF_TARGET_TCPMSS=m
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_SAME=m
CONFIG_IP_NF_NAT_SNMP_BASIC=m
CONFIG_IP_NF_NAT_IRC=m
CONFIG_IP_NF_NAT_FTP=m
CONFIG_IP_NF_NAT_TFTP=m
CONFIG_IP_NF_NAT_AMANDA=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_TOS=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_DSCP=m
CONFIG_IP_NF_TARGET_MARK=m
CONFIG_IP_NF_TARGET_CLASSIFY=m
CONFIG_IP_NF_TARGET_CONNMARK=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_TARGET_NOTRACK=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

#
# IPv6: Netfilter Configuration (EXPERIMENTAL)
#
CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_LIMIT=m
CONFIG_IP6_NF_MATCH_MAC=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_MULTIPORT=m
CONFIG_IP6_NF_MATCH_OWNER=m
CONFIG_IP6_NF_MATCH_MARK=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_AHESP=m
CONFIG_IP6_NF_MATCH_LENGTH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_MATCH_PHYSDEV=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_TARGET_MARK=m
CONFIG_IP6_NF_RAW=m

#
# Bridge: Netfilter Configuration
#
CONFIG_BRIDGE_NF_EBTABLES=m
CONFIG_BRIDGE_EBT_BROUTE=m
CONFIG_BRIDGE_EBT_T_FILTER=m
CONFIG_BRIDGE_EBT_T_NAT=m
CONFIG_BRIDGE_EBT_802_3=m
CONFIG_BRIDGE_EBT_AMONG=m
CONFIG_BRIDGE_EBT_ARP=m
CONFIG_BRIDGE_EBT_IP=m
CONFIG_BRIDGE_EBT_LIMIT=m
CONFIG_BRIDGE_EBT_MARK=m
CONFIG_BRIDGE_EBT_PKTTYPE=m
CONFIG_BRIDGE_EBT_STP=m
CONFIG_BRIDGE_EBT_VLAN=m
CONFIG_BRIDGE_EBT_ARPREPLY=m
CONFIG_BRIDGE_EBT_DNAT=m
CONFIG_BRIDGE_EBT_MARK_T=m
CONFIG_BRIDGE_EBT_REDIRECT=m
CONFIG_BRIDGE_EBT_SNAT=m
CONFIG_BRIDGE_EBT_LOG=m
CONFIG_BRIDGE_EBT_ULOG=m
CONFIG_XFRM=y
CONFIG_XFRM_USER=m

#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set
# CONFIG_ATM is not set
CONFIG_BRIDGE=m
CONFIG_VLAN_8021Q=m
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_NET_DIVERT is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set

#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set
CONFIG_NET_CLS_ROUTE=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_RX is not set
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_BONDING=m
# CONFIG_EQUALIZER is not set
CONFIG_TUN=m
# CONFIG_NET_SB1000 is not set

#
# ARCnet devices
#
# CONFIG_ARCNET is not set

#
# Ethernet (10 or 100Mbit)
#
# CONFIG_NET_ETHERNET is not set

#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SK98LIN is not set
CONFIG_TIGON3=m

#
# Ethernet (10000 Mbit)
#
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set

#
# Token Ring devices
#
# CONFIG_TR is not set

#
# Wireless LAN (non-hamradio)
#
# CONFIG_NET_RADIO is not set

#
# Wan interfaces
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPPOE=m
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
CONFIG_NETCONSOLE=m

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=y
# CONFIG_INPUT_UINPUT is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set
CONFIG_SOUND_GAMEPORT=y

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
# CONFIG_SERIAL_8250_ACPI is not set
CONFIG_SERIAL_8250_NR_UARTS=4
# CONFIG_SERIAL_8250_EXTENDED is not set

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256

#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set

#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
CONFIG_HW_RANDOM=y
# CONFIG_NVRAM is not set
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set

#
# Ftape, the floppy tape device driver
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
# CONFIG_AGP_INTEL is not set
# CONFIG_DRM is not set
# CONFIG_MWAVE is not set
# CONFIG_RAW_DRIVER is not set
# CONFIG_HPET is not set
# CONFIG_HANGCHECK_TIMER is not set

#
# TPM devices
#
# CONFIG_TCG_TPM is not set

#
# I2C support
#
CONFIG_I2C=m
CONFIG_I2C_CHARDEV=m

#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCF=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
CONFIG_I2C_AMD756=m
# CONFIG_I2C_AMD756_S4882 is not set
CONFIG_I2C_AMD8111=m
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_I810 is not set
# CONFIG_I2C_PIIX4 is not set
CONFIG_I2C_ISA=m
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_PROSAVAGE is not set
# CONFIG_I2C_SAVAGE4 is not set
# CONFIG_SCx200_ACB is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set
# CONFIG_I2C_VOODOO3 is not set
# CONFIG_I2C_PCA_ISA is not set

#
# Hardware Sensors Chip support
#
CONFIG_I2C_SENSOR=m
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ASB100=m
CONFIG_SENSORS_DS1621=m
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_FSCPOS=m
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
CONFIG_SENSORS_MAX1619=m
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_SMSC47B397=m
CONFIG_SENSORS_SIS5595=m
CONFIG_SENSORS_SMSC47M1=m
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83627HF=m

#
# Other I2C Chip support
#
CONFIG_SENSORS_DS1337=m
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_RTC8564=m
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set

#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set

#
# Misc devices
#
# CONFIG_IBM_ASM is not set

#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set

#
# Digital Video Broadcasting Devices
#
# CONFIG_DVB is not set

#
# Graphics support
#
# CONFIG_FB is not set
CONFIG_VIDEO_SELECT=y

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y

#
# Sound
#
# CONFIG_SOUND is not set

#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB=m
CONFIG_USB_DEBUG=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_BANDWIDTH is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=m
# CONFIG_USB_EHCI_SPLIT_ISO is not set
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
CONFIG_USB_OHCI_HCD=m
# CONFIG_USB_OHCI_BIG_ENDIAN is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_SL811_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_BLUETOOTH_TTY is not set
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' may also be needed;
see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_DPCM=y
# CONFIG_USB_STORAGE_USBAT is not set
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y

#
# USB Input Devices
#
CONFIG_USB_HID=m
CONFIG_USB_HIDINPUT=y
# CONFIG_HID_FF is not set
# CONFIG_USB_HIDDEV is not set

#
# USB HID Boot Protocol drivers
#
# CONFIG_USB_KBD is not set
# CONFIG_USB_MOUSE is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_MTOUCH is not set
# CONFIG_USB_EGALAX is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB Multimedia devices
#
# CONFIG_USB_DABUSB is not set

#
# Video4Linux support is needed for USB Multimedia device support
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_MON is not set

#
# USB port drivers
#

#
# USB Serial Converter support
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGETKIT is not set
# CONFIG_USB_PHIDGETSERVO is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_TEST is not set

#
# USB ATM/DSL drivers
#

#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set

#
# MMC/SD Card support
#
# CONFIG_MMC is not set

#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
# CONFIG_EXT2_FS_SECURITY is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
# CONFIG_EXT3_FS_SECURITY is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
# CONFIG_REISERFS_PROC_INFO is not set
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
# CONFIG_REISERFS_FS_SECURITY is not set
CONFIG_JFS_FS=m
# CONFIG_JFS_POSIX_ACL is not set
# CONFIG_JFS_SECURITY is not set
# CONFIG_JFS_DEBUG is not set
# CONFIG_JFS_STATISTICS is not set
CONFIG_FS_POSIX_ACL=y

#
# XFS support
#
CONFIG_XFS_FS=m
CONFIG_XFS_EXPORT=y
# CONFIG_XFS_RT is not set
# CONFIG_XFS_QUOTA is not set
# CONFIG_XFS_SECURITY is not set
CONFIG_XFS_POSIX_ACL=y
# CONFIG_MINIX_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
# CONFIG_AUTOFS4_FS is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_ZISOFS_FS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=m
# CONFIG_NTFS_DEBUG is not set
# CONFIG_NTFS_RW is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_SYSFS=y
# CONFIG_DEVFS_FS is not set
# CONFIG_DEVPTS_FS_XATTR is not set
CONFIG_TMPFS=y
# CONFIG_TMPFS_XATTR is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_RAMFS=y

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set

#
# Network File Systems
#
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V4=y
# CONFIG_NFS_DIRECTIO is not set
CONFIG_NFSD=m
CONFIG_NFSD_V3=y
CONFIG_NFSD_V4=y
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
CONFIG_AFS_FS=m
CONFIG_RXRPC=m

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y

#
# Native Language Support
#
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-15"
CONFIG_NLS_CODEPAGE_437=m
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
CONFIG_NLS_CODEPAGE_860=m
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=m
CONFIG_NLS_ISO8859_1=m
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
CONFIG_NLS_ISO8859_15=m
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=m

#
# Profiling support
#
# CONFIG_PROFILING is not set

#
# Kernel hacking
#
# CONFIG_PRINTK_TIME is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_LOG_BUF_SHIFT=18
# CONFIG_SCHEDSTATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_INIT_DEBUG is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_KPROBES is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set

#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=m
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_CRC32C=m
CONFIG_CRYPTO_TEST=m

#
# Hardware crypto devices
#

#
# Library routines
#
CONFIG_CRC_CCITT=m
CONFIG_CRC32=y
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m

2005-04-05 03:13:01

by Andrew Morton

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Claudio Martins <[email protected]> wrote:
>
> While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D
> state after some time.
> This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other
> node has no RAM modules plugged in, since this board works only with pairs).
>
> I was using stress (http://weather.ou.edu/~apw/projects/stress/) with the
> following command line:
>
> stress -v -c 20 -i 12 -m 10 -d 20
>
> This causes a constant load avg. of around 70, makes the machine go into
> swap a little, and writes up to about 20GB of random data to disk while
> eating up all CPU. After about half and hour random processes like top, df,
> etc get stuck in D state. Half of the 60 or so stress processes are also in D
> state. The machine keeps being responsive for maybe some 15 minutes but then
> the shells just hang and sshd stops responding to connections, though the
> machine replies to pings (I don't have console acess till tomorrow).
>
> The system is using ext3 with md software Raid1.
>
> I'm interested in knowing if anyone out there with dual Opterons can
> reproduce this or not. I also have access to an HP DL360 Dual Xeon, so I will
> try to find out if this is AMD64 specific as soon as possible. Please let me
> know if you want me to run some other tests or give some more info to help
> solve this one.

Can you capture the output from alt-sysrq-T?

2005-04-10 02:31:47

by Claudio Martins

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Tuesday 05 April 2005 03:12, Andrew Morton wrote:
> Claudio Martins <[email protected]> wrote:
> > While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck
> > in D state after some time.
> > This machine is a dual Opteron 248 with 2GB (ECC) on one node (the
> > other node has no RAM modules plugged in, since this board works only
> > with pairs).
> >
> > I was using stress (http://weather.ou.edu/~apw/projects/stress/) with
> > the following command line:
> >
> > stress -v -c 20 -i 12 -m 10 -d 20
> >
> > This causes a constant load avg. of around 70, makes the machine go
> > into swap a little, and writes up to about 20GB of random data to disk
> > while eating up all CPU. After about half and hour random processes like
> > top, df, etc get stuck in D state. Half of the 60 or so stress processes
> > are also in D state. The machine keeps being responsive for maybe some 15
> > minutes but then the shells just hang and sshd stops responding to
> > connections, though the machine replies to pings (I don't have console
> > acess till tomorrow).
> >
> > The system is using ext3 with md software Raid1.
> >
> > I'm interested in knowing if anyone out there with dual Opterons can
> > reproduce this or not. I also have access to an HP DL360 Dual Xeon, so I
> > will try to find out if this is AMD64 specific as soon as possible.
> > Please let me know if you want me to run some other tests or give some
> > more info to help solve this one.
>
> Can you capture the output from alt-sysrq-T?

Hi Andrew,

Due to other tasks, only now was I able to repeat the tests and capture the
the output from alt-sysrq-T. I booted with serial console, put stress to work
and when the processes started to get hung on D state I managed to capture
the following:

SysRq : Show State

sibling
task PC pid father child younger older
init D ffff81007fcfe0d8 0 1 0 2
(NOTLB)
ffff810003253768 0000000000000082 ffff81007fd19170 0000007d00000000
ffff81007fd19170 ffff810003251470 000000000000271b ffff810074468e70
ffff810003251680 ffffffff8027a79a
Call Trace:<ffffffff8027a79a>{__make_request+1274}
<ffffffff8037ab68>{__down+152}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80158de4>{mempool_alloc+164}
<ffffffff8037c649>{__down_failed+53}
<ffffffff802ed53d>{.text.lock.md+155}
<ffffffff802d8204>{make_request+868}
<ffffffff8015db7d>{cache_alloc_refill+413}
<ffffffff8027abd1>{generic_make_request+545}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8027accf>{submit_bio+223}
<ffffffff8015c39b>{test_set_page_writeback+203}
<ffffffff8016e9d8>{swap_writepage+184}
<ffffffff80161bc6>{shrink_zone+2678}
<ffffffff8037b3e0>{thread_return+0}
<ffffffff8037b438>{thread_return+88}
<ffffffff80162187>{try_to_free_pages+311}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8015a685>{__alloc_pages+533}
<ffffffff8015a88e>{__get_free_pages+14}
<ffffffff8018c72a>{__pollwait+74}
<ffffffff80185c72>{pipe_poll+66} <ffffffff8018caa5>{do_select+725}
<ffffffff8018c6e0>{__pollwait+0} <ffffffff8018ceef>{sys_select+735}
<ffffffff8010db06>{system_call+126}
migration/0 S ffff810002c12720 0 2 1 3
(L-TLB)
ffff81007ff0fea8 0000000000000046 ffff810074806ef0 0000007500000001
ffff81007ff0fe58 ffff8100032506f0 0000000000000129 ffff810075281230
ffff810003250900 ffff810072ffde88
Call Trace:<ffffffff80130a24>{migration_thread+532}
<ffffffff80130810>{migration_thread+0}
<ffffffff80149c09>{kthread+217} <ffffffff8010e6ef>{child_rip+8}
<ffffffff80149b30>{kthread+0} <ffffffff8010e6e7>{child_rip+0}

ksoftirqd/0 S 0000000000000000 0 3 1 4 2
(L-TLB)
ffff81007ff11f08 0000000000000046 ffff810072e00430 0000007d00000000
ffff810002c194e0 ffff810003250030 00000000000000d1 ffff810072f3a030
ffff810003250240 0000000000000000
Call Trace:<ffffffff801393e1>{__do_softirq+113}
<ffffffff801399c0>{ksoftirqd+0}
<ffffffff801399c0>{ksoftirqd+0} <ffffffff80139a23>{ksoftirqd+99}
<ffffffff801399c0>{ksoftirqd+0} <ffffffff80149c09>{kthread+217}
<ffffffff8010e6ef>{child_rip+8} <ffffffff80149b30>{kthread+0}
<ffffffff8010e6e7>{child_rip+0}
migration/1 S ffff810002c1a720 0 4 1 5 3
(L-TLB)
ffff81007ff15ea8 0000000000000046 ffff810072d1cff0 0000007300000001
ffff810079fe7e98 ffff81007ff134b0 00000000000000a3 ffff810075281230
ffff81007ff136c0 ffff81003381de88
Call Trace:<ffffffff80130a24>{migration_thread+532}
<ffffffff80130810>{migration_thread+0}
<ffffffff80149c09>{kthread+217} <ffffffff8010e6ef>{child_rip+8}
<ffffffff80149b30>{kthread+0} <ffffffff8010e6e7>{child_rip+0}

ksoftirqd/1 S 0000000000000001 0 5 1 6 4
(L-TLB)
ffff81007ff19f08 0000000000000046 ffff810075376db0 00000077802b8e7e
ffff810002c114e0 ffff81007ff12df0 00000000000001b4 ffff810074125130
ffff81007ff13000 0000000000000000
Call Trace:<ffffffff801393e1>{__do_softirq+113}
<ffffffff801399c0>{ksoftirqd+0}
<ffffffff801399c0>{ksoftirqd+0} <ffffffff80139a23>{ksoftirqd+99}
<ffffffff801399c0>{ksoftirqd+0} <ffffffff80149c09>{kthread+217}
<ffffffff8010e6ef>{child_rip+8} <ffffffff80149b30>{kthread+0}
<ffffffff8010e6e7>{child_rip+0}
events/0 S 0000094f2f7a804e 0 6 1 7 5
(L-TLB)
ffff81007ff3be58 0000000000000046 0000000000000246 ffffffff8013d00d
000000007ffe0c00 ffff81007ff12730 0000000000000c80 ffffffff803f40c0
ffff81007ff12940 0000000000000000
Call Trace:<ffffffff8013d00d>{__mod_timer+317}
<ffffffff8015f470>{cache_reap+0}
<ffffffff80145331>{worker_thread+305}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80149c09>{kthread+217} <ffffffff8010e6ef>{child_rip+8}
<ffffffff80149b30>{kthread+0} <ffffffff8010e6e7>{child_rip+0}

events/1 S 0000094ef3e03d58 0 7 1 8 6
(L-TLB)
ffff81007ff3de58 0000000000000046 ffff810003250db0 0000000000000246
0000000000000246 ffff81007ff12070 00000000000000a4 ffff810003250db0
ffff81007ff12280 0000000000000000
Call Trace:<ffffffff80252610>{flush_to_ldisc+0}
<ffffffff80145331>{worker_thread+305}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80145200>{worker_thread+0} <ffffffff80149c09>{kthread+217}
<ffffffff8010e6ef>{child_rip+8} <ffffffff80149b30>{kthread+0}
<ffffffff8010e6e7>{child_rip+0}
khelper S ffff810074815b18 0 8 1 13 7
(L-TLB)
ffff81007ff43e58 0000000000000046 ffff810074815bc8 0000006f00000001
ffff810074815bc8 ffff81007ff414f0 000000000000006c ffff810074292f70
ffff81007ff41700 0000000000000001
Call Trace:<ffffffff80144d50>{__call_usermodehelper+0}
<ffffffff80145331>{worker_thread+305}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80145200>{worker_thread+0} <ffffffff80149c09>{kthread+217}
<ffffffff8010e6ef>{child_rip+8}
<ffffffff8011b0b0>{flat_send_IPI_mask+0}
<ffffffff80149b30>{kthread+0} <ffffffff8010e6e7>{child_rip+0}

kthread S ffff81002a48bd18 0 13 1 24 169 8
(L-TLB)
ffff81007ff55e58 0000000000000046 ffffffff8012f4f0 0000006f00000000
0000000000000000 ffff81007ff40e30 00000000000000ac ffff8100745941b0
ffff81007ff41040 0000000000000001
Call Trace:<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80145331>{worker_thread+305}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80149c09>{kthread+217} <ffffffff8010e6ef>{child_rip+8}
<ffffffff80149b30>{kthread+0} <ffffffff8010e6e7>{child_rip+0}

kacpid S 000000000c378373 0 24 13 105
(L-TLB)
ffff81000334be58 0000000000000046 0000000000000000 0000000000000000
ffff810002c114e0 ffff810003349530 0000000000000209 ffff810003250db0
ffff810003349740 0000000000000000
Call Trace:<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80145331>{worker_thread+305}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149c09>{kthread+217} <ffffffff8010e6ef>{child_rip+8}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149b30>{kthread+0}
<ffffffff8010e6e7>{child_rip+0}
kblockd/0 S ffff81007fd19830 0 105 13 106 24
(L-TLB)
ffff8100033a1e58 0000000000000046 0000000000000001 0000007600000000
ffff810019992230 ffff810003348e70 0000000000000d97 ffff810074125130
ffff810003349080 0000000000000001
Call Trace:<ffffffff80278f30>{blk_unplug_work+0}
<ffffffff80145331>{worker_thread+305}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149c09>{kthread+217}
<ffffffff8010e6ef>{child_rip+8}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149b30>{kthread+0} <ffffffff8010e6e7>{child_rip+0}

kblockd/1 S 000009309d720cf6 0 106 13 170 105
(L-TLB)
ffff8100033a3e58 0000000000000046 ffff81007fcf8e00 ffffffff8027f2a6
ffff81007fcf6a00 ffff8100033487b0 0000000000000ae1 ffff810003250db0
ffff8100033489c0 0000000000000000
Call Trace:<ffffffff8027f2a6>{as_move_to_dispatch+342}
<ffffffff80280530>{as_work_handler+0}
<ffffffff80145331>{worker_thread+305}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149c09>{kthread+217} <ffffffff8010e6ef>{child_rip+8}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149b30>{kthread+0}
<ffffffff8010e6e7>{child_rip+0}
kswapd0 D ffff81007fcfe0d8 0 169 1 758 13
(L-TLB)
ffff81007fc0d8e8 0000000000000046 ffff8100133b5900 0000007600000001
ffff81007fd19170 ffff81007ff400b0 0000000000003643 ffff810074193170
ffff81007ff402c0 ffffffff8027abd1
Call Trace:<ffffffff8027abd1>{generic_make_request+545}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8037ab68>{__down+152}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80158de4>{mempool_alloc+164}
<ffffffff8037c649>{__down_failed+53}
<ffffffff802ed53d>{.text.lock.md+155}
<ffffffff802d8204>{make_request+868}
<ffffffff8027abd1>{generic_make_request+545}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8027accf>{submit_bio+223}
<ffffffff8015c39b>{test_set_page_writeback+203}
<ffffffff8016e9d8>{swap_writepage+184}
<ffffffff80161bc6>{shrink_zone+2678}
<ffffffff8037b3e0>{thread_return+0}
<ffffffff8037b438>{thread_return+88}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff801624e9>{balance_pgdat+601} <ffffffff801627a7>{kswapd+327}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8012df70>{schedule_tail+64} <ffffffff8010e6ef>{child_rip+8}
<ffffffff8011b0b0>{flat_send_IPI_mask+0} <ffffffff80162660>{kswapd+0}
<ffffffff8010e6e7>{child_rip+0}
aio/0 S ffff81000337d000 0 170 13 171 106
(L-TLB)
ffff81007fc1fe58 0000000000000046 0000000000000000 0000007500000000
0000000000000000 ffff81007fc08eb0 000000000000011f ffff810003251470
ffff81007fc090c0 0000000000000000
Call Trace:<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80145331>{worker_thread+305}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149c09>{kthread+217} <ffffffff8010e6ef>{child_rip+8}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149b30>{kthread+0}
<ffffffff8010e6e7>{child_rip+0}
aio/1 S 0000000048dcc53f 0 171 13 2425 170
(L-TLB)
ffff81007fc21e58 0000000000000046 0000000000000000 0000000000000000
ffff810002c114e0 ffff81007fc087f0 000000000000011e ffff810003250db0
ffff81007fc08a00 0000000000000000
Call Trace:<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80145331>{worker_thread+305}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80145200>{worker_thread+0}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149c09>{kthread+217} <ffffffff8010e6ef>{child_rip+8}
<ffffffff80149c50>{keventd_create_kthread+0}
<ffffffff80149b30>{kthread+0}
<ffffffff8010e6e7>{child_rip+0}
kseriod S 00000007606f165c 0 758 1 825 169
(L-TLB)
ffff81007fd05eb8 0000000000000046 0000000000000000 ffffffff801b1df9
0000000000000246 ffff81007ff40770 00000000000001f6 ffff810003250db0
ffff81007ff40980 0000000000000000
Call Trace:<ffffffff801b1df9>{sysfs_make_dirent+41}
<ffffffff8027288d>{driver_create_file+61}
<ffffffff80267b21>{serio_thread+689}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8012df70>{schedule_tail+64}
<ffffffff8010e6ef>{child_rip+8} <ffffffff80267870>{serio_thread+0}
<ffffffff8010e6e7>{child_rip+0}
scsi_eh_0 S ffff81007fd59ef8 0 825 1 826 758
(L-TLB)
ffff81007fd59df8 0000000000000046 ffffffff80145f9f 00000075801461ba
ffff81007fc08130 ffff81007fc08130 00000000000003b1 ffff810003251470
ffff81007fc08340 0000000000000202
Call Trace:<ffffffff80145f9f>{attach_pid+47}
<ffffffff8012d5c3>{recalc_task_prio+323}
<ffffffff8037acad>{__down_interruptible+205}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8037c683>{__down_failed_interruptible+53}
<ffffffff802a0be4>{.text.lock.scsi_error+45}
<ffffffff8010e6ef>{child_rip+8}
<ffffffff802a0150>{scsi_error_handler+0}
<ffffffff8010e6e7>{child_rip+0}

ahc_dv_0 S 000000061ef1cc1c 0 826 1 845 825
(L-TLB)
ffff81007fd5de08 0000000000000046 ffff81000327b400 000000867ff0da40
0000000000000000 ffff81007fd5b5b0 000000000000029f ffff81007ff12df0
ffff81007fd5b7c0 ffff81007fc6ac00
Call Trace:<ffffffff80276ab5>{elv_next_request+261}
<ffffffff8037acad>{__down_interruptible+205}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff8037c683>{__down_failed_interruptible+53}
<ffffffff80212e90>{kobject_release+0}
<ffffffff802be9eb>{.text.lock.aic7xxx_osm+85}
<ffffffff8010e6ef>{child_rip+8}
<ffffffff802bd340>{ahc_linux_dv_thread+0}
<ffffffff8010e6e7>{child_rip+0}
md3_raid1 S ffff81007fdb6b00 0 845 1 847 826
(L-TLB)
ffff81007fddfeb8 0000000000000046 ffff810074534ef0 0000007d00000001
0000000002c114e0 ffff81007fd5a170 000000000000009f ffff810074534ef0
ffff81007fd5a380 0000000000000000
Call Trace:<ffffffff802ea015>{md_thread+277}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8012df70>{schedule_tail+64}
<ffffffff802d8920>{raid1d+0} <ffffffff8010e6ef>{child_rip+8}
<ffffffff802d8920>{raid1d+0} <ffffffff802e9f00>{md_thread+0}
<ffffffff8010e6e7>{child_rip+0}
md2_raid1 D ffff81007fcfe0d8 0 847 1 849 845
(L-TLB)
ffff81007fdf1558 0000000000000046 ffff81000b4d9000 0000007d8015d9ad
ffff81007ffef4f8 ffff81007fd5a830 0000000000001a9e ffff810074ffa2f0
ffff81007fd5aa40 ffff81007ffef480
Call Trace:<ffffffff8015db7d>{cache_alloc_refill+413}
<ffffffff8037ab68>{__down+152}
<ffffffff8012f4f0>{default_wake_function+0}
<ffffffff80158de4>{mempool_alloc+164}
<ffffffff8037c649>{__down_failed+53}
<ffffffff802ed53d>{.text.lock.md+155}
<ffffffff802d8204>{make_request+868}
<ffffffff8015db7d>{cache_alloc_refill+413}
<ffffffff8027abd1>{generic_make_request+545}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8027accf>{submit_bio+223}
<ffffffff8015c39b>{test_set_page_writeback+203}
<ffffffff8016e9d8>{swap_writepage+184}
<ffffffff80161bc6>{shrink_zone+2678}
<ffffffff80162187>{try_to_free_pages+311}
<ffffffff8014a230>{autoremove_wake_function+0}
<ffffffff8015a685>{__alloc_pages+533}
<ffffffff80172633>{alloc_page_interleave+67}
<ffffffff8015d74e>{cache_grow+270}
<ffffffff8015db95>{cache_alloc_refill+437}
<ffffffff8015d636>{kmem_cache_alloc+54}
<ffffffff80158e1c>{mempoolNMI Watchdog detected LOCKUP on CPU1CPU 1
Modules linked in: tg3 i2c_amd756 i2c_core ohci_hcd usbcore dm_mod
Pid: 0, comm: swapper Not tainted 2.6.12-rc2
RIP: 0010:[<ffffffff8026cfe7>] <ffffffff8026cfe7>{serial_in+87}
RSP: 0018:ffff81000325faf0 EFLAGS: 00000002
RAX: 00000000ffffff20 RBX: 0000000000000020 RCX: 0000000000000000
RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff804f5120
RBP: 0000000000002463 R08: 000000000000006c R09: 0000000000000002
R10: 00000000ffffffff R11: 0000000000000000 R12: ffffffff804f5120
R13: ffffffff804acc52 R14: 000000000000001a R15: 0000000000000025
FS: 00002aaaab3a34a0(0000) GS:ffffffff80510bc0(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaaadc55c0 CR3: 0000000073456000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff810003256000, task ffff810003250db0)
Stack: ffffffff8026f42d 000000050325fd35 ffffffff8041a720 0000000000008378
0000000000000025 ffffffffffffbeca 0000000000000025 0000000000000046
ffffffff801342ac 000000000000839d
Call Trace:<IRQ> <ffffffff8026f42d>{serial8250_console_write+125}
<ffffffff801342ac>{__call_console_drivers+76}
<ffffffff801345aa>{release_console_sem+330}
<ffffffff801348d0>{vprintk+656}
<ffffffff8026f54f>{serial8250_console_write+415}
<ffffffff80158e1c>{mempool_alloc+220}
<ffffffff8013498d>{printk+141} <ffffffff80158e1c>{mempool_alloc+220}
<ffffffff801348d0>{vprintk+656} <ffffffff801517d8>{kallsyms_lookup+200}
<ffffffff8015d636>{kmem_cache_alloc+54}
<ffffffff80158e1c>{mempool_alloc+220}
<ffffffff8010ed2c>{printk_address+140}
<ffffffff80158e1c>{mempool_alloc+220}
<ffffffff8010ef2a>{show_trace+410} <ffffffff8010f07e>{show_stack+270}
<ffffffff80130732>{show_state+498}
<ffffffff802611b0>{__handle_sysrq+144}
<ffffffff8026d658>{receive_chars+360}
<ffffffff8026d9e7>{serial8250_interrupt+119}
<ffffffff8015461c>{handle_IRQ_event+44}
<ffffffff80154749>{__do_IRQ+249}
<ffffffff80110a52>{do_IRQ+66} <ffffffff8010e0ad>{ret_from_intr+0}
<EOI> <ffffffff8010e1de>{retint_kernel+38}
<ffffffff8010bb90>{default_idle+0}
<ffffffff8010bbb0>{default_idle+32} <ffffffff8010be1a>{cpu_idle+74}
<ffffffff8052291c>{start_secondary+476}

Code: 0f b6 c0 c3 66 66 90 66 90 0f b6 4f 22 0f b6 47 23 41 89 d0
console shuts up ...
<0>Kernel panic - not syncing: Aiee, killing interrupt handler!

------------------------------------

Unfortunately the system Oopsed in the middle of dumping the tasks, but from
what I can see I'm tempted to think that this might be related to the MD
code. md2_raid1 is blocked on D state and, although not shown on the dump, I
know from ps command that md0_raid1 (the swap partition) was also on D state
(along with the stress processes which are responsible for hogging memory,
and top and df). There were about 200MB swapped out, but the swap partition
size is 1GB.

I repeated the test to try to get more output from alt-sysreq-T, but it
oopsed again with even less output.
By the way, I have also tested 2.6.11.6 and I get stuck processes in the
same way. With 2.6.9 I get a hard lockup with no working alt-sysrq, after
about 30 to 60mins of stress.

This is with preempt enabled (as well as BKL preempt). I want to test also
without preempt and also without using MD Raid1, but I'll have to reach the
machine and hit the power button, so not possible until tomorrow :-(

The original original message in this thread containing the details of the
setup and a .config is at:

http://marc.theaimsgroup.com/?l=linux-kernel&m=111266784320156&w=2

I am happy to test any patches and also wonder if enabling any of the
options in the kernel debugging section could help in trying to find where
the deadlock is.

Thanks

Claudio

2005-04-10 02:48:05

by Andrew Morton

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Claudio Martins <[email protected]> wrote:
>
> I repeated the test to try to get more output from alt-sysreq-T, but it
> oopsed again with even less output.
> By the way, I have also tested 2.6.11.6 and I get stuck processes in the
> same way. With 2.6.9 I get a hard lockup with no working alt-sysrq, after
> about 30 to 60mins of stress.

It could be an md deadlock, or it could be an out-of-memory deadlock. md
trying to allocate memory on the swapout path.

> This is with preempt enabled (as well as BKL preempt). I want to test also
> without preempt and also without using MD Raid1, but I'll have to reach the
> machine and hit the power button, so not possible until tomorrow :-(
>
> The original original message in this thread containing the details of the
> setup and a .config is at:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=111266784320156&w=2
>
> I am happy to test any patches and also wonder if enabling any of the
> options in the kernel debugging section could help in trying to find where
> the deadlock is.

Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
cutting in during long sysrq traces.

Also, capture the `sysrq-m' output so we can see if the thing is out of
memory.

2005-04-10 02:53:23

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Index: linux-2.6/mm/mempool.c
===================================================================
--- linux-2.6.orig/mm/mempool.c 2005-03-30 10:39:51.000000000 +1000
+++ linux-2.6/mm/mempool.c 2005-03-30 10:41:29.000000000 +1000
@@ -198,7 +198,10 @@ void * mempool_alloc(mempool_t *pool, in
void *element;
unsigned long flags;
DEFINE_WAIT(wait);
- int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
+ int gfp_nowait;
+
+ gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
+ gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);

might_sleep_if(gfp_mask & __GFP_WAIT);
repeat_alloc:

Attachments:

mempool-can-fail.patch (605.00 B)

2005-04-10 03:21:59

by Claudio Martins

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

2005-04-10 03:23:24

by Claudio Martins

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Sunday 10 April 2005 03:53, Nick Piggin wrote:
>
> Looks like you may possibly have a memory allocation deadlock
> (although I can't explain the NMI oops).
>
> I would be interested to see if the following patch is of any
> help to you.
>

Hi Nick,

I'll build a kernel with your patch and report on the results as soon as
possible.

Thanks

Claudio

2005-04-11 02:16:31

by Claudio Martins

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Sunday 10 April 2005 03:47, Andrew Morton wrote:
>
> Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
> cutting in during long sysrq traces.
>
> Also, capture the `sysrq-m' output so we can see if the thing is out of
> memory.

Hi Andrew,

Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full
sysrq-t as well as a sysrq-m. Since it might be a little too big for the
list, I've put it on a text file at:

http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt

I also made a run with the mempool-can-fail patch from Nick Piggin. With this
I got some nice memory allocation errors from the md threads when the trouble
started. The dump (with sysrq-t and sysrq-m included) is at:

http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt

Let me know if you find it more convenient to send the dumps by mail or
something. Hope this helps.

Thanks,

Claudio

2005-04-11 06:36:57

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

2005-04-11 09:55:55

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Claudio Martins wrote:
> On Sunday 10 April 2005 03:47, Andrew Morton wrote:
>
>>Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
>>cutting in during long sysrq traces.
>>
>>Also, capture the `sysrq-m' output so we can see if the thing is out of
>>memory.
>
>
> Hi Andrew,
>
> Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full
> sysrq-t as well as a sysrq-m. Since it might be a little too big for the
> list, I've put it on a text file at:
>
> http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt
>

OK, you _may_ be out of memory here (depending on what the lower zone
protection for DMA ends up as), however you are well above all the
"emergency watermarks" in ZONE_NORMAL. Also:

> I also made a run with the mempool-can-fail patch from Nick Piggin. With this
> I got some nice memory allocation errors from the md threads when the trouble
> started. The dump (with sysrq-t and sysrq-m included) is at:
>
> http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt
>

This one shows plenty of memory. The allocation failure messages are
actually a good thing, and show that my patch is sort of working. I
have reworked it a bit so they won't show up though.

So probably not your common or garden memory deadlock.

The common theme seems to be: try_to_free_pages, swap_writepage,
mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect
md/raid1 - maybe some deadlock in an uncommon memory allocation
failure path?

I'll see if I can reproduce it here.

--
SUSE Labs, Novell Inc.

2005-04-11 12:47:44

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Index: linux-2.6/drivers/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/ll_rw_blk.c 2005-04-11 22:18:49.000000000 +1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-11 22:38:10.000000000 +1000
@@ -1450,7 +1450,7 @@ EXPORT_SYMBOL(blk_remove_plug);
*/
void __generic_unplug_device(request_queue_t *q)
{
- if (test_bit(QUEUE_FLAG_STOPPED, &q->queue_flags))
+ if (unlikely(test_bit(QUEUE_FLAG_STOPPED, &q->queue_flags)))
return;

if (!blk_remove_plug(q))
@@ -1828,7 +1828,6 @@ static void __freed_request(request_queu
clear_queue_congested(q, rw);

if (rl->count[rw] + 1 <= q->nr_requests) {
- smp_mb();
if (waitqueue_active(&rl->wait[rw]))
wake_up(&rl->wait[rw]);

@@ -1860,18 +1859,20 @@ static void freed_request(request_queue_

#define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist)
/*
- * Get a free request, queue_lock must not be held
+ * Get a free request, queue_lock must be held.
+ * Returns NULL on failure, with queue_lock held.
+ * Returns !NULL on success, with queue_lock *not held*.
*/
static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
{
+ int batching;
struct request *rq = NULL;
struct request_list *rl = &q->rq;
- struct io_context *ioc = get_io_context(gfp_mask);
+ struct io_context *ioc = get_io_context(GFP_ATOMIC);

if (unlikely(test_bit(QUEUE_FLAG_DRAIN, &q->queue_flags)))
goto out;

- spin_lock_irq(q->queue_lock);
if (rl->count[rw]+1 >= q->nr_requests) {
/*
* The queue will fill after this allocation, so set it as
@@ -1884,6 +1885,8 @@ static struct request *get_request(reque
blk_set_queue_full(q, rw);
}
}
+
+ batching = ioc_batching(q, ioc);

switch (elv_may_queue(q, rw)) {
case ELV_MQUEUE_NO:
@@ -1894,12 +1897,11 @@ static struct request *get_request(reque
goto get_rq;
}

- if (blk_queue_full(q, rw) && !ioc_batching(q, ioc)) {
+ if (blk_queue_full(q, rw) && !batching) {
/*
* The queue is full and the allocating process is not a
* "batcher", and not exempted by the IO scheduler
*/
- spin_unlock_irq(q->queue_lock);
goto out;
}

@@ -1933,11 +1935,10 @@ rq_starved:
if (unlikely(rl->count[rw] == 0))
rl->starved[rw] = 1;

- spin_unlock_irq(q->queue_lock);
goto out;
}

- if (ioc_batching(q, ioc))
+ if (batching)
ioc->nr_batch_requests--;

rq_init(q, rq);
@@ -1950,13 +1951,14 @@ out:
/*
* No available requests for this queue, unplug the device and wait for some
* requests to become available.
+ *
+ * Called with q->queue_lock held, and returns with it unlocked.
*/
static struct request *get_request_wait(request_queue_t *q, int rw)
{
DEFINE_WAIT(wait);
struct request *rq;

- generic_unplug_device(q);
do {
struct request_list *rl = &q->rq;

@@ -1968,6 +1970,8 @@ static struct request *get_request_wait(
if (!rq) {
struct io_context *ioc;

+ __generic_unplug_device(q);
+ spin_unlock_irq(q->queue_lock);
io_schedule();

/*
@@ -1979,6 +1983,8 @@ static struct request *get_request_wait(
ioc = get_io_context(GFP_NOIO);
ioc_set_batching(q, ioc);
put_io_context(ioc);
+
+ spin_lock_irq(q->queue_lock);
}
finish_wait(&rl->wait[rw], &wait);
} while (!rq);
@@ -1992,10 +1998,15 @@ struct request *blk_get_request(request_

BUG_ON(rw != READ && rw != WRITE);

+ spin_lock_irq(q->queue_lock);
if (gfp_mask & __GFP_WAIT)
rq = get_request_wait(q, rw);
- else
+ else {
rq = get_request(q, rw, gfp_mask);
+ if (!rq)
+ spin_unlock_irq(q->queue_lock);
+ }
+ /* q->queue_lock is unlocked at this point */

return rq;
}
@@ -2558,7 +2569,7 @@ EXPORT_SYMBOL(__blk_attempt_remerge);

static int __make_request(request_queue_t *q, struct bio *bio)
{
- struct request *req, *freereq = NULL;
+ struct request *req;
int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, err;
sector_t sector;

@@ -2578,19 +2589,14 @@ static int __make_request(request_queue_
spin_lock_prefetch(q->queue_lock);

barrier = bio_barrier(bio);
- if (barrier && (q->ordered == QUEUE_ORDERED_NONE)) {
+ if (unlikely(barrier) && (q->ordered == QUEUE_ORDERED_NONE)) {
err = -EOPNOTSUPP;
goto end_io;
}

-again:
spin_lock_irq(q->queue_lock);

- if (elv_queue_empty(q)) {
- blk_plug_device(q);
- goto get_rq;
- }
- if (barrier)
+ if (elv_queue_empty(q) || unlikely(barrier))
goto get_rq;

el_ret = elv_merge(q, &req, bio);
@@ -2633,40 +2639,24 @@ again:
elv_merged_request(q, req);
goto out;

- /*
- * elevator says don't/can't merge. get new request
- */
- case ELEVATOR_NO_MERGE:
- break;
-
+ /* ELV_NO_MERGE: elevator says don't/can't merge. */
default:
- printk("elevator returned crap (%d)\n", el_ret);
- BUG();
+ ;
}

+get_rq:
/*
- * Grab a free request from the freelist - if that is empty, check
- * if we are doing read ahead and abort instead of blocking for
- * a free slot.
+ * Grab a free request. This is might sleep but can not fail.
+ * Returns with the queue unlocked.
+ */
+ req = get_request_wait(q, rw);
+
+ /*
+ * After dropping the lock and possibly sleeping here, our request
+ * may now be mergeable after it had proven unmergeable (above).
+ * We don't worry about that case for efficiency. It won't happen
+ * often, and the elevators are able to handle it.
*/
-get_rq:
- if (freereq) {
- req = freereq;
- freereq = NULL;
- } else {
- spin_unlock_irq(q->queue_lock);
- if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) {
- /*
- * READA bit set
- */
- err = -EWOULDBLOCK;
- if (bio_rw_ahead(bio))
- goto end_io;
-
- freereq = get_request_wait(q, rw);
- }
- goto again;
- }

req->flags |= REQ_CMD;

@@ -2679,7 +2669,7 @@ get_rq:
/*
* REQ_BARRIER implies no merging, but lets make it explicit
*/
- if (barrier)
+ if (unlikely(barrier))
req->flags |= (REQ_HARDBARRIER | REQ_NOMERGE);

req->errors = 0;
@@ -2694,10 +2684,11 @@ get_rq:
req->rq_disk = bio->bi_bdev->bd_disk;
req->start_time = jiffies;

+ spin_lock_irq(q->queue_lock);
+ if (elv_queue_empty(q))
+ blk_plug_device(q);
add_request(q, req);
out:
- if (freereq)
- __blk_put_request(q, freereq);
if (bio_sync(bio))
__generic_unplug_device(q);

@@ -2803,7 +2794,7 @@ static inline void block_wait_queue_runn
{
DEFINE_WAIT(wait);

- while (test_bit(QUEUE_FLAG_DRAIN, &q->queue_flags)) {
+ while (unlikely(test_bit(QUEUE_FLAG_DRAIN, &q->queue_flags))) {
struct request_list *rl = &q->rq;

prepare_to_wait_exclusive(&rl->drain, &wait,
@@ -2912,7 +2903,7 @@ end_io:
goto end_io;
}

- if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))
+ if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
goto end_io;

block_wait_queue_running(q);
Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c 2005-04-11 22:18:50.000000000 +1000
+++ linux-2.6/drivers/md/dm-crypt.c 2005-04-11 22:38:08.000000000 +1000
@@ -331,25 +331,14 @@ crypt_alloc_buffer(struct crypt_config *
struct bio *bio;
unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
int gfp_mask = GFP_NOIO | __GFP_HIGHMEM;
- unsigned long flags = current->flags;
unsigned int i;

- /*
- * Tell VM to act less aggressively and fail earlier.
- * This is not necessary but increases throughput.
- * FIXME: Is this really intelligent?
- */
- current->flags &= ~PF_MEMALLOC;
-
if (base_bio)
bio = bio_clone(base_bio, GFP_NOIO);
else
bio = bio_alloc(GFP_NOIO, nr_iovecs);
- if (!bio) {
- if (flags & PF_MEMALLOC)
- current->flags |= PF_MEMALLOC;
+ if (!bio)
return NULL;
- }

/* if the last bio was not complete, continue where that one ended */
bio->bi_idx = *bio_vec_idx;
@@ -386,9 +375,6 @@ crypt_alloc_buffer(struct crypt_config *
size -= bv->bv_len;
}

- if (flags & PF_MEMALLOC)
- current->flags |= PF_MEMALLOC;
-
if (!bio->bi_size) {
bio_put(bio);
return NULL;
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c 2005-04-11 22:18:50.000000000 +1000
+++ linux-2.6/fs/mpage.c 2005-04-11 22:38:08.000000000 +1000
@@ -105,11 +105,6 @@ mpage_alloc(struct block_device *bdev,

bio = bio_alloc(gfp_flags, nr_vecs);

- if (bio == NULL && (current->flags & PF_MEMALLOC)) {
- while (!bio && (nr_vecs /= 2))
- bio = bio_alloc(gfp_flags, nr_vecs);
- }
-
if (bio) {
bio->bi_bdev = bdev;
bio->bi_sector = first_sector;
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h 2005-04-11 22:18:51.000000000 +1000
+++ linux-2.6/include/linux/gfp.h 2005-04-11 22:38:07.000000000 +1000
@@ -38,14 +38,16 @@ struct vm_area_struct;
#define __GFP_NO_GROW 0x2000u /* Slab internal usage */
#define __GFP_COMP 0x4000u /* Add compound page metadata */
#define __GFP_ZERO 0x8000u /* Return zeroed page on success */
+#define __GFP_MEMPOOL 0x10000u/* Mempool allocation */

-#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 17 /* Room for 17 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)

/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
- __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
- __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP)
+ __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL| \
+ __GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP|__GFP_ZERO| \
+ __GFP_MEMPOOL)

#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_NOIO (__GFP_WAIT)
Index: linux-2.6/mm/mempool.c
===================================================================
--- linux-2.6.orig/mm/mempool.c 2005-04-11 22:18:51.000000000 +1000
+++ linux-2.6/mm/mempool.c 2005-04-11 22:38:08.000000000 +1000
@@ -198,31 +198,21 @@ void * mempool_alloc(mempool_t *pool, un
void *element;
unsigned long flags;
DEFINE_WAIT(wait);
- int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
-
+ int gfp_temp;
+
might_sleep_if(gfp_mask & __GFP_WAIT);
+
+ gfp_mask |= __GFP_MEMPOOL;
+ gfp_mask |= __GFP_NOWARN; /* failures are OK */
+
+ gfp_temp = gfp_mask & ~__GFP_WAIT;
+
repeat_alloc:
- element = pool->alloc(gfp_nowait|__GFP_NOWARN, pool->pool_data);
+
+ element = pool->alloc(gfp_temp, pool->pool_data);
if (likely(element != NULL))
return element;

- /*
- * If the pool is less than 50% full and we can perform effective
- * page reclaim then try harder to allocate an element.
- */
- mb();
- if ((gfp_mask & __GFP_FS) && (gfp_mask != gfp_nowait) &&
- (pool->curr_nr <= pool->min_nr/2)) {
- element = pool->alloc(gfp_mask, pool->pool_data);
- if (likely(element != NULL))
- return element;
- }
-
- /*
- * Kick the VM at this point.
- */
- wakeup_bdflush(0);
-
spin_lock_irqsave(&pool->lock, flags);
if (likely(pool->curr_nr)) {
element = remove_element(pool);
@@ -235,6 +225,8 @@ repeat_alloc:
if (!(gfp_mask & __GFP_WAIT))
return NULL;

+ /* Now start performing page reclaim */
+ gfp_temp = gfp_mask;
prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
mb();
if (!pool->curr_nr)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2005-04-11 22:18:51.000000000 +1000
+++ linux-2.6/mm/page_alloc.c 2005-04-11 22:38:07.000000000 +1000
@@ -799,14 +799,18 @@ __alloc_pages(unsigned int __nocast gfp_
}

/* This allocation should allow future memory freeing. */
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) && !in_interrupt()) {
- /* go through the zonelist yet again, ignoring mins */
- for (i = 0; (z = zones[i]) != NULL; i++) {
- if (!cpuset_zone_allowed(z))
- continue;
- page = buffered_rmqueue(z, order, gfp_mask);
- if (page)
- goto got_pg;
+
+ if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt()) {
+ if (!(gfp_mask & __GFP_MEMPOOL)) {
+ /* go through the zonelist yet again, ignoring mins */
+ for (i = 0; (z = zones[i]) != NULL; i++) {
+ if (!cpuset_zone_allowed(z))
+ continue;
+ page = buffered_rmqueue(z, order, gfp_mask);
+ if (page)
+ goto got_pg;
+ }
}
goto nopage;
}
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c 2005-04-11 22:18:50.000000000 +1000
+++ linux-2.6/mm/swap_state.c 2005-04-11 22:38:08.000000000 +1000
@@ -143,7 +143,6 @@ void __delete_from_swap_cache(struct pag
int add_to_swap(struct page * page)
{
swp_entry_t entry;
- int pf_flags;
int err;

if (!PageLocked(page))
@@ -154,30 +153,11 @@ int add_to_swap(struct page * page)
if (!entry.val)
return 0;

- /* Radix-tree node allocations are performing
- * GFP_ATOMIC allocations under PF_MEMALLOC.
- * They can completely exhaust the page allocator.
- *
- * So PF_MEMALLOC is dropped here. This causes the slab
- * allocations to fail earlier, so radix-tree nodes will
- * then be allocated from the mempool reserves.
- *
- * We're still using __GFP_HIGH for radix-tree node
- * allocations, so some of the emergency pools are available,
- * just not all of them.
- */
-
- pf_flags = current->flags;
- current->flags &= ~PF_MEMALLOC;
-
/*
* Add it to the swap cache and mark it dirty
*/
err = __add_to_swap_cache(page, entry, GFP_ATOMIC|__GFP_NOWARN);

- if (pf_flags & PF_MEMALLOC)
- current->flags |= PF_MEMALLOC;
-
switch (err) {
case 0: /* Success */
SetPageUptodate(page);

Attachments:

diff.patch (13.40 kB)

2005-04-11 16:47:03

by Claudio Martins

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Monday 11 April 2005 13:45, Nick Piggin wrote:
>
> No luck yet (on SMP i386). How many disks are you using in each
> raid1 array? You are using one array for swap, and one mounted as
> ext3 for the working area of the `stress` program, right?
>

Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each
with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for
swap. The rest are

~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md1 4.6G 1.9G 2.6G 42% /
tmpfs 1005M 0 1005M 0% /dev/shm
/dev/md3 32G 107M 30G 1% /home
/dev/md2 31G 149M 29G 1% /var

In these tests, /home on md3 is the working area for stress.

The io scheduler used is the anticipatory.

> Neil, have you had a look at the traces? Do they mean much to you?
>
> Claudio - I have attached another patch you could try. It has a more
> complete set of mempool and related memory allocation fixes, as well
> as some other recent patches I had which reduces atomic memory usage
> by the block layer. Could you try if you get time? Thanks.

OK, I'll try them in a few minutes and report back.

I'm curious as whether increasing the vm.min_free_kbytes sysctl value would
help or not in this case. But I guess it wouldn't since there is already some
free memory and also the alloc failures are order 0, right?

Thanks

Claudio

2005-04-11 22:59:23

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Claudio Martins wrote:

> Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each
> with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for
> swap. The rest are
>
> ~$ df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/md1 4.6G 1.9G 2.6G 42% /
> tmpfs 1005M 0 1005M 0% /dev/shm
> /dev/md3 32G 107M 30G 1% /home
> /dev/md2 31G 149M 29G 1% /var
>
> In these tests, /home on md3 is the working area for stress.
>
> The io scheduler used is the anticipatory.
>

OK.

>
> OK, I'll try them in a few minutes and report back.
>

I'm not overly hopeful. If they fix the problem, then it's likely
that the real bug is hidden.

> I'm curious as whether increasing the vm.min_free_kbytes sysctl value would
> help or not in this case. But I guess it wouldn't since there is already some
> free memory and also the alloc failures are order 0, right?
>

Yes. And the failures you were seeing with my first patch were coming
from the mempool code anyway. We want those to fail early so they don't
eat into the min_free_kbytes memory.

You could try raising min_free_kbytes though. If that fixes it, then it
indicates there might be some problem in a memory allocation failure
path in software raid somewhere.

Thanks

--
SUSE Labs, Novell Inc.

2005-04-11 23:47:04

by NeilBrown

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Monday April 11, [email protected] wrote:
>
> Neil, have you had a look at the traces? Do they mean much to you?
>

Just looked.
bio_alloc_bioset seems implicated, as does sync_page_io.

sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe
change it to use bio_alloc (don't know why..) and I should have
checked the change better.

sync_page_io can be called on the write out path, so it should use
GFP_NOIO rather than GFP_KERNEL.

See if this helps.... Actually this patch is against 2.6.12-rc2-mm1
which uses md_super_write instead of sync_page_io (which is now only
used for read). So if you are using a non-mm kernel (which seems to
be the case) you'll need to apply the patch by hand.

Thanks,
NeilBrown

---
Avoid deadlock in sync_page_io by using GFP_NOIO

..as sync_page_io can be called on the write-out path.
Ditto for md_super_write

Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./drivers/md/md.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~ 2005-04-08 11:25:26.000000000 +1000
+++ ./drivers/md/md.c 2005-04-12 09:42:29.000000000 +1000
@@ -351,7 +351,7 @@ void md_super_write(mddev_t *mddev, mdk_
* if zero is reached.
* If an error occurred, call md_error
*/
- struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+ struct bio *bio = bio_alloc(GFP_NOIO, 1);

bio->bi_bdev = rdev->bdev;
bio->bi_sector = sector;
@@ -374,7 +374,7 @@ static int bi_complete(struct bio *bio,
int sync_page_io(struct block_device *bdev, sector_t sector, int size,
struct page *page, int rw)
{
- struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+ struct bio *bio = bio_alloc(GFP_NOIO, 1);
struct completion event;
int ret;

2005-04-12 00:28:47

by Claudio Martins

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Monday 11 April 2005 23:59, Nick Piggin wrote:
>
> > OK, I'll try them in a few minutes and report back.
>
> I'm not overly hopeful. If they fix the problem, then it's likely
> that the real bug is hidden.
>

Well, the thing is, they do fix the problem. Or at least they hide it very
well ;-)

It has been running for more than 5 hours now with stress with no problems
and no stuck processes.

I think I'm going to give a try to Neil's patch, but I'll have to apply some
patches from -mm.

Thanks

Claudio

2005-04-12 00:31:17

by Cláudio Martins

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Tuesday 12 April 2005 00:46, Neil Brown wrote:
> On Monday April 11, [email protected] wrote:
> > Neil, have you had a look at the traces? Do they mean much to you?
>
> Just looked.
> bio_alloc_bioset seems implicated, as does sync_page_io.
>
> sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe
> change it to use bio_alloc (don't know why..) and I should have
> checked the change better.
>
> sync_page_io can be called on the write out path, so it should use
> GFP_NOIO rather than GFP_KERNEL.
>
> See if this helps.... Actually this patch is against 2.6.12-rc2-mm1
> which uses md_super_write instead of sync_page_io (which is now only
> used for read). So if you are using a non-mm kernel (which seems to
> be the case) you'll need to apply the patch by hand.
>

Hi Neil,

I'll test this patch, but I'm wondering if I have to apply all the
md-related patches from broken out directory of 2.6.12-rc2-mm1 or only some
specific ones?
Anyway I'm happy to test all those md updates, if you think they might
help.

Thanks

Claudio

2005-04-12 00:47:12

by Andrew Morton

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Claudio Martins <[email protected]> wrote:
>
> I think I'm going to give a try to Neil's patch, but I'll have to apply some
> patches from -mm.

Just this one if you're using 2.6.12-rc2:

--- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 11 16:55:07 2005
+++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005
@@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio,
static int sync_page_io(struct block_device *bdev, sector_t sector, int size,
struct page *page, int rw)
{
- struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+ struct bio *bio = bio_alloc(GFP_NOIO, 1);
struct completion event;
int ret;

_

2005-04-12 01:20:00

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Tue, 2005-04-12 at 01:22 +0100, Claudio Martins wrote:
> On Monday 11 April 2005 23:59, Nick Piggin wrote:
> >
> > > OK, I'll try them in a few minutes and report back.
> >
> > I'm not overly hopeful. If they fix the problem, then it's likely
> > that the real bug is hidden.
> >
>
> Well, the thing is, they do fix the problem. Or at least they hide it very
> well ;-)
>
> It has been running for more than 5 hours now with stress with no problems
> and no stuck processes.
>

Well, that is good... I guess ;)

Actually the patches I have sent you do fix real bugs, but they also
make the block layer less likely to recurse into page reclaim, so it
may be eg. hiding the problem that Neil's patch fixes.

It may be that your fundamental problem is solved by my patches, but
we need to be sure.

> I think I'm going to give a try to Neil's patch, but I'll have to apply some
> patches from -mm.
>

Yep that would be good. Please test -rc2 with Andrew's patch, and
obviously my patches backed out. Thanks for sticking with it.

Nick

2005-04-12 07:17:36

by Jens Axboe

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Tue, Apr 12 2005, Nick Piggin wrote:
> Actually the patches I have sent you do fix real bugs, but they also
> make the block layer less likely to recurse into page reclaim, so it
> may be eg. hiding the problem that Neil's patch fixes.

Can you push those to Andrew? I'm quite happy with the way they turned
out. It would be nice if Ken would bench 2.6.12-rc2 with and without
those patches.

--
Jens Axboe

2005-04-12 08:04:56

by Chen, Kenneth W

[permalink] [raw]

Subject: RE: Processes stuck on D state on Dual Opteron

On Tue, Apr 12 2005, Nick Piggin wrote:
> Actually the patches I have sent you do fix real bugs, but they also
> make the block layer less likely to recurse into page reclaim, so it
> may be eg. hiding the problem that Neil's patch fixes.

Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM
> Can you push those to Andrew? I'm quite happy with the way they turned
> out. It would be nice if Ken would bench 2.6.12-rc2 with and without
> those patches.

I like the patch a lot and already did bench it on our db setup. However,
I'm seeing a negative regression compare to a very very crappy patch (see
attached, you can laugh at me for doing things like that :-).

My first reaction is that the overhead is in wait queue setup and tear down
in get_request_wait function. Throwing the following patch on top does improve
things a bit, but we are still in the negative territory. I can't explain why.
Everything suppose to be faster. So I'm staring at the execution profile at
the moment.

diff -Nru a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
--- a/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00
+++ b/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00
@@ -1740,10 +1740,35 @@
*/
static struct request *get_request_wait(request_queue_t *q, int rw)
{
- DEFINE_WAIT(wait);
struct request *rq;
+ struct request_list *rl = &q->rq;
+ int gfp_flag = GFP_ATOMIC;
+
+ if (rl->count[rw] < queue_congestion_off_threshold(q)) {
+ rq = kmem_cache_alloc(request_cachep, gfp_flag);
+ if (rq) {
+ if (!elv_set_request(q, rq, gfp_flag)) {
+
+ rl->count[rw]++;
+ INIT_LIST_HEAD(&rq->queuelist);
+ rq->flags = rw;
+ rq->rq_status = RQ_ACTIVE;
+ rq->ref_count = 1;
+ rq->q = q;
+ rq->rl = rl;
+ rq->special = NULL;
+ rq->data_len = 0;
+ rq->data = NULL;
+ rq->sense = NULL;
+
+ return rq;
+ }
+ kmem_cache_free(request_cachep, rq);
+ }
+ }

do {
+ DEFINE_WAIT(wait);
struct request_list *rl = &q->rq;

prepare_to_wait_exclusive(&rl->wait[rw], &wait,

begin 666 old_freereq.patch
M9&EF9B M3G)U(&$O9')I=F5R<R]B;&]C:R]L;%]R=U]B;&LN8R!B+V1R:79E
M<G,O8FQO8VLO;&Q?<G=?8FQK+F,*+2TM(&$O9')I=F5R<R]B;&]C:R]L;%]R
M=U]B;&LN8PDR,# U+3 T+3 T(# P.C4X.C4U("TP-SHP, HK*RL@8B]D<FEV
M97)S+V)L;V-K+VQL7W)W7V)L:RYC"3(P,#4M,#0M,#0@,# Z-3@Z-34@+3 W
M.C P"D! ("TQ.3DV+#$U("LQ.3DV+#$T($! "B *( ER82 ](&)I;RT^8FE?
M<G<@)B H,2 \/"!"24]?4E=?04A%040I.PH@"BL)+RH@1W)A8B!A(&9R964@
M<F5Q=65S="!F<F]M('1H92!F<F5E;&ES=" J+PHK"69R965R97$@/2!G971?
M<F5Q=65S="AQ+"!R=RP@1T907T%43TU)0RD["BL*(&%G86EN.@H@"7-P:6Y?
M;&]C:U]I<G$H<2T^<75E=65?;&]C:RD["B *+0EI9B H96QV7W%U975E7V5M
M<'1Y*'$I*2!["BL):68@*&5L=E]Q=65U95]E;7!T>2AQ*2D*( D)8FQK7W!L
M=6=?9&5V:6-E*'$I.PHM"0EG;W1O(&=E=%]R<3L*+0E]"BT):68@*&)A<G)I
M97(I"BT)"6=O=&\@9V5T7W)Q.PH@"B )96Q?<F5T(#T@96QV7VUE<F=E*'$L
A("9R97$L(&)I;RD["B )<W=I=&-H("AE;%]R970I('L*
`
end

2005-04-12 11:51:21

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Attachments:

blk-efficient2.patch (770.00 B)

2005-04-12 12:07:45

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Nick Piggin wrote:
> Nick Piggin wrote:
>
>> Chen, Kenneth W wrote:
>
>
>>> I like the patch a lot and already did bench it on our db setup.
>>> However,
>>> I'm seeing a negative regression compare to a very very crappy patch
>>> (see
>>> attached, you can laugh at me for doing things like that :-).
>>>
>>
>> OK - if we go that way, perhaps the following patch may be the
>> way to do it.
>>
>
> Here.
>

Actually yes this is good I think.

What I was worried about is that you could lose some fairness due
to not being put on the queue before allocation.

This is probably a silly thing to worry about, because up until
that point things aren't really deterministic anyway (and before this
patchset it would try doing a GFP_ATOMIC allocation first anyway).

However after the subsequent locking rework, both these get_request()
calls will be performed under the same lock - giving you the same
fairness. So it is nothing to worry about anyway!

It is a bit subtle: get_request may only drop the lock and return NULL
(after retaking the lock), if we fail on a memory allocation. If we
just fail due to unavailable queue slots, then the lock is never
dropped. And the mem allocation can't fail because it is a mempool
alloc with GFP_NOIO.

Nick

--
SUSE Labs, Novell Inc.

2005-04-12 14:21:37

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Chen, Kenneth W wrote:
> On Tue, Apr 12 2005, Nick Piggin wrote:
>
>>Actually the patches I have sent you do fix real bugs, but they also
>>make the block layer less likely to recurse into page reclaim, so it
>>may be eg. hiding the problem that Neil's patch fixes.
>
>
> Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM
>
>>Can you push those to Andrew? I'm quite happy with the way they turned
>>out. It would be nice if Ken would bench 2.6.12-rc2 with and without
>>those patches.
>
>
>
> I like the patch a lot and already did bench it on our db setup. However,
> I'm seeing a negative regression compare to a very very crappy patch (see
> attached, you can laugh at me for doing things like that :-).
>

OK - if we go that way, perhaps the following patch may be the
way to do it.

> My first reaction is that the overhead is in wait queue setup and tear down
> in get_request_wait function. Throwing the following patch on top does improve
> things a bit, but we are still in the negative territory. I can't explain why.
> Everything suppose to be faster. So I'm staring at the execution profile at
> the moment.
>

Hmm, that's a bit disappointing. Like you said though, I'm sure we
should be able to get better performance out of this.

I'll look at it and see if we can rework it.

--
SUSE Labs, Novell Inc.

2005-04-12 17:10:00

by Thomas Davis

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Nick Piggin wrote:
>
> It is a bit subtle: get_request may only drop the lock and return NULL
> (after retaking the lock), if we fail on a memory allocation. If we
> just fail due to unavailable queue slots, then the lock is never
> dropped. And the mem allocation can't fail because it is a mempool
> alloc with GFP_NOIO.
>

I'm jumping in here, because we have seen this problem on a X86-64 system, with 4gb of ram, and SLES9 (2.6.5-7.141)

You can drive the node into this state:

Mem-info:
Node 1 DMA per-cpu: empty
Node 1 Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
Node 1 HighMem per-cpu: empty
Node 0 DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
Node 0 Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
Node 0 HighMem per-cpu: empty

Free pages: 10360kB (0kB HighMem)
Active:485853 inactive:421820 dirty:0 writeback:0 unstable:0 free:2590 slab:10816 mapped:903444 pagetables:2097
Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB
lowmem_reserve[]: 0 1664 1664
Node 1 Normal free:2464kB min:2468kB low:4936kB high:7404kB active:918440kB inactive:710360kB present:1703936kB
lowmem_reserve[]: 0 0 0
Node 1 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB
lowmem_reserve[]: 0 0 0
Node 0 DMA free:4928kB min:20kB low:40kB high:60kB active:0kB inactive:0kB present:16384kB
lowmem_reserve[]: 0 2031 2031
Node 0 Normal free:2968kB min:3016kB low:6032kB high:9048kB active:1024968kB inactive:976924kB present:2080764kB
lowmem_reserve[]: 0 0 0
Node 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB
lowmem_reserve[]: 0 0 0
Node 1 DMA: empty
Node 1 Normal: 46*4kB 19*8kB 9*16kB 4*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2464kB
Node 1 HighMem: empty
Node 0 DMA: 4*4kB 4*8kB 1*16kB 2*32kB 3*64kB 4*128kB 2*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 4928kB
Node 0 Normal: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 3*512kB 1*1024kB 0*2048kB 0*4096kB = 2968kB
Node 0 HighMem: empty
Swap cache: add 1009224, delete 106245, find 179674/181478, race 0+2
Free swap: 4739812kB
950271 pages of RAM
17513 reserved pages
2788 pages shared
902980 pages swap cached

with processes doing this:

SysRq : Show State

sibling
task PC pid father child younger older
init D 000001000000e810 0 1 0 2 (NOTLB)
000001007ff81be8 0000000000000006 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000010002c1d6e0
Call Trace:<ffffffff8017338b>{try_to_free_pages+283} <ffffffff80147d0d>{schedule_timeout+173}
<ffffffff80147c50>{process_timeout+0} <ffffffff8013a292>{io_schedule_timeout+82}
<ffffffff80280efd>{blk_congestion_wait+141} <ffffffff8013c530>{autoremove_wake_function+0}
<ffffffff8013c530>{autoremove_wake_function+0} <ffffffff8016ab68>{__alloc_pages+776}
<ffffffff8018573f>{read_swap_cache_async+63} <ffffffff801781b1>{swapin_readahead+97}
<ffffffff8017834e>{do_swap_page+142} <ffffffff801796a1>{handle_mm_fault+337}
<ffffffff80123ebb>{do_page_fault+411} <ffffffff801a3259>{sys_select+1097}
<ffffffff801a332f>{sys_select+1311} <ffffffff801122a9>{error_exit+0}

mg.C.2 D 000001000000e810 0 1971 1955 1972 (NOTLB)
00000100e236bc68 0000000000000006 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000100000000 00000100816ed360
Call Trace:<ffffffff8017338b>{try_to_free_pages+283} <ffffffff80147d0d>{schedule_timeout+173}
<ffffffff80147c50>{process_timeout+0} <ffffffff8013a292>{io_schedule_timeout+82}
<ffffffff80280efd>{blk_congestion_wait+141} <ffffffff8013c530>{autoremove_wake_function+0}
<ffffffff8013c530>{autoremove_wake_function+0} <ffffffff8016ab68>{__alloc_pages+776}
<ffffffff801778ad>{do_wp_page+285} <ffffffff801796c5>{handle_mm_fault+373}
<ffffffff80123ebb>{do_page_fault+411} <ffffffff801122a9>{error_exit+0}
mg.C.2 S 000001007b0a06a0 0 1972 1971 1974 (NOTLB)
00000100bc1c1ca0 0000000000000006 0000000000000010 0000000000010246
000000000004c7c0 00000100816ec280 0000007680000780 0000010081f23390
0000000180000780 00000100816ed360
Call Trace:<ffffffff8016abb4>{__alloc_pages+852} <ffffffff80110ac8>{__down_interruptible+216}
<ffffffff80139280>{default_wake_function+0} <ffffffff8013531c>{recalc_task_prio+940}
<ffffffff80230d91>{__down_failed_interruptible+53}
<ffffffffa01cc47e>{:mosal:.text.lock.mosal_sync+5}
<ffffffffa0291daf>{:mod_vipkl:VIPKL_EQ_poll+607} <ffffffffa029bb01>{:mod_vipkl:VIPKL_EQ_poll_stat+529}
<ffffffffa029e658>{:mod_vipkl:VIPKL_ioctl+5144} <ffffffffa0294e21>{:mod_vipkl:vipkl_wrap_kernel_ioctl+417}
<ffffffff8018c00e>{filp_close+126} <ffffffff801a1fb4>{sys_ioctl+612}
<ffffffff801118d4>{system_call+124}
mg.C.2 S 000001007b0a18c0 0 1974 1971 1972 (NOTLB)
00000100a3955ca0 0000000000000006 00000001e7d422e8 000001002c9ca550
000000000005f138 00000100816ec280 0000007680000780 0000010081f23390
0000000180000780 00000100816ed360
Call Trace:<ffffffff8016abb4>{__alloc_pages+852} <ffffffff80110ac8>{__down_interruptible+216}
<ffffffff80139280>{default_wake_function+0} <ffffffff8013531c>{recalc_task_prio+940}
<ffffffff80230d91>{__down_failed_interruptible+53}
<ffffffffa01cc47e>{:mosal:.text.lock.mosal_sync+5}
<ffffffffa0291daf>{:mod_vipkl:VIPKL_EQ_poll+607} <ffffffff8011db9d>{smp_send_reschedule+29}
<ffffffffa029bb01>{:mod_vipkl:VIPKL_EQ_poll_stat+529}
<ffffffffa029e658>{:mod_vipkl:VIPKL_ioctl+5144} <ffffffffa0294e21>{:mod_vipkl:vipkl_wrap_kernel_ioctl+417}
<ffffffff8018c00e>{filp_close+126} <ffffffff801a1fb4>{sys_ioctl+612}
<ffffffff801118d4>{system_call+124}

and it will never, ever recover from it.

Note - this is a cluster of AMD x86_64's, running IB with 4gb of ram. We have limited the amount of memory that IB can pin down, and limited process size to 1.5gb (on a 4gb machine!) just to maintain stability.

We do not use md; it's a compute node with only a single local drive.

We have been told, the 2.6 memory allocator goes into an infinite loop, and never recovers from it.

thomas

2005-04-12 20:31:49

by Chen, Kenneth W

[permalink] [raw]

Subject: RE: Processes stuck on D state on Dual Opteron

Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM
> Chen, Kenneth W wrote:
> > I like the patch a lot and already did bench it on our db setup. However,
> > I'm seeing a negative regression compare to a very very crappy patch (see
> > attached, you can laugh at me for doing things like that :-).
>
> OK - if we go that way, perhaps the following patch may be the
> way to do it.

OK, if you are going to do it that way, then the ioc_batching code in get_request
has to be reworked. We never push the queue so hard that it kicks itself into the
batching mode. However, calls to get_io_context and put_io_context are unconditional
in that function. Execution profile shows that these two little functions actually
consumed lots of cpu cycles.

AFAICS, ioc_*batching() is trying to push more requests onto the queue that is full
(or near full) and give high priority to the process that hits the last req slot.
Why do we need to go all the way to tsk->io_context to keep track of that state?
For a clean up bonus, I think the tracking can be moved into the queue structure.

> > My first reaction is that the overhead is in wait queue setup and tear down
> > in get_request_wait function. Throwing the following patch on top does improve
> > things a bit, but we are still in the negative territory. I can't explain why.
> > Everything suppose to be faster. So I'm staring at the execution profile at
> > the moment.
> >
>
> Hmm, that's a bit disappointing. Like you said though, I'm sure we
> should be able to get better performance out of this.

Absolutely. I'm disappointed too and this is totally out of expectation. There
must be some other factors.

2005-04-13 00:34:34

by Claudio Martins

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

On Tuesday 12 April 2005 01:46, Andrew Morton wrote:
> Claudio Martins <[email protected]> wrote:
> > I think I'm going to give a try to Neil's patch, but I'll have to apply
> > some patches from -mm.
>
> Just this one if you're using 2.6.12-rc2:
>
> --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon
> Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005
> @@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio,
> static int sync_page_io(struct block_device *bdev, sector_t sector, int
> size, struct page *page, int rw)
> {
> - struct bio *bio = bio_alloc(GFP_KERNEL, 1);
> + struct bio *bio = bio_alloc(GFP_NOIO, 1);
> struct completion event;
> int ret;
>
> _

Hi Andrew, all,

Sorry for the delay in reporting. This patch does indeed fix the problem.
The machine ran stress for almost 15h straight with no problems at all.

As for Nick's patch I, too, think it would be nice to be included (once the
performance problems are sorted out), since it seemed to make the block layer
more robust and well behaved (at least with stress), although I didn't run
performance tests to measure regressions.

Thanks Nick, Neil, Andrew and all others for your great help with this
issue. I'll have to put the machine on production now with the patch applied,
but let me know if I can be of any further help with these issues.

Thanks

Claudio

2005-04-13 01:49:54

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Chen, Kenneth W wrote:
> Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM
>
>>Chen, Kenneth W wrote:
>>
>>>I like the patch a lot and already did bench it on our db setup. However,
>>>I'm seeing a negative regression compare to a very very crappy patch (see
>>>attached, you can laugh at me for doing things like that :-).
>>
>>OK - if we go that way, perhaps the following patch may be the
>>way to do it.
>
>
> OK, if you are going to do it that way, then the ioc_batching code in get_request
> has to be reworked. We never push the queue so hard that it kicks itself into the
> batching mode. However, calls to get_io_context and put_io_context are unconditional
> in that function. Execution profile shows that these two little functions actually
> consumed lots of cpu cycles.
>
> AFAICS, ioc_*batching() is trying to push more requests onto the queue that is full
> (or near full) and give high priority to the process that hits the last req slot.
> Why do we need to go all the way to tsk->io_context to keep track of that state?
> For a clean up bonus, I think the tracking can be moved into the queue structure.
>

OK - well it is no different to what you had before these patches, so
probably future work would be seperate patches.

get_io_context can probably be reworked. For example, it is only called
with the current thread, so it probably doesn't need to increment the
refcount, as most users are only using it process context... all users
in ll_rw_blk.c, anyway.

--
SUSE Labs, Novell Inc.

2005-04-13 03:00:05

by Nick Piggin

[permalink] [raw]

Subject: Re: Processes stuck on D state on Dual Opteron

Claudio Martins wrote:
> On Tuesday 12 April 2005 01:46, Andrew Morton wrote:
>
>>Claudio Martins <[email protected]> wrote:
>>
>>> I think I'm going to give a try to Neil's patch, but I'll have to apply
>>>some patches from -mm.
>>
>>Just this one if you're using 2.6.12-rc2:
>>
>>--- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon
>>Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005
>>@@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio,
>> static int sync_page_io(struct block_device *bdev, sector_t sector, int
>>size, struct page *page, int rw)
>> {
>>- struct bio *bio = bio_alloc(GFP_KERNEL, 1);
>>+ struct bio *bio = bio_alloc(GFP_NOIO, 1);
>> struct completion event;
>> int ret;
>>
>>_
>
>
>
> Hi Andrew, all,
>
> Sorry for the delay in reporting. This patch does indeed fix the problem.
> The machine ran stress for almost 15h straight with no problems at all.
>
> As for Nick's patch I, too, think it would be nice to be included (once the
> performance problems are sorted out), since it seemed to make the block layer
> more robust and well behaved (at least with stress), although I didn't run
> performance tests to measure regressions.
>
> Thanks Nick, Neil, Andrew and all others for your great help with this
> issue. I'll have to put the machine on production now with the patch applied,
> but let me know if I can be of any further help with these issues.
>

Thanks for reporting and testing - what we need is more people
like you contributing to Linux ;)

--
SUSE Labs, Novell Inc.