2007-11-21 21:32:10

by Jie Chen

[permalink] [raw]
Subject: Possible bug from kernel 2.6.22 and above

CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_IKCONFIG=m
CONFIG_IKCONFIG_PROC=y
CONFIG_CPUSETS=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_SYSCTL=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
CONFIG_RT_MUTEXES=y
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_CFQ=y
CONFIG_X86_PC=y
CONFIG_MK8=y
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_DISCONTIGMEM_MANUAL=y
CONFIG_DISCONTIGMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_MIGRATION=y
CONFIG_RESOURCES_64BIT=y
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_IOMMU=y
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_AMD=y
CONFIG_KEXEC=y
CONFIG_HZ_100=y
CONFIG_K8_NB=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_PM=y
CONFIG_SUSPEND_SMP_POSSIBLE=y
CONFIG_HIBERNATION_SMP_POSSIBLE=y
CONFIG_ACPI=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=m
CONFIG_ACPI_BATTERY=m
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=m
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
CONFIG_ACPI_ASUS=m
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=m
CONFIG_PCIEAER=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_HT_IRQ=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_FAKE=m
CONFIG_HOTPLUG_PCI_ACPI=m
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
CONFIG_HOTPLUG_PCI_SHPC=m
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
CONFIG_NET_KEY=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_FIB_HASH=y
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_CUBIC=y
CONFIG_NETWORK_SECMARK=y
CONFIG_BRIDGE=m
CONFIG_VLAN_8021Q=m
CONFIG_LLC=m
CONFIG_NET_PKTGEN=m
CONFIG_IRDA=m
CONFIG_IRLAN=m
CONFIG_IRCOMM=m
CONFIG_IRDA_CACHE_LAST_LSAP=y
CONFIG_IRDA_FAST_RR=y
CONFIG_IRTTY_SIR=m
CONFIG_DONGLE=y
CONFIG_ESI_DONGLE=m
CONFIG_ACTISYS_DONGLE=m
CONFIG_TEKRAM_DONGLE=m
CONFIG_TOIM3232_DONGLE=m
CONFIG_LITELINK_DONGLE=m
CONFIG_MA600_DONGLE=m
CONFIG_GIRBIL_DONGLE=m
CONFIG_MCP2120_DONGLE=m
CONFIG_OLD_BELKIN_DONGLE=m
CONFIG_ACT200L_DONGLE=m
CONFIG_USB_IRDA=m
CONFIG_SIGMATEL_FIR=m
CONFIG_NSC_FIR=m
CONFIG_WINBOND_FIR=m
CONFIG_SMC_IRCC_FIR=m
CONFIG_ALI_FIR=m
CONFIG_VLSI_FIR=m
CONFIG_VIA_FIR=m
CONFIG_MCS_FIR=m
CONFIG_BT=m
CONFIG_BT_L2CAP=m
CONFIG_BT_SCO=m
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_HIDP=m
CONFIG_BT_HCIUSB=m
CONFIG_BT_HCIUSB_SCO=y
CONFIG_BT_HCIUART=m
CONFIG_BT_HCIUART_H4=y
CONFIG_BT_HCIUART_BCSP=y
CONFIG_BT_HCIBCM203X=m
CONFIG_BT_HCIBPA10X=m
CONFIG_BT_HCIBFUSB=m
CONFIG_BT_HCIVHCI=m
CONFIG_WIRELESS_EXT=y
CONFIG_IEEE80211=m
CONFIG_IEEE80211_CRYPT_WEP=m
CONFIG_IEEE80211_CRYPT_CCMP=m
CONFIG_IEEE80211_SOFTMAC=m
CONFIG_IEEE80211_SOFTMAC_DEBUG=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
CONFIG_PARPORT_NOT_PC=y
CONFIG_PNP=y
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=m
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_SX8=m
CONFIG_BLK_DEV_UB=m
CONFIG_BLK_DEV_RAM=y
CONFIG_MISC_DEVICES=y
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECD=m
CONFIG_BLK_DEV_IDEFLOPPY=y
CONFIG_BLK_DEV_IDESCSI=m
CONFIG_IDE_TASK_IOCTL=y
CONFIG_IDE_PROC_FS=y
CONFIG_IDE_GENERIC=y
CONFIG_BLK_DEV_IDEPNP=y
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
CONFIG_IDEPCI_PCIBUS_ORDER=y
CONFIG_BLK_DEV_GENERIC=y
CONFIG_BLK_DEV_IDEDMA_PCI=y
CONFIG_BLK_DEV_AEC62XX=y
CONFIG_BLK_DEV_ALI15X3=y
CONFIG_BLK_DEV_AMD74XX=y
CONFIG_BLK_DEV_ATIIXP=y
CONFIG_BLK_DEV_CMD64X=y
CONFIG_BLK_DEV_HPT34X=y
CONFIG_BLK_DEV_HPT366=y
CONFIG_BLK_DEV_PIIX=y
CONFIG_BLK_DEV_IT821X=y
CONFIG_BLK_DEV_PDC202XX_OLD=y
CONFIG_BLK_DEV_PDC202XX_NEW=y
CONFIG_BLK_DEV_SVWKS=y
CONFIG_BLK_DEV_SIIMAGE=y
CONFIG_BLK_DEV_SIS5513=y
CONFIG_BLK_DEV_VIA82CXXX=y
CONFIG_BLK_DEV_IDEDMA=y
CONFIG_RAID_ATTRS=m
CONFIG_SCSI=m
CONFIG_SCSI_DMA=y
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y
CONFIG_BLK_DEV_SD=m
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_WAIT_SCAN=m
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_BLK_DEV_3W_XXXX_RAID=m
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_ACARD=m
CONFIG_SCSI_AACRAID=m
CONFIG_SCSI_AIC7XXX=m
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_SCSI_AIC79XX=m
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=m
CONFIG_MEGARAID_MAILBOX=m
CONFIG_MEGARAID_LEGACY=m
CONFIG_MEGARAID_SAS=m
CONFIG_SCSI_HPTIOP=m
CONFIG_SCSI_BUSLOGIC=m
CONFIG_SCSI_GDTH=m
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
CONFIG_SCSI_PPA=m
CONFIG_SCSI_IMM=m
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_MMIO=y
CONFIG_SCSI_QLOGIC_1280=m
CONFIG_SCSI_QLA_FC=m
CONFIG_SCSI_LPFC=m
CONFIG_SCSI_DC395x=m
CONFIG_SCSI_DC390T=m
CONFIG_ATA=m
CONFIG_ATA_ACPI=y
CONFIG_SATA_SVW=m
CONFIG_ATA_PIIX=m
CONFIG_SATA_NV=m
CONFIG_PDC_ADMA=m
CONFIG_SATA_QSTOR=m
CONFIG_SATA_PROMISE=m
CONFIG_SATA_SX4=m
CONFIG_SATA_SIL=m
CONFIG_SATA_SIL24=m
CONFIG_SATA_SIS=m
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
CONFIG_SATA_VITESSE=m
CONFIG_PATA_AMD=m
CONFIG_PATA_ATIIXP=m
CONFIG_PATA_CS5520=m
CONFIG_PATA_EFAR=m
CONFIG_ATA_GENERIC=m
CONFIG_PATA_IT821X=m
CONFIG_PATA_JMICRON=m
CONFIG_PATA_TRIFLEX=m
CONFIG_PATA_MPIIX=m
CONFIG_PATA_OLDPIIX=m
CONFIG_PATA_NETCELL=m
CONFIG_PATA_RZ1000=m
CONFIG_PATA_SERVERWORKS=m
CONFIG_PATA_PDC2027X=m
CONFIG_PATA_SIL680=m
CONFIG_PATA_SIS=m
CONFIG_PATA_VIA=m
CONFIG_PATA_WINBOND=m
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
CONFIG_MD_RAID5_RESHAPE=y
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_BLK_DEV_DM=m
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_MIRROR=m
CONFIG_DM_ZERO=m
CONFIG_DM_MULTIPATH=m
CONFIG_DM_MULTIPATH_EMC=m
CONFIG_FUSION=y
CONFIG_FUSION_SPI=m
CONFIG_FUSION_FC=m
CONFIG_FUSION_SAS=m
CONFIG_FUSION_CTL=m
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_BONDING=m
CONFIG_EQUALIZER=m
CONFIG_TUN=m
CONFIG_PHYLIB=m
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_FIXED_PHY=m
CONFIG_FIXED_MII_10_FDX=y
CONFIG_FIXED_MII_100_FDX=y
CONFIG_NET_ETHERNET=y
CONFIG_MII=m
CONFIG_NET_VENDOR_3COM=y
CONFIG_VORTEX=m
CONFIG_TYPHOON=m
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
CONFIG_AMD8111_ETH=m
CONFIG_AMD8111E_NAPI=y
CONFIG_B44=m
CONFIG_FORCEDETH=m
CONFIG_E100=m
CONFIG_EPIC100=m
CONFIG_SUNDANCE=m
CONFIG_VIA_RHINE=m
CONFIG_VIA_RHINE_MMIO=y
CONFIG_VIA_RHINE_NAPI=y
CONFIG_NETDEV_1000=y
CONFIG_ACENIC=m
CONFIG_DL2K=m
CONFIG_E1000=m
CONFIG_E1000_NAPI=y
CONFIG_NS83820=m
CONFIG_HAMACHI=m
CONFIG_YELLOWFIN=m
CONFIG_R8169=m
CONFIG_R8169_NAPI=y
CONFIG_R8169_VLAN=y
CONFIG_SIS190=m
CONFIG_SKGE=m
CONFIG_SKY2=m
CONFIG_VIA_VELOCITY=m
CONFIG_TIGON3=m
CONFIG_BNX2=m
CONFIG_NETDEV_10000=y
CONFIG_CHELSIO_T1=m
CONFIG_CHELSIO_T1_NAPI=y
CONFIG_IXGB=m
CONFIG_IXGB_NAPI=y
CONFIG_S2IO=m
CONFIG_S2IO_NAPI=y
CONFIG_MYRI10GE=m
CONFIG_MLX4_CORE=m
CONFIG_MLX4_DEBUG=y
CONFIG_NETCONSOLE=m
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_EVDEV=y
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_SERIAL=m
CONFIG_MOUSE_VSXXXAA=m
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
CONFIG_INPUT_UINPUT=m
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_SERIAL_NONSTANDARD=y
CONFIG_CYCLADES=m
CONFIG_SYNCLINK=m
CONFIG_SYNCLINKMP=m
CONFIG_SYNCLINK_GT=m
CONFIG_N_HDLC=m
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_JSM=m
CONFIG_UNIX98_PTYS=y
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
CONFIG_TIPAR=m
CONFIG_WATCHDOG=y
CONFIG_SOFT_WATCHDOG=m
CONFIG_SC520_WDT=m
CONFIG_I6300ESB_WDT=m
CONFIG_W83627HF_WDT=m
CONFIG_W83877F_WDT=m
CONFIG_W83977F_WDT=m
CONFIG_MACHZ_WDT=m
CONFIG_PCIPCWATCHDOG=m
CONFIG_WDTPCI=m
CONFIG_WDT_501_PCI=y
CONFIG_USBPCWATCHDOG=m
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_NVRAM=y
CONFIG_RTC=y
CONFIG_R3964=m
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_MWAVE=m
CONFIG_PC8736x_GPIO=m
CONFIG_NSC_GPIO=m
CONFIG_HPET=y
CONFIG_HANGCHECK_TIMER=m
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCF=m
CONFIG_I2C_ALGOPCA=m
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD756_S4882=m
CONFIG_I2C_AMD8111=m
CONFIG_I2C_I801=m
CONFIG_I2C_I810=m
CONFIG_I2C_PIIX4=m
CONFIG_I2C_NFORCE2=m
CONFIG_I2C_OCORES=m
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
CONFIG_I2C_PROSAVAGE=m
CONFIG_I2C_SAVAGE4=m
CONFIG_I2C_SIS5595=m
CONFIG_I2C_SIS630=m
CONFIG_I2C_SIS96X=m
CONFIG_I2C_STUB=m
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m
CONFIG_I2C_VOODOO3=m
CONFIG_SENSORS_DS1337=m
CONFIG_SENSORS_DS1374=m
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
CONFIG_SENSORS_PCA9539=m
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_MAX6875=m
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
CONFIG_SENSORS_ABITUGURU=m
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ADM9240=m
CONFIG_SENSORS_ASB100=m
CONFIG_SENSORS_ATXP1=m
CONFIG_SENSORS_DS1621=m
CONFIG_SENSORS_F71805F=m
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_FSCPOS=m
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
CONFIG_SENSORS_MAX1619=m
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_SIS5595=m
CONFIG_SENSORS_SMSC47M1=m
CONFIG_SENSORS_SMSC47M192=m
CONFIG_SENSORS_SMSC47B397=m
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83791D=m
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83627HF=m
CONFIG_SENSORS_W83627EHF=m
CONFIG_SENSORS_HDAPS=m
CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
CONFIG_VIDEO_V4L2=y
CONFIG_VIDEO_CAPTURE_DRIVERS=y
CONFIG_VIDEO_HELPER_CHIPS_AUTO=y
CONFIG_V4L_USB_DRIVERS=y
CONFIG_RADIO_ADAPTERS=y
CONFIG_DAB=y
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_BACKLIGHT_CLASS_DEVICE=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FONT_8x16=y
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_USB_HID=y
CONFIG_HID_FF=y
CONFIG_HID_PID=y
CONFIG_LOGITECH_FF=y
CONFIG_THRUSTMASTER_FF=y
CONFIG_USB_HIDDEV=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_SPLIT_ISO=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_ISP116X_HCD=m
CONFIG_USB_OHCI_HCD=m
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=m
CONFIG_USB_SL811_HCD=m
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m
CONFIG_USB_STORAGE=m
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_DPCM=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
CONFIG_USB_LIBUSUAL=y
CONFIG_USB_MDC800=m
CONFIG_USB_MICROTEK=m
CONFIG_USB_MON=y
CONFIG_USB_USS720=m
CONFIG_USB_SERIAL=m
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_AIRPRIME=m
CONFIG_USB_SERIAL_ARK3116=m
CONFIG_USB_SERIAL_BELKIN=m
CONFIG_USB_SERIAL_WHITEHEAT=m
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
CONFIG_USB_SERIAL_CP2101=m
CONFIG_USB_SERIAL_CYPRESS_M8=m
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_FUNSOFT=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
CONFIG_USB_SERIAL_IR=m
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
CONFIG_USB_SERIAL_GARMIN=m
CONFIG_USB_SERIAL_IPW=m
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
CONFIG_USB_SERIAL_KEYSPAN_MPR=y
CONFIG_USB_SERIAL_KEYSPAN_USA28=y
CONFIG_USB_SERIAL_KEYSPAN_USA28X=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
CONFIG_USB_SERIAL_KEYSPAN_USA19=y
CONFIG_USB_SERIAL_KEYSPAN_USA18X=y
CONFIG_USB_SERIAL_KEYSPAN_USA19W=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
CONFIG_USB_SERIAL_KLSI=m
CONFIG_USB_SERIAL_KOBIL_SCT=m
CONFIG_USB_SERIAL_MCT_U232=m
CONFIG_USB_SERIAL_NAVMAN=m
CONFIG_USB_SERIAL_PL2303=m
CONFIG_USB_SERIAL_HP4X=m
CONFIG_USB_SERIAL_SAFE=m
CONFIG_USB_SERIAL_SAFE_PADDED=y
CONFIG_USB_SERIAL_SIERRAWIRELESS=m
CONFIG_USB_SERIAL_TI=m
CONFIG_USB_SERIAL_CYBERJACK=m
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_OPTION=m
CONFIG_USB_SERIAL_OMNINET=m
CONFIG_USB_EZUSB=y
CONFIG_USB_EMI62=m
CONFIG_USB_EMI26=m
CONFIG_USB_AUERSWALD=m
CONFIG_USB_RIO500=m
CONFIG_USB_LEGOTOWER=m
CONFIG_USB_LCD=m
CONFIG_USB_LED=m
CONFIG_USB_IDMOUSE=m
CONFIG_USB_APPLEDISPLAY=m
CONFIG_USB_SISUSBVGA=m
CONFIG_USB_SISUSBVGA_CON=y
CONFIG_USB_LD=m
CONFIG_USB_TEST=m
CONFIG_INFINIBAND=m
CONFIG_INFINIBAND_USER_MAD=m
CONFIG_INFINIBAND_USER_ACCESS=m
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=m
CONFIG_INFINIBAND_MTHCA_DEBUG=y
CONFIG_INFINIBAND_IPATH=m
CONFIG_INFINIBAND_AMSO1100=m
CONFIG_MLX4_INFINIBAND=m
CONFIG_INFINIBAND_IPOIB=m
CONFIG_INFINIBAND_IPOIB_DEBUG=y
CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y
CONFIG_INFINIBAND_SRP=m
CONFIG_INFINIBAND_ISER=m
CONFIG_DMA_ENGINE=y
CONFIG_NET_DMA=y
CONFIG_INTEL_IOATDMA=m
CONFIG_VIRTUALIZATION=y
CONFIG_EDD=m
CONFIG_DELL_RBU=m
CONFIG_DCDBAS=m
CONFIG_DMIID=y
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y
CONFIG_FS_XIP=y
CONFIG_EXT3_FS=m
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_JBD=m
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=m
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
CONFIG_JFS_FS=m
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_SECURITY=y
CONFIG_FS_POSIX_ACL=y
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
CONFIG_XFS_SECURITY=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_OCFS2_FS=m
CONFIG_ROMFS_FS=m
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_DNOTIFY=y
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_NTFS_FS=m
CONFIG_NTFS_RW=y
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_RAMFS=y
CONFIG_CONFIGFS_FS=m
CONFIG_UFS_FS=m
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
CONFIG_CIFS=m
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CODA_FS=m
CONFIG_PARTITION_ADVANCED=y
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_SGI_PARTITION=y
CONFIG_SUN_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_UTF8=m
CONFIG_PROFILING=y
CONFIG_OPROFILE=m
CONFIG_KPROBES=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_RODATA=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_KEYS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y
CONFIG_SECURITY=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_CAPABILITIES=y
CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=m
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_CBC=m
CONFIG_CRYPTO_PCBC=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_HW=y
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC32=y
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y


Attachments:
kernel-2.6.23.8-config (19.50 kB)

2007-11-21 22:14:21

by Eric Dumazet

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above

Jie Chen a ?crit :
> Hi, there:
>
> We have a simple pthread program that measures the synchronization
> overheads for various synchronization mechanisms such as spin locks,
> barriers (the barrier is implemented using queue-based barrier
> algorithm) and so on. We have dual quad-core AMD opterons (barcelona)
> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7
> distribution. Before we moved to this kernel, we had kernel 2.6.21.
> These two kernels are configured identical and compiled with the same
> gcc 4.1.2 compiler. Under the old kernel, we observed that the
> performance of these overheads increases as the number of threads
> increases from 2 to 8. The following are the values of total time and
> overhead for all threads acquiring a pthread spin lock and all threads
> executing a barrier synchronization call.

Could you post the source of your test program ?

spinlock are ... spining and should not call linux scheduler, so I have no
idea why a kernel change could modify your results.

Also I suspect you'll have better results with Fedora Core 8 (since glibc was
updated to use private futexes in v 2.7), at least for the barrier ops.


>
> Kernel 2.6.21
> Number of Threads 2 4 6 8
> SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643
> (Overhead) 0.073 0.05746 0.102805 0.154563
> Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002
> (Overhead) 0.531660 1.1502 1.500112 1.891617
>
> Each thread is bound to a particular core using pthread_setaffinity_np.
>
> Kernel 2.6.23.8
> Number of Threads 2 4 6 8
> SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990
> (Overhead) 4.345417 6.617207 3.949435 0.110985
> Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662
> (Overhead) 8.957755 9.784722 5.699590 1.869518
>
> It is clearly that the synchronization overhead increases as the number
> of threads increases in the kernel 2.6.21. But the synchronization
> overhead actually decreases as the number of threads increases in the
> kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as
> well). This certainly is not a correct behavior. The kernels are
> configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC,
> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel
> configuration file is in the attachment of this e-mail.
>
> From what we have read, there was a new scheduler (CFS) appeared from
> 2.6.22. We are not sure whether the above behavior is caused by the new
> scheduler.
>
> Finally, our machine cpu information is listed in the following:
>
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 2347
> stepping : 10
> cpu MHz : 1909.801
> cache size : 512 KB
> physical id : 0
> siblings : 4
> core id : 0
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> pdpe1gb rdtscp
> lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm
> cmp_legacy svm
> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
> bogomips : 3822.95
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate
>
> In addition, we have schedstat and sched_debug files in the /proc
> directory.
>
> Thank you for all your help to solve this puzzle. If you need more
> information, please let us know.
>
>
> P.S. I like to be cc'ed on the discussions related to this problem.
>
>
> ###############################################
> Jie Chen
> Scientific Computing Group
> Thomas Jefferson National Accelerator Facility
> 12000, Jefferson Ave.
> Newport News, VA 23606
>
> (757)269-5046 (office) (757)269-6248 (fax)
> [email protected]
> ###############################################
>

2007-11-22 01:52:43

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above

Eric Dumazet wrote:
> Jie Chen a ?crit :
>> Hi, there:
>>
>> We have a simple pthread program that measures the synchronization
>> overheads for various synchronization mechanisms such as spin locks,
>> barriers (the barrier is implemented using queue-based barrier
>> algorithm) and so on. We have dual quad-core AMD opterons (barcelona)
>> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7
>> distribution. Before we moved to this kernel, we had kernel 2.6.21.
>> These two kernels are configured identical and compiled with the same
>> gcc 4.1.2 compiler. Under the old kernel, we observed that the
>> performance of these overheads increases as the number of threads
>> increases from 2 to 8. The following are the values of total time and
>> overhead for all threads acquiring a pthread spin lock and all threads
>> executing a barrier synchronization call.
>
> Could you post the source of your test program ?
>


Hi, Eric:

Thank you for the quick response. You can get the source code containing
the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a
data parallel threading package for physics calculation. The test code
is pthread_sync in the src directory once you unpack the gz file. To
configure and build this package is very simple: configure and make. The
test program is built by make check. The number of threads is
controlled by QMT_NUM_THREADS. The package is using pthread spin lock,
but the barrier is implemented using a queue-based barrier algorithm
proposed by J. B. Carter of University of Utah (2005).





> spinlock are ... spining and should not call linux scheduler, so I have
> no idea why a kernel change could modify your results.
>
> Also I suspect you'll have better results with Fedora Core 8 (since
> glibc was updated to use private futexes in v 2.7), at least for the
> barrier ops.
>
>

I am not sure what the biggest change between kernel 2.6.21 and 2.6.22
(23) is? Is the scheduler the biggest change between these versions? Can
the scheduler of kernel somehow effect the performance? I know the
scheduler is trying to do load balance and so on. Can the scheduler move
threads to different cores according to the load balance algorithm even
though the threads are bound to cores using pthread_setaffinity_np call
when the number of threads is fewer than the number of cores? I am
thinking about this because the performance of our test code is roughly
the same for both kernels when the number of threads equals to the
number of cores.

>>
>> Kernel 2.6.21
>> Number of Threads 2 4 6 8
>> SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643
>> (Overhead) 0.073 0.05746 0.102805 0.154563
>> Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002
>> (Overhead) 0.531660 1.1502 1.500112 1.891617
>>
>> Each thread is bound to a particular core using pthread_setaffinity_np.
>>
>> Kernel 2.6.23.8
>> Number of Threads 2 4 6 8
>> SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990
>> (Overhead) 4.345417 6.617207 3.949435 0.110985
>> Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662
>> (Overhead) 8.957755 9.784722 5.699590 1.869518
>>
>> It is clearly that the synchronization overhead increases as the
>> number of threads increases in the kernel 2.6.21. But the
>> synchronization overhead actually decreases as the number of threads
>> increases in the kernel 2.6.23.8 (We observed the same behavior on
>> kernel 2.6.22 as well). This certainly is not a correct behavior. The
>> kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC,
>> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel
>> configuration file is in the attachment of this e-mail.
>>
>> From what we have read, there was a new scheduler (CFS) appeared from
>> 2.6.22. We are not sure whether the above behavior is caused by the
>> new scheduler.
>>
>> Finally, our machine cpu information is listed in the following:
>>
>> processor : 0
>> vendor_id : AuthenticAMD
>> cpu family : 16
>> model : 2
>> model name : Quad-Core AMD Opteron(tm) Processor 2347
>> stepping : 10
>> cpu MHz : 1909.801
>> cache size : 512 KB
>> physical id : 0
>> siblings : 4
>> core id : 0
>> cpu cores : 4
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 5
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca cmov
>> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
>> pdpe1gb rdtscp
>> lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm
>> cmp_legacy svm
>> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
>> bogomips : 3822.95
>> TLB size : 1024 4K pages
>> clflush size : 64
>> cache_alignment : 64
>> address sizes : 48 bits physical, 48 bits virtual
>> power management: ts ttp tm stc 100mhzsteps hwpstate
>>
>> In addition, we have schedstat and sched_debug files in the /proc
>> directory.
>>
>> Thank you for all your help to solve this puzzle. If you need more
>> information, please let us know.
>>
>>
>> P.S. I like to be cc'ed on the discussions related to this problem.
>>

Thank you for your help and happy thanksgiving !

--
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [email protected]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################

2007-11-22 02:26:54

by Simon Holm Thøgersen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above


ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
> Eric Dumazet wrote:
> > Jie Chen a écrit :
> >> Hi, there:
> >>
> >> We have a simple pthread program that measures the synchronization
> >> overheads for various synchronization mechanisms such as spin locks,
> >> barriers (the barrier is implemented using queue-based barrier
> >> algorithm) and so on. We have dual quad-core AMD opterons (barcelona)
> >> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7
> >> distribution. Before we moved to this kernel, we had kernel 2.6.21.
> >> These two kernels are configured identical and compiled with the same
> >> gcc 4.1.2 compiler. Under the old kernel, we observed that the
> >> performance of these overheads increases as the number of threads
> >> increases from 2 to 8. The following are the values of total time and
> >> overhead for all threads acquiring a pthread spin lock and all threads
> >> executing a barrier synchronization call.
> >
> > Could you post the source of your test program ?
> >
>
>
> Hi, Eric:
>
> Thank you for the quick response. You can get the source code containing
> the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a
> data parallel threading package for physics calculation. The test code
> is pthread_sync in the src directory once you unpack the gz file. To
> configure and build this package is very simple: configure and make. The
> test program is built by make check. The number of threads is
> controlled by QMT_NUM_THREADS. The package is using pthread spin lock,
> but the barrier is implemented using a queue-based barrier algorithm
> proposed by J. B. Carter of University of Utah (2005).
>
>
>
>
>
> > spinlock are ... spining and should not call linux scheduler, so I have
> > no idea why a kernel change could modify your results.
> >
> > Also I suspect you'll have better results with Fedora Core 8 (since
> > glibc was updated to use private futexes in v 2.7), at least for the
> > barrier ops.
> >
> >
>
> I am not sure what the biggest change between kernel 2.6.21 and 2.6.22
> (23) is? Is the scheduler the biggest change between these versions? Can
> the scheduler of kernel somehow effect the performance? I know the
> scheduler is trying to do load balance and so on. Can the scheduler move
> threads to different cores according to the load balance algorithm even
> though the threads are bound to cores using pthread_setaffinity_np call
> when the number of threads is fewer than the number of cores? I am
> thinking about this because the performance of our test code is roughly
> the same for both kernels when the number of threads equals to the
> number of cores.
>
There is a backport of the CFS scheduler to 2.6.21, see
http://lkml.org/lkml/2007/11/19/127

> >>
> >> Kernel 2.6.21
> >> Number of Threads 2 4 6 8
> >> SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643
> >> (Overhead) 0.073 0.05746 0.102805 0.154563
> >> Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002
> >> (Overhead) 0.531660 1.1502 1.500112 1.891617
> >>
> >> Each thread is bound to a particular core using pthread_setaffinity_np.
> >>
> >> Kernel 2.6.23.8
> >> Number of Threads 2 4 6 8
> >> SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990
> >> (Overhead) 4.345417 6.617207 3.949435 0.110985
> >> Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662
> >> (Overhead) 8.957755 9.784722 5.699590 1.869518
> >>
> >> It is clearly that the synchronization overhead increases as the
> >> number of threads increases in the kernel 2.6.21. But the
> >> synchronization overhead actually decreases as the number of threads
> >> increases in the kernel 2.6.23.8 (We observed the same behavior on
> >> kernel 2.6.22 as well). This certainly is not a correct behavior. The
> >> kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC,
> >> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel
> >> configuration file is in the attachment of this e-mail.
> >>
> >> From what we have read, there was a new scheduler (CFS) appeared from
> >> 2.6.22. We are not sure whether the above behavior is caused by the
> >> new scheduler.
> >>
> >> Finally, our machine cpu information is listed in the following:
> >>
> >> processor : 0
> >> vendor_id : AuthenticAMD
> >> cpu family : 16
> >> model : 2
> >> model name : Quad-Core AMD Opteron(tm) Processor 2347
> >> stepping : 10
> >> cpu MHz : 1909.801
> >> cache size : 512 KB
> >> physical id : 0
> >> siblings : 4
> >> core id : 0
> >> cpu cores : 4
> >> fpu : yes
> >> fpu_exception : yes
> >> cpuid level : 5
> >> wp : yes
> >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> >> mca cmov
> >> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> >> pdpe1gb rdtscp
> >> lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm
> >> cmp_legacy svm
> >> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
> >> bogomips : 3822.95
> >> TLB size : 1024 4K pages
> >> clflush size : 64
> >> cache_alignment : 64
> >> address sizes : 48 bits physical, 48 bits virtual
> >> power management: ts ttp tm stc 100mhzsteps hwpstate
> >>
> >> In addition, we have schedstat and sched_debug files in the /proc
> >> directory.
> >>
> >> Thank you for all your help to solve this puzzle. If you need more
> >> information, please let us know.
> >>
> >>
> >> P.S. I like to be cc'ed on the discussions related to this problem.
> >>
>
> Thank you for your help and happy thanksgiving !
>


Simon Holm Thøgersen

2007-11-22 02:58:27

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above

Simon Holm Thøgersen wrote:
> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:

> There is a backport of the CFS scheduler to 2.6.21, see
> http://lkml.org/lkml/2007/11/19/127
>
Hi, Simon:

I will try that after the thanksgiving holiday to find out whether the
odd behavior will show up using 2.6.21 with back ported CFS.

>>>> Kernel 2.6.21
>>>> Number of Threads 2 4 6 8
>>>> SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643
>>>> (Overhead) 0.073 0.05746 0.102805 0.154563
>>>> Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002
>>>> (Overhead) 0.531660 1.1502 1.500112 1.891617
>>>>
>>>> Each thread is bound to a particular core using pthread_setaffinity_np.
>>>>
>>>> Kernel 2.6.23.8
>>>> Number of Threads 2 4 6 8
>>>> SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990
>>>> (Overhead) 4.345417 6.617207 3.949435 0.110985
>>>> Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662
>>>> (Overhead) 8.957755 9.784722 5.699590 1.869518
>>>>

>
>
> Simon Holm Thøgersen
>
>
I just ran a simple test to prove that the problem may be related to
load balance of the scheduler. I first started 6 processes using
"taskset -c 2 donothing&; taskset -c 3 donothing&; ..., taskset -c 7
donothing". These 6 processes will run on core 2 to 7. Then I started my
test program using two threads bound to core 0 and 1. Here is the result:

Two threads on Kernel 2.6.23.8:
SpinLock (Time micro second) 10.558255
(Overhead) 0.068965
Barrier (Time micro second) 10.865520
(Overhead) 0.376230

Similarly, I started 4 donothing processes on core 4, 5, 6 and 7, and
ran the test program. I have the following result:

Four threads on Kernel 2.6.23.8:
SpinLock (Time micro second) 10.579413
(Overhead) 0.090023
Barrier (Time micro second) 11.363193
(Overhead) 0.873803

Finally, here is the result for 6 threads with two donothing processes
running on core 6 and 7:

Six threads on Kernel 2.6.23.8:
SpinLock (Time micro second) 10.590030
(Overhead) 0.100940
Barrier (Time micro second) 11.977548
(Overhead) 1.488458

Now the above results are very much similar to the results obtained for
the kernel 2.6.21. I hope this helps you guys in some ways. Thank you.

--
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [email protected]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################

2007-11-22 20:21:32

by Matt Mackall

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above

On Wed, Nov 21, 2007 at 09:58:10PM -0500, Jie Chen wrote:
> Simon Holm Th??gersen wrote:
> >ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
>
> >There is a backport of the CFS scheduler to 2.6.21, see
> >http://lkml.org/lkml/2007/11/19/127
> >
> Hi, Simon:
>
> I will try that after the thanksgiving holiday to find out whether the
> odd behavior will show up using 2.6.21 with back ported CFS.
>
> >>>>Kernel 2.6.21
> >>>>Number of Threads 2 4 6 8
> >>>>SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643
> >>>> (Overhead) 0.073 0.05746 0.102805 0.154563
> >>>>Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002
> >>>> (Overhead) 0.531660 1.1502 1.500112 1.891617
> >>>>
> >>>>Each thread is bound to a particular core using pthread_setaffinity_np.
> >>>>
> >>>>Kernel 2.6.23.8
> >>>>Number of Threads 2 4 6 8
> >>>>SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990
> >>>> (Overhead) 4.345417 6.617207 3.949435 0.110985
> >>>>Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662
> >>>> (Overhead) 8.957755 9.784722 5.699590 1.869518
> >>>>
>
> >
> >
> >Simon Holm Th??gersen
> >
> >
> I just ran a simple test to prove that the problem may be related to
> load balance of the scheduler. I first started 6 processes using
> "taskset -c 2 donothing&; taskset -c 3 donothing&; ..., taskset -c 7
> donothing". These 6 processes will run on core 2 to 7. Then I started my
> test program using two threads bound to core 0 and 1. Here is the result:
>
> Two threads on Kernel 2.6.23.8:
> SpinLock (Time micro second) 10.558255
> (Overhead) 0.068965
> Barrier (Time micro second) 10.865520
> (Overhead) 0.376230
>
> Similarly, I started 4 donothing processes on core 4, 5, 6 and 7, and
> ran the test program. I have the following result:
>
> Four threads on Kernel 2.6.23.8:
> SpinLock (Time micro second) 10.579413
> (Overhead) 0.090023
> Barrier (Time micro second) 11.363193
> (Overhead) 0.873803
>
> Finally, here is the result for 6 threads with two donothing processes
> running on core 6 and 7:
>
> Six threads on Kernel 2.6.23.8:
> SpinLock (Time micro second) 10.590030
> (Overhead) 0.100940
> Barrier (Time micro second) 11.977548
> (Overhead) 1.488458
>
> Now the above results are very much similar to the results obtained for
> the kernel 2.6.21. I hope this helps you guys in some ways. Thank you.

Yes, this really does look like a scheduling regression. I've added
Ingo to the cc: list. Next time you should pick a more descriptive
subject line - we've got lots of email about possible bugs.

--
Mathematics is the supreme nostalgia of our time.

2007-12-04 13:17:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

> Simon Holm Th??gersen wrote:
>> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
>
>> There is a backport of the CFS scheduler to 2.6.21, see
>> http://lkml.org/lkml/2007/11/19/127
>>
> Hi, Simon:
>
> I will try that after the thanksgiving holiday to find out whether the
> odd behavior will show up using 2.6.21 with back ported CFS.

would be also nice to test this with 2.6.24-rc4.

Ingo

2007-12-04 15:41:31

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>> Simon Holm Th??gersen wrote:
>>> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
>>> There is a backport of the CFS scheduler to 2.6.21, see
>>> http://lkml.org/lkml/2007/11/19/127
>>>
>> Hi, Simon:
>>
>> I will try that after the thanksgiving holiday to find out whether the
>> odd behavior will show up using 2.6.21 with back ported CFS.
>
> would be also nice to test this with 2.6.24-rc4.
>
> Ingo
Hi, Ingo:

I will test 2.6.24-rc4 this week and let you know the result. Thanks.

--
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [email protected]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################

2007-12-05 15:29:44

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>> Simon Holm Th??gersen wrote:
>>> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
>>> There is a backport of the CFS scheduler to 2.6.21, see
>>> http://lkml.org/lkml/2007/11/19/127
>>>
>> Hi, Simon:
>>
>> I will try that after the thanksgiving holiday to find out whether the
>> odd behavior will show up using 2.6.21 with back ported CFS.
>
> would be also nice to test this with 2.6.24-rc4.
>
> Ingo
Hi, Ingo:

I just ran the same test on two 2.6.24-rc4 kernels: one with
CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED
off. The odd behavior I described in my previous e-mails were still
there for both kernels. Let me know If I can be any more help. Thank you.

--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-05 15:40:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

> I just ran the same test on two 2.6.24-rc4 kernels: one with
> CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED
> off. The odd behavior I described in my previous e-mails were still
> there for both kernels. Let me know If I can be any more help. Thank
> you.

ok, i had a look at your data, and i think this is the result of the
scheduler balancing out to idle CPUs more agressively than before. Doing
that is almost always a good idea though - but indeed it can result in
"bad" numbers if all you do is to measure the ping-pong "performance"
between two threads. (with no real work done by any of them).

the moment you saturate the system a bit more, the numbers should
improve even with such a ping-pong test.

do you have testcode (or a modification of your testcase sourcecode)
that simulates a real-life situation where 2.6.24-rc4 performs not as
well as you'd like it to see? (or if qmt.tar.gz already contains that
then please point me towards that portion of the test and how i should
run it - thanks!)

Ingo

2007-12-05 16:17:24

by Eric Dumazet

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

#include <pthread.h>
#include <sched.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

int blockthemall=1;
static void inline cpupause()
{
#if defined(i386)
asm volatile("rep;nop":::"memory");
#else
asm volatile("":::"memory");
#endif
}
/*
* Determines number of cpus
* Can be overiden by the NR_CPUS environment variable
*/
int number_of_cpus()
{
char line[1024], *p;
int cnt = 0;
FILE *F;

p = getenv("NR_CPUS");
if (p)
return atoi(p);
F = fopen("/proc/cpuinfo", "r");
if (F == NULL) {
perror("/proc/cpuinfo");
return 1;
}
while (fgets(line, sizeof(line), F) != NULL) {
if (memcmp(line, "processor", 9) == 0)
cnt++;
}
fclose(F);
return cnt;
}

void compute_elapsed(struct timeval *delta, const struct timeval *t0)
{
struct timeval t1;

gettimeofday(&t1, NULL);
delta->tv_sec = t1.tv_sec - t0->tv_sec;
delta->tv_usec = t1.tv_usec - t0->tv_usec;
if (delta->tv_usec < 0) {
delta->tv_usec += 1000000;
delta->tv_sec--;
}
}

int nr_loops = 20*1000000;
double incr=0.3456;
void perform_work()
{
int i;
double t = 0.0;
for (i = 0; i < nr_loops; i++) {
t += incr;
}
if (t < 0.0) printf("well... should not happen\n");
}

void set_affinity(int cpu)
{
long cpu_mask;
int res;

cpu_mask = 1L << cpu;
res = sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask);
if (res)
perror("sched_setaffinity");
}

void *thread_work(void *arg)
{
int cpu = (int)arg;
set_affinity(cpu);
while (blockthemall)
cpupause();
perform_work();
return (void *)0;
}

main(int argc, char *argv[])
{
struct timeval t0, delta;
int nr_cpus, i;
pthread_t *tids;

gettimeofday(&t0, NULL);
perform_work();
compute_elapsed(&delta, &t0);
printf("Time to perform the unit of work on one thread is %d.%06d s\n", delta.tv_sec, delta.tv_usec);

nr_cpus = number_of_cpus();
if (nr_cpus <= 1)
return 0;
tids = malloc(nr_cpus * sizeof(pthread_t));
for (i = 1; i < nr_cpus; i++) {
pthread_create(tids + i, NULL, thread_work, (void *)i);
}

set_affinity(0);
gettimeofday(&t0, NULL);
blockthemall=0;
perform_work();
for (i = 1; i < nr_cpus; i++)
pthread_join(tids[i], NULL);
compute_elapsed(&delta, &t0);
printf("Time to perform the unit of work on %d threads is %d.%06d s\n", nr_cpus, delta.tv_sec, delta.tv_usec);

}


Attachments:
burner.c (2.25 kB)

2007-12-05 16:22:59

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>> I just ran the same test on two 2.6.24-rc4 kernels: one with
>> CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED
>> off. The odd behavior I described in my previous e-mails were still
>> there for both kernels. Let me know If I can be any more help. Thank
>> you.
>
> ok, i had a look at your data, and i think this is the result of the
> scheduler balancing out to idle CPUs more agressively than before. Doing
> that is almost always a good idea though - but indeed it can result in
> "bad" numbers if all you do is to measure the ping-pong "performance"
> between two threads. (with no real work done by any of them).
>

My test code are not doing much work but measuring overhead of various
synchronization mechanisms such as barrier and lock. I am trying to see
the scalability of different implementations/algorithms on multi-core
machines.

> the moment you saturate the system a bit more, the numbers should
> improve even with such a ping-pong test.
>
You are right. If I manually do load balance (bind unrelated processes
on the other cores), my test code perform as well as it did in the
kernel 2.6.21.
> do you have testcode (or a modification of your testcase sourcecode)
> that simulates a real-life situation where 2.6.24-rc4 performs not as
> well as you'd like it to see? (or if qmt.tar.gz already contains that
> then please point me towards that portion of the test and how i should
> run it - thanks!)

The qmt.tar.gz code contains a simple test program call pthread_sync
under the src directory. You can change the number of threads by setting
QMT_NUM_THREADS environment variable. You can build the qmt by doing
configure --enable-public-release. I do not have Intel quad core
machines, I am not sure whether the behavior will show up on Intel
platform. Our cluster is dual quad-core opteron which has its own
hardware problem :-).
http://hardware.slashdot.org/article.pl?sid=07/12/04/237248&from=rss

>
> Ingo

Hi, Ingo:

My test code qmt can be found at ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz.
There is a minor performance issue in qmt pointed out by Eric, which I
have not put into the tar ball yet. If I can be any help, please let me
know. Thank you very much.



--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-05 16:25:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Eric Dumazet <[email protected]> wrote:

> $ gcc -O2 -o burner burner.c
> $ ./burner
> Time to perform the unit of work on one thread is 0.040328 s
> Time to perform the unit of work on 2 threads is 0.040221 s

ok, but this actually suggests that scheduling is fine for this,
correct?

Ingo

2007-12-05 16:30:19

by Eric Dumazet

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar a ?crit :
> * Eric Dumazet <[email protected]> wrote:
>
>> $ gcc -O2 -o burner burner.c
>> $ ./burner
>> Time to perform the unit of work on one thread is 0.040328 s
>> Time to perform the unit of work on 2 threads is 0.040221 s
>
> ok, but this actually suggests that scheduling is fine for this,
> correct?
>
> Ingo
>
>

Yes, But this machine runs an old kernel. I was just giving you how to run it :)

2007-12-05 16:47:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

>> the moment you saturate the system a bit more, the numbers should
>> improve even with such a ping-pong test.
>
> You are right. If I manually do load balance (bind unrelated processes
> on the other cores), my test code perform as well as it did in the
> kernel 2.6.21.

so right now the results dont seem to be too bad to me - the higher
overhead comes from two threads running on two different cores and
incurring the overhead of cross-core communications. In a true
spread-out workloads that synchronize occasionally you'd get the same
kind of overhead so in fact this behavior is more informative of the
real overhead i guess. In 2.6.21 the two threads would stick on the same
core and produce artificially low latency - which would only be true in
a real spread-out workload if all tasks ran on the same core. (which is
hardly the thing you want on openmp)

In any case, if i misinterpreted your numbers or if you just disagree,
or if have a workload/test that shows worse performance that it
could/should, let me know.

Ingo

2007-12-05 17:48:16

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>>> the moment you saturate the system a bit more, the numbers should
>>> improve even with such a ping-pong test.
>> You are right. If I manually do load balance (bind unrelated processes
>> on the other cores), my test code perform as well as it did in the
>> kernel 2.6.21.
>
> so right now the results dont seem to be too bad to me - the higher
> overhead comes from two threads running on two different cores and
> incurring the overhead of cross-core communications. In a true
> spread-out workloads that synchronize occasionally you'd get the same
> kind of overhead so in fact this behavior is more informative of the
> real overhead i guess. In 2.6.21 the two threads would stick on the same
> core and produce artificially low latency - which would only be true in
> a real spread-out workload if all tasks ran on the same core. (which is
> hardly the thing you want on openmp)
>

I use pthread_setaffinity_np call to bind one thread to one core. Unless
the kernel 2.6.21 does not honor the affinity, I do not see the
difference running two threads on two cores between the new kernel and
the old kernel. My test code does not do any numerical calculation, but
it does spin waiting on shared/non-shared flags. The reason I am using
the affinity is to test synchronization overheads among different cores.
In either the new and the old kernel, I do see 200% CPU usage when I ran
my test code for two threads. Does this mean two threads are running on
two cores? Also I verify a thread is indeed bound to a core by using
pthread_getaffinity_np.

> In any case, if i misinterpreted your numbers or if you just disagree,
> or if have a workload/test that shows worse performance that it
> could/should, let me know.
>
> Ingo

Hi, Ingo:

Since I am using affinity flag to bind each thread to a different core,
the synchronization overhead should increases as the number of
cores/threads increases. But what we observed in the new kernel is the
opposite. The barrier overhead of two threads is 8.93 micro seconds vs
1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This
will confuse most of people who study the synchronization/communication
scalability. I know my test code is not real-world computation which
usually use up all cores. I hope I have explained myself clearly. Thank
you very much.

--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-05 20:04:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

> Since I am using affinity flag to bind each thread to a different
> core, the synchronization overhead should increases as the number of
> cores/threads increases. But what we observed in the new kernel is the
> opposite. The barrier overhead of two threads is 8.93 micro seconds vs
> 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This
> will confuse most of people who study the
> synchronization/communication scalability. I know my test code is not
> real-world computation which usually use up all cores. I hope I have
> explained myself clearly. Thank you very much.

btw., could you try to not use the affinity mask and let the scheduler
manage the spreading of tasks? It generally has a better knowledge about
how tasks interrelate.

Ingo

2007-12-05 20:23:18

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>> Since I am using affinity flag to bind each thread to a different
>> core, the synchronization overhead should increases as the number of
>> cores/threads increases. But what we observed in the new kernel is the
>> opposite. The barrier overhead of two threads is 8.93 micro seconds vs
>> 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This
>> will confuse most of people who study the
>> synchronization/communication scalability. I know my test code is not
>> real-world computation which usually use up all cores. I hope I have
>> explained myself clearly. Thank you very much.
>
> btw., could you try to not use the affinity mask and let the scheduler
> manage the spreading of tasks? It generally has a better knowledge about
> how tasks interrelate.
>
> Ingo
Hi, Ingo:

I just disabled the affinity mask and reran the test. There were no
significant changes for two threads (barrier overhead is around 9
microseconds). As for 8 threads, the barrier overhead actually drops a
little, which is good. Let me know whether I can be any help. Thank you
very much.

--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-05 20:47:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

> I just disabled the affinity mask and reran the test. There were no
> significant changes for two threads (barrier overhead is around 9
> microseconds). As for 8 threads, the barrier overhead actually drops a
> little, which is good. Let me know whether I can be any help. Thank
> you very much.

sorry to be dense, but could you give me instructions how i could remove
the affinity mask and test the "barrier overhead" myself? I have built
"pthread_sync" and it outputs numbers for me - which one would be the
barrier overhead: Reference_time_1 ?

Ingo

2007-12-05 20:52:31

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>> I just disabled the affinity mask and reran the test. There were no
>> significant changes for two threads (barrier overhead is around 9
>> microseconds). As for 8 threads, the barrier overhead actually drops a
>> little, which is good. Let me know whether I can be any help. Thank
>> you very much.
>
> sorry to be dense, but could you give me instructions how i could remove
> the affinity mask and test the "barrier overhead" myself? I have built
> "pthread_sync" and it outputs numbers for me - which one would be the
> barrier overhead: Reference_time_1 ?
>
> Ingo
Hi, Ingo:

To disable affinity, do configure --enable-public-release
--disable-thread_affinity. You should see barrier overhead like the
following:
Computing BARRIER time

Sample_size Average Min Max S.D. Outliers
20 19.486162 19.482250 19.491400 0.002740 0

BARRIER time = 19.486162 microseconds +/- 0.005371
BARRIER overhead = 8.996257 microseconds +/- 0.006575

The Reference_time_1 is the elapsed time for single thread doing simple
loop without any synchronization. Thank you.

--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-05 20:54:17

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above

Peter Zijlstra wrote:
> On Wed, 2007-11-21 at 15:34 -0500, Jie Chen wrote:
>
>> It is clearly that the synchronization overhead increases as the number
>> of threads increases in the kernel 2.6.21. But the synchronization
>> overhead actually decreases as the number of threads increases in the
>> kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as
>> well). This certainly is not a correct behavior. The kernels are
>> configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC,
>> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel
>> configuration file is in the attachment of this e-mail.
>>
>> From what we have read, there was a new scheduler (CFS) appeared from
>> 2.6.22. We are not sure whether the above behavior is caused by the new
>> scheduler.
>
> If I read this correctly, you say that: .22 is the first bad one right?
>
> The new scheduler (CFS) was introduced in .23, so it seems another
> change would be responsible for this.
>
>
>
Hi, Peter:

Yes. We did observe this in 2.6.22. Thank you.

--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-05 21:02:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

>> sorry to be dense, but could you give me instructions how i could
>> remove the affinity mask and test the "barrier overhead" myself? I
>> have built "pthread_sync" and it outputs numbers for me - which one
>> would be the barrier overhead: Reference_time_1 ?
>
> To disable affinity, do configure --enable-public-release
> --disable-thread_affinity. You should see barrier overhead like the
> following: Computing BARRIER time
>
> Sample_size Average Min Max S.D. Outliers
> 20 19.486162 19.482250 19.491400 0.002740 0
>
> BARRIER time = 19.486162 microseconds +/- 0.005371
> BARRIER overhead = 8.996257 microseconds +/- 0.006575

ok, i did that and rebuilt. I also did "make check" and got
src/pthread_sync which i can run. The only thing i'm missing, if i run
src/pthread_sync, it outputs "PARALLEL time":

PARALLEL time = 22.486103 microseconds +/- 3.944821
PARALLEL overhead = 10.638658 microseconds +/- 10.854154

not "BARRIER time". I've re-read the discussion and found no hint about
how to build and run a barrier test. Either i missed it or it's so
obvious to you that you didnt mention it :-)

Ingo

2007-12-05 21:45:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above


On Wed, 2007-11-21 at 15:34 -0500, Jie Chen wrote:

> It is clearly that the synchronization overhead increases as the number
> of threads increases in the kernel 2.6.21. But the synchronization
> overhead actually decreases as the number of threads increases in the
> kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as
> well). This certainly is not a correct behavior. The kernels are
> configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC,
> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel
> configuration file is in the attachment of this e-mail.
>
> From what we have read, there was a new scheduler (CFS) appeared from
> 2.6.22. We are not sure whether the above behavior is caused by the new
> scheduler.

If I read this correctly, you say that: .22 is the first bad one right?

The new scheduler (CFS) was introduced in .23, so it seems another
change would be responsible for this.


2007-12-05 22:16:53

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>>> sorry to be dense, but could you give me instructions how i could
>>> remove the affinity mask and test the "barrier overhead" myself? I
>>> have built "pthread_sync" and it outputs numbers for me - which one
>>> would be the barrier overhead: Reference_time_1 ?
>> To disable affinity, do configure --enable-public-release
>> --disable-thread_affinity. You should see barrier overhead like the
>> following: Computing BARRIER time
>>
>> Sample_size Average Min Max S.D. Outliers
>> 20 19.486162 19.482250 19.491400 0.002740 0
>>
>> BARRIER time = 19.486162 microseconds +/- 0.005371
>> BARRIER overhead = 8.996257 microseconds +/- 0.006575
>
> ok, i did that and rebuilt. I also did "make check" and got
> src/pthread_sync which i can run. The only thing i'm missing, if i run
> src/pthread_sync, it outputs "PARALLEL time":
>
> PARALLEL time = 22.486103 microseconds +/- 3.944821
> PARALLEL overhead = 10.638658 microseconds +/- 10.854154
>
> not "BARRIER time". I've re-read the discussion and found no hint about
> how to build and run a barrier test. Either i missed it or it's so
> obvious to you that you didnt mention it :-)
>
> Ingo

Hi, Ingo:

Did you do configure --enable-public-release? My qmt is for qcd
calculation (one type of physics code). Without the above flag one can
only test PARALLEL overhead. Actually the PARALLEL benchmark has the
same behavior as the BARRIER. Thanks.


###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-06 10:44:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

>> not "BARRIER time". I've re-read the discussion and found no hint
>> about how to build and run a barrier test. Either i missed it or it's
>> so obvious to you that you didnt mention it :-)
>>
>> Ingo
>
> Hi, Ingo:
>
> Did you do configure --enable-public-release? My qmt is for qcd
> calculation (one type of physics code) [...]

yes, i did exactly as instructed.

> [...]. Without the above flag one can only test PARALLEL overhead.
> Actually the PARALLEL benchmark has the same behavior as the BARRIER.
> Thanks.

hm, but PARALLEL does not seem to do that much context switching. So
basically you create the threads and do a few short runs to establish
overhead? Threads do not get fork-balanced at the moment - but turning
it on would be easy. Could you try the patch below - how does it impact
your results? (and please keep affinity setting off)

Ingo

----------->
Subject: sched: reactivate fork balancing
From: Ingo Molnar <[email protected]>

reactivate fork balancing.

Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/topology.h | 3 +++
1 file changed, 3 insertions(+)

Index: linux/include/linux/topology.h
===================================================================
--- linux.orig/include/linux/topology.h
+++ linux/include/linux/topology.h
@@ -103,6 +103,7 @@
.forkexec_idx = 0, \
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_NEWIDLE \
+ | SD_BALANCE_FORK \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
| SD_WAKE_IDLE \
@@ -134,6 +135,7 @@
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_NEWIDLE \
+ | SD_BALANCE_FORK \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
| SD_WAKE_IDLE \
@@ -165,6 +167,7 @@
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_NEWIDLE \
+ | SD_BALANCE_FORK \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
| BALANCE_FOR_PKG_POWER,\

2007-12-06 16:29:41

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>>> not "BARRIER time". I've re-read the discussion and found no hint
>>> about how to build and run a barrier test. Either i missed it or it's
>>> so obvious to you that you didnt mention it :-)
>>>
>>> Ingo
>> Hi, Ingo:
>>
>> Did you do configure --enable-public-release? My qmt is for qcd
>> calculation (one type of physics code) [...]
>
> yes, i did exactly as instructed.
>
>> [...]. Without the above flag one can only test PARALLEL overhead.
>> Actually the PARALLEL benchmark has the same behavior as the BARRIER.
>> Thanks.
>
> hm, but PARALLEL does not seem to do that much context switching. So
> basically you create the threads and do a few short runs to establish
> overhead? Threads do not get fork-balanced at the moment - but turning
> it on would be easy. Could you try the patch below - how does it impact
> your results? (and please keep affinity setting off)
>
> Ingo
>
> ----------->
> Subject: sched: reactivate fork balancing
> From: Ingo Molnar <[email protected]>
>
> reactivate fork balancing.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> include/linux/topology.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux/include/linux/topology.h
> ===================================================================
> --- linux.orig/include/linux/topology.h
> +++ linux/include/linux/topology.h
> @@ -103,6 +103,7 @@
> .forkexec_idx = 0, \
> .flags = SD_LOAD_BALANCE \
> | SD_BALANCE_NEWIDLE \
> + | SD_BALANCE_FORK \
> | SD_BALANCE_EXEC \
> | SD_WAKE_AFFINE \
> | SD_WAKE_IDLE \
> @@ -134,6 +135,7 @@
> .forkexec_idx = 1, \
> .flags = SD_LOAD_BALANCE \
> | SD_BALANCE_NEWIDLE \
> + | SD_BALANCE_FORK \
> | SD_BALANCE_EXEC \
> | SD_WAKE_AFFINE \
> | SD_WAKE_IDLE \
> @@ -165,6 +167,7 @@
> .forkexec_idx = 1, \
> .flags = SD_LOAD_BALANCE \
> | SD_BALANCE_NEWIDLE \
> + | SD_BALANCE_FORK \
> | SD_BALANCE_EXEC \
> | SD_WAKE_AFFINE \
> | BALANCE_FOR_PKG_POWER,\
Hi, Ingo:

I did patch the header file and recompiled the kernel. I observed no
difference (two threads overhead stays too high). Thank you.

--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-10 11:00:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

> I did patch the header file and recompiled the kernel. I observed no
> difference (two threads overhead stays too high). Thank you.

ok, i think i found it. You do this in your qmt/pthread_sync.c
test-code:

double get_time_of_day_()
{
...
err = gettimeofday(&ts, NULL);
...
}

and then you use this in the measurement loop:

for (k=0; k<=OUTERREPS; k++){
start = getclock();
for (j=0; j<innerreps; j++){
#ifdef _QMT_PUBLIC
delay((void *)0, 0);
#else
delay(0, 0, 0, (void *)0);
#endif
}
times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
}

the problem is, this does not take the overhead of gettimeofday into
account - which overhead can easily reach 10 usecs (the observed
regression). Could you try to eliminate the gettimeofday overhead from
your measurement?

gettimeofday overhead is something that might have changed from .21 to
.22 on your box.

Ingo

2007-12-10 20:04:26

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>> I did patch the header file and recompiled the kernel. I observed no
>> difference (two threads overhead stays too high). Thank you.
>
> ok, i think i found it. You do this in your qmt/pthread_sync.c
> test-code:
>
> double get_time_of_day_()
> {
> ...
> err = gettimeofday(&ts, NULL);
> ...
> }
>
> and then you use this in the measurement loop:
>
> for (k=0; k<=OUTERREPS; k++){
> start = getclock();
> for (j=0; j<innerreps; j++){
> #ifdef _QMT_PUBLIC
> delay((void *)0, 0);
> #else
> delay(0, 0, 0, (void *)0);
> #endif
> }
> times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
> }
>
> the problem is, this does not take the overhead of gettimeofday into
> account - which overhead can easily reach 10 usecs (the observed
> regression). Could you try to eliminate the gettimeofday overhead from
> your measurement?
>
> gettimeofday overhead is something that might have changed from .21 to
> .22 on your box.
>
> Ingo

Hi, Ingo:

In my pthread_sync code, I first call refer () subroutine which actually
establishes the elapsed time (reference time) for non-synchronized
delay() using the gettimeofday. Then each synchronization overhead value
is obtained by subtracting the reference time from the elapsed time with
introduced synchronization. The effect of gettimeofday() should be
minimal if the time difference (overhead value) is the interest here.
Unless the gettimeofday behaves differently in the case of running 8
threads .vs. running 2 threads.

I will try to replace gettimeofday with a lightweight timer call in my
test code. Thank you very much.

--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-11 10:52:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

>> and then you use this in the measurement loop:
>>
>> for (k=0; k<=OUTERREPS; k++){
>> start = getclock();
>> for (j=0; j<innerreps; j++){
>> #ifdef _QMT_PUBLIC
>> delay((void *)0, 0);
>> #else
>> delay(0, 0, 0, (void *)0);
>> #endif
>> }
>> times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
>> }
>>
>> the problem is, this does not take the overhead of gettimeofday into
>> account - which overhead can easily reach 10 usecs (the observed
>> regression). Could you try to eliminate the gettimeofday overhead from
>> your measurement?
>>
>> gettimeofday overhead is something that might have changed from .21 to .22
>> on your box.
>>
>> Ingo
>
> Hi, Ingo:
>
> In my pthread_sync code, I first call refer () subroutine which
> actually establishes the elapsed time (reference time) for
> non-synchronized delay() using the gettimeofday. Then each
> synchronization overhead value is obtained by subtracting the
> reference time from the elapsed time with introduced synchronization.
> The effect of gettimeofday() should be minimal if the time difference
> (overhead value) is the interest here. Unless the gettimeofday behaves
> differently in the case of running 8 threads .vs. running 2 threads.
>
> I will try to replace gettimeofday with a lightweight timer call in my
> test code. Thank you very much.

gettimeofday overhead is around 10 usecs here:

2740 1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.000010>
2740 1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.000010>

and that's the only thing that is going on when computing the reference
time - and i see a similar syscall pattern in the PARALLEL and BARRIER
calculations as well (with no real scheduling going on).

Ingo

2007-12-11 15:28:17

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>>> and then you use this in the measurement loop:
>>>
>>> for (k=0; k<=OUTERREPS; k++){
>>> start = getclock();
>>> for (j=0; j<innerreps; j++){
>>> #ifdef _QMT_PUBLIC
>>> delay((void *)0, 0);
>>> #else
>>> delay(0, 0, 0, (void *)0);
>>> #endif
>>> }
>>> times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
>>> }
>>>
>>> the problem is, this does not take the overhead of gettimeofday into
>>> account - which overhead can easily reach 10 usecs (the observed
>>> regression). Could you try to eliminate the gettimeofday overhead from
>>> your measurement?
>>>
>>> gettimeofday overhead is something that might have changed from .21 to .22
>>> on your box.
>>>
>>> Ingo
>> Hi, Ingo:
>>
>> In my pthread_sync code, I first call refer () subroutine which
>> actually establishes the elapsed time (reference time) for
>> non-synchronized delay() using the gettimeofday. Then each
>> synchronization overhead value is obtained by subtracting the
>> reference time from the elapsed time with introduced synchronization.
>> The effect of gettimeofday() should be minimal if the time difference
>> (overhead value) is the interest here. Unless the gettimeofday behaves
>> differently in the case of running 8 threads .vs. running 2 threads.
>>
>> I will try to replace gettimeofday with a lightweight timer call in my
>> test code. Thank you very much.
>
> gettimeofday overhead is around 10 usecs here:
>
> 2740 1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.000010>
> 2740 1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.000010>
>
> and that's the only thing that is going on when computing the reference
> time - and i see a similar syscall pattern in the PARALLEL and BARRIER
> calculations as well (with no real scheduling going on).
>
> Ingo

Hi, Ingo:

I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs
patch. The results of pthread_sync is the same as the non-patched 2.6.21
kernel. This means the performance of is not related to the scheduler.
As for overhead of the gettimeofday, there is no difference between
2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both
kernel.

So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you
very much for all your help.

--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-11 15:52:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

> Hi, Ingo:
>
> I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs
> patch. The results of pthread_sync is the same as the non-patched
> 2.6.21 kernel. This means the performance of is not related to the
> scheduler. As for overhead of the gettimeofday, there is no difference
> between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us
> for both kernel.

could you please paste again the relevant portion of the output you get
on a "good" .21 kernel versus the output you get on a "bad" .24 kernel?

> So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you
> very much for all your help.

we'll figure it out i'm sure :)

Ingo

2007-12-11 16:39:26

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>> Hi, Ingo:
>>
>> I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs
>> patch. The results of pthread_sync is the same as the non-patched
>> 2.6.21 kernel. This means the performance of is not related to the
>> scheduler. As for overhead of the gettimeofday, there is no difference
>> between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us
>> for both kernel.
>
> could you please paste again the relevant portion of the output you get
> on a "good" .21 kernel versus the output you get on a "bad" .24 kernel?
>
>> So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you
>> very much for all your help.
>
> we'll figure it out i'm sure :)
>
> Ingo

Hi, Ingo:

The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel.

2 threads:

Computing reference time 1

Sample_size Average Min Max S.D. Outliers
20 10.489085 10.488800 10.491100 0.000539 1

Reference_time_1 = 10.489085 microseconds +/- 0.001057

Computing PARALLEL time

Sample_size Average Min Max S.D. Outliers
20 11.106580 11.105650 11.109700 0.001255 0

PARALLEL time = 11.106580 microseconds +/- 0.002460
PARALLEL overhead = 0.617590 microseconds +/- 0.003409

8 threads:
Computing reference time 1

Sample_size Average Min Max S.D. Outliers
20 10.488735 10.488500 10.490700 0.000484 1

Reference_time_1 = 10.488735 microseconds +/- 0.000948

Computing PARALLEL time

Sample_size Average Min Max S.D. Outliers
20 13.000647 12.991050 13.052700 0.012592 1

PARALLEL time = 13.000647 microseconds +/- 0.024680
PARALLEL overhead = 2.511907 microseconds +/- 0.025594


Output for Kernel 2.6.24-rc4 #1 SMP

2 threads:
Computing reference time 1

Sample_size Average Min Max S.D. Outliers
20 10.510535 10.508600 10.518200 0.002237 1

Reference_time_1 = 10.510535 microseconds +/- 0.004384

Computing PARALLEL time

Sample_size Average Min Max S.D. Outliers
20 19.668450 19.650200 19.679650 0.008052 0

PARALLEL time = 19.668450 microseconds +/- 0.015782
PARALLEL overhead = 9.157945 microseconds +/- 0.018217

8 threads:
Computing reference time 1

Sample_size Average Min Max S.D. Outliers
20 10.491285 10.490100 10.494900 0.001085 1

Reference_time_1 = 10.491285 microseconds +/- 0.002127

Computing PARALLEL time

Sample_size Average Min Max S.D. Outliers
20 13.090080 13.079150 13.131450 0.010995 1

PARALLEL time = 13.090080 microseconds +/- 0.021550
PARALLEL overhead = 2.598590 microseconds +/- 0.024534

For 8 threads, both kernels have the similar performance number. But for
2 threads, the 2.6.21 is much better than 2.6.24-rc4. Thank you.


--
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[email protected]
###############################################

2007-12-11 21:24:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


* Jie Chen <[email protected]> wrote:

> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP
> kernel.

> 2 threads:

> PARALLEL time = 11.106580 microseconds +/- 0.002460
> PARALLEL overhead = 0.617590 microseconds +/- 0.003409

> Output for Kernel 2.6.24-rc4 #1 SMP

> PARALLEL time = 19.668450 microseconds +/- 0.015782
> PARALLEL overhead = 9.157945 microseconds +/- 0.018217

ok, so the problem is that this PARALLEL time has an additional +9 usecs
overhead, right? I dont see this myself on a Core2 CPU:

PARALLEL time = 10.446933 microseconds +/- 0.078849
PARALLEL overhead = 0.751732 microseconds +/- 0.177446

Ingo

2007-12-11 22:11:32

by Jie Chen

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

Ingo Molnar wrote:
> * Jie Chen <[email protected]> wrote:
>
>> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP
>> kernel.
>
>> 2 threads:
>
>> PARALLEL time = 11.106580 microseconds +/- 0.002460
>> PARALLEL overhead = 0.617590 microseconds +/- 0.003409
>
>> Output for Kernel 2.6.24-rc4 #1 SMP
>
>> PARALLEL time = 19.668450 microseconds +/- 0.015782
>> PARALLEL overhead = 9.157945 microseconds +/- 0.018217
>
> ok, so the problem is that this PARALLEL time has an additional +9 usecs
> overhead, right? I dont see this myself on a Core2 CPU:
>
> PARALLEL time = 10.446933 microseconds +/- 0.078849
> PARALLEL overhead = 0.751732 microseconds +/- 0.177446
>
> Ingo
Hi, Ingo:

Yes, the extra 9 usecs overhead for running two threads in the 2.6.24
kernel when there are total of 8 cores (2 quad opterons). What is the
total number of cores do you have? I do not have machines that have dual
quad Xeons here for direct comparisons. Thank you.

--
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [email protected]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################

2007-12-12 12:49:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4


On Tue, 2007-12-11 at 17:11 -0500, Jie Chen wrote:
> Ingo Molnar wrote:
> > * Jie Chen <[email protected]> wrote:
> >
> >> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP
> >> kernel.
> >
> >> 2 threads:
> >
> >> PARALLEL time = 11.106580 microseconds +/- 0.002460
> >> PARALLEL overhead = 0.617590 microseconds +/- 0.003409
> >
> >> Output for Kernel 2.6.24-rc4 #1 SMP
> >
> >> PARALLEL time = 19.668450 microseconds +/- 0.015782
> >> PARALLEL overhead = 9.157945 microseconds +/- 0.018217
> >
> > ok, so the problem is that this PARALLEL time has an additional +9 usecs
> > overhead, right? I dont see this myself on a Core2 CPU:
> >
> > PARALLEL time = 10.446933 microseconds +/- 0.078849
> > PARALLEL overhead = 0.751732 microseconds +/- 0.177446
> >

On my dual socket AMD Athlon MP

2.6.20-13-generic

PARALLEL time = 22.751875 microseconds +/- 21.370942
PARALLEL overhead = 7.046595 microseconds +/- 24.370040

2.6.24-rc5

PARALLEL time = 17.365543 microseconds +/- 3.295133
PARALLEL overhead = 2.213722 microseconds +/- 4.797886



Attachments:
signature.asc (189.00 B)
This is a digitally signed message part