On Wed, 6 Feb 2008 16:33:20 -0800 (PST)
[email protected] wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=9906
>
> Summary: Weird hang with NPTL and SIGPROF.
> Product: Process Management
> Version: 2.5
> KernelVersion: 2.6.24-rc4
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: high
> Priority: P1
> Component: Scheduler
> AssignedTo: [email protected]
> ReportedBy: [email protected]
>
>
> Latest working kernel version: None
> Earliest failing kernel version: 2.6.18
> Distribution: Ubuntu
> Hardware Environment: Any
> Problem Description:
> I have a testcase that demonstrates a strange hang of the latest kernel
> (as well as previous ones). In the process of investigating the NPTL,
> we wrote a test that just creates a bunch of threads, then does a
> barrier wait to synchronize them all, after which everybody exits.
> That's all it does.
>
> This works fine under most circumstances. Unfortunately, we also want
> to do profiling, so we catch SIGPROF and turn on ITIMER_PROF. In this
> case, at somewhere between 4000 and 4500 threads, and using the NPTL,
> the system hangs. It's not a hard hang, interrupts are still working
> and clocks are ticking, but nothing is making progress. It becomes
> noticeable when the softlockup_tick() warning goes off after the
> watchdog has been starved long enough.
>
> Sometimes the system recovers and gets going again. Other times it
> doesn't. I've examined the state of things several times with kdb and
> there's certainly nothing obvious going on. Something, perhaps having
> to do with the scheduler, is certainly getting into a bad state, but I
> haven't yet been able to figure out what that is. I've even run it with
> KFT and have seen nothing obvious there, either, except for the fact
> that when it hangs it becomes obvious that it stops making progress and
> it begins to fill up with smp_apic_timer_interrupt() and do_softirq()
> entries. I've also seen smp_apic_timer_interrupt() appear twice or more
> on the stack, as if the previous run(s) didn't finish before the next
> tick happened.
>
> Steps to reproduce:
>
> I'll attach a testcase shortly.
>
It's probably better to handle this one via email, so please send that
testcase vie reply-to-all to this email, thanks.
On Wed, 2008-02-06 at 16:50 -0800, Andrew Morton wrote:
> It's probably better to handle this one via email, so please send that
> testcase vie reply-to-all to this email, thanks.
Testcase attached.
Build with
gcc -D_GNU_SOURCE -c hangc-2.c -o hangc-2.o
gcc -lpthread -o hangc-2 hangc-2.o
Run with
hangc-2 4500 4500
--
Frank Mayhar <[email protected]>
Google, Inc.
On Wed, 6 Feb 2008, Frank Mayhar wrote:
> On Wed, 2008-02-06 at 16:50 -0800, Andrew Morton wrote:
> > It's probably better to handle this one via email, so please send that
> > testcase vie reply-to-all to this email, thanks.
>
> Testcase attached.
>
> Build with
> gcc -D_GNU_SOURCE -c hangc-2.c -o hangc-2.o
> gcc -lpthread -o hangc-2 hangc-2.o
>
> Run with
> hangc-2 4500 4500
FWIW this is not reproducible on 2.6.24/x86/CentOS-51. (I tried running
it nearly 1500 times in a loop.) Assuming those many tries are sufficient
to reproduce this bug, there seems to be something specific to the
environment/architecture/configuration that's necessary to trigger it.
It might be helpful to provide full details like glibc version, compiler
version, .config etc.
Parag
El Wed, 6 Feb 2008 21:57:38 -0500 (EST)
Parag Warudkar <[email protected]> escribió:
>
>
> On Wed, 6 Feb 2008, Frank Mayhar wrote:
>
> > On Wed, 2008-02-06 at 16:50 -0800, Andrew Morton wrote:
> > > It's probably better to handle this one via email, so please send that
> > > testcase vie reply-to-all to this email, thanks.
> >
> > Testcase attached.
> >
> > Build with
> > gcc -D_GNU_SOURCE -c hangc-2.c -o hangc-2.o
> > gcc -lpthread -o hangc-2 hangc-2.o
> >
> > Run with
> > hangc-2 4500 4500
>
> FWIW this is not reproducible on 2.6.24/x86/CentOS-51. (I tried running
> it nearly 1500 times in a loop.) Assuming those many tries are sufficient
> to reproduce this bug, there seems to be something specific to the
> environment/architecture/configuration that's necessary to trigger it.
>
> It might be helpful to provide full details like glibc version, compiler
> version, .config etc.
I can reproduce it on my Ubuntu 7.10 kernel 2.6.24
Note i use CC=gcc-4.2
gcc --version
gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.
Linux Varda 2.6.24 #2 SMP PREEMPT Fri Jan 25 01:05:47 CET 2008 x86_64 GNU/Linux
Gnu C 4.1.3
Gnu make 3.81
binutils 2.18
util-linux 2.13
mount 2.13
module-init-tools 3.3-pre2
e2fsprogs 1.40.2
jfsutils 1.1.11
reiserfsprogs 3.6.19
pcmciautils 014
PPP 2.4.4
Linux C Library 2.6.1
Dynamic linker (ldd) 2.6.1
Procps 3.2.7
Net-tools 1.60
Kbd [opcion...][archivo
Console-tools 0.2.3
Sh-utils 5.97
udev 113
wireless-tools 29
Modules Loaded af_packet binfmt_misc rfcomm l2cap bluetooth ipv6 powernow_k8 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative nf_conntrack_ftp nf_conntrack_irc xt_tcpudp ipt_ULOG xt_limit xt_state iptable_filter nf_conntrack_ipv4 nf_conntrack ip_tables x_tables kvm_amd kvm w83627ehf hwmon_vid lp snd_hda_intel arc4 ecb blkcipher cryptomgr crypto_algapi snd_pcm_oss snd_mixer_oss snd_pcm snd_mpu401 snd_mpu401_uart snd_seq_dummy rt2500pci rt2x00pci rt2x00lib snd_seq_oss rfkill snd_seq_midi input_polldev snd_rawmidi crc_itu_t snd_seq_midi_event snd_seq mac80211 usbhid snd_timer snd_seq_device cfg80211 usblp ff_memless snd eeprom_93cx6 nvidia i2c_ali1535 i2c_ali15x3 evdev snd_page_alloc sr_mod cdrom button soundcore uli526x 8250_pnp 8250 serial_core i2c_core k8temp hwmon parport_pc parport pata_acpi pcspkr rtc floppy sg ata_generic ehci_hcd r8169 ohci_hcd usbcore unix thermal processor fan fuse
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc8
# Thu Jan 24 19:29:41 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
# CONFIG_QUICKLIST is not set
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_SUPPORTS_OPROFILE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
# CONFIG_CGROUPS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLOCK_COMPAT=y
#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=m
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"
CONFIG_PREEMPT_NOTIFIERS=y
#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_X86_VSMP is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
CONFIG_MK8=y
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_HPET_TIMER=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_NR_CPUS=2
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
CONFIG_X86_MCE_AMD=y
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
# CONFIG_NUMA is not set
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MTRR=y
# CONFIG_SECCOMP is not set
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
# CONFIG_HOTPLUG_CPU is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
#
# Power management options
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
CONFIG_SUSPEND_SMP_POSSIBLE=y
# CONFIG_SUSPEND is not set
CONFIG_HIBERNATION_SMP_POSSIBLE=y
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
# CONFIG_ACPI_PROCFS is not set
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_SYSFS_POWER=y
# CONFIG_ACPI_PROC_EVENT is not set
# CONFIG_ACPI_AC is not set
# CONFIG_ACPI_BATTERY is not set
CONFIG_ACPI_BUTTON=m
# CONFIG_ACPI_VIDEO is not set
CONFIG_ACPI_FAN=m
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_THERMAL=m
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_TOSHIBA is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
# CONFIG_ACPI_CONTAINER is not set
# CONFIG_ACPI_SBS is not set
#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=m
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_POWERNOW_K8=m
CONFIG_X86_POWERNOW_K8_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set
#
# shared options
#
CONFIG_X86_ACPI_CPUFREQ_PROC_INTF=y
# CONFIG_X86_SPEEDSTEP_LIB is not set
# CONFIG_CPU_IDLE is not set
#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
# CONFIG_DMAR is not set
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_LEGACY is not set
# CONFIG_PCI_DEBUG is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
# CONFIG_PCCARD is not set
# CONFIG_HOTPLUG_PCI is not set
#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
#
# Networking
#
CONFIG_NET=y
#
# Networking options
#
CONFIG_PACKET=m
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=m
CONFIG_XFRM=y
CONFIG_XFRM_USER=m
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
CONFIG_NET_KEY=m
# CONFIG_NET_KEY_MIGRATE is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
# CONFIG_IP_ROUTE_VERBOSE is not set
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IP_VS is not set
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
# CONFIG_IPV6_ROUTE_INFO is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
# CONFIG_IPV6_MIP6 is not set
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=m
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_NETLABEL is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
# CONFIG_BRIDGE_NETFILTER is not set
#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK_ENABLED=m
CONFIG_NF_CONNTRACK=m
CONFIG_NF_CT_ACCT=y
CONFIG_NF_CONNTRACK_MARK=y
# CONFIG_NF_CONNTRACK_EVENTS is not set
CONFIG_NF_CT_PROTO_SCTP=m
# CONFIG_NF_CT_PROTO_UDPLITE is not set
# CONFIG_NF_CONNTRACK_AMANDA is not set
CONFIG_NF_CONNTRACK_FTP=m
# CONFIG_NF_CONNTRACK_H323 is not set
CONFIG_NF_CONNTRACK_IRC=m
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
CONFIG_NF_CONNTRACK_SIP=m
CONFIG_NF_CONNTRACK_TFTP=m
# CONFIG_NF_CT_NETLINK is not set
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NOTRACK=m
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=m
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_TIME=m
CONFIG_NETFILTER_XT_MATCH_U32=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
#
# IP: Netfilter Configuration
#
CONFIG_NF_CONNTRACK_IPV4=m
CONFIG_NF_CONNTRACK_PROC_COMPAT=y
# CONFIG_IP_NF_QUEUE is not set
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_IPRANGE=m
CONFIG_IP_NF_MATCH_TOS=m
CONFIG_IP_NF_MATCH_RECENT=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_MATCH_OWNER=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_NF_NAT=m
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_SAME=m
CONFIG_NF_NAT_SNMP_BASIC=m
CONFIG_NF_NAT_FTP=m
CONFIG_NF_NAT_IRC=m
CONFIG_NF_NAT_TFTP=m
# CONFIG_NF_NAT_AMANDA is not set
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
CONFIG_NF_NAT_SIP=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_TOS=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m
#
# IPv6: Netfilter Configuration (EXPERIMENTAL)
#
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_OWNER=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_MH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_RAW=m
#
# Bridge: Netfilter Configuration
#
# CONFIG_BRIDGE_NF_EBTABLES is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
# CONFIG_TIPC is not set
CONFIG_ATM=m
CONFIG_ATM_CLIP=m
CONFIG_ATM_CLIP_NO_ICMP=y
CONFIG_ATM_LANE=m
CONFIG_ATM_MPOA=m
CONFIG_ATM_BR2684=m
CONFIG_ATM_BR2684_IPFILTER=y
CONFIG_BRIDGE=m
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
CONFIG_NET_SCHED=y
#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_RR=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_INGRESS=m
#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_IPT=m
CONFIG_NET_ACT_NAT=m
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
# CONFIG_NET_CLS_POLICE is not set
CONFIG_NET_CLS_IND=y
CONFIG_NET_SCH_FIFO=y
#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
CONFIG_BT=m
CONFIG_BT_L2CAP=m
CONFIG_BT_SCO=m
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_HIDP=m
#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=m
CONFIG_BT_HCIUSB_SCO=y
CONFIG_BT_HCIUART=m
CONFIG_BT_HCIUART_H4=y
CONFIG_BT_HCIUART_BCSP=y
# CONFIG_BT_HCIUART_LL is not set
CONFIG_BT_HCIBCM203X=m
CONFIG_BT_HCIBPA10X=m
CONFIG_BT_HCIBFUSB=m
CONFIG_BT_HCIVHCI=m
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
#
# Wireless
#
CONFIG_CFG80211=m
CONFIG_NL80211=y
CONFIG_WIRELESS_EXT=y
CONFIG_MAC80211=m
CONFIG_MAC80211_RCSIMPLE=y
CONFIG_MAC80211_LEDS=y
# CONFIG_MAC80211_DEBUGFS is not set
# CONFIG_MAC80211_DEBUG is not set
CONFIG_IEEE80211=y
# CONFIG_IEEE80211_DEBUG is not set
CONFIG_IEEE80211_CRYPT_WEP=m
CONFIG_IEEE80211_CRYPT_CCMP=m
CONFIG_IEEE80211_CRYPT_TKIP=m
CONFIG_IEEE80211_SOFTMAC=m
# CONFIG_IEEE80211_SOFTMAC_DEBUG is not set
CONFIG_RFKILL=m
CONFIG_RFKILL_INPUT=m
CONFIG_RFKILL_LEDS=y
CONFIG_NET_9P=m
CONFIG_NET_9P_FD=m
# CONFIG_NET_9P_DEBUG is not set
#
# Device Drivers
#
#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=m
# CONFIG_MTD is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
CONFIG_PARPORT_PC_FIFO=y
CONFIG_PARPORT_PC_SUPERIO=y
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set
#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=m
# CONFIG_PARIDE is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=8192
CONFIG_BLK_DEV_RAM_BLOCKSIZE=1024
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
CONFIG_CDROM_PKTCDVD_WCACHE=y
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_MISC_DEVICES is not set
CONFIG_EEPROM_93CX6=m
# CONFIG_IDE is not set
#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_PROC_FS is not set
#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
# CONFIG_CHR_DEV_SCH is not set
#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
# CONFIG_SCSI_MULTI_LUN is not set
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m
#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
# CONFIG_SCSI_LOWLEVEL is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SVW is not set
# CONFIG_ATA_PIIX is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
CONFIG_SATA_SIL24=m
# CONFIG_SATA_SIS is not set
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
CONFIG_PATA_ACPI=m
CONFIG_PATA_ALI=y
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
CONFIG_PATA_JMICRON=m
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
CONFIG_PATA_NETCELL=m
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
CONFIG_PATA_VIA=m
# CONFIG_PATA_WINBOND is not set
# CONFIG_MD is not set
# CONFIG_FUSION is not set
#
# IEEE 1394 (FireWire) support
#
CONFIG_FIREWIRE=m
CONFIG_FIREWIRE_OHCI=m
CONFIG_FIREWIRE_SBP2=m
# CONFIG_IEEE1394 is not set
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_NETDEVICES_MULTIQUEUE=y
# CONFIG_IFB is not set
CONFIG_DUMMY=m
CONFIG_BONDING=m
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=m
CONFIG_VETH=m
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=m
#
# MII PHY device drivers
#
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_BROADCOM_PHY=m
CONFIG_ICPLUS_PHY=m
CONFIG_FIXED_PHY=m
CONFIG_FIXED_MII_10_FDX=y
CONFIG_FIXED_MII_100_FDX=y
CONFIG_FIXED_MII_1000_FDX=y
CONFIG_FIXED_MII_AMNT=1
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=m
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
CONFIG_ULI526X=m
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_NET_PCI is not set
# CONFIG_B44 is not set
# CONFIG_NET_POCKET is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_E1000E is not set
# CONFIG_IP1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_R8169=m
CONFIG_R8169_NAPI=y
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_SK98LIN is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_NETDEV_10000 is not set
# CONFIG_TR is not set
#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
CONFIG_WLAN_80211=y
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
# CONFIG_LIBERTAS is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set
# CONFIG_PRISM54 is not set
CONFIG_USB_ZD1201=m
CONFIG_RTL8187=m
# CONFIG_ADM8211 is not set
# CONFIG_P54_COMMON is not set
# CONFIG_IWLWIFI is not set
# CONFIG_HOSTAP is not set
# CONFIG_BCM43XX is not set
# CONFIG_B43 is not set
# CONFIG_B43LEGACY is not set
CONFIG_ZD1211RW=m
# CONFIG_ZD1211RW_DEBUG is not set
CONFIG_RT2X00=m
CONFIG_RT2X00_LIB=m
CONFIG_RT2X00_LIB_PCI=m
CONFIG_RT2X00_LIB_USB=m
CONFIG_RT2X00_LIB_FIRMWARE=y
CONFIG_RT2X00_LIB_RFKILL=y
# CONFIG_RT2400PCI is not set
CONFIG_RT2500PCI=m
CONFIG_RT2500PCI_RFKILL=y
# CONFIG_RT61PCI is not set
CONFIG_RT2500USB=m
CONFIG_RT73USB=m
# CONFIG_RT2X00_DEBUG is not set
#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_WAN is not set
# CONFIG_ATM_DRIVERS is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PLIP is not set
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPP_MPPE=m
CONFIG_PPPOE=m
CONFIG_PPPOATM=m
# CONFIG_PPPOL2TP is not set
# CONFIG_SLIP is not set
CONFIG_SLHC=m
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set
#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=m
CONFIG_INPUT_POLLDEV=m
#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=m
CONFIG_INPUT_EVBUG=m
#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
CONFIG_KEYBOARD_XTKBD=m
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_VSXXXAA is not set
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_ANALOG=m
CONFIG_JOYSTICK_A3D=m
CONFIG_JOYSTICK_ADI=m
CONFIG_JOYSTICK_COBRA=m
CONFIG_JOYSTICK_GF2K=m
CONFIG_JOYSTICK_GRIP=m
CONFIG_JOYSTICK_GRIP_MP=m
CONFIG_JOYSTICK_GUILLEMOT=m
CONFIG_JOYSTICK_INTERACT=m
CONFIG_JOYSTICK_SIDEWINDER=m
CONFIG_JOYSTICK_TMDC=m
CONFIG_JOYSTICK_IFORCE=m
CONFIG_JOYSTICK_IFORCE_USB=y
CONFIG_JOYSTICK_IFORCE_232=y
CONFIG_JOYSTICK_WARRIOR=m
CONFIG_JOYSTICK_MAGELLAN=m
CONFIG_JOYSTICK_SPACEORB=m
CONFIG_JOYSTICK_SPACEBALL=m
CONFIG_JOYSTICK_STINGER=m
CONFIG_JOYSTICK_TWIDJOY=m
CONFIG_JOYSTICK_DB9=m
CONFIG_JOYSTICK_GAMECON=m
CONFIG_JOYSTICK_TURBOGRAFX=m
CONFIG_JOYSTICK_JOYDUMP=m
CONFIG_JOYSTICK_XPAD=m
CONFIG_JOYSTICK_XPAD_FF=y
CONFIG_JOYSTICK_XPAD_LEDS=y
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
CONFIG_INPUT_UINPUT=m
#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=m
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
CONFIG_SERIO_PCIPS2=m
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_GAMEPORT=m
CONFIG_GAMEPORT_NS558=m
# CONFIG_GAMEPORT_L4 is not set
CONFIG_GAMEPORT_EMU10K1=m
# CONFIG_GAMEPORT_FM801 is not set
#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
# CONFIG_SERIAL_NONSTANDARD is not set
#
# Serial drivers
#
CONFIG_SERIAL_8250=m
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=m
CONFIG_SERIAL_8250_PNP=m
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=m
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=m
# CONFIG_LP_CONSOLE is not set
# CONFIG_PPDEV is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=m
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_NVRAM=m
CONFIG_RTC=m
CONFIG_GEN_RTC=m
CONFIG_GEN_RTC_X=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
CONFIG_HPET_RTC_IRQ=y
CONFIG_HPET_MMAP=y
CONFIG_HANGCHECK_TIMER=m
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m
#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCF=m
CONFIG_I2C_ALGOPCA=m
#
# I2C Hardware Bus support
#
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_I810 is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
CONFIG_I2C_OCORES=m
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
# CONFIG_I2C_PROSAVAGE is not set
# CONFIG_I2C_SAVAGE4 is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_TINY_USB is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set
# CONFIG_I2C_VOODOO3 is not set
#
# Miscellaneous I2C Chip support
#
# CONFIG_SENSORS_DS1337 is not set
# CONFIG_SENSORS_DS1374 is not set
# CONFIG_DS1682 is not set
# CONFIG_SENSORS_EEPROM is not set
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_SENSORS_PCA9539 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_MAX6875 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
#
# SPI support
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7470 is not set
CONFIG_SENSORS_K8TEMP=m
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHER is not set
# CONFIG_SENSORS_FSCPOS is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_CORETEMP is not set
CONFIG_SENSORS_IT87=m
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83627HF is not set
CONFIG_SENSORS_W83627EHF=m
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
# CONFIG_WATCHDOG is not set
#
# Sonics Silicon Backplane
#
CONFIG_SSB_POSSIBLE=y
# CONFIG_SSB is not set
#
# Multifunction device drivers
#
# CONFIG_MFD_SM501 is not set
#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set
# CONFIG_DVB_CORE is not set
# CONFIG_DAB is not set
#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
# CONFIG_AGP_INTEL is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_VIA is not set
CONFIG_DRM=m
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_FB=m
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=m
CONFIG_FB_CFB_FILLRECT=m
CONFIG_FB_CFB_COPYAREA=m
CONFIG_FB_CFB_IMAGEBLIT=m
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_SYS_FOPS is not set
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set
#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_VGA16 is not set
CONFIG_FB_UVESA=m
# CONFIG_FB_HECUBA is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
# CONFIG_FB_RIVA is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
CONFIG_BACKLIGHT_CLASS_DEVICE=m
# CONFIG_BACKLIGHT_CORGI is not set
# CONFIG_BACKLIGHT_PROGEAR is not set
#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set
#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=m
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
CONFIG_FONTS=y
# CONFIG_FONT_8x8 is not set
CONFIG_FONT_8x16=y
# CONFIG_FONT_6x11 is not set
# CONFIG_FONT_7x14 is not set
# CONFIG_FONT_PEARL_8x8 is not set
# CONFIG_FONT_ACORN_8x8 is not set
# CONFIG_FONT_MINI_4x6 is not set
# CONFIG_FONT_SUN8x16 is not set
# CONFIG_FONT_SUN12x22 is not set
# CONFIG_FONT_10x18 is not set
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
CONFIG_LOGO_LINUX_VGA16=y
CONFIG_LOGO_LINUX_CLUT224=y
#
# Sound
#
CONFIG_SOUND=m
#
# Advanced Linux Sound Architecture
#
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_RTCTIMER=m
CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
# CONFIG_SND_DYNAMIC_MINORS is not set
CONFIG_SND_SUPPORT_OLD_API=y
# CONFIG_SND_VERBOSE_PROCFS is not set
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
#
# Generic devices
#
CONFIG_SND_MPU401_UART=m
CONFIG_SND_OPL3_LIB=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DUMMY=m
CONFIG_SND_VIRMIDI=m
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_MTS64 is not set
# CONFIG_SND_SERIAL_U16550 is not set
CONFIG_SND_MPU401=m
# CONFIG_SND_PORTMAN2X4 is not set
CONFIG_SND_SB_COMMON=m
CONFIG_SND_SB16_DSP=m
#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
CONFIG_SND_ALI5451=m
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
CONFIG_SND_CA0106=m
CONFIG_SND_CMIPCI=m
CONFIG_SND_CS4281=m
CONFIG_SND_CS46XX=m
CONFIG_SND_CS46XX_NEW_DSP=y
CONFIG_SND_CS5530=m
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
CONFIG_SND_EMU10K1=m
CONFIG_SND_EMU10K1X=m
CONFIG_SND_ENS1370=m
CONFIG_SND_ENS1371=m
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=m
# CONFIG_SND_HDA_HWDEP is not set
CONFIG_SND_HDA_CODEC_REALTEK=y
# CONFIG_SND_HDA_CODEC_ANALOG is not set
# CONFIG_SND_HDA_CODEC_SIGMATEL is not set
# CONFIG_SND_HDA_CODEC_VIA is not set
# CONFIG_SND_HDA_CODEC_ATIHDMI is not set
# CONFIG_SND_HDA_CODEC_CONEXANT is not set
# CONFIG_SND_HDA_CODEC_CMEDIA is not set
# CONFIG_SND_HDA_CODEC_SI3054 is not set
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
# CONFIG_SND_AC97_POWER_SAVE is not set
#
# USB devices
#
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
#
# System on Chip audio support
#
# CONFIG_SND_SOC is not set
#
# SoC Audio support for SuperH
#
#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_DEBUG is not set
CONFIG_HIDRAW=y
#
# USB Input Devices
#
CONFIG_USB_HID=m
# CONFIG_USB_HIDINPUT_POWERBOOK is not set
CONFIG_HID_FF=y
CONFIG_HID_PID=y
CONFIG_LOGITECH_FF=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_THRUSTMASTER_FF=y
CONFIG_ZEROPLUS_FF=y
CONFIG_USB_HIDDEV=y
#
# USB HID Boot Protocol drivers
#
CONFIG_USB_KBD=m
CONFIG_USB_MOUSE=m
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=m
# CONFIG_USB_DEBUG is not set
#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_PERSIST is not set
# CONFIG_USB_OTG is not set
#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_SPLIT_ISO=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_ISP116X_HCD is not set
CONFIG_USB_OHCI_HCD=m
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
# CONFIG_USB_UHCI_HCD is not set
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
#
# USB Device Class drivers
#
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m
#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#
#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
CONFIG_USB_LIBUSUAL=y
#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
CONFIG_USB_MON=y
#
# USB port drivers
#
# CONFIG_USB_USS720 is not set
#
# USB Serial Converter support
#
# CONFIG_USB_SERIAL is not set
#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
CONFIG_USB_LED=m
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
CONFIG_USB_LD=m
CONFIG_USB_TRANCEVIBRATOR=m
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
#
# USB DSL modem support
#
CONFIG_USB_ATM=m
CONFIG_USB_SPEEDTOUCH=m
# CONFIG_USB_CXACRU is not set
# CONFIG_USB_UEAGLEATM is not set
# CONFIG_USB_XUSBATM is not set
#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set
# CONFIG_MMC is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=m
#
# LED drivers
#
#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=m
CONFIG_LEDS_TRIGGER_HEARTBEAT=m
# CONFIG_INFINIBAND is not set
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m
#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
CONFIG_RTC_INTF_DEV_UIE_EMUL=y
CONFIG_RTC_DRV_TEST=m
#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=m
# CONFIG_RTC_DRV_DS1374 is not set
CONFIG_RTC_DRV_DS1672=m
CONFIG_RTC_DRV_MAX6900=m
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
# CONFIG_RTC_DRV_M41T80 is not set
#
# SPI RTC drivers
#
#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=m
CONFIG_RTC_DRV_DS1553=m
# CONFIG_RTC_DRV_STK17TA8 is not set
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_M48T86=m
# CONFIG_RTC_DRV_M48T59 is not set
CONFIG_RTC_DRV_V3020=m
#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
# CONFIG_KVM_INTEL is not set
CONFIG_KVM_AMD=m
#
# Userspace I/O
#
# CONFIG_UIO is not set
#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=m
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=m
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
CONFIG_JFS_FS=y
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_SECURITY=y
# CONFIG_JFS_DEBUG is not set
CONFIG_JFS_STATISTICS=y
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
# CONFIG_DNOTIFY is not set
# CONFIG_AUTOFS_FS is not set
# CONFIG_AUTOFS4_FS is not set
CONFIG_FUSE_FS=m
#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y
#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=850
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=m
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y
#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_CONFIGFS_FS=m
#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
CONFIG_ECRYPT_FS=m
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=y
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_SMB_FS is not set
CONFIG_CIFS=m
CONFIG_CIFS_STATS=y
# CONFIG_CIFS_STATS2 is not set
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
# CONFIG_CIFS_DEBUG2 is not set
CONFIG_CIFS_EXPERIMENTAL=y
CONFIG_CIFS_UPCALL=y
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=m
#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
# CONFIG_BSD_DISKLABEL is not set
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
CONFIG_LDM_PARTITION=y
CONFIG_LDM_DEBUG=y
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="cp850"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=m
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
CONFIG_NLS_CODEPAGE_860=m
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=m
# CONFIG_NLS_ISO8859_2 is not set
CONFIG_NLS_ISO8859_3=m
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=m
# CONFIG_DLM is not set
# CONFIG_INSTRUMENTATION is not set
#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_SCHED_DEBUG is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_FORCED_INLINING is not set
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_SAMPLES is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_RODATA is not set
# CONFIG_IOMMU_DEBUG is not set
#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y
CONFIG_SECURITY=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
CONFIG_SECURITY_CAPABILITIES=y
CONFIG_SECURITY_FILE_CAPABILITIES=y
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=m
CONFIG_CRYPTO_ABLKCIPHER=m
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_HASH=m
CONFIG_CRYPTO_MANAGER=m
CONFIG_CRYPTO_HMAC=m
CONFIG_CRYPTO_XCBC=m
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=m
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_GF128MUL=m
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_CBC=m
CONFIG_CRYPTO_PCBC=m
CONFIG_CRYPTO_LRW=m
CONFIG_CRYPTO_XTS=m
CONFIG_CRYPTO_CRYPTD=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_FCRYPT=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_SEED=m
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_CRC32C=m
CONFIG_CRYPTO_CAMELLIA=m
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_HW is not set
#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
On Thu, 7 Feb 2008, Alejandro Riveira Fernández wrote:
> gcc --version
>
> gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
>
> If some fields are empty or look unusual you may have an old version.
> Compare to the current minimal requirements in Documentation/Changes.
>
> Linux Varda 2.6.24 #2 SMP PREEMPT Fri Jan 25 01:05:47 CET 2008 x86_64 GNU/Linux
>
So x86+SMP+GnuC-4.1.2+Glibc-2.5 = Not reproducible.
x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.
Not sure what the original reporter's $ARCH was.
So next thing worthwhile to try might be to disable PREEMPT and see if
that cures it.
Parag
On Thu, 7 Feb 2008, Parag Warudkar wrote:
> x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.
>
That should of course be
x86_64+SMP+PREEMPT+GnuC-4.1.3+Glibc-2.6.1 = Reproducible.
El Thu, 7 Feb 2008 10:56:16 -0500 (EST)
Parag Warudkar <[email protected]> escribió:
>
>
> On Thu, 7 Feb 2008, Parag Warudkar wrote:
>
> > x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.
> >
> That should of course be
> x86_64+SMP+PREEMPT+GnuC-4.1.3+Glibc-2.6.1 = Reproducible.
>
>From my previous mail
Note that i use CC=gcc-4.2
$gcc-4.2 --version
gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
On Thu, 7 Feb 2008, Alejandro Riveira Fernández wrote:
> El Thu, 7 Feb 2008 10:56:16 -0500 (EST)
> Parag Warudkar <[email protected]> escribió:
>
> >
> >
> > On Thu, 7 Feb 2008, Parag Warudkar wrote:
> >
> > > x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.
> > >
> > That should of course be
> > x86_64+SMP+PREEMPT+GnuC-4.1.3+Glibc-2.6.1 = Reproducible.
> >
> From my previous mail
>
> Note that i use CC=gcc-4.2
>
> $gcc-4.2 --version
>
> gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
>
Yep. I will enable PREEMPT and see if it reproduces for me.
Thanks
Parag
On Thu, 7 Feb 2008, Parag Warudkar wrote:
> Yep. I will enable PREEMPT and see if it reproduces for me.
Not reproducible with PREEMPT either.
Parag
On Thu, 2008-02-07 at 10:53 -0500, Parag Warudkar wrote:
>
> On Thu, 7 Feb 2008, Alejandro Riveira Fern?ndez wrote:
>
> > gcc --version
> >
> > gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
> >
> > If some fields are empty or look unusual you may have an old version.
> > Compare to the current minimal requirements in Documentation/Changes.
> >
> > Linux Varda 2.6.24 #2 SMP PREEMPT Fri Jan 25 01:05:47 CET 2008 x86_64 GNU/Linux
> >
> So x86+SMP+GnuC-4.1.2+Glibc-2.5 = Not reproducible.
>
> x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.
>
> Not sure what the original reporter's $ARCH was.
Several, among which were i686+SMP+GnuC-4.0.3+Glibc-2.3.6. No PREEMPT.
Linux 2.6.18, 2.6.21 and 2.6.24-rc4.
--
Frank Mayhar <[email protected]>
Google, Inc.
On Thu, 2008-02-07 at 11:53 -0500, Parag Warudkar wrote:
> On Thu, 7 Feb 2008, Parag Warudkar wrote:
> > Yep. I will enable PREEMPT and see if it reproduces for me.
>
> Not reproducible with PREEMPT either.
Okay, here's an analysis of the problem and a potential solution. I
mentioned this in the bug itself but I'll repeat it here:
A couple of us here have been investigating this thing and have
concluded that the problem lies in the implementation of
run_posix_cpu_timers() and specifically in the quadratic nature of the
implementation. It calls check_process_timers() to sum the
utime/stime/sched_time (in 2.6.18.5, under another name in 2.6.24+) of
all threads in the thread group. This means that runtime there grows
with the number of threads. It can go through the list _again_ if and
when it decides to rebalance expiry times.
After thinking through it, it seems clear that the critical number of
threads is that in which run_posix_cpu_timers() takes as long as or
longer than a tick to get its work done. The system makes progress to
that point but after that everything goes to hell as it gets further and
further behind. This explains all the symptoms we've seen, including
seeing run_posix_cpu_timers() at the top of a bunch of profiling stats
(I saw it get more than a third of overall processing time on a bunch of
tests, even where the system _didn't_ hang!). It explains the fact that
things get slow right before they go to hell and it explains why under
certain conditions the system can recover (if the threads have started
exiting by the time it hangs, for example).
I've come up with a potential fix for the problem. It does two things.
First, rather than summing the utime/stime/sched_time at interrupt it
adds all of those times to a new task_struct field on the group leader
then at interrupt just consults those fields; this avoids repeatedly
blowing the cache as well as a loop across all the threads.
Second, if there are more than 1000 threads in the process (as noted in
task->signal->live), it just punts all of the processing to a workqueue.
With these changes I've gone from a hang at 4500 (or fewer) threads to
running out of resources at more than 32000 threads on a single-CPU box.
When I've finished testing I'll polish the patch a bit and submit it to
the LKML but I thought you guys might want to know the state of things.
Oh, and one more note: This bug is also dependent on HZ, since it
matters how long a tick is. I've been running with HZ=1000. A faster
machine or one with HZ=100 would potentially need to generate a _lot_
more threads to see the hang.
--
Frank Mayhar <[email protected]>
Google, Inc.
Thanks for the detailed explanation and for bringing this to my attention.
This is a problem we knew about when I first implemented posix-cpu-timers
and process-wide SIGPROF/SIGVTALRM. I'm a little surprised it took this
long to become a problem in practice. I originally expected to have to
revisit it sooner than this, but I certainly haven't thought about it for
quite some time. I'd guess that HZ=1000 becoming common is what did it.
The obvious implementation for the process-wide clocks is to have the
tick interrupt increment shared utime/stime/sched_time fields in
signal_struct as well as the private task_struct fields. The all-threads
totals accumulate in the signal_struct fields, which would be atomic_t.
It's then trivial for the timer expiry checks to compare against those
totals.
The concern I had about this was multiple CPUs competing for the
signal_struct fields. (That is, several CPUs all running threads in the
same process.) If the ticks on each CPU are even close to synchronized,
then every single time all those CPUs will do an atomic_add on the same
word. I'm not any kind of expert on SMP and cache effects, but I know
this is bad. However bad it is, it's that bad all the time and however
few threads (down to 2) it's that bad for that many CPUs.
The implementation we have instead is obviously dismal for large numbers
of threads. I always figured we'd replace that with something based on
more sophisticated thinking about the CPU-clash issue.
I don't entirely follow your description of your patch. It sounds like it
should be two patches, though. The second of those patches (workqueue)
sounds like it could be an appropriate generic cleanup, or like it could
be a complication that might be unnecessary if we get a really good
solution to main issue.
The first patch I'm not sure whether I understand what you said or not.
Can you elaborate? Or just post the unfinished patch as illustration,
marking it as not for submission until you've finished.
Thanks,
Roland
Put this on the patch but I'm emailing it as well.
On Mon, 2008-03-03 at 23:00 -0800, Roland McGrath wrote:
> Thanks for the detailed explanation and for bringing this to my attention.
You're quite welcome.
> This is a problem we knew about when I first implemented posix-cpu-timers
> and process-wide SIGPROF/SIGVTALRM. I'm a little surprised it took this
> long to become a problem in practice. I originally expected to have to
> revisit it sooner than this, but I certainly haven't thought about it for
> quite some time. I'd guess that HZ=1000 becoming common is what did it.
Well, the iron is getting bigger, too, so it's beginning to be feasible
to run _lots_ of threads.
> The obvious implementation for the process-wide clocks is to have the
> tick interrupt increment shared utime/stime/sched_time fields in
> signal_struct as well as the private task_struct fields. The all-threads
> totals accumulate in the signal_struct fields, which would be atomic_t.
> It's then trivial for the timer expiry checks to compare against those
> totals.
>
> The concern I had about this was multiple CPUs competing for the
> signal_struct fields. (That is, several CPUs all running threads in the
> same process.) If the ticks on each CPU are even close to synchronized,
> then every single time all those CPUs will do an atomic_add on the same
> word. I'm not any kind of expert on SMP and cache effects, but I know
> this is bad. However bad it is, it's that bad all the time and however
> few threads (down to 2) it's that bad for that many CPUs.
>
> The implementation we have instead is obviously dismal for large numbers
> of threads. I always figured we'd replace that with something based on
> more sophisticated thinking about the CPU-clash issue.
>
> I don't entirely follow your description of your patch. It sounds like it
> should be two patches, though. The second of those patches (workqueue)
> sounds like it could be an appropriate generic cleanup, or like it could
> be a complication that might be unnecessary if we get a really good
> solution to main issue.
>
> The first patch I'm not sure whether I understand what you said or not.
> Can you elaborate? Or just post the unfinished patch as illustration,
> marking it as not for submission until you've finished.
My first patch did essentially what you outlined above, incrementing
shared utime/stime/sched_time fields, except that they were in the
task_struct of the group leader rather than in the signal_struct. It's
not clear to me exactly how the signal_struct is shared, whether it is
shared among all threads or if each has its own version.
So each timer routine had something like:
/* If we're part of a thread group, add our time to the leader. */
if (p->group_leader != NULL)
p->group_leader->threads_sched_time += tmp;
and check_process_timers() had
/* Times for the whole thread group are held by the group leader. */
utime = cputime_add(utime, tsk->group_leader->threads_utime);
stime = cputime_add(stime, tsk->group_leader->threads_stime);
sched_time += tsk->group_leader->threads_sched_time;
Of course, this alone is insufficient. It speeds things up a tiny bit
but not nearly enough.
The other issue has to do with the rest of the processing in
run_posix_cpu_timers(), walking the timer lists and walking the whole
thread group (again) to rebalance expiry times. My second patch moved
all that work to a workqueue, but only if there were more than 100
threads in the process. This basically papered over the problem by
moving the processing out of interrupt and into a kernel thread. It's
still insufficient, though, because it takes just as long and will get
backed up just as badly on large numbers of threads. This was made
clear in a test I ran yesterday where I generated some 200,000 threads.
The work queue was unreasonably large, as you might expect.
I am looking for a way to do everything that needs to be done in fewer
operations, but unfortunately I'm not familiar enough with the
SIGPROF/SIGVTALRM semantics or with the details of the Linux
implementation to know where it is safe to consolidate things.
--
Frank Mayhar <[email protected]>
Google, Inc.
> My first patch did essentially what you outlined above, incrementing
> shared utime/stime/sched_time fields, except that they were in the
> task_struct of the group leader rather than in the signal_struct. It's
> not clear to me exactly how the signal_struct is shared, whether it is
> shared among all threads or if each has its own version.
There is a 1:1 correspondence between "shares signal_struct" and "member of
same thread group". signal_struct is the right place for such new fields.
Don't be confused by the existing fields utime, stime, gtime, and
sum_sched_runtime. All of those are accumulators only touched when a
non-leader thread dies (in __exit_signal), and governed by the siglock.
Their only purpose now is to represent the threads that are dead and gone
when calculating the cumulative total for the whole thread group. If you
were to provide cumulative totals that are updated on every tick, then
these old fields would not be needed.
> So each timer routine had something like:
>
> /* If we're part of a thread group, add our time to the leader. */
> if (p->group_leader != NULL)
> p->group_leader->threads_sched_time += tmp;
The task_struct.group_leader field is never NULL. Every thread is a member
of some thread group. The degenerate case is that it's the only member of
the group; then p->group_leader == p.
> and check_process_timers() had
>
> /* Times for the whole thread group are held by the group leader. */
> utime = cputime_add(utime, tsk->group_leader->threads_utime);
> stime = cputime_add(stime, tsk->group_leader->threads_stime);
> sched_time += tsk->group_leader->threads_sched_time;
>
> Of course, this alone is insufficient. It speeds things up a tiny bit
> but not nearly enough.
It sounds like you sped up only one of the sampling loops. Having a
cumulative total already on hand means cpu_clock_sample_group can also
become simple and cheap, as can the analogues in do_getitimer and
k_getrusage. These are what's used in clock_gettime and in the timer
manipulation calls, and in getitimer and getrusage. That's all just gravy.
The real benefit of having a cumulative total is for the basic logic of
run_posix_cpu_timers (check_process_timers) and the timer expiry setup.
It sounds like you didn't take advantage of the new fields for that.
When a cumulative total is on hand in the tick handler, then there is no
need at all to derive per-thread expiry times from group-wide CPU timers
("rebalance") either there or when arming the timer in the first place.
All of that complexity can just disappear from the implementation.
check_process_timers can look just like check_thread_timers, but
consulting the shared fields instead of the per-thread ones for both the
clock accumulators and the timers' expiry times. Likewise, arm_timer
only has to set signal->it_*_expires; process_timer_rebalance goes away.
If you do all that then the time spent in run_posix_cpu_timers should
not be affected at all by the number of threads. The only "walking the
timer lists" that happens is popping the expired timers off the head of
the lists that are kept in ascending order of expiry time. For each
flavor of timer, there are n+1 steps in the "walk" for n timers that
have expired. So already no costs here should scale with the number of
timers, just the with the number of timers that all expire at the same time.
Back for a moment to the status quo and your second patch.
> The other issue has to do with the rest of the processing in
> run_posix_cpu_timers(), walking the timer lists and walking the whole
> thread group (again) to rebalance expiry times. My second patch moved
> all that work to a workqueue, but only if there were more than 100
> threads in the process. This basically papered over the problem by
> moving the processing out of interrupt and into a kernel thread. It's
> still insufficient, though, because it takes just as long and will get
> backed up just as badly on large numbers of threads. This was made
> clear in a test I ran yesterday where I generated some 200,000 threads.
> The work queue was unreasonably large, as you might expect.
What I would expect is that there be at most one item in the queue for
each process (thread group). If you have 200000 threads in one process,
you still only need one iteration of check_process_timers to run. If it
hasn't run by the time more threads in the same group get more ticks,
then all that matters is that it indeed runs once reasonably soon (for
an overall effect of not much less often than once per tick interval).
> I am looking for a way to do everything that needs to be done in fewer
> operations, but unfortunately I'm not familiar enough with the
> SIGPROF/SIGVTALRM semantics or with the details of the Linux
> implementation to know where it is safe to consolidate things.
I can help you with all of that. What I'll need from you is careful
performance analysis of all the effects of any changes we consider.
The simplifications I described above will obviously greatly improve
your test case (many threads and with some process timers expiring
pretty frequently). We need to consider and analyze the other kinds of
cases too. That is, cases with a few threads (not many more than the
number of CPUs); cases where no timer is close to expiring very often.
The most common cases, from one-thread cases to one-million thread
cases, are when no timers are going off before next Tuesday (if any are
set at all). Then run_posix_cpu_timers always bails out early, and none
of the costs you've seen become relevant at all. Any change to what the
timer interrupt handler does on every tick affects those cases too.
As I mentioned in my last message, my concern about this originally was
with the SMP cache/lock effects of multiple CPUs touching the same
memory in signal_struct on every tick (which presumably all tend to
happen simultaneously on all the CPUs). I'd insist that we have
measurements and analysis as thorough as possible of the effects of
introducing that frequent/synchronized sharing, before endorsing such
changes. I have a couple of guesses as to what might be reasonable ways
to mitigate that. But it needs a lot of measurement and wise opinion on
the low-level performance effects of each proposal.
Thanks,
Roland
On Tue, 2008-03-04 at 20:08 -0800, Roland McGrath wrote:
> check_process_timers can look just like check_thread_timers, but
> consulting the shared fields instead of the per-thread ones for both the
> clock accumulators and the timers' expiry times. Likewise, arm_timer
> only has to set signal->it_*_expires; process_timer_rebalance goes away.
Okay, my understanding of this is still evolving, so please (please!)
correct me when I get it wrong. I take what you're saying to mean that,
first, run_posix_cpu_timers() only needs to be run once per thread
group. It _sounds_ like it should be checking the shared fields rather
than the per-task fields for timer expiration (in fact, the more I think
about it the more sure I am that that's the case).
The old process_timer_rebalance() routine was intended to distribute the
remaining ticks across all the threads, so that the per-task fields
would cause run_posix_cpu_timers() to run at the appropriate time. With
it checking the shared fields this becomes no longer necessary.
Since the shared fields are getting all the ticks, this will work for
per-thread timers as well.
The arm_timers() routine, instead of calling posix_timer_rebalance(),
should just directly set signal->it_*_expires to the expiration time,
e.g.:
switch (CPUCLOCK_WHICH(timer->it_clock)) {
default:
BUG();
case CPUCLOCK_VIRT:
if (!cputime_eq(p->signal->it_virt_expires,
cputime_zero) &&
cputime_lt(p->signal->it_virt_expires,
timer->it.cpu.expires.cpu))
break;
p->signal->it_virt_expires = timer->it.cpu.expires.cpu;
goto rebalance;
case CPUCLOCK_PROF:
if (!cputime_eq(p->signal->it_prof_expires,
cputime_zero) &&
cputime_lt(p->signal->it_prof_expires,
timer->it.cpu.expires.cpu))
break;
i = p->signal->rlim[RLIMIT_CPU].rlim_cur;
if (i != RLIM_INFINITY &&
i <= cputime_to_secs(timer->it.cpu.expires.cpu))
break;
p->signal->it_prof_expires = timer->it.cpu.expires.cpu;
goto rebalance;
case CPUCLOCK_SCHED:
p->signal->it_sched_expires = timer->it.cpu.expires.sched;
break;
}
> If you do all that then the time spent in run_posix_cpu_timers should
> not be affected at all by the number of threads. The only "walking the
> timer lists" that happens is popping the expired timers off the head of
> the lists that are kept in ascending order of expiry time. For each
> flavor of timer, there are n+1 steps in the "walk" for n timers that
> have expired. So already no costs here should scale with the number of
> timers, just the with the number of timers that all expire at the same time.
It's still probably worthwhile to defer processing to a workqueue
thread, though, just because it's still a lot to do at interrupt. I'll
probably end up trying it both ways.
One thing that's still unclear to me is, if there were only one run of
run_posix_cpu_timers() per thread group per tick, how would per-thread
timers be serviced?
> The simplifications I described above will obviously greatly improve
> your test case (many threads and with some process timers expiring
> pretty frequently). We need to consider and analyze the other kinds of
> cases too. That is, cases with a few threads (not many more than the
> number of CPUs); cases where no timer is close to expiring very often.
> The most common cases, from one-thread cases to one-million thread
> cases, are when no timers are going off before next Tuesday (if any are
> set at all). Then run_posix_cpu_timers always bails out early, and none
> of the costs you've seen become relevant at all. Any change to what the
> timer interrupt handler does on every tick affects those cases too.
These are all on the roadmap, and in fact the null case should already
be covered. :-)
> As I mentioned in my last message, my concern about this originally was
> with the SMP cache/lock effects of multiple CPUs touching the same
> memory in signal_struct on every tick (which presumably all tend to
> happen simultaneously on all the CPUs). I'd insist that we have
> measurements and analysis as thorough as possible of the effects of
> introducing that frequent/synchronized sharing, before endorsing such
> changes. I have a couple of guesses as to what might be reasonable ways
> to mitigate that. But it needs a lot of measurement and wise opinion on
> the low-level performance effects of each proposal.
I've given this some thought. It seems clear that there's going to be
some performance penalty when multiple CPUs are competing trying to
update the same field at the tick. It would be much better if there
were cacheline-aligned per-cpu fields associated with either the task or
the signal structure; that way each CPU could update its own field
without competing with the others. Processing in run_posix_cpu_timers
would still be constant, although slightly higher for having to consult
multiple fields instead of just one. Not one per thread, though, just
one per CPU, a much smaller and fixed number.
--
Frank Mayhar <[email protected]>
Google, Inc.
Based on Roland's comments and from reading the source, I have a
possible fix. I'm posting the attached patch _not_ for submission but
_only_ for comment. For one thing it's based on 2.6.18.5 and for
another it hasn't had much testing yet. I wanted to get it out here for
comment, though, in case anyone can see where I might have gone wrong.
Comments, criticism and (especially!) testing enthusiastically
requested.
>From my notes, this patch:
Replaces the utime, stime and sched_time fields in signal_struct with
shared_utime, shared_stime and shared_schedtime, respectively. It
also adds it_sched_expires to the signal struct.
Each place that loops through all threads in a thread group to sum
task->utime and/or task->stime now loads the value from
task->signal->shared_[us]time. This includes compat_sys_times(),
do_task_stat(), do_getitimer(), sys_times() and k_getrusage().
Certain routines that used task->signal->[us]time now use the shared
fields instead, which may change their semantics slightly. These
include fill_prstatus() (in fs/binfmt_elf.c), do_task_stat() (in
fs/proc/array.c), wait_task_zombie() and do_notify_parent().
The shared fields are updated at each tick, in update_cpu_clock()
(shared_schedtime), account_user_time() (shared_utime) and
account_system_time() (shared_stime). Each of these functions updates
the task-private field followed by the shared version in the signal
structure if one is present. Note that if different threads of the
same process are being run by different CPUs at the tick, there may
be serious cache contention here.
Finally, kernel/posix-cpu-timers.c has changed quite dramatically.
First, run_posix_cpu_timers() decides whether a timer has expired by
consulting the it_*_expires and shared_* fields in the signal struct.
The check_process_timers() routine bases its computations on the new
shared fields, removing two loops through the threads. "Rebalancing"
is no longer required, the process_timer_rebalance() routine as
disappeared entirely and the arm_timer() routine merely fills
p->signal->it_*_expires from timer->it.cpu.expires.*. The
cpu_clock_sample_group_locked() loses its summing loops, consulting
the shared fields instead. Finally, set_process_cpu_timer() sets
tsk->signal->it_*_expires directly rather than calling the deleted
rebalance routine.
There are still a few open questions. In particular, it's possible
that cache contention on the tick update of the shared fields could
mean that the current scheme is not entirely sufficient. Further,
the semantics of the status-returning routines fill_prstatus(),
do_task_stat(), wait_task_zombie() and do_notify_parent() may no longer
follow standards. For that matter, ITIMER_PROF handling may be broken
entirely, although a brief test seems to show that it's working fine.
Stats:
fs/binfmt_elf.c | 18 +--
fs/proc/array.c | 6 -
include/linux/sched.h | 10 +-
kernel/exit.c | 13 --
kernel/fork.c | 25 +----
kernel/itimer.c | 18 ---
kernel/posix-cpu-timers.c | 224 ++++++++++------------------------------------
kernel/sched.c | 16 +++
kernel/signal.c | 6 -
kernel/sys.c | 17 ---
10 files changed, 105 insertions(+), 248 deletions(-)
--
Frank Mayhar <[email protected]>
Google, Inc.
On Fri, 2008-03-07 at 15:26 -0800, Frank Mayhar wrote:
> Based on Roland's comments and from reading the source, I have a
> possible fix. I'm posting the attached patch _not_ for submission but
> _only_ for comment. For one thing it's based on 2.6.18.5 and for
> another it hasn't had much testing yet. I wanted to get it out here for
> comment, though, in case anyone can see where I might have gone wrong.
> Comments, criticism and (especially!) testing enthusiastically
> requested.
The previous email was missing one small part of the patch, reproduced
below. Remain calm.
----------------------------------BEGIN------------------------------------
diff -urp /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c linux-2.6.18.5/kernel/compat.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c 2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/compat.c 2008-03-06 17:26:21.000000000 -0800
@@ -161,18 +161,11 @@ asmlinkage long compat_sys_times(struct
if (tbuf) {
struct compat_tms tmp;
struct task_struct *tsk = current;
- struct task_struct *t;
cputime_t utime, stime, cutime, cstime;
read_lock(&tasklist_lock);
- utime = tsk->signal->utime;
- stime = tsk->signal->stime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- t = next_thread(t);
- } while (t != tsk);
+ utime = tsk->signal->shared_utime;
+ stime = tsk->signal->shared_stime;
/*
* While we have tasklist_lock read-locked, no dying thread
-----------------------------------END-------------------------------------
--
Frank Mayhar <[email protected]>
Google, Inc.
[I changed the subject and trimmed the CC list, as this is now quite far
away from the "some mysterious NPTL problem" subject. If anyone else
wanted to be individually CC'd, you can add them back in followups.]
> correct me when I get it wrong. I take what you're saying to mean that,
> first, run_posix_cpu_timers() only needs to be run once per thread group.
Not quite. check_process_timers only needs to be run once per thread
group (per interesting tick).
> It _sounds_ like it should be checking the shared fields rather than the
> per-task fields for timer expiration (in fact, the more I think about it
> the more sure I am that that's the case).
run_posix_cpu_timers does two things: thread CPU timers and process CPU
timers. The thread CPU timers track the thread CPU clocks, which are
what the per-thread fields in task_struct count. check_thread_timers
finds what thread CPU timers have fired. The task_struct.it_*_expires
fields are set when there are thread CPU timers set on those clocks.
The process CPU timers track the process CPU clocks. Each process CPU
clock (virt, prof, sched) is just the sum of the corresponding thread
CPU clock across all threads in the group. In the original code, these
clocks are never maintained in any storage as such, but sampled by
summing all the thread clocks when a current value is needed.
check_process_timers finds what process CPU timers have fired. The
signal_struct.it_*_expires fields are set when there are process CPU
timers set on those clocks.
The "rebalance" stuff also sets the task_struct.it_*_expires fields of
all the threads in the group when there are process CPU timers. So each
of these fields is really the minimum of the expiration time of the
earliest thread CPU timer and the "balanced" sample-and-update time
computed from the earliest process CPU timer's expiration time.
> The old process_timer_rebalance() routine was intended to distribute the
> remaining ticks across all the threads, so that the per-task fields
> would cause run_posix_cpu_timers() to run at the appropriate time. With
> it checking the shared fields this becomes no longer necessary.
This is true of check_process_timers.
> Since the shared fields are getting all the ticks, this will work for
> per-thread timers as well.
I do not follow your logic at all here. The signal_struct fields being
proposed track each process CPU clock's value. The thread CPU timers
react to the thread CPU clocks, not the process CPU clocks.
> The arm_timers() routine, instead of calling posix_timer_rebalance(),
> should just directly set signal->it_*_expires to the expiration time,
Correct.
> It's still probably worthwhile to defer processing to a workqueue
> thread, though, just because it's still a lot to do at interrupt. I'll
> probably end up trying it both ways.
I think the natural place for it is run_timer_softirq. This is where
natural time timers run. The posix-cpu-timers processing is analogous
to the posix-timers processing, which runs via timer callbacks from
here. That roughly means doing it right after the interrupt normally,
and falling back to something similar to a workqueue when the load is
too heavy. In the common case, it doesn't entail any context switches,
as workqueues always do.
The issue with either the workqueue or the softirq approach is that it
means the call will sometimes (softirq) or always (workqueue) be made by
another task. Currently we always have current taking samples of its
own clocks and firing timers set on them. (Because of this you can't
actually use softirq in the simple fashion, i.e. moving the call to
run_posix_cpu_timers. It would only be guaranteed to run once per CPU
and wouldn't know which task it was supposed to look at. You'd have to
keep a per-CPU list of tasks pending consideration, i.e. a workqueue.)
I can't tell you off hand about serious complications of doing this work
from another task rather than by current on current. I think I might
have had some in mind when I did the original implementation, but I just
don't remember any more. It seems like a potential can of worms. My
first inclination is to do every other cleanup we like first before
touching this question.
Also note that the bulk of the work (and everything that's not O(1))
has to be done with interrupts disabled anyway. That's necessary to
take siglock. That lock both protects signal_struct, and it protects
the task_struct.cpu_timers lists. (You can do a cheap and lossy test
on current->signal->it_*_expires without taking the lock, for the
nothing-fires fast path.)
> One thing that's still unclear to me is, if there were only one run of
> run_posix_cpu_timers() per thread group per tick, how would per-thread
> timers be serviced?
What I said is only actually necessary once per thread group is the work
that check_process_timers does. In the current style of code where
there is a loop through all the threads anyway, then you could in fact
weave in the check_thread_timers work there too and then all that would
only need to be done once per thread group per tick. (But I don't think
that's what I suggested last time.)
> I've given this some thought. It seems clear that there's going to be
> some performance penalty when multiple CPUs are competing trying to
> update the same field at the tick.
Indeed. That's why I said I would not endorse any patch that doesn't
address this up front, and show concrete measurements about this
overhead. (To a first approximation, the overhead on every tick for
every task in the system is always more important than the complications
for tasks that are using any CPU timers, including ITIMER_PROF and
ITIMER_VIRTUAL.)
> It would be much better if there were cacheline-aligned per-cpu fields
> associated with either the task or the signal structure; that way each
> CPU could update its own field without competing with the others.
> Processing in run_posix_cpu_timers would still be constant, although
> slightly higher for having to consult multiple fields instead of just
> one. Not one per thread, though, just one per CPU, a much smaller and
> fixed number.
Exactly this is the first idea I had about this. (I considered this in
the original implementation and decided for the first crack to err on
the side of no new code or data structures in the paths taken by every
thread, with the new hair only affecting processes that actually use any
process CPU timers.)
But this is not without its own issues. Currently on my configuration
(64-bit) the utime, stime, sum_sched_runtime fields (which now only
accumulate the contributions to process CPU clocks of threads that are
already dead) take 24 bytes. Throw in gtime (which is analogous in its
bookkeeping, but has no POSIX clock/timer interface to it) and call it
32 bytes. That's out of 904 bytes total in signal_struct.
On some common configurations, SMP_CACHE_BYTES is 128 and NR_CPUS is 64.
So the obvious static addition to signal_struct would bloat it by 8192
bytes (i.e. 904 to 9096, or more than 10x), of which 96*NR_CPUS (2/3)
would be wasted even when you really have NR_CPUS running. That is way
beyond the acceptable size (it's too big to even work right with the
kernel allocators), even if it weren't mostly wasted space.
This leads to some obvious follow-on ideas. With some fancy
footwork, you could use a pointer to a separately allocated chunk of
only num_possible_cpus() * SMP_CACHE_BYTES. You needn't allocate it
at all until the first timer is set on that process CPU clock. That
makes the bloat smaller, and limits it to the processes that actually
need to check on each tick.
With all of this properly encapsulated in a struct and some helper
functions, it would be simple to conditionalize the implementation.
For uniprocessor kernels, clearly it would be preferable just to use
the existing signal_struct fields. For NR_CPUS=2, it might be
reasonable to use the static aligned fields bloating signal_struct
(or perhaps for NR_CPUS * SMP_CACHE_BYTES < some threshold). etc.
An alternative notion is to have single shared fields per clock in
signal_struct but add to them only at context switch. If the threads
in the process don't all yield at the same time, then maybe that
works out ok for cache contention. It's not a well-developed idea.
This all adds up to me thinking there is no simple answer. I think
we need to consider several alternatives, and get real measurements
of their overheads in various workloads, etc. I am hesitant to pick
any "simple" changes to put in the kernel before we have examined the
real trade-offs fully.
Thanks,
Roland
On Tue, 2008-03-11 at 00:50 -0700, Roland McGrath wrote:
> > correct me when I get it wrong. I take what you're saying to mean that,
> > first, run_posix_cpu_timers() only needs to be run once per thread group.
> Not quite. check_process_timers only needs to be run once per thread
> group (per interesting tick).
Where "interesting tick" means "tick in which a process timer has
expired," correct?
> > It _sounds_ like it should be checking the shared fields rather than the
> > per-task fields for timer expiration (in fact, the more I think about it
> > the more sure I am that that's the case).
> run_posix_cpu_timers does two things: thread CPU timers and process CPU
> timers. The thread CPU timers track the thread CPU clocks, which are
> what the per-thread fields in task_struct count. check_thread_timers
> finds what thread CPU timers have fired. The task_struct.it_*_expires
> fields are set when there are thread CPU timers set on those clocks.
>
> The process CPU timers track the process CPU clocks. Each process CPU
> clock (virt, prof, sched) is just the sum of the corresponding thread
> CPU clock across all threads in the group. In the original code, these
> clocks are never maintained in any storage as such, but sampled by
> summing all the thread clocks when a current value is needed.
> check_process_timers finds what process CPU timers have fired. The
> signal_struct.it_*_expires fields are set when there are process CPU
> timers set on those clocks.
And my changes introduce these clocks as separate fields in the signal
struct, updated at the tick.
> > Since the shared fields are getting all the ticks, this will work for
> > per-thread timers as well.
>
> I do not follow your logic at all here. The signal_struct fields being
> proposed track each process CPU clock's value. The thread CPU timers
> react to the thread CPU clocks, not the process CPU clocks.
Okay, I hadn't been clear on the distinction between process-wide and
thread-only timers. So, really, run_posix_cpu_timers() needs to check
both sets, the versions in the signal struct for the process-wide timers
and the versions in the task struct for the thread-only timers.
> > It's still probably worthwhile to defer processing to a workqueue
> > thread, though, just because it's still a lot to do at interrupt. I'll
> > probably end up trying it both ways.
I'm going to table this for now. Based on my preliminary performance
results, the changes I've made mean that using a workqueue or softirq is
not necessary. In the profiles of a couple of testcases I've run,
run_posix_cpu_timers() didn't show up at all, whereas before my change
it was right at the top with ~35% of the time.
I'll hang on to your notes, though, for future reference.
> > One thing that's still unclear to me is, if there were only one run of
> > run_posix_cpu_timers() per thread group per tick, how would per-thread
> > timers be serviced?
> What I said is only actually necessary once per thread group is the work
> that check_process_timers does. In the current style of code where
> there is a loop through all the threads anyway, then you could in fact
> weave in the check_thread_timers work there too and then all that would
> only need to be done once per thread group per tick. (But I don't think
> that's what I suggested last time.)
So, check_process_timers() checks for and handles any expired timers for
the currently-running process, whereas check_thread_timers() checks for
and handles any expired timers for the currently-running thread. Is
that correct?
And, since these timers are only counting CPU time, if a thread is never
running at the tick (since that's how we account time in the first
place) any timers it might have will never expire. Sorry to repeat the
obvious but sometimes it's better to state things very explicitly.
At each tick a process-wide timer may have expired. Also, at each tick
a thread-only timer may have expired. Or, of course, both. So we need
to detect both events and fire the appropriate timer in the appropriate
context.
I think my current code (i.e. the patch I published for comment a few
days ago) does this, with one exception: If a thread-only timer expires
it _won't_ be detected when run_posix_cpu_timers() runs, since I'm only
checking the process-wide timers. This implies that I need to do twice
as many checks up front. I'll think about how to minimize that, though.
> > I've given this some thought. It seems clear that there's going to be
> > some performance penalty when multiple CPUs are competing trying to
> > update the same field at the tick.
> Indeed. That's why I said I would not endorse any patch that doesn't
> address this up front, and show concrete measurements about this
> overhead. (To a first approximation, the overhead on every tick for
> every task in the system is always more important than the complications
> for tasks that are using any CPU timers, including ITIMER_PROF and
> ITIMER_VIRTUAL.)
I've actually gotten a little bit of insight into this. I don't think
that a straight set of shared fields is sufficient except in UP (and
possibly dual-CPU) environments. I was able to run a reasonable test on
both a four-core Opteron system and a sixteen-core Opteron system.
(That's two dual-core CPUs and four four-core CPUs, no sixteen-core
Opteron chips here. :-) They had 1024K and 512K cache respectively. I
didn't have a chance to actually time the runs but I observed that the
sixteen-core run took substantially longer than the four-core run.
Something like double the time. While some part of this is likely
attributable to the smaller cache, the testcase was small enough that
that shouldn't have been a real issue. I'm pretty confident that it was
cache conflict among the sixteen cores that did the damage.
> > It would be much better if there were cacheline-aligned per-cpu fields
> > associated with either the task or the signal structure; that way each
> > CPU could update its own field without competing with the others.
> > Processing in run_posix_cpu_timers would still be constant, although
> > slightly higher for having to consult multiple fields instead of just
> > one. Not one per thread, though, just one per CPU, a much smaller and
> > fixed number.
> Exactly this is the first idea I had about this. (I considered this in
> the original implementation and decided for the first crack to err on
> the side of no new code or data structures in the paths taken by every
> thread, with the new hair only affecting processes that actually use any
> process CPU timers.)
> This leads to some obvious follow-on ideas. With some fancy
> footwork, you could use a pointer to a separately allocated chunk of
> only num_possible_cpus() * SMP_CACHE_BYTES. You needn't allocate it
> at all until the first timer is set on that process CPU clock. That
> makes the bloat smaller, and limits it to the processes that actually
> need to check on each tick.
I'm currently working on an implementation that uses the alloc_percpu()
mechanism and a separate structure. I'm encapsulating access to the
fields in shared_xxx_sum() inline functions, which could have different
implementations for UP, dual-CPU and generic SMP kernels. Each tick
does something like, for example:
if (p->signal->shared_times) { /* Set if timer running. */
cpu = get_cpu();
shared_times = per_cpu_ptr(p->signal->shared_times, cpu);
shared_times->shared_stime = cputime_add(shared_times->shared_stime, cputime);
put_cpu_no_resched();
}
(Where "shared_times" is the structure encapsulating the shared fields.)
This adds overhead to sum the per-CPU values but means that multiple
CPUs updating at the same tick won't be competing for the cache line and
killing performance.
> An alternative notion is to have single shared fields per clock in
> signal_struct but add to them only at context switch. If the threads
> in the process don't all yield at the same time, then maybe that
> works out ok for cache contention. It's not a well-developed idea.
I'll keep this in mind.
> This all adds up to me thinking there is no simple answer. I think
> we need to consider several alternatives, and get real measurements
> of their overheads in various workloads, etc. I am hesitant to pick
> any "simple" changes to put in the kernel before we have examined the
> real trade-offs fully.
I personally think that the most promising approach is the one outlined
above (without considering the context-switch scheme for the moment).
It saves as much space as possible, doesn't penalize processes that
aren't using the posix timers and avoids cache contention between CPUs.
I'll go ahead and implement it and try to generate some numbers.
--
Frank Mayhar <[email protected]>
Google, Inc.
> > Not quite. check_process_timers only needs to be run once per thread
> > group (per interesting tick).
>
> Where "interesting tick" means "tick in which a process timer has
> expired," correct?
Or might have expired, in the current implementation style. Correct.
> > The process CPU timers track the process CPU clocks. [...]
>
> And my changes introduce these clocks as separate fields in the signal
> struct, updated at the tick.
Correct.
> Okay, I hadn't been clear on the distinction between process-wide and
> thread-only timers. So, really, run_posix_cpu_timers() needs to check
> both sets, the versions in the signal struct for the process-wide timers
> and the versions in the task struct for the thread-only timers.
Correct.
> I'm going to table this for now. [...]
Agreed.
> So, check_process_timers() checks for and handles any expired timers for
> the currently-running process, whereas check_thread_timers() checks for
> and handles any expired timers for the currently-running thread. Is
> that correct?
Correct.
> And, since these timers are only counting CPU time, if a thread is never
> running at the tick (since that's how we account time in the first
> place) any timers it might have will never expire.
Correct.
> At each tick a process-wide timer may have expired. Also, at each tick
> a thread-only timer may have expired. Or, of course, both. So we need
> to detect both events and fire the appropriate timer in the appropriate
> context.
Correct.
> [...] I'm pretty confident that it was
> cache conflict among the sixteen cores that did the damage.
I'm not surprised by this result. (I do want to see much more detailed
performance analysis before we decide on a final change.)
> I'm currently working on an implementation that uses the alloc_percpu()
> mechanism and a separate structure. I'm encapsulating access to the
> fields in shared_xxx_sum() inline functions, which could have different
> implementations for UP, dual-CPU and generic SMP kernels.
That is exactly what I had in mind. (I hadn't noticed alloc_percpu, and it
has one more level of indirection than I'd planned. But that wastes less
space when num_possible_cpus() is far greater than num_online_cpus(), and I
imagine it's vastly superior for NUMA.)
Don't forget do_[gs]etitimer and k_getrusage can use this too.
(Though maybe no reason to bother in k_getrusage since it has
to loop to sum the non-time counters anyway.)
> I personally think that the most promising approach is the one outlined
> above (without considering the context-switch scheme for the moment).
I tend to agree. It's the only plan I've thought through in detail.
But my remarks stand, about thorough analysis of performance impacts
of options we can think of.
Thanks,
Roland
After the recent conversation with Roland and after more testing, I have
another patch for review (although _not_ for submission, as again it's
against 2.6.18.5). This patch breaks the shared utime/stime/sched_time
fields out into their own structure which is allocated as needed via
alloc_percpu(). This avoids cache thrashing when running lots of
threads on lots of CPUs.
Please take a look and let me know what you think. In the meantime I'll
be working on a similar patch to 2.6-head that has optimizations for
uniprocessor and two-CPU operation, to avoid the overhead of the percpu
functions when they are unneeded.
This patch:
Replaces the utime, stime and sched_time fields in signal_struct with
the shared_times structure, which is cacheline aligned and allocated
when needed using the alloc_percpu() mechanism. There is one copy of
this structure per running CPU when it is being used.
Each place that loops through all threads in a thread group to sum
task->utime and/or task->stime now use the shared_*_sum() inline
functions defined in sched.h to sum the per-CPU structures. This
includes compat_sys_times(), do_task_stat(), do_getitimer(),
sys_times() and k_getrusage().
Certain routines that used task->signal->[us]time now use the
shared_*_sum() functions instead, which may (but hopefully will not)
change their semantics slightly. These include fill_prstatus() (in
fs/binfmt_elf.c), do_task_stat() (in fs/proc/array.c),
wait_task_zombie() and do_notify_parent().
At each tick, update_cpu_clock(), account_user_time() and
account_system_time() update the relevant field of the shared_times
structure using a pointer obtained using per_cpu_ptr, with the effect
that these functions do not compete with one another for the cacheline.
Each of these functions updates the task-private field followed by the
shared_times version if one is present.
Finally, kernel/posix-cpu-timers.c has changed quite dramatically.
First, run_posix_cpu_timers() decides whether a timer has expired by
consulting the it_*_expires fields in the task struct of the running
thread and the shared_*_sum() functions that cover the entire process.
The check_process_timers() routine bases its computations on the
shared structure, removing two loops through the threads. "Rebalancing"
is no longer required, the process_timer_rebalance() routine as
disappeared entirely and the arm_timer() routine merely fills
p->signal->it_*_expires from timer->it.cpu.expires.*. The
cpu_clock_sample_group_locked() loses its summing loops, using the
the shared structure instead. Finally, set_process_cpu_timer() sets
tsk->signal->it_*_expires directly rather than calling the deleted
rebalance routine.
The only remaining open question is whether these changes break the
semantics of the status-returning routines fill_prstatus(),
do_task_stat(), wait_task_zombie() and do_notify_parent().
--
Frank Mayhar <[email protected]>
Google, Inc.
Sorry for the delay.
> Please take a look and let me know what you think. In the meantime I'll
> be working on a similar patch to 2.6-head that has optimizations for
> uniprocessor and two-CPU operation, to avoid the overhead of the percpu
> functions when they are unneeded.
My mention of a 2-CPU special case was just an off-hand idea. I don't
really have any idea if that would be optimal given the tradeoff of
increaing signal_struct size. The performance needs be analyzed.
> disappeared entirely and the arm_timer() routine merely fills
> p->signal->it_*_expires from timer->it.cpu.expires.*. The
> cpu_clock_sample_group_locked() loses its summing loops, using the
> the shared structure instead. Finally, set_process_cpu_timer() sets
> tsk->signal->it_*_expires directly rather than calling the deleted
> rebalance routine.
I think I misled you about the use of the it_*_expires fields, sorry.
The task_struct.it_*_expires fields are used solely as a cache of the
head of cpu_timers[]. Despite the poor choice of the same name, the
signal_struct.it_*_expires fields serve a different purpose. For an
analogous cache of the soonest timer to expire, you need to add new
fields. The signal_struct.it_{prof,virt}_{expires,incr} fields hold
the setitimer settings for ITIMER_{PROF,VTALRM}. You can't change
those in arm_timer. For a quick cache you need a new field that is
the sooner of it_foo_expires or the head cpu_timers[foo] expiry time.
The shared_utime_sum et al names are somewhat oblique to anyone who
hasn't just been hacking on exactly this thing like you and I have.
Things like thread_group_*time make more sense.
There are now several places where you call both shared_utime_sum and
shared_stime_sum. It looks simple because they're nicely encapsulated.
But now you have two loops through all CPUs, and three loops in
check_process_timers.
I think what we want instead is this:
struct task_cputime
{
cputime_t utime;
cputime_t stime;
unsigned long long schedtime;
};
Use one in task_struct to replace the utime, stime, and sum_sched_runtime
fields, and another to replace it_*_expires. Use a single inline function
thread_group_cputime() that fills a sum struct task_cputime using a single
loop. For the places only one or two of the sums is actually used, the
compiler should optimize away the extra summing from the loop.
Don't use __cacheline_aligned on this struct type itself, because most of
the uses don't need that. When using alloc_percpu, you can rely on it to
take care of those needs--that's what it's for. If you implement a
variant that uses a flat array, you can use a wrapper struct with
__cacheline_aligned for that.
Thanks,
Roland
On Fri, 2008-03-21 at 00:18 -0700, Roland McGrath wrote:
> > Please take a look and let me know what you think. In the meantime I'll
> > be working on a similar patch to 2.6-head that has optimizations for
> > uniprocessor and two-CPU operation, to avoid the overhead of the percpu
> > functions when they are unneeded.
> My mention of a 2-CPU special case was just an off-hand idea. I don't
> really have any idea if that would be optimal given the tradeoff of
> increaing signal_struct size. The performance needs be analyzed.
I would really like to just ignore the 2-cpu scenario and just have two
versions, the UP version and the n-way SMP version. It would make life,
and maintenance, simpler.
> > disappeared entirely and the arm_timer() routine merely fills
> > p->signal->it_*_expires from timer->it.cpu.expires.*. The
> > cpu_clock_sample_group_locked() loses its summing loops, using the
> > the shared structure instead. Finally, set_process_cpu_timer() sets
> > tsk->signal->it_*_expires directly rather than calling the deleted
> > rebalance routine.
> I think I misled you about the use of the it_*_expires fields, sorry.
> The task_struct.it_*_expires fields are used solely as a cache of the
> head of cpu_timers[]. Despite the poor choice of the same name, the
> signal_struct.it_*_expires fields serve a different purpose. For an
> analogous cache of the soonest timer to expire, you need to add new
> fields. The signal_struct.it_{prof,virt}_{expires,incr} fields hold
> the setitimer settings for ITIMER_{PROF,VTALRM}. You can't change
> those in arm_timer. For a quick cache you need a new field that is
> the sooner of it_foo_expires or the head cpu_timers[foo] expiry time.
Okay, I'll go back over this and make sure I got it right. It's
interesting, though, that my current patch (written without this
particular bit of knowledge) actually performs no differently from the
existing mechanism.
>From my handy four-core AMD64 test system running 2.6.18.5, the old
kernel gets:
./nohangc-3 1300 200000
Interval timer off.
Threads: 1300
Max prime: 200000
Elapsed: 95.421s
Execution: User 356.001s, System 0.029s, Total 356.030s
Context switches: vol 1319, invol 7402
./hangc-3 1300 200000
Interval timer set to 0.010 sec.
Threads: 1300
Max prime: 200000
Elapsed: 131.457s
Execution: User 435.037s, System 59.495s, Total 494.532s
Context switches: vol 1464, invol 10123
Ticks: 22612, tics/sec 45.724, secs/tic 0.022
(More than 1300 threads hangs the old kernel with this test.)
With my patch it gets:
./nohangc-3 1300 200000
Interval timer off.
Threads: 1300
Max prime: 200000
Elapsed: 94.097s
Execution: User 366.000s, System 0.052s, Total 366.052s
Context switches: vol 1336, invol 28928
./hangc-3 1300 200000
Interval timer set to 0.010 sec.
Threads: 1300
Max prime: 200000
Elapsed: 93.583s
Execution: User 366.117s, System 0.047s, Total 366.164s
Context switches: vol 1323, invol 28875
Ticks: 12131, tics/sec 33.130, secs/tic 0.030
Also see below.
> The shared_utime_sum et al names are somewhat oblique to anyone who
> hasn't just been hacking on exactly this thing like you and I have.
> Things like thread_group_*time make more sense.
In the latest cut I've named them "process_*" but "thread_group" makes
more sense.
> There are now several places where you call both shared_utime_sum and
> shared_stime_sum. It looks simple because they're nicely encapsulated.
> But now you have two loops through all CPUs, and three loops in
> check_process_timers.
Good point, although so far it's been undetectable in my performance
testing. (I can't say that it will stay that way down the road a bit,
when we have systems with large numbers of cores.)
> I think what we want instead is this:
>
> struct task_cputime
> {
> cputime_t utime;
> cputime_t stime;
> unsigned long long schedtime;
> };
>
> Use one in task_struct to replace the utime, stime, and sum_sched_runtime
> fields, and another to replace it_*_expires. Use a single inline function
> thread_group_cputime() that fills a sum struct task_cputime using a single
> loop. For the places only one or two of the sums is actually used, the
> compiler should optimize away the extra summing from the loop.
Excellent idea! This method hadn't occurred to me since I was looking
at it from the viewpoint of the existing structure and keeping the
fields separated, but this makes more sense.
> Don't use __cacheline_aligned on this struct type itself, because most of
> the uses don't need that. When using alloc_percpu, you can rely on it to
> take care of those needs--that's what it's for. If you implement a
> variant that uses a flat array, you can use a wrapper struct with
> __cacheline_aligned for that.
Yeah, I had caught that one.
FYI, I've attached the latest version of the 2.6.18 patch; you might
want to take a look as it has changed a bit. I generated some numbers
as well (from a new README):
Testing was performed using a heavily-modified version of the test
that originally showed the problem. The test sets ITIMER_PROF (if
not run with "nohang" in the name of the executable) and catches
the SIGPROF signal (in any event), then starts some number of threads,
each of which computes the prime numbers up to a given maximum (this
function was lifted from the "cpu" benchmark of sysbench version
0.4.8). It takes as parameters the number of threads to create and
the maximum value for the prime number calculation. It starts the
threads, calls pthread_barrier_wait() to wait for them to complete and
rendezvous, then joins the threads. It uses gettimeofday() to get
the time and getrusage() to get resource usage before and after the
threads run and reports the number of threads, the difference in
elapsed time, user and system CPU time and in the number of voluntary
and involuntary context switches, and the total number of SIGPROF
signals received (this will be zero if the test is run as "nohang").
On a four-core AMD64 system (two dual-core AMD64s), for 1300 threads
(more than that hung the kernel) and a max prime of 120,000, the old
kernel averaged roughly 70s elapsed, with about 240s user cpu and 35s
system cpu, with the profile timer ticking about every 0.02s. The new
kernel averaged roughly 45s elapsed, with about 181s user cpu and .04
system CPU and with the profile timer ticking about every .01s.
On a sixteen-core system (four quad-core AMD64s), for 1300 threads as
above but with a max prime of 300,000, the old kernel averaged roughly
65s elapsed, with about 600s user cpu and 91s system cpu, with the
profile timer ticking about every .02s. The new kernel averaged
roughly 70s elapsed, with about 239s user cpu and 35s system cpu,
and with the profile timer ticking about every .02s.
On the same sixteen-core system, 100,000 threads with a max prime of
100,000 run in roughly 975s elapsed, with about 5,538s user cpu and
751s system cpu, with the profile timer ticking about every .025s.
In summary, the performance of the kernel with the fix is comparable to
the performance without it, with the advantage that many threads will
no longer hang the system.
The patch is attached.
--
Frank Mayhar <[email protected]>
Google, Inc.
On Fri, 2008-03-21 at 00:18 -0700, Roland McGrath wrote:
> I think I misled you about the use of the it_*_expires fields, sorry.
> The task_struct.it_*_expires fields are used solely as a cache of the
> head of cpu_timers[]. Despite the poor choice of the same name, the
> signal_struct.it_*_expires fields serve a different purpose. For an
> analogous cache of the soonest timer to expire, you need to add new
> fields. The signal_struct.it_{prof,virt}_{expires,incr} fields hold
> the setitimer settings for ITIMER_{PROF,VTALRM}. You can't change
> those in arm_timer. For a quick cache you need a new field that is
> the sooner of it_foo_expires or the head cpu_timers[foo] expiry time.
Actually, after looking at the code again and thinking about it a bit,
it appears that the signal_struct.it_*_incr field holds the actual
interval as set by setitimer. Initially the it_*_expires field holds
the expiration time as set by setitimer, but after the timer fires the
first time that value becomes <firing time>+it_*_incr. In other words,
the first time it fires at the value set by setitimer() but from then on
it fires at a time indicated by whatever the time was the last time the
timer fired plus the value in it_*_incr. This time is stored in
signal_struct.it_*_expires.
I guess I could be wrong about this, but it appears to be what the code
is doing. If my analysis is correct, I really don't need a new field,
since the old fields work just fine.
--
Frank Mayhar <[email protected]>
Google, Inc.
> I would really like to just ignore the 2-cpu scenario and just have two
> versions, the UP version and the n-way SMP version. It would make life,
> and maintenance, simpler.
Like I've said, it's only something to investigate for best performance.
If the conditional code is encapsulated well, it will be simple to add
another variant later and experiment with it.
> Okay, I'll go back over this and make sure I got it right. It's
> interesting, though, that my current patch (written without this
> particular bit of knowledge) actually performs no differently from the
> existing mechanism.
Except for correctness in scenarios other than the one you are testing. :-)
> Testing was performed using a heavily-modified version of the test
> that originally showed the problem. The test sets ITIMER_PROF (if
> not run with "nohang" in the name of the executable) [...]
There are several important scenarios you did not test.
Analysis of combinations of all these variables is needed.
1. Tests with a few threads, like as many threads as CPUs or only 2x as many.
2. Tests with a process CPU timer set for a long expiration time.
i.e. a timer set, but that never goes off in your entire run.
(This is what a non-infinity RLIMIT_CPU limit does.)
With the old code, a long enough timer and a small enough number
of threads will never trigger a "rebalance".
> Actually, after looking at the code again and thinking about it a bit,
> it appears that the signal_struct.it_*_incr field holds the actual
> interval as set by setitimer. Initially the it_*_expires field holds
> the expiration time as set by setitimer, but after the timer fires the
> first time that value becomes <firing time>+it_*_incr. In other words,
> the first time it fires at the value set by setitimer() but from then on
> it fires at a time indicated by whatever the time was the last time the
> timer fired plus the value in it_*_incr. This time is stored in
> signal_struct.it_*_expires.
That's correct. The it_*_expires fields store itimerval.it_value (the
current timer) and the it_*_incr fields store itimerval.it_interval (the
timer reload setting).
> I guess I could be wrong about this, but it appears to be what the code
> is doing. If my analysis is correct, I really don't need a new field,
> since the old fields work just fine.
The analysis above is correct but your conclusion here is wrong.
The current value of an itimer is a user feature, not just a piece
of internal bookkeeping.
getitimer returns in it_value the amount of time until the itimer
fires, regardless of whether or not it will reload after it fires or
with what value it will be reloaded. In a setitimer call, the
it_value sets the time at which the itimer must fire, regardless of
the reload setting in it_interval. Consider the case where
it_interval={0,0}; it_value is still meaningful.
Your code causes any timer_settime or timer_delete call on a process
CPU timer or any setrlimit call on RLIMIT_CPU to suddenly change the
itimer setting just as if the user had made some setitimer call that
was never made or intended. That's wrong.
Thanks,
Roland
On Sat, 2008-03-22 at 14:58 -0700, Roland McGrath wrote:
> > I would really like to just ignore the 2-cpu scenario and just have two
> > versions, the UP version and the n-way SMP version. It would make life,
> > and maintenance, simpler.
> Like I've said, it's only something to investigate for best performance.
> If the conditional code is encapsulated well, it will be simple to add
> another variant later and experiment with it.
Well, if it's acceptable, for a first cut (and the patch I'll submit),
I'll handle the UP and SMP cases, encapsulating them in sched.h in such
a way as to make it invisible (as much as is possible) to the rest of
the code.
> There are several important scenarios you did not test.
> Analysis of combinations of all these variables is needed.
> 1. Tests with a few threads, like as many threads as CPUs or only 2x as many.
I've actually done this, although I didn't find the numbers particularly
interesting. I'll do it again and keep the numbers, though.
> 2. Tests with a process CPU timer set for a long expiration time.
> i.e. a timer set, but that never goes off in your entire run.
> (This is what a non-infinity RLIMIT_CPU limit does.)
> With the old code, a long enough timer and a small enough number
> of threads will never trigger a "rebalance".
I'll do this at some point.
> > I guess I could be wrong about this, but it appears to be what the code
> > is doing. If my analysis is correct, I really don't need a new field,
> > since the old fields work just fine.
>
> The analysis above is correct but your conclusion here is wrong.
> The current value of an itimer is a user feature, not just a piece
> of internal bookkeeping.
After looking at the code again, I now understand what you're talking
about. You overloaded it_*_expires to support both the POSIX interval
timers and RLIMIT_CPU. So the way I have things, setting one can stomp
the other.
> Your code causes any timer_settime or timer_delete call on a process
> CPU timer or any setrlimit call on RLIMIT_CPU to suddenly change the
> itimer setting just as if the user had made some setitimer call that
> was never made or intended. That's wrong.
Right, because the original effect was to only set the it_*_expires on
each individual task struct, leaving the one in the signal struct alone.
Might it be cleaner to handle the RLIMIT_CPU stuff separately, rather
than rolling it into the itimer handling?
--
Frank Mayhar <[email protected]>
Google, Inc.
On Mon, 2008-03-24 at 10:34 -0700, Frank Mayhar wrote:
> On Sat, 2008-03-22 at 14:58 -0700, Roland McGrath wrote:
> > The analysis above is correct but your conclusion here is wrong.
> > The current value of an itimer is a user feature, not just a piece
> > of internal bookkeeping.
>
> After looking at the code again, I now understand what you're talking
> about. You overloaded it_*_expires to support both the POSIX interval
> timers and RLIMIT_CPU. So the way I have things, setting one can stomp
> the other.
>
> > Your code causes any timer_settime or timer_delete call on a process
> > CPU timer or any setrlimit call on RLIMIT_CPU to suddenly change the
> > itimer setting just as if the user had made some setitimer call that
> > was never made or intended. That's wrong.
>
> Right, because the original effect was to only set the it_*_expires on
> each individual task struct, leaving the one in the signal struct alone.
>
> Might it be cleaner to handle the RLIMIT_CPU stuff separately, rather
> than rolling it into the itimer handling?
Okay, my proposed fix for this is to introduce a new field in
signal_struct, rlim_expires, a cputime_t. Everywhere that the
RLIMIT_CPU code formerly set it_prof_expires it will now set
rlim_expires and in run_posix_cpu_timers() I'll check it against the
thread group prof_time.
I believe that that will solve the problem, if I understand this
correctly. If I don't, I trust that you will set me straight. :-)
--
Frank Mayhar <[email protected]>
Google, Inc.
This is my official first cut at a patch that will fix bug 9906, "Weird
hang with NPTL and SIGPROF." The problem is that run_posix_cpu_timers()
repeatedly walks the entire thread group every time it runs, which is at
interrupt. With heavy load and lots of threads, this can take longer
than the tick, at which point the kernel stops doing anything put
servicing clock ticks and the occasional interrupt. Many thanks to
Roland McGrath for his help in my attempt to understand his code.
The change adds a new structure to the signal_struct,
thread_group_cputime. On an SMP kernel, this is allocated as a percpu
structure when needed (from do_setitimer()) using the alloc_percpu()
mechanism). It is manipulated via a set of inline functions and macros
defined in sched.h, thread_group_times_init(),
thread_group_times_free(), thread_group_times_alloc(),
thread_group_update() (the macro) and thread_group_cputime(). The
thread_group_update macro is used to update a single field of the
thread_group_cputime structure when needed; the thread_group_cputime()
function sums the fields for each cpu into a passed structure.
In the uniprocessor case, the thread_group_cputime structure becomes a
simple substructure of the signal_struct, allocation and freeing go away
and updating and "summing" become simple assignments.
In addition to fixing the hang, this change removes the overloading of
it_prof_expires for RLIMIT_CPU handling, replacing it with a new field,
rlim_expires, which is checked instead. This makes the code simpler and
more straightforward.
I've made some decisions in this work that could have gone in different
directions and I'm certainly happy to entertain comments and criticisms.
Performance with this fix is at least as good as before and in a few
cases is slightly improved, possibly due to the reduced tick overhead.
Signed-off-by: Frank Mayhar <[email protected]>
include/linux/sched.h | 172 ++++++++++++++++++++++++++++
kernel/compat.c | 31 ++++--
kernel/fork.c | 22 +---
kernel/itimer.c | 40 ++++---
kernel/posix-cpu-timers.c | 271 +++++++++++++--------------------------------
kernel/sched.c | 4 +
kernel/sched_fair.c | 2 +
kernel/sched_rt.c | 2 +
kernel/sys.c | 41 ++++---
security/selinux/hooks.c | 4 +-
10 files changed, 333 insertions(+), 256 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fed07d0..8d1b19d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -424,6 +424,18 @@ struct pacct_struct {
};
/*
+ * This structure contains the versions of utime, stime and sum_exec_runtime
+ * that are shared across threads within a process. It's only used for
+ * interval timers and is allocated via alloc_percpu() in the signal
+ * structure when such a timer is set up.
+ */
+struct thread_group_cputime {
+ cputime_t utime;
+ cputime_t stime;
+ unsigned long long sum_exec_runtime;
+};
+
+/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
* implies a shared sighand_struct, so locking
@@ -468,6 +480,12 @@ struct signal_struct {
cputime_t it_prof_expires, it_virt_expires;
cputime_t it_prof_incr, it_virt_incr;
+ /* Scheduling timer for the process */
+ unsigned long long it_sched_expires;
+
+ /* RLIMIT_CPU timer for the process */
+ cputime_t rlim_expires;
+
/* job control IDs */
/*
@@ -492,6 +510,13 @@ struct signal_struct {
struct tty_struct *tty; /* NULL if no tty */
+ /* Process-wide times for POSIX interval timing. Per CPU. */
+#ifdef CONFIG_SMP
+ struct thread_group_cputime *thread_group_times;
+#else
+ struct thread_group_cputime thread_group_times;
+#endif
+
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1978,6 +2003,153 @@ static inline int spin_needbreak(spinlock_t *lock)
#endif
}
+#define thread_group_runtime_add(a, b) ((a) + (b))
+
+#ifdef CONFIG_SMP
+
+static inline void thread_group_times_init(struct signal_struct *sig)
+{
+ sig->thread_group_times = NULL;
+}
+
+static inline void thread_group_times_free(struct signal_struct *sig)
+{
+ if (sig->thread_group_times)
+ free_percpu(sig->thread_group_times);
+}
+
+/*
+ * Allocate the thread_group_cputime struct appropriately and fill in the current
+ * values of the fields. Called from do_setitimer() when setting an interval
+ * timer (ITIMER_PROF or ITIMER_VIRTUAL). Assumes interrupts are enabled when
+ * it's called. Note that there is no corresponding deallocation done from
+ * do_setitimer(); the structure is freed at process exit.
+ */
+static inline int thread_group_times_alloc(struct task_struct *tsk)
+{
+ struct signal_struct *sig = tsk->signal;
+ struct thread_group_cputime *thread_group_times;
+ struct task_struct *t;
+ cputime_t utime, stime;
+ unsigned long long sum_exec_runtime;
+
+ /*
+ * If we don't already have a thread_group_cputime struct, allocate
+ * one and fill it in with the accumulated times.
+ */
+ if (sig->thread_group_times)
+ return 0;
+ thread_group_times = alloc_percpu(struct thread_group_cputime);
+ if (thread_group_times == NULL)
+ return -ENOMEM;
+ read_lock(&tasklist_lock);
+ spin_lock_irq(&tsk->sighand->siglock);
+ if (sig->thread_group_times) {
+ spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
+ free_percpu(thread_group_times);
+ return 0;
+ }
+ sig->thread_group_times = thread_group_times;
+ utime = sig->utime;
+ stime = sig->stime;
+ sum_exec_runtime = tsk->se.sum_exec_runtime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ sum_exec_runtime += t->se.sum_exec_runtime;
+ } while_each_thread(tsk, t);
+ thread_group_times = per_cpu_ptr(sig->thread_group_times, get_cpu());
+ thread_group_times->utime = utime;
+ thread_group_times->stime = stime;
+ thread_group_times->sum_exec_runtime = sum_exec_runtime;
+ put_cpu_no_resched();
+ spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
+ return 0;
+}
+
+#define thread_group_update(sig, field, val, op) ({ \
+ if (sig && sig->thread_group_times) { \
+ int cpu; \
+ struct thread_group_cputime *thread_group_times; \
+ \
+ cpu = get_cpu(); \
+ thread_group_times = \
+ per_cpu_ptr(sig->thread_group_times, cpu); \
+ thread_group_times->field = \
+ op(thread_group_times->field, val); \
+ put_cpu_no_resched(); \
+ } \
+})
+
+/*
+ * Sum the time fields across all running CPUs.
+ */
+static inline int thread_group_cputime(struct thread_group_cputime *thread_group_times,
+ struct signal_struct *sig)
+{
+ int i;
+ struct thread_group_cputime *tg_times;
+ cputime_t utime = cputime_zero;
+ cputime_t stime = cputime_zero;
+ unsigned long long sum_exec_runtime = 0;
+
+ if (!sig->thread_group_times)
+ return(0);
+ for_each_online_cpu(i) {
+ tg_times = per_cpu_ptr(sig->thread_group_times, i);
+ utime = cputime_add(utime, tg_times->utime);
+ stime = cputime_add(stime, tg_times->stime);
+ sum_exec_runtime += tg_times->sum_exec_runtime;
+ }
+ thread_group_times->utime = utime;
+ thread_group_times->stime = stime;
+ thread_group_times->sum_exec_runtime = sum_exec_runtime;
+ return(1);
+}
+
+#else /* CONFIG_SMP */
+
+static inline void thread_group_times_init(struct signal_struct *sig)
+{
+}
+
+static inline void thread_group_times_free(struct signal_struct *sig)
+{
+}
+
+/*
+ * Allocate the thread_group_cputime struct appropriately and fill in the current
+ * values of the fields. Called from do_setitimer() when setting an interval
+ * timer (ITIMER_PROF or ITIMER_VIRTUAL). Assumes interrupts are enabled when
+ * it's called. Note that there is no corresponding deallocation done from
+ * do_setitimer(); the structure is freed at process exit.
+ */
+static inline int thread_group_times_alloc(struct task_struct *tsk)
+{
+ return 0;
+}
+
+#define thread_group_update(sig, field, val, op) ({ \
+ if (sig) \
+ sig->thread_group_times.field = \
+ op(sig->thread_group_times.field, val); \
+})
+
+/*
+ * Sum the time fields across all running CPUs.
+ */
+static inline int thread_group_cputime(struct thread_group_cputime *thread_group_times,
+ struct signal_struct *sig)
+{
+ *thread_group_times = sig->thread_group_times;
+ return(1);
+}
+
+#endif /* CONFIG_SMP */
+
/*
* Reevaluate whether the task has signals pending delivery.
* Wake the task if so.
diff --git a/kernel/compat.c b/kernel/compat.c
index 5f0e201..5c80f32 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -153,6 +153,8 @@ asmlinkage long compat_sys_setitimer(int which,
asmlinkage long compat_sys_times(struct compat_tms __user *tbuf)
{
+ struct thread_group_cputime thread_group_times;
+
/*
* In the SMP world we might just be unlucky and have one of
* the times increment as we use it. Since the value is an
@@ -162,18 +164,28 @@ asmlinkage long compat_sys_times(struct compat_tms __user *tbuf)
if (tbuf) {
struct compat_tms tmp;
struct task_struct *tsk = current;
- struct task_struct *t;
cputime_t utime, stime, cutime, cstime;
read_lock(&tasklist_lock);
- utime = tsk->signal->utime;
- stime = tsk->signal->stime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- t = next_thread(t);
- } while (t != tsk);
+ /*
+ * If a POSIX interval timer is running use the process-wide
+ * fields, else fall back to brute force.
+ */
+ if (thread_group_cputime(&thread_group_times, tsk->signal)) {
+ utime = thread_group_times.utime;
+ stime = thread_group_times.stime;
+ }
+ else {
+ struct task_struct *t;
+
+ utime = tsk->signal->utime;
+ stime = tsk->signal->stime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ } while_each_thread(tsk, t);
+ }
/*
* While we have tasklist_lock read-locked, no dying thread
@@ -1081,4 +1093,3 @@ compat_sys_sysinfo(struct compat_sysinfo __user *info)
return 0;
}
-
diff --git a/kernel/fork.c b/kernel/fork.c
index dd249c3..e05d224 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -914,10 +914,14 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->it_virt_incr = cputime_zero;
sig->it_prof_expires = cputime_zero;
sig->it_prof_incr = cputime_zero;
+ sig->it_sched_expires = 0;
+ sig->rlim_expires = cputime_zero;
sig->leader = 0; /* session leadership doesn't inherit */
sig->tty_old_pgrp = NULL;
+ thread_group_times_init(sig);
+
sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
sig->gtime = cputime_zero;
sig->cgtime = cputime_zero;
@@ -939,7 +943,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
* New sole thread in the process gets an expiry time
* of the whole CPU time limit.
*/
- tsk->it_prof_expires =
+ sig->rlim_expires =
secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
}
acct_init_pacct(&sig->pacct);
@@ -952,6 +956,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
void __cleanup_signal(struct signal_struct *sig)
{
exit_thread_group_keys(sig);
+ thread_group_times_free(sig);
kmem_cache_free(signal_cachep, sig);
}
@@ -1311,21 +1316,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD) {
p->group_leader = current->group_leader;
list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
-
- if (!cputime_eq(current->signal->it_virt_expires,
- cputime_zero) ||
- !cputime_eq(current->signal->it_prof_expires,
- cputime_zero) ||
- current->signal->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY ||
- !list_empty(¤t->signal->cpu_timers[0]) ||
- !list_empty(¤t->signal->cpu_timers[1]) ||
- !list_empty(¤t->signal->cpu_timers[2])) {
- /*
- * Have child wake up on its first tick to check
- * for process CPU timers.
- */
- p->it_prof_expires = jiffies_to_cputime(1);
- }
}
if (likely(p->pid)) {
diff --git a/kernel/itimer.c b/kernel/itimer.c
index ab98274..8310db2 100644
--- a/kernel/itimer.c
+++ b/kernel/itimer.c
@@ -60,12 +60,11 @@ int do_getitimer(int which, struct itimerval *value)
cval = tsk->signal->it_virt_expires;
cinterval = tsk->signal->it_virt_incr;
if (!cputime_eq(cval, cputime_zero)) {
- struct task_struct *t = tsk;
- cputime_t utime = tsk->signal->utime;
- do {
- utime = cputime_add(utime, t->utime);
- t = next_thread(t);
- } while (t != tsk);
+ struct thread_group_cputime thread_group_times;
+ cputime_t utime;
+
+ (void)thread_group_cputime(&thread_group_times, tsk->signal);
+ utime = thread_group_times.utime;
if (cputime_le(cval, utime)) { /* about to fire */
cval = jiffies_to_cputime(1);
} else {
@@ -83,15 +82,12 @@ int do_getitimer(int which, struct itimerval *value)
cval = tsk->signal->it_prof_expires;
cinterval = tsk->signal->it_prof_incr;
if (!cputime_eq(cval, cputime_zero)) {
- struct task_struct *t = tsk;
- cputime_t ptime = cputime_add(tsk->signal->utime,
- tsk->signal->stime);
- do {
- ptime = cputime_add(ptime,
- cputime_add(t->utime,
- t->stime));
- t = next_thread(t);
- } while (t != tsk);
+ struct thread_group_cputime thread_group_times;
+ cputime_t ptime;
+
+ (void)thread_group_cputime(&thread_group_times, tsk->signal);
+ ptime = cputime_add(thread_group_times.utime,
+ thread_group_times.stime);
if (cputime_le(cval, ptime)) { /* about to fire */
cval = jiffies_to_cputime(1);
} else {
@@ -185,6 +181,13 @@ again:
case ITIMER_VIRTUAL:
nval = timeval_to_cputime(&value->it_value);
ninterval = timeval_to_cputime(&value->it_interval);
+ /*
+ * If he's setting the timer for the first time, we need to
+ * allocate the percpu area. It's freed when the process
+ * exits.
+ */
+ if (!cputime_eq(nval, cputime_zero))
+ thread_group_times_alloc(tsk);
read_lock(&tasklist_lock);
spin_lock_irq(&tsk->sighand->siglock);
cval = tsk->signal->it_virt_expires;
@@ -209,6 +212,13 @@ again:
case ITIMER_PROF:
nval = timeval_to_cputime(&value->it_value);
ninterval = timeval_to_cputime(&value->it_interval);
+ /*
+ * If he's setting the timer for the first time, we need to
+ * allocate the percpu area. It's freed when the process
+ * exits.
+ */
+ if (!cputime_eq(nval, cputime_zero))
+ thread_group_times_alloc(tsk);
read_lock(&tasklist_lock);
spin_lock_irq(&tsk->sighand->siglock);
cval = tsk->signal->it_prof_expires;
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 2eae91f..53a4486 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -227,31 +227,20 @@ static int cpu_clock_sample_group_locked(unsigned int clock_idx,
struct task_struct *p,
union cpu_time_count *cpu)
{
- struct task_struct *t = p;
- switch (clock_idx) {
+ struct thread_group_cputime thread_group_times;
+
+ (void)thread_group_cputime(&thread_group_times, p->signal);
+ switch (clock_idx) {
default:
return -EINVAL;
case CPUCLOCK_PROF:
- cpu->cpu = cputime_add(p->signal->utime, p->signal->stime);
- do {
- cpu->cpu = cputime_add(cpu->cpu, prof_ticks(t));
- t = next_thread(t);
- } while (t != p);
+ cpu->cpu = cputime_add(thread_group_times.utime, thread_group_times.stime);
break;
case CPUCLOCK_VIRT:
- cpu->cpu = p->signal->utime;
- do {
- cpu->cpu = cputime_add(cpu->cpu, virt_ticks(t));
- t = next_thread(t);
- } while (t != p);
+ cpu->cpu = thread_group_times.utime;
break;
case CPUCLOCK_SCHED:
- cpu->sched = p->signal->sum_sched_runtime;
- /* Add in each other live thread. */
- while ((t = next_thread(t)) != p) {
- cpu->sched += t->se.sum_exec_runtime;
- }
- cpu->sched += sched_ns(p);
+ cpu->sched = thread_group_times.sum_exec_runtime;
break;
}
return 0;
@@ -472,80 +461,13 @@ void posix_cpu_timers_exit(struct task_struct *tsk)
}
void posix_cpu_timers_exit_group(struct task_struct *tsk)
{
- cleanup_timers(tsk->signal->cpu_timers,
- cputime_add(tsk->utime, tsk->signal->utime),
- cputime_add(tsk->stime, tsk->signal->stime),
- tsk->se.sum_exec_runtime + tsk->signal->sum_sched_runtime);
-}
+ struct thread_group_cputime thread_group_times;
-
-/*
- * Set the expiry times of all the threads in the process so one of them
- * will go off before the process cumulative expiry total is reached.
- */
-static void process_timer_rebalance(struct task_struct *p,
- unsigned int clock_idx,
- union cpu_time_count expires,
- union cpu_time_count val)
-{
- cputime_t ticks, left;
- unsigned long long ns, nsleft;
- struct task_struct *t = p;
- unsigned int nthreads = atomic_read(&p->signal->live);
-
- if (!nthreads)
- return;
-
- switch (clock_idx) {
- default:
- BUG();
- break;
- case CPUCLOCK_PROF:
- left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
- nthreads);
- do {
- if (likely(!(t->flags & PF_EXITING))) {
- ticks = cputime_add(prof_ticks(t), left);
- if (cputime_eq(t->it_prof_expires,
- cputime_zero) ||
- cputime_gt(t->it_prof_expires, ticks)) {
- t->it_prof_expires = ticks;
- }
- }
- t = next_thread(t);
- } while (t != p);
- break;
- case CPUCLOCK_VIRT:
- left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
- nthreads);
- do {
- if (likely(!(t->flags & PF_EXITING))) {
- ticks = cputime_add(virt_ticks(t), left);
- if (cputime_eq(t->it_virt_expires,
- cputime_zero) ||
- cputime_gt(t->it_virt_expires, ticks)) {
- t->it_virt_expires = ticks;
- }
- }
- t = next_thread(t);
- } while (t != p);
- break;
- case CPUCLOCK_SCHED:
- nsleft = expires.sched - val.sched;
- do_div(nsleft, nthreads);
- nsleft = max_t(unsigned long long, nsleft, 1);
- do {
- if (likely(!(t->flags & PF_EXITING))) {
- ns = t->se.sum_exec_runtime + nsleft;
- if (t->it_sched_expires == 0 ||
- t->it_sched_expires > ns) {
- t->it_sched_expires = ns;
- }
- }
- t = next_thread(t);
- } while (t != p);
- break;
- }
+ (void)thread_group_cputime(&thread_group_times, tsk->signal);
+ cleanup_timers(tsk->signal->cpu_timers,
+ thread_group_times.utime,
+ thread_group_times.stime,
+ thread_group_times.sum_exec_runtime);
}
static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -642,24 +564,18 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
cputime_lt(p->signal->it_virt_expires,
timer->it.cpu.expires.cpu))
break;
- goto rebalance;
+ p->signal->it_virt_expires = timer->it.cpu.expires.cpu;
+ break;
case CPUCLOCK_PROF:
if (!cputime_eq(p->signal->it_prof_expires,
cputime_zero) &&
cputime_lt(p->signal->it_prof_expires,
timer->it.cpu.expires.cpu))
break;
- i = p->signal->rlim[RLIMIT_CPU].rlim_cur;
- if (i != RLIM_INFINITY &&
- i <= cputime_to_secs(timer->it.cpu.expires.cpu))
- break;
- goto rebalance;
+ p->signal->it_prof_expires = timer->it.cpu.expires.cpu;
+ break;
case CPUCLOCK_SCHED:
- rebalance:
- process_timer_rebalance(
- timer->it.cpu.task,
- CPUCLOCK_WHICH(timer->it_clock),
- timer->it.cpu.expires, now);
+ p->signal->it_sched_expires = timer->it.cpu.expires.sched;
break;
}
}
@@ -1053,10 +969,10 @@ static void check_process_timers(struct task_struct *tsk,
{
int maxfire;
struct signal_struct *const sig = tsk->signal;
- cputime_t utime, stime, ptime, virt_expires, prof_expires;
+ cputime_t utime, ptime, virt_expires, prof_expires;
unsigned long long sum_sched_runtime, sched_expires;
- struct task_struct *t;
struct list_head *timers = sig->cpu_timers;
+ struct thread_group_cputime thread_group_times;
/*
* Don't sample the current process CPU clocks if there are no timers.
@@ -1072,17 +988,10 @@ static void check_process_timers(struct task_struct *tsk,
/*
* Collect the current process totals.
*/
- utime = sig->utime;
- stime = sig->stime;
- sum_sched_runtime = sig->sum_sched_runtime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- sum_sched_runtime += t->se.sum_exec_runtime;
- t = next_thread(t);
- } while (t != tsk);
- ptime = cputime_add(utime, stime);
+ (void)thread_group_cputime(&thread_group_times, sig);
+ utime = thread_group_times.utime;
+ ptime = cputime_add(utime, thread_group_times.stime);
+ sum_sched_runtime = thread_group_times.sum_exec_runtime;
maxfire = 20;
prof_expires = cputime_zero;
@@ -1185,66 +1094,24 @@ static void check_process_timers(struct task_struct *tsk,
}
}
x = secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
- if (cputime_eq(prof_expires, cputime_zero) ||
- cputime_lt(x, prof_expires)) {
- prof_expires = x;
+ if (cputime_eq(sig->rlim_expires, cputime_zero) ||
+ cputime_lt(x, sig->rlim_expires)) {
+ sig->rlim_expires = x;
}
}
- if (!cputime_eq(prof_expires, cputime_zero) ||
- !cputime_eq(virt_expires, cputime_zero) ||
- sched_expires != 0) {
- /*
- * Rebalance the threads' expiry times for the remaining
- * process CPU timers.
- */
-
- cputime_t prof_left, virt_left, ticks;
- unsigned long long sched_left, sched;
- const unsigned int nthreads = atomic_read(&sig->live);
-
- if (!nthreads)
- return;
-
- prof_left = cputime_sub(prof_expires, utime);
- prof_left = cputime_sub(prof_left, stime);
- prof_left = cputime_div_non_zero(prof_left, nthreads);
- virt_left = cputime_sub(virt_expires, utime);
- virt_left = cputime_div_non_zero(virt_left, nthreads);
- if (sched_expires) {
- sched_left = sched_expires - sum_sched_runtime;
- do_div(sched_left, nthreads);
- sched_left = max_t(unsigned long long, sched_left, 1);
- } else {
- sched_left = 0;
- }
- t = tsk;
- do {
- if (unlikely(t->flags & PF_EXITING))
- continue;
-
- ticks = cputime_add(cputime_add(t->utime, t->stime),
- prof_left);
- if (!cputime_eq(prof_expires, cputime_zero) &&
- (cputime_eq(t->it_prof_expires, cputime_zero) ||
- cputime_gt(t->it_prof_expires, ticks))) {
- t->it_prof_expires = ticks;
- }
-
- ticks = cputime_add(t->utime, virt_left);
- if (!cputime_eq(virt_expires, cputime_zero) &&
- (cputime_eq(t->it_virt_expires, cputime_zero) ||
- cputime_gt(t->it_virt_expires, ticks))) {
- t->it_virt_expires = ticks;
- }
-
- sched = t->se.sum_exec_runtime + sched_left;
- if (sched_expires && (t->it_sched_expires == 0 ||
- t->it_sched_expires > sched)) {
- t->it_sched_expires = sched;
- }
- } while ((t = next_thread(t)) != tsk);
- }
+ if (!cputime_eq(prof_expires, cputime_zero) &&
+ (cputime_eq(sig->it_prof_expires, cputime_zero) ||
+ cputime_gt(sig->it_prof_expires, prof_expires)))
+ sig->it_prof_expires = prof_expires;
+ if (!cputime_eq(virt_expires, cputime_zero) &&
+ (cputime_eq(sig->it_virt_expires, cputime_zero) ||
+ cputime_gt(sig->it_virt_expires, virt_expires)))
+ sig->it_virt_expires = virt_expires;
+ if (sched_expires != 0 &&
+ (sig->it_sched_expires == 0 ||
+ sig->it_sched_expires > sched_expires))
+ sig->it_sched_expires = sched_expires;
}
/*
@@ -1321,19 +1188,40 @@ void run_posix_cpu_timers(struct task_struct *tsk)
{
LIST_HEAD(firing);
struct k_itimer *timer, *next;
+ struct thread_group_cputime thread_group_times;
+ cputime_t tg_virt, tg_prof;
+ unsigned long long tg_exec_runtime;
BUG_ON(!irqs_disabled());
-#define UNEXPIRED(clock) \
- (cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \
- cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires))
+#define UNEXPIRED(p, prof, virt, sched) \
+ ((cputime_eq((p)->it_prof_expires, cputime_zero) || \
+ cputime_lt((prof), (p)->it_prof_expires)) && \
+ (cputime_eq((p)->it_virt_expires, cputime_zero) || \
+ cputime_lt((virt), (p)->it_virt_expires)) && \
+ ((p)->it_sched_expires == 0 || (sched) < (p)->it_sched_expires))
- if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
- (tsk->it_sched_expires == 0 ||
- tsk->se.sum_exec_runtime < tsk->it_sched_expires))
- return;
+ /*
+ * If there are no expired thread timers, no expired thread group
+ * timers and no expired RLIMIT_CPU timer, just return.
+ */
+ if (UNEXPIRED(tsk, prof_ticks(tsk),
+ virt_ticks(tsk), tsk->se.sum_exec_runtime)) {
+ if (unlikely(tsk->signal == NULL))
+ return;
+ if ((tsk->signal->rlim[RLIMIT_CPU].rlim_cur == RLIM_INFINITY ||
+ cputime_lt(tg_prof, tsk->signal->rlim_expires)) &&
+ !thread_group_cputime(&thread_group_times, tsk->signal))
+ return;
+ tg_virt = thread_group_times.utime;
+ tg_prof = cputime_add(thread_group_times.utime,
+ thread_group_times.stime);
+ tg_exec_runtime = thread_group_times.sum_exec_runtime;
+ if (UNEXPIRED(tsk->signal, tg_virt, tg_prof, tg_exec_runtime))
+ return;
+ }
-#undef UNEXPIRED
+#undef UNEXPIRED
/*
* Double-check with locks held.
@@ -1414,14 +1302,6 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
if (cputime_eq(*newval, cputime_zero))
return;
*newval = cputime_add(*newval, now.cpu);
-
- /*
- * If the RLIMIT_CPU timer will expire before the
- * ITIMER_PROF timer, we have nothing else to do.
- */
- if (tsk->signal->rlim[RLIMIT_CPU].rlim_cur
- < cputime_to_secs(*newval))
- return;
}
/*
@@ -1433,13 +1313,14 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
cputime_ge(list_first_entry(head,
struct cpu_timer_list, entry)->expires.cpu,
*newval)) {
- /*
- * Rejigger each thread's expiry time so that one will
- * notice before we hit the process-cumulative expiry time.
- */
- union cpu_time_count expires = { .sched = 0 };
- expires.cpu = *newval;
- process_timer_rebalance(tsk, clock_idx, expires, now);
+ switch (clock_idx) {
+ case CPUCLOCK_PROF:
+ tsk->signal->it_prof_expires = *newval;
+ break;
+ case CPUCLOCK_VIRT:
+ tsk->signal->it_virt_expires = *newval;
+ break;
+ }
}
}
diff --git a/kernel/sched.c b/kernel/sched.c
index 28c73f0..1ff1a32 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3594,6 +3594,7 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
cputime64_t tmp;
p->utime = cputime_add(p->utime, cputime);
+ thread_group_update(p->signal, utime, cputime, cputime_add);
/* Add user time to cpustat. */
tmp = cputime_to_cputime64(cputime);
@@ -3616,6 +3617,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime)
tmp = cputime_to_cputime64(cputime);
p->utime = cputime_add(p->utime, cputime);
+ thread_group_update(p->signal, utime, cputime, cputime_add);
p->gtime = cputime_add(p->gtime, cputime);
cpustat->user = cputime64_add(cpustat->user, tmp);
@@ -3649,6 +3651,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
return account_guest_time(p, cputime);
p->stime = cputime_add(p->stime, cputime);
+ thread_group_update(p->signal, stime, cputime, cputime_add);
/* Add system time to cpustat. */
tmp = cputime_to_cputime64(cputime);
@@ -3690,6 +3693,7 @@ void account_steal_time(struct task_struct *p, cputime_t steal)
if (p == rq->idle) {
p->stime = cputime_add(p->stime, steal);
+ thread_group_update(p->signal, stime, steal, cputime_add);
if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait = cputime64_add(cpustat->iowait, tmp);
else
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 86a9337..6f7d5d2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -353,6 +353,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
struct task_struct *curtask = task_of(curr);
cpuacct_charge(curtask, delta_exec);
+ thread_group_update(curtask->signal, sum_exec_runtime,
+ delta_exec, thread_group_runtime_add);
}
}
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0a6d2e5..7a2cc40 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -256,6 +256,8 @@ static void update_curr_rt(struct rq *rq)
schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
curr->se.sum_exec_runtime += delta_exec;
+ thread_group_update(curr->signal, sum_exec_runtime,
+ delta_exec, thread_group_runtime_add);
curr->se.exec_start = rq->clock;
cpuacct_charge(curr, delta_exec);
diff --git a/kernel/sys.c b/kernel/sys.c
index a626116..baa3130 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -864,6 +864,8 @@ asmlinkage long sys_setfsgid(gid_t gid)
asmlinkage long sys_times(struct tms __user * tbuf)
{
+ struct thread_group_cputime thread_group_times;
+
/*
* In the SMP world we might just be unlucky and have one of
* the times increment as we use it. Since the value is an
@@ -873,19 +875,28 @@ asmlinkage long sys_times(struct tms __user * tbuf)
if (tbuf) {
struct tms tmp;
struct task_struct *tsk = current;
- struct task_struct *t;
cputime_t utime, stime, cutime, cstime;
spin_lock_irq(&tsk->sighand->siglock);
- utime = tsk->signal->utime;
- stime = tsk->signal->stime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- t = next_thread(t);
- } while (t != tsk);
+ /*
+ * If a POSIX interval timer is running use the process-wide
+ * fields, else fall back to brute force.
+ */
+ if (thread_group_cputime(&thread_group_times, tsk->signal)) {
+ utime = thread_group_times.utime;
+ stime = thread_group_times.stime;
+ }
+ else {
+ struct task_struct *t;
+ utime = tsk->signal->utime;
+ stime = tsk->signal->stime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ } while_each_thread(tsk, t);
+ }
cutime = tsk->signal->cutime;
cstime = tsk->signal->cstime;
spin_unlock_irq(&tsk->sighand->siglock);
@@ -1444,7 +1455,7 @@ asmlinkage long sys_old_getrlimit(unsigned int resource, struct rlimit __user *r
asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
{
struct rlimit new_rlim, *old_rlim;
- unsigned long it_prof_secs;
+ unsigned long rlim_secs;
int retval;
if (resource >= RLIM_NLIMITS)
@@ -1490,15 +1501,11 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
if (new_rlim.rlim_cur == RLIM_INFINITY)
goto out;
- it_prof_secs = cputime_to_secs(current->signal->it_prof_expires);
- if (it_prof_secs == 0 || new_rlim.rlim_cur <= it_prof_secs) {
- unsigned long rlim_cur = new_rlim.rlim_cur;
- cputime_t cputime;
-
- cputime = secs_to_cputime(rlim_cur);
+ rlim_secs = cputime_to_secs(current->signal->rlim_expires);
+ if (rlim_secs == 0 || new_rlim.rlim_cur <= rlim_secs) {
read_lock(&tasklist_lock);
spin_lock_irq(¤t->sighand->siglock);
- set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL);
+ current->signal->rlim_expires = secs_to_cputime(new_rlim.rlim_cur);
spin_unlock_irq(¤t->sighand->siglock);
read_unlock(&tasklist_lock);
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 41a049f..62fed13 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2201,7 +2201,7 @@ static void selinux_bprm_post_apply_creds(struct linux_binprm *bprm)
* This will cause RLIMIT_CPU calculations
* to be refigured.
*/
- current->it_prof_expires = jiffies_to_cputime(1);
+ current->signal->rlim_expires = jiffies_to_cputime(1);
}
}
@@ -5624,5 +5624,3 @@ int selinux_disable(void)
return 0;
}
#endif
-
-
--
Frank Mayhar <[email protected]>
Google, Inc.
* Frank Mayhar <[email protected]> wrote:
> +static inline void thread_group_times_init(struct signal_struct *sig)
> +{
> + sig->thread_group_times = NULL;
> +}
> +
> +static inline void thread_group_times_free(struct signal_struct *sig)
> +{
> + if (sig->thread_group_times)
> + free_percpu(sig->thread_group_times);
> +}
> +
> +/*
> + * Allocate the thread_group_cputime struct appropriately and fill in the current
> + * values of the fields. Called from do_setitimer() when setting an interval
> + * timer (ITIMER_PROF or ITIMER_VIRTUAL). Assumes interrupts are enabled when
> + * it's called. Note that there is no corresponding deallocation done from
> + * do_setitimer(); the structure is freed at process exit.
> + */
> +static inline int thread_group_times_alloc(struct task_struct *tsk)
> +{
> + struct signal_struct *sig = tsk->signal;
> + struct thread_group_cputime *thread_group_times;
> + struct task_struct *t;
> + cputime_t utime, stime;
> + unsigned long long sum_exec_runtime;
> +
> + /*
> + * If we don't already have a thread_group_cputime struct, allocate
> + * one and fill it in with the accumulated times.
> + */
> + if (sig->thread_group_times)
> + return 0;
> + thread_group_times = alloc_percpu(struct thread_group_cputime);
> + if (thread_group_times == NULL)
> + return -ENOMEM;
> + read_lock(&tasklist_lock);
> + spin_lock_irq(&tsk->sighand->siglock);
> + if (sig->thread_group_times) {
> + spin_unlock_irq(&tsk->sighand->siglock);
> + read_unlock(&tasklist_lock);
> + free_percpu(thread_group_times);
> + return 0;
> + }
> + sig->thread_group_times = thread_group_times;
> + utime = sig->utime;
> + stime = sig->stime;
> + sum_exec_runtime = tsk->se.sum_exec_runtime;
> + t = tsk;
> + do {
> + utime = cputime_add(utime, t->utime);
> + stime = cputime_add(stime, t->stime);
> + sum_exec_runtime += t->se.sum_exec_runtime;
> + } while_each_thread(tsk, t);
> + thread_group_times = per_cpu_ptr(sig->thread_group_times, get_cpu());
> + thread_group_times->utime = utime;
> + thread_group_times->stime = stime;
> + thread_group_times->sum_exec_runtime = sum_exec_runtime;
> + put_cpu_no_resched();
> + spin_unlock_irq(&tsk->sighand->siglock);
> + read_unlock(&tasklist_lock);
> + return 0;
> +}
please dont put such a huge inline into a header.
> +
> +#define thread_group_update(sig, field, val, op) ({ \
> + if (sig && sig->thread_group_times) { \
> + int cpu; \
> + struct thread_group_cputime *thread_group_times; \
> + \
> + cpu = get_cpu(); \
> + thread_group_times = \
> + per_cpu_ptr(sig->thread_group_times, cpu); \
> + thread_group_times->field = \
> + op(thread_group_times->field, val); \
> + put_cpu_no_resched(); \
> + } \
> +})
nor use any macros that includes code.
> +static inline int thread_group_cputime(struct thread_group_cputime *thread_group_times,
> + struct signal_struct *sig)
ditto, line length as well.
> + if (!sig->thread_group_times)
> + return(0);
return 0 is the proper form - please run your patch through
scripts/checkpatch.pl.
Ingo
This is my second cut at a patch that will fix bug 9906, "Weird hang
with NPTL and SIGPROF." Thanks to Ingo Molnar who sent me feedback
regarding the first cut.
Again, the problem is that run_posix_cpu_timers() repeatedly walks the
entire thread group every time it runs, which is at interrupt. With
heavy load and lots of threads, this can take longer than the tick, at
which point the kernel stops doing anything put servicing clock ticks
and the occasional interrupt. Many thanks to Roland McGrath for his
help in my attempt to understand his code.
The change adds a new structure to the signal_struct,
thread_group_cputime. On an SMP kernel, this is allocated as a percpu
structure when needed (from do_setitimer()) using the alloc_percpu()
mechanism). It is manipulated via a set of functions defined in sched.c
and sched.h. These new functions are
* thread_group_times_free(), inline function to free via
free_percpu() (SMP) or kfree (UP) the thread_group_cputime
structure.
* thread_group_times_alloc(), external function to allocate the
thread_group_cputime structure when needed.
* thread_group_update(), inline function to update a field of the
thread_group_cputime structure; called at interrupt from tick
handlers, generally. It depends on the "offsetof()" macro to
know which field to update and on compiler optimization to
remove the unused code paths in each case.
* thread_group_cputime(), inline function that sums the time
fields for all running CPUs (SMP) or snapshots the time fields
(UP) into a passed structure.
I've changed the uniprocessor case to retain the dynamic allocation of
the thread_group_cputime structure as needed; this makes the code
somewhat more consistent between SMP and UP and retains the feature of
reducing overhead for processes that don't use interval timers.
In addition to fixing the hang, this change removes the overloading of
it_prof_expires for RLIMIT_CPU handling, replacing it with a new field,
rlim_expires, which is checked instead. This makes the code simpler and
more straightforward.
The kernel/posix-cpu-timers.c file has changed pretty drastically, with
it no longer using the per-task times to know when to check for timer
expiration. Instead, it consecutively checks the per-task timers and
then the per-process timers for expiration, consulting the individual
expiration fields (including the new RLIMIT_CPU expiration field) which
are now logically separate. Rather than performing "rebalancing"
functions now do simple assignments and all loops through the thread
group have gone away, replaced with calls to thread_group_cputime().
Elsewhere, do_getitimer(), compat_sys_times() and sys_times() now use
thread_group_cputime() to get the times if a POSIX interval timer is in
use, providing a faster path in that case.
This version moves the thread_group_times_alloc() routine to sched.c,
changes the thread_group_update() macro to an inline function, shortens
a few things and cleans up the sched.h changes a bit.
Again, performance with the fix is at least as good as before and in a
few cases is slightly improved, possibly due to the reduced tick
overhead.
Finally, the patch has been retargeted to 2.6.25-rc7 instead of -rc6.
Signed-off-by: Frank Mayhar <[email protected]>
include/linux/sched.h | 117 +++++++++++++++++++
kernel/compat.c | 30 ++++--
kernel/fork.c | 22 +---
kernel/itimer.c | 40 ++++---
kernel/posix-cpu-timers.c | 276 +++++++++++++--------------------------------
kernel/sched.c | 73 ++++++++++++
kernel/sched_fair.c | 3 +
kernel/sched_rt.c | 3 +
kernel/sys.c | 44 +++++---
security/selinux/hooks.c | 4 +-
10 files changed, 354 insertions(+), 258 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6a1e7af..f6052ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -424,6 +424,18 @@ struct pacct_struct {
};
/*
+ * This structure contains the versions of utime, stime and sum_exec_runtime
+ * that are shared across threads within a process. It's only used for
+ * interval timers and is allocated via alloc_percpu() in the signal
+ * structure when such a timer is set up.
+ */
+struct thread_group_cputime {
+ cputime_t utime; /* User time. */
+ cputime_t stime; /* System time. */
+ unsigned long long sum_exec_runtime; /* Scheduler time. */
+};
+
+/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
* implies a shared sighand_struct, so locking
@@ -468,6 +480,12 @@ struct signal_struct {
cputime_t it_prof_expires, it_virt_expires;
cputime_t it_prof_incr, it_virt_incr;
+ /* Scheduling timer for the process */
+ unsigned long long it_sched_expires;
+
+ /* RLIMIT_CPU timer for the process */
+ cputime_t rlim_expires;
+
/* job control IDs */
/*
@@ -492,6 +510,9 @@ struct signal_struct {
struct tty_struct *tty; /* NULL if no tty */
+ /* Process-wide times for POSIX interval timing. Per CPU. */
+ struct thread_group_cputime *thread_group_times;
+
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1984,6 +2005,102 @@ static inline int spin_needbreak(spinlock_t *lock)
#endif
}
+#ifdef CONFIG_SMP
+
+static inline void thread_group_times_free(
+ struct thread_group_cputime *tg_times)
+{
+ free_percpu(tg_times);
+}
+
+/*
+ * Sum the time fields across all running CPUs.
+ */
+static inline void thread_group_cputime(
+ struct thread_group_cputime *tg_times,
+ struct signal_struct *sig)
+{
+ int i;
+ struct thread_group_cputime *tg;
+
+ /*
+ * Get the values for the current CPU separately so we don't get
+ * preempted, then sum all the rest.
+ */
+ tg = per_cpu_ptr(sig->thread_group_times, get_cpu());
+ *tg_times = *tg;
+ put_cpu_no_resched();
+ for_each_online_cpu(i) {
+ if (i == smp_processor_id())
+ continue;
+ tg = per_cpu_ptr(sig->thread_group_times, i);
+ tg_times->utime = cputime_add(tg_times->utime, tg->utime);
+ tg_times->stime = cputime_add(tg_times->stime, tg->stime);
+ tg_times->sum_exec_runtime += tg->sum_exec_runtime;
+ }
+}
+
+#else /* CONFIG_SMP */
+
+static inline void thread_group_times_free(
+ struct thread_group_cputime *tg_times)
+{
+ kfree(tg_times);
+}
+
+/*
+ * Snapshot the time fields.
+ */
+static inline void thread_group_cputime(
+ struct thread_group_cputime *tg_times,
+ struct signal_struct *sig)
+{
+ *tg_times = *sig->thread_group_times;
+}
+
+#endif /* CONFIG_SMP */
+
+/*
+ * Update one of the fields in the thread_group_cputime structure. This is
+ * passed the offset of the field to be updated (acquired via the "offsetof"
+ * macro) and uses that to determine the actual field.
+ */
+static inline void thread_group_update(struct signal_struct *sig,
+ const int fieldoff, void *val)
+{
+ cputime_t cputime;
+ unsigned long long sum_exec_runtime;
+ struct thread_group_cputime *tg_times;
+
+ if (!sig || !sig->thread_group_times)
+ return;
+#ifdef CONFIG_SMP
+ tg_times = per_cpu_ptr(sig->thread_group_times, get_cpu());
+#else
+ tg_times = sig->thread_group_times;
+#endif
+ switch (fieldoff) {
+ case offsetof(struct thread_group_cputime, utime):
+ cputime = *(cputime_t *)val;
+ tg_times->utime = cputime_add(tg_times->utime, cputime);
+ break;
+ case offsetof(struct thread_group_cputime, stime):
+ cputime = *(cputime_t *)val;
+ tg_times->utime = cputime_add(tg_times->utime, cputime);
+ break;
+ case offsetof(struct thread_group_cputime, sum_exec_runtime):
+ sum_exec_runtime = *(unsigned long long *)val;
+ tg_times->sum_exec_runtime += sum_exec_runtime;
+ break;
+ }
+#ifdef CONFIG_SMP
+ put_cpu_no_resched();
+#endif
+}
+
+/* The thread_group_cputime allocator. */
+extern int thread_group_times_alloc(struct task_struct *);
+
/*
* Reevaluate whether the task has signals pending delivery.
* Wake the task if so.
diff --git a/kernel/compat.c b/kernel/compat.c
index 5f0e201..06a7e3a 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -162,18 +162,29 @@ asmlinkage long compat_sys_times(struct compat_tms __user *tbuf)
if (tbuf) {
struct compat_tms tmp;
struct task_struct *tsk = current;
- struct task_struct *t;
cputime_t utime, stime, cutime, cstime;
+ struct thread_group_cputime thread_group_times;
read_lock(&tasklist_lock);
- utime = tsk->signal->utime;
- stime = tsk->signal->stime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- t = next_thread(t);
- } while (t != tsk);
+ /*
+ * If a POSIX interval timer is running use the process-wide
+ * fields, else fall back to brute force.
+ */
+ if (sig->thread_group_times) {
+ thread_group_cputime(&thread_group_times, tsk->signal);
+ utime = thread_group_times.utime;
+ stime = thread_group_times.stime;
+ } else {
+ struct task_struct *t;
+
+ utime = tsk->signal->utime;
+ stime = tsk->signal->stime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ } while_each_thread(tsk, t);
+ }
/*
* While we have tasklist_lock read-locked, no dying thread
@@ -1081,4 +1092,3 @@ compat_sys_sysinfo(struct compat_sysinfo __user *info)
return 0;
}
-
diff --git a/kernel/fork.c b/kernel/fork.c
index dd249c3..d4f6282 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -914,10 +914,14 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->it_virt_incr = cputime_zero;
sig->it_prof_expires = cputime_zero;
sig->it_prof_incr = cputime_zero;
+ sig->it_sched_expires = 0;
+ sig->rlim_expires = cputime_zero;
sig->leader = 0; /* session leadership doesn't inherit */
sig->tty_old_pgrp = NULL;
+ sig->thread_group_times = NULL;
+
sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
sig->gtime = cputime_zero;
sig->cgtime = cputime_zero;
@@ -939,7 +943,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
* New sole thread in the process gets an expiry time
* of the whole CPU time limit.
*/
- tsk->it_prof_expires =
+ sig->rlim_expires =
secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
}
acct_init_pacct(&sig->pacct);
@@ -952,6 +956,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
void __cleanup_signal(struct signal_struct *sig)
{
exit_thread_group_keys(sig);
+ thread_group_times_free(sig->thread_group_times);
kmem_cache_free(signal_cachep, sig);
}
@@ -1311,21 +1316,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD) {
p->group_leader = current->group_leader;
list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
-
- if (!cputime_eq(current->signal->it_virt_expires,
- cputime_zero) ||
- !cputime_eq(current->signal->it_prof_expires,
- cputime_zero) ||
- current->signal->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY ||
- !list_empty(¤t->signal->cpu_timers[0]) ||
- !list_empty(¤t->signal->cpu_timers[1]) ||
- !list_empty(¤t->signal->cpu_timers[2])) {
- /*
- * Have child wake up on its first tick to check
- * for process CPU timers.
- */
- p->it_prof_expires = jiffies_to_cputime(1);
- }
}
if (likely(p->pid)) {
diff --git a/kernel/itimer.c b/kernel/itimer.c
index ab98274..7c5b416 100644
--- a/kernel/itimer.c
+++ b/kernel/itimer.c
@@ -60,12 +60,11 @@ int do_getitimer(int which, struct itimerval *value)
cval = tsk->signal->it_virt_expires;
cinterval = tsk->signal->it_virt_incr;
if (!cputime_eq(cval, cputime_zero)) {
- struct task_struct *t = tsk;
- cputime_t utime = tsk->signal->utime;
- do {
- utime = cputime_add(utime, t->utime);
- t = next_thread(t);
- } while (t != tsk);
+ struct thread_group_cputime thread_group_times;
+ cputime_t utime;
+
+ thread_group_cputime(&thread_group_times, tsk->signal);
+ utime = thread_group_times.utime;
if (cputime_le(cval, utime)) { /* about to fire */
cval = jiffies_to_cputime(1);
} else {
@@ -83,15 +82,12 @@ int do_getitimer(int which, struct itimerval *value)
cval = tsk->signal->it_prof_expires;
cinterval = tsk->signal->it_prof_incr;
if (!cputime_eq(cval, cputime_zero)) {
- struct task_struct *t = tsk;
- cputime_t ptime = cputime_add(tsk->signal->utime,
- tsk->signal->stime);
- do {
- ptime = cputime_add(ptime,
- cputime_add(t->utime,
- t->stime));
- t = next_thread(t);
- } while (t != tsk);
+ struct thread_group_cputime thread_group_times;
+ cputime_t ptime;
+
+ thread_group_cputime(&thread_group_times, tsk->signal);
+ ptime = cputime_add(thread_group_times.utime,
+ thread_group_times.stime);
if (cputime_le(cval, ptime)) { /* about to fire */
cval = jiffies_to_cputime(1);
} else {
@@ -185,6 +181,13 @@ again:
case ITIMER_VIRTUAL:
nval = timeval_to_cputime(&value->it_value);
ninterval = timeval_to_cputime(&value->it_interval);
+ /*
+ * If he's setting the timer for the first time, we need to
+ * allocate the percpu area. It's freed when the process
+ * exits.
+ */
+ if (!cputime_eq(nval, cputime_zero))
+ thread_group_times_alloc(tsk);
read_lock(&tasklist_lock);
spin_lock_irq(&tsk->sighand->siglock);
cval = tsk->signal->it_virt_expires;
@@ -209,6 +212,13 @@ again:
case ITIMER_PROF:
nval = timeval_to_cputime(&value->it_value);
ninterval = timeval_to_cputime(&value->it_interval);
+ /*
+ * If he's setting the timer for the first time, we need to
+ * allocate the percpu area. It's freed when the process
+ * exits.
+ */
+ if (!cputime_eq(nval, cputime_zero))
+ thread_group_times_alloc(tsk);
read_lock(&tasklist_lock);
spin_lock_irq(&tsk->sighand->siglock);
cval = tsk->signal->it_prof_expires;
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 2eae91f..309a7c4 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -227,31 +227,21 @@ static int cpu_clock_sample_group_locked(unsigned int clock_idx,
struct task_struct *p,
union cpu_time_count *cpu)
{
- struct task_struct *t = p;
- switch (clock_idx) {
+ struct thread_group_cputime thread_group_times;
+
+ thread_group_cputime(&thread_group_times, p->signal);
+ switch (clock_idx) {
default:
return -EINVAL;
case CPUCLOCK_PROF:
- cpu->cpu = cputime_add(p->signal->utime, p->signal->stime);
- do {
- cpu->cpu = cputime_add(cpu->cpu, prof_ticks(t));
- t = next_thread(t);
- } while (t != p);
+ cpu->cpu = cputime_add(thread_group_times.utime,
+ thread_group_times.stime);
break;
case CPUCLOCK_VIRT:
- cpu->cpu = p->signal->utime;
- do {
- cpu->cpu = cputime_add(cpu->cpu, virt_ticks(t));
- t = next_thread(t);
- } while (t != p);
+ cpu->cpu = thread_group_times.utime;
break;
case CPUCLOCK_SCHED:
- cpu->sched = p->signal->sum_sched_runtime;
- /* Add in each other live thread. */
- while ((t = next_thread(t)) != p) {
- cpu->sched += t->se.sum_exec_runtime;
- }
- cpu->sched += sched_ns(p);
+ cpu->sched = thread_group_times.sum_exec_runtime;
break;
}
return 0;
@@ -472,80 +462,13 @@ void posix_cpu_timers_exit(struct task_struct *tsk)
}
void posix_cpu_timers_exit_group(struct task_struct *tsk)
{
- cleanup_timers(tsk->signal->cpu_timers,
- cputime_add(tsk->utime, tsk->signal->utime),
- cputime_add(tsk->stime, tsk->signal->stime),
- tsk->se.sum_exec_runtime + tsk->signal->sum_sched_runtime);
-}
+ struct thread_group_cputime thread_group_times;
-
-/*
- * Set the expiry times of all the threads in the process so one of them
- * will go off before the process cumulative expiry total is reached.
- */
-static void process_timer_rebalance(struct task_struct *p,
- unsigned int clock_idx,
- union cpu_time_count expires,
- union cpu_time_count val)
-{
- cputime_t ticks, left;
- unsigned long long ns, nsleft;
- struct task_struct *t = p;
- unsigned int nthreads = atomic_read(&p->signal->live);
-
- if (!nthreads)
- return;
-
- switch (clock_idx) {
- default:
- BUG();
- break;
- case CPUCLOCK_PROF:
- left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
- nthreads);
- do {
- if (likely(!(t->flags & PF_EXITING))) {
- ticks = cputime_add(prof_ticks(t), left);
- if (cputime_eq(t->it_prof_expires,
- cputime_zero) ||
- cputime_gt(t->it_prof_expires, ticks)) {
- t->it_prof_expires = ticks;
- }
- }
- t = next_thread(t);
- } while (t != p);
- break;
- case CPUCLOCK_VIRT:
- left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
- nthreads);
- do {
- if (likely(!(t->flags & PF_EXITING))) {
- ticks = cputime_add(virt_ticks(t), left);
- if (cputime_eq(t->it_virt_expires,
- cputime_zero) ||
- cputime_gt(t->it_virt_expires, ticks)) {
- t->it_virt_expires = ticks;
- }
- }
- t = next_thread(t);
- } while (t != p);
- break;
- case CPUCLOCK_SCHED:
- nsleft = expires.sched - val.sched;
- do_div(nsleft, nthreads);
- nsleft = max_t(unsigned long long, nsleft, 1);
- do {
- if (likely(!(t->flags & PF_EXITING))) {
- ns = t->se.sum_exec_runtime + nsleft;
- if (t->it_sched_expires == 0 ||
- t->it_sched_expires > ns) {
- t->it_sched_expires = ns;
- }
- }
- t = next_thread(t);
- } while (t != p);
- break;
- }
+ thread_group_cputime(&thread_group_times, tsk->signal);
+ cleanup_timers(tsk->signal->cpu_timers,
+ thread_group_times.utime,
+ thread_group_times.stime,
+ thread_group_times.sum_exec_runtime);
}
static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -572,7 +495,6 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
struct list_head *head, *listpos;
struct cpu_timer_list *const nt = &timer->it.cpu;
struct cpu_timer_list *next;
- unsigned long i;
head = (CPUCLOCK_PERTHREAD(timer->it_clock) ?
p->cpu_timers : p->signal->cpu_timers);
@@ -642,24 +564,21 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
cputime_lt(p->signal->it_virt_expires,
timer->it.cpu.expires.cpu))
break;
- goto rebalance;
+ p->signal->it_virt_expires =
+ timer->it.cpu.expires.cpu;
+ break;
case CPUCLOCK_PROF:
if (!cputime_eq(p->signal->it_prof_expires,
cputime_zero) &&
cputime_lt(p->signal->it_prof_expires,
timer->it.cpu.expires.cpu))
break;
- i = p->signal->rlim[RLIMIT_CPU].rlim_cur;
- if (i != RLIM_INFINITY &&
- i <= cputime_to_secs(timer->it.cpu.expires.cpu))
- break;
- goto rebalance;
+ p->signal->it_prof_expires =
+ timer->it.cpu.expires.cpu;
+ break;
case CPUCLOCK_SCHED:
- rebalance:
- process_timer_rebalance(
- timer->it.cpu.task,
- CPUCLOCK_WHICH(timer->it_clock),
- timer->it.cpu.expires, now);
+ p->signal->it_sched_expires =
+ timer->it.cpu.expires.sched;
break;
}
}
@@ -1053,10 +972,10 @@ static void check_process_timers(struct task_struct *tsk,
{
int maxfire;
struct signal_struct *const sig = tsk->signal;
- cputime_t utime, stime, ptime, virt_expires, prof_expires;
+ cputime_t utime, ptime, virt_expires, prof_expires;
unsigned long long sum_sched_runtime, sched_expires;
- struct task_struct *t;
struct list_head *timers = sig->cpu_timers;
+ struct thread_group_cputime thread_group_times;
/*
* Don't sample the current process CPU clocks if there are no timers.
@@ -1072,17 +991,10 @@ static void check_process_timers(struct task_struct *tsk,
/*
* Collect the current process totals.
*/
- utime = sig->utime;
- stime = sig->stime;
- sum_sched_runtime = sig->sum_sched_runtime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- sum_sched_runtime += t->se.sum_exec_runtime;
- t = next_thread(t);
- } while (t != tsk);
- ptime = cputime_add(utime, stime);
+ thread_group_cputime(&thread_group_times, sig);
+ utime = thread_group_times.utime;
+ ptime = cputime_add(utime, thread_group_times.stime);
+ sum_sched_runtime = thread_group_times.sum_exec_runtime;
maxfire = 20;
prof_expires = cputime_zero;
@@ -1185,66 +1097,24 @@ static void check_process_timers(struct task_struct *tsk,
}
}
x = secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
- if (cputime_eq(prof_expires, cputime_zero) ||
- cputime_lt(x, prof_expires)) {
- prof_expires = x;
+ if (cputime_eq(sig->rlim_expires, cputime_zero) ||
+ cputime_lt(x, sig->rlim_expires)) {
+ sig->rlim_expires = x;
}
}
- if (!cputime_eq(prof_expires, cputime_zero) ||
- !cputime_eq(virt_expires, cputime_zero) ||
- sched_expires != 0) {
- /*
- * Rebalance the threads' expiry times for the remaining
- * process CPU timers.
- */
-
- cputime_t prof_left, virt_left, ticks;
- unsigned long long sched_left, sched;
- const unsigned int nthreads = atomic_read(&sig->live);
-
- if (!nthreads)
- return;
-
- prof_left = cputime_sub(prof_expires, utime);
- prof_left = cputime_sub(prof_left, stime);
- prof_left = cputime_div_non_zero(prof_left, nthreads);
- virt_left = cputime_sub(virt_expires, utime);
- virt_left = cputime_div_non_zero(virt_left, nthreads);
- if (sched_expires) {
- sched_left = sched_expires - sum_sched_runtime;
- do_div(sched_left, nthreads);
- sched_left = max_t(unsigned long long, sched_left, 1);
- } else {
- sched_left = 0;
- }
- t = tsk;
- do {
- if (unlikely(t->flags & PF_EXITING))
- continue;
-
- ticks = cputime_add(cputime_add(t->utime, t->stime),
- prof_left);
- if (!cputime_eq(prof_expires, cputime_zero) &&
- (cputime_eq(t->it_prof_expires, cputime_zero) ||
- cputime_gt(t->it_prof_expires, ticks))) {
- t->it_prof_expires = ticks;
- }
-
- ticks = cputime_add(t->utime, virt_left);
- if (!cputime_eq(virt_expires, cputime_zero) &&
- (cputime_eq(t->it_virt_expires, cputime_zero) ||
- cputime_gt(t->it_virt_expires, ticks))) {
- t->it_virt_expires = ticks;
- }
-
- sched = t->se.sum_exec_runtime + sched_left;
- if (sched_expires && (t->it_sched_expires == 0 ||
- t->it_sched_expires > sched)) {
- t->it_sched_expires = sched;
- }
- } while ((t = next_thread(t)) != tsk);
- }
+ if (!cputime_eq(prof_expires, cputime_zero) &&
+ (cputime_eq(sig->it_prof_expires, cputime_zero) ||
+ cputime_gt(sig->it_prof_expires, prof_expires)))
+ sig->it_prof_expires = prof_expires;
+ if (!cputime_eq(virt_expires, cputime_zero) &&
+ (cputime_eq(sig->it_virt_expires, cputime_zero) ||
+ cputime_gt(sig->it_virt_expires, virt_expires)))
+ sig->it_virt_expires = virt_expires;
+ if (sched_expires != 0 &&
+ (sig->it_sched_expires == 0 ||
+ sig->it_sched_expires > sched_expires))
+ sig->it_sched_expires = sched_expires;
}
/*
@@ -1321,19 +1191,40 @@ void run_posix_cpu_timers(struct task_struct *tsk)
{
LIST_HEAD(firing);
struct k_itimer *timer, *next;
+ struct thread_group_cputime tg_times;
+ cputime_t tg_virt, tg_prof;
+ unsigned long long tg_exec_runtime;
BUG_ON(!irqs_disabled());
-#define UNEXPIRED(clock) \
- (cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \
- cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires))
+#define UNEXPIRED(p, prof, virt, sched) \
+ ((cputime_eq((p)->it_prof_expires, cputime_zero) || \
+ cputime_lt((prof), (p)->it_prof_expires)) && \
+ (cputime_eq((p)->it_virt_expires, cputime_zero) || \
+ cputime_lt((virt), (p)->it_virt_expires)) && \
+ ((p)->it_sched_expires == 0 || (sched) < (p)->it_sched_expires))
- if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
- (tsk->it_sched_expires == 0 ||
- tsk->se.sum_exec_runtime < tsk->it_sched_expires))
- return;
+ /*
+ * If there are no expired thread timers, no expired thread group
+ * timers and no expired RLIMIT_CPU timer, just return.
+ */
+ if (UNEXPIRED(tsk, prof_ticks(tsk),
+ virt_ticks(tsk), tsk->se.sum_exec_runtime)) {
+ if (unlikely(tsk->signal == NULL))
+ return;
+ if (!sig->thread_group_times)
+ return;
+ thread_group_cputime(&tg_times, tsk->signal);
+ tg_prof = cputime_add(tg_times.utime, tg_times.stime);
+ tg_virt = tg_times.utime;
+ tg_exec_runtime = tg_times.sum_exec_runtime;
+ if ((tsk->signal->rlim[RLIMIT_CPU].rlim_cur == RLIM_INFINITY ||
+ cputime_lt(tg_prof, tsk->signal->rlim_expires)) &&
+ UNEXPIRED(tsk->signal, tg_virt, tg_prof, tg_exec_runtime))
+ return;
+ }
-#undef UNEXPIRED
+#undef UNEXPIRED
/*
* Double-check with locks held.
@@ -1414,14 +1305,6 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
if (cputime_eq(*newval, cputime_zero))
return;
*newval = cputime_add(*newval, now.cpu);
-
- /*
- * If the RLIMIT_CPU timer will expire before the
- * ITIMER_PROF timer, we have nothing else to do.
- */
- if (tsk->signal->rlim[RLIMIT_CPU].rlim_cur
- < cputime_to_secs(*newval))
- return;
}
/*
@@ -1433,13 +1316,14 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
cputime_ge(list_first_entry(head,
struct cpu_timer_list, entry)->expires.cpu,
*newval)) {
- /*
- * Rejigger each thread's expiry time so that one will
- * notice before we hit the process-cumulative expiry time.
- */
- union cpu_time_count expires = { .sched = 0 };
- expires.cpu = *newval;
- process_timer_rebalance(tsk, clock_idx, expires, now);
+ switch (clock_idx) {
+ case CPUCLOCK_PROF:
+ tsk->signal->it_prof_expires = *newval;
+ break;
+ case CPUCLOCK_VIRT:
+ tsk->signal->it_virt_expires = *newval;
+ break;
+ }
}
}
diff --git a/kernel/sched.c b/kernel/sched.c
index 8dcdec6..81b61eb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3637,6 +3637,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
cputime64_t tmp;
p->utime = cputime_add(p->utime, cputime);
+ thread_group_update(p->signal,
+ offsetof(struct thread_group_cputime, utime),
+ (void *)&cputime);
/* Add user time to cpustat. */
tmp = cputime_to_cputime64(cputime);
@@ -3659,6 +3662,9 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime)
tmp = cputime_to_cputime64(cputime);
p->utime = cputime_add(p->utime, cputime);
+ thread_group_update(p->signal,
+ offsetof(struct thread_group_cputime, utime),
+ (void *)&cputime);
p->gtime = cputime_add(p->gtime, cputime);
cpustat->user = cputime64_add(cpustat->user, tmp);
@@ -3692,6 +3698,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
return account_guest_time(p, cputime);
p->stime = cputime_add(p->stime, cputime);
+ thread_group_update(p->signal,
+ offsetof(struct thread_group_cputime, stime),
+ (void *)&cputime);
/* Add system time to cpustat. */
tmp = cputime_to_cputime64(cputime);
@@ -3733,6 +3742,9 @@ void account_steal_time(struct task_struct *p, cputime_t steal)
if (p == rq->idle) {
p->stime = cputime_add(p->stime, steal);
+ thread_group_update(p->signal,
+ offsetof(struct thread_group_cputime, stime),
+ (void *)&steal);
if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait = cputime64_add(cpustat->iowait, tmp);
else
@@ -8138,3 +8150,64 @@ struct cgroup_subsys cpuacct_subsys = {
.subsys_id = cpuacct_subsys_id,
};
#endif /* CONFIG_CGROUP_CPUACCT */
+
+/*
+ * Allocate the thread_group_cputime struct appropriately and fill in the
+ * current values of the fields. Called from do_setitimer() when setting an
+ * interval timer (ITIMER_PROF or ITIMER_VIRTUAL). Assumes interrupts are
+ * enabled when it's called. Note that there is no corresponding deallocation
+ * done from do_setitimer(); the structure is freed at process exit.
+ */
+int thread_group_times_alloc(struct task_struct *tsk)
+{
+ struct signal_struct *sig = tsk->signal;
+ struct thread_group_cputime *thread_group_times;
+ struct task_struct *t;
+ cputime_t utime, stime;
+ unsigned long long sum_exec_runtime;
+
+ /*
+ * If we don't already have a thread_group_cputime struct, allocate
+ * one and fill it in with the accumulated times.
+ */
+ if (sig->thread_group_times)
+ return 0;
+#ifdef CONFIG_SMP
+ thread_group_times = alloc_percpu(struct thread_group_cputime);
+#else
+ thread_group_times =
+ kmalloc(sizeof(struct thread_group_cputime), GFP_KERNEL);
+#endif
+ if (thread_group_times == NULL)
+ return -ENOMEM;
+ read_lock(&tasklist_lock);
+ spin_lock_irq(&tsk->sighand->siglock);
+ if (sig->thread_group_times) {
+ spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
+ thread_group_times_free(thread_group_times);
+ return 0;
+ }
+ sig->thread_group_times = thread_group_times;
+ utime = sig->utime;
+ stime = sig->stime;
+ sum_exec_runtime = tsk->se.sum_exec_runtime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ sum_exec_runtime += t->se.sum_exec_runtime;
+ } while_each_thread(tsk, t);
+#ifdef CONFIG_SMP
+ thread_group_times = per_cpu_ptr(sig->thread_group_times, get_cpu());
+#endif
+ thread_group_times->utime = utime;
+ thread_group_times->stime = stime;
+ thread_group_times->sum_exec_runtime = sum_exec_runtime;
+#ifdef CONFIG_SMP
+ put_cpu_no_resched();
+#endif
+ spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
+ return 0;
+}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 86a9337..fc5e269 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -353,6 +353,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
struct task_struct *curtask = task_of(curr);
cpuacct_charge(curtask, delta_exec);
+ thread_group_update(curtask->signal,
+ offsetof(struct thread_group_cputime, sum_exec_runtime),
+ (void *)&delta_exec);
}
}
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0a6d2e5..ea48a92 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -256,6 +256,9 @@ static void update_curr_rt(struct rq *rq)
schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
curr->se.sum_exec_runtime += delta_exec;
+ thread_group_update(curr->signal,
+ offsetof(struct thread_group_cputime, sum_exec_runtime),
+ (void *)&delta_exec);
curr->se.exec_start = rq->clock;
cpuacct_charge(curr, delta_exec);
diff --git a/kernel/sys.c b/kernel/sys.c
index a626116..ce70226 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -864,6 +864,8 @@ asmlinkage long sys_setfsgid(gid_t gid)
asmlinkage long sys_times(struct tms __user * tbuf)
{
+ struct thread_group_cputime thread_group_times;
+
/*
* In the SMP world we might just be unlucky and have one of
* the times increment as we use it. Since the value is an
@@ -873,19 +875,28 @@ asmlinkage long sys_times(struct tms __user * tbuf)
if (tbuf) {
struct tms tmp;
struct task_struct *tsk = current;
- struct task_struct *t;
cputime_t utime, stime, cutime, cstime;
spin_lock_irq(&tsk->sighand->siglock);
- utime = tsk->signal->utime;
- stime = tsk->signal->stime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- t = next_thread(t);
- } while (t != tsk);
-
+ /*
+ * If a POSIX interval timer is running use the process-wide
+ * fields, else fall back to brute force.
+ */
+ if (sig->thread_group_times) {
+ thread_group_cputime(&thread_group_times, tsk->signal);
+ utime = thread_group_times.utime;
+ stime = thread_group_times.stime;
+ } else {
+ struct task_struct *t;
+
+ utime = tsk->signal->utime;
+ stime = tsk->signal->stime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ } while_each_thread(tsk, t);
+ }
cutime = tsk->signal->cutime;
cstime = tsk->signal->cstime;
spin_unlock_irq(&tsk->sighand->siglock);
@@ -1444,7 +1455,7 @@ asmlinkage long sys_old_getrlimit(unsigned int resource, struct rlimit __user *r
asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
{
struct rlimit new_rlim, *old_rlim;
- unsigned long it_prof_secs;
+ unsigned long rlim_secs;
int retval;
if (resource >= RLIM_NLIMITS)
@@ -1490,15 +1501,12 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
if (new_rlim.rlim_cur == RLIM_INFINITY)
goto out;
- it_prof_secs = cputime_to_secs(current->signal->it_prof_expires);
- if (it_prof_secs == 0 || new_rlim.rlim_cur <= it_prof_secs) {
- unsigned long rlim_cur = new_rlim.rlim_cur;
- cputime_t cputime;
-
- cputime = secs_to_cputime(rlim_cur);
+ rlim_secs = cputime_to_secs(current->signal->rlim_expires);
+ if (rlim_secs == 0 || new_rlim.rlim_cur <= rlim_secs) {
read_lock(&tasklist_lock);
spin_lock_irq(¤t->sighand->siglock);
- set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL);
+ current->signal->rlim_expires =
+ secs_to_cputime(new_rlim.rlim_cur);
spin_unlock_irq(¤t->sighand->siglock);
read_unlock(&tasklist_lock);
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 41a049f..62fed13 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2201,7 +2201,7 @@ static void selinux_bprm_post_apply_creds(struct linux_binprm *bprm)
* This will cause RLIMIT_CPU calculations
* to be refigured.
*/
- current->it_prof_expires = jiffies_to_cputime(1);
+ current->signal->rlim_expires = jiffies_to_cputime(1);
}
}
--
Frank Mayhar <[email protected]>
Google, Inc.
> Well, if it's acceptable, for a first cut (and the patch I'll submit),
> I'll handle the UP and SMP cases, encapsulating them in sched.h in such
> a way as to make it invisible (as much as is possible) to the rest of
> the code.
That's fine.
> After looking at the code again, I now understand what you're talking
> about. You overloaded it_*_expires to support both the POSIX interval
> timers and RLIMIT_CPU. So the way I have things, setting one can stomp
> the other.
For clarity, please never mention that identifier without indicating which
struct you're talking about. signal_struct.it_*_expires has never been
overloaded. signal_struct.it_prof_expires is the ITIMER_PROF setting;
signal_struct.it_virt_expires is the ITIMER_VIRTUAL setting; there is no
signal_struct.it_sched_expires field. task_struct.it_*_expires has never
been overloaded. task_struct.it_prof_expires is the next value of
(task_struct.utime + task_struct.stime) at which run_posix_cpu_timers()
needs to check for work to do.
> Might it be cleaner to handle the RLIMIT_CPU stuff separately, rather
> than rolling it into the itimer handling?
It is not "rolled into the itimer handling".
run_posix_cpu_timers() handles three separate features:
1. ITIMER_VIRTUAL, ITIMER_PROF itimers (setitimer/getitimer)
2. POSIX timers CPU timers (timer_* calls)
3. CPU time rlimits (RLIMIT_CPU for process-wide, RLIMIT_RTTIME for each thread)
The poorly-named task_struct.it_*_expires fields serve a single purpose:
to optimize run_posix_cpu_timers(). task_struct.it_prof_expires is the
minimum of the values at which any of those three features need attention.
> Okay, my proposed fix for this is to introduce a new field in
> signal_struct, rlim_expires, a cputime_t. Everywhere that the
> RLIMIT_CPU code formerly set it_prof_expires it will now set
> rlim_expires and in run_posix_cpu_timers() I'll check it against the
> thread group prof_time.
I don't see the point of adding this field at all. It serves solely to
cache the secs_to_cputime calculation on the RLIMIT_CPU rlim_cur value,
which is just a division.
> I've changed the uniprocessor case to retain the dynamic allocation of
> the thread_group_cputime structure as needed; this makes the code
> somewhat more consistent between SMP and UP and retains the feature of
> reducing overhead for processes that don't use interval timers.
This does not make sense. There is no need for any new state at all in
the UP case, just reorganizing what is already there.
The existing signal_struct fields utime, stime, and sum_sched_runtime are
no longer needed. These accumulate the times of dead threads in the group
(see __exit_signal in exit.c) solely so cpu_clock_sample_group can add
them in. Keeping both those old fields and the dynamically allocated
per-CPU counters is wrong. You will count double all the threads that
died since the struct was allocated (i.e. since the first timer was set).
For the UP case, just replace these with a single struct thread_group_cputime
in signal_struct, and increment it directly on every tick. __exit_signal
never touches it.
For the SMP case, you need a bit of complication. When there are no
expirations (none of the three features in use on a process-wide CPU
clock) or only one live thread, then you don't need to allocate the
per-CPU counters. But you need one or the other kind of state as soon
as a thread dies while others live, or there are multiple threads while
any process-wide expiration is set. There are several options for how
to reconcile the dead-threads tracking with the started-on-demand
per-CPU counters.
You did not follow what I had in mind for abstracting the code.
Here is what I think will make it easiest to work through all these
angles without rewriting much of the code for each variant.
Define:
struct task_cputime {
cputime_t utime;
cputime_t stime;
unsigned long long sched_runtime;
};
and then a second type to use in signal_struct.
You can clean up task_struct by replacing its it_*_expires fields with one:
struct task_cputime cputime_expires;
(That is, overload cputime_expires.stime for the utime+stime expiration
time. Even with that kludge, I think it's cleaner to use this struct
for all these places.)
In signal_struct there are no conditionals, it's just:
struct thread_group_cputime cputime;
The variants provide struct thread_group_cputime and the functions to go
with it. However many sorts we need can be different big #if blocks
keeping all the related code together.
The UP version just does:
struct thread_group_cputime {
struct task_cputime totals;
};
static inline void thread_group_cputime(struct signal_struct *sig,
struct task_cputime *cputime)
{
*cputime = sig->cputime;
}
static inline void account_group_user_time(struct task_struct *task,
cputime_t cputime)
{
struct task_cputime *times = &task->signal->cputime.totals;
times->utime = cputime_add(times->utime, cputime);
}
static inline void account_group_system_time(struct task_struct *task,
cputime_t cputime)
{
struct task_cputime *times = &task->signal->cputime.totals;
times->stime = cputime_add(times->stime, cputime);
}
static inline void account_group_exec_runtime(struct task_struct *task,
unsigned long long ns)
{
struct task_cputime *times = &task->signal->cputime.totals;
times->sched_runtime += ns;
}
Finally, you could consider adding another field to signal_struct:
struct task_cputime cputime_expires;
This would be a cache, for each of the three process CPU clocks, of the
earliest expiration from any of the three features. Each of setitimer,
timer_settime, setrlimit, and implicit timer/itimer reloading, would
recalculate the minimum of the head of the cpu_timers list, the itimer
(it_*_expires), and the rlimit. The reason to do this is that the
common case in run_posix_cpu_timers() stays almost as cheap as it is now.
It also makes a nice parallel with the per-thread expiration cache.
i.e.:
static int task_cputime_expired(const struct task_cputime *sample,
const struct task_cputime *expires)
{
if (!cputime_eq(expires->utime, cputime_zero) &&
cputime_ge(sample->utime, expires->utime))
return 1;
if (!cputime_eq(expires->stime, cputime_zero) &&
cputime_ge(cputime_add(sample->utime, sample->stime),
expires->stime))
return 1;
if (expires->sched_runtime != 0 &&
sample->sched_runtime >= expires->sched_runtime)
return 1;
return 0;
}
...
struct signal_struct *sig = task->signal;
struct task_cputime task_sample = {
.utime = task->utime,
.stime = task->stime,
.sched_runtime = task->se.sum_exec_runtime
};
struct task_cputime group_sample;
thread_group_cputime(sig, &group_sample);
if (!task_cputime_expired(&task_sample,
&task->cputime_expired) &&
!task_cputime_expired(&group_sample,
&sig->cputime_expired))
return 0;
...
Thanks,
Roland
Roland, I'm very much having to read between the lines of what you've
written. And, obviously, getting it wrong at least half the time. :-)
So you've cleared part of my understanding with your latest email.
Here's what I've gotten from it:
struct task_cputime {
cputime_t utime; /* User time. */
cputime_t stime; /* System time. */
unsigned long long sched_runtime; /* Scheduler time. */
};
This is for both SMP and UP, defined before signal_struct in sched.h
(since that structure refers to this one). Following that:
struct thread_group_cputime;
Which is a forward reference to the real definition later in the file.
The inline functions depend on signal_struct and task_struct, so they
have to come after:
#ifdef SMP
struct thread_group_cputime {
struct task_cputime *totals;
};
< ... inline functions ... >
#else /* SMP */
struct thread_group_cputime {
struct task_cputime totals;
};
< ... inline functions ... >
#endif
The SMP version is percpu, the UP version is just a substructure. In
signal_struct itself, delete utime & stime, add
struct thread_group_cputime cputime;
The inline functions include the ones you defined for UP plus equivalent
ones for SMP. The SMP inlines check the percpu pointer
(sig->cputime.totals) and don't update if it's NULL. One small
correction to one of your inlines, in thread_group_cputime:
*cputime = sig->cputime;
should be
*cputime = sig->cputime.totals;
A representative inline for SMP is:
static inline void account_group_system_time(struct task_struct *task,
cputime_t cputime)
{
struct task_cputime *times;
if (!sig->cputime.totals)
return;
times = per_cpu_ptr(sig->cputime.totals, get_cpu());
times->stime = cputime_add(times->stime, cputime);
put_cpu_no_resched();
}
To deal with the need for bookkeeping with multiple threads in the SMP
case (where there isn't a per-cpu structure until it's needed), I'll
allocate the per-cpu structure in __exit_signal() where the relevant
fields are updated. I'll also allocate it where I do now, in
do_setitimer(), when needed. The allocation will be a "return 0" for UP
and a call to "thread_group_times_alloc_smp()" (which lives in sched.c)
for SMP.
I'll also optimize run_posix_cpu_timers() as you suggest, and eliminate
rlim_expires.
Expect a new patch fairly soon.
--
Frank Mayhar <[email protected]>
Google, Inc.