2009-11-12 08:51:16

by Stephen Rothwell

[permalink] [raw]
Subject: linux-next: Tree for November 12

Hi all,

Changes since 20091111:

Linus' tree gained a build failure for which I applied a patch.

The net tree lost 3 conflicts.

The cpufreq tree gained a conflict against the acpi tree.

The rr tree lost its build failure but exposed build failures in the
powerpc and sparc trees for which I applied patches. It also had another
build failure due to an interaction with the kbuild tree for which I
applied another patch.

The i7core_edac tree gained a build failure for which I have applied a
patch.

The tip tree lost its build failure.

The sysctl tree gained a build failure for which I applied a patch.

The usb tree lost its conflicts.

----------------------------------------------------------------------------

I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/v2.6/next/ ). If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one. You should use "git fetch" as mentioned in the FAQ on the wiki
(see below).

You can see which trees have been included by looking in the Next/Trees
file in the source. There are also quilt-import.log and merge.log files
in the Next directory. Between each merge, the tree was built with
a ppc64_defconfig for powerpc and an allmodconfig for x86_64. After the
final fixups (if any), it is also built with powerpc allnoconfig (32 and
64 bit), ppc44x_defconfig and allyesconfig (minus
CONFIG_PROFILE_ALL_BRANCHES - this fails its final link) and i386, sparc
and sparc64 defconfig. These builds also have
CONFIG_ENABLE_WARN_DEPRECATED, CONFIG_ENABLE_MUST_CHECK and
CONFIG_DEBUG_INFO disabled when necessary.

Below is a summary of the state of the merge.

We are up to 148 trees (counting Linus' and 22 trees of patches pending for
Linus' tree), more are welcome (even if they are currently empty).
Thanks to those who have contributed, and to those who haven't, please do.

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next . If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Jan Dittmer for adding the linux-next tree to his build tests
at http://l4x.org/k/ , the guys at http://test.kernel.org/ and Randy
Dunlap for doing many randconfig builds.

There is a wiki covering stuff to do with linux-next at
http://linux.f-seidel.de/linux-next/pmwiki/ . Thanks to Frank Seidel.

--
Cheers,
Stephen Rothwell [email protected]

$ git checkout master
$ git reset --hard stable
Merging origin/master
Merging fixes/fixes
Merging arm-current/master
Merging m68k-current/for-linus
Merging powerpc-merge/merge
Merging sparc-current/master
Merging scsi-rc-fixes/master
Merging net-current/master
Merging sound-current/for-linus
Merging pci-current/for-linus
Merging wireless-current/master
Merging kbuild-current/master
Merging quilt/driver-core.current
Merging quilt/tty.current
Merging quilt/usb.current
Merging quilt/staging.current
Merging cpufreq-current/fixes
Merging input-current/for-linus
Merging md-current/for-linus
Merging audit-current/for-linus
Merging crypto-current/master
Merging ide-curent/master
Merging dwmw2/master
Applying: jbd: export log_start_commit for ext3
Merging arm/devel
Merging davinci/davinci-next
Merging msm/for-next
Merging omap/for-next
Merging pxa/for-next
Merging avr32/avr32-arch
CONFLICT (content): Merge conflict in arch/avr32/mach-at32ap/include/mach/cpu.h
Merging blackfin/for-linus
Merging cris/for-next
Merging ia64/test
Merging m68k/for-next
CONFLICT (content): Merge conflict in drivers/rtc/Kconfig
Merging m68knommu/for-next
Merging microblaze/next
Merging mips/mips-for-linux-next
Merging parisc/next
Merging powerpc/next
Merging 4xx/next
Merging 52xx-and-virtex/next
Merging galak/next
Merging s390/features
Merging sh/master
Merging sparc/master
Merging xtensa/master
Merging ceph/for-next
Merging cifs/master
Merging configfs/linux-next
Merging ecryptfs/next
Merging ext3/for_next
Merging ext4/next
Merging fatfs/master
Merging fuse/for-next
Merging gfs2/master
Merging jfs/next
Merging nfs/linux-next
Merging nfsd/nfsd-next
Merging nilfs2/for-next
Merging ocfs2/linux-next
Merging squashfs/master
Merging udf/for_next
Merging v9fs/for-next
Merging ubifs/linux-next
Merging xfs/master
Merging reiserfs-bkl/reiserfs/kill-bkl
Merging vfs/for-next
Merging pci/linux-next
Merging hid/for-next
Merging quilt/i2c
Merging quilt/jdelvare-hwmon
Merging quilt/kernel-doc
Merging v4l-dvb/master
CONFLICT (content): Merge conflict in drivers/media/common/tuners/tda18271-fe.c
Merging kbuild/master
Merging kconfig/for-next
Merging ide/master
Merging libata/NEXT
Merging infiniband/for-next
Merging acpi/test
Merging ieee1394/for-next
Merging ubi/linux-next
Merging kvm/linux-next
CONFLICT (content): Merge conflict in arch/powerpc/kvm/timing.h
Merging dlm/next
Merging scsi/master
Merging async_tx/next
Merging net/master
CONFLICT (delete/modify): drivers/net/sfc/sfe4001.c deleted in net/master and modified in HEAD. Version HEAD of drivers/net/sfc/sfe4001.c left in tree.
CONFLICT (content): Merge conflict in drivers/net/wireless/libertas/cmd.c
CONFLICT (content): Merge conflict in drivers/staging/Kconfig
CONFLICT (content): Merge conflict in drivers/staging/Makefile
CONFLICT (content): Merge conflict in drivers/staging/rtl8187se/Kconfig
CONFLICT (content): Merge conflict in drivers/staging/rtl8192e/Kconfig
$ git rm -f drivers/net/sfc/sfe4001.c
Applying: net: merge fixup for drivers/net/sfc/falcon_boards.c
Merging wireless/master
Merging mtd/master
Merging crypto/master
Merging sound/for-next
CONFLICT (content): Merge conflict in arch/arm/mach-omap2/board-omap3evm.c
Merging cpufreq/next
CONFLICT (content): Merge conflict in include/acpi/processor.h
Merging quilt/rr
Merging mmc/next
Merging tmio-mmc/linux-next
Merging input/next
Merging lsm/for-next
Merging block/for-next
Merging quilt/device-mapper
Merging embedded/master
Merging firmware/master
Merging pcmcia/master
CONFLICT (content): Merge conflict in drivers/mtd/maps/pcmciamtd.c
CONFLICT (content): Merge conflict in drivers/net/wireless/ray_cs.c
CONFLICT (content): Merge conflict in drivers/pcmcia/Makefile
Merging battery/master
Merging leds/for-mm
Merging backlight/for-mm
Merging kgdb/kgdb-next
Merging slab/for-next
Merging uclinux/for-next
Merging md/for-next
Merging mfd/for-next
Merging hdlc/hdlc-next
Merging drm/drm-next
Merging voltage/for-next
CONFLICT (content): Merge conflict in drivers/mfd/Kconfig
CONFLICT (content): Merge conflict in drivers/mfd/Makefile
Merging security-testing/next
CONFLICT (content): Merge conflict in Documentation/dontdiff
Merging lblnet/master
Merging agp/agp-next
Merging uwb/for-upstream
Merging watchdog/master
Merging bdev/master
Merging dwmw2-iommu/master
Merging cputime/cputime
Merging osd/linux-next
Merging jc_docs/docs-next
Merging nommu/master
Merging trivial/for-next
Merging audit/for-next
Merging quilt/aoe
Merging suspend/linux-next
Merging bluetooth/master
Merging fsnotify/for-next
Merging irda/for-next
Merging hwlat/for-linus
CONFLICT (content): Merge conflict in drivers/misc/Makefile
Merging drbd/for-jens
Merging catalin/for-next
Merging alacrity/linux-next
CONFLICT (content): Merge conflict in drivers/net/Kconfig
CONFLICT (content): Merge conflict in include/linux/Kbuild
CONFLICT (content): Merge conflict in lib/Kconfig
Merging i7core_edac/linux_next
Applying: i7core_edac: do not export static functions
Merging devicetree/next-devicetree
Merging limits/writable_limits
Merging tip/auto-latest
CONFLICT (content): Merge conflict in arch/x86/kernel/kgdb.c
CONFLICT (content): Merge conflict in kernel/irq/chip.c
Merging oprofile/for-next
Merging percpu/for-next
CONFLICT (content): Merge conflict in arch/powerpc/platforms/pseries/hvCall.S
CONFLICT (content): Merge conflict in arch/x86/kvm/svm.c
CONFLICT (content): Merge conflict in kernel/softlockup.c
CONFLICT (content): Merge conflict in mm/percpu.c
Applying: percpu: merge fixup for variable renaming
Merging sfi/sfi-test
Merging asm-generic/next
Merging hwpoison/hwpoison
Merging sysctl/master
Merging quilt/driver-core
Merging quilt/tty
Merging quilt/usb
Merging quilt/staging
CONFLICT (content): Merge conflict in drivers/staging/Kconfig
CONFLICT (content): Merge conflict in drivers/staging/Makefile
CONFLICT (content): Merge conflict in drivers/staging/comedi/drivers/cb_das16_cs.c
CONFLICT (content): Merge conflict in drivers/staging/comedi/drivers/ni_labpc_cs.c
CONFLICT (content): Merge conflict in drivers/staging/comedi/drivers/ni_mio_cs.c
Merging scsi-post-merge/master
Applying: modpost: autoconf.h has moved to include/generated
Applying: sysctl: fix build dependency on CONFIG_NET
Applying: powerpc: do not export pci_alloc/free_consistent
Applying: sparc64: don't export static inline pci_ functions


Attachments:
(No filename) (8.73 kB)
(No filename) (198.00 B)
Download all attachments

2009-11-12 11:53:34

by Sachin Sant

[permalink] [raw]
Subject: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

Stephen Rothwell wrote:
> Hi all,
>
> Changes since 20091111:
>
I came across the following bug while executing cpu hotplug tests
on a x86_64 box. This is with next version 2.6.32-rc6-20091112.
(20280eab85704dcd05a20903f0de80be1c761c6e)

This is a 4 way box. The problem is not always reproducible and
can be recreated only after some amount of activity.

------------[ cut here ]------------
kernel BUG at kernel/sched.c:7359!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu1/online
CPU 0
Modules linked in: ipv6 fuse loop dm_mod sg mptctl bnx2 rtc_cmos rtc_core
rtc_lib i2c_piix4 tpm_tis serio_raw button shpchp pcspkr tpm i2c_core
pci_hotplug k8temp tpm_bios ohci_hcd ehci_hcd sd_mod crc_t10dif usbcore edd fan
thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas
scsi_mod
Pid: 11504, comm: hotplug04.sh Not tainted 2.6.32-rc6-autotest-next-20091112 #1
BladeCenter LS21 -[79716AA]-
RIP: 0010:[<ffffffff8134a744>] [<ffffffff8134a744>] migration_call+0x381/0x51a
RSP: 0018:ffff8801159fdd48 EFLAGS: 00010046
RAX: 0000000000000001 RBX: ffff88011e2de180 RCX: ffffffffff8d8f20
RDX: ffff880028280000 RSI: ffff880028293f88 RDI: ffff880127a3e708
RBP: ffff8801159fdd98 R08: 0000000000000000 R09: 000000046c250cb4
R10: dead000000100100 R11: 7fffffffffffffff R12: ffffffff816d7020
R13: ffff880028293f00 R14: ffff880127a3e6c0 R15: ffff880028293f00
FS: 00007f782aef66f0(0000) GS:ffff880028200000(0000) knlGS:0000000055731b00
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000061f4f0 CR3: 00000001271a0000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process hotplug04.sh (pid: 11504, threadinfo ffff8801159fc000, task
ffff8801293e2600)
Stack:
0000000000000001 0000000000013f00 0000000100000000 0000000000000001
<0> ffff8801159fddb8 0000000000000000 00000000fffffffe ffffffff8176c800
<0> 0000000000000001 0000000000000007 ffff8801159fddd8 ffffffff81351b16
Call Trace:
[<ffffffff81351b16>] notifier_call_chain+0x33/0x5b
[<ffffffff8105db28>] raw_notifier_call_chain+0xf/0x11
[<ffffffff8133db32>] _cpu_down+0x1f7/0x2f1
[<ffffffff8134cb6a>] ? wait_for_completion+0x18/0x1a
[<ffffffff8133dc74>] cpu_down+0x48/0x80
[<ffffffff8133f89a>] store_online+0x2c/0x6f
[<ffffffff8128c44b>] sysdev_store+0x1b/0x1d
[<ffffffff811382f8>] sysfs_write_file+0xdf/0x114
[<ffffffff810e4401>] vfs_write+0xb4/0x186
[<ffffffff810e4597>] sys_write+0x47/0x6e
[<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b
Code: c6 75 05 48 8b 1b eb ed 49 8b 46 30 4c 89 f6 4c 89 ff ff 50 30 41 83 be
78 04 00 00 00 48 8b 45 b0 48 8b 14 c5 70 4d 77 81 75 04 <0f> 0b eb fe 49 8b 06
48 83 f8 40 75 04 0f 0b eb fe 48 8b 5d b8
RIP [<ffffffff8134a744>] migration_call+0x381/0x51a

kernel/sched.c:7359 corresponds to

/* called under rq->lock with disabled interrupts */
static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
{
struct rq *rq = cpu_rq(dead_cpu);

/* Must be exiting, otherwise would be on tasklist. */
BUG_ON(!p->exit_state); <<====

Thanks
-Sachin

--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

2009-11-12 12:10:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Thu, 2009-11-12 at 17:23 +0530, Sachin Sant wrote:
> Stephen Rothwell wrote:
> > Hi all,
> >
> > Changes since 20091111:
> >
> I came across the following bug while executing cpu hotplug tests
> on a x86_64 box. This is with next version 2.6.32-rc6-20091112.
> (20280eab85704dcd05a20903f0de80be1c761c6e)
>
> This is a 4 way box. The problem is not always reproducible and
> can be recreated only after some amount of activity.
>
> ------------[ cut here ]------------
> kernel BUG at kernel/sched.c:7359!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/devices/system/cpu/cpu1/online
> CPU 0
> Modules linked in: ipv6 fuse loop dm_mod sg mptctl bnx2 rtc_cmos rtc_core
> rtc_lib i2c_piix4 tpm_tis serio_raw button shpchp pcspkr tpm i2c_core
> pci_hotplug k8temp tpm_bios ohci_hcd ehci_hcd sd_mod crc_t10dif usbcore edd fan
> thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas
> scsi_mod
> Pid: 11504, comm: hotplug04.sh Not tainted 2.6.32-rc6-autotest-next-20091112 #1
> BladeCenter LS21 -[79716AA]-
> RIP: 0010:[<ffffffff8134a744>] [<ffffffff8134a744>] migration_call+0x381/0x51a
> RSP: 0018:ffff8801159fdd48 EFLAGS: 00010046
> RAX: 0000000000000001 RBX: ffff88011e2de180 RCX: ffffffffff8d8f20
> RDX: ffff880028280000 RSI: ffff880028293f88 RDI: ffff880127a3e708
> RBP: ffff8801159fdd98 R08: 0000000000000000 R09: 000000046c250cb4
> R10: dead000000100100 R11: 7fffffffffffffff R12: ffffffff816d7020
> R13: ffff880028293f00 R14: ffff880127a3e6c0 R15: ffff880028293f00
> FS: 00007f782aef66f0(0000) GS:ffff880028200000(0000) knlGS:0000000055731b00
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 000000000061f4f0 CR3: 00000001271a0000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process hotplug04.sh (pid: 11504, threadinfo ffff8801159fc000, task
> ffff8801293e2600)
> Stack:
> 0000000000000001 0000000000013f00 0000000100000000 0000000000000001
> <0> ffff8801159fddb8 0000000000000000 00000000fffffffe ffffffff8176c800
> <0> 0000000000000001 0000000000000007 ffff8801159fddd8 ffffffff81351b16
> Call Trace:
> [<ffffffff81351b16>] notifier_call_chain+0x33/0x5b
> [<ffffffff8105db28>] raw_notifier_call_chain+0xf/0x11
> [<ffffffff8133db32>] _cpu_down+0x1f7/0x2f1
> [<ffffffff8134cb6a>] ? wait_for_completion+0x18/0x1a
> [<ffffffff8133dc74>] cpu_down+0x48/0x80
> [<ffffffff8133f89a>] store_online+0x2c/0x6f
> [<ffffffff8128c44b>] sysdev_store+0x1b/0x1d
> [<ffffffff811382f8>] sysfs_write_file+0xdf/0x114
> [<ffffffff810e4401>] vfs_write+0xb4/0x186
> [<ffffffff810e4597>] sys_write+0x47/0x6e
> [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b
> Code: c6 75 05 48 8b 1b eb ed 49 8b 46 30 4c 89 f6 4c 89 ff ff 50 30 41 83 be
> 78 04 00 00 00 48 8b 45 b0 48 8b 14 c5 70 4d 77 81 75 04 <0f> 0b eb fe 49 8b 06
> 48 83 f8 40 75 04 0f 0b eb fe 48 8b 5d b8
> RIP [<ffffffff8134a744>] migration_call+0x381/0x51a
>
> kernel/sched.c:7359 corresponds to
>
> /* called under rq->lock with disabled interrupts */
> static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
> {
> struct rq *rq = cpu_rq(dead_cpu);
>
> /* Must be exiting, otherwise would be on tasklist. */
> BUG_ON(!p->exit_state); <<====

I'm pretty sure we stumbled on a TASK_WAKING task there, trying to sort
out the locking there, its a bit of a maze :/

How reproducable is this?

2009-11-12 12:23:27

by Sachin Sant

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

Peter Zijlstra wrote:
>> ------------[ cut here ]------------
>> kernel BUG at kernel/sched.c:7359!
>> invalid opcode: 0000 [#1] SMP
>> last sysfs file: /sys/devices/system/cpu/cpu1/online
>> CPU 0
>> Modules linked in: ipv6 fuse loop dm_mod sg mptctl bnx2 rtc_cmos rtc_core
>> rtc_lib i2c_piix4 tpm_tis serio_raw button shpchp pcspkr tpm i2c_core
>> pci_hotplug k8temp tpm_bios ohci_hcd ehci_hcd sd_mod crc_t10dif usbcore edd fan
>> thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas
>> scsi_mod
>> Pid: 11504, comm: hotplug04.sh Not tainted 2.6.32-rc6-autotest-next-20091112 #1
>> BladeCenter LS21 -[79716AA]-
>> RIP: 0010:[<ffffffff8134a744>] [<ffffffff8134a744>] migration_call+0x381/0x51a
>> RSP: 0018:ffff8801159fdd48 EFLAGS: 00010046
>> RAX: 0000000000000001 RBX: ffff88011e2de180 RCX: ffffffffff8d8f20
>> RDX: ffff880028280000 RSI: ffff880028293f88 RDI: ffff880127a3e708
>> RBP: ffff8801159fdd98 R08: 0000000000000000 R09: 000000046c250cb4
>> R10: dead000000100100 R11: 7fffffffffffffff R12: ffffffff816d7020
>> R13: ffff880028293f00 R14: ffff880127a3e6c0 R15: ffff880028293f00
>> FS: 00007f782aef66f0(0000) GS:ffff880028200000(0000) knlGS:0000000055731b00
>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 000000000061f4f0 CR3: 00000001271a0000 CR4: 00000000000006f0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process hotplug04.sh (pid: 11504, threadinfo ffff8801159fc000, task
>> ffff8801293e2600)
>> Stack:
>> 0000000000000001 0000000000013f00 0000000100000000 0000000000000001
>> <0> ffff8801159fddb8 0000000000000000 00000000fffffffe ffffffff8176c800
>> <0> 0000000000000001 0000000000000007 ffff8801159fddd8 ffffffff81351b16
>> Call Trace:
>> [<ffffffff81351b16>] notifier_call_chain+0x33/0x5b
>> [<ffffffff8105db28>] raw_notifier_call_chain+0xf/0x11
>> [<ffffffff8133db32>] _cpu_down+0x1f7/0x2f1
>> [<ffffffff8134cb6a>] ? wait_for_completion+0x18/0x1a
>> [<ffffffff8133dc74>] cpu_down+0x48/0x80
>> [<ffffffff8133f89a>] store_online+0x2c/0x6f
>> [<ffffffff8128c44b>] sysdev_store+0x1b/0x1d
>> [<ffffffff811382f8>] sysfs_write_file+0xdf/0x114
>> [<ffffffff810e4401>] vfs_write+0xb4/0x186
>> [<ffffffff810e4597>] sys_write+0x47/0x6e
>> [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b
>> Code: c6 75 05 48 8b 1b eb ed 49 8b 46 30 4c 89 f6 4c 89 ff ff 50 30 41 83 be
>> 78 04 00 00 00 48 8b 45 b0 48 8b 14 c5 70 4d 77 81 75 04 <0f> 0b eb fe 49 8b 06
>> 48 83 f8 40 75 04 0f 0b eb fe 48 8b 5d b8
>> RIP [<ffffffff8134a744>] migration_call+0x381/0x51a
> I'm pretty sure we stumbled on a TASK_WAKING task there, trying to sort
> out the locking there, its a bit of a maze :/
>
> How reproducable is this?
>
I was able to recreate this once out of three tries.

When i was able to recreate this bug, the box had been
running for a while and i had executed series of tests
(kernbench, hackbench, hugetlbfs) before cpu_hotplug.

Thanks
-Sachin

--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

2009-11-12 12:27:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Thu, 2009-11-12 at 17:53 +0530, Sachin Sant wrote:
> > How reproducable is this?
> >
> I was able to recreate this once out of three tries.
>
> When i was able to recreate this bug, the box had been
> running for a while and i had executed series of tests
> (kernbench, hackbench, hugetlbfs) before cpu_hotplug.

OK good, its easier to test patches when the thing is relatively easy to
reproduce. I'll send you something to test once I've got a handle on it.

Thanks!

2009-11-12 17:10:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Thu, 2009-11-12 at 13:27 +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-12 at 17:53 +0530, Sachin Sant wrote:
> > > How reproducable is this?
> > >
> > I was able to recreate this once out of three tries.
> >
> > When i was able to recreate this bug, the box had been
> > running for a while and i had executed series of tests
> > (kernbench, hackbench, hugetlbfs) before cpu_hotplug.
>
> OK good, its easier to test patches when the thing is relatively easy to
> reproduce. I'll send you something to test once I've got a handle on it.

OK.. so on hotplug we do:

cpu_down
set_cpu_active(false)
_cpu_down
notify(CPU_DOWN_PREPARE)
stop_machine(take_cpu_down)
__cpu_disable()
set_cpu_online(false);
notify(CPU_DYING)
__cpu_die() /* note no more stop_machine */
notify(CPU_DEAD)

Then on the scheduler hotplug notifier (migration_call), we mostly deal
with CPU_DEAD, where we do:

case CPU_DEAD:
case CPU_DEAD_FROZEN:
cpuset_lock(); /* around calls to cpuset_cpus_allowed_lock() */
migrate_live_tasks(cpu);

rq = cpu_rq(cpu);
kthread_stop(rq->migration_thread);
put_task_struct(rq->migration_thread);
rq->migration_thread = NULL;

/* Idle task back to normal (off runqueue, low prio) */

spin_lock_irq(&rq->lock);
update_rq_clock(rq);
deactivate_task(rq, rq->idle, 0);
rq->idle->static_prio = MAX_PRIO;
__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);
spin_unlock_irq(&rq->lock);

cpuset_unlock();

Where migrate_list_tasks() basically iterates the full task list, and
for each task where task_cpu() == dead_cpu invokes __migrate_task() to
move it to an online cpu.

Furthermore, the sched_domain notifier (update_sched_domains), will on
CPU_DEAD rebuild the sched_domain tree.

Now, I think this all can race against try_to_wake_up() when
select_task_rq_fair() hits the old sched_domain tree, because it only
overlays the sched_domain masks against p->cpus_allowed, without regard
for cpu_online_mask.

It could therefore return an offline cpu and move a task onto it after
migrate_live_tasks() and before migrate_dead_tasks().

The trivial solution that comes to mind is something like this:

diff --git a/kernel/sched.c b/kernel/sched.c
index 1f2e99d..15dcb41 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2376,7 +2376,11 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
p->state = TASK_WAKING;
task_rq_unlock(rq, &flags);

+again:
cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+ if (!cpu_active(cpu))
+ goto again;
+
if (cpu != orig_cpu) {
local_irq_save(flags);
rq = cpu_rq(cpu);


However, Mike ran into a similar problem and we tried something similar
and that deadlocked for him -- something which I can see happen when we
do this from an interrupt on the machine running the CPU_DEAD notifier
and the update_sched_domains() notifier will be run after the
migration_call() notifier.

So what we need to do is make the whole of select_task_rq_fair()
cpu_online/active_mask aware, or give up and simply punt:

diff --git a/kernel/sched.c b/kernel/sched.c
index 1f2e99d..62df61c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2377,6 +2377,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
task_rq_unlock(rq, &flags);

cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+ if (!cpu_active(cpu))
+ cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
+
if (cpu != orig_cpu) {
local_irq_save(flags);
rq = cpu_rq(cpu);


Something I think Mike also tried and didn't deadlock for him..

Sachin, Mike, could you try the above snippet and verify if it does
indeed solve your respective issues?

/me prays it does, because otherwise I'm fresh out of clue...

2009-11-12 17:41:11

by Randy Dunlap

[permalink] [raw]
Subject: Re: linux-next: Tree for November 12 (acpi/processor.h)

On Thu, 12 Nov 2009 19:51:01 +1100 Stephen Rothwell wrote:

> Hi all,
>
> Changes since 20091111:
>
> The cpufreq tree gained a conflict against the acpi tree.

when CONFIG_CPU_FREQ=n:

arch/x86/kernel/acpi/processor.o: In function `acpi_processor_get_bios_limit':
(.text+0x0): multiple definition of `acpi_processor_get_bios_limit'
arch/x86/kernel/acpi/cstate.o:(.text+0x0): first defined here


The function definition in include/apci/procssor.h needs to be "static inline"
at line 323.

---
~Randy

2009-11-12 18:11:18

by Randy Dunlap

[permalink] [raw]
Subject: Re: linux-next: Tree for November 12 (acpi_processor_get_bios_limit)

On Thu, 12 Nov 2009 09:40:12 -0800 Randy Dunlap wrote:

> On Thu, 12 Nov 2009 19:51:01 +1100 Stephen Rothwell wrote:
>
> > Hi all,
> >
> > Changes since 20091111:
> >
> > The cpufreq tree gained a conflict against the acpi tree.
>
> when CONFIG_CPU_FREQ=n:
>
> arch/x86/kernel/acpi/processor.o: In function `acpi_processor_get_bios_limit':
> (.text+0x0): multiple definition of `acpi_processor_get_bios_limit'
> arch/x86/kernel/acpi/cstate.o:(.text+0x0): first defined here
>
>
> The function definition in include/apci/procssor.h needs to be "static inline"
> at line 323.

---
however, even with that fixed, when

CONFIG_CPU_FREQ=y
CONFIG_ACPI=n
CONFIG_SFI=y
CONFIG_APM=y

there is this build error:

arch/x86/kernel/cpu/cpufreq/powernow-k7.c:720: error: 'acpi_processor_get_bios_limit' undeclared here (not in a function)

---
~Randy

2009-11-12 23:46:27

by Randy Dunlap

[permalink] [raw]
Subject: [PATCH -next] staging/line6: fix printk formats

From: Randy Dunlap <[email protected]>

Fix printk format warnings in line6/pod.c; sizeof() is of type
size_t, so use %zu.

drivers/staging/line6/pod.c:581: warning: format '%d' expects type 'int', but argument 4 has type 'long unsigned int'
drivers/staging/line6/pod.c:693: warning: format '%d' expects type 'int', but argument 4 has type 'long unsigned int'

Signed-off-by: Randy Dunlap <[email protected]>
---
drivers/staging/line6/pod.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-next-20091112.orig/drivers/staging/line6/pod.c
+++ linux-next-20091112/drivers/staging/line6/pod.c
@@ -579,7 +579,7 @@ static ssize_t pod_set_dump(struct devic

if (count != sizeof(pod->prog_data)) {
dev_err(pod->line6.ifcdev,
- "data block must be exactly %d bytes\n",
+ "data block must be exactly %zu bytes\n",
sizeof(pod->prog_data));
return -EINVAL;
}
@@ -691,7 +691,7 @@ static ssize_t pod_set_dump_buf(struct d

if (count != sizeof(pod->prog_data)) {
dev_err(pod->line6.ifcdev,
- "data block must be exactly %d bytes\n",
+ "data block must be exactly %zu bytes\n",
sizeof(pod->prog_data));
return -EINVAL;
}

2009-11-13 09:00:31

by Sachin Sant

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

Peter Zijlstra wrote:
> So what we need to do is make the whole of select_task_rq_fair()
> cpu_online/active_mask aware, or give up and simply punt:
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1f2e99d..62df61c 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2377,6 +2377,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
> task_rq_unlock(rq, &flags);
>
> cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> + if (!cpu_active(cpu))
> + cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
> +
> if (cpu != orig_cpu) {
> local_irq_save(flags);
> rq = cpu_rq(cpu);
>
>
> Something I think Mike also tried and didn't deadlock for him..
>
> Sachin, Mike, could you try the above snippet and verify if it does
> indeed solve your respective issues?
>
Unfortunately the above patch made things worse. With this patch
the machine failed to boot with following oops

CPU0: Dual-Core AMD Opteron(tm) Processor 2218 stepping 02
BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff81061f17>] set_task_cpu+0x189/0x1ed
PGD 0
Oops: 0000 [#1] SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 3, comm: kthreadd Not tainted 2.6.32-rc7-next-20091113 #1 BladeCenter LS21 -[79716AA]-
RIP: 0010:[<ffffffff81061f17>] [<ffffffff81061f17>] set_task_cpu+0x189/0x1ed
RSP: 0018:ffff88012b357dd0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88012b340000
RBP: ffff88012b357e10 R08: 0000000000000004 R09: ffff88012b3401f8
R10: 00000000000cffa7 R11: 0000000000000000 R12: ffff88012b340000
R13: 000000000c28ccf6 R14: 0000000000000004 R15: ffff880028214cc0
FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 000000000174e000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kthreadd (pid: 3, threadinfo ffff88012b356000, task ffff88012b3431c0)
Stack:
ffff880028214d20 0000000000000000 0000000028215640 0000000000000000
<0> ffff88012b340000 0000000000000001 ffff880028214cc0 0000000000000000
<0> ffff88012b357e60 ffffffff81063a75 0000000000000000 0000000000000000
Call Trace:
[<ffffffff81063a75>] try_to_wake_up+0x103/0x31f
[<ffffffff81063c9e>] default_wake_function+0xd/0xf
[<ffffffff810519a7>] __wake_up_common+0x46/0x76
[<ffffffff810648ae>] ? migration_thread+0x0/0x285
[<ffffffff810577c8>] complete+0x38/0x4b
[<ffffffff8108040a>] kthread+0x67/0x85
[<ffffffff810298fa>] child_rip+0xa/0x20
[<ffffffff810803a3>] ? kthread+0x0/0x85
[<ffffffff810298f0>] ? child_rip+0x0/0x20
Code: 00 8b 05 dd d7 df 04 85 c0 74 19 45 31 c0 31 c9 ba 01 00 00 00 be 01 00 00 00 bf 04 00 00 00 e8 79 02 07 00 48 8b 55 c8 44 89 f1 <48> 8b 42 20 48 8b 55 c0 49 03 84 24 88 00 00 00 48 2b 42 20 49
RIP [<ffffffff81061f17>] set_task_cpu+0x189/0x1ed
RSP <ffff88012b357dd0>
CR2: 0000000000000020
---[ end trace 4eaa2a86a8e2da22 ]---

I tried this with today's next (2.6.32-rc7-20091113) + the above patch.
Here is how the code looks after applying the patch...

task_rq_unlock(rq, &flags);

cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
if (!cpu_active(cpu))
cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);

if (cpu != orig_cpu)
set_task_cpu(p, cpu);

Thanks
-Sachin


--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

2009-11-13 09:06:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Fri, 2009-11-13 at 14:30 +0530, Sachin Sant wrote:
> Peter Zijlstra wrote:
> > So what we need to do is make the whole of select_task_rq_fair()
> > cpu_online/active_mask aware, or give up and simply punt:
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 1f2e99d..62df61c 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2377,6 +2377,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
> > task_rq_unlock(rq, &flags);
> >
> > cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> > + if (!cpu_active(cpu))
> > + cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
> > +
> > if (cpu != orig_cpu) {
> > local_irq_save(flags);
> > rq = cpu_rq(cpu);
> >
> >
> > Something I think Mike also tried and didn't deadlock for him..
> >
> > Sachin, Mike, could you try the above snippet and verify if it does
> > indeed solve your respective issues?
> >
> Unfortunately the above patch made things worse. With this patch
> the machine failed to boot with following oops

Ugh, more head scratching for me then..

Thanks for testing.

Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Thu, Nov 12, 2009 at 06:10:31PM +0100, Peter Zijlstra wrote:
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1f2e99d..62df61c 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2377,6 +2377,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
> task_rq_unlock(rq, &flags);
>

How about this ?

again:
cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
if (!cpu_online(cpu))
cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
if (!cpu) {
set_task_affinity();
goto again;
}
> +
> if (cpu != orig_cpu) {
> local_irq_save(flags);
> rq = cpu_rq(cpu);

Will it help further narrow down the window ?
>
>
> Something I think Mike also tried and didn't deadlock for him..
>
> Sachin, Mike, could you try the above snippet and verify if it does
> indeed solve your respective issues?
>
> /me prays it does, because otherwise I'm fresh out of clue...

--
Thanks and Regards
gautham

2009-11-13 10:16:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Fri, 2009-11-13 at 15:28 +0530, Gautham R Shenoy wrote:
> On Thu, Nov 12, 2009 at 06:10:31PM +0100, Peter Zijlstra wrote:
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 1f2e99d..62df61c 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2377,6 +2377,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
> > task_rq_unlock(rq, &flags);
> >
>
> How about this ?
>
> again:
> cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> if (!cpu_online(cpu))
> cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
> if (!cpu) {
> set_task_affinity();
> goto again;
> }
> > +
> > if (cpu != orig_cpu) {
> > local_irq_save(flags);
> > rq = cpu_rq(cpu);

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2376,7 +2376,15 @@ static int try_to_wake_up(struct task_st
p->state = TASK_WAKING;
__task_rq_unlock(rq);

+again:
cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+ if (!cpu_online(cpu))
+ cpu = cpumask_any_and(&p->cpus_allowed, cpu_online_mask);
+ if (cpu >= nr_cpu_ids) {
+ cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
+ goto again;
+ }
+
if (cpu != orig_cpu) {
rq = cpu_rq(cpu);
update_rq_clock(rq);

is what I stuck in and am compiling now.. we'll see what that does.

2009-11-13 10:31:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Fri, 2009-11-13 at 11:16 +0100, Peter Zijlstra wrote:
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2376,7 +2376,15 @@ static int try_to_wake_up(struct task_st
> p->state = TASK_WAKING;
> __task_rq_unlock(rq);
>
> +again:
> cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> + if (!cpu_online(cpu))
> + cpu = cpumask_any_and(&p->cpus_allowed, cpu_online_mask);
> + if (cpu >= nr_cpu_ids) {
> + cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> + goto again;
> + }
> +
> if (cpu != orig_cpu) {
> rq = cpu_rq(cpu);
> update_rq_clock(rq);
>
> is what I stuck in and am compiling now.. we'll see what that does.

Well, it boots for me, but then, I've not been able to reproduce any
issues anyway :/

/me goes try a PREEMPT=n kernel, since that is what Mike reports boot
funnies with..

Full running diff against -tip:

---
diff --git a/kernel/sched.c b/kernel/sched.c
index 1f2e99d..7089063 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2374,17 +2374,24 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;
p->state = TASK_WAKING;
- task_rq_unlock(rq, &flags);
+ __task_rq_unlock(rq);

+again:
cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+ if (!cpu_online(cpu))
+ cpu = cpumask_any_and(&p->cpus_allowed, cpu_online_mask);
+ if (cpu >= nr_cpu_ids) {
+ printk(KERN_ERR "Breaking affinity on %d/%s\n", p->pid, p->comm);
+ cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
+ goto again;
+ }
+
if (cpu != orig_cpu) {
- local_irq_save(flags);
rq = cpu_rq(cpu);
update_rq_clock(rq);
set_task_cpu(p, cpu);
- local_irq_restore(flags);
}
- rq = task_rq_lock(p, &flags);
+ rq = __task_rq_lock(p);

WARN_ON(p->state != TASK_WAKING);
cpu = task_cpu(p);
@@ -7620,6 +7627,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
unsigned long flags;
struct rq *rq;

+ printk(KERN_ERR "migration call\n");
+
switch (action) {

case CPU_UP_PREPARE:
@@ -9186,6 +9195,8 @@ int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
static int update_sched_domains(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
+ printk(KERN_ERR "update_sched_domains\n");
+
switch (action) {
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5488a5d..0ff21af 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1345,6 +1345,37 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
}

/*
+ * Try and locate an idle CPU in the sched_domain.
+ */
+static int
+select_idle_sibling(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ int cpu = smp_processor_id();
+ int prev_cpu = task_cpu(p);
+ int i;
+
+ /*
+ * If this domain spans both cpu and prev_cpu (see the SD_WAKE_AFFINE
+ * test in select_task_rq_fair) and the prev_cpu is idle then that's
+ * always a better target than the current cpu.
+ */
+ if (target == cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
+ return prev_cpu;
+
+ /*
+ * Otherwise, iterate the domain and find an elegible idle cpu.
+ */
+ for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
+ if (!cpu_rq(i)->cfs.nr_running) {
+ target = i;
+ break;
+ }
+ }
+
+ return target;
+}
+
+/*
* sched_balance_self: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
* SD_BALANCE_EXEC.
@@ -1398,37 +1429,34 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
want_sd = 0;
}

- if (want_affine && (tmp->flags & SD_WAKE_AFFINE)) {
- int candidate = -1, i;
+ /*
+ * While iterating the domains looking for a spanning
+ * WAKE_AFFINE domain, adjust the affine target to any idle cpu
+ * in cache sharing domains along the way.
+ */
+ if (want_affine) {
+ int target = -1;

+ /*
+ * If both cpu and prev_cpu are part of this domain,
+ * cpu is a valid SD_WAKE_AFFINE target.
+ */
if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
- candidate = cpu;
+ target = cpu;

/*
- * Check for an idle shared cache.
+ * If there's an idle sibling in this domain, make that
+ * the wake_affine target instead of the current cpu.
*/
- if (tmp->flags & SD_PREFER_SIBLING) {
- if (candidate == cpu) {
- if (!cpu_rq(prev_cpu)->cfs.nr_running)
- candidate = prev_cpu;
- }
+ if (tmp->flags & SD_PREFER_SIBLING)
+ target = select_idle_sibling(p, tmp, target);

- if (candidate == -1 || candidate == cpu) {
- for_each_cpu(i, sched_domain_span(tmp)) {
- if (!cpumask_test_cpu(i, &p->cpus_allowed))
- continue;
- if (!cpu_rq(i)->cfs.nr_running) {
- candidate = i;
- break;
- }
- }
+ if (target >= 0) {
+ if (tmp->flags & SD_WAKE_AFFINE) {
+ affine_sd = tmp;
+ want_affine = 0;
}
- }
-
- if (candidate >= 0) {
- affine_sd = tmp;
- want_affine = 0;
- cpu = candidate;
+ cpu = target;
}
}


2009-11-13 10:49:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Fri, 2009-11-13 at 11:31 +0100, Peter Zijlstra wrote:
> /me goes try a PREEMPT=n kernel, since that is what Mike reports boot
> funnies with..

Seems to boot just fine..

let me run a few benchmarks while I have:

while :;
do
echo 0 > /sys/devices/system/cpu/cpu1/online;
sleep .1;
echo 1 > /sys/devices/system/cpu/cpu1/online;
done &

running

2009-11-13 11:44:20

by Sachin Sant

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

Peter Zijlstra wrote:
> Well, it boots for me, but then, I've not been able to reproduce any
> issues anyway :/
>
> /me goes try a PREEMPT=n kernel, since that is what Mike reports boot
> funnies with..
>
With the suggested changes against -next the machine boots fine.
After multiple runs of hackbenck,kernbench,cpu_hotplug tests the
machine is still up and running. So at this point all is well.
I will continue to monitor the box for a while..

I just picked up the changes made to kernel/sched.c. Have attached
the changes here.

Thanks for all your help.

Thanks
-Sachin

--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------


Attachments:
sched-next.patch (861.00 B)

2009-11-13 16:12:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Fri, 2009-11-13 at 11:31 +0100, Peter Zijlstra wrote:

> /me goes try a PREEMPT=n kernel, since that is what Mike reports boot
> funnies with..

My highly intermittent funny is still there unfortunately.

dmesg|grep span
[ 0.504026] domain 0: span 0,3 level MC
[ 0.504032] domain 1: span 0-3 level CPU
[ 0.504042] domain 0: span 1-2 level MC
[ 0.504047] domain 1: span 0-3 level CPU
[ 0.504055] domain 0: span 1-2 level MC
[ 0.504060] domain 1: span 0-3 level CPU
[ 0.504069] domain 0: span 0,3 level MC
[ 0.504073] domain 1: span 0-3 level CPU

-Mike

2009-11-23 09:53:52

by Sachin Sant

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

Peter Zijlstra wrote:
> Well, it boots for me, but then, I've not been able to reproduce any
> issues anyway :/
>
> /me goes try a PREEMPT=n kernel, since that is what Mike reports boot
> funnies with..
>
> Full running diff against -tip:
>
Peter i still can recreate this issue with today's next(20091123).
Looks like the following patch haven't been merged yet.

Thanks
-Sachin

> ---
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1f2e99d..7089063 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2374,17 +2374,24 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
> if (task_contributes_to_load(p))
> rq->nr_uninterruptible--;
> p->state = TASK_WAKING;
> - task_rq_unlock(rq, &flags);
> + __task_rq_unlock(rq);
>
> +again:
> cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> + if (!cpu_online(cpu))
> + cpu = cpumask_any_and(&p->cpus_allowed, cpu_online_mask);
> + if (cpu >= nr_cpu_ids) {
> + printk(KERN_ERR "Breaking affinity on %d/%s\n", p->pid, p->comm);
> + cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> + goto again;
> + }
> +
> if (cpu != orig_cpu) {
> - local_irq_save(flags);
> rq = cpu_rq(cpu);
> update_rq_clock(rq);
> set_task_cpu(p, cpu);
> - local_irq_restore(flags);
> }
> - rq = task_rq_lock(p, &flags);
> + rq = __task_rq_lock(p);
>
> WARN_ON(p->state != TASK_WAKING);
> cpu = task_cpu(p);
> @@ -7620,6 +7627,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> unsigned long flags;
> struct rq *rq;
>
> + printk(KERN_ERR "migration call\n");
> +
> switch (action) {
>
> case CPU_UP_PREPARE:
> @@ -9186,6 +9195,8 @@ int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
> static int update_sched_domains(struct notifier_block *nfb,
> unsigned long action, void *hcpu)
> {
> + printk(KERN_ERR "update_sched_domains\n");
> +
> switch (action) {
> case CPU_ONLINE:
> case CPU_ONLINE_FROZEN:
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5488a5d..0ff21af 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1345,6 +1345,37 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> }
>
> /*
> + * Try and locate an idle CPU in the sched_domain.
> + */
> +static int
> +select_idle_sibling(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + int cpu = smp_processor_id();
> + int prev_cpu = task_cpu(p);
> + int i;
> +
> + /*
> + * If this domain spans both cpu and prev_cpu (see the SD_WAKE_AFFINE
> + * test in select_task_rq_fair) and the prev_cpu is idle then that's
> + * always a better target than the current cpu.
> + */
> + if (target == cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
> + return prev_cpu;
> +
> + /*
> + * Otherwise, iterate the domain and find an elegible idle cpu.
> + */
> + for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
> + if (!cpu_rq(i)->cfs.nr_running) {
> + target = i;
> + break;
> + }
> + }
> +
> + return target;
> +}
> +
> +/*
> * sched_balance_self: balance the current task (running on cpu) in domains
> * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
> * SD_BALANCE_EXEC.
> @@ -1398,37 +1429,34 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
> want_sd = 0;
> }
>
> - if (want_affine && (tmp->flags & SD_WAKE_AFFINE)) {
> - int candidate = -1, i;
> + /*
> + * While iterating the domains looking for a spanning
> + * WAKE_AFFINE domain, adjust the affine target to any idle cpu
> + * in cache sharing domains along the way.
> + */
> + if (want_affine) {
> + int target = -1;
>
> + /*
> + * If both cpu and prev_cpu are part of this domain,
> + * cpu is a valid SD_WAKE_AFFINE target.
> + */
> if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> - candidate = cpu;
> + target = cpu;
>
> /*
> - * Check for an idle shared cache.
> + * If there's an idle sibling in this domain, make that
> + * the wake_affine target instead of the current cpu.
> */
> - if (tmp->flags & SD_PREFER_SIBLING) {
> - if (candidate == cpu) {
> - if (!cpu_rq(prev_cpu)->cfs.nr_running)
> - candidate = prev_cpu;
> - }
> + if (tmp->flags & SD_PREFER_SIBLING)
> + target = select_idle_sibling(p, tmp, target);
>
> - if (candidate == -1 || candidate == cpu) {
> - for_each_cpu(i, sched_domain_span(tmp)) {
> - if (!cpumask_test_cpu(i, &p->cpus_allowed))
> - continue;
> - if (!cpu_rq(i)->cfs.nr_running) {
> - candidate = i;
> - break;
> - }
> - }
> + if (target >= 0) {
> + if (tmp->flags & SD_WAKE_AFFINE) {
> + affine_sd = tmp;
> + want_affine = 0;
> }
> - }
> -
> - if (candidate >= 0) {
> - affine_sd = tmp;
> - want_affine = 0;
> - cpu = candidate;
> + cpu = target;
> }
> }
>
>
>
>


--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

2009-11-25 13:43:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Mon, 2009-11-23 at 15:23 +0530, Sachin Sant wrote:
> Peter Zijlstra wrote:
> > Well, it boots for me, but then, I've not been able to reproduce any
> > issues anyway :/
> >
> > /me goes try a PREEMPT=n kernel, since that is what Mike reports boot
> > funnies with..
> >
> > Full running diff against -tip:
> >
> Peter i still can recreate this issue with today's next(20091123).
> Looks like the following patch haven't been merged yet.

Correct, Ingo objected to the fastpath overhead.

Could you please try the below patch which tries to address the issue
differently.

---
Subject: sched: Fix balance vs hotplug race
From: Peter Zijlstra <[email protected]>
Date: Wed Nov 25 13:31:39 CET 2009

Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
sched domain managment) we have cpu_active_mask which is suppose to
rule scheduler migration and load-balancing, except it never did.

The particular problem being solved here is a crash in
try_to_wake_up() where select_task_rq() ends up selecting an offline
cpu because select_task_rq_fair() trusts the sched_domain tree to reflect
the current state of affairs, similarly select_task_rq_rt() trusts the
root_domain.

However, the sched_domains are updated from CPU_DEAD, which is after
the cpu is taken offline and after stop_machine is done. Therefore it
can race perfectly well with code assuming the domains are right.

Cure this by building the domains from cpu_active_mask on
CPU_DOWN_PREPARE.

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/cpumask.h | 2 ++
kernel/cpu.c | 18 +++++++++++++-----
kernel/cpuset.c | 16 +++++++++-------
kernel/sched.c | 44 ++++++++++++++++++++++++++------------------
4 files changed, 50 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/cpumask.h
===================================================================
--- linux-2.6.orig/include/linux/cpumask.h
+++ linux-2.6/include/linux/cpumask.h
@@ -84,6 +84,7 @@ extern const struct cpumask *const cpu_a
#define num_online_cpus() cpumask_weight(cpu_online_mask)
#define num_possible_cpus() cpumask_weight(cpu_possible_mask)
#define num_present_cpus() cpumask_weight(cpu_present_mask)
+#define num_active_cpus() cpumask_weight(cpu_active_mask)
#define cpu_online(cpu) cpumask_test_cpu((cpu), cpu_online_mask)
#define cpu_possible(cpu) cpumask_test_cpu((cpu), cpu_possible_mask)
#define cpu_present(cpu) cpumask_test_cpu((cpu), cpu_present_mask)
@@ -92,6 +93,7 @@ extern const struct cpumask *const cpu_a
#define num_online_cpus() 1
#define num_possible_cpus() 1
#define num_present_cpus() 1
+#define num_active_cpus() 1
#define cpu_online(cpu) ((cpu) == 0)
#define cpu_possible(cpu) ((cpu) == 0)
#define cpu_present(cpu) ((cpu) == 0)
Index: linux-2.6/kernel/cpuset.c
===================================================================
--- linux-2.6.orig/kernel/cpuset.c
+++ linux-2.6/kernel/cpuset.c
@@ -872,7 +872,7 @@ static int update_cpumask(struct cpuset
if (retval < 0)
return retval;

- if (!cpumask_subset(trialcs->cpus_allowed, cpu_online_mask))
+ if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
return -EINVAL;
}
retval = validate_change(cs, trialcs);
@@ -2010,7 +2010,7 @@ static void scan_for_empty_cpusets(struc
}

/* Continue past cpusets with all cpus, mems online */
- if (cpumask_subset(cp->cpus_allowed, cpu_online_mask) &&
+ if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))
continue;

@@ -2019,7 +2019,7 @@ static void scan_for_empty_cpusets(struc
/* Remove offline cpus and mems from this cpuset. */
mutex_lock(&callback_mutex);
cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
- cpu_online_mask);
+ cpu_active_mask);
nodes_and(cp->mems_allowed, cp->mems_allowed,
node_states[N_HIGH_MEMORY]);
mutex_unlock(&callback_mutex);
@@ -2057,8 +2057,10 @@ static int cpuset_track_online_cpus(stru
switch (phase) {
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
- case CPU_DEAD:
- case CPU_DEAD_FROZEN:
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
break;

default:
@@ -2067,7 +2069,7 @@ static int cpuset_track_online_cpus(stru

cgroup_lock();
mutex_lock(&callback_mutex);
- cpumask_copy(top_cpuset.cpus_allowed, cpu_online_mask);
+ cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
mutex_unlock(&callback_mutex);
scan_for_empty_cpusets(&top_cpuset);
ndoms = generate_sched_domains(&doms, &attr);
@@ -2114,7 +2116,7 @@ static int cpuset_track_online_nodes(str

void __init cpuset_init_smp(void)
{
- cpumask_copy(top_cpuset.cpus_allowed, cpu_online_mask);
+ cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];

hotcpu_notifier(cpuset_track_online_cpus, 0);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2323,6 +2323,12 @@ void task_oncpu_function_call(struct tas
preempt_enable();
}

+static inline
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
+{
+ return p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+}
+
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
@@ -2376,7 +2382,7 @@ static int try_to_wake_up(struct task_st
p->state = TASK_WAKING;
task_rq_unlock(rq, &flags);

- cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+ cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
if (cpu != orig_cpu) {
local_irq_save(flags);
rq = cpu_rq(cpu);
@@ -2593,7 +2599,7 @@ void sched_fork(struct task_struct *p, i
p->sched_class = &fair_sched_class;

#ifdef CONFIG_SMP
- cpu = p->sched_class->select_task_rq(p, SD_BALANCE_FORK, 0);
+ cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
#endif
local_irq_save(flags);
update_rq_clock(cpu_rq(cpu));
@@ -3156,7 +3162,7 @@ out:
void sched_exec(void)
{
int new_cpu, this_cpu = get_cpu();
- new_cpu = current->sched_class->select_task_rq(current, SD_BALANCE_EXEC, 0);
+ new_cpu = select_task_rq(current, SD_BALANCE_EXEC, 0);
put_cpu();
if (new_cpu != this_cpu)
sched_migrate_task(current, new_cpu);
@@ -4134,7 +4140,7 @@ static int load_balance(int this_cpu, st
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);

- cpumask_copy(cpus, cpu_online_mask);
+ cpumask_copy(cpus, cpu_active_mask);

/*
* When power savings policy is enabled for the parent domain, idle
@@ -4297,7 +4303,7 @@ load_balance_newidle(int this_cpu, struc
int all_pinned = 0;
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);

- cpumask_copy(cpus, cpu_online_mask);
+ cpumask_copy(cpus, cpu_active_mask);

/*
* When power savings policy is enabled for the parent domain, idle
@@ -4694,7 +4700,7 @@ int select_nohz_load_balancer(int stop_t
cpumask_set_cpu(cpu, nohz.cpu_mask);

/* time for ilb owner also to sleep */
- if (cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
+ if (cpumask_weight(nohz.cpu_mask) == num_active_cpus()) {
if (atomic_read(&nohz.load_balancer) == cpu)
atomic_set(&nohz.load_balancer, -1);
return 0;
@@ -7071,7 +7077,7 @@ int set_cpus_allowed_ptr(struct task_str
int ret = 0;

rq = task_rq_lock(p, &flags);
- if (!cpumask_intersects(new_mask, cpu_online_mask)) {
+ if (!cpumask_intersects(new_mask, cpu_active_mask)) {
ret = -EINVAL;
goto out;
}
@@ -7093,7 +7099,7 @@ int set_cpus_allowed_ptr(struct task_str
if (cpumask_test_cpu(task_cpu(p), new_mask))
goto out;

- if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req)) {
+ if (migrate_task(p, cpumask_any_and(cpu_active_mask, new_mask), &req)) {
/* Need help from migration thread: drop lock and wait. */
struct task_struct *mt = rq->migration_thread;

@@ -7247,19 +7253,19 @@ static void move_task_off_dead_cpu(int d

again:
/* Look for allowed, online CPU in same node. */
- for_each_cpu_and(dest_cpu, nodemask, cpu_online_mask)
+ for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
goto move;

/* Any allowed, online CPU? */
- dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_online_mask);
+ dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
if (dest_cpu < nr_cpu_ids)
goto move;

/* No more Mr. Nice Guy. */
if (dest_cpu >= nr_cpu_ids) {
cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
- dest_cpu = cpumask_any_and(cpu_online_mask, &p->cpus_allowed);
+ dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);

/*
* Don't tell them about moving exiting tasks or
@@ -7288,7 +7294,7 @@ move:
*/
static void migrate_nr_uninterruptible(struct rq *rq_src)
{
- struct rq *rq_dest = cpu_rq(cpumask_any(cpu_online_mask));
+ struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
unsigned long flags;

local_irq_save(flags);
@@ -7542,7 +7548,7 @@ static ctl_table *sd_alloc_ctl_cpu_table
static struct ctl_table_header *sd_sysctl_header;
static void register_sched_domain_sysctl(void)
{
- int i, cpu_num = num_online_cpus();
+ int i, cpu_num = num_possible_cpus();
struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
char buf[32];

@@ -7552,7 +7558,7 @@ static void register_sched_domain_sysctl
if (entry == NULL)
return;

- for_each_online_cpu(i) {
+ for_each_possible_cpu(i) {
snprintf(buf, 32, "cpu%d", i);
entry->procname = kstrdup(buf, GFP_KERNEL);
entry->mode = 0555;
@@ -9064,7 +9070,7 @@ match1:
if (doms_new == NULL) {
ndoms_cur = 0;
doms_new = &fallback_doms;
- cpumask_andnot(doms_new[0], cpu_online_mask, cpu_isolated_map);
+ cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
WARN_ON_ONCE(dattr_new);
}

@@ -9195,8 +9201,10 @@ static int update_sched_domains(struct n
switch (action) {
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
- case CPU_DEAD:
- case CPU_DEAD_FROZEN:
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
partition_sched_domains(1, NULL, NULL);
return NOTIFY_OK;

@@ -9243,7 +9251,7 @@ void __init sched_init_smp(void)
#endif
get_online_cpus();
mutex_lock(&sched_domains_mutex);
- arch_init_sched_domains(cpu_online_mask);
+ arch_init_sched_domains(cpu_active_mask);
cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
if (cpumask_empty(non_isolated_cpus))
cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -212,6 +212,8 @@ static int __ref _cpu_down(unsigned int
err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
hcpu, -1, &nr_calls);
if (err == NOTIFY_BAD) {
+ set_cpu_active(cpu, true);
+
nr_calls--;
__raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
hcpu, nr_calls, NULL);
@@ -223,11 +225,11 @@ static int __ref _cpu_down(unsigned int

/* Ensure that we are not runnable on dying cpu */
cpumask_copy(old_allowed, &current->cpus_allowed);
- set_cpus_allowed_ptr(current,
- cpumask_of(cpumask_any_but(cpu_online_mask, cpu)));
+ set_cpus_allowed_ptr(current, cpu_active_mask);

err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
if (err) {
+ set_cpu_active(cpu, true);
/* CPU didn't die: tell everyone. Can't complain. */
if (raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
hcpu) == NOTIFY_BAD)
@@ -292,9 +294,6 @@ int __ref cpu_down(unsigned int cpu)

err = _cpu_down(cpu, 0);

- if (cpu_online(cpu))
- set_cpu_active(cpu, true);
-
out:
cpu_maps_update_done();
stop_machine_destroy();
@@ -387,6 +386,15 @@ int disable_nonboot_cpus(void)
* with the userspace trying to use the CPU hotplug at the same time
*/
cpumask_clear(frozen_cpus);
+
+ for_each_online_cpu(cpu) {
+ if (cpu == first_cpu)
+ continue;
+ set_cpu_active(cpu, false);
+ }
+
+ synchronize_sched();
+
printk("Disabling non-boot CPUs ...\n");
for_each_online_cpu(cpu) {
if (cpu == first_cpu)

2009-11-26 04:39:13

by Sachin Sant

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

Peter Zijlstra wrote:
> Correct, Ingo objected to the fastpath overhead.
>
> Could you please try the below patch which tries to address the issue
> differently.
>
Works great. Thanks

Tested-by: Sachin Sant <[email protected]>

Regards
-Sachin

> ---
> Subject: sched: Fix balance vs hotplug race
> From: Peter Zijlstra <[email protected]>
> Date: Wed Nov 25 13:31:39 CET 2009
>
> Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
> sched domain managment) we have cpu_active_mask which is suppose to
> rule scheduler migration and load-balancing, except it never did.
>
> The particular problem being solved here is a crash in
> try_to_wake_up() where select_task_rq() ends up selecting an offline
> cpu because select_task_rq_fair() trusts the sched_domain tree to reflect
> the current state of affairs, similarly select_task_rq_rt() trusts the
> root_domain.
>
> However, the sched_domains are updated from CPU_DEAD, which is after
> the cpu is taken offline and after stop_machine is done. Therefore it
> can race perfectly well with code assuming the domains are right.
>
> Cure this by building the domains from cpu_active_mask on
> CPU_DOWN_PREPARE.
>
>


--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

2009-12-04 12:06:09

by Sachin Sant

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

Peter Zijlstra wrote:
> On Mon, 2009-11-23 at 15:23 +0530, Sachin Sant wrote:
>
>> Peter Zijlstra wrote:
>>
>>> Well, it boots for me, but then, I've not been able to reproduce any
>>> issues anyway :/
>>>
>>> /me goes try a PREEMPT=n kernel, since that is what Mike reports boot
>>> funnies with..
>>>
>>> Full running diff against -tip:
>>>
>>>
>> Peter i still can recreate this issue with today's next(20091123).
>> Looks like the following patch haven't been merged yet.
>>
>
> Correct, Ingo objected to the fastpath overhead.
>
> Could you please try the below patch which tries to address the issue
> differently.
>
>
Peter,

Ping on this patch. Still missing from linux-next.

thanks
-Sachin

> ---
> Subject: sched: Fix balance vs hotplug race
> From: Peter Zijlstra <[email protected]>
> Date: Wed Nov 25 13:31:39 CET 2009
>
> Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
> sched domain managment) we have cpu_active_mask which is suppose to
> rule scheduler migration and load-balancing, except it never did.
>
> The particular problem being solved here is a crash in
> try_to_wake_up() where select_task_rq() ends up selecting an offline
> cpu because select_task_rq_fair() trusts the sched_domain tree to reflect
> the current state of affairs, similarly select_task_rq_rt() trusts the
> root_domain.
>
> However, the sched_domains are updated from CPU_DEAD, which is after
> the cpu is taken offline and after stop_machine is done. Therefore it
> can race perfectly well with code assuming the domains are right.
>
> Cure this by building the domains from cpu_active_mask on
> CPU_DOWN_PREPARE.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> include/linux/cpumask.h | 2 ++
> kernel/cpu.c | 18 +++++++++++++-----
> kernel/cpuset.c | 16 +++++++++-------
> kernel/sched.c | 44 ++++++++++++++++++++++++++------------------
> 4 files changed, 50 insertions(+), 30 deletions(-)
>
> Index: linux-2.6/include/linux/cpumask.h
> ===================================================================
> --- linux-2.6.orig/include/linux/cpumask.h
> +++ linux-2.6/include/linux/cpumask.h
> @@ -84,6 +84,7 @@ extern const struct cpumask *const cpu_a
> #define num_online_cpus() cpumask_weight(cpu_online_mask)
> #define num_possible_cpus() cpumask_weight(cpu_possible_mask)
> #define num_present_cpus() cpumask_weight(cpu_present_mask)
> +#define num_active_cpus() cpumask_weight(cpu_active_mask)
> #define cpu_online(cpu) cpumask_test_cpu((cpu), cpu_online_mask)
> #define cpu_possible(cpu) cpumask_test_cpu((cpu), cpu_possible_mask)
> #define cpu_present(cpu) cpumask_test_cpu((cpu), cpu_present_mask)
> @@ -92,6 +93,7 @@ extern const struct cpumask *const cpu_a
> #define num_online_cpus() 1
> #define num_possible_cpus() 1
> #define num_present_cpus() 1
> +#define num_active_cpus() 1
> #define cpu_online(cpu) ((cpu) == 0)
> #define cpu_possible(cpu) ((cpu) == 0)
> #define cpu_present(cpu) ((cpu) == 0)
> Index: linux-2.6/kernel/cpuset.c
> ===================================================================
> --- linux-2.6.orig/kernel/cpuset.c
> +++ linux-2.6/kernel/cpuset.c
> @@ -872,7 +872,7 @@ static int update_cpumask(struct cpuset
> if (retval < 0)
> return retval;
>
> - if (!cpumask_subset(trialcs->cpus_allowed, cpu_online_mask))
> + if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
> return -EINVAL;
> }
> retval = validate_change(cs, trialcs);
> @@ -2010,7 +2010,7 @@ static void scan_for_empty_cpusets(struc
> }
>
> /* Continue past cpusets with all cpus, mems online */
> - if (cpumask_subset(cp->cpus_allowed, cpu_online_mask) &&
> + if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
> nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))
> continue;
>
> @@ -2019,7 +2019,7 @@ static void scan_for_empty_cpusets(struc
> /* Remove offline cpus and mems from this cpuset. */
> mutex_lock(&callback_mutex);
> cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
> - cpu_online_mask);
> + cpu_active_mask);
> nodes_and(cp->mems_allowed, cp->mems_allowed,
> node_states[N_HIGH_MEMORY]);
> mutex_unlock(&callback_mutex);
> @@ -2057,8 +2057,10 @@ static int cpuset_track_online_cpus(stru
> switch (phase) {
> case CPU_ONLINE:
> case CPU_ONLINE_FROZEN:
> - case CPU_DEAD:
> - case CPU_DEAD_FROZEN:
> + case CPU_DOWN_PREPARE:
> + case CPU_DOWN_PREPARE_FROZEN:
> + case CPU_DOWN_FAILED:
> + case CPU_DOWN_FAILED_FROZEN:
> break;
>
> default:
> @@ -2067,7 +2069,7 @@ static int cpuset_track_online_cpus(stru
>
> cgroup_lock();
> mutex_lock(&callback_mutex);
> - cpumask_copy(top_cpuset.cpus_allowed, cpu_online_mask);
> + cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> mutex_unlock(&callback_mutex);
> scan_for_empty_cpusets(&top_cpuset);
> ndoms = generate_sched_domains(&doms, &attr);
> @@ -2114,7 +2116,7 @@ static int cpuset_track_online_nodes(str
>
> void __init cpuset_init_smp(void)
> {
> - cpumask_copy(top_cpuset.cpus_allowed, cpu_online_mask);
> + cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
>
> hotcpu_notifier(cpuset_track_online_cpus, 0);
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2323,6 +2323,12 @@ void task_oncpu_function_call(struct tas
> preempt_enable();
> }
>
> +static inline
> +int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
> +{
> + return p->sched_class->select_task_rq(p, sd_flags, wake_flags);
> +}
> +
> /***
> * try_to_wake_up - wake up a thread
> * @p: the to-be-woken-up thread
> @@ -2376,7 +2382,7 @@ static int try_to_wake_up(struct task_st
> p->state = TASK_WAKING;
> task_rq_unlock(rq, &flags);
>
> - cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> + cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> if (cpu != orig_cpu) {
> local_irq_save(flags);
> rq = cpu_rq(cpu);
> @@ -2593,7 +2599,7 @@ void sched_fork(struct task_struct *p, i
> p->sched_class = &fair_sched_class;
>
> #ifdef CONFIG_SMP
> - cpu = p->sched_class->select_task_rq(p, SD_BALANCE_FORK, 0);
> + cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
> #endif
> local_irq_save(flags);
> update_rq_clock(cpu_rq(cpu));
> @@ -3156,7 +3162,7 @@ out:
> void sched_exec(void)
> {
> int new_cpu, this_cpu = get_cpu();
> - new_cpu = current->sched_class->select_task_rq(current, SD_BALANCE_EXEC, 0);
> + new_cpu = select_task_rq(current, SD_BALANCE_EXEC, 0);
> put_cpu();
> if (new_cpu != this_cpu)
> sched_migrate_task(current, new_cpu);
> @@ -4134,7 +4140,7 @@ static int load_balance(int this_cpu, st
> unsigned long flags;
> struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
>
> - cpumask_copy(cpus, cpu_online_mask);
> + cpumask_copy(cpus, cpu_active_mask);
>
> /*
> * When power savings policy is enabled for the parent domain, idle
> @@ -4297,7 +4303,7 @@ load_balance_newidle(int this_cpu, struc
> int all_pinned = 0;
> struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
>
> - cpumask_copy(cpus, cpu_online_mask);
> + cpumask_copy(cpus, cpu_active_mask);
>
> /*
> * When power savings policy is enabled for the parent domain, idle
> @@ -4694,7 +4700,7 @@ int select_nohz_load_balancer(int stop_t
> cpumask_set_cpu(cpu, nohz.cpu_mask);
>
> /* time for ilb owner also to sleep */
> - if (cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
> + if (cpumask_weight(nohz.cpu_mask) == num_active_cpus()) {
> if (atomic_read(&nohz.load_balancer) == cpu)
> atomic_set(&nohz.load_balancer, -1);
> return 0;
> @@ -7071,7 +7077,7 @@ int set_cpus_allowed_ptr(struct task_str
> int ret = 0;
>
> rq = task_rq_lock(p, &flags);
> - if (!cpumask_intersects(new_mask, cpu_online_mask)) {
> + if (!cpumask_intersects(new_mask, cpu_active_mask)) {
> ret = -EINVAL;
> goto out;
> }
> @@ -7093,7 +7099,7 @@ int set_cpus_allowed_ptr(struct task_str
> if (cpumask_test_cpu(task_cpu(p), new_mask))
> goto out;
>
> - if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req)) {
> + if (migrate_task(p, cpumask_any_and(cpu_active_mask, new_mask), &req)) {
> /* Need help from migration thread: drop lock and wait. */
> struct task_struct *mt = rq->migration_thread;
>
> @@ -7247,19 +7253,19 @@ static void move_task_off_dead_cpu(int d
>
> again:
> /* Look for allowed, online CPU in same node. */
> - for_each_cpu_and(dest_cpu, nodemask, cpu_online_mask)
> + for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
> if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
> goto move;
>
> /* Any allowed, online CPU? */
> - dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_online_mask);
> + dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
> if (dest_cpu < nr_cpu_ids)
> goto move;
>
> /* No more Mr. Nice Guy. */
> if (dest_cpu >= nr_cpu_ids) {
> cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> - dest_cpu = cpumask_any_and(cpu_online_mask, &p->cpus_allowed);
> + dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
>
> /*
> * Don't tell them about moving exiting tasks or
> @@ -7288,7 +7294,7 @@ move:
> */
> static void migrate_nr_uninterruptible(struct rq *rq_src)
> {
> - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_online_mask));
> + struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> unsigned long flags;
>
> local_irq_save(flags);
> @@ -7542,7 +7548,7 @@ static ctl_table *sd_alloc_ctl_cpu_table
> static struct ctl_table_header *sd_sysctl_header;
> static void register_sched_domain_sysctl(void)
> {
> - int i, cpu_num = num_online_cpus();
> + int i, cpu_num = num_possible_cpus();
> struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
> char buf[32];
>
> @@ -7552,7 +7558,7 @@ static void register_sched_domain_sysctl
> if (entry == NULL)
> return;
>
> - for_each_online_cpu(i) {
> + for_each_possible_cpu(i) {
> snprintf(buf, 32, "cpu%d", i);
> entry->procname = kstrdup(buf, GFP_KERNEL);
> entry->mode = 0555;
> @@ -9064,7 +9070,7 @@ match1:
> if (doms_new == NULL) {
> ndoms_cur = 0;
> doms_new = &fallback_doms;
> - cpumask_andnot(doms_new[0], cpu_online_mask, cpu_isolated_map);
> + cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
> WARN_ON_ONCE(dattr_new);
> }
>
> @@ -9195,8 +9201,10 @@ static int update_sched_domains(struct n
> switch (action) {
> case CPU_ONLINE:
> case CPU_ONLINE_FROZEN:
> - case CPU_DEAD:
> - case CPU_DEAD_FROZEN:
> + case CPU_DOWN_PREPARE:
> + case CPU_DOWN_PREPARE_FROZEN:
> + case CPU_DOWN_FAILED:
> + case CPU_DOWN_FAILED_FROZEN:
> partition_sched_domains(1, NULL, NULL);
> return NOTIFY_OK;
>
> @@ -9243,7 +9251,7 @@ void __init sched_init_smp(void)
> #endif
> get_online_cpus();
> mutex_lock(&sched_domains_mutex);
> - arch_init_sched_domains(cpu_online_mask);
> + arch_init_sched_domains(cpu_active_mask);
> cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
> if (cpumask_empty(non_isolated_cpus))
> cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
> Index: linux-2.6/kernel/cpu.c
> ===================================================================
> --- linux-2.6.orig/kernel/cpu.c
> +++ linux-2.6/kernel/cpu.c
> @@ -212,6 +212,8 @@ static int __ref _cpu_down(unsigned int
> err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
> hcpu, -1, &nr_calls);
> if (err == NOTIFY_BAD) {
> + set_cpu_active(cpu, true);
> +
> nr_calls--;
> __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
> hcpu, nr_calls, NULL);
> @@ -223,11 +225,11 @@ static int __ref _cpu_down(unsigned int
>
> /* Ensure that we are not runnable on dying cpu */
> cpumask_copy(old_allowed, &current->cpus_allowed);
> - set_cpus_allowed_ptr(current,
> - cpumask_of(cpumask_any_but(cpu_online_mask, cpu)));
> + set_cpus_allowed_ptr(current, cpu_active_mask);
>
> err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
> if (err) {
> + set_cpu_active(cpu, true);
> /* CPU didn't die: tell everyone. Can't complain. */
> if (raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
> hcpu) == NOTIFY_BAD)
> @@ -292,9 +294,6 @@ int __ref cpu_down(unsigned int cpu)
>
> err = _cpu_down(cpu, 0);
>
> - if (cpu_online(cpu))
> - set_cpu_active(cpu, true);
> -
> out:
> cpu_maps_update_done();
> stop_machine_destroy();
> @@ -387,6 +386,15 @@ int disable_nonboot_cpus(void)
> * with the userspace trying to use the CPU hotplug at the same time
> */
> cpumask_clear(frozen_cpus);
> +
> + for_each_online_cpu(cpu) {
> + if (cpu == first_cpu)
> + continue;
> + set_cpu_active(cpu, false);
> + }
> +
> + synchronize_sched();
> +
> printk("Disabling non-boot CPUs ...\n");
> for_each_online_cpu(cpu) {
> if (cpu == first_cpu)
>
>
>


--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

2009-12-04 12:17:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

On Fri, 2009-12-04 at 17:36 +0530, Sachin Sant wrote:
>
> Ping on this patch. Still missing from linux-next.

I know, it got stuck in my queue too long, but I handed it to Ingo
yesterday, should appear in tip soonish.

2009-12-07 06:17:01

by Sachin Sant

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

Peter Zijlstra wrote:
> On Fri, 2009-12-04 at 17:36 +0530, Sachin Sant wrote:
>
>> Ping on this patch. Still missing from linux-next.
>>
>
> I know, it got stuck in my queue too long, but I handed it to Ingo
> yesterday, should appear in tip soonish.
Thanks Peter.

The same problem appeared in 2.6.32-git1 (6ec22f9...) as well.

Thanks
-Sachin

--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

2009-12-12 07:09:19

by Max Krasnyansky

[permalink] [raw]
Subject: Re: -next: Nov 12 - kernel BUG at kernel/sched.c:7359!

Peter Zijlstra wrote:
> On Mon, 2009-11-23 at 15:23 +0530, Sachin Sant wrote:
>> Peter Zijlstra wrote:
>>> Well, it boots for me, but then, I've not been able to reproduce any
>>> issues anyway :/
>>>
>>> /me goes try a PREEMPT=n kernel, since that is what Mike reports boot
>>> funnies with..
>>>
>>> Full running diff against -tip:
>>>
>> Peter i still can recreate this issue with today's next(20091123).
>> Looks like the following patch haven't been merged yet.
>
> Correct, Ingo objected to the fastpath overhead.
>
> Could you please try the below patch which tries to address the issue
> differently.
>
> ---
> Subject: sched: Fix balance vs hotplug race
> From: Peter Zijlstra <[email protected]>
> Date: Wed Nov 25 13:31:39 CET 2009
>
> Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
> sched domain managment) we have cpu_active_mask which is suppose to
> rule scheduler migration and load-balancing, except it never did.

The original patch was addressing some other issue. And to be honest I don't
remember the details now (it's been awhile) :(.
This change looks fine.

Max




>
> The particular problem being solved here is a crash in
> try_to_wake_up() where select_task_rq() ends up selecting an offline
> cpu because select_task_rq_fair() trusts the sched_domain tree to reflect
> the current state of affairs, similarly select_task_rq_rt() trusts the
> root_domain.
>
> However, the sched_domains are updated from CPU_DEAD, which is after
> the cpu is taken offline and after stop_machine is done. Therefore it
> can race perfectly well with code assuming the domains are right.
>
> Cure this by building the domains from cpu_active_mask on
> CPU_DOWN_PREPARE.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> include/linux/cpumask.h | 2 ++
> kernel/cpu.c | 18 +++++++++++++-----
> kernel/cpuset.c | 16 +++++++++-------
> kernel/sched.c | 44 ++++++++++++++++++++++++++------------------
> 4 files changed, 50 insertions(+), 30 deletions(-)
>
> Index: linux-2.6/include/linux/cpumask.h
> ===================================================================
> --- linux-2.6.orig/include/linux/cpumask.h
> +++ linux-2.6/include/linux/cpumask.h
> @@ -84,6 +84,7 @@ extern const struct cpumask *const cpu_a
> #define num_online_cpus() cpumask_weight(cpu_online_mask)
> #define num_possible_cpus() cpumask_weight(cpu_possible_mask)
> #define num_present_cpus() cpumask_weight(cpu_present_mask)
> +#define num_active_cpus() cpumask_weight(cpu_active_mask)
> #define cpu_online(cpu) cpumask_test_cpu((cpu), cpu_online_mask)
> #define cpu_possible(cpu) cpumask_test_cpu((cpu), cpu_possible_mask)
> #define cpu_present(cpu) cpumask_test_cpu((cpu), cpu_present_mask)
> @@ -92,6 +93,7 @@ extern const struct cpumask *const cpu_a
> #define num_online_cpus() 1
> #define num_possible_cpus() 1
> #define num_present_cpus() 1
> +#define num_active_cpus() 1
> #define cpu_online(cpu) ((cpu) == 0)
> #define cpu_possible(cpu) ((cpu) == 0)
> #define cpu_present(cpu) ((cpu) == 0)
> Index: linux-2.6/kernel/cpuset.c
> ===================================================================
> --- linux-2.6.orig/kernel/cpuset.c
> +++ linux-2.6/kernel/cpuset.c
> @@ -872,7 +872,7 @@ static int update_cpumask(struct cpuset
> if (retval < 0)
> return retval;
>
> - if (!cpumask_subset(trialcs->cpus_allowed, cpu_online_mask))
> + if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
> return -EINVAL;
> }
> retval = validate_change(cs, trialcs);
> @@ -2010,7 +2010,7 @@ static void scan_for_empty_cpusets(struc
> }
>
> /* Continue past cpusets with all cpus, mems online */
> - if (cpumask_subset(cp->cpus_allowed, cpu_online_mask) &&
> + if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
> nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))
> continue;
>
> @@ -2019,7 +2019,7 @@ static void scan_for_empty_cpusets(struc
> /* Remove offline cpus and mems from this cpuset. */
> mutex_lock(&callback_mutex);
> cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
> - cpu_online_mask);
> + cpu_active_mask);
> nodes_and(cp->mems_allowed, cp->mems_allowed,
> node_states[N_HIGH_MEMORY]);
> mutex_unlock(&callback_mutex);
> @@ -2057,8 +2057,10 @@ static int cpuset_track_online_cpus(stru
> switch (phase) {
> case CPU_ONLINE:
> case CPU_ONLINE_FROZEN:
> - case CPU_DEAD:
> - case CPU_DEAD_FROZEN:
> + case CPU_DOWN_PREPARE:
> + case CPU_DOWN_PREPARE_FROZEN:
> + case CPU_DOWN_FAILED:
> + case CPU_DOWN_FAILED_FROZEN:
> break;
>
> default:
> @@ -2067,7 +2069,7 @@ static int cpuset_track_online_cpus(stru
>
> cgroup_lock();
> mutex_lock(&callback_mutex);
> - cpumask_copy(top_cpuset.cpus_allowed, cpu_online_mask);
> + cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> mutex_unlock(&callback_mutex);
> scan_for_empty_cpusets(&top_cpuset);
> ndoms = generate_sched_domains(&doms, &attr);
> @@ -2114,7 +2116,7 @@ static int cpuset_track_online_nodes(str
>
> void __init cpuset_init_smp(void)
> {
> - cpumask_copy(top_cpuset.cpus_allowed, cpu_online_mask);
> + cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
>
> hotcpu_notifier(cpuset_track_online_cpus, 0);
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2323,6 +2323,12 @@ void task_oncpu_function_call(struct tas
> preempt_enable();
> }
>
> +static inline
> +int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
> +{
> + return p->sched_class->select_task_rq(p, sd_flags, wake_flags);
> +}
> +
> /***
> * try_to_wake_up - wake up a thread
> * @p: the to-be-woken-up thread
> @@ -2376,7 +2382,7 @@ static int try_to_wake_up(struct task_st
> p->state = TASK_WAKING;
> task_rq_unlock(rq, &flags);
>
> - cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> + cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> if (cpu != orig_cpu) {
> local_irq_save(flags);
> rq = cpu_rq(cpu);
> @@ -2593,7 +2599,7 @@ void sched_fork(struct task_struct *p, i
> p->sched_class = &fair_sched_class;
>
> #ifdef CONFIG_SMP
> - cpu = p->sched_class->select_task_rq(p, SD_BALANCE_FORK, 0);
> + cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
> #endif
> local_irq_save(flags);
> update_rq_clock(cpu_rq(cpu));
> @@ -3156,7 +3162,7 @@ out:
> void sched_exec(void)
> {
> int new_cpu, this_cpu = get_cpu();
> - new_cpu = current->sched_class->select_task_rq(current, SD_BALANCE_EXEC, 0);
> + new_cpu = select_task_rq(current, SD_BALANCE_EXEC, 0);
> put_cpu();
> if (new_cpu != this_cpu)
> sched_migrate_task(current, new_cpu);
> @@ -4134,7 +4140,7 @@ static int load_balance(int this_cpu, st
> unsigned long flags;
> struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
>
> - cpumask_copy(cpus, cpu_online_mask);
> + cpumask_copy(cpus, cpu_active_mask);
>
> /*
> * When power savings policy is enabled for the parent domain, idle
> @@ -4297,7 +4303,7 @@ load_balance_newidle(int this_cpu, struc
> int all_pinned = 0;
> struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
>
> - cpumask_copy(cpus, cpu_online_mask);
> + cpumask_copy(cpus, cpu_active_mask);
>
> /*
> * When power savings policy is enabled for the parent domain, idle
> @@ -4694,7 +4700,7 @@ int select_nohz_load_balancer(int stop_t
> cpumask_set_cpu(cpu, nohz.cpu_mask);
>
> /* time for ilb owner also to sleep */
> - if (cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
> + if (cpumask_weight(nohz.cpu_mask) == num_active_cpus()) {
> if (atomic_read(&nohz.load_balancer) == cpu)
> atomic_set(&nohz.load_balancer, -1);
> return 0;
> @@ -7071,7 +7077,7 @@ int set_cpus_allowed_ptr(struct task_str
> int ret = 0;
>
> rq = task_rq_lock(p, &flags);
> - if (!cpumask_intersects(new_mask, cpu_online_mask)) {
> + if (!cpumask_intersects(new_mask, cpu_active_mask)) {
> ret = -EINVAL;
> goto out;
> }
> @@ -7093,7 +7099,7 @@ int set_cpus_allowed_ptr(struct task_str
> if (cpumask_test_cpu(task_cpu(p), new_mask))
> goto out;
>
> - if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req)) {
> + if (migrate_task(p, cpumask_any_and(cpu_active_mask, new_mask), &req)) {
> /* Need help from migration thread: drop lock and wait. */
> struct task_struct *mt = rq->migration_thread;
>
> @@ -7247,19 +7253,19 @@ static void move_task_off_dead_cpu(int d
>
> again:
> /* Look for allowed, online CPU in same node. */
> - for_each_cpu_and(dest_cpu, nodemask, cpu_online_mask)
> + for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
> if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
> goto move;
>
> /* Any allowed, online CPU? */
> - dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_online_mask);
> + dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
> if (dest_cpu < nr_cpu_ids)
> goto move;
>
> /* No more Mr. Nice Guy. */
> if (dest_cpu >= nr_cpu_ids) {
> cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> - dest_cpu = cpumask_any_and(cpu_online_mask, &p->cpus_allowed);
> + dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
>
> /*
> * Don't tell them about moving exiting tasks or
> @@ -7288,7 +7294,7 @@ move:
> */
> static void migrate_nr_uninterruptible(struct rq *rq_src)
> {
> - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_online_mask));
> + struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> unsigned long flags;
>
> local_irq_save(flags);
> @@ -7542,7 +7548,7 @@ static ctl_table *sd_alloc_ctl_cpu_table
> static struct ctl_table_header *sd_sysctl_header;
> static void register_sched_domain_sysctl(void)
> {
> - int i, cpu_num = num_online_cpus();
> + int i, cpu_num = num_possible_cpus();
> struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
> char buf[32];
>
> @@ -7552,7 +7558,7 @@ static void register_sched_domain_sysctl
> if (entry == NULL)
> return;
>
> - for_each_online_cpu(i) {
> + for_each_possible_cpu(i) {
> snprintf(buf, 32, "cpu%d", i);
> entry->procname = kstrdup(buf, GFP_KERNEL);
> entry->mode = 0555;
> @@ -9064,7 +9070,7 @@ match1:
> if (doms_new == NULL) {
> ndoms_cur = 0;
> doms_new = &fallback_doms;
> - cpumask_andnot(doms_new[0], cpu_online_mask, cpu_isolated_map);
> + cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
> WARN_ON_ONCE(dattr_new);
> }
>
> @@ -9195,8 +9201,10 @@ static int update_sched_domains(struct n
> switch (action) {
> case CPU_ONLINE:
> case CPU_ONLINE_FROZEN:
> - case CPU_DEAD:
> - case CPU_DEAD_FROZEN:
> + case CPU_DOWN_PREPARE:
> + case CPU_DOWN_PREPARE_FROZEN:
> + case CPU_DOWN_FAILED:
> + case CPU_DOWN_FAILED_FROZEN:
> partition_sched_domains(1, NULL, NULL);
> return NOTIFY_OK;
>
> @@ -9243,7 +9251,7 @@ void __init sched_init_smp(void)
> #endif
> get_online_cpus();
> mutex_lock(&sched_domains_mutex);
> - arch_init_sched_domains(cpu_online_mask);
> + arch_init_sched_domains(cpu_active_mask);
> cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
> if (cpumask_empty(non_isolated_cpus))
> cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
> Index: linux-2.6/kernel/cpu.c
> ===================================================================
> --- linux-2.6.orig/kernel/cpu.c
> +++ linux-2.6/kernel/cpu.c
> @@ -212,6 +212,8 @@ static int __ref _cpu_down(unsigned int
> err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
> hcpu, -1, &nr_calls);
> if (err == NOTIFY_BAD) {
> + set_cpu_active(cpu, true);
> +
> nr_calls--;
> __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
> hcpu, nr_calls, NULL);
> @@ -223,11 +225,11 @@ static int __ref _cpu_down(unsigned int
>
> /* Ensure that we are not runnable on dying cpu */
> cpumask_copy(old_allowed, &current->cpus_allowed);
> - set_cpus_allowed_ptr(current,
> - cpumask_of(cpumask_any_but(cpu_online_mask, cpu)));
> + set_cpus_allowed_ptr(current, cpu_active_mask);
>
> err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
> if (err) {
> + set_cpu_active(cpu, true);
> /* CPU didn't die: tell everyone. Can't complain. */
> if (raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
> hcpu) == NOTIFY_BAD)
> @@ -292,9 +294,6 @@ int __ref cpu_down(unsigned int cpu)
>
> err = _cpu_down(cpu, 0);
>
> - if (cpu_online(cpu))
> - set_cpu_active(cpu, true);
> -
> out:
> cpu_maps_update_done();
> stop_machine_destroy();
> @@ -387,6 +386,15 @@ int disable_nonboot_cpus(void)
> * with the userspace trying to use the CPU hotplug at the same time
> */
> cpumask_clear(frozen_cpus);
> +
> + for_each_online_cpu(cpu) {
> + if (cpu == first_cpu)
> + continue;
> + set_cpu_active(cpu, false);
> + }
> +
> + synchronize_sched();
> +
> printk("Disabling non-boot CPUs ...\n");
> for_each_online_cpu(cpu) {
> if (cpu == first_cpu)
>
>