2009-01-01 01:19:35

by Rusty Russell

[permalink] [raw]
Subject: [PULL] cpumask tree

OK, this is the bulk of the conversion to the new cpumask operators.
The x86-specific parts (the most aggressive large-NR_CPUS arch) are going
via Ingo's tree.

Goals:
1) Get cpumasks off the stack when CONFIG_CPUMASK_OFFSTACK. This should
be achieved for core & x86 by 2.6.29. cpumask_var_t is struct cpumask[1]
for CONFIG_CPUMASK_OFFSTACK=n (and alloc_cpumask_var et al. a noop).
2) Convert to new cpumask functions which only go to nr_cpu_ids for large
NR_CPUS, so booting huge configured kernels on small machines doesn't suck.
3) Allocate smaller cpumasks when nr_cpu_ids < NR_CPUS, when
CONFIG_CPUMASK_OFFSTACK=y. This requires (2) to be completed.
4) Use cpumask_var_t for static cpumasks as well, or raw bitmaps if we
really have to. The former will aave space for nr_cpu_ids << NR_CPUS.
5) Ban on-stack cpumasks (to ensure (1) doesn't get reverted) and cpumask
assignment (for (3)) by making struct cpumask undefined when
CONFIG_CPUMASK_OFFSTACK=y. This means (4) needs to be finished.

Between this and Ingo's tree, we achieve (1) and part of (2) and (4).
Completing the work is expected by 2.6.30.

Note that we can't stop people creating bitmaps of NR_CPUS on the stack
and using to_cpumask() on them. But at least it should stand out.

Cheers,
Rusty.

The following changes since commit 6a94cb73064c952255336cc57731904174b2c58f:
Linus Torvalds (1):
Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask.git master

Li Zefan (1):
cpumask: fix bogus kernel-doc

Mike Travis (4):
cpumask: Add alloc_cpumask_var_node()
cpumask: documentation for cpumask_var_t
cpumask: add sysfs displays for configured and disabled cpu maps
sysfs: add documentation to cputopology.txt for system cpumasks

Rusty Russell (64):
cpumask: centralize cpu_online_map and cpu_possible_map
cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and cpulist_scnprintf to take pointers.
cpumask: make irq_set_affinity() take a const struct cpumask
cpumask: convert struct clock_event_device to cpumask pointers.
cpumask: Add CONFIG_CPUMASK_OFFSTACK
cpumask: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: Use all NR_CPUS bits unless CONFIG_CPUMASK_OFFSTACK
cpumask: x86: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: sparc: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: sh: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: powerpc: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: IA64: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: Mips: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: alpha: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: cpu_coregroup_mask(): x86
cpumask: cpu_coregroup_mask(): sparc
cpumask: cpu_coregroup_mask(): s390
cpumask: Replace cpu_coregroup_map with cpu_coregroup_mask
Merge branch 'master' of git://git.kernel.org/.../torvalds/linux-2.6
cpumask: make CONFIG_NR_CPUS always valid.
bitmap: test for constant as well as small size for inline versions
bitmap: fix seq_bitmap and seq_cpumask to take const pointer
cpumask: switch over to cpu_online/possible/active/present_mask: core
cpumask: make cpumask.h eat its own dogfood.
cpumask: make set_cpu_*/init_cpu_* out-of-line
cpumask: smp_call_function_many()
cpumask: arch_send_call_function_ipi_mask: core
cpumask: use for_each_online_cpu() in drivers/infiniband/hw/ehca/ehca_irq.c
cpumask: use new cpumask API in drivers/infiniband/hw/ehca
cpumask: use new cpumask API in drivers/infiniband/hw/ipath
cpumask: Use nr_cpu_ids in seq_cpumask
Merge branch 'master' of git://git.kernel.org/.../torvalds/linux-2.6
cpumask: Remove IA64 definition of total_cpus now it's in core code
percpu: fix percpu accessors to potentially !cpu_possible() cpus: pnpbios
percpu: fix percpu accessors to potentially !cpu_possible() cpus: m32r
cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits.: core
cpumask: Use accessors code in core
parisc: remove gratuitous cpu_online_map declaration.
avr32: define __fls
blackfin: define __fls
m68k: define __fls
m68knommu: define __fls
bitmap: find_last_bit()
cpumask: Use find_last_bit()
cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): sparc
cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): s390
cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): powerpc
cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): ia64
cpumask: convert kernel trace functions
cpumask: convert kernel trace functions further
cpumask: remove any_online_cpu() users: kernel/
cpumask: remove any_online_cpu() users: mm/
cpumask: convert kernel/compat.c
cpumask: convert kernel/workqueue.c
cpumask: convert kernel time functions
cpumask: convert kernel/irq
cpumask: convert RCU implementations
cpumask: convert kernel/profile.c
cpumask: convert kernel/cpu.c
cpumask: convert rest of files in kernel/
cpumask: convert mm/
cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
cpumask: zero extra bits in alloc_cpumask_var_node
cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS

Documentation/cputopology.txt | 48 ++++
arch/alpha/include/asm/smp.h | 1 -
arch/alpha/include/asm/topology.h | 17 ++
arch/alpha/kernel/irq.c | 5 +-
arch/alpha/kernel/process.c | 2 +
arch/alpha/kernel/setup.c | 5 +
arch/alpha/kernel/smp.c | 7 +-
arch/alpha/kernel/sys_dp264.c | 8 +-
arch/alpha/kernel/sys_titan.c | 4 +-
arch/arm/common/gic.c | 4 +-
arch/arm/kernel/irq.c | 2 +-
arch/arm/kernel/smp.c | 10 -
arch/arm/mach-at91/at91rm9200_time.c | 3 +-
arch/arm/mach-at91/at91sam926x_time.c | 2 +-
arch/arm/mach-davinci/time.c | 2 +-
arch/arm/mach-imx/time.c | 2 +-
arch/arm/mach-ixp4xx/common.c | 2 +-
arch/arm/mach-msm/timer.c | 2 +-
arch/arm/mach-ns9xxx/time-ns9360.c | 2 +-
arch/arm/mach-omap1/time.c | 2 +-
arch/arm/mach-omap1/timer32k.c | 2 +-
arch/arm/mach-omap2/timer-gp.c | 2 +-
arch/arm/mach-pxa/time.c | 2 +-
arch/arm/mach-realview/core.c | 2 +-
arch/arm/mach-realview/localtimer.c | 4 +-
arch/arm/mach-sa1100/time.c | 2 +-
arch/arm/mach-versatile/core.c | 2 +-
arch/arm/oprofile/op_model_mpcore.c | 4 +-
arch/arm/plat-mxc/time.c | 2 +-
arch/arm/plat-orion/time.c | 2 +-
arch/avr32/include/asm/bitops.h | 5 +
arch/avr32/kernel/time.c | 2 +-
arch/blackfin/include/asm/bitops.h | 1 +
arch/blackfin/kernel/time-ts.c | 2 +-
arch/cris/arch-v32/kernel/irq.c | 4 +-
arch/cris/arch-v32/kernel/smp.c | 4 -
arch/cris/include/asm/smp.h | 1 -
arch/ia64/hp/sim/hpsim_irq.c | 2 +-
arch/ia64/include/asm/smp.h | 1 -
arch/ia64/include/asm/topology.h | 9 +-
arch/ia64/kernel/acpi.c | 3 +-
arch/ia64/kernel/iosapic.c | 35 ++--
arch/ia64/kernel/irq.c | 9 +-
arch/ia64/kernel/msi_ia64.c | 12 +-
arch/ia64/kernel/smpboot.c | 10 +-
arch/ia64/kernel/topology.c | 2 +-
arch/ia64/sn/kernel/irq.c | 6 +-
arch/ia64/sn/kernel/msi_sn.c | 7 +-
arch/ia64/sn/kernel/sn2/sn_hwperf.c | 27 +--
arch/m32r/Kconfig | 1 +
arch/m32r/kernel/smpboot.c | 8 +-
arch/m68knommu/include/asm/bitops.h | 1 +
arch/m68knommu/platform/coldfire/pit.c | 2 +-
arch/mips/include/asm/irq.h | 3 +-
arch/mips/include/asm/mach-ip27/topology.h | 4 +-
arch/mips/include/asm/smp.h | 3 -
arch/mips/jazz/irq.c | 2 +-
arch/mips/kernel/cevt-bcm1480.c | 4 +-
arch/mips/kernel/cevt-ds1287.c | 2 +-
arch/mips/kernel/cevt-gt641xx.c | 2 +-
arch/mips/kernel/cevt-r4k.c | 2 +-
arch/mips/kernel/cevt-sb1250.c | 4 +-
arch/mips/kernel/cevt-smtc.c | 2 +-
arch/mips/kernel/cevt-txx9.c | 2 +-
arch/mips/kernel/i8253.c | 2 +-
arch/mips/kernel/irq-gic.c | 6 +-
arch/mips/kernel/smp-cmp.c | 6 +-
arch/mips/kernel/smp-mt.c | 2 +-
arch/mips/kernel/smp.c | 7 +-
arch/mips/kernel/smtc.c | 6 +-
arch/mips/mti-malta/malta-smtc.c | 6 +-
arch/mips/nxp/pnx8550/common/time.c | 1 +
arch/mips/pmc-sierra/yosemite/smp.c | 6 +-
arch/mips/sgi-ip27/ip27-smp.c | 2 +-
arch/mips/sgi-ip27/ip27-timer.c | 2 +-
arch/mips/sibyte/bcm1480/irq.c | 8 +-
arch/mips/sibyte/bcm1480/smp.c | 8 +-
arch/mips/sibyte/sb1250/irq.c | 8 +-
arch/mips/sibyte/sb1250/smp.c | 8 +-
arch/mips/sni/time.c | 2 +-
arch/parisc/Kconfig | 1 +
arch/parisc/include/asm/smp.h | 2 -
arch/parisc/kernel/irq.c | 6 +-
arch/parisc/kernel/smp.c | 15 --
arch/powerpc/include/asm/topology.h | 12 +-
arch/powerpc/kernel/irq.c | 2 +-
arch/powerpc/kernel/smp.c | 4 -
arch/powerpc/kernel/time.c | 2 +-
arch/powerpc/platforms/cell/spu_priv1_mmio.c | 6 +-
arch/powerpc/platforms/cell/spufs/sched.c | 4 +-
arch/powerpc/platforms/pseries/xics.c | 4 +-
arch/powerpc/sysdev/mpic.c | 4 +-
arch/powerpc/sysdev/mpic.h | 2 +-
arch/s390/Kconfig | 1 +
arch/s390/include/asm/topology.h | 2 +
arch/s390/kernel/smp.c | 6 -
arch/s390/kernel/time.c | 2 +-
arch/s390/kernel/topology.c | 5 +
arch/sh/include/asm/smp.h | 2 +-
arch/sh/include/asm/topology.h | 1 +
arch/sh/kernel/smp.c | 10 +-
arch/sh/kernel/timers/timer-broadcast.c | 2 +-
arch/sh/kernel/timers/timer-tmu.c | 2 +-
arch/sparc/include/asm/smp_32.h | 2 -
arch/sparc/include/asm/topology_64.h | 13 +-
arch/sparc/kernel/irq_64.c | 11 +-
arch/sparc/kernel/of_device_64.c | 4 +-
arch/sparc/kernel/pci_msi.c | 4 +-
arch/sparc/kernel/smp_32.c | 6 +-
arch/sparc/kernel/smp_64.c | 4 -
arch/sparc/kernel/sparc_ksyms_32.c | 4 -
arch/sparc/kernel/time_64.c | 2 +-
arch/um/kernel/smp.c | 7 -
arch/um/kernel/time.c | 2 +-
arch/x86/include/asm/pci.h | 10 +-
arch/x86/include/asm/topology.h | 36 ++-
arch/x86/kernel/apic.c | 8 +-
arch/x86/kernel/cpu/intel_cacheinfo.c | 4 +-
arch/x86/kernel/hpet.c | 8 +-
arch/x86/kernel/i8253.c | 2 +-
arch/x86/kernel/io_apic.c | 78 ++++----
arch/x86/kernel/irq_32.c | 2 +-
arch/x86/kernel/irq_64.c | 2 +-
arch/x86/kernel/mfgpt_32.c | 2 +-
arch/x86/kernel/setup_percpu.c | 10 +-
arch/x86/kernel/smpboot.c | 17 +-
arch/x86/kernel/vmiclock_32.c | 2 +-
arch/x86/lguest/boot.c | 2 +-
arch/x86/mach-voyager/voyager_smp.c | 7 -
arch/x86/xen/time.c | 2 +-
block/blk.h | 4 +-
drivers/base/cpu.c | 46 ++++-
drivers/base/node.c | 4 +-
drivers/base/topology.c | 4 +-
drivers/clocksource/tcb_clksrc.c | 2 +-
drivers/infiniband/hw/ehca/ehca_irq.c | 17 +-
drivers/infiniband/hw/ipath/ipath_file_ops.c | 8 +-
drivers/parisc/iosapic.c | 7 +-
drivers/pci/pci-sysfs.c | 4 +-
drivers/pci/probe.c | 4 +-
drivers/pnp/pnpbios/bioscalls.c | 2 +-
drivers/xen/events.c | 6 +-
fs/seq_file.c | 3 +-
include/asm-generic/topology.h | 14 ++-
include/asm-m32r/smp.h | 2 -
include/asm-m68k/bitops.h | 5 +
include/linux/bitmap.h | 35 ++--
include/linux/bitops.h | 13 +-
include/linux/clockchips.h | 4 +-
include/linux/cpumask.h | 295 +++++++++++++-------------
include/linux/interrupt.h | 6 +-
include/linux/irq.h | 3 +-
include/linux/rcuclassic.h | 4 +-
include/linux/seq_file.h | 7 +-
include/linux/smp.h | 18 +-
include/linux/stop_machine.h | 6 +-
include/linux/threads.h | 16 +-
include/linux/tick.h | 4 +-
init/Kconfig | 9 +
init/main.c | 13 +-
kernel/compat.c | 49 +++--
kernel/cpu.c | 145 +++++++++----
kernel/cpuset.c | 4 +-
kernel/irq/chip.c | 2 +-
kernel/irq/manage.c | 31 ++--
kernel/irq/migration.c | 14 +-
kernel/irq/proc.c | 59 ++++--
kernel/kexec.c | 2 +-
kernel/power/poweroff.c | 2 +-
kernel/profile.c | 38 +++--
kernel/rcuclassic.c | 32 ++--
kernel/rcupreempt.c | 19 +-
kernel/rcutorture.c | 27 ++-
kernel/sched.c | 10 +-
kernel/sched_stats.h | 2 +-
kernel/smp.c | 145 +++++--------
kernel/softirq.c | 2 +-
kernel/softlockup.c | 10 +-
kernel/stop_machine.c | 8 +-
kernel/taskstats.c | 41 +++--
kernel/time/clockevents.c | 2 +
kernel/time/clocksource.c | 9 +-
kernel/time/tick-broadcast.c | 113 +++++-----
kernel/time/tick-common.c | 18 +-
kernel/trace/ring_buffer.c | 42 +++--
kernel/trace/trace.c | 68 ++++---
kernel/trace/trace.h | 2 +-
kernel/trace/trace_boot.c | 2 +-
kernel/trace/trace_functions_graph.c | 2 +-
kernel/trace/trace_hw_branches.c | 6 +-
kernel/trace/trace_power.c | 2 +-
kernel/trace/trace_sysprof.c | 13 +-
kernel/workqueue.c | 26 ++-
lib/Kconfig | 15 ++
lib/Makefile | 1 +
lib/cpumask.c | 62 +++++-
lib/find_last_bit.c | 45 ++++
mm/pdflush.c | 16 ++-
mm/slab.c | 2 +-
mm/slub.c | 20 +-
mm/vmscan.c | 4 +-
mm/vmstat.c | 4 +-
security/selinux/selinuxfs.c | 2 +-
203 files changed, 1393 insertions(+), 1056 deletions(-)
create mode 100644 lib/find_last_bit.c

commit 98a79d6a50181ca1ecf7400eda01d5dc1bc0dbf0
Author: Rusty Russell <[email protected]>
Date: Sat Dec 13 21:19:41 2008 +1030

cpumask: centralize cpu_online_map and cpu_possible_map

Impact: cleanup

Each SMP arch defines these themselves. Move them to a central
location.

Twists:
1) Some archs (m32, parisc, s390) set possible_map to all 1, so we add a
CONFIG_INIT_ALL_POSSIBLE for this rather than break them.

2) mips and sparc32 '#define cpu_possible_map phys_cpu_present_map'.
Those archs simply have phys_cpu_present_map replaced everywhere.

3) Alpha defined cpu_possible_map to cpu_present_map; this is tricky
so I just manipulate them both in sync.

4) IA64, cris and m32r have gratuitous 'extern cpumask_t cpu_possible_map'
declarations.

Signed-off-by: Rusty Russell <[email protected]>
Reviewed-by: Grant Grundler <[email protected]>
Tested-by: Tony Luck <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

commit 29c0177e6a4ac094302bed54a1d4bbb6b740a9ef
Author: Rusty Russell <[email protected]>
Date: Sat Dec 13 21:20:25 2008 +1030

cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and cpulist_scnprintf to take pointers.

Impact: change calling convention of existing cpumask APIs

Most cpumask functions started with cpus_: these have been replaced by
cpumask_ ones which take struct cpumask pointers as expected.

These four functions don't have good replacement names; fortunately
they're rarely used, so we just change them over.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Greg Kroah-Hartman <[email protected]>
Cc: [email protected]
Cc: [email protected]

commit 0de26520c7cabf36e1de090ea8092f011a6106ce
Author: Rusty Russell <[email protected]>
Date: Sat Dec 13 21:20:26 2008 +1030

cpumask: make irq_set_affinity() take a const struct cpumask

Impact: change existing irq_chip API

Not much point with gentle transition here: the struct irq_chip's
setaffinity method signature needs to change.

Fortunately, not widely used code, but hits a few architectures.

Note: In irq_select_affinity() I save a temporary in by mangling
irq_desc[irq].affinity directly. Ingo, does this break anything?

(Folded in fix from KOSAKI Motohiro)

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Reviewed-by: Grant Grundler <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: KOSAKI Motohiro <[email protected]>

commit 320ab2b0b1e08e3805a3e1084a2f0eb1938d5d67
Author: Rusty Russell <[email protected]>
Date: Sat Dec 13 21:20:26 2008 +1030

cpumask: convert struct clock_event_device to cpumask pointers.

Impact: change calling convention of existing clock_event APIs

struct clock_event_timer's cpumask field gets changed to take pointer,
as does the ->broadcast function.

Another single-patch change. For safety, we BUG_ON() in
clockevents_register_device() if it's not set.

Signed-off-by: Rusty Russell <[email protected]>
Cc: Ingo Molnar <[email protected]>

commit aab46da0520af9c99b7802cebe4f14a81ff39415
Author: Rusty Russell <[email protected]>
Date: Sat Dec 13 21:20:27 2008 +1030

cpumask: Add CONFIG_CPUMASK_OFFSTACK

Impact: Add config option to enable code in cpumask.h

Currently it can be set if DEBUG_PER_CPU_MAPS, or set specifically by
an arch.

Signed-off-by: Rusty Russell <[email protected]>

commit f0b848ce6fe9062d504d997e9e97fe0f87d57217
Author: Rusty Russell <[email protected]>
Date: Sat Dec 13 21:20:27 2008 +1030

cpumask: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask

Impact: New APIs

The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these
return a pointer to a struct cpumask. Part of removing cpumasks from
the stack.

This defines them in the generic non-NUMA case.

Signed-off-by: Rusty Russell <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>

commit 7be7585393d311866653564fbcd10a3232773c0b
Author: Rusty Russell <[email protected]>
Date: Sat Dec 13 21:20:28 2008 +1030

cpumask: Use all NR_CPUS bits unless CONFIG_CPUMASK_OFFSTACK

Impact: futureproof as we convert more code to new APIs

The old cpumask operators treat all NR_CPUS bits as relevent, the new
ones use nr_cpumask_bits. For large NR_CPUS and small nr_cpu_ids, this
makes a difference.

However, mixing the two can cause problems with undefined bits. An
arch which sets CONFIG_CPUMASK_OFFSTACK should have converted across
to the new operators, so it's safe in that case.

(Thanks to Stephen Rothwell for bisecting the initial unused-bits bug,
and Mike Travis for this solution).

Signed-off-by: Rusty Russell <[email protected]>
Cc: Mike Travis <[email protected]>

commit 7b4967c532045a1983d6d4af5c69cc7c5109f62b
Author: Mike Travis <[email protected]>
Date: Fri Dec 19 16:56:37 2008 +1030

cpumask: Add alloc_cpumask_var_node()

Impact: New API

This will be needed in x86 code to allocate the domain and old_domain
cpumasks on the same node as where the containing irq_cfg struct is
allocated.

(Also fixes double-dump_stack on rare CONFIG_DEBUG_PER_CPU_MAPS case)

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Rusty Russell <[email protected]> (re-impl alloc_cpumask_var)

commit ec26b805879c7e77865b39ee91b737985e80006d
Author: Mike Travis <[email protected]>
Date: Fri Dec 19 16:56:52 2008 +1030

cpumask: documentation for cpumask_var_t

Impact: New kerneldoc comments

Additional documentation added to all the alloc_cpumask and free_cpumask
functions.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Rusty Russell <[email protected]> (minor additions)

commit e057d7aea9d8f2a46cd440d8bfb72245d4e72d79
Author: Mike Travis <[email protected]>
Date: Mon Dec 15 20:26:48 2008 -0800

cpumask: add sysfs displays for configured and disabled cpu maps

Impact: add new sysfs files.

Add sysfs files "kernel_max" and "offline" to display the max CPU index
allowed (NR_CPUS-1), and the map of cpus that are offline.

Cpus can be offlined via HOTPLUG, disabled by the BIOS ACPI tables, or
if they exceed the number of cpus allowed by the NR_CPUS config option,
or the "maxcpus=NUM" kernel start parameter.

The "possible_cpus=NUM" parameter can also extend the number of possible
cpus allowed, in which case the cpus not present at startup will be
in the offline state. (These cpus can be HOTPLUGGED ON after system
startup [pending a follow-on patch to provide the capability via the
/sys/devices/sys/cpu/cpuN/online mechanism to bring them online.])

By design, the "offlined cpus > possible cpus" display will always
use the following formats:

* all possible cpus online: "x$" or "x-y$"
* some possible cpus offline: ".*,x$" or ".*,x-y$"

where:
x == number of possible cpus (nr_cpu_ids); and
y == number of cpus >= NR_CPUS or maxcpus (if y > x).

One use of this feature is for distros to select (or configure) the
appropriate kernel to install for the resident system.

Notes:
* cpus offlined <= possible cpus will be printed for all architectures.
* cpus offlined > possible cpus will only be printed for arches that
set 'total_cpus' [X86 only in this patch].

Based on tip/cpus4096 + .../rusty/linux-2.6-for-ingo.git/master +
x86-only-patches sent 12/15.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>

commit d62720ade82c5e5b8f9585e5ed02c89573ebf111
Author: Mike Travis <[email protected]>
Date: Wed Dec 17 14:14:30 2008 -0800

sysfs: add documentation to cputopology.txt for system cpumasks

Add information to cputopology.txt explaining the output of various
system cpumask's.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>

commit 393d68fb9929817cde7ab31c82d66fcb28ad35fc
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:38 2008 +1030

cpumask: x86: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask

Impact: New APIs

The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these
return a pointer to a struct cpumask. Part of removing cpumasks from
the stack.

Also makes __pcibus_to_node take a const pointer.

Signed-off-by: Rusty Russell <[email protected]>
Acked-by: Ingo Molnar <[email protected]>

commit 96d76a74870d5f11ce2abdd09a8dcdc401d714d1
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:38 2008 +1030

cpumask: sparc: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask

Impact: New APIs

The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these
return a pointer to a struct cpumask. Part of removing cpumasks from
the stack.

Signed-off-by: Rusty Russell <[email protected]>
Acked-by: David S. Miller <[email protected]>

commit 7479a2939df4957ba794cce814379b6d10914bdc
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:39 2008 +1030

cpumask: sh: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask

Impact: New APIs

The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these
return a pointer to a struct cpumask. Part of removing cpumasks from
the stack.

Signed-off-by: Rusty Russell <[email protected]>
Cc: Paul Mundt <[email protected]>

commit 86c6f274f52c3e991d429869780945c0790e7b65
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:39 2008 +1030

cpumask: powerpc: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask

Impact: New APIs

The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these
return a pointer to a struct cpumask. Part of removing cpumasks from
the stack.

(Also replaces powerpc internal uses of node_to_cpumask).

Signed-off-by: Rusty Russell <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>

commit fbb776c3ca4501d5a2821bf1e9bceefcaec7ae47
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:40 2008 +1030

cpumask: IA64: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask

Impact: New APIs

The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these
return a pointer to a struct cpumask. Part of removing cpumasks from
the stack.

We can also use the new for_each_cpu_and() to avoid a temporary cpumask,
and a gratuitous test in sn_topology_show.

(Includes fix from KOSAKI Motohiro <[email protected]>)

Signed-off-by: Rusty Russell <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>

commit b4a2f916a8326065816a0743dd1b0ca2ffd18f5f
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:40 2008 +1030

cpumask: Mips: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask

Impact: New APIs

The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these
return a pointer to a struct cpumask. Part of removing cpumasks from
the stack.

Signed-off-by: Rusty Russell <[email protected]>
Cc: Ralf Baechle <[email protected]>

commit 2258a5bb1064351b552aceaff29393967d694fa3
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:41 2008 +1030

cpumask: alpha: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask

Impact: New APIs

The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these
return a pointer to a struct cpumask. Part of removing cpumasks from
the stack.

I'm not sure the existing code even compiles, but new version is
straightforward.

Signed-off-by: Rusty Russell <[email protected]>
Cc: Richard Henderson <[email protected]>

commit 030bb203e01db12e3f2866799f4f03a114d06349
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:41 2008 +1030

cpumask: cpu_coregroup_mask(): x86

Impact: New API

Like cpu_coregroup_map, but returns a (const) pointer.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Cc: Ingo Molnar <[email protected]>

commit a0ae09b46a516f05ea76e3419ad43c46f52c1165
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:42 2008 +1030

cpumask: cpu_coregroup_mask(): sparc

Like cpu_coregroup_map, but returns a (const) pointer.

Compile-tested on sparc64 (defconfig).

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit 9be3eec2c83848a1ca57ebad13c63c95d0df01e2
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:42 2008 +1030

cpumask: cpu_coregroup_mask(): s390

Like cpu_coregroup_map, but returns a (const) pointer.

Compile-tested on s390 (defconfig).

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit be4d638c1597580ed2294d899d9f1a2cd10e462c
Author: Rusty Russell <[email protected]>
Date: Fri Dec 26 22:23:43 2008 +1030

cpumask: Replace cpu_coregroup_map with cpu_coregroup_mask

cpu_coregroup_map returned a cpumask_t: it's going away.

(Note, the sched part of this patch won't apply meaningfully to the
sched tree, but I'm posting it to show the goal).

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ingo Molnar <[email protected]>

commit 33edcf133ba93ecba2e4b6472e97b689895d805c
Merge: be4d638... 3c92ec8...
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 08:02:35 2008 +1030

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

commit 278d1ed65e25d80af7c3a112d707b3f70516ddb4
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:12 2008 +1030

cpumask: make CONFIG_NR_CPUS always valid.

Impact: cleanup

Currently we have NR_CPUS, which is 1 on UP, and CONFIG_NR_CPUS on
SMP. If we make CONFIG_NR_CPUS always valid (and always 1 on !SMP),
we can skip the middleman.

This also allows us to find and check all the unaudited NR_CPUS usage
as we prepare for v. large NR_CPUS.

To avoid breaking every arch, we cheat and do this for the moment
in the header if the arch doesn't.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit 4b0bc0bca83f3fb7cf920e2ec80684c15d2269c0
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:13 2008 +1030

bitmap: test for constant as well as small size for inline versions

Impact: reduce text size

bitmap_zero et al have a fastpath for nbits <= BITS_PER_LONG, but this
should really only apply where the nbits is known at compile time.

This only saves about 1200 bytes on an allyesconfig kernel, but with
cpumasks going variable that number will increase.

text data bss dec hex filename
35327852 5035607 6782976 47146435 2cf65c3 vmlinux-before
35326640 5035607 6782976 47145223 2cf6107 vmlinux-after

Signed-off-by: Rusty Russell <[email protected]>

commit cb78a0ce69fad2026825f957e24e2d9cda1ec9f1
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:14 2008 +1030

bitmap: fix seq_bitmap and seq_cpumask to take const pointer

Impact: cleanup

seq_bitmap just calls bitmap_scnprintf on the bits: that arg can be const.
Similarly, seq_cpumask just calls seq_bitmap.

Signed-off-by: Rusty Russell <[email protected]>

commit b3199c025d1646e25e7d1d640dd605db251dccf8
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:14 2008 +1030

cpumask: switch over to cpu_online/possible/active/present_mask: core

Impact: cleanup

This implements the obsolescent cpu_online_map in terms of
cpu_online_mask, rather than the other way around. Same for the other
maps.

The documentation comments are also updated to refer to _mask rather
than _map.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit ae7a47e72e1a0b5e2b46d1596bc2c22942a73023
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:15 2008 +1030

cpumask: make cpumask.h eat its own dogfood.

Changes:
1) cpumask_t to struct cpumask,
2) cpus_weight_nr to cpumask_weight,
3) cpu_isset to cpumask_test_cpu,
4) ->bits to cpumask_bits()
5) cpu_*_map to cpu_*_mask.
6) for_each_cpu_mask_nr to for_each_cpu

Signed-off-by: Rusty Russell <[email protected]>

commit 3fa41520696fec2815e2d88fbcccdda77ba4d693
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:16 2008 +1030

cpumask: make set_cpu_*/init_cpu_* out-of-line

They're only for use in boot/cpu hotplug code anyway, and this avoids
the use of deprecated cpu_*_map.

Stephen Rothwell points out that gcc 4.2.4 (on powerpc at least)
didn't like the cast away of const anyway:

include/linux/cpumask.h: In function 'set_cpu_possible':
include/linux/cpumask.h:1052: warning: passing argument 2 of 'cpumask_set_cpu' discards qualifiers from pointer target type

So this kills two birds with one stone.

Signed-off-by: Rusty Russell <[email protected]>

commit 54b11e6d57a10aa9d0009efd93873e17bffd5d30
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:16 2008 +1030

cpumask: smp_call_function_many()

Impact: Implementation change to remove cpumask_t from stack.

Actually change smp_call_function_mask() to smp_call_function_many().
We avoid cpumasks on the stack in this version.

(S390 has its own version, but that's going away apparently).

We have to do some dancing to figure out if 0 or 1 other cpus are in
the mask supplied and the online mask without allocating a tmp
cpumask. It's still fairly cheap.

We allocate the cpumask at the end of the call_function_data
structure: if allocation fails we fallback to smp_call_function_single
rather than using the baroque quiescing code (which needs a cpumask on
stack).

(Thanks to Hiroshi Shimamoto for spotting several bugs in previous versions!)

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Cc: Hiroshi Shimamoto <[email protected]>
Cc: [email protected]
Cc: [email protected]

commit ce47d974f71af26d00832e83a43ac79bec272d99
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:17 2008 +1030

cpumask: arch_send_call_function_ipi_mask: core

Impact: new API to reduce stack usage

We're weaning the core code off handing cpumask's around on-stack.
This introduces arch_send_call_function_ipi_mask().

Signed-off-by: Rusty Russell <[email protected]>

commit 259c4ddd00237e5072921afa15a900839643fd98
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:17 2008 +1030

cpumask: use for_each_online_cpu() in drivers/infiniband/hw/ehca/ehca_irq.c

Impact: cleanup

In future, accessing cpu numbers beyond nr_cpu_ids (the runtime limit)
will be undefined. We can avoid future problems by using
for_each_online_cpu() here.

Signed-off-by: Rusty Russell <[email protected]>
Acked-by: Hoang-Nam Nguyen <[email protected]>
Tested-by: Hoang-Nam Nguyen <[email protected]>
Cc: Christoph Raisch <[email protected]>

commit b29179c3d32021d79c11ece7199a1da41d31b1b7
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:18 2008 +1030

cpumask: use new cpumask API in drivers/infiniband/hw/ehca

Impact: cleanup

We're moving from handing around cpumask_t's to handing around struct
cpumask *'s. cpus_*, cpumask_t and cpu_*_map are deprecated: convert
to cpumask_*, cpu_*_mask.

Signed-off-by: Rusty Russell <[email protected]>
Acked-by: Hoang-Nam Nguyen <[email protected]>
Tested-by: Hoang-Nam Nguyen <[email protected]>
Cc: Christoph Raisch <[email protected]>

commit cbe31f02f5b5536f17dd978118e25052af528071
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:18 2008 +1030

cpumask: use new cpumask API in drivers/infiniband/hw/ipath

Impact: cleanup

We're moving from handing around cpumask_t's to handing around struct
cpumask *'s. cpus_*, cpumask_t and cpu_*_map are deprecated: convert
to cpumask_*, cpu_*_mask.

Signed-off-by: Rusty Russell <[email protected]>
Cc: Ralph Campbell <[email protected]>

commit e12f0102ac81d660c9f801d0a0e10ccf4537a9de
Author: Rusty Russell <[email protected]>
Date: Tue Dec 30 09:05:19 2008 +1030

cpumask: Use nr_cpu_ids in seq_cpumask

Impact: cleanup, futureproof

nr_cpu_ids is the (badly named) runtime limit on possible CPU numbers;
ie. the variable version of NR_CPUS.

With the new cpumask operators, only bits less than this are defined.
So we should use it everywhere, rather than NR_CPUS. Eventually this
will make it possible to allocate cpumasks of the minimal length at runtime.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Acked-by: Ingo Molnar <[email protected]>

commit 2ca1a615835d9f4990f42102ab1f2ef434e7e89c
Merge: e12f010... 6a94cb7...
Author: Rusty Russell <[email protected]>
Date: Wed Dec 31 23:05:57 2008 +1030

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

Conflicts:

arch/x86/kernel/io_apic.c

commit f320786063a9d1f885d2cf34ab44aa69c1d88f43
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:13 2009 +1030

cpumask: Remove IA64 definition of total_cpus now it's in core code

Impact: fix IA64 compile

Fortunately, they have exactly the same semantics.

Signed-off-by: Rusty Russell <[email protected]>

commit e9690a6e4b1615cb0102e425e04b7ce29e7858e2
Author: Li Zefan <[email protected]>
Date: Wed Dec 31 16:45:50 2008 +0800

cpumask: fix bogus kernel-doc

Impact: fix kernel-doc

alloc_bootmem_cpumask_var() returns avoid.

Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>

commit 6aaa8ce523c7ce954b81b8c0b3e32c8be599af8d
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:14 2009 +1030

percpu: fix percpu accessors to potentially !cpu_possible() cpus: pnpbios

Impact: CPU iterator bugfixes

Percpu areas are only allocated for possible cpus. In general, you
shouldn't access random cpu's percpu areas.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: Adam Belay <[email protected]>

commit 9e2f913df70b378379a358a44e7d286f7b765e8e
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:14 2009 +1030

percpu: fix percpu accessors to potentially !cpu_possible() cpus: m32r

Impact: CPU iterator bugfixes

Percpu areas are only allocated for possible cpus. In general, you
shouldn't access random cpu's percpu areas.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Acked-by: Hirokazu Takata <[email protected]>

commit 4f4b6c1a94a8735bbdc030a2911cf395495645b6
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:15 2009 +1030

cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits.: core

Impact: cleanup

In future, all cpumask ops will only be valid (in general) for bit
numbers < nr_cpu_ids. So use that instead of NR_CPUS in iterators
and other comparisons.

This is always safe: no cpu number can be >= nr_cpu_ids, and
nr_cpu_ids is initialized to NR_CPUS at boot.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Acked-by: James Morris <[email protected]>
Cc: Eric Biederman <[email protected]>

commit 915441b601e6662e79f6c958e7be307967a96977
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:15 2009 +1030

cpumask: Use accessors code in core

Impact: use new API

cpu_*_map are going away in favour of cpu_*_mask, but const pointers.
So we have accessors where we really do want to frob them. Archs
will also need the (trivial) conversion before we can finally remove
cpu_*_map.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit 165ac433fa3f01ba99b29972f3adc283d03b0f17
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:16 2009 +1030

parisc: remove gratuitous cpu_online_map declaration.

This is defined in linux/cpumask.h (included in this file already),
and this is now defined differently.

Signed-off-by: Rusty Russell <[email protected]>
Cc: [email protected]

commit 96b8d4c19d797200b973caab57ca842531184c13
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:16 2009 +1030

avr32: define __fls

Like fls, but can't be handed 0 and returns the bit number.

(I broke this arch in linux-next by using __fls in generic code).

Signed-off-by: Rusty Russell <[email protected]>

commit ccec25ff69d5f48c7a088c16fe2dc7e11d9e87fe
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:17 2009 +1030

blackfin: define __fls

Like fls, but can't be handed 0 and returns the bit number.

(I broke this arch in linux-next by using __fls in generic code).

Signed-off-by: Rusty Russell <[email protected]>
Acked-by: Mike Frysinger <[email protected]>

commit 434ae514c23047db87a8bbf39cebc9e1767aea44
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:18 2009 +1030

m68k: define __fls

Like fls, but can't be handed 0 and returns the bit number.

(I broke this arch in linux-next by using __fls in generic code).

Signed-off-by: Rusty Russell <[email protected]>

commit 0db5d3d2f58804edb394e8008c7d9744110338a2
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:18 2009 +1030

m68knommu: define __fls

Like fls, but can't be handed 0 and returns the bit number.

(I broke this arch in linux-next by using __fls in generic code).

Signed-off-by: Rusty Russell <[email protected]>

commit ab53d472e785e51fdfc08fc1d66252c1153e6c0f
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:19 2009 +1030

bitmap: find_last_bit()

Impact: New API

As the name suggests. For the moment everyone uses the generic one.

Signed-off-by: Rusty Russell <[email protected]>

commit e0c0ba736547e81c4f986ce192307c549d214167
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:19 2009 +1030

cpumask: Use find_last_bit()

Impact: cleanup

There's one obvious place to use it: to find the highest possible cpu.

Signed-off-by: Rusty Russell <[email protected]>

commit 78fd744f827586615da5b387fa9f0af1888601b6
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:20 2009 +1030

cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): sparc

Impact: New API

The old topology_core_siblings() and topology_thread_siblings() return
a cpumask_t; these new ones return a (const) struct cpumask *.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit 2bb23a63f22f0e2d91fee93ff5ca9c29e180b146
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:20 2009 +1030

cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): s390

Impact: New API

The old topology_core_siblings() and topology_thread_siblings() return
a cpumask_t; these new ones return a (const) struct cpumask *.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit 9150641dd17fe9e213ab3391c8ebfc228daa2d9d
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:21 2009 +1030

cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): powerpc

Impact: New API

The old topology_core_siblings() and topology_thread_siblings() return
a cpumask_t; these new ones return a (const) struct cpumask *.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit 333af15341b2f6cd813c054e1b441d7b6d8e9318
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:21 2009 +1030

cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): ia64

Impact: New API

The old topology_core_siblings() and topology_thread_siblings() return
a cpumask_t; these new ones return a (const) struct cpumask *.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit 9e01c1b74c9531e301c900edaa92a99fcb7738f2
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:22 2009 +1030

cpumask: convert kernel trace functions

Impact: Reduce future memory usage, use new cpumask API.

(Eventually, cpumask_var_t will be allocated based on nr_cpu_ids, not NR_CPUS).

Convert kernel trace functions to use struct cpumask API:
1) Use cpumask_copy/cpumask_test_cpu/for_each_cpu.
2) Use cpumask_var_t and alloc_cpumask_var/free_cpumask_var everywhere.
3) Use on_each_cpu instead of playing with current->cpus_allowed.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Acked-by: Steven Rostedt <[email protected]>

commit 4462344ee9ea9224d026801b877887f2f39774a3
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:23 2009 +1030

cpumask: convert kernel trace functions further

Impact: Reduce future memory usage, use new cpumask API.

Since the last patch was created and acked, more old cpumask users
slipped into kernel/trace.

Mostly trivial conversions, except struct trace_iterator's "started"
member becomes a cpumask_var_t.

Signed-off-by: Rusty Russell <[email protected]>

commit f1fc057c79cb2d27602fb3ad08a031f13459ef27
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:23 2009 +1030

cpumask: remove any_online_cpu() users: kernel/

Impact: Remove obsolete API usage

any_online_cpu() is a good name, but it takes a cpumask_t, not a
pointer.

There are several places where any_online_cpu() doesn't really want a
mask arg at all. Replace all callers with cpumask_any() and
cpumask_any_and().

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit 3e597945384dee1457240158eb81e3afb90b68c2
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:24 2009 +1030

cpumask: remove any_online_cpu() users: mm/

Impact: Remove obsolete API usage

any_online_cpu() is a good name, but it takes a cpumask_t, not a
pointer.

There are several places where any_online_cpu() doesn't really want a
mask arg at all. Replace all callers with cpumask_any() and
cpumask_any_and().

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit a45185d2d7108b01b90b9e0293377be4d6346dde
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:24 2009 +1030

cpumask: convert kernel/compat.c

Impact: Reduce stack usage, use new cpumask API.

Straightforward conversion; cpumasks' size is given by cpumask_size() (now
a variable rather than fixed) and on-stack cpu masks use cpumask_var_t.

Signed-off-by: Rusty Russell <[email protected]>

commit e7577c50f2fb2d1c167e2c04a4b4c2cc042acb82
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:25 2009 +1030

cpumask: convert kernel/workqueue.c

Impact: Reduce memory usage, use new cpumask API.

cpu_populated_map becomes a cpumask_var_t, and cpu_singlethread_map is
simply a cpumask pointer: it's simply the cpumask containing the first
possible CPU anyway.

Signed-off-by: Rusty Russell <[email protected]>

commit 6b954823c24f04ed026a8517f6bab5abda279db8
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:25 2009 +1030

cpumask: convert kernel time functions

Impact: Use new APIs

Convert kernel/time functions to use struct cpumask *.

Note the ugly bitmap declarations in tick-broadcast.c. These should
be cpumask_var_t, but there was no obvious initialization function to
put the alloc_cpumask_var() calls in. This was safe.

(Eventually 'struct cpumask' will be undefined for CONFIG_CPUMASK_OFFSTACK,
so we use a bitmap here to show we really mean it).

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>

commit d036e67b40f52bdd95392390108defbac7e53837
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:26 2009 +1030

cpumask: convert kernel/irq

Impact: Reduce stack usage, use new cpumask API. ALPHA mod!

Main change is that irq_default_affinity becomes a cpumask_var_t, so
treat it as a pointer (this effects alpha).

Signed-off-by: Rusty Russell <[email protected]>

commit bd232f97b30f6bb630efa136a777647545db3039
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:26 2009 +1030

cpumask: convert RCU implementations

Impact: use new cpumask API.

rcu_ctrlblk contains a cpumask, and it's highly optimized so I don't want
a cpumask_var_t (ie. a pointer) for the CONFIG_CPUMASK_OFFSTACK case. It
could use a dangling bitmap, and be allocated in __rcu_init to save memory,
but for the moment we use a bitmap.

(Eventually 'struct cpumask' will be undefined for CONFIG_CPUMASK_OFFSTACK,
so we use a bitmap here to show we really mean it).

We remove on-stack cpumasks, using cpumask_var_t for
rcu_torture_shuffle_tasks() and for_each_cpu_and in force_quiescent_state().

Signed-off-by: Rusty Russell <[email protected]>

commit c309b917cab55799ea489d7b5f1b77025d9f8462
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:27 2009 +1030

cpumask: convert kernel/profile.c

Impact: Reduce kernel memory usage, use new cpumask API.

Avoid a static cpumask_t for prof_cpu_mask, and an on-stack cpumask_t
in prof_cpu_mask_write_proc. Both become cpumask_var_t.

prof_cpu_mask is only allocated when profiling is on, but the NULL
checks are optimized out by gcc for the !CPUMASK_OFFSTACK case.

Also removed some strange and unnecessary casts.

Signed-off-by: Rusty Russell <[email protected]>

commit e0b582ec56f1a1d8b30ebf340a7b91fb09f26c8c
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:28 2009 +1030

cpumask: convert kernel/cpu.c

Impact: Reduce kernel stack and memory usage, use new cpumask API.

Use cpumask_var_t for take_cpu_down() stack var, and frozen_cpus.

Note that notify_cpu_starting() can be called before core_initcall
allocates frozen_cpus, but the NULL check is optimized out by gcc for
the CONFIG_CPUMASK_OFFSTACK=n case.

Signed-off-by: Rusty Russell <[email protected]>

commit 41c7bb9588904eb060a95bcad47bd3804a1ece25
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:28 2009 +1030

cpumask: convert rest of files in kernel/

Impact: Reduce stack usage, use new cpumask API.

Mainly changing cpumask_t to 'struct cpumask' and similar simple API
conversion. Two conversions worth mentioning:

1) we use cpumask_any_but to avoid a temporary in kernel/softlockup.c,
2) Use cpumask_var_t in taskstats_user_cmd().

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Ingo Molnar <[email protected]>

commit 174596a0b9f21e8844d70566a6bb29bf48a87750
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:29 2009 +1030

cpumask: convert mm/

Impact: Use new API

Convert kernel mm functions to use struct cpumask.

We skip include/linux/percpu.h and mm/allocpercpu.c, which are in flux.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>

commit 5db0e1e9e0f30f160b832a0b5cd1131954bf4f6e
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:29 2009 +1030

cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/

Impact: cleanup

Simple replacement, now the _nr is redundant.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Cc: Ingo Molnar <[email protected]>

commit 2a53008033189ed09bfe241c6b33811ba4ce980d
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:30 2009 +1030

cpumask: zero extra bits in alloc_cpumask_var_node

Impact: extra safety checks during transition

When CONFIG_CPUMASKS_OFFSTACK is set, the new cpumask_ operators only
use bits up to nr_cpu_ids, not NR_CPUS. Using the old cpus_ operators
on these masks can mean accessing undefined bits.

After some discussion, Mike and I decided to err on the side of caution;
we zero the "undefined" bits in alloc_cpumask_var_node() until all the
old cpumask functions are removed.

Signed-off-by: Rusty Russell <[email protected]>

commit 8c384cdee3e04d6194a2c2b192b624754f990835
Author: Rusty Russell <[email protected]>
Date: Thu Jan 1 10:12:30 2009 +1030

cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS

Impact: new debug CONFIG options

This helps find unconverted code. It currently breaks compile horribly,
but we never wanted a flag day so that's expected.

Signed-off-by: Rusty Russell <[email protected]>


2009-01-02 20:06:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PULL] cpumask tree



On Thu, 1 Jan 2009, Rusty Russell wrote:
>
> OK, this is the bulk of the conversion to the new cpumask operators.
> The x86-specific parts (the most aggressive large-NR_CPUS arch) are going
> via Ingo's tree.

This gets lots of conflicts for me. Some of them look simple enough, but
not all. io_apic.c gets lots of nasty conflicts, and it _looks_ like I
should just pick the version of the file that I already have (because the
only thing that comes in from that is yet another merge commit), but
kernel/sched.c also gets conflicts in areas with FIXME's etc.

Rusty, Ingo, can you work this out? I pushed out my current tree.

Linus

2009-01-02 20:39:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PULL] cpumask tree


* Linus Torvalds <[email protected]> wrote:

> On Thu, 1 Jan 2009, Rusty Russell wrote:
> >
> > OK, this is the bulk of the conversion to the new cpumask operators.
> > The x86-specific parts (the most aggressive large-NR_CPUS arch) are
> > going via Ingo's tree.
>
> This gets lots of conflicts for me. Some of them look simple enough, but
> not all. io_apic.c gets lots of nasty conflicts, and it _looks_ like I
> should just pick the version of the file that I already have (because
> the only thing that comes in from that is yet another merge commit), but
> kernel/sched.c also gets conflicts in areas with FIXME's etc.
>
> Rusty, Ingo, can you work this out? I pushed out my current tree.

yes, we have those conflicts all resolved already in the second phase of
the tip/cpus4096 changes: Mike did all those difficult conflict
resolutions over the hollidays and i pulled it yesterday.

The end result looks nice as a tree but it is not fully cooked yet: since
i pulled yesterday i found a couple of build and runtime test failures
with Rusty's latest cpumask tree:

- architectures that have no __fls (8 out of 21) fail to build:

arch/cris
arch/frv
arch/h8300
arch/m32r
arch/m68k
arch/mn10300
arch/xtensa

- there's a new circular locking lockdep splat in CPU hotplug tests when
the var-cpumask code is enabled. (needs a handful of default-off
options enabled in the .config)

So i didnt want to push the second phase to you until those known bugs are
sorted out - i think we need some more time for that - a day or two at
most.

Linus, would you like to pull that, despite the pending regressions? Can
send a pull request right away.

Rusty, would it be fine with you if we did all the remaining bits via
tip/cpus4096? It's your tree and your bits and we wanted to send our
remaining bits after your tree went to Linus but the conflict resolutions
from Mike are valuable so i think we should reconsider the ordering.

All the commits you sent to Linus in this pull request are already
included in tip/cpus4096, the conflicts that Linus hit are all non-trivial
and Mike resolved them correctly and it merges cleanly with Linus's latest
tree:

earth4:~/tip> git pull
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask.git master
From git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask
* branch master -> FETCH_HEAD
Already up-to-date.

The pending diff is:

108 files changed, 1442 insertions(+), 979 deletions(-)

which is pretty OK and straightforward.

With that Rusty and Mike has done 99% of the cpumask conversions, so the
most difficult phase of the conversion should be dealt with in .29. Cool
stuff.

Ingo

2009-01-02 23:32:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PULL] cpumask tree



On Fri, 2 Jan 2009, Ingo Molnar wrote:
>
> Linus, would you like to pull that, despite the pending regressions? Can
> send a pull request right away.

No, if there are known pending regressions, please get those fixed first.

Linus

2009-01-03 00:23:40

by Rusty Russell

[permalink] [raw]
Subject: Re: [PULL] cpumask tree

On Saturday 03 January 2009 06:36:33 Linus Torvalds wrote:
>
> On Thu, 1 Jan 2009, Rusty Russell wrote:
> >
> > OK, this is the bulk of the conversion to the new cpumask operators.
> > The x86-specific parts (the most aggressive large-NR_CPUS arch) are going
> > via Ingo's tree.
>
> This gets lots of conflicts for me. Some of them look simple enough, but
> not all. io_apic.c gets lots of nasty conflicts, and it _looks_ like I
> should just pick the version of the file that I already have (because the
> only thing that comes in from that is yet another merge commit), but
> kernel/sched.c also gets conflicts in areas with FIXME's etc.
>
> Rusty, Ingo, can you work this out? I pushed out my current tree.

Yes, some went via Ingo's tree and there are some known overlaps.

I'll clean it up and resend pull req this weekend.

Thanks,
Rusty.

2009-01-03 07:21:18

by Rusty Russell

[permalink] [raw]
Subject: Re: [PULL] cpumask tree

On Saturday 03 January 2009 07:08:40 Ingo Molnar wrote:
> - architectures that have no __fls (8 out of 21) fail to build:
>
> arch/cris
> arch/frv
> arch/h8300
> arch/m32r
> arch/m68k
> arch/mn10300
> arch/xtensa

Fixes pushed, m68k should be OK tho; is this actual compile test? You have
to look in include/asm-m68k to see __fls.

> Rusty, would it be fine with you if we did all the remaining bits via
> tip/cpus4096? It's your tree and your bits and we wanted to send our
> remaining bits after your tree went to Linus but the conflict resolutions
> from Mike are valuable so i think we should reconsider the ordering.

Yeah, no reason for us to do the merge twice. As long as it ends upstream,
I'm a happy camper.

Thanks,
Rusty.

2009-01-03 10:52:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PULL] cpumask tree


* Rusty Russell <[email protected]> wrote:

> On Saturday 03 January 2009 07:08:40 Ingo Molnar wrote:
> > - architectures that have no __fls (8 out of 21) fail to build:
> >
> > arch/cris
> > arch/frv
> > arch/h8300
> > arch/m32r
> > arch/m68k
> > arch/mn10300
> > arch/xtensa
>
> Fixes pushed, m68k should be OK tho; is this actual compile test? You have
> to look in include/asm-m68k to see __fls.

yeah, i stopped the tests after the first two build failures - the rest is
a grep result from arch/*/, that's why include/asm-m68k/ was left out.

> > Rusty, would it be fine with you if we did all the remaining bits via
> > tip/cpus4096? It's your tree and your bits and we wanted to send our
> > remaining bits after your tree went to Linus but the conflict
> > resolutions from Mike are valuable so i think we should reconsider the
> > ordering.
>
> Yeah, no reason for us to do the merge twice. As long as it ends
> upstream, I'm a happy camper.

great - lets do it that way then. I have pulled your fixes into the
cpus4096 tree:

5ece5c5: xtensa: define __fls
5c134da: mn10300: define __fls
16a2062: m32r: define __fls
9ddabc2: h8300: define __fls
ee38e51: frv: define __fls
0999769: cris: define __fls

Once we have figured out the CPU-hotplug lockdep splat (possibly due to
Mike's changes not yours) i'll send it to Linus. Thanks,

Ingo

2009-01-03 11:59:37

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] ia64: cpumask fix for is_affinity_mask_valid()


* Ingo Molnar <[email protected]> wrote:

> * Rusty Russell <[email protected]> wrote:
>
> > On Saturday 03 January 2009 07:08:40 Ingo Molnar wrote:
> > > - architectures that have no __fls (8 out of 21) fail to build:
> > >
> > > arch/cris
> > > arch/frv
> > > arch/h8300
> > > arch/m32r
> > > arch/m68k
> > > arch/mn10300
> > > arch/xtensa
> >
> > Fixes pushed, m68k should be OK tho; is this actual compile test? You have
> > to look in include/asm-m68k to see __fls.
>
> yeah, i stopped the tests after the first two build failures - the rest is
> a grep result from arch/*/, that's why include/asm-m68k/ was left out.
>
> > > Rusty, would it be fine with you if we did all the remaining bits via
> > > tip/cpus4096? It's your tree and your bits and we wanted to send our
> > > remaining bits after your tree went to Linus but the conflict
> > > resolutions from Mike are valuable so i think we should reconsider the
> > > ordering.
> >
> > Yeah, no reason for us to do the merge twice. As long as it ends
> > upstream, I'm a happy camper.
>
> great - lets do it that way then. I have pulled your fixes into the
> cpus4096 tree:
>
> 5ece5c5: xtensa: define __fls
> 5c134da: mn10300: define __fls
> 16a2062: m32r: define __fls
> 9ddabc2: h8300: define __fls
> ee38e51: frv: define __fls
> 0999769: cris: define __fls

ok, these architectures build fine now.

There's one new build failure due to cpumask changes: ia64. I have fixed
it via the patch below. (if it looks good to you i'll carry it via
tip/cpus4096, ok?)

Breakage came via this commit i think:

d036e67: cpumask: convert kernel/irq

[ia64 is the only architecture that re-defines default_affinity_write().]

Ingo

----------------------->
>From f1faf35ad450c3d648138b74b9f58b283dc19c5c Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Sat, 3 Jan 2009 12:50:46 +0100
Subject: [PATCH] ia64: cpumask fix for is_affinity_mask_valid()

Impact: build fix on ia64

ia64's default_affinity_write() still had old cpumask_t usage:

/home/mingo/tip/kernel/irq/proc.c: In function `default_affinity_write':
/home/mingo/tip/kernel/irq/proc.c:114: error: incompatible type for argument 1 of `is_affinity_mask_valid'
make[3]: *** [kernel/irq/proc.o] Error 1
make[3]: *** Waiting for unfinished jobs....

update it to cpumask_var_t.

Signed-off-by: Ingo Molnar <[email protected]>
---
arch/ia64/include/asm/irq.h | 2 +-
arch/ia64/kernel/irq.c | 4 ++--
kernel/irq/proc.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/include/asm/irq.h b/arch/ia64/include/asm/irq.h
index 3627116..36429a5 100644
--- a/arch/ia64/include/asm/irq.h
+++ b/arch/ia64/include/asm/irq.h
@@ -27,7 +27,7 @@ irq_canonicalize (int irq)
}

extern void set_irq_affinity_info (unsigned int irq, int dest, int redir);
-bool is_affinity_mask_valid(cpumask_t cpumask);
+bool is_affinity_mask_valid(cpumask_var_t cpumask);

#define is_affinity_mask_valid is_affinity_mask_valid

diff --git a/arch/ia64/kernel/irq.c b/arch/ia64/kernel/irq.c
index 0b6db53..95ff16c 100644
--- a/arch/ia64/kernel/irq.c
+++ b/arch/ia64/kernel/irq.c
@@ -112,11 +112,11 @@ void set_irq_affinity_info (unsigned int irq, int hwid, int redir)
}
}

-bool is_affinity_mask_valid(cpumask_t cpumask)
+bool is_affinity_mask_valid(cpumask_var_t cpumask)
{
if (ia64_platform_is("sn2")) {
/* Only allow one CPU to be specified in the smp_affinity mask */
- if (cpus_weight(cpumask) != 1)
+ if (cpumask_weight(cpumask) != 1)
return false;
}
return true;
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 2abd3a7..aae3f74 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -54,7 +54,7 @@ static ssize_t irq_affinity_proc_write(struct file *file,
if (err)
goto free_cpumask;

- if (!is_affinity_mask_valid(*new_value)) {
+ if (!is_affinity_mask_valid(new_value)) {
err = -EINVAL;
goto free_cpumask;
}

2009-01-03 12:20:08

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] cpumask: convert RCU implementations, fix


there's a build warning regression as well on CONFIG_RCU_CLASSIC:

kernel/rcuclassic.c: In function ‘rcu_start_batch’:
kernel/rcuclassic.c:397: warning: passing argument 1 of ‘cpumask_andnot’ from incompatible pointer type

caused by:

bd232f9: cpumask: convert RCU implementations

fixed via the patch below. (i'll carry this in cpus4096 too, ok?)

Ingo

---------------------->
>From 9caab32c0d69e569e33caa03845cec73ae701f32 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Sat, 3 Jan 2009 13:16:09 +0100
Subject: [PATCH] cpumask: convert RCU implementations, fix
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Impact: cleanup

This warning:

kernel/rcuclassic.c: In function ‘rcu_start_batch’:
kernel/rcuclassic.c:397: warning: passing argument 1 of ‘cpumask_andnot’ from incompatible pointer type

triggers because one usage site of rcp->cpumask was not converted
to to_cpumask(rcp->cpumask). There's no ill effects of this bug.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/rcuclassic.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
index 6ec495f..490934f 100644
--- a/kernel/rcuclassic.c
+++ b/kernel/rcuclassic.c
@@ -394,7 +394,8 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
* unnecessarily.
*/
smp_mb();
- cpumask_andnot(&rcp->cpumask, cpu_online_mask, nohz_cpu_mask);
+ cpumask_andnot(to_cpumask(rcp->cpumask),
+ cpu_online_mask, nohz_cpu_mask);

rcp->signaled = 0;
}

2009-01-03 14:59:07

by Mike Travis

[permalink] [raw]
Subject: Re: [PULL] cpumask tree

Ingo Molnar wrote:
> * Rusty Russell <[email protected]> wrote:
>
>> On Saturday 03 January 2009 07:08:40 Ingo Molnar wrote:
>>> - architectures that have no __fls (8 out of 21) fail to build:
>>>
>>> arch/cris
>>> arch/frv
>>> arch/h8300
>>> arch/m32r
>>> arch/m68k
>>> arch/mn10300
>>> arch/xtensa
>> Fixes pushed, m68k should be OK tho; is this actual compile test? You have
>> to look in include/asm-m68k to see __fls.
>
> yeah, i stopped the tests after the first two build failures - the rest is
> a grep result from arch/*/, that's why include/asm-m68k/ was left out.
>
>>> Rusty, would it be fine with you if we did all the remaining bits via
>>> tip/cpus4096? It's your tree and your bits and we wanted to send our
>>> remaining bits after your tree went to Linus but the conflict
>>> resolutions from Mike are valuable so i think we should reconsider the
>>> ordering.
>> Yeah, no reason for us to do the merge twice. As long as it ends
>> upstream, I'm a happy camper.
>
> great - lets do it that way then. I have pulled your fixes into the
> cpus4096 tree:
>
> 5ece5c5: xtensa: define __fls
> 5c134da: mn10300: define __fls
> 16a2062: m32r: define __fls
> 9ddabc2: h8300: define __fls
> ee38e51: frv: define __fls
> 0999769: cris: define __fls
>
> Once we have figured out the CPU-hotplug lockdep splat (possibly due to
> Mike's changes not yours) i'll send it to Linus. Thanks,
>
> Ingo

Thanks! Am working on that now.

Mike

2009-01-03 15:07:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PULL] cpumask tree


* Mike Travis <[email protected]> wrote:

> > Once we have figured out the CPU-hotplug lockdep splat (possibly due
> > to Mike's changes not yours) i'll send it to Linus. Thanks,
> >
> > Ingo
>
> Thanks! Am working on that now.

do you suspect any of the commits? I'm bisecting it right now but if you
have bisected it already i wont repeat it.

Ingo

2009-01-03 15:31:53

by Mike Travis

[permalink] [raw]
Subject: Re: [PULL] cpumask tree

Ingo Molnar wrote:
> * Mike Travis <[email protected]> wrote:
>
>>> Once we have figured out the CPU-hotplug lockdep splat (possibly due
>>> to Mike's changes not yours) i'll send it to Linus. Thanks,
>>>
>>> Ingo
>> Thanks! Am working on that now.
>
> do you suspect any of the commits? I'm bisecting it right now but if you
> have bisected it already i wont repeat it.
>
> Ingo

I haven't done that yet, I'm just now getting your config to work pre-patches.

But yes, I suspect one of the changes to use "work_on_cpu" -- this usually
causes the 2nd call to get_online_cpus().

Thanks,
Mike

2009-01-03 15:47:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PULL] cpumask tree


* Mike Travis <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Mike Travis <[email protected]> wrote:
> >
> >>> Once we have figured out the CPU-hotplug lockdep splat (possibly due
> >>> to Mike's changes not yours) i'll send it to Linus. Thanks,
> >>>
> >>> Ingo
> >> Thanks! Am working on that now.
> >
> > do you suspect any of the commits? I'm bisecting it right now but if you
> > have bisected it already i wont repeat it.
> >
> > Ingo
>
> I haven't done that yet, I'm just now getting your config to work pre-patches.
>
> But yes, I suspect one of the changes to use "work_on_cpu" -- this
> usually causes the 2nd call to get_online_cpus().

i suspect it's:

| commit 2d22bd5e74519854458ad372a89006e65f45e628
| Author: Mike Travis <[email protected]>
| Date: Wed Dec 31 18:08:46 2008 -0800
|
| x86: cleanup remaining cpumask_t code in microcode_core.c

as the microcode is loaded during CPU onlining.

Ingo

2009-01-03 15:52:19

by Mike Travis

[permalink] [raw]
Subject: Re: [PULL] cpumask tree

Ingo Molnar wrote:
> * Mike Travis <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>>> * Mike Travis <[email protected]> wrote:
>>>
>>>>> Once we have figured out the CPU-hotplug lockdep splat (possibly due
>>>>> to Mike's changes not yours) i'll send it to Linus. Thanks,
>>>>>
>>>>> Ingo
>>>> Thanks! Am working on that now.
>>> do you suspect any of the commits? I'm bisecting it right now but if you
>>> have bisected it already i wont repeat it.
>>>
>>> Ingo
>> I haven't done that yet, I'm just now getting your config to work pre-patches.
>>
>> But yes, I suspect one of the changes to use "work_on_cpu" -- this
>> usually causes the 2nd call to get_online_cpus().
>
> i suspect it's:
>
> | commit 2d22bd5e74519854458ad372a89006e65f45e628
> | Author: Mike Travis <[email protected]>
> | Date: Wed Dec 31 18:08:46 2008 -0800
> |
> | x86: cleanup remaining cpumask_t code in microcode_core.c
>
> as the microcode is loaded during CPU onlining.
>
> Ingo

I think you're right. You need the coincidence of both writing to the
/sys/devices/system/cpu/cpuX/online file along with the cpu coming up
quickly enough to attempt to load the microcode before the write has
returned. (Or something similar.)

Btw, I'm resending your "fix ia64" patch shortly.

Thanks,
Mike

2009-01-03 16:00:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PULL] cpumask tree


* Ingo Molnar <[email protected]> wrote:

> i suspect it's:
>
> | commit 2d22bd5e74519854458ad372a89006e65f45e628
> | Author: Mike Travis <[email protected]>
> | Date: Wed Dec 31 18:08:46 2008 -0800
> |
> | x86: cleanup remaining cpumask_t code in microcode_core.c
>
> as the microcode is loaded during CPU onlining.

yep, that's the bad one. Should i revert it or do you have a safe fix in
mind?

Ingo

2009-01-03 16:09:59

by Mike Travis

[permalink] [raw]
Subject: Re: [PULL] cpumask tree

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>> i suspect it's:
>>
>> | commit 2d22bd5e74519854458ad372a89006e65f45e628
>> | Author: Mike Travis <[email protected]>
>> | Date: Wed Dec 31 18:08:46 2008 -0800
>> |
>> | x86: cleanup remaining cpumask_t code in microcode_core.c
>>
>> as the microcode is loaded during CPU onlining.
>
> yep, that's the bad one. Should i revert it or do you have a safe fix in
> mind?
>
> Ingo

Probably revert for now. There are a few more following patches that also
use 'work_on_cpu' so a better (more global?) fix should be used.

Any thought on using a recursive lock for cpu-hotplug-lock? (At least for
get_online_cpus()?)

Thanks,
Mike

2009-01-03 16:43:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PULL] cpumask tree


* Mike Travis <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Ingo Molnar <[email protected]> wrote:
> >
> >> i suspect it's:
> >>
> >> | commit 2d22bd5e74519854458ad372a89006e65f45e628
> >> | Author: Mike Travis <[email protected]>
> >> | Date: Wed Dec 31 18:08:46 2008 -0800
> >> |
> >> | x86: cleanup remaining cpumask_t code in microcode_core.c
> >>
> >> as the microcode is loaded during CPU onlining.
> >
> > yep, that's the bad one. Should i revert it or do you have a safe fix in
> > mind?
> >
> > Ingo
>
> Probably revert for now. There are a few more following patches that
> also use 'work_on_cpu' so a better (more global?) fix should be used.
>
> Any thought on using a recursive lock for cpu-hotplug-lock? (At least
> for get_online_cpus()?)

but the problem has nothing to do with self-recursion. Take a look at the
lockdep warning i posted (also below) - the locks are simply taken in the
wrong order.

your change adds this cpu_hotplug.lock usage:

[ 43.652000] -> #1 (&cpu_hotplug.lock){--..}:
[ 43.652000] [<ffffffff8027a7c0>] __lock_acquire+0xf10/0x1360
[ 43.652000] [<ffffffff8027aca9>] lock_acquire+0x99/0xd0
[ 43.652000] [<ffffffff809b5e4a>] __mutex_lock_common+0xaa/0x450
[ 43.652000] [<ffffffff809b62cf>] mutex_lock_nested+0x3f/0x50
[ 43.652000] [<ffffffff802516ba>] get_online_cpus+0x3a/0x50
[ 43.652000] [<ffffffff802648dc>] work_on_cpu+0x6c/0xc0
[ 43.652000] [<ffffffff8022b2a2>] mc_sysdev_add+0x92/0xa0
[ 43.652000] [<ffffffff8050a800>] sysdev_driver_register+0xb0/0x140
[ 43.652000] [<ffffffff8163c792>] microcode_init+0xb2/0x13b
[ 43.652000] [<ffffffff8020a041>] do_one_initcall+0x41/0x180
[ 43.652000] [<ffffffff8162e6cb>] kernel_init+0x145/0x19d
[ 43.652000] [<ffffffff802146aa>] child_rip+0xa/0x20
[ 43.652000] [<ffffffffffffffff>] 0xffffffffffffffff

which nests the inside sysdev_drivers_lock - which is wrong
[sysdev_drivers_lock is a pretty lowlevel lock that generally nests inside
the CPU hotplug lock].

If you want to use work_on_cpu() it should be done on a higher level, so
that sysdev_drivers_lock is taken after the hotplug lock.

Ingo

[ 43.376051] lockdep: fixing up alternatives.
[ 43.380007] SMP alternatives: switching to UP code
[ 43.616014] CPU0 attaching NULL sched-domain.
[ 43.620068] CPU1 attaching NULL sched-domain.
[ 43.644482] CPU0 attaching NULL sched-domain.
[ 43.648264]
[ 43.648265] =======================================================
[ 43.652000] [ INFO: possible circular locking dependency detected ]
[ 43.652000] 2.6.28-05081-geeff031-dirty #37
[ 43.652000] -------------------------------------------------------
[ 43.652000] S99local/1238 is trying to acquire lock:
[ 43.652000] (sysdev_drivers_lock){--..}, at: [<ffffffff8050a52d>] sysdev_unregister+0x1d/0x80
[ 43.652000]
[ 43.652000] but task is already holding lock:
[ 43.652000] (&cpu_hotplug.lock){--..}, at: [<ffffffff802515d7>] cpu_hotplug_begin+0x27/0x60
[ 43.652000]
[ 43.652000] which lock already depends on the new lock.
[ 43.652000]
[ 43.652000]
[ 43.652000] the existing dependency chain (in reverse order) is:
[ 43.652000]
[ 43.652000] -> #1 (&cpu_hotplug.lock){--..}:
[ 43.652000] [<ffffffff8027a7c0>] __lock_acquire+0xf10/0x1360
[ 43.652000] [<ffffffff8027aca9>] lock_acquire+0x99/0xd0
[ 43.652000] [<ffffffff809b5e4a>] __mutex_lock_common+0xaa/0x450
[ 43.652000] [<ffffffff809b62cf>] mutex_lock_nested+0x3f/0x50
[ 43.652000] [<ffffffff802516ba>] get_online_cpus+0x3a/0x50
[ 43.652000] [<ffffffff802648dc>] work_on_cpu+0x6c/0xc0
[ 43.652000] [<ffffffff8022b2a2>] mc_sysdev_add+0x92/0xa0
[ 43.652000] [<ffffffff8050a800>] sysdev_driver_register+0xb0/0x140
[ 43.652000] [<ffffffff8163c792>] microcode_init+0xb2/0x13b
[ 43.652000] [<ffffffff8020a041>] do_one_initcall+0x41/0x180
[ 43.652000] [<ffffffff8162e6cb>] kernel_init+0x145/0x19d
[ 43.652000] [<ffffffff802146aa>] child_rip+0xa/0x20
[ 43.652000] [<ffffffffffffffff>] 0xffffffffffffffff
[ 43.652000]
[ 43.652000] -> #0 (sysdev_drivers_lock){--..}:
[ 43.652000] [<ffffffff8027a89c>] __lock_acquire+0xfec/0x1360
[ 43.652000] [<ffffffff8027aca9>] lock_acquire+0x99/0xd0
[ 43.652000] [<ffffffff809b5e4a>] __mutex_lock_common+0xaa/0x450
[ 43.652000] [<ffffffff809b62cf>] mutex_lock_nested+0x3f/0x50
[ 43.652000] [<ffffffff8050a52d>] sysdev_unregister+0x1d/0x80
[ 43.652000] [<ffffffff809af9d1>] mce_cpu_callback+0xce/0x101
[ 43.652000] [<ffffffff809bbb75>] notifier_call_chain+0x65/0xa0
[ 43.652000] [<ffffffff8026d696>] raw_notifier_call_chain+0x16/0x20
[ 43.652000] [<ffffffff80964a00>] _cpu_down+0x240/0x350
[ 43.652000] [<ffffffff80964b8b>] cpu_down+0x7b/0xa0
[ 43.652000] [<ffffffff80966268>] store_online+0x48/0xa0
[ 43.652000] [<ffffffff80509e90>] sysdev_store+0x20/0x30
[ 43.652000] [<ffffffff80335ddf>] sysfs_write_file+0xcf/0x140
[ 43.652000] [<ffffffff802dc1f7>] vfs_write+0xc7/0x150
[ 43.652000] [<ffffffff802dc375>] sys_write+0x55/0x90
[ 43.652000] [<ffffffff802133ca>] system_call_fastpath+0x16/0x1b
[ 43.652000] [<ffffffffffffffff>] 0xffffffffffffffff
[ 43.652000]
[ 43.652000] other info that might help us debug this:
[ 43.652000]
[ 43.652000] 3 locks held by S99local/1238:
[ 43.652000] #0: (&buffer->mutex){--..}, at: [<ffffffff80335d58>] sysfs_write_file+0x48/0x140
[ 43.652000] #1: (cpu_add_remove_lock){--..}, at: [<ffffffff80964b3f>] cpu_down+0x2f/0xa0
[ 43.652000] #2: (&cpu_hotplug.lock){--..}, at: [<ffffffff802515d7>] cpu_hotplug_begin+0x27/0x60
[ 43.652000]
[ 43.652000] stack backtrace:
[ 43.652000] Pid: 1238, comm: S99local Not tainted 2.6.28-05081-geeff031-dirty #37
[ 43.652000] Call Trace:
[ 43.652000] [<ffffffff80277f24>] print_circular_bug_tail+0xa4/0x100
[ 43.652000] [<ffffffff8027a89c>] __lock_acquire+0xfec/0x1360
[ 43.652000] [<ffffffff8027aca9>] lock_acquire+0x99/0xd0
[ 43.652000] [<ffffffff8050a52d>] ? sysdev_unregister+0x1d/0x80
[ 43.652000] [<ffffffff809b5e4a>] __mutex_lock_common+0xaa/0x450
[ 43.652000] [<ffffffff8050a52d>] ? sysdev_unregister+0x1d/0x80
[ 43.652000] [<ffffffff8050a52d>] ? sysdev_unregister+0x1d/0x80
[ 43.652000] [<ffffffff809b62cf>] mutex_lock_nested+0x3f/0x50
[ 43.652000] [<ffffffff8050a52d>] sysdev_unregister+0x1d/0x80
[ 43.652000] [<ffffffff809af9d1>] mce_cpu_callback+0xce/0x101
[ 43.652000] [<ffffffff809bbb75>] notifier_call_chain+0x65/0xa0
[ 43.652000] [<ffffffff8026d696>] raw_notifier_call_chain+0x16/0x20
[ 43.652000] [<ffffffff80964a00>] _cpu_down+0x240/0x350
[ 43.652000] [<ffffffff809b4763>] ? wait_for_common+0xe3/0x1b0
[ 43.652000] [<ffffffff80964b8b>] cpu_down+0x7b/0xa0
[ 43.652000] [<ffffffff80966268>] store_online+0x48/0xa0
[ 43.652000] [<ffffffff80509e90>] sysdev_store+0x20/0x30
[ 43.652000] [<ffffffff80335ddf>] sysfs_write_file+0xcf/0x140
[ 43.652000] [<ffffffff802dc1f7>] vfs_write+0xc7/0x150
[ 43.652000] [<ffffffff802dc375>] sys_write+0x55/0x90
[ 43.652000] [<ffffffff802133ca>] system_call_fastpath+0x16/0x1b
[ 43.652104] device: 'msr1': device_unregister
[ 43.656005] PM: Removing info for No Bus:msr1

2009-01-03 16:48:42

by Mike Travis

[permalink] [raw]
Subject: Re: [PULL] cpumask tree

Ingo Molnar wrote:
> * Mike Travis <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>>> * Ingo Molnar <[email protected]> wrote:
>>>
>>>> i suspect it's:
>>>>
>>>> | commit 2d22bd5e74519854458ad372a89006e65f45e628
>>>> | Author: Mike Travis <[email protected]>
>>>> | Date: Wed Dec 31 18:08:46 2008 -0800
>>>> |
>>>> | x86: cleanup remaining cpumask_t code in microcode_core.c
>>>>
>>>> as the microcode is loaded during CPU onlining.
>>> yep, that's the bad one. Should i revert it or do you have a safe fix in
>>> mind?
>>>
>>> Ingo
>> Probably revert for now. There are a few more following patches that
>> also use 'work_on_cpu' so a better (more global?) fix should be used.
>>
>> Any thought on using a recursive lock for cpu-hotplug-lock? (At least
>> for get_online_cpus()?)
>
> but the problem has nothing to do with self-recursion. Take a look at the
> lockdep warning i posted (also below) - the locks are simply taken in the
> wrong order.
>
> your change adds this cpu_hotplug.lock usage:
>
> [ 43.652000] -> #1 (&cpu_hotplug.lock){--..}:
> [ 43.652000] [<ffffffff8027a7c0>] __lock_acquire+0xf10/0x1360
> [ 43.652000] [<ffffffff8027aca9>] lock_acquire+0x99/0xd0
> [ 43.652000] [<ffffffff809b5e4a>] __mutex_lock_common+0xaa/0x450
> [ 43.652000] [<ffffffff809b62cf>] mutex_lock_nested+0x3f/0x50
> [ 43.652000] [<ffffffff802516ba>] get_online_cpus+0x3a/0x50
> [ 43.652000] [<ffffffff802648dc>] work_on_cpu+0x6c/0xc0
> [ 43.652000] [<ffffffff8022b2a2>] mc_sysdev_add+0x92/0xa0
> [ 43.652000] [<ffffffff8050a800>] sysdev_driver_register+0xb0/0x140
> [ 43.652000] [<ffffffff8163c792>] microcode_init+0xb2/0x13b
> [ 43.652000] [<ffffffff8020a041>] do_one_initcall+0x41/0x180
> [ 43.652000] [<ffffffff8162e6cb>] kernel_init+0x145/0x19d
> [ 43.652000] [<ffffffff802146aa>] child_rip+0xa/0x20
> [ 43.652000] [<ffffffffffffffff>] 0xffffffffffffffff
>
> which nests the inside sysdev_drivers_lock - which is wrong
> [sysdev_drivers_lock is a pretty lowlevel lock that generally nests inside
> the CPU hotplug lock].
>
> If you want to use work_on_cpu() it should be done on a higher level, so
> that sysdev_drivers_lock is taken after the hotplug lock.
>
> Ingo

Ok, thanks, I will look in that direction.

Mike

>
> [ 43.376051] lockdep: fixing up alternatives.
> [ 43.380007] SMP alternatives: switching to UP code
> [ 43.616014] CPU0 attaching NULL sched-domain.
> [ 43.620068] CPU1 attaching NULL sched-domain.
> [ 43.644482] CPU0 attaching NULL sched-domain.
> [ 43.648264]
> [ 43.648265] =======================================================
> [ 43.652000] [ INFO: possible circular locking dependency detected ]
> [ 43.652000] 2.6.28-05081-geeff031-dirty #37
> [ 43.652000] -------------------------------------------------------
> [ 43.652000] S99local/1238 is trying to acquire lock:
> [ 43.652000] (sysdev_drivers_lock){--..}, at: [<ffffffff8050a52d>] sysdev_unregister+0x1d/0x80
> [ 43.652000]
> [ 43.652000] but task is already holding lock:
> [ 43.652000] (&cpu_hotplug.lock){--..}, at: [<ffffffff802515d7>] cpu_hotplug_begin+0x27/0x60
> [ 43.652000]
> [ 43.652000] which lock already depends on the new lock.
> [ 43.652000]
> [ 43.652000]
> [ 43.652000] the existing dependency chain (in reverse order) is:
> [ 43.652000]
> [ 43.652000] -> #1 (&cpu_hotplug.lock){--..}:
> [ 43.652000] [<ffffffff8027a7c0>] __lock_acquire+0xf10/0x1360
> [ 43.652000] [<ffffffff8027aca9>] lock_acquire+0x99/0xd0
> [ 43.652000] [<ffffffff809b5e4a>] __mutex_lock_common+0xaa/0x450
> [ 43.652000] [<ffffffff809b62cf>] mutex_lock_nested+0x3f/0x50
> [ 43.652000] [<ffffffff802516ba>] get_online_cpus+0x3a/0x50
> [ 43.652000] [<ffffffff802648dc>] work_on_cpu+0x6c/0xc0
> [ 43.652000] [<ffffffff8022b2a2>] mc_sysdev_add+0x92/0xa0
> [ 43.652000] [<ffffffff8050a800>] sysdev_driver_register+0xb0/0x140
> [ 43.652000] [<ffffffff8163c792>] microcode_init+0xb2/0x13b
> [ 43.652000] [<ffffffff8020a041>] do_one_initcall+0x41/0x180
> [ 43.652000] [<ffffffff8162e6cb>] kernel_init+0x145/0x19d
> [ 43.652000] [<ffffffff802146aa>] child_rip+0xa/0x20
> [ 43.652000] [<ffffffffffffffff>] 0xffffffffffffffff
> [ 43.652000]
> [ 43.652000] -> #0 (sysdev_drivers_lock){--..}:
> [ 43.652000] [<ffffffff8027a89c>] __lock_acquire+0xfec/0x1360
> [ 43.652000] [<ffffffff8027aca9>] lock_acquire+0x99/0xd0
> [ 43.652000] [<ffffffff809b5e4a>] __mutex_lock_common+0xaa/0x450
> [ 43.652000] [<ffffffff809b62cf>] mutex_lock_nested+0x3f/0x50
> [ 43.652000] [<ffffffff8050a52d>] sysdev_unregister+0x1d/0x80
> [ 43.652000] [<ffffffff809af9d1>] mce_cpu_callback+0xce/0x101
> [ 43.652000] [<ffffffff809bbb75>] notifier_call_chain+0x65/0xa0
> [ 43.652000] [<ffffffff8026d696>] raw_notifier_call_chain+0x16/0x20
> [ 43.652000] [<ffffffff80964a00>] _cpu_down+0x240/0x350
> [ 43.652000] [<ffffffff80964b8b>] cpu_down+0x7b/0xa0
> [ 43.652000] [<ffffffff80966268>] store_online+0x48/0xa0
> [ 43.652000] [<ffffffff80509e90>] sysdev_store+0x20/0x30
> [ 43.652000] [<ffffffff80335ddf>] sysfs_write_file+0xcf/0x140
> [ 43.652000] [<ffffffff802dc1f7>] vfs_write+0xc7/0x150
> [ 43.652000] [<ffffffff802dc375>] sys_write+0x55/0x90
> [ 43.652000] [<ffffffff802133ca>] system_call_fastpath+0x16/0x1b
> [ 43.652000] [<ffffffffffffffff>] 0xffffffffffffffff
> [ 43.652000]
> [ 43.652000] other info that might help us debug this:
> [ 43.652000]
> [ 43.652000] 3 locks held by S99local/1238:
> [ 43.652000] #0: (&buffer->mutex){--..}, at: [<ffffffff80335d58>] sysfs_write_file+0x48/0x140
> [ 43.652000] #1: (cpu_add_remove_lock){--..}, at: [<ffffffff80964b3f>] cpu_down+0x2f/0xa0
> [ 43.652000] #2: (&cpu_hotplug.lock){--..}, at: [<ffffffff802515d7>] cpu_hotplug_begin+0x27/0x60
> [ 43.652000]
> [ 43.652000] stack backtrace:
> [ 43.652000] Pid: 1238, comm: S99local Not tainted 2.6.28-05081-geeff031-dirty #37
> [ 43.652000] Call Trace:
> [ 43.652000] [<ffffffff80277f24>] print_circular_bug_tail+0xa4/0x100
> [ 43.652000] [<ffffffff8027a89c>] __lock_acquire+0xfec/0x1360
> [ 43.652000] [<ffffffff8027aca9>] lock_acquire+0x99/0xd0
> [ 43.652000] [<ffffffff8050a52d>] ? sysdev_unregister+0x1d/0x80
> [ 43.652000] [<ffffffff809b5e4a>] __mutex_lock_common+0xaa/0x450
> [ 43.652000] [<ffffffff8050a52d>] ? sysdev_unregister+0x1d/0x80
> [ 43.652000] [<ffffffff8050a52d>] ? sysdev_unregister+0x1d/0x80
> [ 43.652000] [<ffffffff809b62cf>] mutex_lock_nested+0x3f/0x50
> [ 43.652000] [<ffffffff8050a52d>] sysdev_unregister+0x1d/0x80
> [ 43.652000] [<ffffffff809af9d1>] mce_cpu_callback+0xce/0x101
> [ 43.652000] [<ffffffff809bbb75>] notifier_call_chain+0x65/0xa0
> [ 43.652000] [<ffffffff8026d696>] raw_notifier_call_chain+0x16/0x20
> [ 43.652000] [<ffffffff80964a00>] _cpu_down+0x240/0x350
> [ 43.652000] [<ffffffff809b4763>] ? wait_for_common+0xe3/0x1b0
> [ 43.652000] [<ffffffff80964b8b>] cpu_down+0x7b/0xa0
> [ 43.652000] [<ffffffff80966268>] store_online+0x48/0xa0
> [ 43.652000] [<ffffffff80509e90>] sysdev_store+0x20/0x30
> [ 43.652000] [<ffffffff80335ddf>] sysfs_write_file+0xcf/0x140
> [ 43.652000] [<ffffffff802dc1f7>] vfs_write+0xc7/0x150
> [ 43.652000] [<ffffffff802dc375>] sys_write+0x55/0x90
> [ 43.652000] [<ffffffff802133ca>] system_call_fastpath+0x16/0x1b
> [ 43.652104] device: 'msr1': device_unregister
> [ 43.656005] PM: Removing info for No Bus:msr1

2009-01-03 17:45:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PULL] cpumask tree


* Mike Travis <[email protected]> wrote:

> > yep, that's the bad one. Should i revert it or do you have a safe fix
> > in mind?
>
> Probably revert for now. [...]

done.

But -tip testing found another bug today as well, a boot crash with
certain (rare) 64-bit configs:

[ 1.588202] ACPI: PCI Interrupt Link [LNKA] (IRQs<1>BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 1.588012] IP: [<ffffffff80239778>] find_busiest_group+0x198/0xa20

[ 1.588048] Call Trace:
[ 1.588049] <IRQ> <0> [<ffffffff80240f96>] rebalance_domains+0x196/0x5e0
[ 1.588052] [<ffffffff80270b15>] ? lock_release_holdtime+0x35/0x1e0
[ 1.588055] [<ffffffff80983f70>] ? _spin_unlock_irq+0x30/0x40
[ 1.588058] [<ffffffff8024300e>] run_rebalance_domains+0x4e/0x120
[ 1.588060] [<ffffffff8024f80c>] __do_softirq+0xac/0x190
[ 1.588063] [<ffffffff8020d13c>] call_softirq+0x1c/0x30
[ 1.588066] [<ffffffff8020ef35>] do_softirq+0x75/0xa0
[ 1.588067] [<ffffffff8024f47d>] irq_exit+0x9d/0xb0
[ 1.588069] [<ffffffff80984f5d>] smp_apic_timer_interrupt+0x8d/0xc3
[ 1.588071] [<ffffffff8020cb73>] apic_timer_interrupt+0x13/0x20

i just bisected it back to:

| 74c5409893751c400547184751410c61930043b2 is first bad commit
| commit 74c5409893751c400547184751410c61930043b2
| Author: Mike Travis <[email protected]>
| Date: Wed Dec 31 18:08:45 2008 -0800
|
| x86: cleanup remaining cpumask_t ops in smpboot code
|
| Impact: Reduce memory usage and use new cpumask API.

this is in the final pieces of changes you did after pulling Rusty's tree:

26e2013: x86: setup_per_cpu_areas() cleanup
44aa683: cpumask: fix compile error when CONFIG_NR_CPUS is not defined
eeff031: cpumask: use alloc_cpumask_var_node where appropriate
40fbcb0: cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var
197e99c: x86: use cpumask_var_t in acpi/boot.c
2d22bd5: x86: cleanup remaining cpumask_t code in microcode_core.c
22022f5: x86: cleanup remaining cpumask_t code in mce_amd_64.c
b5f3096: x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
efb897c: sched: put back some stack hog changes that were undone in kernel/sched.c
74c5409: x86: cleanup remaining cpumask_t ops in smpboot code
8627b2a: x86: enable cpus display of kernel_max and offlined cpus
095fb96: Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/ru

i think i'll just rebase the tail portion of cpus4096 starting at 8627b2a
- this keeps most of the history intact and avoids these ugly reverts.

Also, while bisecting this window of commits i found that neither would
build successfully due to a typo - and the typo is fixed in 095fb96. So
since we rebase this portion anyway due to excessive amount of bugs, i'll
make it fully bisectable by rebasing right at 095fb96, backmerge the
fixlet from eeff031 and redo the whole series dropping the two bad
patches. Since this portion of the tree has no appreciable testing value
the rebase is the right thing to do here.

Ingo

2009-01-03 18:13:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PULL] cpumask tree


* Ingo Molnar <[email protected]> wrote:

> Also, while bisecting this window of commits i found that neither would
> build successfully due to a typo - and the typo is fixed in 095fb96. So
> since we rebase this portion anyway due to excessive amount of bugs,
> i'll make it fully bisectable by rebasing right at 095fb96, backmerge
> the fixlet from eeff031 and redo the whole series dropping the two bad
> patches. Since this portion of the tree has no appreciable testing value
> the rebase is the right thing to do here.

okay, i've done this now and pushed out the resulting tree to
tip/cpus4096-v2. It is a no-content-changed reshuffle of the tail ~10
commits of tip/cpus4096:

$ git diff cpus4096..cpus4096-v2
$

i dropped the two broken patches i bisected today, and backmerged a fixlet
to make it more bisectable, and reordered fixes next to the merge point -
and the new changes now come after that.

this finally is something that has no known regressions and looks pushable
to Linus. I've started the -tip tests, lets see how well this holds up
now.

Ingo

2009-01-03 18:14:36

by Mike Travis

[permalink] [raw]
Subject: Re: [PULL] cpumask tree

Ingo Molnar wrote:
> * Mike Travis <[email protected]> wrote:
>
>>> yep, that's the bad one. Should i revert it or do you have a safe fix
>>> in mind?
>> Probably revert for now. [...]
>
> done.
>
> But -tip testing found another bug today as well, a boot crash with
> certain (rare) 64-bit configs:
>
> [ 1.588202] ACPI: PCI Interrupt Link [LNKA] (IRQs<1>BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> [ 1.588012] IP: [<ffffffff80239778>] find_busiest_group+0x198/0xa20
>
> [ 1.588048] Call Trace:
> [ 1.588049] <IRQ> <0> [<ffffffff80240f96>] rebalance_domains+0x196/0x5e0
> [ 1.588052] [<ffffffff80270b15>] ? lock_release_holdtime+0x35/0x1e0
> [ 1.588055] [<ffffffff80983f70>] ? _spin_unlock_irq+0x30/0x40
> [ 1.588058] [<ffffffff8024300e>] run_rebalance_domains+0x4e/0x120
> [ 1.588060] [<ffffffff8024f80c>] __do_softirq+0xac/0x190
> [ 1.588063] [<ffffffff8020d13c>] call_softirq+0x1c/0x30
> [ 1.588066] [<ffffffff8020ef35>] do_softirq+0x75/0xa0
> [ 1.588067] [<ffffffff8024f47d>] irq_exit+0x9d/0xb0
> [ 1.588069] [<ffffffff80984f5d>] smp_apic_timer_interrupt+0x8d/0xc3
> [ 1.588071] [<ffffffff8020cb73>] apic_timer_interrupt+0x13/0x20
>
> i just bisected it back to:
>
> | 74c5409893751c400547184751410c61930043b2 is first bad commit
> | commit 74c5409893751c400547184751410c61930043b2
> | Author: Mike Travis <[email protected]>
> | Date: Wed Dec 31 18:08:45 2008 -0800
> |
> | x86: cleanup remaining cpumask_t ops in smpboot code
> |
> | Impact: Reduce memory usage and use new cpumask API.
>
> this is in the final pieces of changes you did after pulling Rusty's tree:
>
> 26e2013: x86: setup_per_cpu_areas() cleanup
> 44aa683: cpumask: fix compile error when CONFIG_NR_CPUS is not defined
> eeff031: cpumask: use alloc_cpumask_var_node where appropriate
> 40fbcb0: cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var
> 197e99c: x86: use cpumask_var_t in acpi/boot.c
> 2d22bd5: x86: cleanup remaining cpumask_t code in microcode_core.c
> 22022f5: x86: cleanup remaining cpumask_t code in mce_amd_64.c
> b5f3096: x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
> efb897c: sched: put back some stack hog changes that were undone in kernel/sched.c
> 74c5409: x86: cleanup remaining cpumask_t ops in smpboot code
> 8627b2a: x86: enable cpus display of kernel_max and offlined cpus
> 095fb96: Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/ru
>
> i think i'll just rebase the tail portion of cpus4096 starting at 8627b2a
> - this keeps most of the history intact and avoids these ugly reverts.
>
> Also, while bisecting this window of commits i found that neither would
> build successfully due to a typo - and the typo is fixed in 095fb96. So
> since we rebase this portion anyway due to excessive amount of bugs, i'll
> make it fully bisectable by rebasing right at 095fb96, backmerge the
> fixlet from eeff031 and redo the whole series dropping the two bad
> patches. Since this portion of the tree has no appreciable testing value
> the rebase is the right thing to do here.
>
> Ingo

Ok, thanks. Still working through my queue... I'll re-pull when you've
got your part done.

Mike

2009-01-03 19:39:53

by Ingo Molnar

[permalink] [raw]
Subject: [git pull] cpus4096 tree, part 3


* Linus Torvalds <[email protected]> wrote:

> On Fri, 2 Jan 2009, Ingo Molnar wrote:
> >
> > Linus, would you like to pull that, despite the pending regressions?
> > Can send a pull request right away.
>
> No, if there are known pending regressions, please get those fixed
> first.

ok. The pending regressions are all fixed now, and i've just finished my
standard tests on the latest tree and all the tests passed fine.

Please pull the cpus4096-for-linus-3 git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git cpus4096-for-linus-3

Thanks,

Ingo

------------------>
Cyrill Gorcunov (1):
x86: setup_per_cpu_areas() cleanup

Ingo Molnar (2):
cpumask: convert RCU implementations, fix
ia64: cpumask fix for is_affinity_mask_valid()

Li Zefan (1):
cpumask: fix bogus kernel-doc

Mike Travis (9):
cpumask: Add alloc_cpumask_var_node()
cpumask: documentation for cpumask_var_t
cpumask: add sysfs displays for configured and disabled cpu maps
sysfs: add documentation to cputopology.txt for system cpumasks
x86: enable cpus display of kernel_max and offlined cpus
sched: put back some stack hog changes that were undone in kernel/sched.c
x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
cpumask: use alloc_cpumask_var_node where appropriate
cpumask: fix compile error when CONFIG_NR_CPUS is not defined

Rusty Russell (63):
cpumask: x86: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: sparc: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: sh: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: powerpc: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: IA64: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: Mips: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: alpha: Introduce cpumask_of_{node,pcibus} to replace {node,pcibus}_to_cpumask
cpumask: cpu_coregroup_mask(): x86
cpumask: cpu_coregroup_mask(): sparc
cpumask: cpu_coregroup_mask(): s390
cpumask: Replace cpu_coregroup_map with cpu_coregroup_mask
cpumask: make CONFIG_NR_CPUS always valid.
bitmap: test for constant as well as small size for inline versions
bitmap: fix seq_bitmap and seq_cpumask to take const pointer
cpumask: switch over to cpu_online/possible/active/present_mask: core
cpumask: make cpumask.h eat its own dogfood.
cpumask: make set_cpu_*/init_cpu_* out-of-line
cpumask: smp_call_function_many()
cpumask: arch_send_call_function_ipi_mask: core
cpumask: use for_each_online_cpu() in drivers/infiniband/hw/ehca/ehca_irq.c
cpumask: use new cpumask API in drivers/infiniband/hw/ehca
cpumask: use new cpumask API in drivers/infiniband/hw/ipath
cpumask: Use nr_cpu_ids in seq_cpumask
cpumask: Remove IA64 definition of total_cpus now it's in core code
percpu: fix percpu accessors to potentially !cpu_possible() cpus: pnpbios
percpu: fix percpu accessors to potentially !cpu_possible() cpus: m32r
cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits.: core
cpumask: Use accessors code in core
parisc: remove gratuitous cpu_online_map declaration.
avr32: define __fls
blackfin: define __fls
m68k: define __fls
m68knommu: define __fls
bitmap: find_last_bit()
cpumask: Use find_last_bit()
cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): sparc
cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): s390
cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): powerpc
cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): ia64
cpumask: convert kernel trace functions
cpumask: convert kernel trace functions further
cpumask: remove any_online_cpu() users: kernel/
cpumask: remove any_online_cpu() users: mm/
cpumask: convert kernel/compat.c
cpumask: convert kernel/workqueue.c
cpumask: convert kernel time functions
cpumask: convert kernel/irq
cpumask: convert RCU implementations
cpumask: convert kernel/profile.c
cpumask: convert kernel/cpu.c
cpumask: convert rest of files in kernel/
cpumask: convert mm/
cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
cpumask: zero extra bits in alloc_cpumask_var_node
cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
cris: define __fls
frv: define __fls
h8300: define __fls
m32r: define __fls
mn10300: define __fls
xtensa: define __fls
x86: use cpumask_var_t in acpi/boot.c
cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t

Sergio Luis (1):
x86: mark get_cpu_leaves() with __cpuinit annotation


Documentation/cputopology.txt | 48 ++++++
arch/alpha/include/asm/topology.h | 17 ++
arch/alpha/kernel/irq.c | 3 +-
arch/alpha/kernel/setup.c | 5 +
arch/avr32/include/asm/bitops.h | 5 +
arch/blackfin/include/asm/bitops.h | 1 +
arch/cris/include/asm/bitops.h | 1 +
arch/h8300/include/asm/bitops.h | 1 +
arch/ia64/include/asm/irq.h | 2 +-
arch/ia64/include/asm/topology.h | 9 +-
arch/ia64/kernel/acpi.c | 3 +-
arch/ia64/kernel/iosapic.c | 23 ++--
arch/ia64/kernel/irq.c | 4 +-
arch/ia64/sn/kernel/sn2/sn_hwperf.c | 27 ++--
arch/m32r/kernel/smpboot.c | 2 +-
arch/m68knommu/include/asm/bitops.h | 1 +
arch/mips/include/asm/mach-ip27/topology.h | 4 +-
arch/parisc/include/asm/smp.h | 2 -
arch/powerpc/include/asm/topology.h | 12 +-
arch/powerpc/platforms/cell/spu_priv1_mmio.c | 6 +-
arch/powerpc/platforms/cell/spufs/sched.c | 4 +-
arch/s390/include/asm/topology.h | 2 +
arch/s390/kernel/topology.c | 5 +
arch/sh/include/asm/topology.h | 1 +
arch/sparc/include/asm/topology_64.h | 13 +-
arch/sparc/kernel/of_device_64.c | 2 +-
arch/sparc/kernel/pci_msi.c | 2 +-
arch/x86/include/asm/es7000/apic.h | 32 +----
arch/x86/include/asm/lguest.h | 2 +-
arch/x86/include/asm/numaq/apic.h | 4 +-
arch/x86/include/asm/pci.h | 10 +-
arch/x86/include/asm/summit/apic.h | 42 +----
arch/x86/include/asm/topology.h | 36 +++--
arch/x86/kernel/acpi/boot.c | 31 +++-
arch/x86/kernel/apic.c | 4 +-
arch/x86/kernel/cpu/common.c | 2 +-
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 28 +++-
arch/x86/kernel/cpu/cpufreq/powernow-k7.c | 9 +
arch/x86/kernel/cpu/cpufreq/powernow-k8.c | 24 ++-
arch/x86/kernel/cpu/intel_cacheinfo.c | 2 +-
arch/x86/kernel/cpuid.c | 2 +-
arch/x86/kernel/io_apic.c | 6 +-
arch/x86/kernel/msr.c | 2 +-
arch/x86/kernel/reboot.c | 4 +-
arch/x86/kernel/setup_percpu.c | 33 ++---
arch/x86/kernel/smpboot.c | 15 ++-
arch/x86/mach-voyager/voyager_smp.c | 7 +-
block/blk.h | 4 +-
drivers/acpi/processor_core.c | 14 ++-
drivers/acpi/processor_perflib.c | 28 ++--
drivers/acpi/processor_throttling.c | 80 ++++++----
drivers/base/cpu.c | 44 +++++
drivers/infiniband/hw/ehca/ehca_irq.c | 17 +-
drivers/infiniband/hw/ipath/ipath_file_ops.c | 8 +-
drivers/pnp/pnpbios/bioscalls.c | 2 +-
fs/seq_file.c | 3 +-
include/acpi/processor.h | 4 +-
include/asm-frv/bitops.h | 13 ++
include/asm-m32r/bitops.h | 1 +
include/asm-m68k/bitops.h | 5 +
include/asm-mn10300/bitops.h | 11 ++
include/asm-xtensa/bitops.h | 11 ++
include/linux/bitmap.h | 35 +++--
include/linux/bitops.h | 13 ++-
include/linux/cpumask.h | 221 +++++++++++---------------
include/linux/interrupt.h | 2 +-
include/linux/rcuclassic.h | 4 +-
include/linux/seq_file.h | 7 +-
include/linux/smp.h | 18 ++-
include/linux/stop_machine.h | 6 +-
include/linux/threads.h | 16 +-
include/linux/tick.h | 4 +-
init/main.c | 13 +-
kernel/compat.c | 49 ++++---
kernel/cpu.c | 144 ++++++++++++------
kernel/irq/manage.c | 11 +-
kernel/irq/proc.c | 34 +++--
kernel/kexec.c | 2 +-
kernel/power/poweroff.c | 2 +-
kernel/profile.c | 38 +++--
kernel/rcuclassic.c | 32 ++--
kernel/rcupreempt.c | 19 ++-
kernel/rcutorture.c | 27 ++--
kernel/sched.c | 53 ++-----
kernel/sched_rt.c | 3 +-
kernel/smp.c | 145 +++++++-----------
kernel/softirq.c | 2 +-
kernel/softlockup.c | 10 +-
kernel/stop_machine.c | 8 +-
kernel/taskstats.c | 39 +++--
kernel/time/clocksource.c | 9 +-
kernel/time/tick-broadcast.c | 115 +++++++-------
kernel/time/tick-common.c | 6 +-
kernel/trace/ring_buffer.c | 42 +++--
kernel/trace/trace.c | 72 ++++++---
kernel/trace/trace.h | 2 +-
kernel/trace/trace_boot.c | 2 +-
kernel/trace/trace_functions_graph.c | 2 +-
kernel/trace/trace_hw_branches.c | 6 +-
kernel/trace/trace_power.c | 2 +-
kernel/trace/trace_sysprof.c | 13 +--
kernel/workqueue.c | 26 ++--
lib/Kconfig | 8 +
lib/Makefile | 1 +
lib/cpumask.c | 62 +++++++-
lib/find_last_bit.c | 45 ++++++
mm/pdflush.c | 16 ++-
mm/slab.c | 2 +-
mm/slub.c | 20 ++-
mm/vmscan.c | 4 +-
mm/vmstat.c | 4 +-
security/selinux/selinuxfs.c | 2 +-
112 files changed, 1285 insertions(+), 878 deletions(-)
create mode 100644 lib/find_last_bit.c

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index bd699da..45932ec 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -31,3 +31,51 @@ not defined by include/asm-XXX/topology.h:
2) core_id: 0
3) thread_siblings: just the given CPU
4) core_siblings: just the given CPU
+
+Additionally, cpu topology information is provided under
+/sys/devices/system/cpu and includes these files. The internal
+source for the output is in brackets ("[]").
+
+ kernel_max: the maximum cpu index allowed by the kernel configuration.
+ [NR_CPUS-1]
+
+ offline: cpus that are not online because they have been
+ HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
+ of cpus allowed by the kernel configuration (kernel_max
+ above). [~cpu_online_mask + cpus >= NR_CPUS]
+
+ online: cpus that are online and being scheduled [cpu_online_mask]
+
+ possible: cpus that have been allocated resources and can be
+ brought online if they are present. [cpu_possible_mask]
+
+ present: cpus that have been identified as being present in the
+ system. [cpu_present_mask]
+
+The format for the above output is compatible with cpulist_parse()
+[see <linux/cpumask.h>]. Some examples follow.
+
+In this example, there are 64 cpus in the system but cpus 32-63 exceed
+the kernel max which is limited to 0..31 by the NR_CPUS config option
+being 32. Note also that cpus 2 and 4-31 are not online but could be
+brought online as they are both present and possible.
+
+ kernel_max: 31
+ offline: 2,4-31,32-63
+ online: 0-1,3
+ possible: 0-31
+ present: 0-31
+
+In this example, the NR_CPUS config option is 128, but the kernel was
+started with possible_cpus=144. There are 4 cpus in the system and cpu2
+was manually taken offline (and is the only cpu that can be brought
+online.)
+
+ kernel_max: 127
+ offline: 2,4-127,128-143
+ online: 0-1,3
+ possible: 0-127
+ present: 0-3
+
+See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter
+as well as more information on the various cpumask's.
diff --git a/arch/alpha/include/asm/topology.h b/arch/alpha/include/asm/topology.h
index 149532e..b4f284c 100644
--- a/arch/alpha/include/asm/topology.h
+++ b/arch/alpha/include/asm/topology.h
@@ -39,7 +39,24 @@ static inline cpumask_t node_to_cpumask(int node)
return node_cpu_mask;
}

+extern struct cpumask node_to_cpumask_map[];
+/* FIXME: This is dumb, recalculating every time. But simple. */
+static const struct cpumask *cpumask_of_node(int node)
+{
+ int cpu;
+
+ cpumask_clear(&node_to_cpumask_map[node]);
+
+ for_each_online_cpu(cpu) {
+ if (cpu_to_node(cpu) == node)
+ cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+ }
+
+ return &node_to_cpumask_map[node];
+}
+
#define pcibus_to_cpumask(bus) (cpu_online_map)
+#define cpumask_of_pcibus(bus) (cpu_online_mask)

#endif /* !CONFIG_NUMA */
# include <asm-generic/topology.h>
diff --git a/arch/alpha/kernel/irq.c b/arch/alpha/kernel/irq.c
index d0f1620..703731a 100644
--- a/arch/alpha/kernel/irq.c
+++ b/arch/alpha/kernel/irq.c
@@ -50,7 +50,8 @@ int irq_select_affinity(unsigned int irq)
if (!irq_desc[irq].chip->set_affinity || irq_user_affinity[irq])
return 1;

- while (!cpu_possible(cpu) || !cpu_isset(cpu, irq_default_affinity))
+ while (!cpu_possible(cpu) ||
+ !cpumask_test_cpu(cpu, irq_default_affinity))
cpu = (cpu < (NR_CPUS-1) ? cpu + 1 : 0);
last_cpu = cpu;

diff --git a/arch/alpha/kernel/setup.c b/arch/alpha/kernel/setup.c
index a449e99..02bee69 100644
--- a/arch/alpha/kernel/setup.c
+++ b/arch/alpha/kernel/setup.c
@@ -79,6 +79,11 @@ int alpha_l3_cacheshape;
unsigned long alpha_verbose_mcheck = CONFIG_VERBOSE_MCHECK_ON;
#endif

+#ifdef CONFIG_NUMA
+struct cpumask node_to_cpumask_map[MAX_NUMNODES] __read_mostly;
+EXPORT_SYMBOL(node_to_cpumask_map);
+#endif
+
/* Which processor we booted from. */
int boot_cpuid;

diff --git a/arch/avr32/include/asm/bitops.h b/arch/avr32/include/asm/bitops.h
index 1a50b69..f7dd5f7 100644
--- a/arch/avr32/include/asm/bitops.h
+++ b/arch/avr32/include/asm/bitops.h
@@ -263,6 +263,11 @@ static inline int fls(unsigned long word)
return 32 - result;
}

+static inline int __fls(unsigned long word)
+{
+ return fls(word) - 1;
+}
+
unsigned long find_first_zero_bit(const unsigned long *addr,
unsigned long size);
unsigned long find_next_zero_bit(const unsigned long *addr,
diff --git a/arch/blackfin/include/asm/bitops.h b/arch/blackfin/include/asm/bitops.h
index b39a175..c428e41 100644
--- a/arch/blackfin/include/asm/bitops.h
+++ b/arch/blackfin/include/asm/bitops.h
@@ -213,6 +213,7 @@ static __inline__ int __test_bit(int nr, const void *addr)
#endif /* __KERNEL__ */

#include <asm-generic/bitops/fls.h>
+#include <asm-generic/bitops/__fls.h>
#include <asm-generic/bitops/fls64.h>

#endif /* _BLACKFIN_BITOPS_H */
diff --git a/arch/cris/include/asm/bitops.h b/arch/cris/include/asm/bitops.h
index c0e62f8..9e69cfb 100644
--- a/arch/cris/include/asm/bitops.h
+++ b/arch/cris/include/asm/bitops.h
@@ -148,6 +148,7 @@ static inline int test_and_change_bit(int nr, volatile unsigned long *addr)
#define ffs kernel_ffs

#include <asm-generic/bitops/fls.h>
+#include <asm-generic/bitops/__fls.h>
#include <asm-generic/bitops/fls64.h>
#include <asm-generic/bitops/hweight.h>
#include <asm-generic/bitops/find.h>
diff --git a/arch/h8300/include/asm/bitops.h b/arch/h8300/include/asm/bitops.h
index cb18e3b..cb9ddf5 100644
--- a/arch/h8300/include/asm/bitops.h
+++ b/arch/h8300/include/asm/bitops.h
@@ -207,6 +207,7 @@ static __inline__ unsigned long __ffs(unsigned long word)
#endif /* __KERNEL__ */

#include <asm-generic/bitops/fls.h>
+#include <asm-generic/bitops/__fls.h>
#include <asm-generic/bitops/fls64.h>

#endif /* _H8300_BITOPS_H */
diff --git a/arch/ia64/include/asm/irq.h b/arch/ia64/include/asm/irq.h
index 3627116..36429a5 100644
--- a/arch/ia64/include/asm/irq.h
+++ b/arch/ia64/include/asm/irq.h
@@ -27,7 +27,7 @@ irq_canonicalize (int irq)
}

extern void set_irq_affinity_info (unsigned int irq, int dest, int redir);
-bool is_affinity_mask_valid(cpumask_t cpumask);
+bool is_affinity_mask_valid(cpumask_var_t cpumask);

#define is_affinity_mask_valid is_affinity_mask_valid

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a3cc9f6..76a33a9 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -34,6 +34,7 @@
* Returns a bitmask of CPUs on Node 'node'.
*/
#define node_to_cpumask(node) (node_to_cpu_mask[node])
+#define cpumask_of_node(node) (&node_to_cpu_mask[node])

/*
* Returns the number of the node containing Node 'nid'.
@@ -45,7 +46,7 @@
/*
* Returns the number of the first CPU on Node 'node'.
*/
-#define node_to_first_cpu(node) (first_cpu(node_to_cpumask(node)))
+#define node_to_first_cpu(node) (cpumask_first(cpumask_of_node(node)))

/*
* Determines the node for a given pci bus
@@ -109,6 +110,8 @@ void build_cpu_to_node_map(void);
#define topology_core_id(cpu) (cpu_data(cpu)->core_id)
#define topology_core_siblings(cpu) (cpu_core_map[cpu])
#define topology_thread_siblings(cpu) (per_cpu(cpu_sibling_map, cpu))
+#define topology_core_cpumask(cpu) (&cpu_core_map[cpu])
+#define topology_thread_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu))
#define smt_capable() (smp_num_siblings > 1)
#endif

@@ -119,6 +122,10 @@ extern void arch_fix_phys_package_id(int num, u32 slot);
node_to_cpumask(pcibus_to_node(bus)) \
)

+#define cpumask_of_pcibus(bus) (pcibus_to_node(bus) == -1 ? \
+ cpu_all_mask : \
+ cpumask_from_node(pcibus_to_node(bus)))
+
#include <asm-generic/topology.h>

#endif /* _ASM_IA64_TOPOLOGY_H */
diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index bd7acc7..0553648 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -202,7 +202,6 @@ char *__init __acpi_map_table(unsigned long phys_addr, unsigned long size)
Boot-time Table Parsing
-------------------------------------------------------------------------- */

-static int total_cpus __initdata;
static int available_cpus __initdata;
struct acpi_table_madt *acpi_madt __initdata;
static u8 has_8259;
@@ -1001,7 +1000,7 @@ acpi_map_iosapic(acpi_handle handle, u32 depth, void *context, void **ret)
node = pxm_to_node(pxm);

if (node >= MAX_NUMNODES || !node_online(node) ||
- cpus_empty(node_to_cpumask(node)))
+ cpumask_empty(cpumask_of_node(node)))
return AE_OK;

/* We know a gsi to node mapping! */
diff --git a/arch/ia64/kernel/iosapic.c b/arch/ia64/kernel/iosapic.c
index c8adecd..5cfd3d9 100644
--- a/arch/ia64/kernel/iosapic.c
+++ b/arch/ia64/kernel/iosapic.c
@@ -695,32 +695,31 @@ get_target_cpu (unsigned int gsi, int irq)
#ifdef CONFIG_NUMA
{
int num_cpus, cpu_index, iosapic_index, numa_cpu, i = 0;
- cpumask_t cpu_mask;
+ const struct cpumask *cpu_mask;

iosapic_index = find_iosapic(gsi);
if (iosapic_index < 0 ||
iosapic_lists[iosapic_index].node == MAX_NUMNODES)
goto skip_numa_setup;

- cpu_mask = node_to_cpumask(iosapic_lists[iosapic_index].node);
- cpus_and(cpu_mask, cpu_mask, domain);
- for_each_cpu_mask(numa_cpu, cpu_mask) {
- if (!cpu_online(numa_cpu))
- cpu_clear(numa_cpu, cpu_mask);
+ cpu_mask = cpumask_of_node(iosapic_lists[iosapic_index].node);
+ num_cpus = 0;
+ for_each_cpu_and(numa_cpu, cpu_mask, &domain) {
+ if (cpu_online(numa_cpu))
+ num_cpus++;
}

- num_cpus = cpus_weight(cpu_mask);
-
if (!num_cpus)
goto skip_numa_setup;

/* Use irq assignment to distribute across cpus in node */
cpu_index = irq % num_cpus;

- for (numa_cpu = first_cpu(cpu_mask) ; i < cpu_index ; i++)
- numa_cpu = next_cpu(numa_cpu, cpu_mask);
+ for_each_cpu_and(numa_cpu, cpu_mask, &domain)
+ if (cpu_online(numa_cpu) && i++ >= cpu_index)
+ break;

- if (numa_cpu != NR_CPUS)
+ if (numa_cpu < nr_cpu_ids)
return cpu_physical_id(numa_cpu);
}
skip_numa_setup:
@@ -731,7 +730,7 @@ skip_numa_setup:
* case of NUMA.)
*/
do {
- if (++cpu >= NR_CPUS)
+ if (++cpu >= nr_cpu_ids)
cpu = 0;
} while (!cpu_online(cpu) || !cpu_isset(cpu, domain));

diff --git a/arch/ia64/kernel/irq.c b/arch/ia64/kernel/irq.c
index 0b6db53..95ff16c 100644
--- a/arch/ia64/kernel/irq.c
+++ b/arch/ia64/kernel/irq.c
@@ -112,11 +112,11 @@ void set_irq_affinity_info (unsigned int irq, int hwid, int redir)
}
}

-bool is_affinity_mask_valid(cpumask_t cpumask)
+bool is_affinity_mask_valid(cpumask_var_t cpumask)
{
if (ia64_platform_is("sn2")) {
/* Only allow one CPU to be specified in the smp_affinity mask */
- if (cpus_weight(cpumask) != 1)
+ if (cpumask_weight(cpumask) != 1)
return false;
}
return true;
diff --git a/arch/ia64/sn/kernel/sn2/sn_hwperf.c b/arch/ia64/sn/kernel/sn2/sn_hwperf.c
index 636588e..be33947 100644
--- a/arch/ia64/sn/kernel/sn2/sn_hwperf.c
+++ b/arch/ia64/sn/kernel/sn2/sn_hwperf.c
@@ -385,7 +385,6 @@ static int sn_topology_show(struct seq_file *s, void *d)
int j;
const char *slabname;
int ordinal;
- cpumask_t cpumask;
char slice;
struct cpuinfo_ia64 *c;
struct sn_hwperf_port_info *ptdata;
@@ -473,23 +472,21 @@ static int sn_topology_show(struct seq_file *s, void *d)
* CPUs on this node, if any
*/
if (!SN_HWPERF_IS_IONODE(obj)) {
- cpumask = node_to_cpumask(ordinal);
- for_each_online_cpu(i) {
- if (cpu_isset(i, cpumask)) {
- slice = 'a' + cpuid_to_slice(i);
- c = cpu_data(i);
- seq_printf(s, "cpu %d %s%c local"
- " freq %luMHz, arch ia64",
- i, obj->location, slice,
- c->proc_freq / 1000000);
- for_each_online_cpu(j) {
- seq_printf(s, j ? ":%d" : ", dist %d",
- node_distance(
+ for_each_cpu_and(i, cpu_online_mask,
+ cpumask_of_node(ordinal)) {
+ slice = 'a' + cpuid_to_slice(i);
+ c = cpu_data(i);
+ seq_printf(s, "cpu %d %s%c local"
+ " freq %luMHz, arch ia64",
+ i, obj->location, slice,
+ c->proc_freq / 1000000);
+ for_each_online_cpu(j) {
+ seq_printf(s, j ? ":%d" : ", dist %d",
+ node_distance(
cpu_to_node(i),
cpu_to_node(j)));
- }
- seq_putc(s, '\n');
}
+ seq_putc(s, '\n');
}
}
}
diff --git a/arch/m32r/kernel/smpboot.c b/arch/m32r/kernel/smpboot.c
index 0f06b37..2547d6c 100644
--- a/arch/m32r/kernel/smpboot.c
+++ b/arch/m32r/kernel/smpboot.c
@@ -592,7 +592,7 @@ int setup_profiling_timer(unsigned int multiplier)
* accounting. At that time they also adjust their APIC timers
* accordingly.
*/
- for (i = 0; i < NR_CPUS; ++i)
+ for_each_possible_cpu(i)
per_cpu(prof_multiplier, i) = multiplier;

return 0;
diff --git a/arch/m68knommu/include/asm/bitops.h b/arch/m68knommu/include/asm/bitops.h
index 6f3685e..9d3cbe5 100644
--- a/arch/m68knommu/include/asm/bitops.h
+++ b/arch/m68knommu/include/asm/bitops.h
@@ -331,6 +331,7 @@ found_middle:
#endif /* __KERNEL__ */

#include <asm-generic/bitops/fls.h>
+#include <asm-generic/bitops/__fls.h>
#include <asm-generic/bitops/fls64.h>

#endif /* _M68KNOMMU_BITOPS_H */
diff --git a/arch/mips/include/asm/mach-ip27/topology.h b/arch/mips/include/asm/mach-ip27/topology.h
index 1fb959f..55d4815 100644
--- a/arch/mips/include/asm/mach-ip27/topology.h
+++ b/arch/mips/include/asm/mach-ip27/topology.h
@@ -25,11 +25,13 @@ extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
#define cpu_to_node(cpu) (sn_cpu_info[(cpu)].p_nodeid)
#define parent_node(node) (node)
#define node_to_cpumask(node) (hub_data(node)->h_cpus)
-#define node_to_first_cpu(node) (first_cpu(node_to_cpumask(node)))
+#define cpumask_of_node(node) (&hub_data(node)->h_cpus)
+#define node_to_first_cpu(node) (cpumask_first(cpumask_of_node(node)))
struct pci_bus;
extern int pcibus_to_node(struct pci_bus *);

#define pcibus_to_cpumask(bus) (cpu_online_map)
+#define cpumask_of_pcibus(bus) (cpu_online_mask)

extern unsigned char __node_distances[MAX_COMPACT_NODES][MAX_COMPACT_NODES];

diff --git a/arch/parisc/include/asm/smp.h b/arch/parisc/include/asm/smp.h
index 409e698..6ef4b78 100644
--- a/arch/parisc/include/asm/smp.h
+++ b/arch/parisc/include/asm/smp.h
@@ -16,8 +16,6 @@
#include <linux/cpumask.h>
typedef unsigned long address_t;

-extern cpumask_t cpu_online_map;
-

/*
* Private routines/data
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 373fca3..3752585 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -22,11 +22,11 @@ static inline cpumask_t node_to_cpumask(int node)
return numa_cpumask_lookup_table[node];
}

+#define cpumask_of_node(node) (&numa_cpumask_lookup_table[node])
+
static inline int node_to_first_cpu(int node)
{
- cpumask_t tmp;
- tmp = node_to_cpumask(node);
- return first_cpu(tmp);
+ return cpumask_first(cpumask_of_node(node));
}

int of_node_to_nid(struct device_node *device);
@@ -46,6 +46,10 @@ static inline int pcibus_to_node(struct pci_bus *bus)
node_to_cpumask(pcibus_to_node(bus)) \
)

+#define cpumask_of_pcibus(bus) (pcibus_to_node(bus) == -1 ? \
+ cpu_all_mask : \
+ cpumask_of_node(pcibus_to_node(bus)))
+
/* sched_domains SD_NODE_INIT for PPC64 machines */
#define SD_NODE_INIT (struct sched_domain) { \
.parent = NULL, \
@@ -108,6 +112,8 @@ static inline void sysfs_remove_device_from_node(struct sys_device *dev,

#define topology_thread_siblings(cpu) (per_cpu(cpu_sibling_map, cpu))
#define topology_core_siblings(cpu) (per_cpu(cpu_core_map, cpu))
+#define topology_thread_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu))
+#define topology_core_cpumask(cpu) (&per_cpu(cpu_core_map, cpu))
#define topology_core_id(cpu) (cpu_to_core_id(cpu))
#endif
#endif
diff --git a/arch/powerpc/platforms/cell/spu_priv1_mmio.c b/arch/powerpc/platforms/cell/spu_priv1_mmio.c
index 906a0a2..1410443 100644
--- a/arch/powerpc/platforms/cell/spu_priv1_mmio.c
+++ b/arch/powerpc/platforms/cell/spu_priv1_mmio.c
@@ -80,10 +80,10 @@ static void cpu_affinity_set(struct spu *spu, int cpu)
u64 route;

if (nr_cpus_node(spu->node)) {
- cpumask_t spumask = node_to_cpumask(spu->node);
- cpumask_t cpumask = node_to_cpumask(cpu_to_node(cpu));
+ const struct cpumask *spumask = cpumask_of_node(spu->node),
+ *cpumask = cpumask_of_node(cpu_to_node(cpu));

- if (!cpus_intersects(spumask, cpumask))
+ if (!cpumask_intersects(spumask, cpumask))
return;
}

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 2ad914c..6a0ad19 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -166,9 +166,9 @@ void spu_update_sched_info(struct spu_context *ctx)
static int __node_allowed(struct spu_context *ctx, int node)
{
if (nr_cpus_node(node)) {
- cpumask_t mask = node_to_cpumask(node);
+ const struct cpumask *mask = cpumask_of_node(node);

- if (cpus_intersects(mask, ctx->cpus_allowed))
+ if (cpumask_intersects(mask, &ctx->cpus_allowed))
return 1;
}

diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index d96c916..c93eb50 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -6,10 +6,12 @@
#define mc_capable() (1)

cpumask_t cpu_coregroup_map(unsigned int cpu);
+const struct cpumask *cpu_coregroup_mask(unsigned int cpu);

extern cpumask_t cpu_core_map[NR_CPUS];

#define topology_core_siblings(cpu) (cpu_core_map[cpu])
+#define topology_core_cpumask(cpu) (&cpu_core_map[cpu])

int topology_set_cpu_management(int fc);
void topology_schedule_update(void);
diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index 90e9ba1..cc362c9 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -97,6 +97,11 @@ cpumask_t cpu_coregroup_map(unsigned int cpu)
return mask;
}

+const struct cpumask *cpu_coregroup_mask(unsigned int cpu)
+{
+ return &cpu_core_map[cpu];
+}
+
static void add_cpus_to_core(struct tl_cpu *tl_cpu, struct core_info *core)
{
unsigned int cpu;
diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h
index 279d9cc..066f0fb 100644
--- a/arch/sh/include/asm/topology.h
+++ b/arch/sh/include/asm/topology.h
@@ -32,6 +32,7 @@
#define parent_node(node) ((void)(node),0)

#define node_to_cpumask(node) ((void)node, cpu_online_map)
+#define cpumask_of_node(node) ((void)node, cpu_online_mask)
#define node_to_first_cpu(node) ((void)(node),0)

#define pcibus_to_node(bus) ((void)(bus), -1)
diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h
index 001c040..b8a65b6 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -16,8 +16,12 @@ static inline cpumask_t node_to_cpumask(int node)
{
return numa_cpumask_lookup_table[node];
}
+#define cpumask_of_node(node) (&numa_cpumask_lookup_table[node])

-/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
+/*
+ * Returns a pointer to the cpumask of CPUs on Node 'node'.
+ * Deprecated: use "const struct cpumask *mask = cpumask_of_node(node)"
+ */
#define node_to_cpumask_ptr(v, node) \
cpumask_t *v = &(numa_cpumask_lookup_table[node])

@@ -26,9 +30,7 @@ static inline cpumask_t node_to_cpumask(int node)

static inline int node_to_first_cpu(int node)
{
- cpumask_t tmp;
- tmp = node_to_cpumask(node);
- return first_cpu(tmp);
+ return cpumask_first(cpumask_of_node(node));
}

struct pci_bus;
@@ -77,10 +79,13 @@ static inline int pcibus_to_node(struct pci_bus *pbus)
#define topology_core_id(cpu) (cpu_data(cpu).core_id)
#define topology_core_siblings(cpu) (cpu_core_map[cpu])
#define topology_thread_siblings(cpu) (per_cpu(cpu_sibling_map, cpu))
+#define topology_core_cpumask(cpu) (&cpu_core_map[cpu])
+#define topology_thread_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu))
#define mc_capable() (sparc64_multi_core)
#define smt_capable() (sparc64_multi_core)
#endif /* CONFIG_SMP */

#define cpu_coregroup_map(cpu) (cpu_core_map[cpu])
+#define cpu_coregroup_mask(cpu) (&cpu_core_map[cpu])

#endif /* _ASM_SPARC64_TOPOLOGY_H */
diff --git a/arch/sparc/kernel/of_device_64.c b/arch/sparc/kernel/of_device_64.c
index 322046c..4873f28 100644
--- a/arch/sparc/kernel/of_device_64.c
+++ b/arch/sparc/kernel/of_device_64.c
@@ -778,7 +778,7 @@ static unsigned int __init build_one_device_irq(struct of_device *op,
out:
nid = of_node_to_nid(dp);
if (nid != -1) {
- cpumask_t numa_mask = node_to_cpumask(nid);
+ cpumask_t numa_mask = *cpumask_of_node(nid);

irq_set_affinity(irq, &numa_mask);
}
diff --git a/arch/sparc/kernel/pci_msi.c b/arch/sparc/kernel/pci_msi.c
index 0d0cd81..4ef282e 100644
--- a/arch/sparc/kernel/pci_msi.c
+++ b/arch/sparc/kernel/pci_msi.c
@@ -286,7 +286,7 @@ static int bringup_one_msi_queue(struct pci_pbm_info *pbm,

nid = pbm->numa_node;
if (nid != -1) {
- cpumask_t numa_mask = node_to_cpumask(nid);
+ cpumask_t numa_mask = *cpumask_of_node(nid);

irq_set_affinity(irq, &numa_mask);
}
diff --git a/arch/x86/include/asm/es7000/apic.h b/arch/x86/include/asm/es7000/apic.h
index 51ac123..bc53d5e 100644
--- a/arch/x86/include/asm/es7000/apic.h
+++ b/arch/x86/include/asm/es7000/apic.h
@@ -157,7 +157,7 @@ cpu_mask_to_apicid_cluster(const struct cpumask *cpumask)

num_bits_set = cpumask_weight(cpumask);
/* Return id to all */
- if (num_bits_set == NR_CPUS)
+ if (num_bits_set == nr_cpu_ids)
return 0xFF;
/*
* The cpus in the mask must all be on the apic cluster. If are not
@@ -190,7 +190,7 @@ static inline unsigned int cpu_mask_to_apicid(const cpumask_t *cpumask)

num_bits_set = cpus_weight(*cpumask);
/* Return id to all */
- if (num_bits_set == NR_CPUS)
+ if (num_bits_set == nr_cpu_ids)
return cpu_to_logical_apicid(0);
/*
* The cpus in the mask must all be on the apic cluster. If are not
@@ -218,9 +218,6 @@ static inline unsigned int cpu_mask_to_apicid(const cpumask_t *cpumask)
static inline unsigned int cpu_mask_to_apicid_and(const struct cpumask *inmask,
const struct cpumask *andmask)
{
- int num_bits_set;
- int cpus_found = 0;
- int cpu;
int apicid = cpu_to_logical_apicid(0);
cpumask_var_t cpumask;

@@ -229,31 +226,8 @@ static inline unsigned int cpu_mask_to_apicid_and(const struct cpumask *inmask,

cpumask_and(cpumask, inmask, andmask);
cpumask_and(cpumask, cpumask, cpu_online_mask);
+ apicid = cpu_mask_to_apicid(cpumask);

- num_bits_set = cpumask_weight(cpumask);
- /* Return id to all */
- if (num_bits_set == NR_CPUS)
- goto exit;
- /*
- * The cpus in the mask must all be on the apic cluster. If are not
- * on the same apicid cluster return default value of TARGET_CPUS.
- */
- cpu = cpumask_first(cpumask);
- apicid = cpu_to_logical_apicid(cpu);
- while (cpus_found < num_bits_set) {
- if (cpumask_test_cpu(cpu, cpumask)) {
- int new_apicid = cpu_to_logical_apicid(cpu);
- if (apicid_cluster(apicid) !=
- apicid_cluster(new_apicid)){
- printk ("%s: Not a valid mask!\n", __func__);
- return cpu_to_logical_apicid(0);
- }
- apicid = new_apicid;
- cpus_found++;
- }
- cpu++;
- }
-exit:
free_cpumask_var(cpumask);
return apicid;
}
diff --git a/arch/x86/include/asm/lguest.h b/arch/x86/include/asm/lguest.h
index d28a507..1caf576 100644
--- a/arch/x86/include/asm/lguest.h
+++ b/arch/x86/include/asm/lguest.h
@@ -15,7 +15,7 @@
#define SHARED_SWITCHER_PAGES \
DIV_ROUND_UP(end_switcher_text - start_switcher_text, PAGE_SIZE)
/* Pages for switcher itself, then two pages per cpu */
-#define TOTAL_SWITCHER_PAGES (SHARED_SWITCHER_PAGES + 2 * NR_CPUS)
+#define TOTAL_SWITCHER_PAGES (SHARED_SWITCHER_PAGES + 2 * nr_cpu_ids)

/* We map at -4M for ease of mapping into the guest (one PTE page). */
#define SWITCHER_ADDR 0xFFC00000
diff --git a/arch/x86/include/asm/numaq/apic.h b/arch/x86/include/asm/numaq/apic.h
index c80f00d..bf37bc4 100644
--- a/arch/x86/include/asm/numaq/apic.h
+++ b/arch/x86/include/asm/numaq/apic.h
@@ -63,8 +63,8 @@ static inline physid_mask_t ioapic_phys_id_map(physid_mask_t phys_map)
extern u8 cpu_2_logical_apicid[];
static inline int cpu_to_logical_apicid(int cpu)
{
- if (cpu >= NR_CPUS)
- return BAD_APICID;
+ if (cpu >= nr_cpu_ids)
+ return BAD_APICID;
return (int)cpu_2_logical_apicid[cpu];
}

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index 66834c4..a977de2 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -102,9 +102,9 @@ extern void pci_iommu_alloc(void);

#ifdef CONFIG_NUMA
/* Returns the node based on pci bus */
-static inline int __pcibus_to_node(struct pci_bus *bus)
+static inline int __pcibus_to_node(const struct pci_bus *bus)
{
- struct pci_sysdata *sd = bus->sysdata;
+ const struct pci_sysdata *sd = bus->sysdata;

return sd->node;
}
@@ -113,6 +113,12 @@ static inline cpumask_t __pcibus_to_cpumask(struct pci_bus *bus)
{
return node_to_cpumask(__pcibus_to_node(bus));
}
+
+static inline const struct cpumask *
+cpumask_of_pcibus(const struct pci_bus *bus)
+{
+ return cpumask_of_node(__pcibus_to_node(bus));
+}
#endif

#endif /* _ASM_X86_PCI_H */
diff --git a/arch/x86/include/asm/summit/apic.h b/arch/x86/include/asm/summit/apic.h
index 99327d1..4bb5fb3 100644
--- a/arch/x86/include/asm/summit/apic.h
+++ b/arch/x86/include/asm/summit/apic.h
@@ -52,7 +52,7 @@ static inline void init_apic_ldr(void)
int i;

/* Create logical APIC IDs by counting CPUs already in cluster. */
- for (count = 0, i = NR_CPUS; --i >= 0; ) {
+ for (count = 0, i = nr_cpu_ids; --i >= 0; ) {
lid = cpu_2_logical_apicid[i];
if (lid != BAD_APICID && apicid_cluster(lid) == my_cluster)
++count;
@@ -97,8 +97,8 @@ static inline int apicid_to_node(int logical_apicid)
static inline int cpu_to_logical_apicid(int cpu)
{
#ifdef CONFIG_SMP
- if (cpu >= NR_CPUS)
- return BAD_APICID;
+ if (cpu >= nr_cpu_ids)
+ return BAD_APICID;
return (int)cpu_2_logical_apicid[cpu];
#else
return logical_smp_processor_id();
@@ -107,7 +107,7 @@ static inline int cpu_to_logical_apicid(int cpu)

static inline int cpu_present_to_apicid(int mps_cpu)
{
- if (mps_cpu < NR_CPUS)
+ if (mps_cpu < nr_cpu_ids)
return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu);
else
return BAD_APICID;
@@ -146,7 +146,7 @@ static inline unsigned int cpu_mask_to_apicid(const cpumask_t *cpumask)

num_bits_set = cpus_weight(*cpumask);
/* Return id to all */
- if (num_bits_set == NR_CPUS)
+ if (num_bits_set >= nr_cpu_ids)
return (int) 0xFF;
/*
* The cpus in the mask must all be on the apic cluster. If are not
@@ -173,42 +173,16 @@ static inline unsigned int cpu_mask_to_apicid(const cpumask_t *cpumask)
static inline unsigned int cpu_mask_to_apicid_and(const struct cpumask *inmask,
const struct cpumask *andmask)
{
- int num_bits_set;
- int cpus_found = 0;
- int cpu;
- int apicid = 0xFF;
+ int apicid = cpu_to_logical_apicid(0);
cpumask_var_t cpumask;

if (!alloc_cpumask_var(&cpumask, GFP_ATOMIC))
- return (int) 0xFF;
+ return apicid;

cpumask_and(cpumask, inmask, andmask);
cpumask_and(cpumask, cpumask, cpu_online_mask);
+ apicid = cpu_mask_to_apicid(cpumask);

- num_bits_set = cpumask_weight(cpumask);
- /* Return id to all */
- if (num_bits_set == nr_cpu_ids)
- goto exit;
- /*
- * The cpus in the mask must all be on the apic cluster. If are not
- * on the same apicid cluster return default value of TARGET_CPUS.
- */
- cpu = cpumask_first(cpumask);
- apicid = cpu_to_logical_apicid(cpu);
- while (cpus_found < num_bits_set) {
- if (cpumask_test_cpu(cpu, cpumask)) {
- int new_apicid = cpu_to_logical_apicid(cpu);
- if (apicid_cluster(apicid) !=
- apicid_cluster(new_apicid)){
- printk ("%s: Not a valid mask!\n", __func__);
- return 0xFF;
- }
- apicid = apicid | new_apicid;
- cpus_found++;
- }
- cpu++;
- }
-exit:
free_cpumask_var(cpumask);
return apicid;
}
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 79e31e9..4e2f2e0 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -61,13 +61,19 @@ static inline int cpu_to_node(int cpu)
*
* Side note: this function creates the returned cpumask on the stack
* so with a high NR_CPUS count, excessive stack space is used. The
- * node_to_cpumask_ptr function should be used whenever possible.
+ * cpumask_of_node function should be used whenever possible.
*/
static inline cpumask_t node_to_cpumask(int node)
{
return node_to_cpumask_map[node];
}

+/* Returns a bitmask of CPUs on Node 'node'. */
+static inline const struct cpumask *cpumask_of_node(int node)
+{
+ return &node_to_cpumask_map[node];
+}
+
#else /* CONFIG_X86_64 */

/* Mappings between node number and cpus on that node. */
@@ -82,7 +88,7 @@ DECLARE_EARLY_PER_CPU(int, x86_cpu_to_node_map);
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
extern int cpu_to_node(int cpu);
extern int early_cpu_to_node(int cpu);
-extern const cpumask_t *_node_to_cpumask_ptr(int node);
+extern const cpumask_t *cpumask_of_node(int node);
extern cpumask_t node_to_cpumask(int node);

#else /* !CONFIG_DEBUG_PER_CPU_MAPS */
@@ -103,7 +109,7 @@ static inline int early_cpu_to_node(int cpu)
}

/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
-static inline const cpumask_t *_node_to_cpumask_ptr(int node)
+static inline const cpumask_t *cpumask_of_node(int node)
{
return &node_to_cpumask_map[node];
}
@@ -116,12 +122,15 @@ static inline cpumask_t node_to_cpumask(int node)

#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */

-/* Replace default node_to_cpumask_ptr with optimized version */
+/*
+ * Replace default node_to_cpumask_ptr with optimized version
+ * Deprecated: use "const struct cpumask *mask = cpumask_of_node(node)"
+ */
#define node_to_cpumask_ptr(v, node) \
- const cpumask_t *v = _node_to_cpumask_ptr(node)
+ const cpumask_t *v = cpumask_of_node(node)

#define node_to_cpumask_ptr_next(v, node) \
- v = _node_to_cpumask_ptr(node)
+ v = cpumask_of_node(node)

#endif /* CONFIG_X86_64 */

@@ -187,7 +196,7 @@ extern int __node_distance(int, int);
#define cpu_to_node(cpu) 0
#define early_cpu_to_node(cpu) 0

-static inline const cpumask_t *_node_to_cpumask_ptr(int node)
+static inline const cpumask_t *cpumask_of_node(int node)
{
return &cpu_online_map;
}
@@ -200,12 +209,15 @@ static inline int node_to_first_cpu(int node)
return first_cpu(cpu_online_map);
}

-/* Replace default node_to_cpumask_ptr with optimized version */
+/*
+ * Replace default node_to_cpumask_ptr with optimized version
+ * Deprecated: use "const struct cpumask *mask = cpumask_of_node(node)"
+ */
#define node_to_cpumask_ptr(v, node) \
- const cpumask_t *v = _node_to_cpumask_ptr(node)
+ const cpumask_t *v = cpumask_of_node(node)

#define node_to_cpumask_ptr_next(v, node) \
- v = _node_to_cpumask_ptr(node)
+ v = cpumask_of_node(node)
#endif

#include <asm-generic/topology.h>
@@ -214,12 +226,12 @@ static inline int node_to_first_cpu(int node)
/* Returns the number of the first CPU on Node 'node'. */
static inline int node_to_first_cpu(int node)
{
- node_to_cpumask_ptr(mask, node);
- return first_cpu(*mask);
+ return cpumask_first(cpumask_of_node(node));
}
#endif

extern cpumask_t cpu_coregroup_map(int cpu);
+extern const struct cpumask *cpu_coregroup_mask(int cpu);

#ifdef ENABLE_TOPO_DEFINES
#define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 65d0b72..29dc0c8 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -538,9 +538,10 @@ static int __cpuinit _acpi_map_lsapic(acpi_handle handle, int *pcpu)
struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
union acpi_object *obj;
struct acpi_madt_local_apic *lapic;
- cpumask_t tmp_map, new_map;
+ cpumask_var_t tmp_map, new_map;
u8 physid;
int cpu;
+ int retval = -ENOMEM;

if (ACPI_FAILURE(acpi_evaluate_object(handle, "_MAT", NULL, &buffer)))
return -EINVAL;
@@ -569,23 +570,37 @@ static int __cpuinit _acpi_map_lsapic(acpi_handle handle, int *pcpu)
buffer.length = ACPI_ALLOCATE_BUFFER;
buffer.pointer = NULL;

- tmp_map = cpu_present_map;
+ if (!alloc_cpumask_var(&tmp_map, GFP_KERNEL))
+ goto out;
+
+ if (!alloc_cpumask_var(&new_map, GFP_KERNEL))
+ goto free_tmp_map;
+
+ cpumask_copy(tmp_map, cpu_present_mask);
acpi_register_lapic(physid, lapic->lapic_flags & ACPI_MADT_ENABLED);

/*
* If mp_register_lapic successfully generates a new logical cpu
* number, then the following will get us exactly what was mapped
*/
- cpus_andnot(new_map, cpu_present_map, tmp_map);
- if (cpus_empty(new_map)) {
+ cpumask_andnot(new_map, cpu_present_mask, tmp_map);
+ if (cpumask_empty(new_map)) {
printk ("Unable to map lapic to logical cpu number\n");
- return -EINVAL;
+ retval = -EINVAL;
+ goto free_new_map;
}

- cpu = first_cpu(new_map);
+ cpu = cpumask_first(new_map);

*pcpu = cpu;
- return 0;
+ retval = 0;
+
+free_new_map:
+ free_cpumask_var(new_map);
+free_tmp_map:
+ free_cpumask_var(tmp_map);
+out:
+ return retval;
}

/* wrapper to silence section mismatch warning */
@@ -598,7 +613,7 @@ EXPORT_SYMBOL(acpi_map_lsapic);
int acpi_unmap_lsapic(int cpu)
{
per_cpu(x86_cpu_to_apicid, cpu) = -1;
- cpu_clear(cpu, cpu_present_map);
+ set_cpu_present(cpu, false);
num_processors--;

return (0);
diff --git a/arch/x86/kernel/apic.c b/arch/x86/kernel/apic.c
index 6b7f824..9958924 100644
--- a/arch/x86/kernel/apic.c
+++ b/arch/x86/kernel/apic.c
@@ -140,7 +140,7 @@ static int lapic_next_event(unsigned long delta,
struct clock_event_device *evt);
static void lapic_timer_setup(enum clock_event_mode mode,
struct clock_event_device *evt);
-static void lapic_timer_broadcast(const cpumask_t *mask);
+static void lapic_timer_broadcast(const struct cpumask *mask);
static void apic_pm_activate(void);

/*
@@ -453,7 +453,7 @@ static void lapic_timer_setup(enum clock_event_mode mode,
/*
* Local APIC timer broadcast function
*/
-static void lapic_timer_broadcast(const cpumask_t *mask)
+static void lapic_timer_broadcast(const struct cpumask *mask)
{
#ifdef CONFIG_SMP
send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 42e0853..3f95a40 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -355,7 +355,7 @@ void __cpuinit detect_ht(struct cpuinfo_x86 *c)
printk(KERN_INFO "CPU: Hyper-Threading is disabled\n");
} else if (smp_num_siblings > 1) {

- if (smp_num_siblings > NR_CPUS) {
+ if (smp_num_siblings > nr_cpu_ids) {
printk(KERN_WARNING "CPU: Unsupported number of siblings %d",
smp_num_siblings);
smp_num_siblings = 1;
diff --git a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
index 88ea02d..28102ad 100644
--- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
+++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
@@ -517,6 +517,17 @@ acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
}
}

+static void free_acpi_perf_data(void)
+{
+ unsigned int i;
+
+ /* Freeing a NULL pointer is OK, and alloc_percpu zeroes. */
+ for_each_possible_cpu(i)
+ free_cpumask_var(per_cpu_ptr(acpi_perf_data, i)
+ ->shared_cpu_map);
+ free_percpu(acpi_perf_data);
+}
+
/*
* acpi_cpufreq_early_init - initialize ACPI P-States library
*
@@ -527,6 +538,7 @@ acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
*/
static int __init acpi_cpufreq_early_init(void)
{
+ unsigned int i;
dprintk("acpi_cpufreq_early_init\n");

acpi_perf_data = alloc_percpu(struct acpi_processor_performance);
@@ -534,6 +546,16 @@ static int __init acpi_cpufreq_early_init(void)
dprintk("Memory allocation error for acpi_perf_data.\n");
return -ENOMEM;
}
+ for_each_possible_cpu(i) {
+ if (!alloc_cpumask_var_node(
+ &per_cpu_ptr(acpi_perf_data, i)->shared_cpu_map,
+ GFP_KERNEL, cpu_to_node(i))) {
+
+ /* Freeing a NULL pointer is OK: alloc_percpu zeroes. */
+ free_acpi_perf_data();
+ return -ENOMEM;
+ }
+ }

/* Do initialization in ACPI core */
acpi_processor_preregister_performance(acpi_perf_data);
@@ -604,9 +626,9 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
*/
if (policy->shared_type == CPUFREQ_SHARED_TYPE_ALL ||
policy->shared_type == CPUFREQ_SHARED_TYPE_ANY) {
- policy->cpus = perf->shared_cpu_map;
+ cpumask_copy(&policy->cpus, perf->shared_cpu_map);
}
- policy->related_cpus = perf->shared_cpu_map;
+ cpumask_copy(&policy->related_cpus, perf->shared_cpu_map);

#ifdef CONFIG_SMP
dmi_check_system(sw_any_bug_dmi_table);
@@ -795,7 +817,7 @@ static int __init acpi_cpufreq_init(void)

ret = cpufreq_register_driver(&acpi_cpufreq_driver);
if (ret)
- free_percpu(acpi_perf_data);
+ free_acpi_perf_data();

return ret;
}
diff --git a/arch/x86/kernel/cpu/cpufreq/powernow-k7.c b/arch/x86/kernel/cpu/cpufreq/powernow-k7.c
index 7c7d56b..1b446d7 100644
--- a/arch/x86/kernel/cpu/cpufreq/powernow-k7.c
+++ b/arch/x86/kernel/cpu/cpufreq/powernow-k7.c
@@ -310,6 +310,12 @@ static int powernow_acpi_init(void)
goto err0;
}

+ if (!alloc_cpumask_var(&acpi_processor_perf->shared_cpu_map,
+ GFP_KERNEL)) {
+ retval = -ENOMEM;
+ goto err05;
+ }
+
if (acpi_processor_register_performance(acpi_processor_perf, 0)) {
retval = -EIO;
goto err1;
@@ -412,6 +418,8 @@ static int powernow_acpi_init(void)
err2:
acpi_processor_unregister_performance(acpi_processor_perf, 0);
err1:
+ free_cpumask_var(acpi_processor_perf->shared_cpu_map);
+err05:
kfree(acpi_processor_perf);
err0:
printk(KERN_WARNING PFX "ACPI perflib can not be used in this platform\n");
@@ -652,6 +660,7 @@ static int powernow_cpu_exit (struct cpufreq_policy *policy) {
#ifdef CONFIG_X86_POWERNOW_K7_ACPI
if (acpi_processor_perf) {
acpi_processor_unregister_performance(acpi_processor_perf, 0);
+ free_cpumask_var(acpi_processor_perf->shared_cpu_map);
kfree(acpi_processor_perf);
}
#endif
diff --git a/arch/x86/kernel/cpu/cpufreq/powernow-k8.c b/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
index 7f05f44..c3c9adb 100644
--- a/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
+++ b/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
@@ -766,7 +766,7 @@ static void powernow_k8_acpi_pst_values(struct powernow_k8_data *data, unsigned
static int powernow_k8_cpu_init_acpi(struct powernow_k8_data *data)
{
struct cpufreq_frequency_table *powernow_table;
- int ret_val;
+ int ret_val = -ENODEV;

if (acpi_processor_register_performance(&data->acpi_data, data->cpu)) {
dprintk("register performance failed: bad ACPI data\n");
@@ -815,6 +815,13 @@ static int powernow_k8_cpu_init_acpi(struct powernow_k8_data *data)
/* notify BIOS that we exist */
acpi_processor_notify_smm(THIS_MODULE);

+ if (!alloc_cpumask_var(&data->acpi_data.shared_cpu_map, GFP_KERNEL)) {
+ printk(KERN_ERR PFX
+ "unable to alloc powernow_k8_data cpumask\n");
+ ret_val = -ENOMEM;
+ goto err_out_mem;
+ }
+
return 0;

err_out_mem:
@@ -826,7 +833,7 @@ err_out:
/* data->acpi_data.state_count informs us at ->exit() whether ACPI was used */
data->acpi_data.state_count = 0;

- return -ENODEV;
+ return ret_val;
}

static int fill_powernow_table_pstate(struct powernow_k8_data *data, struct cpufreq_frequency_table *powernow_table)
@@ -929,6 +936,7 @@ static void powernow_k8_cpu_exit_acpi(struct powernow_k8_data *data)
{
if (data->acpi_data.state_count)
acpi_processor_unregister_performance(&data->acpi_data, data->cpu);
+ free_cpumask_var(data->acpi_data.shared_cpu_map);
}

#else
@@ -1134,7 +1142,8 @@ static int __cpuinit powernowk8_cpu_init(struct cpufreq_policy *pol)
data->cpu = pol->cpu;
data->currpstate = HW_PSTATE_INVALID;

- if (powernow_k8_cpu_init_acpi(data)) {
+ rc = powernow_k8_cpu_init_acpi(data);
+ if (rc) {
/*
* Use the PSB BIOS structure. This is only availabe on
* an UP version, and is deprecated by AMD.
@@ -1152,20 +1161,17 @@ static int __cpuinit powernowk8_cpu_init(struct cpufreq_policy *pol)
"ACPI maintainers and complain to your BIOS "
"vendor.\n");
#endif
- kfree(data);
- return -ENODEV;
+ goto err_out;
}
if (pol->cpu != 0) {
printk(KERN_ERR FW_BUG PFX "No ACPI _PSS objects for "
"CPU other than CPU0. Complain to your BIOS "
"vendor.\n");
- kfree(data);
- return -ENODEV;
+ goto err_out;
}
rc = find_psb_table(data);
if (rc) {
- kfree(data);
- return -ENODEV;
+ goto err_out;
}
}

diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c
index c6ecda6..48533d7 100644
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -534,7 +534,7 @@ static void __cpuinit free_cache_attributes(unsigned int cpu)
per_cpu(cpuid4_info, cpu) = NULL;
}

-static void get_cpu_leaves(void *_retval)
+static void __cpuinit get_cpu_leaves(void *_retval)
{
int j, *retval = _retval, cpu = smp_processor_id();

diff --git a/arch/x86/kernel/cpuid.c b/arch/x86/kernel/cpuid.c
index 72cefd1..62a3c23 100644
--- a/arch/x86/kernel/cpuid.c
+++ b/arch/x86/kernel/cpuid.c
@@ -121,7 +121,7 @@ static int cpuid_open(struct inode *inode, struct file *file)
lock_kernel();

cpu = iminor(file->f_path.dentry->d_inode);
- if (cpu >= NR_CPUS || !cpu_online(cpu)) {
+ if (cpu >= nr_cpu_ids || !cpu_online(cpu)) {
ret = -ENXIO; /* No such CPU */
goto out;
}
diff --git a/arch/x86/kernel/io_apic.c b/arch/x86/kernel/io_apic.c
index 62ecfc9..37715bb 100644
--- a/arch/x86/kernel/io_apic.c
+++ b/arch/x86/kernel/io_apic.c
@@ -214,11 +214,11 @@ static struct irq_cfg *get_one_free_irq_cfg(int cpu)

cfg = kzalloc_node(sizeof(*cfg), GFP_ATOMIC, node);
if (cfg) {
- /* FIXME: needs alloc_cpumask_var_node() */
- if (!alloc_cpumask_var(&cfg->domain, GFP_ATOMIC)) {
+ if (!alloc_cpumask_var_node(&cfg->domain, GFP_ATOMIC, node)) {
kfree(cfg);
cfg = NULL;
- } else if (!alloc_cpumask_var(&cfg->old_domain, GFP_ATOMIC)) {
+ } else if (!alloc_cpumask_var_node(&cfg->old_domain,
+ GFP_ATOMIC, node)) {
free_cpumask_var(cfg->domain);
kfree(cfg);
cfg = NULL;
diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c
index 82a7c7e..7262666 100644
--- a/arch/x86/kernel/msr.c
+++ b/arch/x86/kernel/msr.c
@@ -136,7 +136,7 @@ static int msr_open(struct inode *inode, struct file *file)
lock_kernel();
cpu = iminor(file->f_path.dentry->d_inode);

- if (cpu >= NR_CPUS || !cpu_online(cpu)) {
+ if (cpu >= nr_cpu_ids || !cpu_online(cpu)) {
ret = -ENXIO; /* No such CPU */
goto out;
}
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 39643b1..bc1535e 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -501,7 +501,7 @@ void native_machine_shutdown(void)

#ifdef CONFIG_X86_32
/* See if there has been given a command line override */
- if ((reboot_cpu != -1) && (reboot_cpu < NR_CPUS) &&
+ if ((reboot_cpu != -1) && (reboot_cpu < nr_cpu_ids) &&
cpu_online(reboot_cpu))
reboot_cpu_id = reboot_cpu;
#endif
@@ -511,7 +511,7 @@ void native_machine_shutdown(void)
reboot_cpu_id = smp_processor_id();

/* Make certain I only run on the appropriate processor */
- set_cpus_allowed_ptr(current, &cpumask_of_cpu(reboot_cpu_id));
+ set_cpus_allowed_ptr(current, cpumask_of(reboot_cpu_id));

/* O.K Now that I'm on the appropriate processor,
* stop all of the others.
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 0b63b08..a4b619c 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -153,12 +153,10 @@ void __init setup_per_cpu_areas(void)
align = max_t(unsigned long, PAGE_SIZE, align);
size = roundup(old_size, align);

- printk(KERN_INFO
- "NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
+ pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);

- printk(KERN_INFO "PERCPU: Allocating %zd bytes of per cpu data\n",
- size);
+ pr_info("PERCPU: Allocating %zd bytes of per cpu data\n", size);

for_each_possible_cpu(cpu) {
#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -169,22 +167,15 @@ void __init setup_per_cpu_areas(void)
if (!node_online(node) || !NODE_DATA(node)) {
ptr = __alloc_bootmem(size, align,
__pa(MAX_DMA_ADDRESS));
- printk(KERN_INFO
- "cpu %d has no node %d or node-local memory\n",
+ pr_info("cpu %d has no node %d or node-local memory\n",
cpu, node);
- if (ptr)
- printk(KERN_DEBUG
- "per cpu data for cpu%d at %016lx\n",
- cpu, __pa(ptr));
- }
- else {
+ pr_debug("per cpu data for cpu%d at %016lx\n",
+ cpu, __pa(ptr));
+ } else {
ptr = __alloc_bootmem_node(NODE_DATA(node), size, align,
__pa(MAX_DMA_ADDRESS));
- if (ptr)
- printk(KERN_DEBUG
- "per cpu data for cpu%d on node%d "
- "at %016lx\n",
- cpu, node, __pa(ptr));
+ pr_debug("per cpu data for cpu%d on node%d at %016lx\n",
+ cpu, node, __pa(ptr));
}
#endif
per_cpu_offset(cpu) = ptr - __per_cpu_start;
@@ -339,25 +330,25 @@ static const cpumask_t cpu_mask_none;
/*
* Returns a pointer to the bitmask of CPUs on Node 'node'.
*/
-const cpumask_t *_node_to_cpumask_ptr(int node)
+const cpumask_t *cpumask_of_node(int node)
{
if (node_to_cpumask_map == NULL) {
printk(KERN_WARNING
- "_node_to_cpumask_ptr(%d): no node_to_cpumask_map!\n",
+ "cpumask_of_node(%d): no node_to_cpumask_map!\n",
node);
dump_stack();
return (const cpumask_t *)&cpu_online_map;
}
if (node >= nr_node_ids) {
printk(KERN_WARNING
- "_node_to_cpumask_ptr(%d): node > nr_node_ids(%d)\n",
+ "cpumask_of_node(%d): node > nr_node_ids(%d)\n",
node, nr_node_ids);
dump_stack();
return &cpu_mask_none;
}
return &node_to_cpumask_map[node];
}
-EXPORT_SYMBOL(_node_to_cpumask_ptr);
+EXPORT_SYMBOL(cpumask_of_node);

/*
* Returns a bitmask of CPUs on Node 'node'.
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 31869bf..6bd4d9b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -496,7 +496,7 @@ void __cpuinit set_cpu_sibling_map(int cpu)
}

/* maps the cpu to the sched domain representing multi-core */
-cpumask_t cpu_coregroup_map(int cpu)
+const struct cpumask *cpu_coregroup_mask(int cpu)
{
struct cpuinfo_x86 *c = &cpu_data(cpu);
/*
@@ -504,9 +504,14 @@ cpumask_t cpu_coregroup_map(int cpu)
* And for power savings, we return cpu_core_map
*/
if (sched_mc_power_savings || sched_smt_power_savings)
- return per_cpu(cpu_core_map, cpu);
+ return &per_cpu(cpu_core_map, cpu);
else
- return c->llc_shared_map;
+ return &c->llc_shared_map;
+}
+
+cpumask_t cpu_coregroup_map(int cpu)
+{
+ return *cpu_coregroup_mask(cpu);
}

static void impress_friends(void)
@@ -1149,7 +1154,7 @@ static void __init smp_cpu_index_default(void)
for_each_possible_cpu(i) {
c = &cpu_data(i);
/* mark all to hotplug */
- c->cpu_index = NR_CPUS;
+ c->cpu_index = nr_cpu_ids;
}
}

@@ -1293,6 +1298,8 @@ __init void prefill_possible_map(void)
else
possible = setup_possible_cpus;

+ total_cpus = max_t(int, possible, num_processors + disabled_cpus);
+
if (possible > CONFIG_NR_CPUS) {
printk(KERN_WARNING
"%d Processors exceeds NR_CPUS limit of %d\n",
diff --git a/arch/x86/mach-voyager/voyager_smp.c b/arch/x86/mach-voyager/voyager_smp.c
index a5bc054..9840b7e 100644
--- a/arch/x86/mach-voyager/voyager_smp.c
+++ b/arch/x86/mach-voyager/voyager_smp.c
@@ -357,9 +357,8 @@ void __init find_smp_config(void)
printk("VOYAGER SMP: Boot cpu is %d\n", boot_cpu_id);

/* initialize the CPU structures (moved from smp_boot_cpus) */
- for (i = 0; i < NR_CPUS; i++) {
+ for (i = 0; i < nr_cpu_ids; i++)
cpu_irq_affinity[i] = ~0;
- }
cpu_online_map = cpumask_of_cpu(boot_cpu_id);

/* The boot CPU must be extended */
@@ -1227,7 +1226,7 @@ int setup_profiling_timer(unsigned int multiplier)
* new values until the next timer interrupt in which they do process
* accounting.
*/
- for (i = 0; i < NR_CPUS; ++i)
+ for (i = 0; i < nr_cpu_ids; ++i)
per_cpu(prof_multiplier, i) = multiplier;

return 0;
@@ -1257,7 +1256,7 @@ void __init voyager_smp_intr_init(void)
int i;

/* initialize the per cpu irq mask to all disabled */
- for (i = 0; i < NR_CPUS; i++)
+ for (i = 0; i < nr_cpu_ids; i++)
vic_irq_mask[i] = 0xFFFF;

VIC_SET_GATE(VIC_CPI_LEVEL0, vic_cpi_interrupt);
diff --git a/block/blk.h b/block/blk.h
index d2e49af..6e1ed40 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -99,8 +99,8 @@ static inline int queue_congestion_off_threshold(struct request_queue *q)
static inline int blk_cpu_to_group(int cpu)
{
#ifdef CONFIG_SCHED_MC
- cpumask_t mask = cpu_coregroup_map(cpu);
- return first_cpu(mask);
+ const struct cpumask *mask = cpu_coregroup_mask(cpu);
+ return cpumask_first(mask);
#elif defined(CONFIG_SCHED_SMT)
return first_cpu(per_cpu(cpu_sibling_map, cpu));
#else
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 3494836..0cc2fd3 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -826,6 +826,11 @@ static int acpi_processor_add(struct acpi_device *device)
if (!pr)
return -ENOMEM;

+ if (!alloc_cpumask_var(&pr->throttling.shared_cpu_map, GFP_KERNEL)) {
+ kfree(pr);
+ return -ENOMEM;
+ }
+
pr->handle = device->handle;
strcpy(acpi_device_name(device), ACPI_PROCESSOR_DEVICE_NAME);
strcpy(acpi_device_class(device), ACPI_PROCESSOR_CLASS);
@@ -845,10 +850,8 @@ static int acpi_processor_remove(struct acpi_device *device, int type)

pr = acpi_driver_data(device);

- if (pr->id >= nr_cpu_ids) {
- kfree(pr);
- return 0;
- }
+ if (pr->id >= nr_cpu_ids)
+ goto free;

if (type == ACPI_BUS_REMOVAL_EJECT) {
if (acpi_processor_handle_eject(pr))
@@ -873,6 +876,9 @@ static int acpi_processor_remove(struct acpi_device *device, int type)

per_cpu(processors, pr->id) = NULL;
per_cpu(processor_device_array, pr->id) = NULL;
+
+free:
+ free_cpumask_var(pr->throttling.shared_cpu_map);
kfree(pr);

return 0;
diff --git a/drivers/acpi/processor_perflib.c b/drivers/acpi/processor_perflib.c
index 0d7b772..846e227 100644
--- a/drivers/acpi/processor_perflib.c
+++ b/drivers/acpi/processor_perflib.c
@@ -588,12 +588,15 @@ int acpi_processor_preregister_performance(
int count, count_target;
int retval = 0;
unsigned int i, j;
- cpumask_t covered_cpus;
+ cpumask_var_t covered_cpus;
struct acpi_processor *pr;
struct acpi_psd_package *pdomain;
struct acpi_processor *match_pr;
struct acpi_psd_package *match_pdomain;

+ if (!alloc_cpumask_var(&covered_cpus, GFP_KERNEL))
+ return -ENOMEM;
+
mutex_lock(&performance_mutex);

retval = 0;
@@ -617,7 +620,7 @@ int acpi_processor_preregister_performance(
}

pr->performance = percpu_ptr(performance, i);
- cpu_set(i, pr->performance->shared_cpu_map);
+ cpumask_set_cpu(i, pr->performance->shared_cpu_map);
if (acpi_processor_get_psd(pr)) {
retval = -EINVAL;
continue;
@@ -650,18 +653,18 @@ int acpi_processor_preregister_performance(
}
}

- cpus_clear(covered_cpus);
+ cpumask_clear(covered_cpus);
for_each_possible_cpu(i) {
pr = per_cpu(processors, i);
if (!pr)
continue;

- if (cpu_isset(i, covered_cpus))
+ if (cpumask_test_cpu(i, covered_cpus))
continue;

pdomain = &(pr->performance->domain_info);
- cpu_set(i, pr->performance->shared_cpu_map);
- cpu_set(i, covered_cpus);
+ cpumask_set_cpu(i, pr->performance->shared_cpu_map);
+ cpumask_set_cpu(i, covered_cpus);
if (pdomain->num_processors <= 1)
continue;

@@ -699,8 +702,8 @@ int acpi_processor_preregister_performance(
goto err_ret;
}

- cpu_set(j, covered_cpus);
- cpu_set(j, pr->performance->shared_cpu_map);
+ cpumask_set_cpu(j, covered_cpus);
+ cpumask_set_cpu(j, pr->performance->shared_cpu_map);
count++;
}

@@ -718,8 +721,8 @@ int acpi_processor_preregister_performance(

match_pr->performance->shared_type =
pr->performance->shared_type;
- match_pr->performance->shared_cpu_map =
- pr->performance->shared_cpu_map;
+ cpumask_copy(match_pr->performance->shared_cpu_map,
+ pr->performance->shared_cpu_map);
}
}

@@ -731,14 +734,15 @@ err_ret:

/* Assume no coordination on any error parsing domain info */
if (retval) {
- cpus_clear(pr->performance->shared_cpu_map);
- cpu_set(i, pr->performance->shared_cpu_map);
+ cpumask_clear(pr->performance->shared_cpu_map);
+ cpumask_set_cpu(i, pr->performance->shared_cpu_map);
pr->performance->shared_type = CPUFREQ_SHARED_TYPE_ALL;
}
pr->performance = NULL; /* Will be set for real in register */
}

mutex_unlock(&performance_mutex);
+ free_cpumask_var(covered_cpus);
return retval;
}
EXPORT_SYMBOL(acpi_processor_preregister_performance);
diff --git a/drivers/acpi/processor_throttling.c b/drivers/acpi/processor_throttling.c
index a0c38c9..d278381 100644
--- a/drivers/acpi/processor_throttling.c
+++ b/drivers/acpi/processor_throttling.c
@@ -61,11 +61,14 @@ static int acpi_processor_update_tsd_coord(void)
int count, count_target;
int retval = 0;
unsigned int i, j;
- cpumask_t covered_cpus;
+ cpumask_var_t covered_cpus;
struct acpi_processor *pr, *match_pr;
struct acpi_tsd_package *pdomain, *match_pdomain;
struct acpi_processor_throttling *pthrottling, *match_pthrottling;

+ if (!alloc_cpumask_var(&covered_cpus, GFP_KERNEL))
+ return -ENOMEM;
+
/*
* Now that we have _TSD data from all CPUs, lets setup T-state
* coordination between all CPUs.
@@ -91,19 +94,19 @@ static int acpi_processor_update_tsd_coord(void)
if (retval)
goto err_ret;

- cpus_clear(covered_cpus);
+ cpumask_clear(covered_cpus);
for_each_possible_cpu(i) {
pr = per_cpu(processors, i);
if (!pr)
continue;

- if (cpu_isset(i, covered_cpus))
+ if (cpumask_test_cpu(i, covered_cpus))
continue;
pthrottling = &pr->throttling;

pdomain = &(pthrottling->domain_info);
- cpu_set(i, pthrottling->shared_cpu_map);
- cpu_set(i, covered_cpus);
+ cpumask_set_cpu(i, pthrottling->shared_cpu_map);
+ cpumask_set_cpu(i, covered_cpus);
/*
* If the number of processor in the TSD domain is 1, it is
* unnecessary to parse the coordination for this CPU.
@@ -144,8 +147,8 @@ static int acpi_processor_update_tsd_coord(void)
goto err_ret;
}

- cpu_set(j, covered_cpus);
- cpu_set(j, pthrottling->shared_cpu_map);
+ cpumask_set_cpu(j, covered_cpus);
+ cpumask_set_cpu(j, pthrottling->shared_cpu_map);
count++;
}
for_each_possible_cpu(j) {
@@ -165,12 +168,14 @@ static int acpi_processor_update_tsd_coord(void)
* If some CPUS have the same domain, they
* will have the same shared_cpu_map.
*/
- match_pthrottling->shared_cpu_map =
- pthrottling->shared_cpu_map;
+ cpumask_copy(match_pthrottling->shared_cpu_map,
+ pthrottling->shared_cpu_map);
}
}

err_ret:
+ free_cpumask_var(covered_cpus);
+
for_each_possible_cpu(i) {
pr = per_cpu(processors, i);
if (!pr)
@@ -182,8 +187,8 @@ err_ret:
*/
if (retval) {
pthrottling = &(pr->throttling);
- cpus_clear(pthrottling->shared_cpu_map);
- cpu_set(i, pthrottling->shared_cpu_map);
+ cpumask_clear(pthrottling->shared_cpu_map);
+ cpumask_set_cpu(i, pthrottling->shared_cpu_map);
pthrottling->shared_type = DOMAIN_COORD_TYPE_SW_ALL;
}
}
@@ -567,7 +572,7 @@ static int acpi_processor_get_tsd(struct acpi_processor *pr)
pthrottling = &pr->throttling;
pthrottling->tsd_valid_flag = 1;
pthrottling->shared_type = pdomain->coord_type;
- cpu_set(pr->id, pthrottling->shared_cpu_map);
+ cpumask_set_cpu(pr->id, pthrottling->shared_cpu_map);
/*
* If the coordination type is not defined in ACPI spec,
* the tsd_valid_flag will be clear and coordination type
@@ -826,7 +831,7 @@ static int acpi_processor_get_throttling_ptc(struct acpi_processor *pr)

static int acpi_processor_get_throttling(struct acpi_processor *pr)
{
- cpumask_t saved_mask;
+ cpumask_var_t saved_mask;
int ret;

if (!pr)
@@ -834,14 +839,20 @@ static int acpi_processor_get_throttling(struct acpi_processor *pr)

if (!pr->flags.throttling)
return -ENODEV;
+
+ if (!alloc_cpumask_var(&saved_mask, GFP_KERNEL))
+ return -ENOMEM;
+
/*
* Migrate task to the cpu pointed by pr.
*/
- saved_mask = current->cpus_allowed;
- set_cpus_allowed_ptr(current, &cpumask_of_cpu(pr->id));
+ cpumask_copy(saved_mask, &current->cpus_allowed);
+ /* FIXME: use work_on_cpu() */
+ set_cpus_allowed_ptr(current, cpumask_of(pr->id));
ret = pr->throttling.acpi_processor_get_throttling(pr);
/* restore the previous state */
- set_cpus_allowed_ptr(current, &saved_mask);
+ set_cpus_allowed_ptr(current, saved_mask);
+ free_cpumask_var(saved_mask);

return ret;
}
@@ -986,13 +997,13 @@ static int acpi_processor_set_throttling_ptc(struct acpi_processor *pr,

int acpi_processor_set_throttling(struct acpi_processor *pr, int state)
{
- cpumask_t saved_mask;
+ cpumask_var_t saved_mask;
int ret = 0;
unsigned int i;
struct acpi_processor *match_pr;
struct acpi_processor_throttling *p_throttling;
struct throttling_tstate t_state;
- cpumask_t online_throttling_cpus;
+ cpumask_var_t online_throttling_cpus;

if (!pr)
return -EINVAL;
@@ -1003,17 +1014,25 @@ int acpi_processor_set_throttling(struct acpi_processor *pr, int state)
if ((state < 0) || (state > (pr->throttling.state_count - 1)))
return -EINVAL;

- saved_mask = current->cpus_allowed;
+ if (!alloc_cpumask_var(&saved_mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ if (!alloc_cpumask_var(&online_throttling_cpus, GFP_KERNEL)) {
+ free_cpumask_var(saved_mask);
+ return -ENOMEM;
+ }
+
+ cpumask_copy(saved_mask, &current->cpus_allowed);
t_state.target_state = state;
p_throttling = &(pr->throttling);
- cpus_and(online_throttling_cpus, cpu_online_map,
- p_throttling->shared_cpu_map);
+ cpumask_and(online_throttling_cpus, cpu_online_mask,
+ p_throttling->shared_cpu_map);
/*
* The throttling notifier will be called for every
* affected cpu in order to get one proper T-state.
* The notifier event is THROTTLING_PRECHANGE.
*/
- for_each_cpu_mask_nr(i, online_throttling_cpus) {
+ for_each_cpu(i, online_throttling_cpus) {
t_state.cpu = i;
acpi_processor_throttling_notifier(THROTTLING_PRECHANGE,
&t_state);
@@ -1025,7 +1044,8 @@ int acpi_processor_set_throttling(struct acpi_processor *pr, int state)
* it can be called only for the cpu pointed by pr.
*/
if (p_throttling->shared_type == DOMAIN_COORD_TYPE_SW_ANY) {
- set_cpus_allowed_ptr(current, &cpumask_of_cpu(pr->id));
+ /* FIXME: use work_on_cpu() */
+ set_cpus_allowed_ptr(current, cpumask_of(pr->id));
ret = p_throttling->acpi_processor_set_throttling(pr,
t_state.target_state);
} else {
@@ -1034,7 +1054,7 @@ int acpi_processor_set_throttling(struct acpi_processor *pr, int state)
* it is necessary to set T-state for every affected
* cpus.
*/
- for_each_cpu_mask_nr(i, online_throttling_cpus) {
+ for_each_cpu(i, online_throttling_cpus) {
match_pr = per_cpu(processors, i);
/*
* If the pointer is invalid, we will report the
@@ -1056,7 +1076,8 @@ int acpi_processor_set_throttling(struct acpi_processor *pr, int state)
continue;
}
t_state.cpu = i;
- set_cpus_allowed_ptr(current, &cpumask_of_cpu(i));
+ /* FIXME: use work_on_cpu() */
+ set_cpus_allowed_ptr(current, cpumask_of(i));
ret = match_pr->throttling.
acpi_processor_set_throttling(
match_pr, t_state.target_state);
@@ -1068,13 +1089,16 @@ int acpi_processor_set_throttling(struct acpi_processor *pr, int state)
* affected cpu to update the T-states.
* The notifier event is THROTTLING_POSTCHANGE
*/
- for_each_cpu_mask_nr(i, online_throttling_cpus) {
+ for_each_cpu(i, online_throttling_cpus) {
t_state.cpu = i;
acpi_processor_throttling_notifier(THROTTLING_POSTCHANGE,
&t_state);
}
/* restore the previous state */
- set_cpus_allowed_ptr(current, &saved_mask);
+ /* FIXME: use work_on_cpu() */
+ set_cpus_allowed_ptr(current, saved_mask);
+ free_cpumask_var(online_throttling_cpus);
+ free_cpumask_var(saved_mask);
return ret;
}

@@ -1120,7 +1144,7 @@ int acpi_processor_get_throttling_info(struct acpi_processor *pr)
if (acpi_processor_get_tsd(pr)) {
pthrottling = &pr->throttling;
pthrottling->tsd_valid_flag = 0;
- cpu_set(pr->id, pthrottling->shared_cpu_map);
+ cpumask_set_cpu(pr->id, pthrottling->shared_cpu_map);
pthrottling->shared_type = DOMAIN_COORD_TYPE_SW_ALL;
}

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 4259072..719ee5c 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -128,10 +128,54 @@ print_cpus_func(online);
print_cpus_func(possible);
print_cpus_func(present);

+/*
+ * Print values for NR_CPUS and offlined cpus
+ */
+static ssize_t print_cpus_kernel_max(struct sysdev_class *class, char *buf)
+{
+ int n = snprintf(buf, PAGE_SIZE-2, "%d\n", NR_CPUS - 1);
+ return n;
+}
+static SYSDEV_CLASS_ATTR(kernel_max, 0444, print_cpus_kernel_max, NULL);
+
+/* arch-optional setting to enable display of offline cpus >= nr_cpu_ids */
+unsigned int total_cpus;
+
+static ssize_t print_cpus_offline(struct sysdev_class *class, char *buf)
+{
+ int n = 0, len = PAGE_SIZE-2;
+ cpumask_var_t offline;
+
+ /* display offline cpus < nr_cpu_ids */
+ if (!alloc_cpumask_var(&offline, GFP_KERNEL))
+ return -ENOMEM;
+ cpumask_complement(offline, cpu_online_mask);
+ n = cpulist_scnprintf(buf, len, offline);
+ free_cpumask_var(offline);
+
+ /* display offline cpus >= nr_cpu_ids */
+ if (total_cpus && nr_cpu_ids < total_cpus) {
+ if (n && n < len)
+ buf[n++] = ',';
+
+ if (nr_cpu_ids == total_cpus-1)
+ n += snprintf(&buf[n], len - n, "%d", nr_cpu_ids);
+ else
+ n += snprintf(&buf[n], len - n, "%d-%d",
+ nr_cpu_ids, total_cpus-1);
+ }
+
+ n += snprintf(&buf[n], len - n, "\n");
+ return n;
+}
+static SYSDEV_CLASS_ATTR(offline, 0444, print_cpus_offline, NULL);
+
static struct sysdev_class_attribute *cpu_state_attr[] = {
&attr_online_map,
&attr_possible_map,
&attr_present_map,
+ &attr_kernel_max,
+ &attr_offline,
};

static int cpu_states_init(void)
diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index 757035e..3128a50 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -659,12 +659,12 @@ static inline int find_next_online_cpu(struct ehca_comp_pool *pool)

WARN_ON_ONCE(!in_interrupt());
if (ehca_debug_level >= 3)
- ehca_dmp(&cpu_online_map, sizeof(cpumask_t), "");
+ ehca_dmp(cpu_online_mask, cpumask_size(), "");

spin_lock_irqsave(&pool->last_cpu_lock, flags);
- cpu = next_cpu_nr(pool->last_cpu, cpu_online_map);
+ cpu = cpumask_next(pool->last_cpu, cpu_online_mask);
if (cpu >= nr_cpu_ids)
- cpu = first_cpu(cpu_online_map);
+ cpu = cpumask_first(cpu_online_mask);
pool->last_cpu = cpu;
spin_unlock_irqrestore(&pool->last_cpu_lock, flags);

@@ -855,7 +855,7 @@ static int __cpuinit comp_pool_callback(struct notifier_block *nfb,
case CPU_UP_CANCELED_FROZEN:
ehca_gen_dbg("CPU: %x (CPU_CANCELED)", cpu);
cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
- kthread_bind(cct->task, any_online_cpu(cpu_online_map));
+ kthread_bind(cct->task, cpumask_any(cpu_online_mask));
destroy_comp_task(pool, cpu);
break;
case CPU_ONLINE:
@@ -902,7 +902,7 @@ int ehca_create_comp_pool(void)
return -ENOMEM;

spin_lock_init(&pool->last_cpu_lock);
- pool->last_cpu = any_online_cpu(cpu_online_map);
+ pool->last_cpu = cpumask_any(cpu_online_mask);

pool->cpu_comp_tasks = alloc_percpu(struct ehca_cpu_comp_task);
if (pool->cpu_comp_tasks == NULL) {
@@ -934,10 +934,9 @@ void ehca_destroy_comp_pool(void)

unregister_hotcpu_notifier(&comp_pool_callback_nb);

- for (i = 0; i < NR_CPUS; i++) {
- if (cpu_online(i))
- destroy_comp_task(pool, i);
- }
+ for_each_online_cpu(i)
+ destroy_comp_task(pool, i);
+
free_percpu(pool->cpu_comp_tasks);
kfree(pool);
}
diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index 239d4e8..2317398 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -1679,7 +1679,7 @@ static int find_best_unit(struct file *fp,
* InfiniPath chip to that processor (we assume reasonable connectivity,
* for now). This code assumes that if affinity has been set
* before this point, that at most one cpu is set; for now this
- * is reasonable. I check for both cpus_empty() and cpus_full(),
+ * is reasonable. I check for both cpumask_empty() and cpumask_full(),
* in case some kernel variant sets none of the bits when no
* affinity is set. 2.6.11 and 12 kernels have all present
* cpus set. Some day we'll have to fix it up further to handle
@@ -1688,11 +1688,11 @@ static int find_best_unit(struct file *fp,
* information. There may be some issues with dual core numbering
* as well. This needs more work prior to release.
*/
- if (!cpus_empty(current->cpus_allowed) &&
- !cpus_full(current->cpus_allowed)) {
+ if (!cpumask_empty(&current->cpus_allowed) &&
+ !cpumask_full(&current->cpus_allowed)) {
int ncpus = num_online_cpus(), curcpu = -1, nset = 0;
for (i = 0; i < ncpus; i++)
- if (cpu_isset(i, current->cpus_allowed)) {
+ if (cpumask_test_cpu(i, &current->cpus_allowed)) {
ipath_cdbg(PROC, "%s[%u] affinity set for "
"cpu %d/%d\n", current->comm,
current->pid, i, ncpus);
diff --git a/drivers/pnp/pnpbios/bioscalls.c b/drivers/pnp/pnpbios/bioscalls.c
index 7ff8244..7e6b5a3 100644
--- a/drivers/pnp/pnpbios/bioscalls.c
+++ b/drivers/pnp/pnpbios/bioscalls.c
@@ -481,7 +481,7 @@ void pnpbios_calls_init(union pnp_bios_install_struct *header)

set_base(bad_bios_desc, __va((unsigned long)0x40 << 4));
_set_limit((char *)&bad_bios_desc, 4095 - (0x40 << 4));
- for (i = 0; i < NR_CPUS; i++) {
+ for_each_possible_cpu(i) {
struct desc_struct *gdt = get_cpu_gdt_table(i);
if (!gdt)
continue;
diff --git a/fs/seq_file.c b/fs/seq_file.c
index 99d8b8c..b569ff1 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -468,7 +468,8 @@ int seq_dentry(struct seq_file *m, struct dentry *dentry, char *esc)
return -1;
}

-int seq_bitmap(struct seq_file *m, unsigned long *bits, unsigned int nr_bits)
+int seq_bitmap(struct seq_file *m, const unsigned long *bits,
+ unsigned int nr_bits)
{
if (m->count < m->size) {
int len = bitmap_scnprintf(m->buf + m->count,
diff --git a/include/acpi/processor.h b/include/acpi/processor.h
index 3795590..0574add 100644
--- a/include/acpi/processor.h
+++ b/include/acpi/processor.h
@@ -127,7 +127,7 @@ struct acpi_processor_performance {
unsigned int state_count;
struct acpi_processor_px *states;
struct acpi_psd_package domain_info;
- cpumask_t shared_cpu_map;
+ cpumask_var_t shared_cpu_map;
unsigned int shared_type;
};

@@ -172,7 +172,7 @@ struct acpi_processor_throttling {
unsigned int state_count;
struct acpi_processor_tx_tss *states_tss;
struct acpi_tsd_package domain_info;
- cpumask_t shared_cpu_map;
+ cpumask_var_t shared_cpu_map;
int (*acpi_processor_get_throttling) (struct acpi_processor * pr);
int (*acpi_processor_set_throttling) (struct acpi_processor * pr,
int state);
diff --git a/include/asm-frv/bitops.h b/include/asm-frv/bitops.h
index 39456ba..287f6f6 100644
--- a/include/asm-frv/bitops.h
+++ b/include/asm-frv/bitops.h
@@ -339,6 +339,19 @@ int __ffs(unsigned long x)
return 31 - bit;
}

+/**
+ * __fls - find last (most-significant) set bit in a long word
+ * @word: the word to search
+ *
+ * Undefined if no set bit exists, so code should check against 0 first.
+ */
+static inline unsigned long __fls(unsigned long word)
+{
+ unsigned long bit;
+ asm("scan %1,gr0,%0" : "=r"(bit) : "r"(word));
+ return bit;
+}
+
/*
* special slimline version of fls() for calculating ilog2_u32()
* - note: no protection against n == 0
diff --git a/include/asm-m32r/bitops.h b/include/asm-m32r/bitops.h
index 6dc9b81..aaddf0d 100644
--- a/include/asm-m32r/bitops.h
+++ b/include/asm-m32r/bitops.h
@@ -251,6 +251,7 @@ static __inline__ int test_and_change_bit(int nr, volatile void * addr)
#include <asm-generic/bitops/ffz.h>
#include <asm-generic/bitops/__ffs.h>
#include <asm-generic/bitops/fls.h>
+#include <asm-generic/bitops/__fls.h>
#include <asm-generic/bitops/fls64.h>

#ifdef __KERNEL__
diff --git a/include/asm-m68k/bitops.h b/include/asm-m68k/bitops.h
index 3e81064..9bde784 100644
--- a/include/asm-m68k/bitops.h
+++ b/include/asm-m68k/bitops.h
@@ -315,6 +315,11 @@ static inline int fls(int x)
return 32 - cnt;
}

+static inline int __fls(int x)
+{
+ return fls(x) - 1;
+}
+
#include <asm-generic/bitops/fls64.h>
#include <asm-generic/bitops/sched.h>
#include <asm-generic/bitops/hweight.h>
diff --git a/include/asm-mn10300/bitops.h b/include/asm-mn10300/bitops.h
index cc6d40c..0b610f4 100644
--- a/include/asm-mn10300/bitops.h
+++ b/include/asm-mn10300/bitops.h
@@ -196,6 +196,17 @@ int fls(int x)
}

/**
+ * __fls - find last (most-significant) set bit in a long word
+ * @word: the word to search
+ *
+ * Undefined if no set bit exists, so code should check against 0 first.
+ */
+static inline unsigned long __fls(unsigned long word)
+{
+ return __ilog2_u32(word);
+}
+
+/**
* ffs - find first bit set
* @x: the word to search
*
diff --git a/include/asm-xtensa/bitops.h b/include/asm-xtensa/bitops.h
index 23261e8..6c39303 100644
--- a/include/asm-xtensa/bitops.h
+++ b/include/asm-xtensa/bitops.h
@@ -82,6 +82,16 @@ static inline int fls (unsigned int x)
return 32 - __cntlz(x);
}

+/**
+ * __fls - find last (most-significant) set bit in a long word
+ * @word: the word to search
+ *
+ * Undefined if no set bit exists, so code should check against 0 first.
+ */
+static inline unsigned long __fls(unsigned long word)
+{
+ return 31 - __cntlz(word);
+}
#else

/* Use the generic implementation if we don't have the nsa/nsau instructions. */
@@ -90,6 +100,7 @@ static inline int fls (unsigned int x)
# include <asm-generic/bitops/__ffs.h>
# include <asm-generic/bitops/ffz.h>
# include <asm-generic/bitops/fls.h>
+# include <asm-generic/bitops/__fls.h>

#endif

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index a08c33a..2878811 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -137,9 +137,12 @@ extern void bitmap_copy_le(void *dst, const unsigned long *src, int nbits);
(1UL<<((nbits) % BITS_PER_LONG))-1 : ~0UL \
)

+#define small_const_nbits(nbits) \
+ (__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
+
static inline void bitmap_zero(unsigned long *dst, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
*dst = 0UL;
else {
int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
@@ -150,7 +153,7 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
static inline void bitmap_fill(unsigned long *dst, int nbits)
{
size_t nlongs = BITS_TO_LONGS(nbits);
- if (nlongs > 1) {
+ if (!small_const_nbits(nbits)) {
int len = (nlongs - 1) * sizeof(unsigned long);
memset(dst, 0xff, len);
}
@@ -160,7 +163,7 @@ static inline void bitmap_fill(unsigned long *dst, int nbits)
static inline void bitmap_copy(unsigned long *dst, const unsigned long *src,
int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
*dst = *src;
else {
int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
@@ -171,7 +174,7 @@ static inline void bitmap_copy(unsigned long *dst, const unsigned long *src,
static inline void bitmap_and(unsigned long *dst, const unsigned long *src1,
const unsigned long *src2, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
*dst = *src1 & *src2;
else
__bitmap_and(dst, src1, src2, nbits);
@@ -180,7 +183,7 @@ static inline void bitmap_and(unsigned long *dst, const unsigned long *src1,
static inline void bitmap_or(unsigned long *dst, const unsigned long *src1,
const unsigned long *src2, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
*dst = *src1 | *src2;
else
__bitmap_or(dst, src1, src2, nbits);
@@ -189,7 +192,7 @@ static inline void bitmap_or(unsigned long *dst, const unsigned long *src1,
static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
const unsigned long *src2, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
*dst = *src1 ^ *src2;
else
__bitmap_xor(dst, src1, src2, nbits);
@@ -198,7 +201,7 @@ static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
static inline void bitmap_andnot(unsigned long *dst, const unsigned long *src1,
const unsigned long *src2, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
*dst = *src1 & ~(*src2);
else
__bitmap_andnot(dst, src1, src2, nbits);
@@ -207,7 +210,7 @@ static inline void bitmap_andnot(unsigned long *dst, const unsigned long *src1,
static inline void bitmap_complement(unsigned long *dst, const unsigned long *src,
int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
*dst = ~(*src) & BITMAP_LAST_WORD_MASK(nbits);
else
__bitmap_complement(dst, src, nbits);
@@ -216,7 +219,7 @@ static inline void bitmap_complement(unsigned long *dst, const unsigned long *sr
static inline int bitmap_equal(const unsigned long *src1,
const unsigned long *src2, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
return ! ((*src1 ^ *src2) & BITMAP_LAST_WORD_MASK(nbits));
else
return __bitmap_equal(src1, src2, nbits);
@@ -225,7 +228,7 @@ static inline int bitmap_equal(const unsigned long *src1,
static inline int bitmap_intersects(const unsigned long *src1,
const unsigned long *src2, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
return ((*src1 & *src2) & BITMAP_LAST_WORD_MASK(nbits)) != 0;
else
return __bitmap_intersects(src1, src2, nbits);
@@ -234,7 +237,7 @@ static inline int bitmap_intersects(const unsigned long *src1,
static inline int bitmap_subset(const unsigned long *src1,
const unsigned long *src2, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
return ! ((*src1 & ~(*src2)) & BITMAP_LAST_WORD_MASK(nbits));
else
return __bitmap_subset(src1, src2, nbits);
@@ -242,7 +245,7 @@ static inline int bitmap_subset(const unsigned long *src1,

static inline int bitmap_empty(const unsigned long *src, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
return ! (*src & BITMAP_LAST_WORD_MASK(nbits));
else
return __bitmap_empty(src, nbits);
@@ -250,7 +253,7 @@ static inline int bitmap_empty(const unsigned long *src, int nbits)

static inline int bitmap_full(const unsigned long *src, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
return ! (~(*src) & BITMAP_LAST_WORD_MASK(nbits));
else
return __bitmap_full(src, nbits);
@@ -258,7 +261,7 @@ static inline int bitmap_full(const unsigned long *src, int nbits)

static inline int bitmap_weight(const unsigned long *src, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
return hweight_long(*src & BITMAP_LAST_WORD_MASK(nbits));
return __bitmap_weight(src, nbits);
}
@@ -266,7 +269,7 @@ static inline int bitmap_weight(const unsigned long *src, int nbits)
static inline void bitmap_shift_right(unsigned long *dst,
const unsigned long *src, int n, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
*dst = *src >> n;
else
__bitmap_shift_right(dst, src, n, nbits);
@@ -275,7 +278,7 @@ static inline void bitmap_shift_right(unsigned long *dst,
static inline void bitmap_shift_left(unsigned long *dst,
const unsigned long *src, int n, int nbits)
{
- if (nbits <= BITS_PER_LONG)
+ if (small_const_nbits(nbits))
*dst = (*src << n) & BITMAP_LAST_WORD_MASK(nbits);
else
__bitmap_shift_left(dst, src, n, nbits);
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 024f2b0..6182913 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -134,9 +134,20 @@ extern unsigned long find_first_bit(const unsigned long *addr,
*/
extern unsigned long find_first_zero_bit(const unsigned long *addr,
unsigned long size);
-
#endif /* CONFIG_GENERIC_FIND_FIRST_BIT */

+#ifdef CONFIG_GENERIC_FIND_LAST_BIT
+/**
+ * find_last_bit - find the last set bit in a memory region
+ * @addr: The address to start the search at
+ * @size: The maximum size to search
+ *
+ * Returns the bit number of the first set bit, or size.
+ */
+extern unsigned long find_last_bit(const unsigned long *addr,
+ unsigned long size);
+#endif /* CONFIG_GENERIC_FIND_LAST_BIT */
+
#ifdef CONFIG_GENERIC_FIND_NEXT_BIT

/**
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index d4bf526..9f31538 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -144,6 +144,7 @@
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
extern cpumask_t _unused_cpumask_arg_;

+#ifndef CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
#define cpu_set(cpu, dst) __cpu_set((cpu), &(dst))
static inline void __cpu_set(int cpu, volatile cpumask_t *dstp)
{
@@ -267,6 +268,26 @@ static inline void __cpus_shift_left(cpumask_t *dstp,
{
bitmap_shift_left(dstp->bits, srcp->bits, n, nbits);
}
+#endif /* !CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS */
+
+/**
+ * to_cpumask - convert an NR_CPUS bitmap to a struct cpumask *
+ * @bitmap: the bitmap
+ *
+ * There are a few places where cpumask_var_t isn't appropriate and
+ * static cpumasks must be used (eg. very early boot), yet we don't
+ * expose the definition of 'struct cpumask'.
+ *
+ * This does the conversion, and can be used as a constant initializer.
+ */
+#define to_cpumask(bitmap) \
+ ((struct cpumask *)(1 ? (bitmap) \
+ : (void *)sizeof(__check_is_bitmap(bitmap))))
+
+static inline int __check_is_bitmap(const unsigned long *bitmap)
+{
+ return 1;
+}

/*
* Special-case data structure for "single bit set only" constant CPU masks.
@@ -278,13 +299,14 @@ static inline void __cpus_shift_left(cpumask_t *dstp,
extern const unsigned long
cpu_bit_bitmap[BITS_PER_LONG+1][BITS_TO_LONGS(NR_CPUS)];

-static inline const cpumask_t *get_cpu_mask(unsigned int cpu)
+static inline const struct cpumask *get_cpu_mask(unsigned int cpu)
{
const unsigned long *p = cpu_bit_bitmap[1 + cpu % BITS_PER_LONG];
p -= cpu / BITS_PER_LONG;
- return (const cpumask_t *)p;
+ return to_cpumask(p);
}

+#ifndef CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
/*
* In cases where we take the address of the cpumask immediately,
* gcc optimizes it out (it's a constant) and there's no huge stack
@@ -370,19 +392,22 @@ static inline void __cpus_fold(cpumask_t *dstp, const cpumask_t *origp,
{
bitmap_fold(dstp->bits, origp->bits, sz, nbits);
}
+#endif /* !CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS */

#if NR_CPUS == 1

#define nr_cpu_ids 1
+#ifndef CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
#define first_cpu(src) ({ (void)(src); 0; })
#define next_cpu(n, src) ({ (void)(src); 1; })
#define any_online_cpu(mask) 0
#define for_each_cpu_mask(cpu, mask) \
for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)
-
+#endif /* !CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS */
#else /* NR_CPUS > 1 */

extern int nr_cpu_ids;
+#ifndef CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
int __first_cpu(const cpumask_t *srcp);
int __next_cpu(int n, const cpumask_t *srcp);
int __any_online_cpu(const cpumask_t *mask);
@@ -394,8 +419,10 @@ int __any_online_cpu(const cpumask_t *mask);
for ((cpu) = -1; \
(cpu) = next_cpu((cpu), (mask)), \
(cpu) < NR_CPUS; )
+#endif /* !CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS */
#endif

+#ifndef CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
#if NR_CPUS <= 64

#define next_cpu_nr(n, src) next_cpu(n, src)
@@ -413,77 +440,67 @@ int __next_cpu_nr(int n, const cpumask_t *srcp);
(cpu) < nr_cpu_ids; )

#endif /* NR_CPUS > 64 */
+#endif /* !CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS */

/*
* The following particular system cpumasks and operations manage
- * possible, present, active and online cpus. Each of them is a fixed size
- * bitmap of size NR_CPUS.
+ * possible, present, active and online cpus.
*
- * #ifdef CONFIG_HOTPLUG_CPU
- * cpu_possible_map - has bit 'cpu' set iff cpu is populatable
- * cpu_present_map - has bit 'cpu' set iff cpu is populated
- * cpu_online_map - has bit 'cpu' set iff cpu available to scheduler
- * cpu_active_map - has bit 'cpu' set iff cpu available to migration
- * #else
- * cpu_possible_map - has bit 'cpu' set iff cpu is populated
- * cpu_present_map - copy of cpu_possible_map
- * cpu_online_map - has bit 'cpu' set iff cpu available to scheduler
- * #endif
+ * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
+ * cpu_present_mask - has bit 'cpu' set iff cpu is populated
+ * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
+ * cpu_active_mask - has bit 'cpu' set iff cpu available to migration
*
- * In either case, NR_CPUS is fixed at compile time, as the static
- * size of these bitmaps. The cpu_possible_map is fixed at boot
- * time, as the set of CPU id's that it is possible might ever
- * be plugged in at anytime during the life of that system boot.
- * The cpu_present_map is dynamic(*), representing which CPUs
- * are currently plugged in. And cpu_online_map is the dynamic
- * subset of cpu_present_map, indicating those CPUs available
- * for scheduling.
+ * If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
*
- * If HOTPLUG is enabled, then cpu_possible_map is forced to have
+ * The cpu_possible_mask is fixed at boot time, as the set of CPU id's
+ * that it is possible might ever be plugged in at anytime during the
+ * life of that system boot. The cpu_present_mask is dynamic(*),
+ * representing which CPUs are currently plugged in. And
+ * cpu_online_mask is the dynamic subset of cpu_present_mask,
+ * indicating those CPUs available for scheduling.
+ *
+ * If HOTPLUG is enabled, then cpu_possible_mask is forced to have
* all NR_CPUS bits set, otherwise it is just the set of CPUs that
* ACPI reports present at boot.
*
- * If HOTPLUG is enabled, then cpu_present_map varies dynamically,
+ * If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
* depending on what ACPI reports as currently plugged in, otherwise
- * cpu_present_map is just a copy of cpu_possible_map.
+ * cpu_present_mask is just a copy of cpu_possible_mask.
*
- * (*) Well, cpu_present_map is dynamic in the hotplug case. If not
- * hotplug, it's a copy of cpu_possible_map, hence fixed at boot.
+ * (*) Well, cpu_present_mask is dynamic in the hotplug case. If not
+ * hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
*
* Subtleties:
* 1) UP arch's (NR_CPUS == 1, CONFIG_SMP not defined) hardcode
* assumption that their single CPU is online. The UP
- * cpu_{online,possible,present}_maps are placebos. Changing them
+ * cpu_{online,possible,present}_masks are placebos. Changing them
* will have no useful affect on the following num_*_cpus()
* and cpu_*() macros in the UP case. This ugliness is a UP
* optimization - don't waste any instructions or memory references
* asking if you're online or how many CPUs there are if there is
* only one CPU.
- * 2) Most SMP arch's #define some of these maps to be some
- * other map specific to that arch. Therefore, the following
- * must be #define macros, not inlines. To see why, examine
- * the assembly code produced by the following. Note that
- * set1() writes phys_x_map, but set2() writes x_map:
- * int x_map, phys_x_map;
- * #define set1(a) x_map = a
- * inline void set2(int a) { x_map = a; }
- * #define x_map phys_x_map
- * main(){ set1(3); set2(5); }
*/

-extern cpumask_t cpu_possible_map;
-extern cpumask_t cpu_online_map;
-extern cpumask_t cpu_present_map;
-extern cpumask_t cpu_active_map;
+extern const struct cpumask *const cpu_possible_mask;
+extern const struct cpumask *const cpu_online_mask;
+extern const struct cpumask *const cpu_present_mask;
+extern const struct cpumask *const cpu_active_mask;
+
+/* These strip const, as traditionally they weren't const. */
+#define cpu_possible_map (*(cpumask_t *)cpu_possible_mask)
+#define cpu_online_map (*(cpumask_t *)cpu_online_mask)
+#define cpu_present_map (*(cpumask_t *)cpu_present_mask)
+#define cpu_active_map (*(cpumask_t *)cpu_active_mask)

#if NR_CPUS > 1
-#define num_online_cpus() cpus_weight_nr(cpu_online_map)
-#define num_possible_cpus() cpus_weight_nr(cpu_possible_map)
-#define num_present_cpus() cpus_weight_nr(cpu_present_map)
-#define cpu_online(cpu) cpu_isset((cpu), cpu_online_map)
-#define cpu_possible(cpu) cpu_isset((cpu), cpu_possible_map)
-#define cpu_present(cpu) cpu_isset((cpu), cpu_present_map)
-#define cpu_active(cpu) cpu_isset((cpu), cpu_active_map)
+#define num_online_cpus() cpumask_weight(cpu_online_mask)
+#define num_possible_cpus() cpumask_weight(cpu_possible_mask)
+#define num_present_cpus() cpumask_weight(cpu_present_mask)
+#define cpu_online(cpu) cpumask_test_cpu((cpu), cpu_online_mask)
+#define cpu_possible(cpu) cpumask_test_cpu((cpu), cpu_possible_mask)
+#define cpu_present(cpu) cpumask_test_cpu((cpu), cpu_present_mask)
+#define cpu_active(cpu) cpumask_test_cpu((cpu), cpu_active_mask)
#else
#define num_online_cpus() 1
#define num_possible_cpus() 1
@@ -496,10 +513,6 @@ extern cpumask_t cpu_active_map;

#define cpu_is_offline(cpu) unlikely(!cpu_online(cpu))

-#define for_each_possible_cpu(cpu) for_each_cpu_mask_nr((cpu), cpu_possible_map)
-#define for_each_online_cpu(cpu) for_each_cpu_mask_nr((cpu), cpu_online_map)
-#define for_each_present_cpu(cpu) for_each_cpu_mask_nr((cpu), cpu_present_map)
-
/* These are the new versions of the cpumask operators: passed by pointer.
* The older versions will be implemented in terms of these, then deleted. */
#define cpumask_bits(maskp) ((maskp)->bits)
@@ -687,7 +700,7 @@ static inline void cpumask_clear_cpu(int cpu, struct cpumask *dstp)
* No static inline type checking - see Subtlety (1) above.
*/
#define cpumask_test_cpu(cpu, cpumask) \
- test_bit(cpumask_check(cpu), (cpumask)->bits)
+ test_bit(cpumask_check(cpu), cpumask_bits((cpumask)))

/**
* cpumask_test_and_set_cpu - atomically test and set a cpu in a cpumask
@@ -930,7 +943,7 @@ static inline void cpumask_copy(struct cpumask *dstp,
static inline int cpumask_scnprintf(char *buf, int len,
const struct cpumask *srcp)
{
- return bitmap_scnprintf(buf, len, srcp->bits, nr_cpumask_bits);
+ return bitmap_scnprintf(buf, len, cpumask_bits(srcp), nr_cpumask_bits);
}

/**
@@ -944,7 +957,7 @@ static inline int cpumask_scnprintf(char *buf, int len,
static inline int cpumask_parse_user(const char __user *buf, int len,
struct cpumask *dstp)
{
- return bitmap_parse_user(buf, len, dstp->bits, nr_cpumask_bits);
+ return bitmap_parse_user(buf, len, cpumask_bits(dstp), nr_cpumask_bits);
}

/**
@@ -959,7 +972,8 @@ static inline int cpumask_parse_user(const char __user *buf, int len,
static inline int cpulist_scnprintf(char *buf, int len,
const struct cpumask *srcp)
{
- return bitmap_scnlistprintf(buf, len, srcp->bits, nr_cpumask_bits);
+ return bitmap_scnlistprintf(buf, len, cpumask_bits(srcp),
+ nr_cpumask_bits);
}

/**
@@ -972,26 +986,7 @@ static inline int cpulist_scnprintf(char *buf, int len,
*/
static inline int cpulist_parse(const char *buf, struct cpumask *dstp)
{
- return bitmap_parselist(buf, dstp->bits, nr_cpumask_bits);
-}
-
-/**
- * to_cpumask - convert an NR_CPUS bitmap to a struct cpumask *
- * @bitmap: the bitmap
- *
- * There are a few places where cpumask_var_t isn't appropriate and
- * static cpumasks must be used (eg. very early boot), yet we don't
- * expose the definition of 'struct cpumask'.
- *
- * This does the conversion, and can be used as a constant initializer.
- */
-#define to_cpumask(bitmap) \
- ((struct cpumask *)(1 ? (bitmap) \
- : (void *)sizeof(__check_is_bitmap(bitmap))))
-
-static inline int __check_is_bitmap(const unsigned long *bitmap)
-{
- return 1;
+ return bitmap_parselist(buf, cpumask_bits(dstp), nr_cpumask_bits);
}

/**
@@ -1025,6 +1020,7 @@ static inline size_t cpumask_size(void)
#ifdef CONFIG_CPUMASK_OFFSTACK
typedef struct cpumask *cpumask_var_t;

+bool alloc_cpumask_var_node(cpumask_var_t *mask, gfp_t flags, int node);
bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags);
void alloc_bootmem_cpumask_var(cpumask_var_t *mask);
void free_cpumask_var(cpumask_var_t mask);
@@ -1038,6 +1034,12 @@ static inline bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
return true;
}

+static inline bool alloc_cpumask_var_node(cpumask_var_t *mask, gfp_t flags,
+ int node)
+{
+ return true;
+}
+
static inline void alloc_bootmem_cpumask_var(cpumask_var_t *mask)
{
}
@@ -1051,12 +1053,6 @@ static inline void free_bootmem_cpumask_var(cpumask_var_t mask)
}
#endif /* CONFIG_CPUMASK_OFFSTACK */

-/* The pointer versions of the maps, these will become the primary versions. */
-#define cpu_possible_mask ((const struct cpumask *)&cpu_possible_map)
-#define cpu_online_mask ((const struct cpumask *)&cpu_online_map)
-#define cpu_present_mask ((const struct cpumask *)&cpu_present_map)
-#define cpu_active_mask ((const struct cpumask *)&cpu_active_map)
-
/* It's common to want to use cpu_all_mask in struct member initializers,
* so it has to refer to an address rather than a pointer. */
extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
@@ -1065,51 +1061,16 @@ extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
/* First bits of cpu_bit_bitmap are in fact unset. */
#define cpu_none_mask to_cpumask(cpu_bit_bitmap[0])

-/* Wrappers for arch boot code to manipulate normally-constant masks */
-static inline void set_cpu_possible(unsigned int cpu, bool possible)
-{
- if (possible)
- cpumask_set_cpu(cpu, &cpu_possible_map);
- else
- cpumask_clear_cpu(cpu, &cpu_possible_map);
-}
-
-static inline void set_cpu_present(unsigned int cpu, bool present)
-{
- if (present)
- cpumask_set_cpu(cpu, &cpu_present_map);
- else
- cpumask_clear_cpu(cpu, &cpu_present_map);
-}
-
-static inline void set_cpu_online(unsigned int cpu, bool online)
-{
- if (online)
- cpumask_set_cpu(cpu, &cpu_online_map);
- else
- cpumask_clear_cpu(cpu, &cpu_online_map);
-}
+#define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
+#define for_each_online_cpu(cpu) for_each_cpu((cpu), cpu_online_mask)
+#define for_each_present_cpu(cpu) for_each_cpu((cpu), cpu_present_mask)

-static inline void set_cpu_active(unsigned int cpu, bool active)
-{
- if (active)
- cpumask_set_cpu(cpu, &cpu_active_map);
- else
- cpumask_clear_cpu(cpu, &cpu_active_map);
-}
-
-static inline void init_cpu_present(const struct cpumask *src)
-{
- cpumask_copy(&cpu_present_map, src);
-}
-
-static inline void init_cpu_possible(const struct cpumask *src)
-{
- cpumask_copy(&cpu_possible_map, src);
-}
-
-static inline void init_cpu_online(const struct cpumask *src)
-{
- cpumask_copy(&cpu_online_map, src);
-}
+/* Wrappers for arch boot code to manipulate normally-constant masks */
+void set_cpu_possible(unsigned int cpu, bool possible);
+void set_cpu_present(unsigned int cpu, bool present);
+void set_cpu_online(unsigned int cpu, bool online);
+void set_cpu_active(unsigned int cpu, bool active);
+void init_cpu_present(const struct cpumask *src);
+void init_cpu_possible(const struct cpumask *src);
+void init_cpu_online(const struct cpumask *src);
#endif /* __LINUX_CPUMASK_H */
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 990355f..0702c4d 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -109,7 +109,7 @@ extern void enable_irq(unsigned int irq);

#if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_HARDIRQS)

-extern cpumask_t irq_default_affinity;
+extern cpumask_var_t irq_default_affinity;

extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
extern int irq_can_set_affinity(unsigned int irq);
diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h
index 301dda8..f3f697d 100644
--- a/include/linux/rcuclassic.h
+++ b/include/linux/rcuclassic.h
@@ -59,8 +59,8 @@ struct rcu_ctrlblk {
int signaled;

spinlock_t lock ____cacheline_internodealigned_in_smp;
- cpumask_t cpumask; /* CPUs that need to switch in order */
- /* for current batch to proceed. */
+ DECLARE_BITMAP(cpumask, NR_CPUS); /* CPUs that need to switch for */
+ /* current batch to proceed. */
} ____cacheline_internodealigned_in_smp;

/* Is batch a before batch b ? */
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index b3dfa72..40ea505 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -50,10 +50,11 @@ int seq_path(struct seq_file *, struct path *, char *);
int seq_dentry(struct seq_file *, struct dentry *, char *);
int seq_path_root(struct seq_file *m, struct path *path, struct path *root,
char *esc);
-int seq_bitmap(struct seq_file *m, unsigned long *bits, unsigned int nr_bits);
-static inline int seq_cpumask(struct seq_file *m, cpumask_t *mask)
+int seq_bitmap(struct seq_file *m, const unsigned long *bits,
+ unsigned int nr_bits);
+static inline int seq_cpumask(struct seq_file *m, const struct cpumask *mask)
{
- return seq_bitmap(m, mask->bits, NR_CPUS);
+ return seq_bitmap(m, mask->bits, nr_cpu_ids);
}

static inline int seq_nodemask(struct seq_file *m, nodemask_t *mask)
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 6e7ba16..b824669 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -21,6 +21,9 @@ struct call_single_data {
u16 priv;
};

+/* total number of cpus in this system (may exceed NR_CPUS) */
+extern unsigned int total_cpus;
+
#ifdef CONFIG_SMP

#include <linux/preempt.h>
@@ -64,15 +67,16 @@ extern void smp_cpus_done(unsigned int max_cpus);
* Call a function on all other processors
*/
int smp_call_function(void(*func)(void *info), void *info, int wait);
-/* Deprecated: use smp_call_function_many() which uses a cpumask ptr. */
-int smp_call_function_mask(cpumask_t mask, void(*func)(void *info), void *info,
- int wait);
+void smp_call_function_many(const struct cpumask *mask,
+ void (*func)(void *info), void *info, bool wait);

-static inline void smp_call_function_many(const struct cpumask *mask,
- void (*func)(void *info), void *info,
- int wait)
+/* Deprecated: Use smp_call_function_many which takes a pointer to the mask. */
+static inline int
+smp_call_function_mask(cpumask_t mask, void(*func)(void *info), void *info,
+ int wait)
{
- smp_call_function_mask(*mask, func, info, wait);
+ smp_call_function_many(&mask, func, info, wait);
+ return 0;
}

int smp_call_function_single(int cpuid, void (*func) (void *info), void *info,
diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index faf1519..74d59a6 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -23,7 +23,7 @@
*
* This can be thought of as a very heavy write lock, equivalent to
* grabbing every spinlock in the kernel. */
-int stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus);
+int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus);

/**
* __stop_machine: freeze the machine on all CPUs and run this function
@@ -34,11 +34,11 @@ int stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus);
* Description: This is a special version of the above, which assumes cpus
* won't come or go while it's being called. Used by hotplug cpu.
*/
-int __stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus);
+int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus);
#else

static inline int stop_machine(int (*fn)(void *), void *data,
- const cpumask_t *cpus)
+ const struct cpumask *cpus)
{
int ret;
local_irq_disable();
diff --git a/include/linux/threads.h b/include/linux/threads.h
index 38d1a5d..052b12b 100644
--- a/include/linux/threads.h
+++ b/include/linux/threads.h
@@ -8,17 +8,17 @@
*/

/*
- * Maximum supported processors that can run under SMP. This value is
- * set via configure setting. The maximum is equal to the size of the
- * bitmasks used on that platform, i.e. 32 or 64. Setting this smaller
- * saves quite a bit of memory.
+ * Maximum supported processors. Setting this smaller saves quite a
+ * bit of memory. Use nr_cpu_ids instead of this except for static bitmaps.
*/
-#ifdef CONFIG_SMP
-#define NR_CPUS CONFIG_NR_CPUS
-#else
-#define NR_CPUS 1
+#ifndef CONFIG_NR_CPUS
+/* FIXME: This should be fixed in the arch's Kconfig */
+#define CONFIG_NR_CPUS 1
#endif

+/* Places which use this should consider cpumask_var_t. */
+#define NR_CPUS CONFIG_NR_CPUS
+
#define MIN_THREADS_LEFT_FOR_ROOT 4

/*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b6ec818..469b82d 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -84,10 +84,10 @@ static inline void tick_cancel_sched_timer(int cpu) { }

# ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
extern struct tick_device *tick_get_broadcast_device(void);
-extern cpumask_t *tick_get_broadcast_mask(void);
+extern struct cpumask *tick_get_broadcast_mask(void);

# ifdef CONFIG_TICK_ONESHOT
-extern cpumask_t *tick_get_broadcast_oneshot_mask(void);
+extern struct cpumask *tick_get_broadcast_oneshot_mask(void);
# endif

# endif /* BROADCAST */
diff --git a/init/main.c b/init/main.c
index ad8f9f5..cd168eb 100644
--- a/init/main.c
+++ b/init/main.c
@@ -371,12 +371,7 @@ EXPORT_SYMBOL(nr_cpu_ids);
/* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */
static void __init setup_nr_cpu_ids(void)
{
- int cpu, highest_cpu = 0;
-
- for_each_possible_cpu(cpu)
- highest_cpu = cpu;
-
- nr_cpu_ids = highest_cpu + 1;
+ nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
}

#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
@@ -518,9 +513,9 @@ static void __init boot_cpu_init(void)
{
int cpu = smp_processor_id();
/* Mark the boot cpu "present", "online" etc for SMP and UP case */
- cpu_set(cpu, cpu_online_map);
- cpu_set(cpu, cpu_present_map);
- cpu_set(cpu, cpu_possible_map);
+ set_cpu_online(cpu, true);
+ set_cpu_present(cpu, true);
+ set_cpu_possible(cpu, true);
}

void __init __weak smp_setup_processor_id(void)
diff --git a/kernel/compat.c b/kernel/compat.c
index 8eafe3e..d52e2ec 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -454,16 +454,16 @@ asmlinkage long compat_sys_waitid(int which, compat_pid_t pid,
}

static int compat_get_user_cpu_mask(compat_ulong_t __user *user_mask_ptr,
- unsigned len, cpumask_t *new_mask)
+ unsigned len, struct cpumask *new_mask)
{
unsigned long *k;

- if (len < sizeof(cpumask_t))
- memset(new_mask, 0, sizeof(cpumask_t));
- else if (len > sizeof(cpumask_t))
- len = sizeof(cpumask_t);
+ if (len < cpumask_size())
+ memset(new_mask, 0, cpumask_size());
+ else if (len > cpumask_size())
+ len = cpumask_size();

- k = cpus_addr(*new_mask);
+ k = cpumask_bits(new_mask);
return compat_get_bitmap(k, user_mask_ptr, len * 8);
}

@@ -471,40 +471,51 @@ asmlinkage long compat_sys_sched_setaffinity(compat_pid_t pid,
unsigned int len,
compat_ulong_t __user *user_mask_ptr)
{
- cpumask_t new_mask;
+ cpumask_var_t new_mask;
int retval;

- retval = compat_get_user_cpu_mask(user_mask_ptr, len, &new_mask);
+ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ retval = compat_get_user_cpu_mask(user_mask_ptr, len, new_mask);
if (retval)
- return retval;
+ goto out;

- return sched_setaffinity(pid, &new_mask);
+ retval = sched_setaffinity(pid, new_mask);
+out:
+ free_cpumask_var(new_mask);
+ return retval;
}

asmlinkage long compat_sys_sched_getaffinity(compat_pid_t pid, unsigned int len,
compat_ulong_t __user *user_mask_ptr)
{
int ret;
- cpumask_t mask;
+ cpumask_var_t mask;
unsigned long *k;
- unsigned int min_length = sizeof(cpumask_t);
+ unsigned int min_length = cpumask_size();

- if (NR_CPUS <= BITS_PER_COMPAT_LONG)
+ if (nr_cpu_ids <= BITS_PER_COMPAT_LONG)
min_length = sizeof(compat_ulong_t);

if (len < min_length)
return -EINVAL;

- ret = sched_getaffinity(pid, &mask);
+ if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ ret = sched_getaffinity(pid, mask);
if (ret < 0)
- return ret;
+ goto out;

- k = cpus_addr(mask);
+ k = cpumask_bits(mask);
ret = compat_put_bitmap(user_mask_ptr, k, min_length * 8);
- if (ret)
- return ret;
+ if (ret == 0)
+ ret = min_length;

- return min_length;
+out:
+ free_cpumask_var(mask);
+ return ret;
}

int get_compat_itimerspec(struct itimerspec *dst,
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bae131a..47fff3b 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -15,30 +15,8 @@
#include <linux/stop_machine.h>
#include <linux/mutex.h>

-/*
- * Represents all cpu's present in the system
- * In systems capable of hotplug, this map could dynamically grow
- * as new cpu's are detected in the system via any platform specific
- * method, such as ACPI for e.g.
- */
-cpumask_t cpu_present_map __read_mostly;
-EXPORT_SYMBOL(cpu_present_map);
-
-/*
- * Represents all cpu's that are currently online.
- */
-cpumask_t cpu_online_map __read_mostly;
-EXPORT_SYMBOL(cpu_online_map);
-
-#ifdef CONFIG_INIT_ALL_POSSIBLE
-cpumask_t cpu_possible_map __read_mostly = CPU_MASK_ALL;
-#else
-cpumask_t cpu_possible_map __read_mostly;
-#endif
-EXPORT_SYMBOL(cpu_possible_map);
-
#ifdef CONFIG_SMP
-/* Serializes the updates to cpu_online_map, cpu_present_map */
+/* Serializes the updates to cpu_online_mask, cpu_present_mask */
static DEFINE_MUTEX(cpu_add_remove_lock);

static __cpuinitdata RAW_NOTIFIER_HEAD(cpu_chain);
@@ -65,8 +43,6 @@ void __init cpu_hotplug_init(void)
cpu_hotplug.refcount = 0;
}

-cpumask_t cpu_active_map;
-
#ifdef CONFIG_HOTPLUG_CPU

void get_online_cpus(void)
@@ -97,7 +73,7 @@ EXPORT_SYMBOL_GPL(put_online_cpus);

/*
* The following two API's must be used when attempting
- * to serialize the updates to cpu_online_map, cpu_present_map.
+ * to serialize the updates to cpu_online_mask, cpu_present_mask.
*/
void cpu_maps_update_begin(void)
{
@@ -218,7 +194,7 @@ static int __ref take_cpu_down(void *_param)
static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
{
int err, nr_calls = 0;
- cpumask_t old_allowed, tmp;
+ cpumask_var_t old_allowed;
void *hcpu = (void *)(long)cpu;
unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
struct take_cpu_down_param tcd_param = {
@@ -232,6 +208,9 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
if (!cpu_online(cpu))
return -EINVAL;

+ if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL))
+ return -ENOMEM;
+
cpu_hotplug_begin();
err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
hcpu, -1, &nr_calls);
@@ -246,13 +225,11 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
}

/* Ensure that we are not runnable on dying cpu */
- old_allowed = current->cpus_allowed;
- cpus_setall(tmp);
- cpu_clear(cpu, tmp);
- set_cpus_allowed_ptr(current, &tmp);
- tmp = cpumask_of_cpu(cpu);
+ cpumask_copy(old_allowed, &current->cpus_allowed);
+ set_cpus_allowed_ptr(current,
+ cpumask_of(cpumask_any_but(cpu_online_mask, cpu)));

- err = __stop_machine(take_cpu_down, &tcd_param, &tmp);
+ err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
if (err) {
/* CPU didn't die: tell everyone. Can't complain. */
if (raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
@@ -278,7 +255,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
check_for_tasks(cpu);

out_allowed:
- set_cpus_allowed_ptr(current, &old_allowed);
+ set_cpus_allowed_ptr(current, old_allowed);
out_release:
cpu_hotplug_done();
if (!err) {
@@ -286,6 +263,7 @@ out_release:
hcpu) == NOTIFY_BAD)
BUG();
}
+ free_cpumask_var(old_allowed);
return err;
}

@@ -304,7 +282,7 @@ int __ref cpu_down(unsigned int cpu)

/*
* Make sure the all cpus did the reschedule and are not
- * using stale version of the cpu_active_map.
+ * using stale version of the cpu_active_mask.
* This is not strictly necessary becuase stop_machine()
* that we run down the line already provides the required
* synchronization. But it's really a side effect and we do not
@@ -368,7 +346,7 @@ out_notify:
int __cpuinit cpu_up(unsigned int cpu)
{
int err = 0;
- if (!cpu_isset(cpu, cpu_possible_map)) {
+ if (!cpu_possible(cpu)) {
printk(KERN_ERR "can't online cpu %d because it is not "
"configured as may-hotadd at boot time\n", cpu);
#if defined(CONFIG_IA64) || defined(CONFIG_X86_64)
@@ -393,25 +371,25 @@ out:
}

#ifdef CONFIG_PM_SLEEP_SMP
-static cpumask_t frozen_cpus;
+static cpumask_var_t frozen_cpus;

int disable_nonboot_cpus(void)
{
int cpu, first_cpu, error = 0;

cpu_maps_update_begin();
- first_cpu = first_cpu(cpu_online_map);
+ first_cpu = cpumask_first(cpu_online_mask);
/* We take down all of the non-boot CPUs in one shot to avoid races
* with the userspace trying to use the CPU hotplug at the same time
*/
- cpus_clear(frozen_cpus);
+ cpumask_clear(frozen_cpus);
printk("Disabling non-boot CPUs ...\n");
for_each_online_cpu(cpu) {
if (cpu == first_cpu)
continue;
error = _cpu_down(cpu, 1);
if (!error) {
- cpu_set(cpu, frozen_cpus);
+ cpumask_set_cpu(cpu, frozen_cpus);
printk("CPU%d is down\n", cpu);
} else {
printk(KERN_ERR "Error taking CPU%d down: %d\n",
@@ -437,11 +415,11 @@ void __ref enable_nonboot_cpus(void)
/* Allow everyone to use the CPU hotplug again */
cpu_maps_update_begin();
cpu_hotplug_disabled = 0;
- if (cpus_empty(frozen_cpus))
+ if (cpumask_empty(frozen_cpus))
goto out;

printk("Enabling non-boot CPUs ...\n");
- for_each_cpu_mask_nr(cpu, frozen_cpus) {
+ for_each_cpu(cpu, frozen_cpus) {
error = _cpu_up(cpu, 1);
if (!error) {
printk("CPU%d is up\n", cpu);
@@ -449,10 +427,18 @@ void __ref enable_nonboot_cpus(void)
}
printk(KERN_WARNING "Error taking CPU%d up: %d\n", cpu, error);
}
- cpus_clear(frozen_cpus);
+ cpumask_clear(frozen_cpus);
out:
cpu_maps_update_done();
}
+
+static int alloc_frozen_cpus(void)
+{
+ if (!alloc_cpumask_var(&frozen_cpus, GFP_KERNEL|__GFP_ZERO))
+ return -ENOMEM;
+ return 0;
+}
+core_initcall(alloc_frozen_cpus);
#endif /* CONFIG_PM_SLEEP_SMP */

/**
@@ -468,7 +454,7 @@ void __cpuinit notify_cpu_starting(unsigned int cpu)
unsigned long val = CPU_STARTING;

#ifdef CONFIG_PM_SLEEP_SMP
- if (cpu_isset(cpu, frozen_cpus))
+ if (frozen_cpus != NULL && cpumask_test_cpu(cpu, frozen_cpus))
val = CPU_STARTING_FROZEN;
#endif /* CONFIG_PM_SLEEP_SMP */
raw_notifier_call_chain(&cpu_chain, val, (void *)(long)cpu);
@@ -480,7 +466,7 @@ void __cpuinit notify_cpu_starting(unsigned int cpu)
* cpu_bit_bitmap[] is a special, "compressed" data structure that
* represents all NR_CPUS bits binary values of 1<<nr.
*
- * It is used by cpumask_of_cpu() to get a constant address to a CPU
+ * It is used by cpumask_of() to get a constant address to a CPU
* mask value that has a single bit set only.
*/

@@ -503,3 +489,71 @@ EXPORT_SYMBOL_GPL(cpu_bit_bitmap);

const DECLARE_BITMAP(cpu_all_bits, NR_CPUS) = CPU_BITS_ALL;
EXPORT_SYMBOL(cpu_all_bits);
+
+#ifdef CONFIG_INIT_ALL_POSSIBLE
+static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly
+ = CPU_BITS_ALL;
+#else
+static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly;
+#endif
+const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits);
+EXPORT_SYMBOL(cpu_possible_mask);
+
+static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_online_mask = to_cpumask(cpu_online_bits);
+EXPORT_SYMBOL(cpu_online_mask);
+
+static DECLARE_BITMAP(cpu_present_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_present_mask = to_cpumask(cpu_present_bits);
+EXPORT_SYMBOL(cpu_present_mask);
+
+static DECLARE_BITMAP(cpu_active_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_active_mask = to_cpumask(cpu_active_bits);
+EXPORT_SYMBOL(cpu_active_mask);
+
+void set_cpu_possible(unsigned int cpu, bool possible)
+{
+ if (possible)
+ cpumask_set_cpu(cpu, to_cpumask(cpu_possible_bits));
+ else
+ cpumask_clear_cpu(cpu, to_cpumask(cpu_possible_bits));
+}
+
+void set_cpu_present(unsigned int cpu, bool present)
+{
+ if (present)
+ cpumask_set_cpu(cpu, to_cpumask(cpu_present_bits));
+ else
+ cpumask_clear_cpu(cpu, to_cpumask(cpu_present_bits));
+}
+
+void set_cpu_online(unsigned int cpu, bool online)
+{
+ if (online)
+ cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
+ else
+ cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
+}
+
+void set_cpu_active(unsigned int cpu, bool active)
+{
+ if (active)
+ cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
+ else
+ cpumask_clear_cpu(cpu, to_cpumask(cpu_active_bits));
+}
+
+void init_cpu_present(const struct cpumask *src)
+{
+ cpumask_copy(to_cpumask(cpu_present_bits), src);
+}
+
+void init_cpu_possible(const struct cpumask *src)
+{
+ cpumask_copy(to_cpumask(cpu_possible_bits), src);
+}
+
+void init_cpu_online(const struct cpumask *src)
+{
+ cpumask_copy(to_cpumask(cpu_online_bits), src);
+}
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 61c4a9b..cd0cd8d 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -16,8 +16,15 @@
#include "internals.h"

#ifdef CONFIG_SMP
+cpumask_var_t irq_default_affinity;

-cpumask_t irq_default_affinity = CPU_MASK_ALL;
+static int init_irq_default_affinity(void)
+{
+ alloc_cpumask_var(&irq_default_affinity, GFP_KERNEL);
+ cpumask_setall(irq_default_affinity);
+ return 0;
+}
+core_initcall(init_irq_default_affinity);

/**
* synchronize_irq - wait for pending IRQ handlers (on other CPUs)
@@ -127,7 +134,7 @@ int do_irq_select_affinity(unsigned int irq, struct irq_desc *desc)
desc->status &= ~IRQ_AFFINITY_SET;
}

- cpumask_and(&desc->affinity, cpu_online_mask, &irq_default_affinity);
+ cpumask_and(&desc->affinity, cpu_online_mask, irq_default_affinity);
set_affinity:
desc->chip->set_affinity(irq, &desc->affinity);

diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index d2c0e5e..aae3f74 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -20,7 +20,7 @@ static struct proc_dir_entry *root_irq_dir;
static int irq_affinity_proc_show(struct seq_file *m, void *v)
{
struct irq_desc *desc = irq_to_desc((long)m->private);
- cpumask_t *mask = &desc->affinity;
+ const struct cpumask *mask = &desc->affinity;

#ifdef CONFIG_GENERIC_PENDING_IRQ
if (desc->status & IRQ_MOVE_PENDING)
@@ -54,7 +54,7 @@ static ssize_t irq_affinity_proc_write(struct file *file,
if (err)
goto free_cpumask;

- if (!is_affinity_mask_valid(*new_value)) {
+ if (!is_affinity_mask_valid(new_value)) {
err = -EINVAL;
goto free_cpumask;
}
@@ -93,7 +93,7 @@ static const struct file_operations irq_affinity_proc_fops = {

static int default_affinity_show(struct seq_file *m, void *v)
{
- seq_cpumask(m, &irq_default_affinity);
+ seq_cpumask(m, irq_default_affinity);
seq_putc(m, '\n');
return 0;
}
@@ -101,27 +101,37 @@ static int default_affinity_show(struct seq_file *m, void *v)
static ssize_t default_affinity_write(struct file *file,
const char __user *buffer, size_t count, loff_t *ppos)
{
- cpumask_t new_value;
+ cpumask_var_t new_value;
int err;

- err = cpumask_parse_user(buffer, count, &new_value);
+ if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
+ return -ENOMEM;
+
+ err = cpumask_parse_user(buffer, count, new_value);
if (err)
- return err;
+ goto out;

- if (!is_affinity_mask_valid(new_value))
- return -EINVAL;
+ if (!is_affinity_mask_valid(new_value)) {
+ err = -EINVAL;
+ goto out;
+ }

/*
* Do not allow disabling IRQs completely - it's a too easy
* way to make the system unusable accidentally :-) At least
* one online CPU still has to be targeted.
*/
- if (!cpus_intersects(new_value, cpu_online_map))
- return -EINVAL;
+ if (!cpumask_intersects(new_value, cpu_online_mask)) {
+ err = -EINVAL;
+ goto out;
+ }

- irq_default_affinity = new_value;
+ cpumask_copy(irq_default_affinity, new_value);
+ err = count;

- return count;
+out:
+ free_cpumask_var(new_value);
+ return err;
}

static int default_affinity_open(struct inode *inode, struct file *file)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index ac0fde7..3fb855a 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1116,7 +1116,7 @@ void crash_save_cpu(struct pt_regs *regs, int cpu)
struct elf_prstatus prstatus;
u32 *buf;

- if ((cpu < 0) || (cpu >= NR_CPUS))
+ if ((cpu < 0) || (cpu >= nr_cpu_ids))
return;

/* Using ELF notes here is opportunistic.
diff --git a/kernel/power/poweroff.c b/kernel/power/poweroff.c
index 72016f0..9789083 100644
--- a/kernel/power/poweroff.c
+++ b/kernel/power/poweroff.c
@@ -27,7 +27,7 @@ static DECLARE_WORK(poweroff_work, do_poweroff);
static void handle_poweroff(int key, struct tty_struct *tty)
{
/* run sysrq poweroff on boot cpu */
- schedule_work_on(first_cpu(cpu_online_map), &poweroff_work);
+ schedule_work_on(cpumask_first(cpu_online_mask), &poweroff_work);
}

static struct sysrq_key_op sysrq_poweroff_op = {
diff --git a/kernel/profile.c b/kernel/profile.c
index 4cb7d68..d18e2d2 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -45,7 +45,7 @@ static unsigned long prof_len, prof_shift;
int prof_on __read_mostly;
EXPORT_SYMBOL_GPL(prof_on);

-static cpumask_t prof_cpu_mask = CPU_MASK_ALL;
+static cpumask_var_t prof_cpu_mask;
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct profile_hit *[2], cpu_profile_hits);
static DEFINE_PER_CPU(int, cpu_profile_flip);
@@ -113,9 +113,13 @@ int __ref profile_init(void)
buffer_bytes = prof_len*sizeof(atomic_t);
if (!slab_is_available()) {
prof_buffer = alloc_bootmem(buffer_bytes);
+ alloc_bootmem_cpumask_var(&prof_cpu_mask);
return 0;
}

+ if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
+ return -ENOMEM;
+
prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
if (prof_buffer)
return 0;
@@ -128,6 +132,7 @@ int __ref profile_init(void)
if (prof_buffer)
return 0;

+ free_cpumask_var(prof_cpu_mask);
return -ENOMEM;
}

@@ -386,13 +391,15 @@ out_free:
return NOTIFY_BAD;
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
- cpu_set(cpu, prof_cpu_mask);
+ if (prof_cpu_mask != NULL)
+ cpumask_set_cpu(cpu, prof_cpu_mask);
break;
case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
case CPU_DEAD:
case CPU_DEAD_FROZEN:
- cpu_clear(cpu, prof_cpu_mask);
+ if (prof_cpu_mask != NULL)
+ cpumask_clear_cpu(cpu, prof_cpu_mask);
if (per_cpu(cpu_profile_hits, cpu)[0]) {
page = virt_to_page(per_cpu(cpu_profile_hits, cpu)[0]);
per_cpu(cpu_profile_hits, cpu)[0] = NULL;
@@ -430,7 +437,8 @@ void profile_tick(int type)

if (type == CPU_PROFILING && timer_hook)
timer_hook(regs);
- if (!user_mode(regs) && cpu_isset(smp_processor_id(), prof_cpu_mask))
+ if (!user_mode(regs) && prof_cpu_mask != NULL &&
+ cpumask_test_cpu(smp_processor_id(), prof_cpu_mask))
profile_hit(type, (void *)profile_pc(regs));
}

@@ -442,7 +450,7 @@ void profile_tick(int type)
static int prof_cpu_mask_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
{
- int len = cpumask_scnprintf(page, count, (cpumask_t *)data);
+ int len = cpumask_scnprintf(page, count, data);
if (count - len < 2)
return -EINVAL;
len += sprintf(page + len, "\n");
@@ -452,16 +460,20 @@ static int prof_cpu_mask_read_proc(char *page, char **start, off_t off,
static int prof_cpu_mask_write_proc(struct file *file,
const char __user *buffer, unsigned long count, void *data)
{
- cpumask_t *mask = (cpumask_t *)data;
+ struct cpumask *mask = data;
unsigned long full_count = count, err;
- cpumask_t new_value;
+ cpumask_var_t new_value;

- err = cpumask_parse_user(buffer, count, &new_value);
- if (err)
- return err;
+ if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
+ return -ENOMEM;

- *mask = new_value;
- return full_count;
+ err = cpumask_parse_user(buffer, count, new_value);
+ if (!err) {
+ cpumask_copy(mask, new_value);
+ err = full_count;
+ }
+ free_cpumask_var(new_value);
+ return err;
}

void create_prof_cpu_mask(struct proc_dir_entry *root_irq_dir)
@@ -472,7 +484,7 @@ void create_prof_cpu_mask(struct proc_dir_entry *root_irq_dir)
entry = create_proc_entry("prof_cpu_mask", 0600, root_irq_dir);
if (!entry)
return;
- entry->data = (void *)&prof_cpu_mask;
+ entry->data = prof_cpu_mask;
entry->read_proc = prof_cpu_mask_read_proc;
entry->write_proc = prof_cpu_mask_write_proc;
}
diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
index c03ca3e..490934f 100644
--- a/kernel/rcuclassic.c
+++ b/kernel/rcuclassic.c
@@ -63,14 +63,14 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
.completed = -300,
.pending = -300,
.lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock),
- .cpumask = CPU_MASK_NONE,
+ .cpumask = CPU_BITS_NONE,
};
static struct rcu_ctrlblk rcu_bh_ctrlblk = {
.cur = -300,
.completed = -300,
.pending = -300,
.lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock),
- .cpumask = CPU_MASK_NONE,
+ .cpumask = CPU_BITS_NONE,
};

DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
@@ -85,7 +85,6 @@ static void force_quiescent_state(struct rcu_data *rdp,
struct rcu_ctrlblk *rcp)
{
int cpu;
- cpumask_t cpumask;
unsigned long flags;

set_need_resched();
@@ -96,10 +95,10 @@ static void force_quiescent_state(struct rcu_data *rdp,
* Don't send IPI to itself. With irqs disabled,
* rdp->cpu is the current cpu.
*
- * cpu_online_map is updated by the _cpu_down()
+ * cpu_online_mask is updated by the _cpu_down()
* using __stop_machine(). Since we're in irqs disabled
* section, __stop_machine() is not exectuting, hence
- * the cpu_online_map is stable.
+ * the cpu_online_mask is stable.
*
* However, a cpu might have been offlined _just_ before
* we disabled irqs while entering here.
@@ -107,13 +106,14 @@ static void force_quiescent_state(struct rcu_data *rdp,
* notification, leading to the offlined cpu's bit
* being set in the rcp->cpumask.
*
- * Hence cpumask = (rcp->cpumask & cpu_online_map) to prevent
+ * Hence cpumask = (rcp->cpumask & cpu_online_mask) to prevent
* sending smp_reschedule() to an offlined CPU.
*/
- cpus_and(cpumask, rcp->cpumask, cpu_online_map);
- cpu_clear(rdp->cpu, cpumask);
- for_each_cpu_mask_nr(cpu, cpumask)
- smp_send_reschedule(cpu);
+ for_each_cpu_and(cpu,
+ to_cpumask(rcp->cpumask), cpu_online_mask) {
+ if (cpu != rdp->cpu)
+ smp_send_reschedule(cpu);
+ }
}
spin_unlock_irqrestore(&rcp->lock, flags);
}
@@ -193,7 +193,7 @@ static void print_other_cpu_stall(struct rcu_ctrlblk *rcp)

printk(KERN_ERR "INFO: RCU detected CPU stalls:");
for_each_possible_cpu(cpu) {
- if (cpu_isset(cpu, rcp->cpumask))
+ if (cpumask_test_cpu(cpu, to_cpumask(rcp->cpumask)))
printk(" %d", cpu);
}
printk(" (detected by %d, t=%ld jiffies)\n",
@@ -221,7 +221,8 @@ static void check_cpu_stall(struct rcu_ctrlblk *rcp)
long delta;

delta = jiffies - rcp->jiffies_stall;
- if (cpu_isset(smp_processor_id(), rcp->cpumask) && delta >= 0) {
+ if (cpumask_test_cpu(smp_processor_id(), to_cpumask(rcp->cpumask)) &&
+ delta >= 0) {

/* We haven't checked in, so go dump stack. */
print_cpu_stall(rcp);
@@ -393,7 +394,8 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
* unnecessarily.
*/
smp_mb();
- cpumask_andnot(&rcp->cpumask, cpu_online_mask, nohz_cpu_mask);
+ cpumask_andnot(to_cpumask(rcp->cpumask),
+ cpu_online_mask, nohz_cpu_mask);

rcp->signaled = 0;
}
@@ -406,8 +408,8 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
*/
static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp)
{
- cpu_clear(cpu, rcp->cpumask);
- if (cpus_empty(rcp->cpumask)) {
+ cpumask_clear_cpu(cpu, to_cpumask(rcp->cpumask));
+ if (cpumask_empty(to_cpumask(rcp->cpumask))) {
/* batch completed ! */
rcp->completed = rcp->cur;
rcu_start_batch(rcp);
diff --git a/kernel/rcupreempt.c b/kernel/rcupreempt.c
index 0498265..f9dc8f3 100644
--- a/kernel/rcupreempt.c
+++ b/kernel/rcupreempt.c
@@ -164,7 +164,8 @@ static char *rcu_try_flip_state_names[] =
{ "idle", "waitack", "waitzero", "waitmb" };
#endif /* #ifdef CONFIG_RCU_TRACE */

-static cpumask_t rcu_cpu_online_map __read_mostly = CPU_MASK_NONE;
+static DECLARE_BITMAP(rcu_cpu_online_map, NR_CPUS) __read_mostly
+ = CPU_BITS_NONE;

/*
* Enum and per-CPU flag to determine when each CPU has seen
@@ -758,7 +759,7 @@ rcu_try_flip_idle(void)

/* Now ask each CPU for acknowledgement of the flip. */

- for_each_cpu_mask_nr(cpu, rcu_cpu_online_map) {
+ for_each_cpu(cpu, to_cpumask(rcu_cpu_online_map)) {
per_cpu(rcu_flip_flag, cpu) = rcu_flipped;
dyntick_save_progress_counter(cpu);
}
@@ -776,7 +777,7 @@ rcu_try_flip_waitack(void)
int cpu;

RCU_TRACE_ME(rcupreempt_trace_try_flip_a1);
- for_each_cpu_mask_nr(cpu, rcu_cpu_online_map)
+ for_each_cpu(cpu, to_cpumask(rcu_cpu_online_map))
if (rcu_try_flip_waitack_needed(cpu) &&
per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) {
RCU_TRACE_ME(rcupreempt_trace_try_flip_ae1);
@@ -808,7 +809,7 @@ rcu_try_flip_waitzero(void)
/* Check to see if the sum of the "last" counters is zero. */

RCU_TRACE_ME(rcupreempt_trace_try_flip_z1);
- for_each_cpu_mask_nr(cpu, rcu_cpu_online_map)
+ for_each_cpu(cpu, to_cpumask(rcu_cpu_online_map))
sum += RCU_DATA_CPU(cpu)->rcu_flipctr[lastidx];
if (sum != 0) {
RCU_TRACE_ME(rcupreempt_trace_try_flip_ze1);
@@ -823,7 +824,7 @@ rcu_try_flip_waitzero(void)
smp_mb(); /* ^^^^^^^^^^^^ */

/* Call for a memory barrier from each CPU. */
- for_each_cpu_mask_nr(cpu, rcu_cpu_online_map) {
+ for_each_cpu(cpu, to_cpumask(rcu_cpu_online_map)) {
per_cpu(rcu_mb_flag, cpu) = rcu_mb_needed;
dyntick_save_progress_counter(cpu);
}
@@ -843,7 +844,7 @@ rcu_try_flip_waitmb(void)
int cpu;

RCU_TRACE_ME(rcupreempt_trace_try_flip_m1);
- for_each_cpu_mask_nr(cpu, rcu_cpu_online_map)
+ for_each_cpu(cpu, to_cpumask(rcu_cpu_online_map))
if (rcu_try_flip_waitmb_needed(cpu) &&
per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) {
RCU_TRACE_ME(rcupreempt_trace_try_flip_me1);
@@ -1032,7 +1033,7 @@ void rcu_offline_cpu(int cpu)
RCU_DATA_CPU(cpu)->rcu_flipctr[0] = 0;
RCU_DATA_CPU(cpu)->rcu_flipctr[1] = 0;

- cpu_clear(cpu, rcu_cpu_online_map);
+ cpumask_clear_cpu(cpu, to_cpumask(rcu_cpu_online_map));

spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);

@@ -1072,7 +1073,7 @@ void __cpuinit rcu_online_cpu(int cpu)
struct rcu_data *rdp;

spin_lock_irqsave(&rcu_ctrlblk.fliplock, flags);
- cpu_set(cpu, rcu_cpu_online_map);
+ cpumask_set_cpu(cpu, to_cpumask(rcu_cpu_online_map));
spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);

/*
@@ -1430,7 +1431,7 @@ void __init __rcu_init(void)
* We don't need protection against CPU-Hotplug here
* since
* a) If a CPU comes online while we are iterating over the
- * cpu_online_map below, we would only end up making a
+ * cpu_online_mask below, we would only end up making a
* duplicate call to rcu_online_cpu() which sets the corresponding
* CPU's mask in the rcu_cpu_online_map.
*
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index b310655..3245b40 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -868,49 +868,52 @@ static int rcu_idle_cpu; /* Force all torture tasks off this CPU */
*/
static void rcu_torture_shuffle_tasks(void)
{
- cpumask_t tmp_mask;
+ cpumask_var_t tmp_mask;
int i;

- cpus_setall(tmp_mask);
+ if (!alloc_cpumask_var(&tmp_mask, GFP_KERNEL))
+ BUG();
+
+ cpumask_setall(tmp_mask);
get_online_cpus();

/* No point in shuffling if there is only one online CPU (ex: UP) */
- if (num_online_cpus() == 1) {
- put_online_cpus();
- return;
- }
+ if (num_online_cpus() == 1)
+ goto out;

if (rcu_idle_cpu != -1)
- cpu_clear(rcu_idle_cpu, tmp_mask);
+ cpumask_clear_cpu(rcu_idle_cpu, tmp_mask);

- set_cpus_allowed_ptr(current, &tmp_mask);
+ set_cpus_allowed_ptr(current, tmp_mask);

if (reader_tasks) {
for (i = 0; i < nrealreaders; i++)
if (reader_tasks[i])
set_cpus_allowed_ptr(reader_tasks[i],
- &tmp_mask);
+ tmp_mask);
}

if (fakewriter_tasks) {
for (i = 0; i < nfakewriters; i++)
if (fakewriter_tasks[i])
set_cpus_allowed_ptr(fakewriter_tasks[i],
- &tmp_mask);
+ tmp_mask);
}

if (writer_task)
- set_cpus_allowed_ptr(writer_task, &tmp_mask);
+ set_cpus_allowed_ptr(writer_task, tmp_mask);

if (stats_task)
- set_cpus_allowed_ptr(stats_task, &tmp_mask);
+ set_cpus_allowed_ptr(stats_task, tmp_mask);

if (rcu_idle_cpu == -1)
rcu_idle_cpu = num_online_cpus() - 1;
else
rcu_idle_cpu--;

+out:
put_online_cpus();
+ free_cpumask_var(tmp_mask);
}

/* Shuffle tasks across CPUs, with the intent of allowing each CPU in the
diff --git a/kernel/sched.c b/kernel/sched.c
index 27ba1d6..dd862d7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3715,7 +3715,7 @@ redo:
* don't kick the migration_thread, if the curr
* task on busiest cpu can't be moved to this_cpu
*/
- if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+ if (!cpumask_test_cpu(this_cpu, &busiest->curr->cpus_allowed)) {
double_unlock_balance(this_rq, busiest);
all_pinned = 1;
return ld_moved;
@@ -6220,9 +6220,7 @@ static int __migrate_task_irq(struct task_struct *p, int src_cpu, int dest_cpu)
static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
{
int dest_cpu;
- /* FIXME: Use cpumask_of_node here. */
- cpumask_t _nodemask = node_to_cpumask(cpu_to_node(dead_cpu));
- const struct cpumask *nodemask = &_nodemask;
+ const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(dead_cpu));

again:
/* Look for allowed, online CPU in same node. */
@@ -7133,21 +7131,18 @@ static int find_next_best_node(int node, nodemask_t *used_nodes)
static void sched_domain_node_span(int node, struct cpumask *span)
{
nodemask_t used_nodes;
- /* FIXME: use cpumask_of_node() */
- node_to_cpumask_ptr(nodemask, node);
int i;

- cpus_clear(*span);
+ cpumask_clear(span);
nodes_clear(used_nodes);

- cpus_or(*span, *span, *nodemask);
+ cpumask_or(span, span, cpumask_of_node(node));
node_set(node, used_nodes);

for (i = 1; i < SD_NODES_PER_DOMAIN; i++) {
int next_node = find_next_best_node(node, &used_nodes);

- node_to_cpumask_ptr_next(nodemask, next_node);
- cpus_or(*span, *span, *nodemask);
+ cpumask_or(span, span, cpumask_of_node(next_node));
}
}
#endif /* CONFIG_NUMA */
@@ -7227,9 +7222,7 @@ cpu_to_phys_group(int cpu, const struct cpumask *cpu_map,
{
int group;
#ifdef CONFIG_SCHED_MC
- /* FIXME: Use cpu_coregroup_mask. */
- *mask = cpu_coregroup_map(cpu);
- cpus_and(*mask, *mask, *cpu_map);
+ cpumask_and(mask, cpu_coregroup_mask(cpu), cpu_map);
group = cpumask_first(mask);
#elif defined(CONFIG_SCHED_SMT)
cpumask_and(mask, &per_cpu(cpu_sibling_map, cpu), cpu_map);
@@ -7259,10 +7252,8 @@ static int cpu_to_allnodes_group(int cpu, const struct cpumask *cpu_map,
struct cpumask *nodemask)
{
int group;
- /* FIXME: use cpumask_of_node */
- node_to_cpumask_ptr(pnodemask, cpu_to_node(cpu));

- cpumask_and(nodemask, pnodemask, cpu_map);
+ cpumask_and(nodemask, cpumask_of_node(cpu_to_node(cpu)), cpu_map);
group = cpumask_first(nodemask);

if (sg)
@@ -7313,10 +7304,8 @@ static void free_sched_groups(const struct cpumask *cpu_map,

for (i = 0; i < nr_node_ids; i++) {
struct sched_group *oldsg, *sg = sched_group_nodes[i];
- /* FIXME: Use cpumask_of_node */
- node_to_cpumask_ptr(pnodemask, i);

- cpus_and(*nodemask, *pnodemask, *cpu_map);
+ cpumask_and(nodemask, cpumask_of_node(i), cpu_map);
if (cpumask_empty(nodemask))
continue;

@@ -7525,9 +7514,7 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
for_each_cpu(i, cpu_map) {
struct sched_domain *sd = NULL, *p;

- /* FIXME: use cpumask_of_node */
- *nodemask = node_to_cpumask(cpu_to_node(i));
- cpus_and(*nodemask, *nodemask, *cpu_map);
+ cpumask_and(nodemask, cpumask_of_node(cpu_to_node(i)), cpu_map);

#ifdef CONFIG_NUMA
if (cpumask_weight(cpu_map) >
@@ -7568,9 +7555,8 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
sd = &per_cpu(core_domains, i).sd;
SD_INIT(sd, MC);
set_domain_attribute(sd, attr);
- *sched_domain_span(sd) = cpu_coregroup_map(i);
- cpumask_and(sched_domain_span(sd),
- sched_domain_span(sd), cpu_map);
+ cpumask_and(sched_domain_span(sd), cpu_map,
+ cpu_coregroup_mask(i));
sd->parent = p;
p->child = sd;
cpu_to_core_group(i, cpu_map, &sd->groups, tmpmask);
@@ -7606,9 +7592,7 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
#ifdef CONFIG_SCHED_MC
/* Set up multi-core groups */
for_each_cpu(i, cpu_map) {
- /* FIXME: Use cpu_coregroup_mask */
- *this_core_map = cpu_coregroup_map(i);
- cpus_and(*this_core_map, *this_core_map, *cpu_map);
+ cpumask_and(this_core_map, cpu_coregroup_mask(i), cpu_map);
if (i != cpumask_first(this_core_map))
continue;

@@ -7620,9 +7604,7 @@ static int __build_sched_domains(const struct cpumask *cpu_map,

/* Set up physical groups */
for (i = 0; i < nr_node_ids; i++) {
- /* FIXME: Use cpumask_of_node */
- *nodemask = node_to_cpumask(i);
- cpus_and(*nodemask, *nodemask, *cpu_map);
+ cpumask_and(nodemask, cpumask_of_node(i), cpu_map);
if (cpumask_empty(nodemask))
continue;

@@ -7644,11 +7626,8 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
struct sched_group *sg, *prev;
int j;

- /* FIXME: Use cpumask_of_node */
- *nodemask = node_to_cpumask(i);
cpumask_clear(covered);
-
- cpus_and(*nodemask, *nodemask, *cpu_map);
+ cpumask_and(nodemask, cpumask_of_node(i), cpu_map);
if (cpumask_empty(nodemask)) {
sched_group_nodes[i] = NULL;
continue;
@@ -7679,8 +7658,6 @@ static int __build_sched_domains(const struct cpumask *cpu_map,

for (j = 0; j < nr_node_ids; j++) {
int n = (i + j) % nr_node_ids;
- /* FIXME: Use cpumask_of_node */
- node_to_cpumask_ptr(pnodemask, n);

cpumask_complement(notcovered, covered);
cpumask_and(tmpmask, notcovered, cpu_map);
@@ -7688,7 +7665,7 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
if (cpumask_empty(tmpmask))
break;

- cpumask_and(tmpmask, tmpmask, pnodemask);
+ cpumask_and(tmpmask, tmpmask, cpumask_of_node(n));
if (cpumask_empty(tmpmask))
continue;

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 833b6d4..954e1a8 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1383,7 +1383,8 @@ static inline void init_sched_rt_class(void)
unsigned int i;

for_each_possible_cpu(i)
- alloc_cpumask_var(&per_cpu(local_cpu_mask, i), GFP_KERNEL);
+ alloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
+ GFP_KERNEL, cpu_to_node(i));
}
#endif /* CONFIG_SMP */

diff --git a/kernel/smp.c b/kernel/smp.c
index 75c8dde..5cfa0e5 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -24,8 +24,8 @@ struct call_function_data {
struct call_single_data csd;
spinlock_t lock;
unsigned int refs;
- cpumask_t cpumask;
struct rcu_head rcu_head;
+ unsigned long cpumask_bits[];
};

struct call_single_queue {
@@ -110,13 +110,13 @@ void generic_smp_call_function_interrupt(void)
list_for_each_entry_rcu(data, &call_function_queue, csd.list) {
int refs;

- if (!cpu_isset(cpu, data->cpumask))
+ if (!cpumask_test_cpu(cpu, to_cpumask(data->cpumask_bits)))
continue;

data->csd.func(data->csd.info);

spin_lock(&data->lock);
- cpu_clear(cpu, data->cpumask);
+ cpumask_clear_cpu(cpu, to_cpumask(data->cpumask_bits));
WARN_ON(data->refs == 0);
data->refs--;
refs = data->refs;
@@ -223,7 +223,7 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
local_irq_save(flags);
func(info);
local_irq_restore(flags);
- } else if ((unsigned)cpu < NR_CPUS && cpu_online(cpu)) {
+ } else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
struct call_single_data *data = NULL;

if (!wait) {
@@ -266,51 +266,19 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
generic_exec_single(cpu, data);
}

-/* Dummy function */
-static void quiesce_dummy(void *unused)
-{
-}
-
-/*
- * Ensure stack based data used in call function mask is safe to free.
- *
- * This is needed by smp_call_function_mask when using on-stack data, because
- * a single call function queue is shared by all CPUs, and any CPU may pick up
- * the data item on the queue at any time before it is deleted. So we need to
- * ensure that all CPUs have transitioned through a quiescent state after
- * this call.
- *
- * This is a very slow function, implemented by sending synchronous IPIs to
- * all possible CPUs. For this reason, we have to alloc data rather than use
- * stack based data even in the case of synchronous calls. The stack based
- * data is then just used for deadlock/oom fallback which will be very rare.
- *
- * If a faster scheme can be made, we could go back to preferring stack based
- * data -- the data allocation/free is non-zero cost.
- */
-static void smp_call_function_mask_quiesce_stack(cpumask_t mask)
-{
- struct call_single_data data;
- int cpu;
-
- data.func = quiesce_dummy;
- data.info = NULL;
-
- for_each_cpu_mask(cpu, mask) {
- data.flags = CSD_FLAG_WAIT;
- generic_exec_single(cpu, &data);
- }
-}
+/* FIXME: Shim for archs using old arch_send_call_function_ipi API. */
+#ifndef arch_send_call_function_ipi_mask
+#define arch_send_call_function_ipi_mask(maskp) \
+ arch_send_call_function_ipi(*(maskp))
+#endif

/**
- * smp_call_function_mask(): Run a function on a set of other CPUs.
- * @mask: The set of cpus to run on.
+ * smp_call_function_many(): Run a function on a set of other CPUs.
+ * @mask: The set of cpus to run on (only runs on online subset).
* @func: The function to run. This must be fast and non-blocking.
* @info: An arbitrary pointer to pass to the function.
* @wait: If true, wait (atomically) until function has completed on other CPUs.
*
- * Returns 0 on success, else a negative status code.
- *
* If @wait is true, then returns once @func has returned. Note that @wait
* will be implicitly turned on in case of allocation failures, since
* we fall back to on-stack allocation.
@@ -319,53 +287,57 @@ static void smp_call_function_mask_quiesce_stack(cpumask_t mask)
* hardware interrupt handler or from a bottom half handler. Preemption
* must be disabled when calling this function.
*/
-int smp_call_function_mask(cpumask_t mask, void (*func)(void *), void *info,
- int wait)
+void smp_call_function_many(const struct cpumask *mask,
+ void (*func)(void *), void *info,
+ bool wait)
{
- struct call_function_data d;
- struct call_function_data *data = NULL;
- cpumask_t allbutself;
+ struct call_function_data *data;
unsigned long flags;
- int cpu, num_cpus;
- int slowpath = 0;
+ int cpu, next_cpu;

/* Can deadlock when called with interrupts disabled */
WARN_ON(irqs_disabled());

- cpu = smp_processor_id();
- allbutself = cpu_online_map;
- cpu_clear(cpu, allbutself);
- cpus_and(mask, mask, allbutself);
- num_cpus = cpus_weight(mask);
-
- /*
- * If zero CPUs, return. If just a single CPU, turn this request
- * into a targetted single call instead since it's faster.
- */
- if (!num_cpus)
- return 0;
- else if (num_cpus == 1) {
- cpu = first_cpu(mask);
- return smp_call_function_single(cpu, func, info, wait);
+ /* So, what's a CPU they want? Ignoring this one. */
+ cpu = cpumask_first_and(mask, cpu_online_mask);
+ if (cpu == smp_processor_id())
+ cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+ /* No online cpus? We're done. */
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ /* Do we have another CPU which isn't us? */
+ next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+ if (next_cpu == smp_processor_id())
+ next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
+
+ /* Fastpath: do that cpu by itself. */
+ if (next_cpu >= nr_cpu_ids) {
+ smp_call_function_single(cpu, func, info, wait);
+ return;
}

- data = kmalloc(sizeof(*data), GFP_ATOMIC);
- if (data) {
- data->csd.flags = CSD_FLAG_ALLOC;
- if (wait)
- data->csd.flags |= CSD_FLAG_WAIT;
- } else {
- data = &d;
- data->csd.flags = CSD_FLAG_WAIT;
- wait = 1;
- slowpath = 1;
+ data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC);
+ if (unlikely(!data)) {
+ /* Slow path. */
+ for_each_online_cpu(cpu) {
+ if (cpu == smp_processor_id())
+ continue;
+ if (cpumask_test_cpu(cpu, mask))
+ smp_call_function_single(cpu, func, info, wait);
+ }
+ return;
}

spin_lock_init(&data->lock);
+ data->csd.flags = CSD_FLAG_ALLOC;
+ if (wait)
+ data->csd.flags |= CSD_FLAG_WAIT;
data->csd.func = func;
data->csd.info = info;
- data->refs = num_cpus;
- data->cpumask = mask;
+ cpumask_and(to_cpumask(data->cpumask_bits), mask, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), to_cpumask(data->cpumask_bits));
+ data->refs = cpumask_weight(to_cpumask(data->cpumask_bits));

spin_lock_irqsave(&call_function_lock, flags);
list_add_tail_rcu(&data->csd.list, &call_function_queue);
@@ -377,18 +349,13 @@ int smp_call_function_mask(cpumask_t mask, void (*func)(void *), void *info,
smp_mb();

/* Send a message to all CPUs in the map */
- arch_send_call_function_ipi(mask);
+ arch_send_call_function_ipi_mask(to_cpumask(data->cpumask_bits));

/* optionally wait for the CPUs to complete */
- if (wait) {
+ if (wait)
csd_flag_wait(&data->csd);
- if (unlikely(slowpath))
- smp_call_function_mask_quiesce_stack(mask);
- }
-
- return 0;
}
-EXPORT_SYMBOL(smp_call_function_mask);
+EXPORT_SYMBOL(smp_call_function_many);

/**
* smp_call_function(): Run a function on all other CPUs.
@@ -396,7 +363,7 @@ EXPORT_SYMBOL(smp_call_function_mask);
* @info: An arbitrary pointer to pass to the function.
* @wait: If true, wait (atomically) until function has completed on other CPUs.
*
- * Returns 0 on success, else a negative status code.
+ * Returns 0.
*
* If @wait is true, then returns once @func has returned; otherwise
* it returns just before the target cpu calls @func. In case of allocation
@@ -407,12 +374,10 @@ EXPORT_SYMBOL(smp_call_function_mask);
*/
int smp_call_function(void (*func)(void *), void *info, int wait)
{
- int ret;
-
preempt_disable();
- ret = smp_call_function_mask(cpu_online_map, func, info, wait);
+ smp_call_function_many(cpu_online_mask, func, info, wait);
preempt_enable();
- return ret;
+ return 0;
}
EXPORT_SYMBOL(smp_call_function);

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 670c1ec..bdbe9de 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -733,7 +733,7 @@ static int __cpuinit cpu_callback(struct notifier_block *nfb,
break;
/* Unbind so it can run. Fall thru. */
kthread_bind(per_cpu(ksoftirqd, hotcpu),
- any_online_cpu(cpu_online_map));
+ cpumask_any(cpu_online_mask));
case CPU_DEAD:
case CPU_DEAD_FROZEN: {
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
diff --git a/kernel/softlockup.c b/kernel/softlockup.c
index 1ab790c..d9188c6 100644
--- a/kernel/softlockup.c
+++ b/kernel/softlockup.c
@@ -303,17 +303,15 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
- check_cpu = any_online_cpu(cpu_online_map);
+ check_cpu = cpumask_any(cpu_online_mask);
wake_up_process(per_cpu(watchdog_task, hotcpu));
break;
#ifdef CONFIG_HOTPLUG_CPU
case CPU_DOWN_PREPARE:
case CPU_DOWN_PREPARE_FROZEN:
if (hotcpu == check_cpu) {
- cpumask_t temp_cpu_online_map = cpu_online_map;
-
- cpu_clear(hotcpu, temp_cpu_online_map);
- check_cpu = any_online_cpu(temp_cpu_online_map);
+ /* Pick any other online cpu. */
+ check_cpu = cpumask_any_but(cpu_online_mask, hotcpu);
}
break;

@@ -323,7 +321,7 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;
/* Unbind so it can run. Fall thru. */
kthread_bind(per_cpu(watchdog_task, hotcpu),
- any_online_cpu(cpu_online_map));
+ cpumask_any(cpu_online_mask));
case CPU_DEAD:
case CPU_DEAD_FROZEN:
p = per_cpu(watchdog_task, hotcpu);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 24e8cea..286c417 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -69,10 +69,10 @@ static void stop_cpu(struct work_struct *unused)
int err;

if (!active_cpus) {
- if (cpu == first_cpu(cpu_online_map))
+ if (cpu == cpumask_first(cpu_online_mask))
smdata = &active;
} else {
- if (cpu_isset(cpu, *active_cpus))
+ if (cpumask_test_cpu(cpu, active_cpus))
smdata = &active;
}
/* Simple state machine */
@@ -109,7 +109,7 @@ static int chill(void *unused)
return 0;
}

-int __stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)
+int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
{
struct work_struct *sm_work;
int i, ret;
@@ -142,7 +142,7 @@ int __stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)
return ret;
}

-int stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)
+int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
{
int ret;

diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 6d7dc4e..888adbc 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -290,18 +290,17 @@ ret:
return;
}

-static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)
+static int add_del_listener(pid_t pid, const struct cpumask *mask, int isadd)
{
struct listener_list *listeners;
struct listener *s, *tmp;
unsigned int cpu;
- cpumask_t mask = *maskp;

- if (!cpus_subset(mask, cpu_possible_map))
+ if (!cpumask_subset(mask, cpu_possible_mask))
return -EINVAL;

if (isadd == REGISTER) {
- for_each_cpu_mask_nr(cpu, mask) {
+ for_each_cpu(cpu, mask) {
s = kmalloc_node(sizeof(struct listener), GFP_KERNEL,
cpu_to_node(cpu));
if (!s)
@@ -320,7 +319,7 @@ static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)

/* Deregister or cleanup */
cleanup:
- for_each_cpu_mask_nr(cpu, mask) {
+ for_each_cpu(cpu, mask) {
listeners = &per_cpu(listener_array, cpu);
down_write(&listeners->sem);
list_for_each_entry_safe(s, tmp, &listeners->list, list) {
@@ -335,7 +334,7 @@ cleanup:
return 0;
}

-static int parse(struct nlattr *na, cpumask_t *mask)
+static int parse(struct nlattr *na, struct cpumask *mask)
{
char *data;
int len;
@@ -428,23 +427,33 @@ err:

static int taskstats_user_cmd(struct sk_buff *skb, struct genl_info *info)
{
- int rc = 0;
+ int rc;
struct sk_buff *rep_skb;
struct taskstats *stats;
size_t size;
- cpumask_t mask;
+ cpumask_var_t mask;
+
+ if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+ return -ENOMEM;

- rc = parse(info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK], &mask);
+ rc = parse(info->attrs[TASKSTATS_CMD_ATTR_REGISTER_CPUMASK], mask);
if (rc < 0)
- return rc;
- if (rc == 0)
- return add_del_listener(info->snd_pid, &mask, REGISTER);
+ goto free_return_rc;
+ if (rc == 0) {
+ rc = add_del_listener(info->snd_pid, mask, REGISTER);
+ goto free_return_rc;
+ }

- rc = parse(info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK], &mask);
+ rc = parse(info->attrs[TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK], mask);
if (rc < 0)
+ goto free_return_rc;
+ if (rc == 0) {
+ rc = add_del_listener(info->snd_pid, mask, DEREGISTER);
+free_return_rc:
+ free_cpumask_var(mask);
return rc;
- if (rc == 0)
- return add_del_listener(info->snd_pid, &mask, DEREGISTER);
+ }
+ free_cpumask_var(mask);

/*
* Size includes space for nested attributes
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 9ed2eec..ca89e15 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -145,10 +145,11 @@ static void clocksource_watchdog(unsigned long data)
* Cycle through CPUs to check if the CPUs stay
* synchronized to each other.
*/
- int next_cpu = next_cpu_nr(raw_smp_processor_id(), cpu_online_map);
+ int next_cpu = cpumask_next(raw_smp_processor_id(),
+ cpu_online_mask);

if (next_cpu >= nr_cpu_ids)
- next_cpu = first_cpu(cpu_online_map);
+ next_cpu = cpumask_first(cpu_online_mask);
watchdog_timer.expires += WATCHDOG_INTERVAL;
add_timer_on(&watchdog_timer, next_cpu);
}
@@ -173,7 +174,7 @@ static void clocksource_check_watchdog(struct clocksource *cs)
watchdog_last = watchdog->read();
watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
add_timer_on(&watchdog_timer,
- first_cpu(cpu_online_map));
+ cpumask_first(cpu_online_mask));
}
} else {
if (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS)
@@ -195,7 +196,7 @@ static void clocksource_check_watchdog(struct clocksource *cs)
watchdog_timer.expires =
jiffies + WATCHDOG_INTERVAL;
add_timer_on(&watchdog_timer,
- first_cpu(cpu_online_map));
+ cpumask_first(cpu_online_mask));
}
}
}
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 9590af2..118a3b3 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -28,7 +28,9 @@
*/

struct tick_device tick_broadcast_device;
-static cpumask_t tick_broadcast_mask;
+/* FIXME: Use cpumask_var_t. */
+static DECLARE_BITMAP(tick_broadcast_mask, NR_CPUS);
+static DECLARE_BITMAP(tmpmask, NR_CPUS);
static DEFINE_SPINLOCK(tick_broadcast_lock);
static int tick_broadcast_force;

@@ -46,9 +48,9 @@ struct tick_device *tick_get_broadcast_device(void)
return &tick_broadcast_device;
}

-cpumask_t *tick_get_broadcast_mask(void)
+struct cpumask *tick_get_broadcast_mask(void)
{
- return &tick_broadcast_mask;
+ return to_cpumask(tick_broadcast_mask);
}

/*
@@ -72,7 +74,7 @@ int tick_check_broadcast_device(struct clock_event_device *dev)

clockevents_exchange_device(NULL, dev);
tick_broadcast_device.evtdev = dev;
- if (!cpus_empty(tick_broadcast_mask))
+ if (!cpumask_empty(tick_get_broadcast_mask()))
tick_broadcast_start_periodic(dev);
return 1;
}
@@ -104,7 +106,7 @@ int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
*/
if (!tick_device_is_functional(dev)) {
dev->event_handler = tick_handle_periodic;
- cpu_set(cpu, tick_broadcast_mask);
+ cpumask_set_cpu(cpu, tick_get_broadcast_mask());
tick_broadcast_start_periodic(tick_broadcast_device.evtdev);
ret = 1;
} else {
@@ -116,7 +118,7 @@ int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
if (!(dev->features & CLOCK_EVT_FEAT_C3STOP)) {
int cpu = smp_processor_id();

- cpu_clear(cpu, tick_broadcast_mask);
+ cpumask_clear_cpu(cpu, tick_get_broadcast_mask());
tick_broadcast_clear_oneshot(cpu);
}
}
@@ -125,9 +127,9 @@ int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
}

/*
- * Broadcast the event to the cpus, which are set in the mask
+ * Broadcast the event to the cpus, which are set in the mask (mangled).
*/
-static void tick_do_broadcast(cpumask_t mask)
+static void tick_do_broadcast(struct cpumask *mask)
{
int cpu = smp_processor_id();
struct tick_device *td;
@@ -135,22 +137,21 @@ static void tick_do_broadcast(cpumask_t mask)
/*
* Check, if the current cpu is in the mask
*/
- if (cpu_isset(cpu, mask)) {
- cpu_clear(cpu, mask);
+ if (cpumask_test_cpu(cpu, mask)) {
+ cpumask_clear_cpu(cpu, mask);
td = &per_cpu(tick_cpu_device, cpu);
td->evtdev->event_handler(td->evtdev);
}

- if (!cpus_empty(mask)) {
+ if (!cpumask_empty(mask)) {
/*
* It might be necessary to actually check whether the devices
* have different broadcast functions. For now, just use the
* one of the first device. This works as long as we have this
* misfeature only on x86 (lapic)
*/
- cpu = first_cpu(mask);
- td = &per_cpu(tick_cpu_device, cpu);
- td->evtdev->broadcast(&mask);
+ td = &per_cpu(tick_cpu_device, cpumask_first(mask));
+ td->evtdev->broadcast(mask);
}
}

@@ -160,12 +161,11 @@ static void tick_do_broadcast(cpumask_t mask)
*/
static void tick_do_periodic_broadcast(void)
{
- cpumask_t mask;
-
spin_lock(&tick_broadcast_lock);

- cpus_and(mask, cpu_online_map, tick_broadcast_mask);
- tick_do_broadcast(mask);
+ cpumask_and(to_cpumask(tmpmask),
+ cpu_online_mask, tick_get_broadcast_mask());
+ tick_do_broadcast(to_cpumask(tmpmask));

spin_unlock(&tick_broadcast_lock);
}
@@ -228,13 +228,13 @@ static void tick_do_broadcast_on_off(void *why)
if (!tick_device_is_functional(dev))
goto out;

- bc_stopped = cpus_empty(tick_broadcast_mask);
+ bc_stopped = cpumask_empty(tick_get_broadcast_mask());

switch (*reason) {
case CLOCK_EVT_NOTIFY_BROADCAST_ON:
case CLOCK_EVT_NOTIFY_BROADCAST_FORCE:
- if (!cpu_isset(cpu, tick_broadcast_mask)) {
- cpu_set(cpu, tick_broadcast_mask);
+ if (!cpumask_test_cpu(cpu, tick_get_broadcast_mask())) {
+ cpumask_set_cpu(cpu, tick_get_broadcast_mask());
if (tick_broadcast_device.mode ==
TICKDEV_MODE_PERIODIC)
clockevents_shutdown(dev);
@@ -244,8 +244,8 @@ static void tick_do_broadcast_on_off(void *why)
break;
case CLOCK_EVT_NOTIFY_BROADCAST_OFF:
if (!tick_broadcast_force &&
- cpu_isset(cpu, tick_broadcast_mask)) {
- cpu_clear(cpu, tick_broadcast_mask);
+ cpumask_test_cpu(cpu, tick_get_broadcast_mask())) {
+ cpumask_clear_cpu(cpu, tick_get_broadcast_mask());
if (tick_broadcast_device.mode ==
TICKDEV_MODE_PERIODIC)
tick_setup_periodic(dev, 0);
@@ -253,7 +253,7 @@ static void tick_do_broadcast_on_off(void *why)
break;
}

- if (cpus_empty(tick_broadcast_mask)) {
+ if (cpumask_empty(tick_get_broadcast_mask())) {
if (!bc_stopped)
clockevents_shutdown(bc);
} else if (bc_stopped) {
@@ -272,7 +272,7 @@ out:
*/
void tick_broadcast_on_off(unsigned long reason, int *oncpu)
{
- if (!cpu_isset(*oncpu, cpu_online_map))
+ if (!cpumask_test_cpu(*oncpu, cpu_online_mask))
printk(KERN_ERR "tick-broadcast: ignoring broadcast for "
"offline CPU #%d\n", *oncpu);
else
@@ -303,10 +303,10 @@ void tick_shutdown_broadcast(unsigned int *cpup)
spin_lock_irqsave(&tick_broadcast_lock, flags);

bc = tick_broadcast_device.evtdev;
- cpu_clear(cpu, tick_broadcast_mask);
+ cpumask_clear_cpu(cpu, tick_get_broadcast_mask());

if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC) {
- if (bc && cpus_empty(tick_broadcast_mask))
+ if (bc && cpumask_empty(tick_get_broadcast_mask()))
clockevents_shutdown(bc);
}

@@ -342,10 +342,10 @@ int tick_resume_broadcast(void)

switch (tick_broadcast_device.mode) {
case TICKDEV_MODE_PERIODIC:
- if(!cpus_empty(tick_broadcast_mask))
+ if (!cpumask_empty(tick_get_broadcast_mask()))
tick_broadcast_start_periodic(bc);
- broadcast = cpu_isset(smp_processor_id(),
- tick_broadcast_mask);
+ broadcast = cpumask_test_cpu(smp_processor_id(),
+ tick_get_broadcast_mask());
break;
case TICKDEV_MODE_ONESHOT:
broadcast = tick_resume_broadcast_oneshot(bc);
@@ -360,14 +360,15 @@ int tick_resume_broadcast(void)

#ifdef CONFIG_TICK_ONESHOT

-static cpumask_t tick_broadcast_oneshot_mask;
+/* FIXME: use cpumask_var_t. */
+static DECLARE_BITMAP(tick_broadcast_oneshot_mask, NR_CPUS);

/*
- * Debugging: see timer_list.c
+ * Exposed for debugging: see timer_list.c
*/
-cpumask_t *tick_get_broadcast_oneshot_mask(void)
+struct cpumask *tick_get_broadcast_oneshot_mask(void)
{
- return &tick_broadcast_oneshot_mask;
+ return to_cpumask(tick_broadcast_oneshot_mask);
}

static int tick_broadcast_set_event(ktime_t expires, int force)
@@ -389,7 +390,7 @@ int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
*/
void tick_check_oneshot_broadcast(int cpu)
{
- if (cpu_isset(cpu, tick_broadcast_oneshot_mask)) {
+ if (cpumask_test_cpu(cpu, to_cpumask(tick_broadcast_oneshot_mask))) {
struct tick_device *td = &per_cpu(tick_cpu_device, cpu);

clockevents_set_mode(td->evtdev, CLOCK_EVT_MODE_ONESHOT);
@@ -402,7 +403,6 @@ void tick_check_oneshot_broadcast(int cpu)
static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
{
struct tick_device *td;
- cpumask_t mask;
ktime_t now, next_event;
int cpu;

@@ -410,13 +410,13 @@ static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
again:
dev->next_event.tv64 = KTIME_MAX;
next_event.tv64 = KTIME_MAX;
- mask = CPU_MASK_NONE;
+ cpumask_clear(to_cpumask(tmpmask));
now = ktime_get();
/* Find all expired events */
- for_each_cpu_mask_nr(cpu, tick_broadcast_oneshot_mask) {
+ for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
td = &per_cpu(tick_cpu_device, cpu);
if (td->evtdev->next_event.tv64 <= now.tv64)
- cpu_set(cpu, mask);
+ cpumask_set_cpu(cpu, to_cpumask(tmpmask));
else if (td->evtdev->next_event.tv64 < next_event.tv64)
next_event.tv64 = td->evtdev->next_event.tv64;
}
@@ -424,7 +424,7 @@ again:
/*
* Wakeup the cpus which have an expired event.
*/
- tick_do_broadcast(mask);
+ tick_do_broadcast(to_cpumask(tmpmask));

/*
* Two reasons for reprogram:
@@ -476,15 +476,16 @@ void tick_broadcast_oneshot_control(unsigned long reason)
goto out;

if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
- if (!cpu_isset(cpu, tick_broadcast_oneshot_mask)) {
- cpu_set(cpu, tick_broadcast_oneshot_mask);
+ if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
+ cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask());
clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
if (dev->next_event.tv64 < bc->next_event.tv64)
tick_broadcast_set_event(dev->next_event, 1);
}
} else {
- if (cpu_isset(cpu, tick_broadcast_oneshot_mask)) {
- cpu_clear(cpu, tick_broadcast_oneshot_mask);
+ if (cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
+ cpumask_clear_cpu(cpu,
+ tick_get_broadcast_oneshot_mask());
clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
if (dev->next_event.tv64 != KTIME_MAX)
tick_program_event(dev->next_event, 1);
@@ -502,15 +503,16 @@ out:
*/
static void tick_broadcast_clear_oneshot(int cpu)
{
- cpu_clear(cpu, tick_broadcast_oneshot_mask);
+ cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask());
}

-static void tick_broadcast_init_next_event(cpumask_t *mask, ktime_t expires)
+static void tick_broadcast_init_next_event(struct cpumask *mask,
+ ktime_t expires)
{
struct tick_device *td;
int cpu;

- for_each_cpu_mask_nr(cpu, *mask) {
+ for_each_cpu(cpu, mask) {
td = &per_cpu(tick_cpu_device, cpu);
if (td->evtdev)
td->evtdev->next_event = expires;
@@ -526,7 +528,6 @@ void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
if (bc->event_handler != tick_handle_oneshot_broadcast) {
int was_periodic = bc->mode == CLOCK_EVT_MODE_PERIODIC;
int cpu = smp_processor_id();
- cpumask_t mask;

bc->event_handler = tick_handle_oneshot_broadcast;
clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);
@@ -540,13 +541,15 @@ void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
* oneshot_mask bits for those and program the
* broadcast device to fire.
*/
- mask = tick_broadcast_mask;
- cpu_clear(cpu, mask);
- cpus_or(tick_broadcast_oneshot_mask,
- tick_broadcast_oneshot_mask, mask);
-
- if (was_periodic && !cpus_empty(mask)) {
- tick_broadcast_init_next_event(&mask, tick_next_period);
+ cpumask_copy(to_cpumask(tmpmask), tick_get_broadcast_mask());
+ cpumask_clear_cpu(cpu, to_cpumask(tmpmask));
+ cpumask_or(tick_get_broadcast_oneshot_mask(),
+ tick_get_broadcast_oneshot_mask(),
+ to_cpumask(tmpmask));
+
+ if (was_periodic && !cpumask_empty(to_cpumask(tmpmask))) {
+ tick_broadcast_init_next_event(to_cpumask(tmpmask),
+ tick_next_period);
tick_broadcast_set_event(tick_next_period, 1);
} else
bc->next_event.tv64 = KTIME_MAX;
@@ -585,7 +588,7 @@ void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
* Clear the broadcast mask flag for the dead cpu, but do not
* stop the broadcast device!
*/
- cpu_clear(cpu, tick_broadcast_oneshot_mask);
+ cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask());

spin_unlock_irqrestore(&tick_broadcast_lock, flags);
}
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index f8372be..63e05d4 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -254,7 +254,7 @@ static int tick_check_new_device(struct clock_event_device *newdev)
curdev = NULL;
}
clockevents_exchange_device(curdev, newdev);
- tick_setup_device(td, newdev, cpu, &cpumask_of_cpu(cpu));
+ tick_setup_device(td, newdev, cpu, cpumask_of(cpu));
if (newdev->features & CLOCK_EVT_FEAT_ONESHOT)
tick_oneshot_notify();

@@ -299,9 +299,9 @@ static void tick_shutdown(unsigned int *cpup)
}
/* Transfer the do_timer job away from this cpu */
if (*cpup == tick_do_timer_cpu) {
- int cpu = first_cpu(cpu_online_map);
+ int cpu = cpumask_first(cpu_online_mask);

- tick_do_timer_cpu = (cpu != NR_CPUS) ? cpu :
+ tick_do_timer_cpu = (cpu < nr_cpu_ids) ? cpu :
TICK_DO_TIMER_NONE;
}
spin_unlock_irqrestore(&tick_device_lock, flags);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d601a7..a9d9760 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -195,7 +195,7 @@ void *ring_buffer_event_data(struct ring_buffer_event *event)
EXPORT_SYMBOL_GPL(ring_buffer_event_data);

#define for_each_buffer_cpu(buffer, cpu) \
- for_each_cpu_mask(cpu, buffer->cpumask)
+ for_each_cpu(cpu, buffer->cpumask)

#define TS_SHIFT 27
#define TS_MASK ((1ULL << TS_SHIFT) - 1)
@@ -267,7 +267,7 @@ struct ring_buffer {
unsigned pages;
unsigned flags;
int cpus;
- cpumask_t cpumask;
+ cpumask_var_t cpumask;
atomic_t record_disabled;

struct mutex mutex;
@@ -458,6 +458,9 @@ struct ring_buffer *ring_buffer_alloc(unsigned long size, unsigned flags)
if (!buffer)
return NULL;

+ if (!alloc_cpumask_var(&buffer->cpumask, GFP_KERNEL))
+ goto fail_free_buffer;
+
buffer->pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
buffer->flags = flags;

@@ -465,14 +468,14 @@ struct ring_buffer *ring_buffer_alloc(unsigned long size, unsigned flags)
if (buffer->pages == 1)
buffer->pages++;

- buffer->cpumask = cpu_possible_map;
+ cpumask_copy(buffer->cpumask, cpu_possible_mask);
buffer->cpus = nr_cpu_ids;

bsize = sizeof(void *) * nr_cpu_ids;
buffer->buffers = kzalloc(ALIGN(bsize, cache_line_size()),
GFP_KERNEL);
if (!buffer->buffers)
- goto fail_free_buffer;
+ goto fail_free_cpumask;

for_each_buffer_cpu(buffer, cpu) {
buffer->buffers[cpu] =
@@ -492,6 +495,9 @@ struct ring_buffer *ring_buffer_alloc(unsigned long size, unsigned flags)
}
kfree(buffer->buffers);

+ fail_free_cpumask:
+ free_cpumask_var(buffer->cpumask);
+
fail_free_buffer:
kfree(buffer);
return NULL;
@@ -510,6 +516,8 @@ ring_buffer_free(struct ring_buffer *buffer)
for_each_buffer_cpu(buffer, cpu)
rb_free_cpu_buffer(buffer->buffers[cpu]);

+ free_cpumask_var(buffer->cpumask);
+
kfree(buffer);
}
EXPORT_SYMBOL_GPL(ring_buffer_free);
@@ -1283,7 +1291,7 @@ ring_buffer_lock_reserve(struct ring_buffer *buffer,

cpu = raw_smp_processor_id();

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
goto out;

cpu_buffer = buffer->buffers[cpu];
@@ -1396,7 +1404,7 @@ int ring_buffer_write(struct ring_buffer *buffer,

cpu = raw_smp_processor_id();

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
goto out;

cpu_buffer = buffer->buffers[cpu];
@@ -1478,7 +1486,7 @@ void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer;

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
return;

cpu_buffer = buffer->buffers[cpu];
@@ -1498,7 +1506,7 @@ void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer;

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
return;

cpu_buffer = buffer->buffers[cpu];
@@ -1515,7 +1523,7 @@ unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer;

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
return 0;

cpu_buffer = buffer->buffers[cpu];
@@ -1532,7 +1540,7 @@ unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer;

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
return 0;

cpu_buffer = buffer->buffers[cpu];
@@ -1850,7 +1858,7 @@ rb_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
struct buffer_page *reader;
int nr_loops = 0;

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
return NULL;

cpu_buffer = buffer->buffers[cpu];
@@ -2025,7 +2033,7 @@ ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
struct ring_buffer_event *event;
unsigned long flags;

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
return NULL;

spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
@@ -2062,7 +2070,7 @@ ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
struct ring_buffer_iter *iter;
unsigned long flags;

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
return NULL;

iter = kmalloc(sizeof(*iter), GFP_KERNEL);
@@ -2172,7 +2180,7 @@ void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
unsigned long flags;

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
return;

spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
@@ -2228,7 +2236,7 @@ int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer;

- if (!cpu_isset(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
return 1;

cpu_buffer = buffer->buffers[cpu];
@@ -2252,8 +2260,8 @@ int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
struct ring_buffer_per_cpu *cpu_buffer_a;
struct ring_buffer_per_cpu *cpu_buffer_b;

- if (!cpu_isset(cpu, buffer_a->cpumask) ||
- !cpu_isset(cpu, buffer_b->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer_a->cpumask) ||
+ !cpumask_test_cpu(cpu, buffer_b->cpumask))
return -EINVAL;

/* At least make sure the two buffers are somewhat the same */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0e91f43..c580233 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -89,10 +89,10 @@ static inline void ftrace_enable_cpu(void)
preempt_enable();
}

-static cpumask_t __read_mostly tracing_buffer_mask;
+static cpumask_var_t __read_mostly tracing_buffer_mask;

#define for_each_tracing_cpu(cpu) \
- for_each_cpu_mask(cpu, tracing_buffer_mask)
+ for_each_cpu(cpu, tracing_buffer_mask)

/*
* ftrace_dump_on_oops - variable to dump ftrace buffer on oops
@@ -1811,10 +1811,10 @@ static void test_cpu_buff_start(struct trace_iterator *iter)
if (!(iter->iter_flags & TRACE_FILE_ANNOTATE))
return;

- if (cpu_isset(iter->cpu, iter->started))
+ if (cpumask_test_cpu(iter->cpu, iter->started))
return;

- cpu_set(iter->cpu, iter->started);
+ cpumask_set_cpu(iter->cpu, iter->started);
trace_seq_printf(s, "##### CPU %u buffer started ####\n", iter->cpu);
}

@@ -2646,13 +2646,7 @@ static struct file_operations show_traces_fops = {
/*
* Only trace on a CPU if the bitmask is set:
*/
-static cpumask_t tracing_cpumask = CPU_MASK_ALL;
-
-/*
- * When tracing/tracing_cpu_mask is modified then this holds
- * the new bitmask we are about to install:
- */
-static cpumask_t tracing_cpumask_new;
+static cpumask_var_t tracing_cpumask;

/*
* The tracer itself will not take this lock, but still we want
@@ -2674,7 +2668,7 @@ tracing_cpumask_read(struct file *filp, char __user *ubuf,

mutex_lock(&tracing_cpumask_update_lock);

- len = cpumask_scnprintf(mask_str, count, &tracing_cpumask);
+ len = cpumask_scnprintf(mask_str, count, tracing_cpumask);
if (count - len < 2) {
count = -EINVAL;
goto out_err;
@@ -2693,9 +2687,13 @@ tracing_cpumask_write(struct file *filp, const char __user *ubuf,
size_t count, loff_t *ppos)
{
int err, cpu;
+ cpumask_var_t tracing_cpumask_new;
+
+ if (!alloc_cpumask_var(&tracing_cpumask_new, GFP_KERNEL))
+ return -ENOMEM;

mutex_lock(&tracing_cpumask_update_lock);
- err = cpumask_parse_user(ubuf, count, &tracing_cpumask_new);
+ err = cpumask_parse_user(ubuf, count, tracing_cpumask_new);
if (err)
goto err_unlock;

@@ -2706,26 +2704,28 @@ tracing_cpumask_write(struct file *filp, const char __user *ubuf,
* Increase/decrease the disabled counter if we are
* about to flip a bit in the cpumask:
*/
- if (cpu_isset(cpu, tracing_cpumask) &&
- !cpu_isset(cpu, tracing_cpumask_new)) {
+ if (cpumask_test_cpu(cpu, tracing_cpumask) &&
+ !cpumask_test_cpu(cpu, tracing_cpumask_new)) {
atomic_inc(&global_trace.data[cpu]->disabled);
}
- if (!cpu_isset(cpu, tracing_cpumask) &&
- cpu_isset(cpu, tracing_cpumask_new)) {
+ if (!cpumask_test_cpu(cpu, tracing_cpumask) &&
+ cpumask_test_cpu(cpu, tracing_cpumask_new)) {
atomic_dec(&global_trace.data[cpu]->disabled);
}
}
__raw_spin_unlock(&ftrace_max_lock);
local_irq_enable();

- tracing_cpumask = tracing_cpumask_new;
+ cpumask_copy(tracing_cpumask, tracing_cpumask_new);

mutex_unlock(&tracing_cpumask_update_lock);
+ free_cpumask_var(tracing_cpumask_new);

return count;

err_unlock:
mutex_unlock(&tracing_cpumask_update_lock);
+ free_cpumask_var(tracing_cpumask);

return err;
}
@@ -3114,10 +3114,15 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp)
if (!iter)
return -ENOMEM;

+ if (!alloc_cpumask_var(&iter->started, GFP_KERNEL)) {
+ kfree(iter);
+ return -ENOMEM;
+ }
+
mutex_lock(&trace_types_lock);

/* trace pipe does not show start of buffer */
- cpus_setall(iter->started);
+ cpumask_setall(iter->started);

iter->tr = &global_trace;
iter->trace = current_trace;
@@ -3134,6 +3139,7 @@ static int tracing_release_pipe(struct inode *inode, struct file *file)
{
struct trace_iterator *iter = file->private_data;

+ free_cpumask_var(iter->started);
kfree(iter);
atomic_dec(&tracing_reader);

@@ -3752,7 +3758,6 @@ void ftrace_dump(void)
static DEFINE_SPINLOCK(ftrace_dump_lock);
/* use static because iter can be a bit big for the stack */
static struct trace_iterator iter;
- static cpumask_t mask;
static int dump_ran;
unsigned long flags;
int cnt = 0, cpu;
@@ -3786,8 +3791,6 @@ void ftrace_dump(void)
* and then release the locks again.
*/

- cpus_clear(mask);
-
while (!trace_empty(&iter)) {

if (!cnt)
@@ -3823,19 +3826,28 @@ __init static int tracer_alloc_buffers(void)
{
struct trace_array_cpu *data;
int i;
+ int ret = -ENOMEM;

- /* TODO: make the number of buffers hot pluggable with CPUS */
- tracing_buffer_mask = cpu_possible_map;
+ if (!alloc_cpumask_var(&tracing_buffer_mask, GFP_KERNEL))
+ goto out;
+
+ if (!alloc_cpumask_var(&tracing_cpumask, GFP_KERNEL))
+ goto out_free_buffer_mask;
+
+ cpumask_copy(tracing_buffer_mask, cpu_possible_mask);
+ cpumask_copy(tracing_cpumask, cpu_all_mask);

+ /* TODO: make the number of buffers hot pluggable with CPUS */
global_trace.buffer = ring_buffer_alloc(trace_buf_size,
TRACE_BUFFER_FLAGS);
if (!global_trace.buffer) {
printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
WARN_ON(1);
- return 0;
+ goto out_free_cpumask;
}
global_trace.entries = ring_buffer_size(global_trace.buffer);

+
#ifdef CONFIG_TRACER_MAX_TRACE
max_tr.buffer = ring_buffer_alloc(trace_buf_size,
TRACE_BUFFER_FLAGS);
@@ -3843,7 +3855,7 @@ __init static int tracer_alloc_buffers(void)
printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
WARN_ON(1);
ring_buffer_free(global_trace.buffer);
- return 0;
+ goto out_free_cpumask;
}
max_tr.entries = ring_buffer_size(max_tr.buffer);
WARN_ON(max_tr.entries != global_trace.entries);
@@ -3873,8 +3885,14 @@ __init static int tracer_alloc_buffers(void)
&trace_panic_notifier);

register_die_notifier(&trace_die_notifier);
+ ret = 0;

- return 0;
+out_free_cpumask:
+ free_cpumask_var(tracing_cpumask);
+out_free_buffer_mask:
+ free_cpumask_var(tracing_buffer_mask);
+out:
+ return ret;
}
early_initcall(tracer_alloc_buffers);
fs_initcall(tracer_init_debugfs);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index cc7a4f8..4d3d381 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -368,7 +368,7 @@ struct trace_iterator {
loff_t pos;
long idx;

- cpumask_t started;
+ cpumask_var_t started;
};

int tracing_is_enabled(void);
diff --git a/kernel/trace/trace_boot.c b/kernel/trace/trace_boot.c
index 3ccebde..366c8c3 100644
--- a/kernel/trace/trace_boot.c
+++ b/kernel/trace/trace_boot.c
@@ -42,7 +42,7 @@ static int boot_trace_init(struct trace_array *tr)
int cpu;
boot_trace = tr;

- for_each_cpu_mask(cpu, cpu_possible_map)
+ for_each_cpu(cpu, cpu_possible_mask)
tracing_reset(tr, cpu);

tracing_sched_switch_assign_trace(tr);
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 4bf39fc..930c08e 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -79,7 +79,7 @@ print_graph_cpu(struct trace_seq *s, int cpu)
int i;
int ret;
int log10_this = log10_cpu(cpu);
- int log10_all = log10_cpu(cpus_weight_nr(cpu_online_map));
+ int log10_all = log10_cpu(cpumask_weight(cpu_online_mask));


/*
diff --git a/kernel/trace/trace_hw_branches.c b/kernel/trace/trace_hw_branches.c
index b6a3e20..649df22 100644
--- a/kernel/trace/trace_hw_branches.c
+++ b/kernel/trace/trace_hw_branches.c
@@ -46,7 +46,7 @@ static void bts_trace_start(struct trace_array *tr)

tracing_reset_online_cpus(tr);

- for_each_cpu_mask(cpu, cpu_possible_map)
+ for_each_cpu(cpu, cpu_possible_mask)
smp_call_function_single(cpu, bts_trace_start_cpu, NULL, 1);
}

@@ -62,7 +62,7 @@ static void bts_trace_stop(struct trace_array *tr)
{
int cpu;

- for_each_cpu_mask(cpu, cpu_possible_map)
+ for_each_cpu(cpu, cpu_possible_mask)
smp_call_function_single(cpu, bts_trace_stop_cpu, NULL, 1);
}

@@ -172,7 +172,7 @@ static void trace_bts_prepare(struct trace_iterator *iter)
{
int cpu;

- for_each_cpu_mask(cpu, cpu_possible_map)
+ for_each_cpu(cpu, cpu_possible_mask)
smp_call_function_single(cpu, trace_bts_cpu, iter->tr, 1);
}

diff --git a/kernel/trace/trace_power.c b/kernel/trace/trace_power.c
index a7172a3..7bda248 100644
--- a/kernel/trace/trace_power.c
+++ b/kernel/trace/trace_power.c
@@ -39,7 +39,7 @@ static int power_trace_init(struct trace_array *tr)

trace_power_enabled = 1;

- for_each_cpu_mask(cpu, cpu_possible_map)
+ for_each_cpu(cpu, cpu_possible_mask)
tracing_reset(tr, cpu);
return 0;
}
diff --git a/kernel/trace/trace_sysprof.c b/kernel/trace/trace_sysprof.c
index a5779bd..eaca5ad 100644
--- a/kernel/trace/trace_sysprof.c
+++ b/kernel/trace/trace_sysprof.c
@@ -196,9 +196,9 @@ static enum hrtimer_restart stack_trace_timer_fn(struct hrtimer *hrtimer)
return HRTIMER_RESTART;
}

-static void start_stack_timer(int cpu)
+static void start_stack_timer(void *unused)
{
- struct hrtimer *hrtimer = &per_cpu(stack_trace_hrtimer, cpu);
+ struct hrtimer *hrtimer = &__get_cpu_var(stack_trace_hrtimer);

hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer->function = stack_trace_timer_fn;
@@ -208,14 +208,7 @@ static void start_stack_timer(int cpu)

static void start_stack_timers(void)
{
- cpumask_t saved_mask = current->cpus_allowed;
- int cpu;
-
- for_each_online_cpu(cpu) {
- set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu));
- start_stack_timer(cpu);
- }
- set_cpus_allowed_ptr(current, &saved_mask);
+ on_each_cpu(start_stack_timer, NULL, 1);
}

static void stop_stack_timer(int cpu)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4952322..2f44583 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -73,7 +73,7 @@ static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);

static int singlethread_cpu __read_mostly;
-static cpumask_t cpu_singlethread_map __read_mostly;
+static const struct cpumask *cpu_singlethread_map __read_mostly;
/*
* _cpu_down() first removes CPU from cpu_online_map, then CPU_DEAD
* flushes cwq->worklist. This means that flush_workqueue/wait_on_work
@@ -81,7 +81,7 @@ static cpumask_t cpu_singlethread_map __read_mostly;
* use cpu_possible_map, the cpumask below is more a documentation
* than optimization.
*/
-static cpumask_t cpu_populated_map __read_mostly;
+static cpumask_var_t cpu_populated_map __read_mostly;

/* If it's single threaded, it isn't in the list of workqueues. */
static inline int is_wq_single_threaded(struct workqueue_struct *wq)
@@ -89,10 +89,10 @@ static inline int is_wq_single_threaded(struct workqueue_struct *wq)
return wq->singlethread;
}

-static const cpumask_t *wq_cpu_map(struct workqueue_struct *wq)
+static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
{
return is_wq_single_threaded(wq)
- ? &cpu_singlethread_map : &cpu_populated_map;
+ ? cpu_singlethread_map : cpu_populated_map;
}

static
@@ -410,7 +410,7 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
*/
void flush_workqueue(struct workqueue_struct *wq)
{
- const cpumask_t *cpu_map = wq_cpu_map(wq);
+ const struct cpumask *cpu_map = wq_cpu_map(wq);
int cpu;

might_sleep();
@@ -532,7 +532,7 @@ static void wait_on_work(struct work_struct *work)
{
struct cpu_workqueue_struct *cwq;
struct workqueue_struct *wq;
- const cpumask_t *cpu_map;
+ const struct cpumask *cpu_map;
int cpu;

might_sleep();
@@ -903,7 +903,7 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
*/
void destroy_workqueue(struct workqueue_struct *wq)
{
- const cpumask_t *cpu_map = wq_cpu_map(wq);
+ const struct cpumask *cpu_map = wq_cpu_map(wq);
int cpu;

cpu_maps_update_begin();
@@ -933,7 +933,7 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,

switch (action) {
case CPU_UP_PREPARE:
- cpu_set(cpu, cpu_populated_map);
+ cpumask_set_cpu(cpu, cpu_populated_map);
}
undo:
list_for_each_entry(wq, &workqueues, list) {
@@ -964,7 +964,7 @@ undo:
switch (action) {
case CPU_UP_CANCELED:
case CPU_POST_DEAD:
- cpu_clear(cpu, cpu_populated_map);
+ cpumask_clear_cpu(cpu, cpu_populated_map);
}

return ret;
@@ -1017,9 +1017,11 @@ EXPORT_SYMBOL_GPL(work_on_cpu);

void __init init_workqueues(void)
{
- cpu_populated_map = cpu_online_map;
- singlethread_cpu = first_cpu(cpu_possible_map);
- cpu_singlethread_map = cpumask_of_cpu(singlethread_cpu);
+ alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);
+
+ cpumask_copy(cpu_populated_map, cpu_online_mask);
+ singlethread_cpu = cpumask_first(cpu_possible_mask);
+ cpu_singlethread_map = cpumask_of(singlethread_cpu);
hotcpu_notifier(workqueue_cpu_callback, 0);
keventd_wq = create_workqueue("events");
BUG_ON(!keventd_wq);
diff --git a/lib/Kconfig b/lib/Kconfig
index 2ba43c4..03c2c24 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -13,6 +13,10 @@ config GENERIC_FIND_FIRST_BIT
config GENERIC_FIND_NEXT_BIT
bool

+config GENERIC_FIND_LAST_BIT
+ bool
+ default y
+
config CRC_CCITT
tristate "CRC-CCITT functions"
help
@@ -166,4 +170,8 @@ config CPUMASK_OFFSTACK
them on the stack. This is a bit more expensive, but avoids
stack overflow.

+config DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
+ bool "Disable obsolete cpumask functions" if DEBUG_PER_CPU_MAPS
+ depends on EXPERIMENTAL && BROKEN
+
endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 80fe8a3..32b0e64 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -37,6 +37,7 @@ lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
+lib-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_PLIST) += plist.o
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 8d03f22..3389e24 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -76,15 +76,28 @@ int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)

/* These are not inline because of header tangles. */
#ifdef CONFIG_CPUMASK_OFFSTACK
-bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
+/**
+ * alloc_cpumask_var_node - allocate a struct cpumask on a given node
+ * @mask: pointer to cpumask_var_t where the cpumask is returned
+ * @flags: GFP_ flags
+ *
+ * Only defined when CONFIG_CPUMASK_OFFSTACK=y, otherwise is
+ * a nop returning a constant 1 (in <linux/cpumask.h>)
+ * Returns TRUE if memory allocation succeeded, FALSE otherwise.
+ *
+ * In addition, mask will be NULL if this fails. Note that gcc is
+ * usually smart enough to know that mask can never be NULL if
+ * CONFIG_CPUMASK_OFFSTACK=n, so does code elimination in that case
+ * too.
+ */
+bool alloc_cpumask_var_node(cpumask_var_t *mask, gfp_t flags, int node)
{
if (likely(slab_is_available()))
- *mask = kmalloc(cpumask_size(), flags);
+ *mask = kmalloc_node(cpumask_size(), flags, node);
else {
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
printk(KERN_ERR
"=> alloc_cpumask_var: kmalloc not available!\n");
- dump_stack();
#endif
*mask = NULL;
}
@@ -94,21 +107,64 @@ bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
dump_stack();
}
#endif
+ /* FIXME: Bandaid to save us from old primitives which go to NR_CPUS. */
+ if (*mask) {
+ unsigned int tail;
+ tail = BITS_TO_LONGS(NR_CPUS - nr_cpumask_bits) * sizeof(long);
+ memset(cpumask_bits(*mask) + cpumask_size() - tail,
+ 0, tail);
+ }
+
return *mask != NULL;
}
+EXPORT_SYMBOL(alloc_cpumask_var_node);
+
+/**
+ * alloc_cpumask_var - allocate a struct cpumask
+ * @mask: pointer to cpumask_var_t where the cpumask is returned
+ * @flags: GFP_ flags
+ *
+ * Only defined when CONFIG_CPUMASK_OFFSTACK=y, otherwise is
+ * a nop returning a constant 1 (in <linux/cpumask.h>).
+ *
+ * See alloc_cpumask_var_node.
+ */
+bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
+{
+ return alloc_cpumask_var_node(mask, flags, numa_node_id());
+}
EXPORT_SYMBOL(alloc_cpumask_var);

+/**
+ * alloc_bootmem_cpumask_var - allocate a struct cpumask from the bootmem arena.
+ * @mask: pointer to cpumask_var_t where the cpumask is returned
+ *
+ * Only defined when CONFIG_CPUMASK_OFFSTACK=y, otherwise is
+ * a nop (in <linux/cpumask.h>).
+ * Either returns an allocated (zero-filled) cpumask, or causes the
+ * system to panic.
+ */
void __init alloc_bootmem_cpumask_var(cpumask_var_t *mask)
{
*mask = alloc_bootmem(cpumask_size());
}

+/**
+ * free_cpumask_var - frees memory allocated for a struct cpumask.
+ * @mask: cpumask to free
+ *
+ * This is safe on a NULL mask.
+ */
void free_cpumask_var(cpumask_var_t mask)
{
kfree(mask);
}
EXPORT_SYMBOL(free_cpumask_var);

+/**
+ * free_bootmem_cpumask_var - frees result of alloc_bootmem_cpumask_var
+ * @mask: cpumask to free
+ */
void __init free_bootmem_cpumask_var(cpumask_var_t mask)
{
free_bootmem((unsigned long)mask, cpumask_size());
diff --git a/lib/find_last_bit.c b/lib/find_last_bit.c
new file mode 100644
index 0000000..5d202e3
--- /dev/null
+++ b/lib/find_last_bit.c
@@ -0,0 +1,45 @@
+/* find_last_bit.c: fallback find next bit implementation
+ *
+ * Copyright (C) 2008 IBM Corporation
+ * Written by Rusty Russell <[email protected]>
+ * (Inspired by David Howell's find_next_bit implementation)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/bitops.h>
+#include <linux/module.h>
+#include <asm/types.h>
+#include <asm/byteorder.h>
+
+unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
+{
+ unsigned long words;
+ unsigned long tmp;
+
+ /* Start at final word. */
+ words = size / BITS_PER_LONG;
+
+ /* Partial final word? */
+ if (size & (BITS_PER_LONG-1)) {
+ tmp = (addr[words] & (~0UL >> (BITS_PER_LONG
+ - (size & (BITS_PER_LONG-1)))));
+ if (tmp)
+ goto found;
+ }
+
+ while (words) {
+ tmp = addr[--words];
+ if (tmp) {
+found:
+ return words * BITS_PER_LONG + __fls(tmp);
+ }
+ }
+
+ /* Not found */
+ return size;
+}
+EXPORT_SYMBOL(find_last_bit);
diff --git a/mm/pdflush.c b/mm/pdflush.c
index a0a14c4..15de509 100644
--- a/mm/pdflush.c
+++ b/mm/pdflush.c
@@ -172,7 +172,16 @@ static int __pdflush(struct pdflush_work *my_work)
static int pdflush(void *dummy)
{
struct pdflush_work my_work;
- cpumask_t cpus_allowed;
+ cpumask_var_t cpus_allowed;
+
+ /*
+ * Since the caller doesn't even check kthread_run() worked, let's not
+ * freak out too much if this fails.
+ */
+ if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
+ printk(KERN_WARNING "pdflush failed to allocate cpumask\n");
+ return 0;
+ }

/*
* pdflush can spend a lot of time doing encryption via dm-crypt. We
@@ -187,8 +196,9 @@ static int pdflush(void *dummy)
* This is needed as pdflush's are dynamically created and destroyed.
* The boottime pdflush's are easily placed w/o these 2 lines.
*/
- cpuset_cpus_allowed(current, &cpus_allowed);
- set_cpus_allowed_ptr(current, &cpus_allowed);
+ cpuset_cpus_allowed(current, cpus_allowed);
+ set_cpus_allowed_ptr(current, cpus_allowed);
+ free_cpumask_var(cpus_allowed);

return __pdflush(&my_work);
}
diff --git a/mm/slab.c b/mm/slab.c
index f97e564..ddc41f3 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2157,7 +2157,7 @@ kmem_cache_create (const char *name, size_t size, size_t align,

/*
* We use cache_chain_mutex to ensure a consistent view of
- * cpu_online_map as well. Please see cpuup_callback
+ * cpu_online_mask as well. Please see cpuup_callback
*/
get_online_cpus();
mutex_lock(&cache_chain_mutex);
diff --git a/mm/slub.c b/mm/slub.c
index 0d861c3..f0e2892 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1970,7 +1970,7 @@ static DEFINE_PER_CPU(struct kmem_cache_cpu,
kmem_cache_cpu)[NR_KMEM_CACHE_CPU];

static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static cpumask_t kmem_cach_cpu_free_init_once = CPU_MASK_NONE;
+static DECLARE_BITMAP(kmem_cach_cpu_free_init_once, CONFIG_NR_CPUS);

static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
int cpu, gfp_t flags)
@@ -2045,13 +2045,13 @@ static void init_alloc_cpu_cpu(int cpu)
{
int i;

- if (cpu_isset(cpu, kmem_cach_cpu_free_init_once))
+ if (cpumask_test_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once)))
return;

for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);

- cpu_set(cpu, kmem_cach_cpu_free_init_once);
+ cpumask_set_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once));
}

static void __init init_alloc_cpu(void)
@@ -3451,7 +3451,7 @@ struct location {
long max_time;
long min_pid;
long max_pid;
- cpumask_t cpus;
+ DECLARE_BITMAP(cpus, NR_CPUS);
nodemask_t nodes;
};

@@ -3526,7 +3526,8 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
if (track->pid > l->max_pid)
l->max_pid = track->pid;

- cpu_set(track->cpu, l->cpus);
+ cpumask_set_cpu(track->cpu,
+ to_cpumask(l->cpus));
}
node_set(page_to_nid(virt_to_page(track)), l->nodes);
return 1;
@@ -3556,8 +3557,8 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
l->max_time = age;
l->min_pid = track->pid;
l->max_pid = track->pid;
- cpus_clear(l->cpus);
- cpu_set(track->cpu, l->cpus);
+ cpumask_clear(to_cpumask(l->cpus));
+ cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
nodes_clear(l->nodes);
node_set(page_to_nid(virt_to_page(track)), l->nodes);
return 1;
@@ -3638,11 +3639,12 @@ static int list_locations(struct kmem_cache *s, char *buf,
len += sprintf(buf + len, " pid=%ld",
l->min_pid);

- if (num_online_cpus() > 1 && !cpus_empty(l->cpus) &&
+ if (num_online_cpus() > 1 &&
+ !cpumask_empty(to_cpumask(l->cpus)) &&
len < PAGE_SIZE - 60) {
len += sprintf(buf + len, " cpus=");
len += cpulist_scnprintf(buf + len, PAGE_SIZE - len - 50,
- &l->cpus);
+ to_cpumask(l->cpus));
}

if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62e7f62..d196f46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1902,7 +1902,7 @@ static int kswapd(void *p)
};
node_to_cpumask_ptr(cpumask, pgdat->node_id);

- if (!cpus_empty(*cpumask))
+ if (!cpumask_empty(cpumask))
set_cpus_allowed_ptr(tsk, cpumask);
current->reclaim_state = &reclaim_state;

@@ -2141,7 +2141,7 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
pg_data_t *pgdat = NODE_DATA(nid);
node_to_cpumask_ptr(mask, pgdat->node_id);

- if (any_online_cpu(*mask) < nr_cpu_ids)
+ if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
/* One of our CPUs online: restore mask */
set_cpus_allowed_ptr(pgdat->kswapd, mask);
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c3ccfda..9114974 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -20,7 +20,7 @@
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
EXPORT_PER_CPU_SYMBOL(vm_event_states);

-static void sum_vm_events(unsigned long *ret, cpumask_t *cpumask)
+static void sum_vm_events(unsigned long *ret, const struct cpumask *cpumask)
{
int cpu;
int i;
@@ -43,7 +43,7 @@ static void sum_vm_events(unsigned long *ret, cpumask_t *cpumask)
void all_vm_events(unsigned long *ret)
{
get_online_cpus();
- sum_vm_events(ret, &cpu_online_map);
+ sum_vm_events(ret, cpu_online_mask);
put_online_cpus();
}
EXPORT_SYMBOL_GPL(all_vm_events);
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index c863036..e552099 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1211,7 +1211,7 @@ static struct avc_cache_stats *sel_avc_get_stat_idx(loff_t *idx)
{
int cpu;

- for (cpu = *idx; cpu < NR_CPUS; ++cpu) {
+ for (cpu = *idx; cpu < nr_cpu_ids; ++cpu) {
if (!cpu_possible(cpu))
continue;
*idx = cpu + 1;

2009-01-03 20:29:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3



On Sat, 3 Jan 2009, Ingo Molnar wrote:
>
> ok. The pending regressions are all fixed now, and i've just finished my
> standard tests on the latest tree and all the tests passed fine.

Ok, pulled and pushed out.

Has anybody looked at what the stack size is with MAXSMP set with an
allyesconfig? And what areas are still problematic, if any? Are we going
to have some code-paths that still essentially have 1kB+ of stack space
just because they haven't been converted and still have the cpu mask on
stack?

Linus

2009-01-03 20:36:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3


* Linus Torvalds <[email protected]> wrote:

> On Sat, 3 Jan 2009, Ingo Molnar wrote:
> >
> > ok. The pending regressions are all fixed now, and i've just finished
> > my standard tests on the latest tree and all the tests passed fine.
>
> Ok, pulled and pushed out.

thanks!

> Has anybody looked at what the stack size is with MAXSMP set with an
> allyesconfig? And what areas are still problematic, if any? Are we going
> to have some code-paths that still essentially have 1kB+ of stack space
> just because they haven't been converted and still have the cpu mask on
> stack?

ok, indeed testing of that is in order now.

I'll check what the worst-case runtime stack footprint is for an
allyesconfig 64-bit bootup - that should be the worst-case scenario on
x86. We have a low number of leftover places, but the serious ones like
the IPI paths, which triggered stack overflows in the past, got all fixed.

The test is underway with:

CONFIG_64BIT=y
CONFIG_NR_CPUS=4096
CONFIG_MAXSMP=y

Ingo

2009-01-03 20:56:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3



On Sat, 3 Jan 2009, Ingo Molnar wrote:
>
> > Has anybody looked at what the stack size is with MAXSMP set with an
> > allyesconfig? And what areas are still problematic, if any? Are we going
> > to have some code-paths that still essentially have 1kB+ of stack space
> > just because they haven't been converted and still have the cpu mask on
> > stack?
>
> ok, indeed testing of that is in order now.

Well, since I can compile a allyesconfig pretty quickly, I did the static
part. It looks better than it used to, and I think most of the huge stacks
are totally unrealted to cpu masks. But not all.

But it looks like we have a few:

- flush_tlb_current_task:
cpumask_t cpu_mask;
- flush_tlb_mm:
cpumask_t cpu_mask;
- local_cpus_show:
cpumask_t mask;
- local_cpulist_show:
cpumask_t mask;
- acpi_cpufreq_target:
cpumask_t online_policy_cpus

and then we have a number of things that have "struct cpufreq_policy" on
the stack, and those things have two cpumask_t's in each.

The rest of the high-stack-usage cases - from a _very_ quick look - seem
to be unrelated to CPU masks, but in the "more than 1kB of stack" group
about a third (wild handwaving eyeballing) of them do seem to be related
to cpumask.

And those things (VM, ACPI and PCI) are easily triggered in real use and
don't need odd hardware. So I think they need to be fixed before 2.6.29,
otherwise I'll have to disable MAXSMP and again limit MAX_CPU back to 128
or something.

Linus

2009-01-03 20:58:38

by Mike Travis

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3

Linus Torvalds wrote:
>
> On Sat, 3 Jan 2009, Ingo Molnar wrote:
>> ok. The pending regressions are all fixed now, and i've just finished my
>> standard tests on the latest tree and all the tests passed fine.
>
> Ok, pulled and pushed out.
>
> Has anybody looked at what the stack size is with MAXSMP set with an
> allyesconfig? And what areas are still problematic, if any? Are we going
> to have some code-paths that still essentially have 1kB+ of stack space
> just because they haven't been converted and still have the cpu mask on
> stack?
>
> Linus

Hi Linus,

Yes, I do periodically collect stats for memory and stack usage. Here is
a recent stack summary (not "allyes", but most all trace/debug options
turned on). It shows the stack growth from a 128 NR_CPUS config to a
MAXSMP (4096 NR_CPUS) config. Most of the changes to correct these "stack
hogs" have been sitting in a queue until the changes affecting non-x86
architectures have been accepted (which you just did), though some are
because of new code from the merge activity.

Rusty has introduced a config option that disables the old cpumask_t
which really highlights where the offenders still are. Ultimately,
that should prevent any new stack hogs from being introduced, but it
won't be settable until 2.6.30 time frame.

====== Stack (-l 500)
1 - 128-defconfig
2 - 4k-defconfig

.1. .2. ..final..
0 +1640 1640 . acpi_cpufreq_target
0 +1368 1368 . cpufreq_add_dev
0 +1344 1344 . store_scaling_governor
0 +1328 1328 . store_scaling_min_freq
0 +1328 1328 . store_scaling_max_freq
0 +1328 1328 . cpufreq_update_policy
0 +1328 1328 . cpu_has_cpufreq
0 +1048 1048 . get_cur_val
0 +1032 1032 . local_cpus_show
0 +1032 1032 . local_cpulist_show
0 +1024 1024 . pci_bus_show_cpuaffinity
0 +808 808 . cpuset_write_resmask
0 +736 736 . update_flag
0 +648 648 . init_intel_cacheinfo
0 +640 640 . cpuset_attach
0 +584 584 . shmem_getpage
0 +584 584 . __percpu_alloc_mask
0 +552 552 . smp_call_function_many
0 +536 536 . pci_device_probe
0 +536 536 . native_flush_tlb_others
0 +536 536 . cpuset_common_file_read
0 +520 520 . show_related_cpus
0 +520 520 . show_affected_cpus
0 +520 520 . get_measured_perf
0 +520 520 . flush_tlb_page
0 +520 520 . cpuset_can_attach
0 +512 512 . flush_tlb_mm
0 +512 512 . flush_tlb_current_task
0 +512 512 . find_lowest_rq
0 +512 512 . acpi_processor_ffh_cstate_probe

====== Text/Data ()

Overall memory reservation looks like this:

.1. .2. ..final..
5799936 +4096 5804032 +0.07% TextSize
3772416 +139264 3911680 +3.69% DataSize
8822784 +1234944 10057728 +13% BssSize
2445312 +794624 3239936 +32% InitSize
1884160 +4096 1888256 +0.22% PerCPU
143360 +708608 851968 +494% OtherSize
22867968 +2885632 25753600 +112% Totals


I will update these with the latest changes (and use a
allyesconfig config) and post them again soon.

Thanks,
Mike

2009-01-03 21:39:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3


* Ingo Molnar <[email protected]> wrote:

> The test is underway with:
>
> CONFIG_64BIT=y
> CONFIG_NR_CPUS=4096
> CONFIG_MAXSMP=y

the worst-case stack footprint i was able to trigger on an allyesconfig
bootup (instrumented via CONFIG_STACK_TRACER=y and the 'stacktrace' boot
option) was 5.6K:

# ls -l /debug/tracing/*stack*
-rw-r--r-- 1 root root 0 2009-01-03 22:42 /debug/tracing/stack_max_size
-r--r--r-- 1 root root 0 2009-01-03 22:42 /debug/tracing/stack_trace

# cat /debug/tracing/*stack*
5624
Depth Size Location (38 entries)
----- ---- --------
0) 5496 232 lookup_address+0x51/0x20e
1) 5264 496 __change_page_attr_set_clr+0xa7/0xa15
2) 4768 128 kernel_map_pages+0x121/0x140
3) 4640 224 get_page_from_freelist+0x4bb/0x6ec
4) 4416 160 __alloc_pages_internal+0x150/0x4af
5) 4256 48 alloc_pages_current+0xbe/0xc7
6) 4208 16 alloc_slab_page+0x1e/0x6e
7) 4192 64 new_slab+0x51/0x1e2
8) 4128 96 __slab_alloc+0x260/0x3fa
9) 4032 80 kmem_cache_alloc+0xa8/0x120
10) 3952 48 radix_tree_preload+0x38/0x86
11) 3904 64 add_to_page_cache_locked+0x51/0xed
12) 3840 32 add_to_page_cache_lru+0x31/0x6b
13) 3808 64 find_or_create_page+0x56/0x7d
14) 3744 80 __getblk+0x136/0x262
15) 3664 32 __bread+0x13/0x82
16) 3632 64 ext3_get_branch+0x7b/0xef
17) 3568 368 ext3_get_blocks_handle+0xa2/0x907
18) 3200 96 ext3_get_block+0xc3/0x101
19) 3104 224 do_mpage_readpage+0x1ad/0x4d0
20) 2880 240 mpage_readpages+0xb6/0xf9
21) 2640 16 ext3_readpages+0x1f/0x21
22) 2624 144 __do_page_cache_readahead+0x147/0x1bd
23) 2480 48 do_page_cache_readahead+0x5b/0x68
24) 2432 112 filemap_fault+0x176/0x34b
25) 2320 160 __do_fault+0x58/0x410
26) 2160 176 handle_mm_fault+0x4b2/0xa3e
27) 1984 800 do_page_fault+0x86d/0xcec
28) 1184 208 page_fault+0x25/0x30
29) 976 16 clear_user+0x30/0x38
30) 960 288 load_elf_binary+0x61c/0x18f0
31) 672 80 search_binary_handler+0xd9/0x279
32) 592 176 load_script+0x1b3/0x1c8
33) 416 80 search_binary_handler+0xd9/0x279
34) 336 96 do_execve+0x1df/0x296
35) 240 64 sys_execve+0x43/0x5e
36) 176 176 stub_execve+0x6a/0xc0

None of the cpumask_t callsites showed up - but i agree that they should
be fixed, 512 bytes of cpumask_t on the stack is not good no matter how
one looks at it, it will only make an already strained kernel stack
footprint situation worse.

This one looks a bit high:

1) 5264 496 __change_page_attr_set_clr+0xa7/0xa15

(Venki, Suresh and Arjan Cc:-ed)

This one isnt too nice either:

27) 1984 800 do_page_fault+0x86d/0xcec

not sure why it happens - will investigate tomorrow, it's getting late
here.

Ingo

2009-01-03 21:58:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3


* Linus Torvalds <[email protected]> wrote:

> And those things (VM, ACPI and PCI) are easily triggered in real use and
> don't need odd hardware. So I think they need to be fixed before 2.6.29,
> otherwise I'll have to disable MAXSMP and again limit MAX_CPU back to
> 128 or something.

agreed.

Ingo

2009-01-03 22:00:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3



On Sat, 3 Jan 2009, Ingo Molnar wrote:
>
> This one looks a bit high:
>
> 1) 5264 496 __change_page_attr_set_clr+0xa7/0xa15
>
> (Venki, Suresh and Arjan Cc:-ed)
>
> This one isnt too nice either:
>
> 27) 1984 800 do_page_fault+0x86d/0xcec
>
> not sure why it happens - will investigate tomorrow, it's getting late
> here.

The most common case tends to be insane gcc inlining (vmalloc_fault and
spurious_fault), and then gcc not re-using stack slots even if they have
no overlap in usage. Some people continue to claim that gcc reuses them,
but it's definitely not the case in any complex situation, so I suspect
the re-use is probably purely for some simple spilling case, not for
variables allocated on the stack.

do_page_fault() in particular has a lot of gunk in it for the special
cases.

What happened to Nick's cleanup patch to do_page_fault (a month or two
ago? I complained about some of the issues in his first version and asked
for some further cleanups, but I think that whole discussion ended with
him saying "I am going to add those changes that you suggested (in fact, I
already have)".

And then I didn't see anything further. Maybe I just missed the end
result. Or maybe we have it in some -mm branch or something?

Linus

2009-01-03 22:38:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3


* Linus Torvalds <[email protected]> wrote:

> What happened to Nick's cleanup patch to do_page_fault (a month or two
> ago? I complained about some of the issues in his first version and
> asked for some further cleanups, but I think that whole discussion ended
> with him saying "I am going to add those changes that you suggested (in
> fact, I already have)".
>
> And then I didn't see anything further. Maybe I just missed the end
> result. Or maybe we have it in some -mm branch or something?

they would have been in tip/x86/mm and would be upstream now had Nick
re-sent a v2 series but that never happened. I think they might have
fallen victim to a serious attention deficit caused by the SLQB patch ;-)

Ingo

2009-01-04 03:35:30

by Rusty Russell

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3

On Sunday 04 January 2009 07:26:03 Linus Torvalds wrote:
>
> On Sat, 3 Jan 2009, Ingo Molnar wrote:
> >
> > > Has anybody looked at what the stack size is with MAXSMP set with an
> > > allyesconfig? And what areas are still problematic, if any? Are we going
> > > to have some code-paths that still essentially have 1kB+ of stack space
> > > just because they haven't been converted and still have the cpu mask on
> > > stack?
> >
> > ok, indeed testing of that is in order now.
>
> Well, since I can compile a allyesconfig pretty quickly, I did the static
> part. It looks better than it used to, and I think most of the huge stacks
> are totally unrealted to cpu masks. But not all.
>
> But it looks like we have a few:
>
> - flush_tlb_current_task:
> cpumask_t cpu_mask;
> - flush_tlb_mm:
> cpumask_t cpu_mask;
...
> - acpi_cpufreq_target:
> cpumask_t online_policy_cpus

Mike? These are x86-specific...

> - local_cpus_show:
> cpumask_t mask;
> - local_cpulist_show:
> cpumask_t mask;

Yes, this removal is still in my queue. I'll double-check that all the
archs have the new "cpumask_of_pcibus". (cpumask:replace-and-remove-pcibus_to_cpumask.patch "cpumask: remove the now-obsoleted pcibus_to_cpumask()").

> and then we have a number of things that have "struct cpufreq_policy" on
> the stack, and those things have two cpumask_t's in each.

Yep, we have the conversion for that too. Mike, it's cpumask:convert-drivers_acpi.patch "cpumask: convert struct cpufreq_policy to cpumask_var_t."

> The rest of the high-stack-usage cases - from a _very_ quick look - seem
> to be unrelated to CPU masks, but in the "more than 1kB of stack" group
> about a third (wild handwaving eyeballing) of them do seem to be related
> to cpumask.

Mike was tracking this; I think he has a script to set NR_CPUS small then
large and dump the changes.

Cheers,
Rusty.

2009-01-04 03:43:25

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] ia64: cpumask fix for is_affinity_mask_valid()

On Saturday 03 January 2009 22:29:04 Ingo Molnar wrote:
> update it to cpumask_var_t.

Please change it to a const struct cpumask_t *.

It should be const, and of course 'const cpumask_var_t' isn't the same.

Thanks,
Rusty.

2009-01-04 04:20:30

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH] ia64: cpumask fix for is_affinity_mask_valid()

Rusty Russell wrote:
> On Saturday 03 January 2009 22:29:04 Ingo Molnar wrote:
>> update it to cpumask_var_t.
>
> Please change it to a const struct cpumask_t *.
>
> It should be const, and of course 'const cpumask_var_t' isn't the same.
>
> Thanks,
> Rusty.

I have a fixup patch coming to correct that and remove an unneeded cpumask
off the stack.

Mike

2009-01-04 04:28:33

by Mike Travis

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3

Rusty Russell wrote:
> On Sunday 04 January 2009 07:26:03 Linus Torvalds wrote:
>> On Sat, 3 Jan 2009, Ingo Molnar wrote:
>>>> Has anybody looked at what the stack size is with MAXSMP set with an
>>>> allyesconfig? And what areas are still problematic, if any? Are we going
>>>> to have some code-paths that still essentially have 1kB+ of stack space
>>>> just because they haven't been converted and still have the cpu mask on
>>>> stack?
>>> ok, indeed testing of that is in order now.
>> Well, since I can compile a allyesconfig pretty quickly, I did the static
>> part. It looks better than it used to, and I think most of the huge stacks
>> are totally unrealted to cpu masks. But not all.
>>
>> But it looks like we have a few:
>>
>> - flush_tlb_current_task:
>> cpumask_t cpu_mask;
>> - flush_tlb_mm:
>> cpumask_t cpu_mask;
> ...
>> - acpi_cpufreq_target:
>> cpumask_t online_policy_cpus
>
> Mike? These are x86-specific...

I've been testing the heck out of it... ;-)

>
>> - local_cpus_show:
>> cpumask_t mask;
>> - local_cpulist_show:
>> cpumask_t mask;

Yes, these are in my "real soon now" patchset. Trivial.
>
> Yes, this removal is still in my queue. I'll double-check that all the
> archs have the new "cpumask_of_pcibus". (cpumask:replace-and-remove-pcibus_to_cpumask.patch "cpumask: remove the now-obsoleted pcibus_to_cpumask()").
>
>> and then we have a number of things that have "struct cpufreq_policy" on
>> the stack, and those things have two cpumask_t's in each.
>
> Yep, we have the conversion for that too. Mike, it's cpumask:convert-drivers_acpi.patch "cpumask: convert struct cpufreq_policy to cpumask_var_t."
>
That's part of what I'm testing above.

>> The rest of the high-stack-usage cases - from a _very_ quick look - seem
>> to be unrelated to CPU masks, but in the "more than 1kB of stack" group
>> about a third (wild handwaving eyeballing) of them do seem to be related
>> to cpumask.
>
> Mike was tracking this; I think he has a script to set NR_CPUS small then
> large and dump the changes.

It's looking pretty good, only 11 > 1k and 19 more > 512.

Thanks,
Mike

2009-01-04 12:38:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ia64: cpumask fix for is_affinity_mask_valid()


* Mike Travis <[email protected]> wrote:

> Rusty Russell wrote:
> > On Saturday 03 January 2009 22:29:04 Ingo Molnar wrote:
> >> update it to cpumask_var_t.
> >
> > Please change it to a const struct cpumask_t *.

yeah.

> > It should be const, and of course 'const cpumask_var_t' isn't the
> > same.
> >
> > Thanks,
> > Rusty.
>
> I have a fixup patch coming to correct that and remove an unneeded
> cpumask off the stack.

ok, will wait for that.

Ingo

2009-01-05 01:14:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3

On Sat, Jan 03, 2009 at 11:37:23PM +0100, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > What happened to Nick's cleanup patch to do_page_fault (a month or two
> > ago? I complained about some of the issues in his first version and
> > asked for some further cleanups, but I think that whole discussion ended
> > with him saying "I am going to add those changes that you suggested (in
> > fact, I already have)".
> >
> > And then I didn't see anything further. Maybe I just missed the end
> > result. Or maybe we have it in some -mm branch or something?
>
> they would have been in tip/x86/mm and would be upstream now had Nick
> re-sent a v2 series but that never happened. I think they might have
> fallen victim to a serious attention deficit caused by the SLQB patch ;-)

Well, I already added Linus's suggestions but didn't submit it because
there was a bit of work going on in that file as far as I could see, both
in the x86 tree and in -mm:

(http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/broken-out/mm-invoke-oom-killer-from-page-fault.patch)

It isn't a big deal to resolve either way, but I don't want to make Andrew's
life harder.

[Yes OK now I'm the guilty one of pushing in an x86 patch not via the
x86 tree ;) This one is easy to break in pieces, but I didn't want
to create a dependency between the trees]

I didn't really consider it to be urgent, so I was waiting for that patch
to go in, but I was still hoping to get this into 2.6.29... This is what
it looks like now with your suggestions, and just merged it to your current
tree (untested).

I'll cc the linux-arch list here too, because it might be nice to keep these
things as structurally similar as possible (and they'll all want to look at
the -mm patch above, although I'll probably end up having to write the
patches!).

---

Optimise x86's do_page_fault (C entry point for the page fault path).

gcc isn't _all_ that smart about spilling registers to stack or reusing
stack slots, even with branch annotations. do_page_fault contained a lot
of functionality, so split unlikely paths into their own functions, and
mark them as noinline just to be sure. I consider this actually to be
somewhat of a cleanup too: the main function now contains about half
the number of lines.

Also, ensure the order of arguments to functions is always the same: regs,
addr, error_code. This can reduce code size a tiny bit, and just looks neater
too.

Add a couple of branch annotations.

One real behavioural difference this makes is that the OOM-init-task case
will no longer loop around the page fault handler, but will return to
userspace and presumably retry the fault. Effectively the same macro-behaviour,
but it is a notable difference. Such change in behaviour should disappear after
the "call oom killer from page fault" patch.

Before:
do_page_fault:
subq $360, %rsp #,

After:
do_page_fault:
subq $56, %rsp #,

bloat-o-meter:
add/remove: 8/0 grow/shrink: 0/1 up/down: 2222/-1680 (542)
function old new delta
__bad_area_nosemaphore - 506 +506
no_context - 474 +474
vmalloc_fault - 424 +424
spurious_fault - 358 +358
mm_fault_error - 272 +272
bad_area_access_error - 89 +89
bad_area - 89 +89
bad_area_nosemaphore - 10 +10
do_page_fault 2464 784 -1680

Yes, the total size increases by 542 bytes, due to the extra function calls.
But these will very rarely be called (except for vmalloc_fault) in a normal
workload. Importantly, do_page_fault is less than 1/3rd it's original size.
Existing gotos and branch hints did move a lot of the infrequently used text
out of the fastpath, but that's even further improved after this patch.

---
arch/x86/mm/fault.c | 458 ++++++++++++++++++++++++++++++----------------------
1 file changed, 265 insertions(+), 193 deletions(-)

Index: linux-2.6/arch/x86/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/fault.c
+++ linux-2.6/arch/x86/mm/fault.c
@@ -91,8 +91,8 @@ static inline int notify_page_fault(stru
*
* Opcode checker based on code by Richard Brunner
*/
-static int is_prefetch(struct pt_regs *regs, unsigned long addr,
- unsigned long error_code)
+static int is_prefetch(struct pt_regs *regs, unsigned long error_code,
+ unsigned long addr)
{
unsigned char *instr;
int scan_more = 1;
@@ -409,15 +409,15 @@ static void show_fault_oops(struct pt_re
}

#ifdef CONFIG_X86_64
-static noinline void pgtable_bad(unsigned long address, struct pt_regs *regs,
- unsigned long error_code)
+static noinline void pgtable_bad(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
{
unsigned long flags = oops_begin();
int sig = SIGKILL;
- struct task_struct *tsk;
+ struct task_struct *tsk = current;

printk(KERN_ALERT "%s: Corrupted page table at address %lx\n",
- current->comm, address);
+ tsk->comm, address);
dump_pagetable(address);
tsk = current;
tsk->thread.cr2 = address;
@@ -429,6 +429,200 @@ static noinline void pgtable_bad(unsigne
}
#endif

+static noinline void no_context(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ struct task_struct *tsk = current;
+#ifdef CONFIG_X86_64
+ unsigned long flags;
+ int sig;
+#endif
+
+ /* Are we prepared to handle this kernel fault? */
+ if (fixup_exception(regs))
+ return;
+
+ /*
+ * X86_32
+ * Valid to do another page fault here, because if this fault
+ * had been triggered by is_prefetch fixup_exception would have
+ * handled it.
+ *
+ * X86_64
+ * Hall of shame of CPU/BIOS bugs.
+ */
+ if (is_prefetch(regs, error_code, address))
+ return;
+
+ if (is_errata93(regs, address))
+ return;
+
+ /*
+ * Oops. The kernel tried to access some bad page. We'll have to
+ * terminate things with extreme prejudice.
+ */
+#ifdef CONFIG_X86_32
+ bust_spinlocks(1);
+#else
+ flags = oops_begin();
+#endif
+
+ show_fault_oops(regs, error_code, address);
+
+ tsk->thread.cr2 = address;
+ tsk->thread.trap_no = 14;
+ tsk->thread.error_code = error_code;
+
+#ifdef CONFIG_X86_32
+ die("Oops", regs, error_code);
+ bust_spinlocks(0);
+ do_exit(SIGKILL);
+#else
+ sig = SIGKILL;
+ if (__die("Oops", regs, error_code))
+ sig = 0;
+ /* Executive summary in case the body of the oops scrolled away */
+ printk(KERN_EMERG "CR2: %016lx\n", address);
+ oops_end(flags, regs, sig);
+#endif
+}
+
+static void __bad_area_nosemaphore(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address,
+ int si_code)
+{
+ struct task_struct *tsk = current;
+
+ /* User mode accesses just cause a SIGSEGV */
+ if (error_code & PF_USER) {
+ /*
+ * It's possible to have interrupts off here.
+ */
+ local_irq_enable();
+
+ /*
+ * Valid to do another page fault here because this one came
+ * from user space.
+ */
+ if (is_prefetch(regs, error_code, address))
+ return;
+
+ if (is_errata100(regs, address))
+ return;
+
+ if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+ printk_ratelimit()) {
+ printk(
+ "%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
+ task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
+ tsk->comm, task_pid_nr(tsk), address,
+ (void *) regs->ip, (void *) regs->sp, error_code);
+ print_vma_addr(" in ", regs->ip);
+ printk("\n");
+ }
+
+ tsk->thread.cr2 = address;
+ /* Kernel addresses are always protection faults */
+ tsk->thread.error_code = error_code | (address >= TASK_SIZE);
+ tsk->thread.trap_no = 14;
+ force_sig_info_fault(SIGSEGV, si_code, address, tsk);
+ return;
+ }
+
+ if (is_f00f_bug(regs, address))
+ return;
+
+ no_context(regs, error_code, address);
+}
+
+static noinline void bad_area_nosemaphore(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ __bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+}
+
+static void __bad_area(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address,
+ int si_code)
+{
+ struct mm_struct *mm = current->mm;
+
+ /*
+ * Something tried to access memory that isn't in our memory map..
+ * Fix it, but check if it's kernel or user first..
+ */
+ up_read(&mm->mmap_sem);
+
+ __bad_area_nosemaphore(regs, error_code, address, si_code);
+}
+
+static noinline void bad_area(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ __bad_area(regs, error_code, address, SEGV_MAPERR);
+}
+
+static noinline void bad_area_access_error(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ __bad_area(regs, error_code, address, SEGV_ACCERR);
+}
+
+/* TODO: fixup for "mm-invoke-oom-killer-from-page-fault.patch" */
+static void out_of_memory(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ struct task_struct *tsk = current;
+ struct mm_struct *mm = tsk->mm;
+ /*
+ * We ran out of memory, or some other thing happened to us that made
+ * us unable to handle the page fault gracefully.
+ */
+ up_read(&mm->mmap_sem);
+ if (is_global_init(tsk)) {
+ yield();
+ return;
+ }
+
+ printk("VM: killing process %s\n", tsk->comm);
+ if (error_code & PF_USER)
+ do_group_exit(SIGKILL);
+ no_context(regs, error_code, address);
+}
+
+static void do_sigbus(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ struct task_struct *tsk = current;
+ struct mm_struct *mm = tsk->mm;
+
+ up_read(&mm->mmap_sem);
+
+ /* Kernel mode? Handle exceptions or die */
+ if (!(error_code & PF_USER))
+ no_context(regs, error_code, address);
+#ifdef CONFIG_X86_32
+ /* User space => ok to do another page fault */
+ if (is_prefetch(regs, error_code, address))
+ return;
+#endif
+ tsk->thread.cr2 = address;
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_no = 14;
+ force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+}
+
+static noinline void mm_fault_error(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address, unsigned int fault)
+{
+ if (fault & VM_FAULT_OOM)
+ out_of_memory(regs, error_code, address);
+ else if (fault & VM_FAULT_SIGBUS)
+ do_sigbus(regs, error_code, address);
+ else
+ BUG();
+}
+
static int spurious_fault_check(unsigned long error_code, pte_t *pte)
{
if ((error_code & PF_WRITE) && !pte_write(*pte))
@@ -448,8 +642,8 @@ static int spurious_fault_check(unsigned
* There are no security implications to leaving a stale TLB when
* increasing the permissions on a page.
*/
-static int spurious_fault(unsigned long address,
- unsigned long error_code)
+static noinline int spurious_fault(unsigned long error_code,
+ unsigned long address)
{
pgd_t *pgd;
pud_t *pud;
@@ -494,7 +688,7 @@ static int spurious_fault(unsigned long
*
* This assumes no large pages in there.
*/
-static int vmalloc_fault(unsigned long address)
+static noinline int vmalloc_fault(unsigned long address)
{
#ifdef CONFIG_X86_32
unsigned long pgd_paddr;
@@ -573,6 +767,25 @@ static int vmalloc_fault(unsigned long a

int show_unhandled_signals = 1;

+static inline int access_error(unsigned long error_code, int write,
+ struct vm_area_struct *vma)
+{
+ if (write) {
+ /* write, present and write, not present */
+ if (unlikely(!(vma->vm_flags & VM_WRITE)))
+ return 1;
+ } else if (unlikely(error_code & PF_PROT)) {
+ /* read, present */
+ return 1;
+ } else {
+ /* read, not present */
+ if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+ return 1;
+ }
+
+ return 0;
+}
+
/*
* This routine handles page faults. It determines the address,
* and the problem, and then passes it off to one of the appropriate
@@ -583,16 +796,12 @@ asmlinkage
#endif
void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
+ unsigned long address;
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct *vma;
- unsigned long address;
- int write, si_code;
+ int write;
int fault;
-#ifdef CONFIG_X86_64
- unsigned long flags;
- int sig;
-#endif

tsk = current;
mm = tsk->mm;
@@ -601,9 +810,7 @@ void __kprobes do_page_fault(struct pt_r
/* get the address */
address = read_cr2();

- si_code = SEGV_MAPERR;
-
- if (notify_page_fault(regs))
+ if (unlikely(notify_page_fault(regs)))
return;
if (unlikely(kmmio_fault(regs, address)))
return;
@@ -638,10 +845,10 @@ void __kprobes do_page_fault(struct pt_r
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.
*/
- goto bad_area_nosemaphore;
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
}

-
/*
* It's safe to allow irq's after cr2 has been saved and the
* vmalloc fault has been handled.
@@ -657,17 +864,18 @@ void __kprobes do_page_fault(struct pt_r

#ifdef CONFIG_X86_64
if (unlikely(error_code & PF_RSVD))
- pgtable_bad(address, regs, error_code);
+ pgtable_bad(regs, error_code, address);
#endif

/*
* If we're in an interrupt, have no user context or are running in an
* atomic region then we must not take the fault.
*/
- if (unlikely(in_atomic() || !mm))
- goto bad_area_nosemaphore;
+ if (unlikely(in_atomic() || !mm)) {
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
+ }

-again:
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
@@ -684,20 +892,26 @@ again:
* source. If this is invalid we can skip the address space check,
* thus avoiding the deadlock.
*/
- if (!down_read_trylock(&mm->mmap_sem)) {
+ if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
if ((error_code & PF_USER) == 0 &&
- !search_exception_tables(regs->ip))
- goto bad_area_nosemaphore;
+ !search_exception_tables(regs->ip)) {
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
+ }
down_read(&mm->mmap_sem);
}

vma = find_vma(mm, address);
- if (!vma)
- goto bad_area;
- if (vma->vm_start <= address)
+ if (unlikely(!vma)) {
+ bad_area(regs, error_code, address);
+ return;
+ }
+ if (likely(vma->vm_start <= address))
goto good_area;
- if (!(vma->vm_flags & VM_GROWSDOWN))
- goto bad_area;
+ if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
+ bad_area(regs, error_code, address);
+ return;
+ }
if (error_code & PF_USER) {
/*
* Accessing the stack below %sp is always a bug.
@@ -705,31 +919,25 @@ again:
* and pusha to work. ("enter $65535,$31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
- if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp)
- goto bad_area;
+ if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
+ bad_area(regs, error_code, address);
+ return;
+ }
}
- if (expand_stack(vma, address))
- goto bad_area;
-/*
- * Ok, we have a good vm_area for this memory access, so
- * we can handle it..
- */
+ if (unlikely(expand_stack(vma, address))) {
+ bad_area(regs, error_code, address);
+ return;
+ }
+
+ /*
+ * Ok, we have a good vm_area for this memory access, so
+ * we can handle it..
+ */
good_area:
- si_code = SEGV_ACCERR;
- write = 0;
- switch (error_code & (PF_PROT|PF_WRITE)) {
- default: /* 3: write, present */
- /* fall through */
- case PF_WRITE: /* write, not present */
- if (!(vma->vm_flags & VM_WRITE))
- goto bad_area;
- write++;
- break;
- case PF_PROT: /* read, present */
- goto bad_area;
- case 0: /* read, not present */
- if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
- goto bad_area;
+ write = error_code & PF_WRITE;
+ if (unlikely(access_error(error_code, write, vma))) {
+ bad_area_access_error(regs, error_code, address);
+ return;
}

/*
@@ -739,11 +947,8 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, write);
if (unlikely(fault & VM_FAULT_ERROR)) {
- if (fault & VM_FAULT_OOM)
- goto out_of_memory;
- else if (fault & VM_FAULT_SIGBUS)
- goto do_sigbus;
- BUG();
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}
if (fault & VM_FAULT_MAJOR)
tsk->maj_flt++;
@@ -761,139 +966,6 @@ good_area:
}
#endif
up_read(&mm->mmap_sem);
- return;
-
-/*
- * Something tried to access memory that isn't in our memory map..
- * Fix it, but check if it's kernel or user first..
- */
-bad_area:
- up_read(&mm->mmap_sem);
-
-bad_area_nosemaphore:
- /* User mode accesses just cause a SIGSEGV */
- if (error_code & PF_USER) {
- /*
- * It's possible to have interrupts off here.
- */
- local_irq_enable();
-
- /*
- * Valid to do another page fault here because this one came
- * from user space.
- */
- if (is_prefetch(regs, address, error_code))
- return;
-
- if (is_errata100(regs, address))
- return;
-
- if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
- printk_ratelimit()) {
- printk(
- "%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
- task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
- tsk->comm, task_pid_nr(tsk), address,
- (void *) regs->ip, (void *) regs->sp, error_code);
- print_vma_addr(" in ", regs->ip);
- printk("\n");
- }
-
- tsk->thread.cr2 = address;
- /* Kernel addresses are always protection faults */
- tsk->thread.error_code = error_code | (address >= TASK_SIZE);
- tsk->thread.trap_no = 14;
- force_sig_info_fault(SIGSEGV, si_code, address, tsk);
- return;
- }
-
- if (is_f00f_bug(regs, address))
- return;
-
-no_context:
- /* Are we prepared to handle this kernel fault? */
- if (fixup_exception(regs))
- return;
-
- /*
- * X86_32
- * Valid to do another page fault here, because if this fault
- * had been triggered by is_prefetch fixup_exception would have
- * handled it.
- *
- * X86_64
- * Hall of shame of CPU/BIOS bugs.
- */
- if (is_prefetch(regs, address, error_code))
- return;
-
- if (is_errata93(regs, address))
- return;
-
-/*
- * Oops. The kernel tried to access some bad page. We'll have to
- * terminate things with extreme prejudice.
- */
-#ifdef CONFIG_X86_32
- bust_spinlocks(1);
-#else
- flags = oops_begin();
-#endif
-
- show_fault_oops(regs, error_code, address);
-
- tsk->thread.cr2 = address;
- tsk->thread.trap_no = 14;
- tsk->thread.error_code = error_code;
-
-#ifdef CONFIG_X86_32
- die("Oops", regs, error_code);
- bust_spinlocks(0);
- do_exit(SIGKILL);
-#else
- sig = SIGKILL;
- if (__die("Oops", regs, error_code))
- sig = 0;
- /* Executive summary in case the body of the oops scrolled away */
- printk(KERN_EMERG "CR2: %016lx\n", address);
- oops_end(flags, regs, sig);
-#endif
-
-/*
- * We ran out of memory, or some other thing happened to us that made
- * us unable to handle the page fault gracefully.
- */
-out_of_memory:
- up_read(&mm->mmap_sem);
- if (is_global_init(tsk)) {
- yield();
- /*
- * Re-lookup the vma - in theory the vma tree might
- * have changed:
- */
- goto again;
- }
-
- printk("VM: killing process %s\n", tsk->comm);
- if (error_code & PF_USER)
- do_group_exit(SIGKILL);
- goto no_context;
-
-do_sigbus:
- up_read(&mm->mmap_sem);
-
- /* Kernel mode? Handle exceptions or die */
- if (!(error_code & PF_USER))
- goto no_context;
-#ifdef CONFIG_X86_32
- /* User space => ok to do another page fault */
- if (is_prefetch(regs, address, error_code))
- return;
-#endif
- tsk->thread.cr2 = address;
- tsk->thread.error_code = error_code;
- tsk->thread.trap_no = 14;
- force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
}

DEFINE_SPINLOCK(pgd_lock);

2009-01-05 01:16:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3

Really cc linux-arch this time

On Mon, Jan 05, 2009 at 02:14:16AM +0100, Nick Piggin wrote:
> On Sat, Jan 03, 2009 at 11:37:23PM +0100, Ingo Molnar wrote:
> >
> > * Linus Torvalds <[email protected]> wrote:
> >
> > > What happened to Nick's cleanup patch to do_page_fault (a month or two
> > > ago? I complained about some of the issues in his first version and
> > > asked for some further cleanups, but I think that whole discussion ended
> > > with him saying "I am going to add those changes that you suggested (in
> > > fact, I already have)".
> > >
> > > And then I didn't see anything further. Maybe I just missed the end
> > > result. Or maybe we have it in some -mm branch or something?
> >
> > they would have been in tip/x86/mm and would be upstream now had Nick
> > re-sent a v2 series but that never happened. I think they might have
> > fallen victim to a serious attention deficit caused by the SLQB patch ;-)
>
> Well, I already added Linus's suggestions but didn't submit it because
> there was a bit of work going on in that file as far as I could see, both
> in the x86 tree and in -mm:
>
> (http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/broken-out/mm-invoke-oom-killer-from-page-fault.patch)
>
> It isn't a big deal to resolve either way, but I don't want to make Andrew's
> life harder.
>
> [Yes OK now I'm the guilty one of pushing in an x86 patch not via the
> x86 tree ;) This one is easy to break in pieces, but I didn't want
> to create a dependency between the trees]
>
> I didn't really consider it to be urgent, so I was waiting for that patch
> to go in, but I was still hoping to get this into 2.6.29... This is what
> it looks like now with your suggestions, and just merged it to your current
> tree (untested).
>
> I'll cc the linux-arch list here too, because it might be nice to keep these
> things as structurally similar as possible (and they'll all want to look at
> the -mm patch above, although I'll probably end up having to write the
> patches!).

---
Optimise x86's do_page_fault (C entry point for the page fault path).

gcc isn't _all_ that smart about spilling registers to stack or reusing
stack slots, even with branch annotations. do_page_fault contained a lot
of functionality, so split unlikely paths into their own functions, and
mark them as noinline just to be sure. I consider this actually to be
somewhat of a cleanup too: the main function now contains about half
the number of lines.

Also, ensure the order of arguments to functions is always the same: regs,
addr, error_code. This can reduce code size a tiny bit, and just looks neater
too.

Add a couple of branch annotations.

One real behavioural difference this makes is that the OOM-init-task case
will no longer loop around the page fault handler, but will return to
userspace and presumably retry the fault. Effectively the same macro-behaviour,
but it is a notable difference. Such change in behaviour should disappear after
the "call oom killer from page fault" patch.

Before:
do_page_fault:
subq $360, %rsp #,

After:
do_page_fault:
subq $56, %rsp #,

bloat-o-meter:
add/remove: 8/0 grow/shrink: 0/1 up/down: 2222/-1680 (542)
function old new delta
__bad_area_nosemaphore - 506 +506
no_context - 474 +474
vmalloc_fault - 424 +424
spurious_fault - 358 +358
mm_fault_error - 272 +272
bad_area_access_error - 89 +89
bad_area - 89 +89
bad_area_nosemaphore - 10 +10
do_page_fault 2464 784 -1680

Yes, the total size increases by 542 bytes, due to the extra function calls.
But these will very rarely be called (except for vmalloc_fault) in a normal
workload. Importantly, do_page_fault is less than 1/3rd it's original size.
Existing gotos and branch hints did move a lot of the infrequently used text
out of the fastpath, but that's even further improved after this patch.

---
arch/x86/mm/fault.c | 458 ++++++++++++++++++++++++++++++----------------------
1 file changed, 265 insertions(+), 193 deletions(-)

Index: linux-2.6/arch/x86/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/fault.c
+++ linux-2.6/arch/x86/mm/fault.c
@@ -91,8 +91,8 @@ static inline int notify_page_fault(stru
*
* Opcode checker based on code by Richard Brunner
*/
-static int is_prefetch(struct pt_regs *regs, unsigned long addr,
- unsigned long error_code)
+static int is_prefetch(struct pt_regs *regs, unsigned long error_code,
+ unsigned long addr)
{
unsigned char *instr;
int scan_more = 1;
@@ -409,15 +409,15 @@ static void show_fault_oops(struct pt_re
}

#ifdef CONFIG_X86_64
-static noinline void pgtable_bad(unsigned long address, struct pt_regs *regs,
- unsigned long error_code)
+static noinline void pgtable_bad(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
{
unsigned long flags = oops_begin();
int sig = SIGKILL;
- struct task_struct *tsk;
+ struct task_struct *tsk = current;

printk(KERN_ALERT "%s: Corrupted page table at address %lx\n",
- current->comm, address);
+ tsk->comm, address);
dump_pagetable(address);
tsk = current;
tsk->thread.cr2 = address;
@@ -429,6 +429,200 @@ static noinline void pgtable_bad(unsigne
}
#endif

+static noinline void no_context(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ struct task_struct *tsk = current;
+#ifdef CONFIG_X86_64
+ unsigned long flags;
+ int sig;
+#endif
+
+ /* Are we prepared to handle this kernel fault? */
+ if (fixup_exception(regs))
+ return;
+
+ /*
+ * X86_32
+ * Valid to do another page fault here, because if this fault
+ * had been triggered by is_prefetch fixup_exception would have
+ * handled it.
+ *
+ * X86_64
+ * Hall of shame of CPU/BIOS bugs.
+ */
+ if (is_prefetch(regs, error_code, address))
+ return;
+
+ if (is_errata93(regs, address))
+ return;
+
+ /*
+ * Oops. The kernel tried to access some bad page. We'll have to
+ * terminate things with extreme prejudice.
+ */
+#ifdef CONFIG_X86_32
+ bust_spinlocks(1);
+#else
+ flags = oops_begin();
+#endif
+
+ show_fault_oops(regs, error_code, address);
+
+ tsk->thread.cr2 = address;
+ tsk->thread.trap_no = 14;
+ tsk->thread.error_code = error_code;
+
+#ifdef CONFIG_X86_32
+ die("Oops", regs, error_code);
+ bust_spinlocks(0);
+ do_exit(SIGKILL);
+#else
+ sig = SIGKILL;
+ if (__die("Oops", regs, error_code))
+ sig = 0;
+ /* Executive summary in case the body of the oops scrolled away */
+ printk(KERN_EMERG "CR2: %016lx\n", address);
+ oops_end(flags, regs, sig);
+#endif
+}
+
+static void __bad_area_nosemaphore(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address,
+ int si_code)
+{
+ struct task_struct *tsk = current;
+
+ /* User mode accesses just cause a SIGSEGV */
+ if (error_code & PF_USER) {
+ /*
+ * It's possible to have interrupts off here.
+ */
+ local_irq_enable();
+
+ /*
+ * Valid to do another page fault here because this one came
+ * from user space.
+ */
+ if (is_prefetch(regs, error_code, address))
+ return;
+
+ if (is_errata100(regs, address))
+ return;
+
+ if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+ printk_ratelimit()) {
+ printk(
+ "%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
+ task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
+ tsk->comm, task_pid_nr(tsk), address,
+ (void *) regs->ip, (void *) regs->sp, error_code);
+ print_vma_addr(" in ", regs->ip);
+ printk("\n");
+ }
+
+ tsk->thread.cr2 = address;
+ /* Kernel addresses are always protection faults */
+ tsk->thread.error_code = error_code | (address >= TASK_SIZE);
+ tsk->thread.trap_no = 14;
+ force_sig_info_fault(SIGSEGV, si_code, address, tsk);
+ return;
+ }
+
+ if (is_f00f_bug(regs, address))
+ return;
+
+ no_context(regs, error_code, address);
+}
+
+static noinline void bad_area_nosemaphore(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ __bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+}
+
+static void __bad_area(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address,
+ int si_code)
+{
+ struct mm_struct *mm = current->mm;
+
+ /*
+ * Something tried to access memory that isn't in our memory map..
+ * Fix it, but check if it's kernel or user first..
+ */
+ up_read(&mm->mmap_sem);
+
+ __bad_area_nosemaphore(regs, error_code, address, si_code);
+}
+
+static noinline void bad_area(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ __bad_area(regs, error_code, address, SEGV_MAPERR);
+}
+
+static noinline void bad_area_access_error(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ __bad_area(regs, error_code, address, SEGV_ACCERR);
+}
+
+/* TODO: fixup for "mm-invoke-oom-killer-from-page-fault.patch" */
+static void out_of_memory(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ struct task_struct *tsk = current;
+ struct mm_struct *mm = tsk->mm;
+ /*
+ * We ran out of memory, or some other thing happened to us that made
+ * us unable to handle the page fault gracefully.
+ */
+ up_read(&mm->mmap_sem);
+ if (is_global_init(tsk)) {
+ yield();
+ return;
+ }
+
+ printk("VM: killing process %s\n", tsk->comm);
+ if (error_code & PF_USER)
+ do_group_exit(SIGKILL);
+ no_context(regs, error_code, address);
+}
+
+static void do_sigbus(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ struct task_struct *tsk = current;
+ struct mm_struct *mm = tsk->mm;
+
+ up_read(&mm->mmap_sem);
+
+ /* Kernel mode? Handle exceptions or die */
+ if (!(error_code & PF_USER))
+ no_context(regs, error_code, address);
+#ifdef CONFIG_X86_32
+ /* User space => ok to do another page fault */
+ if (is_prefetch(regs, error_code, address))
+ return;
+#endif
+ tsk->thread.cr2 = address;
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_no = 14;
+ force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+}
+
+static noinline void mm_fault_error(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address, unsigned int fault)
+{
+ if (fault & VM_FAULT_OOM)
+ out_of_memory(regs, error_code, address);
+ else if (fault & VM_FAULT_SIGBUS)
+ do_sigbus(regs, error_code, address);
+ else
+ BUG();
+}
+
static int spurious_fault_check(unsigned long error_code, pte_t *pte)
{
if ((error_code & PF_WRITE) && !pte_write(*pte))
@@ -448,8 +642,8 @@ static int spurious_fault_check(unsigned
* There are no security implications to leaving a stale TLB when
* increasing the permissions on a page.
*/
-static int spurious_fault(unsigned long address,
- unsigned long error_code)
+static noinline int spurious_fault(unsigned long error_code,
+ unsigned long address)
{
pgd_t *pgd;
pud_t *pud;
@@ -494,7 +688,7 @@ static int spurious_fault(unsigned long
*
* This assumes no large pages in there.
*/
-static int vmalloc_fault(unsigned long address)
+static noinline int vmalloc_fault(unsigned long address)
{
#ifdef CONFIG_X86_32
unsigned long pgd_paddr;
@@ -573,6 +767,25 @@ static int vmalloc_fault(unsigned long a

int show_unhandled_signals = 1;

+static inline int access_error(unsigned long error_code, int write,
+ struct vm_area_struct *vma)
+{
+ if (write) {
+ /* write, present and write, not present */
+ if (unlikely(!(vma->vm_flags & VM_WRITE)))
+ return 1;
+ } else if (unlikely(error_code & PF_PROT)) {
+ /* read, present */
+ return 1;
+ } else {
+ /* read, not present */
+ if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+ return 1;
+ }
+
+ return 0;
+}
+
/*
* This routine handles page faults. It determines the address,
* and the problem, and then passes it off to one of the appropriate
@@ -583,16 +796,12 @@ asmlinkage
#endif
void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
+ unsigned long address;
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct *vma;
- unsigned long address;
- int write, si_code;
+ int write;
int fault;
-#ifdef CONFIG_X86_64
- unsigned long flags;
- int sig;
-#endif

tsk = current;
mm = tsk->mm;
@@ -601,9 +810,7 @@ void __kprobes do_page_fault(struct pt_r
/* get the address */
address = read_cr2();

- si_code = SEGV_MAPERR;
-
- if (notify_page_fault(regs))
+ if (unlikely(notify_page_fault(regs)))
return;
if (unlikely(kmmio_fault(regs, address)))
return;
@@ -638,10 +845,10 @@ void __kprobes do_page_fault(struct pt_r
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.
*/
- goto bad_area_nosemaphore;
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
}

-
/*
* It's safe to allow irq's after cr2 has been saved and the
* vmalloc fault has been handled.
@@ -657,17 +864,18 @@ void __kprobes do_page_fault(struct pt_r

#ifdef CONFIG_X86_64
if (unlikely(error_code & PF_RSVD))
- pgtable_bad(address, regs, error_code);
+ pgtable_bad(regs, error_code, address);
#endif

/*
* If we're in an interrupt, have no user context or are running in an
* atomic region then we must not take the fault.
*/
- if (unlikely(in_atomic() || !mm))
- goto bad_area_nosemaphore;
+ if (unlikely(in_atomic() || !mm)) {
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
+ }

-again:
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
@@ -684,20 +892,26 @@ again:
* source. If this is invalid we can skip the address space check,
* thus avoiding the deadlock.
*/
- if (!down_read_trylock(&mm->mmap_sem)) {
+ if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
if ((error_code & PF_USER) == 0 &&
- !search_exception_tables(regs->ip))
- goto bad_area_nosemaphore;
+ !search_exception_tables(regs->ip)) {
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
+ }
down_read(&mm->mmap_sem);
}

vma = find_vma(mm, address);
- if (!vma)
- goto bad_area;
- if (vma->vm_start <= address)
+ if (unlikely(!vma)) {
+ bad_area(regs, error_code, address);
+ return;
+ }
+ if (likely(vma->vm_start <= address))
goto good_area;
- if (!(vma->vm_flags & VM_GROWSDOWN))
- goto bad_area;
+ if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
+ bad_area(regs, error_code, address);
+ return;
+ }
if (error_code & PF_USER) {
/*
* Accessing the stack below %sp is always a bug.
@@ -705,31 +919,25 @@ again:
* and pusha to work. ("enter $65535,$31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
- if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp)
- goto bad_area;
+ if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
+ bad_area(regs, error_code, address);
+ return;
+ }
}
- if (expand_stack(vma, address))
- goto bad_area;
-/*
- * Ok, we have a good vm_area for this memory access, so
- * we can handle it..
- */
+ if (unlikely(expand_stack(vma, address))) {
+ bad_area(regs, error_code, address);
+ return;
+ }
+
+ /*
+ * Ok, we have a good vm_area for this memory access, so
+ * we can handle it..
+ */
good_area:
- si_code = SEGV_ACCERR;
- write = 0;
- switch (error_code & (PF_PROT|PF_WRITE)) {
- default: /* 3: write, present */
- /* fall through */
- case PF_WRITE: /* write, not present */
- if (!(vma->vm_flags & VM_WRITE))
- goto bad_area;
- write++;
- break;
- case PF_PROT: /* read, present */
- goto bad_area;
- case 0: /* read, not present */
- if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
- goto bad_area;
+ write = error_code & PF_WRITE;
+ if (unlikely(access_error(error_code, write, vma))) {
+ bad_area_access_error(regs, error_code, address);
+ return;
}

/*
@@ -739,11 +947,8 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, write);
if (unlikely(fault & VM_FAULT_ERROR)) {
- if (fault & VM_FAULT_OOM)
- goto out_of_memory;
- else if (fault & VM_FAULT_SIGBUS)
- goto do_sigbus;
- BUG();
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}
if (fault & VM_FAULT_MAJOR)
tsk->maj_flt++;
@@ -761,139 +966,6 @@ good_area:
}
#endif
up_read(&mm->mmap_sem);
- return;
-
-/*
- * Something tried to access memory that isn't in our memory map..
- * Fix it, but check if it's kernel or user first..
- */
-bad_area:
- up_read(&mm->mmap_sem);
-
-bad_area_nosemaphore:
- /* User mode accesses just cause a SIGSEGV */
- if (error_code & PF_USER) {
- /*
- * It's possible to have interrupts off here.
- */
- local_irq_enable();
-
- /*
- * Valid to do another page fault here because this one came
- * from user space.
- */
- if (is_prefetch(regs, address, error_code))
- return;
-
- if (is_errata100(regs, address))
- return;
-
- if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
- printk_ratelimit()) {
- printk(
- "%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
- task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
- tsk->comm, task_pid_nr(tsk), address,
- (void *) regs->ip, (void *) regs->sp, error_code);
- print_vma_addr(" in ", regs->ip);
- printk("\n");
- }
-
- tsk->thread.cr2 = address;
- /* Kernel addresses are always protection faults */
- tsk->thread.error_code = error_code | (address >= TASK_SIZE);
- tsk->thread.trap_no = 14;
- force_sig_info_fault(SIGSEGV, si_code, address, tsk);
- return;
- }
-
- if (is_f00f_bug(regs, address))
- return;
-
-no_context:
- /* Are we prepared to handle this kernel fault? */
- if (fixup_exception(regs))
- return;
-
- /*
- * X86_32
- * Valid to do another page fault here, because if this fault
- * had been triggered by is_prefetch fixup_exception would have
- * handled it.
- *
- * X86_64
- * Hall of shame of CPU/BIOS bugs.
- */
- if (is_prefetch(regs, address, error_code))
- return;
-
- if (is_errata93(regs, address))
- return;
-
-/*
- * Oops. The kernel tried to access some bad page. We'll have to
- * terminate things with extreme prejudice.
- */
-#ifdef CONFIG_X86_32
- bust_spinlocks(1);
-#else
- flags = oops_begin();
-#endif
-
- show_fault_oops(regs, error_code, address);
-
- tsk->thread.cr2 = address;
- tsk->thread.trap_no = 14;
- tsk->thread.error_code = error_code;
-
-#ifdef CONFIG_X86_32
- die("Oops", regs, error_code);
- bust_spinlocks(0);
- do_exit(SIGKILL);
-#else
- sig = SIGKILL;
- if (__die("Oops", regs, error_code))
- sig = 0;
- /* Executive summary in case the body of the oops scrolled away */
- printk(KERN_EMERG "CR2: %016lx\n", address);
- oops_end(flags, regs, sig);
-#endif
-
-/*
- * We ran out of memory, or some other thing happened to us that made
- * us unable to handle the page fault gracefully.
- */
-out_of_memory:
- up_read(&mm->mmap_sem);
- if (is_global_init(tsk)) {
- yield();
- /*
- * Re-lookup the vma - in theory the vma tree might
- * have changed:
- */
- goto again;
- }
-
- printk("VM: killing process %s\n", tsk->comm);
- if (error_code & PF_USER)
- do_group_exit(SIGKILL);
- goto no_context;
-
-do_sigbus:
- up_read(&mm->mmap_sem);
-
- /* Kernel mode? Handle exceptions or die */
- if (!(error_code & PF_USER))
- goto no_context;
-#ifdef CONFIG_X86_32
- /* User space => ok to do another page fault */
- if (is_prefetch(regs, address, error_code))
- return;
-#endif
- tsk->thread.cr2 = address;
- tsk->thread.error_code = error_code;
- tsk->thread.trap_no = 14;
- force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
}

DEFINE_SPINLOCK(pgd_lock);

2009-01-07 17:31:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3


* Nick Piggin <[email protected]> wrote:

> On Sat, Jan 03, 2009 at 11:37:23PM +0100, Ingo Molnar wrote:
> >
> > * Linus Torvalds <[email protected]> wrote:
> >
> > > What happened to Nick's cleanup patch to do_page_fault (a month or two
> > > ago? I complained about some of the issues in his first version and
> > > asked for some further cleanups, but I think that whole discussion ended
> > > with him saying "I am going to add those changes that you suggested (in
> > > fact, I already have)".
> > >
> > > And then I didn't see anything further. Maybe I just missed the end
> > > result. Or maybe we have it in some -mm branch or something?
> >
> > they would have been in tip/x86/mm and would be upstream now had Nick
> > re-sent a v2 series but that never happened. I think they might have
> > fallen victim to a serious attention deficit caused by the SLQB patch ;-)
>
> Well, I already added Linus's suggestions but didn't submit it because
> there was a bit of work going on in that file as far as I could see,
> both in the x86 tree and in -mm:
>
> (http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/broken-out/mm-invoke-oom-killer-from-page-fault.patch)
>
> It isn't a big deal to resolve either way, but I don't want to make
> Andrew's life harder.
>
> [Yes OK now I'm the guilty one of pushing in an x86 patch not via the
> x86 tree ;) This one is easy to break in pieces, but I didn't want to
> create a dependency between the trees]

That's OK, and the oom-killer patch impact on x86 was incidental, so it
was correct to push it via -mm IMO.

Now that the bits that went in via Andrew's tree upstream, there's a
handful of new conflicts in the patch - so would you mind to (re-)send a
merged up patch against latest -git?

Ingo

2009-01-08 19:11:17

by David Daney

[permalink] [raw]
Subject: Re: [PULL] cpumask tree

Rusty Russell wrote:
> commit d036e67b40f52bdd95392390108defbac7e53837
> Author: Rusty Russell <[email protected]>
> Date: Thu Jan 1 10:12:26 2009 +1030
>
> cpumask: convert kernel/irq
>
> Impact: Reduce stack usage, use new cpumask API. ALPHA mod!
>
> Main change is that irq_default_affinity becomes a cpumask_var_t, so
> treat it as a pointer (this effects alpha).
>
> Signed-off-by: Rusty Russell <[email protected]>
>

Which contains:

> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 61c4a9b..cd0cd8d 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -16,8 +16,15 @@
> #include "internals.h"
>
> #ifdef CONFIG_SMP
> +cpumask_var_t irq_default_affinity;
>
> -cpumask_t irq_default_affinity = CPU_MASK_ALL;
> +static int init_irq_default_affinity(void)
> +{
> + alloc_cpumask_var(&irq_default_affinity, GFP_KERNEL);
> + cpumask_setall(irq_default_affinity);
> + return 0;
> +}
> +core_initcall(init_irq_default_affinity);

I think core_initcall is too late to be initializing
irq_default_affinity. This happens way after init_IRQ() is called and
for my target (mips/cavium_octeon) after the timer and SMP related irqs
are setup.

I had been setting irq_default_affinity in init_IRQ(), and I could
probably do it later with no real problem, but this seems wrong to me.
Data that is potentially used in interrupt configuration and processing
should be initialized before it is used.

David Daney

2009-01-26 19:01:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3

On Mon, 5 Jan 2009 02:16:30 +0100
Nick Piggin <[email protected]> wrote:

> Really cc linux-arch this time
>
> On Mon, Jan 05, 2009 at 02:14:16AM +0100, Nick Piggin wrote:
> > On Sat, Jan 03, 2009 at 11:37:23PM +0100, Ingo Molnar wrote:
> > >
> > > * Linus Torvalds <[email protected]> wrote:
> > >
> > > > What happened to Nick's cleanup patch to do_page_fault (a month or two
> > > > ago? I complained about some of the issues in his first version and
> > > > asked for some further cleanups, but I think that whole discussion ended
> > > > with him saying "I am going to add those changes that you suggested (in
> > > > fact, I already have)".
> > > >
> > > > And then I didn't see anything further. Maybe I just missed the end
> > > > result. Or maybe we have it in some -mm branch or something?
> > >
> > > they would have been in tip/x86/mm and would be upstream now had Nick
> > > re-sent a v2 series but that never happened. I think they might have
> > > fallen victim to a serious attention deficit caused by the SLQB patch ;-)
> >
> > Well, I already added Linus's suggestions but didn't submit it because
> > there was a bit of work going on in that file as far as I could see, both
> > in the x86 tree and in -mm:
> >
> > (http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/broken-out/mm-invoke-oom-killer-from-page-fault.patch)
> >
> > It isn't a big deal to resolve either way, but I don't want to make Andrew's
> > life harder.
> >
> > [Yes OK now I'm the guilty one of pushing in an x86 patch not via the
> > x86 tree ;) This one is easy to break in pieces, but I didn't want
> > to create a dependency between the trees]
> >
> > I didn't really consider it to be urgent, so I was waiting for that patch
> > to go in, but I was still hoping to get this into 2.6.29... This is what
> > it looks like now with your suggestions, and just merged it to your current
> > tree (untested).
> >
> > I'll cc the linux-arch list here too, because it might be nice to keep these
> > things as structurally similar as possible (and they'll all want to look at
> > the -mm patch above, although I'll probably end up having to write the
> > patches!).
>
> ---
> Optimise x86's do_page_fault (C entry point for the page fault path).

It took rather a lot of hunting to find this email. Please do try to
make the email subject match the final patch's title?

> * This routine handles page faults. It determines the address,
> * and the problem, and then passes it off to one of the appropriate
> @@ -583,16 +796,12 @@ asmlinkage
> #endif
> void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
> {
> + unsigned long address;
> struct task_struct *tsk;
> struct mm_struct *mm;
> struct vm_area_struct *vma;
> - unsigned long address;
> - int write, si_code;
> + int write;
> int fault;
> -#ifdef CONFIG_X86_64
> - unsigned long flags;
> - int sig;
> -#endif
>
> tsk = current;
> mm = tsk->mm;
> @@ -601,9 +810,7 @@ void __kprobes do_page_fault(struct pt_r
> /* get the address */
> address = read_cr2();
>
> - si_code = SEGV_MAPERR;
> -
> - if (notify_page_fault(regs))
> + if (unlikely(notify_page_fault(regs)))
> return;
> if (unlikely(kmmio_fault(regs, address)))
> return;
> @@ -638,10 +845,10 @@ void __kprobes do_page_fault(struct pt_r
> * Don't take the mm semaphore here. If we fixup a prefetch
> * fault we could otherwise deadlock.
> */
> - goto bad_area_nosemaphore;
> + bad_area_nosemaphore(regs, error_code, address);
> + return;
> }
>
> -
> /*
> * It's safe to allow irq's after cr2 has been saved and the
> * vmalloc fault has been handled.
> @@ -657,17 +864,18 @@ void __kprobes do_page_fault(struct pt_r
>
> #ifdef CONFIG_X86_64
> if (unlikely(error_code & PF_RSVD))
> - pgtable_bad(address, regs, error_code);
> + pgtable_bad(regs, error_code, address);
> #endif
>
> /*
> * If we're in an interrupt, have no user context or are running in an
> * atomic region then we must not take the fault.
> */
> - if (unlikely(in_atomic() || !mm))
> - goto bad_area_nosemaphore;
> + if (unlikely(in_atomic() || !mm)) {
> + bad_area_nosemaphore(regs, error_code, address);
> + return;
> + }
>
> -again:
> /*
> * When running in the kernel we expect faults to occur only to
> * addresses in user space. All other faults represent errors in the
> @@ -684,20 +892,26 @@ again:
> * source. If this is invalid we can skip the address space check,
> * thus avoiding the deadlock.
> */
> - if (!down_read_trylock(&mm->mmap_sem)) {
> + if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
> if ((error_code & PF_USER) == 0 &&
> - !search_exception_tables(regs->ip))
> - goto bad_area_nosemaphore;
> + !search_exception_tables(regs->ip)) {
> + bad_area_nosemaphore(regs, error_code, address);
> + return;
> + }
> down_read(&mm->mmap_sem);
> }
>
> vma = find_vma(mm, address);
> - if (!vma)
> - goto bad_area;
> - if (vma->vm_start <= address)
> + if (unlikely(!vma)) {
> + bad_area(regs, error_code, address);
> + return;
> + }
> + if (likely(vma->vm_start <= address))
> goto good_area;
> - if (!(vma->vm_flags & VM_GROWSDOWN))
> - goto bad_area;
> + if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
> + bad_area(regs, error_code, address);
> + return;
> + }
> if (error_code & PF_USER) {
> /*
> * Accessing the stack below %sp is always a bug.
> @@ -705,31 +919,25 @@ again:
> * and pusha to work. ("enter $65535,$31" pushes
> * 32 pointers and then decrements %sp by 65535.)
> */
> - if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp)
> - goto bad_area;
> + if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> + bad_area(regs, error_code, address);
> + return;
> + }
> }
> - if (expand_stack(vma, address))
> - goto bad_area;
> -/*
> - * Ok, we have a good vm_area for this memory access, so
> - * we can handle it..
> - */
> + if (unlikely(expand_stack(vma, address))) {
> + bad_area(regs, error_code, address);
> + return;
> + }
> +
> + /*
> + * Ok, we have a good vm_area for this memory access, so
> + * we can handle it..
> + */
> good_area:
> - si_code = SEGV_ACCERR;
> - write = 0;
> - switch (error_code & (PF_PROT|PF_WRITE)) {
> - default: /* 3: write, present */
> - /* fall through */
> - case PF_WRITE: /* write, not present */
> - if (!(vma->vm_flags & VM_WRITE))
> - goto bad_area;
> - write++;
> - break;
> - case PF_PROT: /* read, present */
> - goto bad_area;
> - case 0: /* read, not present */
> - if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
> - goto bad_area;
> + write = error_code & PF_WRITE;

What's going on here? We set `error_code' to PF_WRITE, which is some
x86-specific thing.

> + if (unlikely(access_error(error_code, write, vma))) {
> + bad_area_access_error(regs, error_code, address);
> + return;
> }
>
> /*
> @@ -739,11 +947,8 @@ good_area:
> */
> fault = handle_mm_fault(mm, vma, address, write);

and then pass it into handle_mm_fault(), which is expecting a bunch of
flags in the FAULT_FLAG_foo domain.

IOW, the code will accidentally set FAULT_FLAG_NONLINEAR!.


Methinks we want something like this,

--- a/arch/x86/mm/fault.c~fix-x86-optimise-x86s-do_page_fault-c-entry-point-for-the-page-fault-path
+++ a/arch/x86/mm/fault.c
@@ -942,7 +942,7 @@ void __kprobes do_page_fault(struct pt_r
* we can handle it..
*/
good_area:
- write = error_code & PF_WRITE;
+ write = (error_code & PF_WRITE) ? FAULT_FLAG_WRITE : 0;
if (unlikely(access_error(error_code, write, vma))) {
bad_area_access_error(regs, error_code, address);
return;
_


but why did the current code pass testing at all??

2009-01-26 19:11:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3



On Mon, 26 Jan 2009, Andrew Morton wrote:
> > + write = error_code & PF_WRITE;
>
> What's going on here? We set `error_code' to PF_WRITE, which is some
> x86-specific thing.

No. We set "write" to non-zero if it was a write fault.

> > fault = handle_mm_fault(mm, vma, address, write);
>
> and then pass it into handle_mm_fault(), which is expecting a bunch of
> flags in the FAULT_FLAG_foo domain.

No. "handle_mm_fault()" takes an integer that is non-zero if it's a write,
zero if it's a read. That's how it has _always_ worked.

I don't see where you find that FAULT_FLAG_foo thing. That's much deeper
down, when people do things like

unsigned int flags = FAULT_FLAG_NONLINEAR |
(write_access ? FAULT_FLAG_WRITE : 0);

based on that whole "write_access" flag.

Linus

2009-01-26 19:31:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3

On Mon, 26 Jan 2009 11:09:59 -0800 (PST)
Linus Torvalds <[email protected]> wrote:

>
>
> On Mon, 26 Jan 2009, Andrew Morton wrote:
> > > + write = error_code & PF_WRITE;
> >
> > What's going on here? We set `error_code' to PF_WRITE, which is some
> > x86-specific thing.
>
> No. We set "write" to non-zero if it was a write fault.
>
> > > fault = handle_mm_fault(mm, vma, address, write);
> >
> > and then pass it into handle_mm_fault(), which is expecting a bunch of
> > flags in the FAULT_FLAG_foo domain.
>
> No. "handle_mm_fault()" takes an integer that is non-zero if it's a write,
> zero if it's a read. That's how it has _always_ worked.
>
> I don't see where you find that FAULT_FLAG_foo thing. That's much deeper
> down, when people do things like
>
> unsigned int flags = FAULT_FLAG_NONLINEAR |
> (write_access ? FAULT_FLAG_WRITE : 0);
>
> based on that whole "write_access" flag.
>

OK, thanks. It's actually page_fault-retry-with-nopage_retry.patch
which got those things confused, and then confused me. I'll go address
that in the other thread..

2009-01-26 20:10:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3


* Andrew Morton <[email protected]> wrote:

> but why did the current code pass testing at all??

i queued it up a week ago and beyond a same-day breakage i reported to
Nick (and which he fixed) this commit was problem-free and passed all
testing here.

Does it cause problems for you? If yes then please describe the kind of
problems.

Note: i see that -mm modifies a few other details of the x86 pagefault
handling path (there a pagefault-retry patch in there) - so there might be
contextual interactions there. But this particular cleanup/improvement
from Nick is working fine on a wide range of systems here.

Btw., regarding pagefault retry. The bits that are in -mm currently i
find a bit ugly:

> +++ a/arch/x86/mm/fault.c
> @@ -799,7 +799,7 @@ void __kprobes do_page_fault(struct pt_r
> struct vm_area_struct *vma;
> int write;
> int fault;
> - unsigned int retry_flag = FAULT_FLAG_RETRY;
> + int retry_flag = 1;
>
> tsk = current;
> mm = tsk->mm;
> @@ -951,6 +951,7 @@ good_area:
> }
>
> write |= retry_flag;
> +
> /*
> * If for any reason at all we couldn't handle the fault,
> * make sure we exit gracefully rather than endlessly redo
> @@ -969,8 +970,8 @@ good_area:
> * be removed or changed after the retry.
> */
> if (fault & VM_FAULT_RETRY) {
> - if (write & FAULT_FLAG_RETRY) {
> - retry_flag &= ~FAULT_FLAG_RETRY;
> + if (retry_flag) {
> + retry_flag = 0;
> goto retry;
> }
> BUG();

as this complicates every architecture with a 'can the fault be retried'
logic and open-coded retry loop.

But that logic is rather repetitive and once an architecture filters out
all its special in-kernel sources of faults and the hw quirks it has, the
handling of pte faults is rather generic and largely offloaded into
handle_pte_fault() already.

So when this patch was submitted a few weeks ago i suggested that retry
should be done purely in mm/memory.c instead, and the low level code
should at most be refactored to suit this method, but not complicated any
further.

Any deep reasons for why such a more generic approach is not desirable?

Ingo

2009-01-26 20:45:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3

On Mon, 26 Jan 2009 21:09:57 +0100
Ingo Molnar <[email protected]> wrote:

> ...
>
> Btw., regarding pagefault retry. The bits that are in -mm currently i
> find a bit ugly:
>
> > +++ a/arch/x86/mm/fault.c
> > @@ -799,7 +799,7 @@ void __kprobes do_page_fault(struct pt_r
> > struct vm_area_struct *vma;
> > int write;
> > int fault;
> > - unsigned int retry_flag = FAULT_FLAG_RETRY;
> > + int retry_flag = 1;
> >
> > tsk = current;
> > mm = tsk->mm;
> > @@ -951,6 +951,7 @@ good_area:
> > }
> >
> > write |= retry_flag;
> > +
> > /*
> > * If for any reason at all we couldn't handle the fault,
> > * make sure we exit gracefully rather than endlessly redo
> > @@ -969,8 +970,8 @@ good_area:
> > * be removed or changed after the retry.
> > */
> > if (fault & VM_FAULT_RETRY) {
> > - if (write & FAULT_FLAG_RETRY) {
> > - retry_flag &= ~FAULT_FLAG_RETRY;
> > + if (retry_flag) {
> > + retry_flag = 0;
> > goto retry;
> > }
> > BUG();
>
> as this complicates every architecture with a 'can the fault be retried'
> logic and open-coded retry loop.
>
> But that logic is rather repetitive and once an architecture filters out
> all its special in-kernel sources of faults and the hw quirks it has, the
> handling of pte faults is rather generic and largely offloaded into
> handle_pte_fault() already.
>
> So when this patch was submitted a few weeks ago i suggested that retry
> should be done purely in mm/memory.c instead, and the low level code
> should at most be refactored to suit this method, but not complicated any
> further.
>
> Any deep reasons for why such a more generic approach is not desirable?
>

Let's cc the people who wrote it.

2009-01-26 23:22:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3


* Ying Han <[email protected]> wrote:

> Thank you Ingo and Andrew for the comments. I will take a look into it
> ASAP and updates it here.

Note, my objection wasnt a hard NAK - just an observation. If all things
considered Andrew still favors the VM_FAULT_RETRY approach then that's
fine too i guess.

It's just that a quick look gave me the feeling of a retry flag tacked on
to an existing codepath [and all the micro-overhead and complexity that
this brings], instead of a clean refactoring of pagefault handling
functionality into a higher MM level retry loop.

So the alternative has to be looked at and rejected because it's
technically inferior - not because it's more difficult to implement.
(which it certainly is)

Ingo

2009-01-26 23:45:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [git pull] cpus4096 tree, part 3

On Tue, 27 Jan 2009 00:21:39 +0100
Ingo Molnar <[email protected]> wrote:

>
> * Ying Han <[email protected]> wrote:
>
> > Thank you Ingo and Andrew for the comments. I will take a look into it
> > ASAP and updates it here.
>
> Note, my objection wasnt a hard NAK - just an observation. If all things
> considered Andrew still favors the VM_FAULT_RETRY approach then that's
> fine too i guess.
>
> It's just that a quick look gave me the feeling of a retry flag tacked on
> to an existing codepath [and all the micro-overhead and complexity that
> this brings], instead of a clean refactoring of pagefault handling
> functionality into a higher MM level retry loop.
>
> So the alternative has to be looked at and rejected because it's
> technically inferior - not because it's more difficult to implement.
> (which it certainly is)
>

I have wobbly feelings about this patch. There are your issues, and a
long string of problems and fixes. And my recent half-assed
linux-next-related fix which I didn't really think about.

It all needs a revisit/rereview/reunderstand cycle.