2011-04-19 04:58:10

by Linus Torvalds

[permalink] [raw]
Subject: Linux 2.6.39-rc4

So things have sadly not continued to calm down even further. We had
more commits in -rc4 than we had in -rc3, and I sincerely hope that
upward trend doesn't continue.

That said, so far the only thing that has really caused problems this
release cycle has been the block layer plugging changes, and as of
-rc4 the issues we had with MD should hopefully now be behind us. So
we're making progress on that front too.

The plugging code still seems to trigger some issue with what looks
like an infinite stream of disk-change notifications on CD-ROMs - but
Jens is hopefully going to squish that problem soon. In the meantime,
you can avoid the problem by either running SMP or having preemption
enabled.

Other than that? We may have a bit more commits than in -rc3, but it
hasn't been _too_ bad. There's certainly nothing overly exciting:
aside from the block/MD fixups, we've got some filesystem updates
(btrfs, cifs and ubifs) and some driver updates (the largest chunk of
which is actually a duplicate driver removal). USB, some KMS, nothing
really earthshaking.

Shortlog appended for the curious.

Linus

---
Abhilash Kesavan (2):
ARM: S5P: Remove unused s3c_pm_check_resume_pin
ARM: SAMSUNG: Fix build failure in PM CRC check code

Alan Stern (1):
USB: EHCI: unlink unused QHs when the controller is stopped

Alberto Mardegan (1):
samsung-laptop: Samsung R410P backlight driver

Alex Deucher (8):
drm/radeon/kms: pll tweaks for rv6xx
drm/radeon/kms: make radeon i2c put/get bytes less noisy
drm/radeon/kms: clean up gart dummy page handling
drm/radeon/kms: fix suspend on rv530 asics
drm/radeon/kms: fix pcie_p callbacks on btc and cayman
drm/radeon/kms: add voltage type to atom set voltage function
drm/radeon/kms: properly program vddci on evergreen+
i2c-algo-bit: Call pre/post_xfer for bit_test

Alexander Clouter (1):
MAINTAINERS: add ARM/ts78xx-setup platform maintainer

Alexandre Bounine (2):
RapidIO: add IDT CPS-1432 switch definitions
RapidIO/mpc85xx: fix possible mport registration problems

Alexey Dobriyan (2):
kstrtox: fix compile warnings in test
kstrtox: simpler code in _kstrtoull()

Alexey Khoroshilov (1):
USB: usb-gadget: unlock data->lock mutex on error path in ep_read()

Andi Kleen (1):
mm: add VM counters for transparent hugepages

Andiry Xu (2):
usbcore: Bug fix: system can't suspend with USB3.0 device
connected to USB3.0 hub
xHCI: Implement AMD PLL quirk

Andreas Bie?mann (1):
avr32: add ATAG_BOARDINFO

Aneesh Kumar K.V (5):
fs/9p: Fix revalidate to return correct value
fs/9p: Use write_inode for data sync on server
9p: revert tsyncfs related changes
fs/9p: Fix error reported by coccicheck
9p: Fix sparse error

Anton Blanchard (1):
powerpc: Fix oops if scan_dispatch_log is called too early

Antonio Ospite (1):
leds/leds-regulator.c: fix handling of already enabled regulators

Arne Jansen (1):
btrfs: using cached extent_state in set/unlock combinations

Artem Bityutskiy (1):
UBIFS: fix oops when R/O file-system is fsync'ed

Axel Lin (2):
Input: twl4030_keypad - fix potential NULL dereference in
twl4030_kp_probe()
drivers/rtc/rtc-mc13xxx.c: fix unterminated platform_device_id table

Ben Hutchings (2):
avr32: Fix .size directive for cpu_enter_idle
mm/thp: use conventional format for boolean attributes

Ben Skeggs (5):
drm/nouveau: implement init table opcode 0x5c
drm/nouveau: quirk for XFX GT-240X-YA
drm/nv50: use "nv86" tlb flush method on everything except 0x50/0xac
drm/nv50-nvc0: remove some code that doesn't belong here
drm/nvc0: improve vm flush function

Benjamin Herrenschmidt (1):
powerpc/powermac: Build fix with SMP and CPU hotplug

Bob Liu (1):
ramfs: fix memleak on no-mmu arch

Catalin Marinas (3):
ARM: 6866/1: Do not restrict HIGHPTE to !OUTER_CACHE
ARM: 6867/1: Introduce THREAD_NOTIFY_COPY for copy_thread() hooks
ARM: 6868/1: Preserve the VFP state during fork

Chase Douglas (1):
Input: document event types and codes and their intended use

Chris Mason (4):
Btrfs: make uncache_state unconditional
Btrfs: don't force chunk allocation in find_free_extent
Btrfs end_bio_extent_readpage should look for locked bits
Btrfs: fix free space cache leak

Christian Simon (1):
USB: ftdi_sio: Added IDs for CTI USB Serial Devices

Christoph Fritz (1):
Input: h3600_ts - fix error handling at connect

Christoph Hellwig (2):
block: cleanup the block plug helper functions
block: add blk_run_queue_async

Christoph Lameter (1):
vmstat: update comment regarding stat_threshold

Colin Cross (1):
ARM: tegra: gpio: Fix unused variable warnings

Corentin Chary (3):
asus-laptop: remove removed features from feature-removal-schedule.txt
asus-wmi: swap input name and phys
eeepc-wmi: add keys found on EeePC 1215T

Dan Carpenter (6):
USB: musb: add missing unlock in cppi_interrupt()
USB: musb: using 0 instead of NULL
USB: musb: silence printk format warning
USB: musb: dereferencing an iomem pointer
usb: pch_udc: unlock on allocation failure
USB: xhci: unsigned char never equals -1

Daniel J Blueman (1):
fix user annotation in ioctl.c

Daniel Kiper (1):
mm: optimize pfn calculation in online_page()

Darren Hart (1):
futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup

Dave Airlie (3):
i915: restore only the mode of this driver on lastclose
Revert "ttm: Utilize the DMA API for pages that have
TTM_PAGE_FLAG_DMA32 set."
Revert "i915: restore only the mode of this driver on lastclose"

Dave Chinner (1):
nfs: don't call __mark_inode_dirty while holding i_lock

David Brown (2):
msm: Remove extraneous ffa device check
msm: timer: fix missing return value

David Dillow (1):
drm/nv50-nvc0: work around an evo channel hang that some people see

Dmitry Eremin-Solenikov (2):
pcmcia: limit pxa2xx_balloon3 subdriver to balloon3 platform
pcmcia: limit pxa2xx_trizeps4 subdriver to trizeps4 platform

Dmitry Torokhov (6):
USB: fix formatting of SuperSpeed endpoints in /proc/bus/usb/devices
USB: xhci - fix unsafe macro definitions
USB: xhci - remove excessive 'inline' markings
USB: xhci: simplify logic of skipping missed isoc TDs
USB: xhci - fix math in xhci_get_endpoint_interval()
USB: xhci - also free streams when resetting devices

Emil Velikov (1):
nv30: Fix parsing of perf table

Eric B Munson (1):
powerpc/perf_event: Skip updating kernel counters if register
value shrinks

Eric Dumazet (2):
perf: Fix a build error with some GCC versions
memcg: fix mem_cgroup_rotate_reclaimable_page()

Eric Miao (1):
ARM: pxa: convert incorrect IRQ_TO_IRQ() to irq_to_gpio()

Felipe Balbi (2):
usb: musb: temporarily make it bool
usb: musb: gadget: check the correct list_head

Feng Tang (1):
RTC: rtc-mrst: follow on to the change of rtc_device_register()

Geert Uytterhoeven (1):
m68k,m68knommu: Wire up name_to_handle_at, open_by_handle_at,
clock_adjtime, syncfs

Graf Yang (1):
Blackfin: SMP: make all barriers handle cache issues

Greg Kroah-Hartman (2):
samsung-laptop: add support for N230 model
Revert "USB: isp1760-hcd: move imask clear after pending work is done"

Hans J. Koch (1):
MAINTAINERS: change mail adress of Hans J. Koch

Haojian Zhuang (3):
ARM: pxa: always clear LPM bits for PXA168 MFPR
ARM: pxa: align NR_BUILTIN_GPIO with GPIO interrupt number
ARM: mmp: align NR_BUILTIN_GPIO with gpio interrupt number

Harsh Prateek Bora (1):
net/9p: nwname should be an unsigned int

Hema HK (1):
usb: musb: Fix the crash issue during reboot

Hugh Dickins (1):
tmpfs: fix off-by-one in max_blocks checks

Igor Mammedov (1):
Input: xen-kbdfront - fix mouse getting stuck after save/restore

Jacob Pan (1):
x86/mrst: Fix boot crash caused by incorrect pin to irq mapping

Jarod Wilson (1):
Input: add KEY_IMAGES specifically for AL Image Browser

Jean Delvare (1):
i2c: Improve deprecation warnings

Jean-Christophe PLAGNIOL-VILLARD (1):
avr32: At32ap: pio fix typo "))" on gpio_irq_unmask prototype

Jeff Brown (2):
Input: evdev - indicate buffer overrun with SYN_DROPPED
Input: estimate number of events per packet

Jeff Layton (9):
cifs: check for private_data before trying to put it
cifs: replace /proc/fs/cifs/Experimental with a module parm
cifs: always do is_path_accessible check in cifs_mount
cifs: fix broken BCC check in is_valid_oplock_break
cifs: set ra_pages in backing_dev_info
cifs: clean up length checks in check2ndT2
cifs: clean up various nits in unicode routines (try #2)
cifs: wrap received signature check in srv_mutex
cifs: don't allow mmap'ed pages to be dirtied while under
writeback (try #3)

Jeff Mahoney (1):
fs/fhandle.c: add <linux/personality.h> for ia64

Jens Axboe (13):
block: remove block_unplug_timer() trace point
block: fixup block IO unplug trace call
block: add comment on why we save and disable interrupts in
flush_plug_list()
block: add callback function for unplug notification
block: readd plug trace event
block: kill queue_sync_plugs()
block: move queue run on unplug to kblockd
block: only force kblockd unplugging from the schedule() path
block: let io_schedule() flush the plug inline
block: make unplug timer trace event correspond to the schedule() unplug
Revert "block: add callback function for unplug notification"
block: drop queue lock before calling __blk_run_queue() for kblockd punt
block: blk_delay_queue() should use kblockd workqueue

Jeremy Fitzhardinge (1):
xen: just completely disable XSAVE

Jiri Kosina (1):
brk: COMPAT_BRK: fix detection of randomized brk

Joe Perches (2):
MAINTAINERS: update m68knommu patterns
MAINTAINERS: update various tty patterns

Joerg Roedel (2):
USB host: Fix lockdep warning in AMD PLL quirk
x86, amd: Disable GartTlbWlkErr when BIOS forgets it

Johan Hovold (2):
usb: musb: omap2430: fix build failure
USB: ftdi_sio: add PID for OCT DK201 docking station

John Stultz (1):
RTC: Fix early irqs caused by calling rtc_set_alarm too early

Josef Bacik (11):
Btrfs: deal with the case that we run out of space in the cache
Btrfs: only retry transaction reservation once
Btrfs: map the inode item when doing fill_inode_item
Btrfs: do not call btrfs_update_inode in endio if nothing changed
Btrfs: don't split dio bios if we don't have to
Btrfs: do not use async submit for small DIO io's
Btrfs: reuse the extent_map we found when calling btrfs_get_extent
Btrfs: check for duplicate iov_base's when doing dio reads
Btrfs: check for duplicate iov_base's when doing dio reads
Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
Btrfs: avoid taking the chunk_mutex in do_chunk_alloc

Justin P. Mattock (1):
ARM: 6872/1: arch:common:Makefile Remove unused config in the Makefile.

KOSAKI Motohiro (3):
vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
oom-kill: remove boost_dying_task_prio()
x86, NUMA: Fix fakenuma boot failure

Keith Packard (1):
thinkpad-acpi fails to load with newer Thinkpad X201s BIOS

Ken Chen (2):
sched: Fix sched-domain avg_load calculation
sched: Fix erroneous all_pinned logic

Konrad Rzeszutek Wilk (1):
xen/debug: Don't be so verbose with WARN on 1-1 mapping errors.

Konstantin Khlebnikov (1):
i915: select VIDEO_OUTPUT_CONTROL for ACPI_VIDEO

Kumar Gala (2):
powerpc/book3e: Fix CPU feature handling on 64-bit e5500
powerpc/85xx: disable Suspend support if SMP enabled

Lee, Chun-Yi (1):
acer-wmi: Fix capitalisation of GUID in module alias

Li Zefan (2):
Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()

Linus Torvalds (9):
Revert "vfs: Export file system uuid via /proc/<pid>/mountinfo"
vm: fix mlock() on stack guard page
vfs: Re-introduce s_uuid in the superblock
vm: fix vm_pgoff wrap in stack expansion
block: don't flush plugged IO on forced preemtion scheduling
vfs: fix incorrect dentry_update_name_case() BUG_ON() test
next_pidmap: fix overflow condition
proc: do proper range check on readdir offset
Linux 2.6.39-rc4

Liu Yuan (1):
block, blk-sysfs: Use the variable directly instead of a function call

Maksim Rayskiy (1):
UBIFS: fix compilation warnings when compiling with gcc 4.5

Marcin Slusarz (1):
drm/nouveau: fix oops on unload with disabled LVDS panel

Marco Chiappero (1):
sony-laptop: keyboard backlight fixes

Marek Vasut (1):
ARM: pxafb: Fix access to nonexistent member of pxafb_info

Marius B. Kotsbak (1):
USB: option: Added support for Samsung GT-B3730/GT-B3710 LTE USB modem.

Matt Fleming (1):
avr32: init cannot ignore signals sent by force_sig_info()

Matthew Garrett (1):
x86 platform drivers: Build fix for intel_pmic_gpio

Matthew Wilcox (1):
USB: Fix unplug of device with active streams

Mattia Dongili (2):
sony-laptop: fix early NULL pointer dereference
sony-laptop: only show the handles sysfs file in debug mode

Maurus Cuelenaere (1):
ARM: SAMSUNG: Fix warning 's3c_pm_show_resume_irqs' defined but not used

Mian Yousaf Kaukab (2):
usb: musb: clear AUTOSET while clearing DMAENAB
usb: musb: ux500: copy dma mask from platform device to musb device

Miao Xie (2):
Btrfs: Fix incorrect inode nlink in btrfs_link()
Btrfs: Check validity before setting an acl

Michael Ellerman (1):
mm: check that we have the right vma in __access_remote_vm()

Michal Marek (2):
staging: samsung-laptop has moved to platform/x86
samsung-laptop: set backlight type

Michal Simek (1):
usb: Fix Kconfig unmet dependencies for Microblaze EHCI

Michel D?nzer (2):
radeon: Fix KMS CP writeback on big endian machines.
drm/radeon: Fix KMS legacy backlight support if
CONFIG_BACKLIGHT_CLASS_DEVICE=m.

Mike Frysinger (4):
RTC: add missing "return 0" in new alarm func for rtc-bfin.c
USB: musb: blackfin: work around anomaly 05000450
Blackfin: gptimers: fix thinko when disabling timers
Blackfin: time-ts: ack gptimer sooner to avoid missing short ints

Milton Miller (1):
fs: synchronize_rcu when unregister_filesystem success not failure

NeilBrown (8):
block: splice plug list to local context
block: Enhance new plugging support to support general callbacks
md: use new plugging interface for RAID IO.
md/dm - remove remains of plug_fn callback.
md - remove old plugging code.
md: provide generic support for handling unplug callbacks.
md: incorporate new plugging into raid5.
md: fix up raid1/raid10 unplugging.

Nicolas Kaiser (2):
xen: events: fix error checks in bind_*_to_irqhandler()
arm: tegra: fix error check in tegra2_clocks.c

Nicolas Pitre (3):
ARM: 6877/1: the ADDR_NO_RANDOMIZE personality flag should be
honored with mmap()
ARM: 6878/1: fix personality flag propagation across an exec
ARM: 6879/1: fix personality test wrt usage of domain handlers

Nishanth Aravamudan (1):
powerpc/pseries: Use a kmem cache for DTL buffers

Ole Henrik Jahren (1):
avr32: fix deadlock when reading clock list in debugfs

Paul Friedrich (1):
USB: ftdi_sio: add ids for Hameg HO720 and HO730

Paul Gortmaker (1):
powerpc/kexec: Fix regression causing compile failure on UP

Paul Mundt (1):
mm/page_alloc.c: silence build_all_zonelists() section mismatch

Prabhakar Kushwaha (2):
powerpc/85xx: Don't add disabled PCIe devices
powerpc: Check device status before adding serial device

Rafael J. Wysocki (1):
PM / Hibernate: Introduce CONFIG_HIBERNATE_CALLBACKS

Randy Dunlap (3):
msi-laptop: fix config-dependent build error
usb: fix ips1760-hcd printk format warning
MAINTAINERS: update STABLE BRANCH info

Richard Henderson (4):
alpha: Don't force -Werror.
alpha: Remove set but unused variables.
alpha: Fix RTC interrupt setup.
alpha: Fix uninitialized value in read_persistent_clock.

Richard Retanubun (1):
USB: isp1760-hcd: move imask clear after pending work is done

Richard Weinberger (2):
um: fix call tracer and bug handler
um: disable CONFIG_CMPXCHG_LOCAL

Roy Spliet (1):
drm/nouveau: correct memtiming table parsing for nv4x

Russell King (2):
ARM: Make consolidated PM sleep code depend on PM_SLEEP
ARM: Only allow PM_SLEEP with CPUs which support suspend

Sage Weil (1):
libceph: fix linger request requeueing

Samuel Ortiz (1):
mfd: Fetch cell pointer from platform_device->mfd_cell

Sarah Sharp (2):
xhci: Fix NULL pointer deref in handle_port_status()
xhci: Tell USB core both roothubs lost power.

Scott Wood (1):
powerpc/e500mc: Remove CPU_FTR_MAYBE_CAN_NAP/CPU_FTR_MAYBE_CAN_DOZE

Sebastian Andrzej Siewior (2):
x86/ce4100: Add reg property to bridges
usb/gadget: don't leak hs_descriptors

Sergei Trofimovich (1):
btrfs: properly handle overlapping areas in memmove_extent_buffer

Shan Haitao (1):
xen: Allow PV-OPS kernel to detect whether XSAVE is supported

Shriram Rajagopalan (1):
fix XEN_SAVE_RESTORE Kconfig dependencies

Shubhrajyoti D (1):
Input: twl4030_keypad - avoid potential NULL-pointer dereference

Sonic Zhang (1):
Blackfin: SMP: fix cache flush loop

Stefan Roese (1):
powerpc: Don't write protect kernel text with
CONFIG_DYNAMIC_FTRACE enabled

Stephane Eranian (1):
perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec()

Stephen Boyd (1):
ARM: 6876/1: Kconfig.debug: Remove unused CONFIG_DEBUG_ERRORS

Steve French (6):
Allow user names longer than 32 bytes
Max share size is too small
Elminate sparse __CHECK_ENDIAN__ warnings on port conversion
various endian fixes to cifs
[CIFS] cifs: clarify the meaning of tcpStatus == CifsGood
[CIFS] Warn on requesting default security (ntlm) on mount

Steven Hardy (3):
usb: Fix qcserial memory leak on rmmod
usb: qcserial avoid pointing to freed memory
usb: qcserial add missing errorpath kfrees

Thomas Gleixner (1):
platform-drivers: x86: pmic: Restore the dropped buslock/unlock

Tim Chen (1):
vfs: Fix absolute RCU path walk failures due to uninitialized seq number

Timo Warns (1):
fs/partitions/ldm.c: fix oops caused by corrupted partition table

Uwe Kleine-K?nig (1):
don't check platform_get_irq's return value against zero

Valentin Longchamp (1):
USB: fsl_qe_udc: send ZLP when zero flag and length % maxpacket == 0

Vasily Khoruzhick (1):
RTC: Fix s3c compile error due to missing s3c_rtc_setpie

Wanlong Gao (2):
fix the wrong argument of the functions definition
drivers/misc/sgi-gru/grufile.c: fix the wrong members of gru_chip

Will Deacon (2):
ARM: 6864/1: hw_breakpoint: clear DBGVCR out of reset
ARM: 6865/1: perf: ensure pass through zero is counted on overflow

Xin Zhong (1):
Btrfs: fix subvolume mount by name problem when default mount
subvolume is set

Yauheni Kaliuta (1):
usb: gadget: eem: fix echo command processing

Yoichi Yuasa (1):
USB: ohci-au1xxx: fix warning "__BIG_ENDIAN" is not defined

Yoshihiro Shimoda (1):
usb: r8a66597-udc: fix spinlock usage

Yoshinori Sano (1):
Btrfs: fix memory leaks in btrfs_new_inode()


2011-04-19 20:05:04

by Randy Dunlap

[permalink] [raw]
Subject: [PATCH] uml: fix hppfs build

From: Randy Dunlap <[email protected]>

Make HoneyPot ProcFS depend on CONFIG_PROC_FS so that it will build.
Recommended by Christoph Hellwig.

Fixes kernel bugzilla #33692:
https://bugzilla.kernel.org/show_bug.cgi?id=33692

Reported-by: Simon Danner <[email protected]>
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
---
arch/um/Kconfig.um | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- lnx-2639-rc4.orig/arch/um/Kconfig.um
+++ lnx-2639-rc4/arch/um/Kconfig.um
@@ -47,7 +47,7 @@ config HOSTFS

config HPPFS
tristate "HoneyPot ProcFS (EXPERIMENTAL)"
- depends on EXPERIMENTAL
+ depends on EXPERIMENTAL && PROC_FS
help
hppfs (HoneyPot ProcFS) is a filesystem which allows UML /proc
entries to be overridden, removed, or fabricated from the host.

2011-04-19 20:09:37

by Richard Weinberger

[permalink] [raw]
Subject: Re: [PATCH] uml: fix hppfs build

Am Dienstag 19 April 2011, 22:04:19 schrieb Randy Dunlap:
> From: Randy Dunlap <[email protected]>
>
> Make HoneyPot ProcFS depend on CONFIG_PROC_FS so that it will build.
> Recommended by Christoph Hellwig.
>
> Fixes kernel bugzilla #33692:
> https://bugzilla.kernel.org/show_bug.cgi?id=33692
>
> Reported-by: Simon Danner <[email protected]>
> Signed-off-by: Randy Dunlap <[email protected]>
> Cc: Jeff Dike <[email protected]>
> Cc: Richard Weinberger <[email protected]>
> Cc: [email protected]
> Cc: Christoph Hellwig <[email protected]>
> ---
> arch/um/Kconfig.um | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- lnx-2639-rc4.orig/arch/um/Kconfig.um
> +++ lnx-2639-rc4/arch/um/Kconfig.um
> @@ -47,7 +47,7 @@ config HOSTFS
>
> config HPPFS
> tristate "HoneyPot ProcFS (EXPERIMENTAL)"
> - depends on EXPERIMENTAL
> + depends on EXPERIMENTAL && PROC_FS
> help
> hppfs (HoneyPot ProcFS) is a filesystem which allows UML /proc
> entries to be overridden, removed, or fabricated from the host.

Applied.

Thanks,
//richard

2011-04-20 15:39:14

by Andreas Herrmann

[permalink] [raw]
Subject: Re: Linux 2.6.39-rc4 (regression: NUMA on multi-node CPUs broken)

Following patch breaks real NUMA on multi-node CPUs like AMD
Magny-Cours and should be reverted (or changed to just take effect in
case of numa=fake):

commit 7d6b46707f2491a94f4bd3b4329d2d7f809e9368
Author: KOSAKI Motohiro <[email protected]>
Date: Fri Apr 15 20:39:01 2011 +0900

x86, NUMA: Fix fakenuma boot failure

...

Thus, this patch implements a reassignment of node-ids if buggy firmware
or numa emulation makes wrong cpu node map. Tt enforce all logical cpus
in the same physical cpu share the same node.

...

+static void __cpuinit check_cpu_siblings_on_same_node(int cpu1, int cpu2)
+{
+ int node1 = early_cpu_to_node(cpu1);
+ int node2 = early_cpu_to_node(cpu2);
+
+ /*
+ * Our CPU scheduler assumes all logical cpus in the same physical cpu
+ * share the same node. But, buggy ACPI or NUMA emulation might assign
+ * them to different node. Fix it.
+ */

...

This is a false assumption. Magny-Cours has two nodes in the same
physical package. The scheduler was (kind of) fixed to work around
this boot problem for multi-node CPUs (with 2.6.32). If this is also
an issue with wrong cpu node maps in case of NUMA emulation this might
be fixed similar or this quirk should only be applied in case of NUMA
emulation.

With this patch Linux shows

root # numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 8189 MB
node 0 free: 7937 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 16129 MB
node 2 cpus: 12 13 14 15 16 17 18 19 20 21 22 23
node 2 size: 8192 MB
node 2 free: 8024 MB
node 3 cpus:
node 3 size: 16384 MB
node 3 free: 16129 MB
node 4 cpus: 24 25 26 27 28 29 30 31 32 33 34 35
node 4 size: 8192 MB
node 4 free: 8013 MB
node 5 cpus:
node 5 size: 16384 MB
node 5 free: 16129 MB
node 6 cpus: 36 37 38 39 40 41 42 43 44 45 46 47
node 6 size: 8192 MB
node 6 free: 8025 MB
node 7 cpus:
node 7 size: 16384 MB
node 7 free: 16128 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 22 16 22 16 22
1: 16 10 22 16 16 22 22 16
2: 16 22 10 16 16 16 16 16
3: 22 16 16 10 16 16 22 22
4: 16 16 16 16 10 16 16 22
5: 22 22 16 16 16 10 22 16
6: 16 22 16 22 16 22 10 16
7: 22 16 16 22 22 16 16 10


which is bogus. The correct NUMA-information (based on SRAT) (w/o this
patch) is

linux # numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8189 MB
node 0 free: 7947 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 16384 MB
node 1 free: 16114 MB
node 2 cpus: 12 13 14 15 16 17
node 2 size: 8192 MB
node 2 free: 7941 MB
node 3 cpus: 18 19 20 21 22 23
node 3 size: 16384 MB
node 3 free: 16120 MB
node 4 cpus: 24 25 26 27 28 29
node 4 size: 8192 MB
node 4 free: 8028 MB
node 5 cpus: 30 31 32 33 34 35
node 5 size: 16384 MB
node 5 free: 16116 MB
node 6 cpus: 36 37 38 39 40 41
node 6 size: 8192 MB
node 6 free: 8033 MB
node 7 cpus: 42 43 44 45 46 47
node 7 size: 16384 MB
node 7 free: 16120 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 22 16 22 16 22
1: 16 10 22 16 16 22 22 16
2: 16 22 10 16 16 16 16 16
3: 22 16 16 10 16 16 22 22
4: 16 16 16 16 10 16 16 22
5: 22 22 16 16 16 10 22 16
6: 16 22 16 22 16 22 10 16
7: 22 16 16 22 22 16 16 10



Regards,

Andreas

2011-04-21 00:46:20

by David Rientjes

[permalink] [raw]
Subject: Re: Linux 2.6.39-rc4 (regression: NUMA on multi-node CPUs broken)

On Wed, 20 Apr 2011, Andreas Herrmann wrote:

> Following patch breaks real NUMA on multi-node CPUs like AMD
> Magny-Cours and should be reverted (or changed to just take effect in
> case of numa=fake):
>
> commit 7d6b46707f2491a94f4bd3b4329d2d7f809e9368
> Author: KOSAKI Motohiro <[email protected]>
> Date: Fri Apr 15 20:39:01 2011 +0900
>
> x86, NUMA: Fix fakenuma boot failure
>
> ...
>
> Thus, this patch implements a reassignment of node-ids if buggy firmware
> or numa emulation makes wrong cpu node map. Tt enforce all logical cpus
> in the same physical cpu share the same node.
>
> ...
>
> +static void __cpuinit check_cpu_siblings_on_same_node(int cpu1, int cpu2)
> +{
> + int node1 = early_cpu_to_node(cpu1);
> + int node2 = early_cpu_to_node(cpu2);
> +
> + /*
> + * Our CPU scheduler assumes all logical cpus in the same physical cpu
> + * share the same node. But, buggy ACPI or NUMA emulation might assign
> + * them to different node. Fix it.
> + */
>
> ...
>
> This is a false assumption. Magny-Cours has two nodes in the same
> physical package. The scheduler was (kind of) fixed to work around
> this boot problem for multi-node CPUs (with 2.6.32). If this is also
> an issue with wrong cpu node maps in case of NUMA emulation this might
> be fixed similar or this quirk should only be applied in case of NUMA
> emulation.
>

Right, this yields cpuless nodes that the scheduler can't handle. Prior
to the unification and cleanup, NUMA emulation would bind cpus to all
nodes that are allocated on the physical node that it has affinity with on
the board. This causes all nodes to have bound cpus such that
node_to_cpumask() correctly reveals the proximity that cpus have to its
nodes, either emulated or otherwise.

We usually don't touch NUMA code for real architectures to fix a problem
that can only happen with NUMA emulation, so 7d6b46707f24 should probably
be reverted.

With that patch reverted, NUMA emulation works fine for me; for example,
with numa=fake=8:

/sys/devices/system/node/node0/cpulist:0-3
/sys/devices/system/node/node1/cpulist:4-7
/sys/devices/system/node/node2/cpulist:8-11
/sys/devices/system/node/node3/cpulist:12-15
/sys/devices/system/node/node4/cpulist:0-3
/sys/devices/system/node/node5/cpulist:4-7
/sys/devices/system/node/node6/cpulist:8-11
/sys/devices/system/node/node7/cpulist:12-15

I'm not sure what it's trying to address (yes, there is a problem with the
binding for CONFIG_NUMA_EMU && CONFIG_DEBUG_PER_CPU_MAPS, but not
otherwise).

KOSAKI-san?

2011-04-21 02:04:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Linux 2.6.39-rc4 (regression: NUMA on multi-node CPUs broken)

> Right, this yields cpuless nodes that the scheduler can't handle. Prior
> to the unification and cleanup, NUMA emulation would bind cpus to all
> nodes that are allocated on the physical node that it has affinity with on
> the board. This causes all nodes to have bound cpus such that
> node_to_cpumask() correctly reveals the proximity that cpus have to its
> nodes, either emulated or otherwise.
>
> We usually don't touch NUMA code for real architectures to fix a problem
> that can only happen with NUMA emulation, so 7d6b46707f24 should probably
> be reverted.
>
> With that patch reverted, NUMA emulation works fine for me; for example,
> with numa=fake=8:
>
> /sys/devices/system/node/node0/cpulist:0-3
> /sys/devices/system/node/node1/cpulist:4-7
> /sys/devices/system/node/node2/cpulist:8-11
> /sys/devices/system/node/node3/cpulist:12-15
> /sys/devices/system/node/node4/cpulist:0-3
> /sys/devices/system/node/node5/cpulist:4-7
> /sys/devices/system/node/node6/cpulist:8-11
> /sys/devices/system/node/node7/cpulist:12-15
>
> I'm not sure what it's trying to address (yes, there is a problem with the
> binding for CONFIG_NUMA_EMU && CONFIG_DEBUG_PER_CPU_MAPS, but not
> otherwise).
>
> KOSAKI-san?

Simple revert 7d6b46707f24 makes the same boot failure again.

[ 0.215976] Pid: 1, comm: swapper Not tainted 2.6.39-rc4+ #10 FUJITSU-SV PRIMERGY /D2559-A1
[ 0.215976] RIP: 0010:[<ffffffff81085b94>] [<ffffffff81085b94>] find_busiest_group+0x464/0xea0
[ 0.215976] RSP: 0018:ffff88003c67d850 EFLAGS: 00010046
[ 0.215976] RAX: 0000000000000000 RBX: 00000000001d2ec0 RCX: 0000000000000000
[ 0.215976] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[ 0.215976] RBP: ffff88003c67da10 R08: 0000000000000000 R09: 0000000000000000
[ 0.215976] R10: 0000000000000400 R11: 0000000000000000 R12: 00000000001d2ec0
[ 0.215976] R13: 00000000ffffffff R14: ffff88003c640780 R15: 0000000000000001
[ 0.215976] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 0.215976] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 0.215976] CR2: 0000000000000000 CR3: 0000000001a03000 CR4: 00000000000006f0
[ 0.215976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.215976] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 0.215976] Process swapper (pid: 1, threadinfo ffff88003c67c000, task ffff88003c678040)
[ 0.215976] Stack:
[ 0.215976] ffff88003c678078 ffff88003c67d9a0 ffff88003c67d880 ffff88003fc00000
[ 0.215976] 0000000000000000 00000000001d2ec0 ffff88003c67db00 0100000000000002
[ 0.215976] ffff88003c67dbdc 0000000000000001 ffff88003fc0e4a0 000000003c678040
[ 0.215976] Call Trace:
[ 0.215976] [<ffffffff810c24ff>] ? local_clock+0x6f/0x80
[ 0.215976] [<ffffffff8108c875>] load_balance+0xc5/0x990
[ 0.215976] [<ffffffff810d05ed>] ? trace_hardirqs_off+0xd/0x10
[ 0.215976] [<ffffffff810c24ff>] ? local_clock+0x6f/0x80
[ 0.215976] [<ffffffff8107e6a2>] ? update_shares+0x162/0x1a0
[ 0.215976] [<ffffffff8107e6ba>] ? update_shares+0x17a/0x1a0
[ 0.215976] [<ffffffff8107e540>] ? update_cfs_shares+0x1d0/0x1d0
[ 0.215976] [<ffffffff815a2673>] schedule+0xb03/0xb10
[ 0.215976] [<ffffffff810d48e1>] ? __lock_acquire+0x541/0x1e80
[ 0.215976] [<ffffffff810c24ff>] ? local_clock+0x6f/0x80
[ 0.215976] [<ffffffff815a2fa5>] schedule_timeout+0x265/0x320
[ 0.215976] [<ffffffff810d05ed>] ? trace_hardirqs_off+0xd/0x10
[ 0.215976] [<ffffffff810c24ff>] ? local_clock+0x6f/0x80
[ 0.215976] [<ffffffff810d0625>] ? lock_release_holdtime+0x35/0x180
[ 0.215976] [<ffffffff815a59e0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 0.215976] [<ffffffff815a59e0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 0.215976] [<ffffffff815a2a80>] wait_for_common+0x130/0x190
[ 0.215976] [<ffffffff8108ddb0>] ? try_to_wake_up+0x520/0x520
[ 0.215976] [<ffffffff815a2bbd>] wait_for_completion+0x1d/0x20
[ 0.215976] [<ffffffff810bafbc>] kthread_create_on_node+0xac/0x150
[ 0.215976] [<ffffffff810b3870>] ? process_scheduled_works+0x40/0x40
[ 0.215976] [<ffffffff815a299f>] ? wait_for_common+0x4f/0x190
[ 0.215976] [<ffffffff810b5f03>] __alloc_workqueue_key+0x1a3/0x590
[ 0.215976] [<ffffffff81cc2864>] cpuset_init_smp+0x64/0x74
[ 0.215976] [<ffffffff81ca8cd7>] kernel_init+0xa9/0x168
[ 0.215976] [<ffffffff815af4e4>] kernel_thread_helper+0x4/0x10
[ 0.215976] [<ffffffff815a61d4>] ? retint_restore_args+0x13/0x13
[ 0.215976] [<ffffffff81ca8c2e>] ? start_kernel+0x3f6/0x3f6
[ 0.215976] [<ffffffff815af4e0>] ? gs_change+0x13/0x13
[ 0.215976] Code: 50 fe ff ff 41 89 50 08 0f 1f 80 00 00 00 00 48 8b 95 b0 fe ff ff 48 8b 7d 98 44 8b 42 08 48 89 f8 31 d2 48 c1 e0 0a 48 8b 4d a0
[ 0.215976] f7 f0 48 85 c9 48 89 c6 49 89 c1 48 89 45 90 74 1f 31 d2 48
[ 0.215976] RIP [<ffffffff81085b94>] find_busiest_group+0x464/0xea0
[ 0.215976] RSP <ffff88003c67d850>
[ 0.215976] divide error: 0000 [#2]
[ 0.215976] ---[ end trace 93d72a36b9146f22 ]---
[ 0.215990] swapper used greatest stack depth: 3608 bytes left
[ 0.216000] Kernel panic - not syncing: Attempted to kill init!
[ 0.216002] Pid: 1, comm: swapper Tainted: G D 2.6.39-rc4+ #10
[ 0.216003] Call Trace:
[ 0.216006] [<ffffffff815a1816>] panic+0x91/0x1ab
[ 0.216009] [<ffffffff815a5a20>] ? _raw_write_unlock_irq+0x30/0x40
[ 0.216011] [<ffffffff8109b0ca>] ? do_exit+0x80a/0x970
[ 0.216013] [<ffffffff8109b183>] do_exit+0x8c3/0x970
[ 0.216016] [<ffffffff815a71ef>] oops_end+0xaf/0xf0
[ 0.216019] [<ffffffff81040fab>] die+0x5b/0x90
[ 0.216021] [<ffffffff815a68e4>] do_trap+0xc4/0x170
[ 0.216023] [<ffffffff8103de4f>] do_divide_error+0x8f/0xb0
[ 0.216025] [<ffffffff81085b94>] ? find_busiest_group+0x464/0xea0
[ 0.216028] [<ffffffff812c8d2d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 0.216030] [<ffffffff815a6204>] ? restore_args+0x30/0x30
[ 0.216033] [<ffffffff815af2fb>] divide_error+0x1b/0x20
[ 0.216035] [<ffffffff81085b94>] ? find_busiest_group+0x464/0xea0
[ 0.216038] [<ffffffff810c24ff>] ? local_clock+0x6f/0x80
[ 0.216041] [<ffffffff8108c875>] load_balance+0xc5/0x990
[ 0.216043] [<ffffffff810d05ed>] ? trace_hardirqs_off+0xd/0x10
[ 0.216046] [<ffffffff810c24ff>] ? local_clock+0x6f/0x80
[ 0.216048] [<ffffffff8107e6a2>] ? update_shares+0x162/0x1a0
[ 0.216051] [<ffffffff8107e6ba>] ? update_shares+0x17a/0x1a0
[ 0.216053] [<ffffffff8107e540>] ? update_cfs_shares+0x1d0/0x1d0
[ 0.216055] [<ffffffff815a2673>] schedule+0xb03/0xb10
[ 0.216058] [<ffffffff810d48e1>] ? __lock_acquire+0x541/0x1e80
[ 0.216060] [<ffffffff810c24ff>] ? local_clock+0x6f/0x80
[ 0.216062] [<ffffffff815a2fa5>] schedule_timeout+0x265/0x320
[ 0.216064] [<ffffffff810d05ed>] ? trace_hardirqs_off+0xd/0x10
[ 0.216066] [<ffffffff810c24ff>] ? local_clock+0x6f/0x80
[ 0.216069] [<ffffffff810d0625>] ? lock_release_holdtime+0x35/0x180
[ 0.216071] [<ffffffff815a59e0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 0.216073] [<ffffffff815a59e0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 0.216076] [<ffffffff815a2a80>] wait_for_common+0x130/0x190
[ 0.216078] [<ffffffff8108ddb0>] ? try_to_wake_up+0x520/0x520
[ 0.216080] [<ffffffff815a2bbd>] wait_for_completion+0x1d/0x20
[ 0.216083] [<ffffffff810bafbc>] kthread_create_on_node+0xac/0x150
[ 0.216085] [<ffffffff810b3870>] ? process_scheduled_works+0x40/0x40
[ 0.216088] [<ffffffff815a299f>] ? wait_for_common+0x4f/0x190
[ 0.216090] [<ffffffff810b5f03>] __alloc_workqueue_key+0x1a3/0x590
[ 0.216092] [<ffffffff81cc2864>] cpuset_init_smp+0x64/0x74
[ 0.216095] [<ffffffff81ca8cd7>] kernel_init+0xa9/0x168
[ 0.216097] [<ffffffff815af4e4>] kernel_thread_helper+0x4/0x10
[ 0.216099] [<ffffffff815a61d4>] ? retint_restore_args+0x13/0x13
[ 0.216101] [<ffffffff81ca8c2e>] ? start_kernel+0x3f6/0x3f6
[ 0.216103] [<ffffffff815af4e0>] ? gs_change+0x13/0x13
[ 0.215976] SMP
[ 0.215976] last sysfs file:
[ 0.215976] CPU 1
[ 0.215976] Modules linked in:
[ 0.215976]
[ 0.215976] Pid: 2, comm: kthreadd Tainted: G D 2.6.39-rc4+ #10 FUJITSU-SV PRIMERGY /D2559-A1
[ 0.215976] RIP: 0010:[<ffffffff81084d65>] [<ffffffff81084d65>] select_task_rq_fair+0x855/0xb80
[ 0.215976] RSP: 0000:ffff88003c67fc40 EFLAGS: 00010046
[ 0.215976] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 0.215976] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000002
[ 0.215976] RBP: ffff88003c67fcf0 R08: ffff88007aa133f0 R09: 0000000000000000
[ 0.215976] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88007aa133f0
[ 0.215976] R13: ffff88007aa133d8 R14: 0000000000000000 R15: 0000000000000000
[ 0.215976] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 0.215976] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 0.215976] CR2: 0000000000000000 CR3: 0000000001a03000 CR4: 00000000000006e0
[ 0.215976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.215976] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 0.215976] Process kthreadd (pid: 2, threadinfo ffff88003c67e000, task ffff88003c680080)
[ 0.215976] Stack:
[ 0.215976] ffffffff815a5a20 000000007aa886e8 ffff88007fdd2ed8 0000000000000002
[ 0.215976] 0000000000000000 00000000001d2ec0 000000000000007d 0000000000000200
[ 0.215976] ffffffffffffffff 0000000000000000 0000000100000008 ffffffff00000001
[ 0.215976] Call Trace:
[ 0.215976] [<ffffffff815a5a20>] ? _raw_write_unlock_irq+0x30/0x40
[ 0.215976] [<ffffffff8108e201>] wake_up_new_task+0x41/0x1b0
[ 0.215976] [<ffffffff810b6cd0>] ? __task_pid_nr_ns+0xc0/0x100
[ 0.215976] [<ffffffff810b6c10>] ? cpumask_weight+0x20/0x20
[ 0.215976] [<ffffffff81095112>] do_fork+0xe2/0x3a0
[ 0.215976] [<ffffffff815a59e0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 0.215976] [<ffffffff815a59e0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 0.215976] [<ffffffff81044885>] ? native_sched_clock+0x15/0x70
[ 0.215976] [<ffffffff810c24ff>] ? local_clock+0x6f/0x80
[ 0.215976] [<ffffffff810456d6>] kernel_thread+0x76/0x80
[ 0.215976] [<ffffffff810bac70>] ? __init_kthread_worker+0x70/0x70
[ 0.215976] [<ffffffff815af4e0>] ? gs_change+0x13/0x13
[ 0.215976] [<ffffffff810bb1c3>] kthreadd+0x113/0x150
[ 0.215976] [<ffffffff815af4e4>] kernel_thread_helper+0x4/0x10
[ 0.215976] [<ffffffff815a61d4>] ? retint_restore_args+0x13/0x13
[ 0.215976] [<ffffffff810bb0b0>] ? tsk_fork_get_node+0x30/0x30
[ 0.215976] [<ffffffff815af4e0>] ? gs_change+0x13/0x13
[ 0.215976] Code: ff ff 44 89 fe 89 c7 e8 4a 26 ff ff 8b 8d 68 ff ff ff 8b 95 70 ff ff ff eb 93 0f 1f 40 00 31 d2 48 89 d8 41 8b 4d 08 48 c1 e0 0a
[ 0.215976] f7 f1 45 85 f6 75 43 48 3b 45 90 0f 83 d9 fe ff ff 4c 89 6d
[ 0.215976] RIP [<ffffffff81084d65>] select_task_rq_fair+0x855/0xb80
[ 0.215976] RSP <ffff88003c67fc40>
[ 0.215976] ---[ end trace 93d72a36b9146f23 ]---



2011-04-21 02:04:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Linux 2.6.39-rc4 (regression: NUMA on multi-node CPUs broken)

> Following patch breaks real NUMA on multi-node CPUs like AMD
> Magny-Cours and should be reverted (or changed to just take effect in
> case of numa=fake):
>
> commit 7d6b46707f2491a94f4bd3b4329d2d7f809e9368
> Author: KOSAKI Motohiro <[email protected]>
> Date: Fri Apr 15 20:39:01 2011 +0900
>
> x86, NUMA: Fix fakenuma boot failure
>
> ...
>
> Thus, this patch implements a reassignment of node-ids if buggy firmware
> or numa emulation makes wrong cpu node map. Tt enforce all logical cpus
> in the same physical cpu share the same node.
>
> ...
>
> +static void __cpuinit check_cpu_siblings_on_same_node(int cpu1, int cpu2)
> +{
> + int node1 = early_cpu_to_node(cpu1);
> + int node2 = early_cpu_to_node(cpu2);
> +
> + /*
> + * Our CPU scheduler assumes all logical cpus in the same physical cpu
> + * share the same node. But, buggy ACPI or NUMA emulation might assign
> + * them to different node. Fix it.
> + */
>
> ...
>
> This is a false assumption. Magny-Cours has two nodes in the same
> physical package. The scheduler was (kind of) fixed to work around
> this boot problem for multi-node CPUs (with 2.6.32).

I agree we have to fix this ASAP. I also think we have to avoid reintroduce
the same again. Can you please tell me the commit-id of this one?

> If this is also
> an issue with wrong cpu node maps in case of NUMA emulation this might
> be fixed similar or this quirk should only be applied in case of NUMA
> emulation.

Indeed.

Tejun, Do you remember I sent numa emulation specific patch at first. now
I'm beside with Andreas. Because I bet current numa fallback code (you
pointed out one) has no user.

Or, please let us know if you have an alternative patch.



Attached revert and fakenuma spefic fix patches.


Attachments:
0001-Revert-x86-NUMA-Fix-fakenuma-boot-failure.patch (4.73 kB)
0002-x86-64-NUMA-reimplement-cpu-node-map-initialization-.patch (3.69 kB)
Download all attachments

2011-04-21 02:17:27

by David Rientjes

[permalink] [raw]
Subject: Re: Linux 2.6.39-rc4 (regression: NUMA on multi-node CPUs broken)

On Thu, 21 Apr 2011, KOSAKI Motohiro wrote:

> Simple revert 7d6b46707f24 makes the same boot failure again.
>

Do you have CONFIG_DEBUG_PER_CPU_MAPS enabled? If not, please send your
.config.

2011-04-21 02:19:16

by David Rientjes

[permalink] [raw]
Subject: [patch 1/2] x86, numa: Revert "Fix fakenuma boot failure"

7d6b46707f24 (x86, NUMA: Fix fakenuma boot failure) could cause physical
NUMA topologies to move sibling cpus to a single node when in reality
they are in separate domains. This may result in some nodes being
completely void of cpus, which doesn't accurately represent the correct
topology.

This commit was intended as a fix for NUMA emulation, but should not
cause a regression for real NUMA machines as a side effect.

Reported-by: Andreas Herrmann <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
arch/x86/kernel/smpboot.c | 23 -----------------------
1 files changed, 0 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -312,26 +312,6 @@ void __cpuinit smp_store_cpu_info(int id)
identify_secondary_cpu(c);
}

-static void __cpuinit check_cpu_siblings_on_same_node(int cpu1, int cpu2)
-{
- int node1 = early_cpu_to_node(cpu1);
- int node2 = early_cpu_to_node(cpu2);
-
- /*
- * Our CPU scheduler assumes all logical cpus in the same physical cpu
- * share the same node. But, buggy ACPI or NUMA emulation might assign
- * them to different node. Fix it.
- */
- if (node1 != node2) {
- pr_warning("CPU %d in node %d and CPU %d in node %d are in the same physical CPU. forcing same node %d\n",
- cpu1, node1, cpu2, node2, node2);
-
- numa_remove_cpu(cpu1);
- numa_set_node(cpu1, node2);
- numa_add_cpu(cpu1);
- }
-}
-
static void __cpuinit link_thread_siblings(int cpu1, int cpu2)
{
cpumask_set_cpu(cpu1, cpu_sibling_mask(cpu2));
@@ -340,7 +320,6 @@ static void __cpuinit link_thread_siblings(int cpu1, int cpu2)
cpumask_set_cpu(cpu2, cpu_core_mask(cpu1));
cpumask_set_cpu(cpu1, cpu_llc_shared_mask(cpu2));
cpumask_set_cpu(cpu2, cpu_llc_shared_mask(cpu1));
- check_cpu_siblings_on_same_node(cpu1, cpu2);
}


@@ -382,12 +361,10 @@ void __cpuinit set_cpu_sibling_map(int cpu)
per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) {
cpumask_set_cpu(i, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(i));
- check_cpu_siblings_on_same_node(cpu, i);
}
if (c->phys_proc_id == cpu_data(i).phys_proc_id) {
cpumask_set_cpu(i, cpu_core_mask(cpu));
cpumask_set_cpu(cpu, cpu_core_mask(i));
- check_cpu_siblings_on_same_node(cpu, i);
/*
* Does this new cpu bringup a new core?
*/

2011-04-21 02:19:21

by David Rientjes

[permalink] [raw]
Subject: [patch 2/2] x86, numa: Fix cpu nodemasks for NUMA emulation and CONFIG_DEBUG_PER_CPU_MAPS

cpu nodemasks under CONFIG_DEBUG_PER_CPU_MAPS when NUMA emulation is
enabled is currently broken because it does not iterate through every
emulated node and bind cpus that have affinity to it. NUMA emulation
should bind each cpu to every local node to accurately represent the true
NUMA topology of the underlying machine.

debug_cpumask_set_cpu() needs to be fixed at the same time so that the
debugging information that it emits shows the new cpumask of the node
being assigned when the cpu is being added or removed. It can now take
responsibility of setting or clearing the cpu itself to remove the need
for duplicate code.

Also changes its last formal, "enable", to have the correct bool type
since it can only be true or false.

Signed-off-by: David Rientjes <[email protected]>
---
arch/x86/include/asm/numa.h | 2 +-
arch/x86/mm/numa.c | 27 +++++++++++----------------
arch/x86/mm/numa_emulation.c | 20 ++++++--------------
3 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -51,7 +51,7 @@ static inline void numa_remove_cpu(int cpu) { }
#endif /* CONFIG_NUMA */

#ifdef CONFIG_DEBUG_PER_CPU_MAPS
-struct cpumask __cpuinit *debug_cpumask_set_cpu(int cpu, int enable);
+void debug_cpumask_set_cpu(int cpu, int node, bool enable);
#endif

#endif /* _ASM_X86_NUMA_H */
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -213,9 +213,8 @@ int early_cpu_to_node(int cpu)
return per_cpu(x86_cpu_to_node_map, cpu);
}

-struct cpumask __cpuinit *debug_cpumask_set_cpu(int cpu, int enable)
+void debug_cpumask_set_cpu(int cpu, int node, bool enable)
{
- int node = early_cpu_to_node(cpu);
struct cpumask *mask;
char buf[64];

@@ -227,9 +226,14 @@ struct cpumask __cpuinit *debug_cpumask_set_cpu(int cpu, int enable)
if (!mask) {
pr_err("node_to_cpumask_map[%i] NULL\n", node);
dump_stack();
- return NULL;
+ return;
}

+ if (enable)
+ cpumask_set_cpu(cpu, mask);
+ else
+ cpumask_clear_cpu(cpu, mask);
+
cpulist_scnprintf(buf, sizeof(buf), mask);
printk(KERN_DEBUG "%s cpu %d node %d: mask now %s\n",
enable ? "numa_add_cpu" : "numa_remove_cpu",
@@ -238,28 +242,19 @@ struct cpumask __cpuinit *debug_cpumask_set_cpu(int cpu, int enable)
}

# ifndef CONFIG_NUMA_EMU
-static void __cpuinit numa_set_cpumask(int cpu, int enable)
+static void __cpuinit numa_set_cpumask(int cpu, bool enable)
{
- struct cpumask *mask;
-
- mask = debug_cpumask_set_cpu(cpu, enable);
- if (!mask)
- return;
-
- if (enable)
- cpumask_set_cpu(cpu, mask);
- else
- cpumask_clear_cpu(cpu, mask);
+ debug_cpumask_set_cpu(cpu, early_cpu_to_node(cpu), enable);
}

void __cpuinit numa_add_cpu(int cpu)
{
- numa_set_cpumask(cpu, 1);
+ numa_set_cpumask(cpu, true);
}

void __cpuinit numa_remove_cpu(int cpu)
{
- numa_set_cpumask(cpu, 0);
+ numa_set_cpumask(cpu, false);
}
# endif /* !CONFIG_NUMA_EMU */

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -454,10 +454,9 @@ void __cpuinit numa_remove_cpu(int cpu)
cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
}
#else /* !CONFIG_DEBUG_PER_CPU_MAPS */
-static void __cpuinit numa_set_cpumask(int cpu, int enable)
+static void __cpuinit numa_set_cpumask(int cpu, bool enable)
{
- struct cpumask *mask;
- int nid, physnid, i;
+ int nid, physnid;

nid = early_cpu_to_node(cpu);
if (nid == NUMA_NO_NODE) {
@@ -467,28 +466,21 @@ static void __cpuinit numa_set_cpumask(int cpu, int enable)

physnid = emu_nid_to_phys[nid];

- for_each_online_node(i) {
+ for_each_online_node(nid) {
if (emu_nid_to_phys[nid] != physnid)
continue;

- mask = debug_cpumask_set_cpu(cpu, enable);
- if (!mask)
- return;
-
- if (enable)
- cpumask_set_cpu(cpu, mask);
- else
- cpumask_clear_cpu(cpu, mask);
+ debug_cpumask_set_cpu(cpu, nid, enable);
}
}

void __cpuinit numa_add_cpu(int cpu)
{
- numa_set_cpumask(cpu, 1);
+ numa_set_cpumask(cpu, true);
}

void __cpuinit numa_remove_cpu(int cpu)
{
- numa_set_cpumask(cpu, 0);
+ numa_set_cpumask(cpu, false);
}
#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */

2011-04-21 05:45:49

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 1/2] x86, numa: Revert "Fix fakenuma boot failure"

> 7d6b46707f24 (x86, NUMA: Fix fakenuma boot failure) could cause physical
> NUMA topologies to move sibling cpus to a single node when in reality
> they are in separate domains. This may result in some nodes being
> completely void of cpus, which doesn't accurately represent the correct
> topology.
>
> This commit was intended as a fix for NUMA emulation, but should not
> cause a regression for real NUMA machines as a side effect.
>
> Reported-by: Andreas Herrmann <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

Acked-by: KOSAKI Motohiro <[email protected]>


2011-04-21 05:46:00

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 2/2] x86, numa: Fix cpu nodemasks for NUMA emulation and CONFIG_DEBUG_PER_CPU_MAPS

> cpu nodemasks under CONFIG_DEBUG_PER_CPU_MAPS when NUMA emulation is
> enabled is currently broken because it does not iterate through every
> emulated node and bind cpus that have affinity to it. NUMA emulation
> should bind each cpu to every local node to accurately represent the true
> NUMA topology of the underlying machine.
>
> debug_cpumask_set_cpu() needs to be fixed at the same time so that the
> debugging information that it emits shows the new cpumask of the node
> being assigned when the cpu is being added or removed. It can now take
> responsibility of setting or clearing the cpu itself to remove the need
> for duplicate code.
>
> Also changes its last formal, "enable", to have the correct bool type
> since it can only be true or false.
>
> Signed-off-by: David Rientjes <[email protected]>

Ok, this is better. I haven't realized node_to_cpumask_map[] don't
need exclusive cpu map.

Thank you!


Tested-by: KOSAKI Motohiro <[email protected]>



However

> -struct cpumask __cpuinit *debug_cpumask_set_cpu(int cpu, int enable)
> +void debug_cpumask_set_cpu(int cpu, int node, bool enable)
> {
> - int node = early_cpu_to_node(cpu);
> struct cpumask *mask;
> char buf[64];
>
> @@ -227,9 +226,14 @@ struct cpumask __cpuinit *debug_cpumask_set_cpu(int cpu, int enable)
> if (!mask) {
> pr_err("node_to_cpumask_map[%i] NULL\n", node);
> dump_stack();
> - return NULL;
> + return;
> }
>
> + if (enable)
> + cpumask_set_cpu(cpu, mask);
> + else
> + cpumask_clear_cpu(cpu, mask);
> +

Following patch also shold be apply?


From aaca24826696f7911bd66380baa18cfbe4f4b18e Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Thu, 21 Apr 2011 14:01:42 +0900
Subject: [PATCH] Fix

debug_cpumask_set_cpu() has tree return statement. we have change
rest two return statement.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
arch/x86/mm/numa.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 0471b1d6..745258d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -220,7 +220,7 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable)

if (node == NUMA_NO_NODE) {
/* early_cpu_to_node() already emits a warning and trace */
- return NULL;
+ return;
}
mask = node_to_cpumask_map[node];
if (!mask) {
@@ -238,7 +238,7 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable)
printk(KERN_DEBUG "%s cpu %d node %d: mask now %s\n",
enable ? "numa_add_cpu" : "numa_remove_cpu",
cpu, node, buf);
- return mask;
+ return;
}

# ifndef CONFIG_NUMA_EMU
--
1.7.3.1



2011-04-21 05:45:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Linux 2.6.39-rc4 (regression: NUMA on multi-node CPUs broken)

> On Thu, 21 Apr 2011, KOSAKI Motohiro wrote:
>
> > Simple revert 7d6b46707f24 makes the same boot failure again.
> >
>
> Do you have CONFIG_DEBUG_PER_CPU_MAPS enabled? If not, please send your
> .config.

Oops. Yes, I have CONFIG_DEBUG_PER_CPU_MAPS=y.


2011-04-21 05:57:10

by Andreas Herrmann

[permalink] [raw]
Subject: Re: Linux 2.6.39-rc4 (regression: NUMA on multi-node CPUs broken)

On Thu, Apr 21, 2011 at 11:04:27AM +0900, KOSAKI Motohiro wrote:
> > Following patch breaks real NUMA on multi-node CPUs like AMD
> > Magny-Cours and should be reverted (or changed to just take effect in
> > case of numa=fake):
> >
> > commit 7d6b46707f2491a94f4bd3b4329d2d7f809e9368
> > Author: KOSAKI Motohiro <[email protected]>
> > Date: Fri Apr 15 20:39:01 2011 +0900
> >
> > x86, NUMA: Fix fakenuma boot failure
> >
> > ...
> >
> > Thus, this patch implements a reassignment of node-ids if buggy firmware
> > or numa emulation makes wrong cpu node map. Tt enforce all logical cpus
> > in the same physical cpu share the same node.
> >
> > ...
> >
> > +static void __cpuinit check_cpu_siblings_on_same_node(int cpu1, int cpu2)
> > +{
> > + int node1 = early_cpu_to_node(cpu1);
> > + int node2 = early_cpu_to_node(cpu2);
> > +
> > + /*
> > + * Our CPU scheduler assumes all logical cpus in the same physical cpu
> > + * share the same node. But, buggy ACPI or NUMA emulation might assign
> > + * them to different node. Fix it.
> > + */
> >
> > ...
> >
> > This is a false assumption. Magny-Cours has two nodes in the same
> > physical package. The scheduler was (kind of) fixed to work around
> > this boot problem for multi-node CPUs (with 2.6.32).
>
> I agree we have to fix this ASAP. I also think we have to avoid reintroduce
> the same again. Can you please tell me the commit-id of this one?

It's

commit 5a925b4282d7f805deafde62001a83dbaf8be275
Author: Andreas Herrmann <[email protected]>
Date: Thu Sep 3 09:44:28 2009 +0200

x86, sched: Workaround broken sched domain creation for AMD Magny-Cours



> > If this is also
> > an issue with wrong cpu node maps in case of NUMA emulation this might
> > be fixed similar or this quirk should only be applied in case of NUMA
> > emulation.
>
> Indeed.
>
> Tejun, Do you remember I sent numa emulation specific patch at first. now
> I'm beside with Andreas. Because I bet current numa fallback code (you
> pointed out one) has no user.
>
> Or, please let us know if you have an alternative patch.
>
> Attached revert and fakenuma spefic fix patches.


Andreas

2011-04-21 12:10:42

by David Rientjes

[permalink] [raw]
Subject: [tip:x86/urgent] Revert "x86, NUMA: Fix fakenuma boot failure"

Commit-ID: 37f8527dbfd05af0f670aa02370d0c4cca7fbda6
Gitweb: http://git.kernel.org/tip/37f8527dbfd05af0f670aa02370d0c4cca7fbda6
Author: David Rientjes <[email protected]>
AuthorDate: Wed, 20 Apr 2011 19:19:10 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 21 Apr 2011 11:30:59 +0200

Revert "x86, NUMA: Fix fakenuma boot failure"

Andreas Herrmann reported that 7d6b46707f24 ("x86, NUMA: Fix fakenuma
boot failure") causes certain physical NUMA topologies (for example
AMD Magny-Cours) to move sibling cpus to a single node when in reality
they are in separate domains.

This may result in some nodes being completely void of cpus, which
doesn't accurately represent the correct topology. The system will
boot, but will have suboptimal NUMA performance.

This commit was intended as a fix for NUMA emulation, but should
not cause a regression for real NUMA machines as a side effect.

( There will be a separate fix for the numa-debug code, which
will not affect physical topologies. )

Reported-by: Andreas Herrmann <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/smpboot.c | 23 -----------------------
1 files changed, 0 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8ed8908..c2871d3 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -312,26 +312,6 @@ void __cpuinit smp_store_cpu_info(int id)
identify_secondary_cpu(c);
}

-static void __cpuinit check_cpu_siblings_on_same_node(int cpu1, int cpu2)
-{
- int node1 = early_cpu_to_node(cpu1);
- int node2 = early_cpu_to_node(cpu2);
-
- /*
- * Our CPU scheduler assumes all logical cpus in the same physical cpu
- * share the same node. But, buggy ACPI or NUMA emulation might assign
- * them to different node. Fix it.
- */
- if (node1 != node2) {
- pr_warning("CPU %d in node %d and CPU %d in node %d are in the same physical CPU. forcing same node %d\n",
- cpu1, node1, cpu2, node2, node2);
-
- numa_remove_cpu(cpu1);
- numa_set_node(cpu1, node2);
- numa_add_cpu(cpu1);
- }
-}
-
static void __cpuinit link_thread_siblings(int cpu1, int cpu2)
{
cpumask_set_cpu(cpu1, cpu_sibling_mask(cpu2));
@@ -340,7 +320,6 @@ static void __cpuinit link_thread_siblings(int cpu1, int cpu2)
cpumask_set_cpu(cpu2, cpu_core_mask(cpu1));
cpumask_set_cpu(cpu1, cpu_llc_shared_mask(cpu2));
cpumask_set_cpu(cpu2, cpu_llc_shared_mask(cpu1));
- check_cpu_siblings_on_same_node(cpu1, cpu2);
}


@@ -382,12 +361,10 @@ void __cpuinit set_cpu_sibling_map(int cpu)
per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) {
cpumask_set_cpu(i, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(i));
- check_cpu_siblings_on_same_node(cpu, i);
}
if (c->phys_proc_id == cpu_data(i).phys_proc_id) {
cpumask_set_cpu(i, cpu_core_mask(cpu));
cpumask_set_cpu(cpu, cpu_core_mask(i));
- check_cpu_siblings_on_same_node(cpu, i);
/*
* Does this new cpu bringup a new core?
*/

2011-04-21 12:11:11

by David Rientjes

[permalink] [raw]
Subject: [tip:x86/urgent] x86, numa: Fix cpu nodemasks for NUMA emulation and CONFIG_DEBUG_PER_CPU_MAPS

Commit-ID: 7a6c6547825a2324faa76cff856db11d78de075e
Gitweb: http://git.kernel.org/tip/7a6c6547825a2324faa76cff856db11d78de075e
Author: David Rientjes <[email protected]>
AuthorDate: Wed, 20 Apr 2011 19:19:13 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 21 Apr 2011 11:31:00 +0200

x86, numa: Fix cpu nodemasks for NUMA emulation and CONFIG_DEBUG_PER_CPU_MAPS

The cpu<->node mappings under CONFIG_DEBUG_PER_CPU_MAPS=y
when NUMA emulation is enabled is currently broken because it does
not iterate through every emulated node and bind cpus that have
affinity to it.

NUMA emulation should bind each cpu to every local node to
accurately represent the true NUMA topology of the underlying
machine.

debug_cpumask_set_cpu() needs to be fixed at the same time so
that the debugging information that it emits shows the new
cpumask of the node being assigned when the cpu is being added
or removed.

It can now take responsibility of setting or clearing the cpu
itself to remove the need for duplicate code.

Also change its last parameter, "enable", to have the correct bool
type since it can only be true or false.

-v2: Fix the return statements, by Kosaki Motohiro

Acked-and-Tested-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Cc: Andreas Herrmann <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/numa.h | 2 +-
arch/x86/mm/numa.c | 31 +++++++++++++------------------
arch/x86/mm/numa_emulation.c | 20 ++++++--------------
3 files changed, 20 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 3d4dab4..a50fc9f 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -51,7 +51,7 @@ static inline void numa_remove_cpu(int cpu) { }
#endif /* CONFIG_NUMA */

#ifdef CONFIG_DEBUG_PER_CPU_MAPS
-struct cpumask __cpuinit *debug_cpumask_set_cpu(int cpu, int enable);
+void debug_cpumask_set_cpu(int cpu, int node, bool enable);
#endif

#endif /* _ASM_X86_NUMA_H */
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 9559d36..745258d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -213,53 +213,48 @@ int early_cpu_to_node(int cpu)
return per_cpu(x86_cpu_to_node_map, cpu);
}

-struct cpumask __cpuinit *debug_cpumask_set_cpu(int cpu, int enable)
+void debug_cpumask_set_cpu(int cpu, int node, bool enable)
{
- int node = early_cpu_to_node(cpu);
struct cpumask *mask;
char buf[64];

if (node == NUMA_NO_NODE) {
/* early_cpu_to_node() already emits a warning and trace */
- return NULL;
+ return;
}
mask = node_to_cpumask_map[node];
if (!mask) {
pr_err("node_to_cpumask_map[%i] NULL\n", node);
dump_stack();
- return NULL;
+ return;
}

+ if (enable)
+ cpumask_set_cpu(cpu, mask);
+ else
+ cpumask_clear_cpu(cpu, mask);
+
cpulist_scnprintf(buf, sizeof(buf), mask);
printk(KERN_DEBUG "%s cpu %d node %d: mask now %s\n",
enable ? "numa_add_cpu" : "numa_remove_cpu",
cpu, node, buf);
- return mask;
+ return;
}

# ifndef CONFIG_NUMA_EMU
-static void __cpuinit numa_set_cpumask(int cpu, int enable)
+static void __cpuinit numa_set_cpumask(int cpu, bool enable)
{
- struct cpumask *mask;
-
- mask = debug_cpumask_set_cpu(cpu, enable);
- if (!mask)
- return;
-
- if (enable)
- cpumask_set_cpu(cpu, mask);
- else
- cpumask_clear_cpu(cpu, mask);
+ debug_cpumask_set_cpu(cpu, early_cpu_to_node(cpu), enable);
}

void __cpuinit numa_add_cpu(int cpu)
{
- numa_set_cpumask(cpu, 1);
+ numa_set_cpumask(cpu, true);
}

void __cpuinit numa_remove_cpu(int cpu)
{
- numa_set_cpumask(cpu, 0);
+ numa_set_cpumask(cpu, false);
}
# endif /* !CONFIG_NUMA_EMU */

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index ad091e4..de84cc1 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -454,10 +454,9 @@ void __cpuinit numa_remove_cpu(int cpu)
cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
}
#else /* !CONFIG_DEBUG_PER_CPU_MAPS */
-static void __cpuinit numa_set_cpumask(int cpu, int enable)
+static void __cpuinit numa_set_cpumask(int cpu, bool enable)
{
- struct cpumask *mask;
- int nid, physnid, i;
+ int nid, physnid;

nid = early_cpu_to_node(cpu);
if (nid == NUMA_NO_NODE) {
@@ -467,28 +466,21 @@ static void __cpuinit numa_set_cpumask(int cpu, int enable)

physnid = emu_nid_to_phys[nid];

- for_each_online_node(i) {
+ for_each_online_node(nid) {
if (emu_nid_to_phys[nid] != physnid)
continue;

- mask = debug_cpumask_set_cpu(cpu, enable);
- if (!mask)
- return;
-
- if (enable)
- cpumask_set_cpu(cpu, mask);
- else
- cpumask_clear_cpu(cpu, mask);
+ debug_cpumask_set_cpu(cpu, nid, enable);
}
}

void __cpuinit numa_add_cpu(int cpu)
{
- numa_set_cpumask(cpu, 1);
+ numa_set_cpumask(cpu, true);
}

void __cpuinit numa_remove_cpu(int cpu)
{
- numa_set_cpumask(cpu, 0);
+ numa_set_cpumask(cpu, false);
}
#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */

2011-04-21 19:43:29

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 2/2] x86, numa: Fix cpu nodemasks for NUMA emulation and CONFIG_DEBUG_PER_CPU_MAPS

On Thu, 21 Apr 2011, KOSAKI Motohiro wrote:

> From aaca24826696f7911bd66380baa18cfbe4f4b18e Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Thu, 21 Apr 2011 14:01:42 +0900
> Subject: [PATCH] Fix
>
> debug_cpumask_set_cpu() has tree return statement. we have change
> rest two return statement.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> arch/x86/mm/numa.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 0471b1d6..745258d 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -220,7 +220,7 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable)
>
> if (node == NUMA_NO_NODE) {
> /* early_cpu_to_node() already emits a warning and trace */
> - return NULL;
> + return;
> }
> mask = node_to_cpumask_map[node];
> if (!mask) {
> @@ -238,7 +238,7 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable)
> printk(KERN_DEBUG "%s cpu %d node %d: mask now %s\n",
> enable ? "numa_add_cpu" : "numa_remove_cpu",
> cpu, node, buf);
> - return mask;
> + return;
> }
>
> # ifndef CONFIG_NUMA_EMU

Yes, it looks like Ingo fixed that up when it was merged in the latest
git as 7a6c6547825a, thanks for pointing it out.

2011-04-21 19:45:53

by David Rientjes

[permalink] [raw]
Subject: Re: Linux 2.6.39-rc4 (regression: NUMA on multi-node CPUs broken)

On Wed, 20 Apr 2011, David Rientjes wrote:

> I'm not sure what it's trying to address (yes, there is a problem with the
> binding for CONFIG_NUMA_EMU && CONFIG_DEBUG_PER_CPU_MAPS, but not
> otherwise).
>

Andreas, the revert (37f8527dbfd0) and the new NUMA emulation fix
(7a6c6547825a) have been merged into the latest -git, please let us know
if there are any other issues that you notice. Thanks!