The mm-of-the-moment snapshot 2009-09-25-14-35 has been uploaded to
http://userweb.kernel.org/~akpm/mmotm/
and will soon be available at
git://git.zen-sources.org/zen/mmotm.git
It contains the following patches against 2.6.31:
origin.patch
module-fix-up-config_kallsyms=n-build.patch
linux-next.patch
linux-next-git-rejects.patch
next-remove-localversion.patch
i-need-old-gcc.patch
kernel-core-add-smp_call_function_any.patch
arch-x86-kernel-cpu-cpufreq-acpi-cpufreqc-avoid-cross-cpu-interrupts-by-using-smp_call_function_any.patch
dell_laptop-when-the-hardware-switch-is-disabled-dont-actually-allow-changing-the-softblock-status.patch
acpi-reintroduce-acpi_device_ops-shutdown-method.patch
cs5535-gpio-add-amd-cs5535-cs5536-gpio-driver-support.patch
cs5535-gpio-request-function-mask-names-added.patch
alsa-cs5535audio-free-olpc-quirks-from-reliance-on-mgeode_lx-cpu-optimization.patch
agp-correct-missing-cleanup-on-error-in-agp_add_bridge.patch
avr32-convert-to-asm-generic-hardirqh.patch
powerpc-sky-cpu-redundant-or-incorrect-tests-on-unsigned.patch
hpilo-add-locking-comment.patch
drm-via-add-pci-id-for-via-vx800-chipset.patch
genirq-switch-proc-irq-spurious-to-seq_file.patch
timer-stats-fix-del_timer_sync-and-try_to_del_timer_sync.patch
ia64-use-printk_once.patch
input-drivers-input-xpadc-improve-xbox-360-wireless-support-and-add-sysfs-interface.patch
input-documentation-input-xpadtxt-update-for-new-driver-functionality.patch
kbuild-generate-modulesbuiltin.patch
kbuild-rebuild-fix-for-makefilemodbuiltin.patch
kconfig-cross_compile-option.patch
ide-use-printk_once.patch
mips-decrease-size-of-au1xxx_dbdma_pm_regs.patch
msp71xx-request_irq-failure-ignored-in-msp_pcibios_config_access.patch
mtdpart-memory-accessor-interface-for-mtd-layer.patch
isdn-hisax-fix-lock-imbalance.patch
hfc_usb-fix-read-buffer-overflow.patch
isdn-fix-netjet-build-errors.patch
misdn-fix-reversed-if-in-st_own_ctrl.patch
isdn-eicon-use-offsetof.patch
isdn-eicon-return-on-error.patch
3x59x-fix-pci-resource-management.patch
bluetooth-fix-for-acer-bluetooth-optical-rechargeable-mouse.patch
sunrpc-use-formatting-of-module-name-in-sunrpc.patch
serial_txx9-use-container_of-instead-of-direct-cast.patch
icom-converting-space-to-tabs.patch
cyclades-read-buffer-overflow.patch
serial167-fix-read-buffer-overflow.patch
serial-add-parameter-to-force-skipping-the-test-for-the-txen-bug.patch
drivers-md-introduce-missing-kfree.patch
scsi-add-__init-__exit-macros-to-ibmvstgtc.patch
drivers-scsi-fnic-fnic_scsic-clean-up.patch
ibmmca-buffer-overflow.patch
scsi-eata-fix-buffer-overflow.patch
drivers-scsi-gdthc-fix-buffer-overflow.patch
drivers-scsi-u14-34fc-fix-uffer-overflow.patch
drivers-scsi-lpfc-lpfc_vportc-fix-read-buffer-overflow.patch
osst-fix-read-buffer-overflow.patch
scsi-fix-func-names-in-kernel-doc.patch
gdth-unmap-ccb_phys-when-scsi_add_host-fails-in-gdth_eisa_probe_one.patch
zfcp-test-kmalloc-failure-in-scsi_get_vpd_page.patch
st-fix-test-of-value-range-in-st_set_options.patch
sparc32-convert-to-asm-generic-hardirqh.patch
synaptics-touchscreen-for-htc-dream-check-that-smbus-is-available.patch
drivers-staging-octeon-ethernet-rgmiic-dont-ignore-request_irq-return-code.patch
revert-staging-android-lowmemorykillerc-fix-it-for-oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch
oom-move-oom_adj-value-from-task_struct-to-signal_struct-fix.patch
vfs-fix-vfs_rename_dir-for-fs_rename_does_d_move-filesystems.patch
raw-fix-rawctl-compat-ioctls-breakage-on-amd64-and-itanic.patch
vfs-improve-comment-describing-fget_light.patch
ecryptfs-another-lockdep-issue.patch
fs-remove-unneeded-dcache_unhashed-tricks.patch
fs-improve-remountro-vs-buffercache-coherency.patch
vfs-implement-posix-o_sync-and-o_dsync-semantics.patch
vfs-implement-posix-o_sync-and-o_dsync-semantics-checkpatch-fixes.patch
xtensa-use-generic-sys_pipe.patch
xtensa-convert-to-asm-generic-hardirqh.patch
percpu-avoid-calling-__pcpu_ptr_to_addrnull.patch
mm.patch
ksm-more-on-default-values.patch
mm-includecheck-fix-vmallocc.patch
fs-includecheck-fix-proc-kcorec.patch
video-includecheck-fix-msm-mddic.patch
video-includecheck-fix-da8xx-fbc.patch
cgroups-update-documentation-of-cgroups-tasks-and-procs-files.patch
drivers-input-inputc-fix-config_pm=n-warning.patch
x86-_end-symbol-missing-from-symbolmap.patch
sched-fix-htmldocs-schedc.patch
scsi_libc-avoid-calling-scsi_device_put-from-under-host_lock.patch
hrtimers-remove-the-timer_stats_active-check-when-setting-the-start-info.patch
drivers-net-wireless-ath-ar9170-phyc-fix-uninitialised-variable.patch
readahead-add-blk_run_backing_dev.patch
mm-warn-once-when-a-page-is-freed-with-pg_mlocked-set.patch
mm-vsmcan-check-shrink_active_list-sc-isolate_pages-return-value.patch
dev-mem-remove-redundant-test-on-len.patch
dev-mem-introduce-size_inside_page.patch
dev-mem-cleanup-unxlate_dev_mem_ptr-calls.patch
dev-mem-cleanup-unxlate_dev_mem_ptr-calls-fix.patch
dev-mem-cleanup-unxlate_dev_mem_ptr-calls-fix-fix.patch
dev-mem-make-size_inside_page-logic-straight.patch
dev-mem-remove-the-written-variable-in-write_kmem.patch
dev-mem-remove-the-written-variable-in-write_kmem-fix.patch
dev-mem-remove-redundant-parameter-from-do_write_kmem.patch
pcmcia-fix-controller-printk-warnings.patch
frv-duplicate-output_buffer-of-e03.patch
frv-duplicate-output_buffer-of-e03-checkpatch-fixes.patch
cris-convert-to-use-arch_gettimeoffset.patch
generic-ipi-cleanup-for-generic_smp_call_function_interrupt.patch
maintainers-update-generic-uio-for-pci-devices.patch
maintainers-update-tracing-section.patch
maintainers-update-omap-tony-lindgren-email-name.patch
maintainers-change-atm-mailing-list-to-moderated.patch
maintainers-use-tab-not-spaces-after-field-types.patch
maintainers-update-kernel-janitors-after-mismerge.patch
maintainers-update-score-architecture-name-style-and-add-file-pattern.patch
maintainers-simple-firmware-interface-update-email-style.patch
maintainers-winbond-cir-integrate-p-m-lines-fixup-david-h?rdemans-name.patch
maintainers-fix-up-peripheral-spelling.patch
maintainers-update-wolfson-microelectronics.patch
mmc-make-the-configuration-memory-resource-optional.patch
tmio_mmc-optionally-support-using-platform-clock.patch
sh-switch-migo-r-to-use-the-tmio-mmc-driver-instead-of-spi.patch
sdio-recognize-io-card-without-powercycle.patch
sdio-pass-whitelisted-cis-funce-tuples-to-sdio-drivers.patch
hwmon-driver-for-texas-instruments-amc6821-chip.patch
rtc-set-wakeup-capability-for-i2c-and-spi-rtc-drivers.patch
gpiolib-add-names-file-in-gpio-chip-sysfs.patch
fbdev-bfin-lq035q1-fb-new-blackfin-landscape-lcd-ez-extender-driver.patch
fbdev-bfin-t350mcqb-fb-handle-all-resources-in-suspend-resume.patch
fbdev-bfin-t350mcqb-fb-fix-lcd-dimensions.patch
intelfb-fix-setting-of-active-pipe-with-lvds-displays.patch
hfsplus-identify-journal-info-block-in-volume-header.patch
hfsplus-fix-journal-detection.patch
reiserfs-remove-proc-fs-reiserfs-version.patch
reiserfs-dont-compile-procfso-at-all-if-no-support.patch
reiserfs-truncate-blocks-not-used-by-a-write-v2.patch
fatfs-use-common-time_to_tm-in-fat_time_unix2fat.patch
do_wait-optimization-do-not-place-sub-threads-on-task_struct-children-list.patch
utrace-core.patch
cyclades-allow-overriding-isa-defaults-also-when-the-driver-is-built-in.patch
sound-core-pcm_timerc-use-lib-gcdc.patch
net-netfilter-ipvs-ip_vs_wrrc-use-lib-gcdc.patch
net-netfilter-ipvs-ip_vs_wrrc-use-lib-gcdc-fix.patch
vfs-take-2add-set_page_dirty_notag.patch
reiser4-vfs-add-super_operationssync_inodes-2.patch
reiser4-export-remove_from_page_cache.patch
reiser4-export-remove_from_page_cache-fix.patch
reiser4-export-find_get_pages.patch
reiser4.patch
reiser4-adjust-to-the-new-aops.patch
reiser4-adjust-to-the-new-aops-fixup.patch
reiser4-remove-simple_prepare_write-usage.patch
reiser4-remove-simple_prepare_write-usage-checkpatch-fixes.patch
fs-symlink-write_begin-allocation-context-fix-reiser4-fix.patch
reiser4-handling-error-returned-by-d_obtain_alias-fixup.patch
reiser4-update-names-of-quota-methods.patch
reiser4-use-set_page_dirty_notag.patch
fs-reiser4-add-parenths-around-x-y.patch
fs-reiser4-contextc-current_is_pdflush-got-removed.patch
reiser4-fix.patch
reiser4-disable.patch
make-sure-nobodys-leaking-resources.patch
journal_add_journal_head-debug.patch
releasing-resources-with-children.patch
make-frame_pointer-default=y.patch
mutex-subsystem-synchro-test-module.patch
slab-leaks3-default-y.patch
put_bh-debug.patch
add-debugging-aid-for-memory-initialisation-problems.patch
keep-track-of-network-interface-renaming.patch
workaround-for-a-pci-restoring-bug.patch
prio_tree-debugging-patch.patch
single_open-seq_release-leak-diagnostics.patch
add-a-refcount-check-in-dput.patch
getblk-handle-2tb-devices.patch
getblk-handle-2tb-devices-fix.patch
undeprecate-pci_find_device.patch
notify_change-callers-must-hold-i_mutex.patch
On Fri, 25 Sep 2009 14:35:46 -0700
[email protected] wrote:
> The mm-of-the-moment snapshot 2009-09-25-14-35 has been uploaded to
>
> http://userweb.kernel.org/~akpm/mmotm/
>
> and will soon be available at
>
> git://git.zen-sources.org/zen/mmotm.git
>
> It contains the following patches against 2.6.31:
>
At testing my (small) patch, with high memory pressure to
memcg+hierarchy+softlimit, following is shown.
==
INFO: RCU detected CPU 0 stall (t=10000 jiffies)
sending NMI to all CPUs:
NMI backtrace for cpu 0
CPU 0:
Modules linked in: sco bridge stp bnep l2cap crc16 bluetooth rfkill iptabl
e_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log d
m_multipath dm_mod uinput ppdev i2c_i801 pcspkr i2c_core bnx2 sg e1000e parport_pc parport button shpchp megaraid_sas sd_mo
d scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2 PRIMERGY
RIP: 0010:[<ffffffff810878fe>] [<ffffffff810878fe>] trace_hardirqs_off_ca
ller+0x3e/0xb RSP: 0018:ffff88004fa03d98 EFLAGS: 00000006
RAX: 0000000000000046 RBX: 0000000000000c00 RCX: 000000000000e501
RDX: ffff8806133564f0 RSI: 0000000000000002 RDI: ffffffff8102a940
RBP: ffff88004fa03d98 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
R13: 0000000000000046 R14: 00000000000000ff R15: ffff88004fa03f48
FS: 00007fdeca0856f0(0000) GS:ffff88004fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdeca09e000 CR3: 0000000619fc6000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
<#DB[1]> <<EOE>> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2
Call Trace:
<NMI> [<ffffffff8100af79>] ? show_regs+0x49/0x50
[<ffffffff81429385>] nmi_watchdog_tick+0x1e5/0x210
[<ffffffff81428891>] do_nmi+0x1b1/0x2e0
[<ffffffff8142808a>] nmi+0x1a/0x2c
[<ffffffff8102a940>] ? flat_send_IPI_mask+0x90/0xb0
[<ffffffff810878fe>] ? trace_hardirqs_off_caller+0x3e/0xb0
<<EOE>> <IRQ> [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
[<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
[<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
[<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
[<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
[<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
[<ffffffff81063a26>] update_process_times+0x46/0x70
[<ffffffff81085930>] tick_sched_timer+0x60/0x160
[<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
[<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
[<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
[<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
[<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
[<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
[<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
[<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
[<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
[<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
[<ffffffff810a5b90>] ? cgroup_map_add+0x0/0x30
[<ffffffff8115de03>] ? seq_read+0xf3/0x420
[<ffffffff811d9926>] ? security_file_permission+0x16/0x20
[<ffffffff8113b7ec>] ? vfs_read+0xcc/0x190
[<ffffffff8113b9b5>] ? sys_read+0x55/0x90
[<ffffffff8100bf9b>] ? system_call_fastpath+0x16/0x1b
.....
==
mem_cgroup_walk_tree() is not very young function and I've been doing the
same kind of tests but this is 1st messeage for me. So, I'm a bit confused.
and not sure how I start debug from...
Does this mean mem_cgroup_walk_tree() blocks RCU's progress
over 10000 jiffies ? What should I doubt ?
Thanks,
-Kame
On Mon, 28 Sep 2009 15:42:13 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Fri, 25 Sep 2009 14:35:46 -0700
> [email protected] wrote:
>
> > The mm-of-the-moment snapshot 2009-09-25-14-35 has been uploaded to
> >
> > http://userweb.kernel.org/~akpm/mmotm/
> >
> > and will soon be available at
> >
> > git://git.zen-sources.org/zen/mmotm.git
> >
> > It contains the following patches against 2.6.31:
> >
>
> At testing my (small) patch, with high memory pressure to
> memcg+hierarchy+softlimit, following is shown.
I'm sorry that I think I found something wrong in memcg.
I'll report it. sorry for noise.
Thanks,
-Kame
> ==
> INFO: RCU detected CPU 0 stall (t=10000 jiffies)
> sending NMI to all CPUs:
> NMI backtrace for cpu 0
> CPU 0:
> Modules linked in: sco bridge stp bnep l2cap crc16 bluetooth rfkill iptabl
> e_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log d
> m_multipath dm_mod uinput ppdev i2c_i801 pcspkr i2c_core bnx2 sg e1000e parport_pc parport button shpchp megaraid_sas sd_mo
> d scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2 PRIMERGY
> RIP: 0010:[<ffffffff810878fe>] [<ffffffff810878fe>] trace_hardirqs_off_ca
> ller+0x3e/0xb RSP: 0018:ffff88004fa03d98 EFLAGS: 00000006
> RAX: 0000000000000046 RBX: 0000000000000c00 RCX: 000000000000e501
> RDX: ffff8806133564f0 RSI: 0000000000000002 RDI: ffffffff8102a940
> RBP: ffff88004fa03d98 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> R13: 0000000000000046 R14: 00000000000000ff R15: ffff88004fa03f48
> FS: 00007fdeca0856f0(0000) GS:ffff88004fa00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fdeca09e000 CR3: 0000000619fc6000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Call Trace:
> <#DB[1]> <<EOE>> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2
> Call Trace:
> <NMI> [<ffffffff8100af79>] ? show_regs+0x49/0x50
> [<ffffffff81429385>] nmi_watchdog_tick+0x1e5/0x210
> [<ffffffff81428891>] do_nmi+0x1b1/0x2e0
> [<ffffffff8142808a>] nmi+0x1a/0x2c
> [<ffffffff8102a940>] ? flat_send_IPI_mask+0x90/0xb0
> [<ffffffff810878fe>] ? trace_hardirqs_off_caller+0x3e/0xb0
> <<EOE>> <IRQ> [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
> [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
> [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
> [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
> [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
> [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
> [<ffffffff81063a26>] update_process_times+0x46/0x70
> [<ffffffff81085930>] tick_sched_timer+0x60/0x160
> [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
> [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
> [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
> [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
> [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
> <EOI> [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
> [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
> [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
> [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
> [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
> [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
> [<ffffffff810a5b90>] ? cgroup_map_add+0x0/0x30
> [<ffffffff8115de03>] ? seq_read+0xf3/0x420
> [<ffffffff811d9926>] ? security_file_permission+0x16/0x20
> [<ffffffff8113b7ec>] ? vfs_read+0xcc/0x190
> [<ffffffff8113b9b5>] ? sys_read+0x55/0x90
> [<ffffffff8100bf9b>] ? system_call_fastpath+0x16/0x1b
> .....
> ==
> mem_cgroup_walk_tree() is not very young function and I've been doing the
> same kind of tests but this is 1st messeage for me. So, I'm a bit confused.
> and not sure how I start debug from...
>
> Does this mean mem_cgroup_walk_tree() blocks RCU's progress
> over 10000 jiffies ? What should I doubt ?
>
> Thanks,
> -Kame
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> At testing my (small) patch, with high memory pressure to
> memcg+hierarchy+softlimit, following is shown.
> ==
> INFO: RCU detected CPU 0 stall (t=10000 jiffies)
> sending NMI to all CPUs:
> NMI backtrace for cpu 0
> CPU 0:
> Modules linked in: sco bridge stp bnep l2cap crc16 bluetooth rfkill iptabl
> e_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log d
> m_multipath dm_mod uinput ppdev i2c_i801 pcspkr i2c_core bnx2 sg e1000e parport_pc parport button shpchp megaraid_sas sd_mo
> d scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2 PRIMERGY
> RIP: 0010:[<ffffffff810878fe>] [<ffffffff810878fe>] trace_hardirqs_off_ca
> ller+0x3e/0xb RSP: 0018:ffff88004fa03d98 EFLAGS: 00000006
> RAX: 0000000000000046 RBX: 0000000000000c00 RCX: 000000000000e501
> RDX: ffff8806133564f0 RSI: 0000000000000002 RDI: ffffffff8102a940
> RBP: ffff88004fa03d98 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> R13: 0000000000000046 R14: 00000000000000ff R15: ffff88004fa03f48
> FS: 00007fdeca0856f0(0000) GS:ffff88004fa00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fdeca09e000 CR3: 0000000619fc6000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Call Trace:
> <#DB[1]> <<EOE>> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2
> Call Trace:
> <NMI> [<ffffffff8100af79>] ? show_regs+0x49/0x50
> [<ffffffff81429385>] nmi_watchdog_tick+0x1e5/0x210
> [<ffffffff81428891>] do_nmi+0x1b1/0x2e0
> [<ffffffff8142808a>] nmi+0x1a/0x2c
> [<ffffffff8102a940>] ? flat_send_IPI_mask+0x90/0xb0
> [<ffffffff810878fe>] ? trace_hardirqs_off_caller+0x3e/0xb0
> <<EOE>> <IRQ> [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
> [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
> [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
> [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
> [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
> [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
> [<ffffffff81063a26>] update_process_times+0x46/0x70
> [<ffffffff81085930>] tick_sched_timer+0x60/0x160
> [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
> [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
> [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
> [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
> [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
> <EOI> [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
> [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
> [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
> [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
> [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
> [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
> [<ffffffff810a5b90>] ? cgroup_map_add+0x0/0x30
> [<ffffffff8115de03>] ? seq_read+0xf3/0x420
> [<ffffffff811d9926>] ? security_file_permission+0x16/0x20
> [<ffffffff8113b7ec>] ? vfs_read+0xcc/0x190
> [<ffffffff8113b9b5>] ? sys_read+0x55/0x90
> [<ffffffff8100bf9b>] ? system_call_fastpath+0x16/0x1b
> .....
> ==
This is a patch for 2.6.31-rc1 (maybe no hunk with -mm)
==
__mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz"
with incremnted mz->mem->css's refcnt.
Then, the caller of this function has to call css_put(mz->mem->css).
But, mz can be !NULL even if "not found" i.e. without css_get().
By this, css->refcnt will go down to minus.
This may cause various things...one of results will be
initite-loop in css_tryget() as this.
INFO: RCU detected CPU 0 stall (t=10000 jiffies)
sending NMI to all CPUs:
NMI backtrace for cpu 0
CPU 0:
<snip>
<<EOE>> <IRQ> [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
[<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
[<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
[<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
[<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
[<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
[<ffffffff81063a26>] update_process_times+0x46/0x70
[<ffffffff81085930>] tick_sched_timer+0x60/0x160
[<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
[<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
[<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
[<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
[<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
[<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
[<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
[<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
[<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
[<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
Above shows CPU0 caught in css_tryget()'s inifinite loop because
of bad refcnt.
This is a fix to set mz=NULL at the top of retry path.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Index: linux-2.6.32-rc1/mm/memcontrol.c
===================================================================
--- linux-2.6.32-rc1.orig/mm/memcontrol.c
+++ linux-2.6.32-rc1/mm/memcontrol.c
@@ -447,9 +447,10 @@ static struct mem_cgroup_per_zone *
__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
{
struct rb_node *rightmost = NULL;
- struct mem_cgroup_per_zone *mz = NULL;
+ struct mem_cgroup_per_zone *mz;
retry:
+ mz = NULL;
rightmost = rb_last(&mctz->rb_root);
if (!rightmost)
goto done; /* Nothing to reclaim from */
This is a patch for checking css->refcnt's sanity at css_put().
BTW, I noticed that...css->refcnt may overflow if used with memcg...
Now, refcnt is incremented per a page. Paul, do you have any idea ?
(Ah, yes. "don't use css->refcnt per page" is maybe reasonable but
it will be big change..)
==
__css_put() doesn't check a bug as refcnt goes to minus.
I think it should be caught. This patch adds a check for it.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
kernel/cgroup.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Index: linux-2.6.32-rc1/kernel/cgroup.c
===================================================================
--- linux-2.6.32-rc1.orig/kernel/cgroup.c
+++ linux-2.6.32-rc1/kernel/cgroup.c
@@ -3708,8 +3708,10 @@ static void check_for_release(struct cgr
void __css_put(struct cgroup_subsys_state *css)
{
struct cgroup *cgrp = css->cgroup;
+ int val;
rcu_read_lock();
- if (atomic_dec_return(&css->refcnt) == 1) {
+ val = atomic_dec_return(&css->refcnt);
+ if (val == 1) {
if (notify_on_release(cgrp)) {
set_bit(CGRP_RELEASABLE, &cgrp->flags);
check_for_release(cgrp);
@@ -3717,6 +3719,7 @@ void __css_put(struct cgroup_subsys_stat
cgroup_wakeup_rmdir_waiter(cgrp);
}
rcu_read_unlock();
+ WARN_ON(val < 1);
}
/*
> [email protected] wrote:
>
>> The mm-of-the-moment snapshot 2009-09-25-14-35 has been uploaded to
>>
>> ? ?http://userweb.kernel.org/~akpm/mmotm/
>>
>> and will soon be available at
>>
>> ? ?git://git.zen-sources.org/zen/mmotm.git
>>
>> It contains the following patches against 2.6.31:
>>
>
> At testing my (small) patch, with high memory pressure to
> memcg+hierarchy+softlimit, following is shown.
> ==
> INFO: RCU detected CPU 0 stall (t=10000 jiffies)
> sending NMI to all CPUs:
> NMI backtrace for cpu 0
> CPU 0:
> Modules linked in: sco bridge stp bnep l2cap crc16 bluetooth rfkill iptabl
> e_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log d
> m_multipath dm_mod uinput ppdev i2c_i801 pcspkr i2c_core bnx2 sg e1000e parport_pc parport button shpchp megaraid_sas sd_mo
> d scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
> ?Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2 PRIMERGY
> ?RIP: 0010:[<ffffffff810878fe>] ?[<ffffffff810878fe>] trace_hardirqs_off_ca
> ller+0x3e/0xb RSP: 0018:ffff88004fa03d98 ?EFLAGS: 00000006
> ?RAX: 0000000000000046 RBX: 0000000000000c00 RCX: 000000000000e501
> ?RDX: ffff8806133564f0 RSI: 0000000000000002 RDI: ffffffff8102a940
> ?RBP: ffff88004fa03d98 R08: 0000000000000001 R09: 0000000000000000
> ?R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> ?R13: 0000000000000046 R14: 00000000000000ff R15: ffff88004fa03f48
> ?FS: ?00007fdeca0856f0(0000) GS:ffff88004fa00000(0000) knlGS:0000000000000000
> ?CS: ?0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> ?CR2: 00007fdeca09e000 CR3: 0000000619fc6000 CR4: 00000000000006f0
> ?DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> ?DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> ?Call Trace:
> ?<#DB[1]> ?<<EOE>> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2
> ?Call Trace:
> ?<NMI> ?[<ffffffff8100af79>] ? show_regs+0x49/0x50
> ?[<ffffffff81429385>] nmi_watchdog_tick+0x1e5/0x210
> ?[<ffffffff81428891>] do_nmi+0x1b1/0x2e0
> ?[<ffffffff8142808a>] nmi+0x1a/0x2c
> ?[<ffffffff8102a940>] ? flat_send_IPI_mask+0x90/0xb0
> ?[<ffffffff810878fe>] ? trace_hardirqs_off_caller+0x3e/0xb0
> ?<<EOE>> ?<IRQ> ?[<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
> ?[<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
> ?[<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
> ?[<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
> ?[<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
> ?[<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
> ?[<ffffffff81063a26>] update_process_times+0x46/0x70
> ?[<ffffffff81085930>] tick_sched_timer+0x60/0x160
> ?[<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
> ?[<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
> ?[<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
> ?[<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> ?[<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
> ?[<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
> ?<EOI> ?[<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
> ?[<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
> ?[<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
> ?[<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
> ?[<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
> ?[<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
> ?[<ffffffff810a5b90>] ? cgroup_map_add+0x0/0x30
> ?[<ffffffff8115de03>] ? seq_read+0xf3/0x420
> ?[<ffffffff811d9926>] ? security_file_permission+0x16/0x20
> ?[<ffffffff8113b7ec>] ? vfs_read+0xcc/0x190
> ?[<ffffffff8113b9b5>] ? sys_read+0x55/0x90
> ?[<ffffffff8100bf9b>] ? system_call_fastpath+0x16/0x1b
> .....
> ==
> mem_cgroup_walk_tree() is not very young function and I've been doing the
> same kind of tests but this is 1st messeage for me. So, I'm a bit confused.
> and not sure how I start debug from...
>
> Does this mean mem_cgroup_walk_tree() blocks RCU's progress
> over 10000 jiffies ? What should I doubt ?
Could you share your test case so that I can reproduce the problem? My
tests seem to work fine so far
Balbir Singh
> __mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz"
> with incremnted mz->mem->css's refcnt.
> Then, the caller of this function has to call css_put(mz->mem->css).
>
> But, mz can be !NULL even if "not found" i.e. without css_get().
> By this, css->refcnt will go down to minus.
>
> This may cause various things...one of results will be
> initite-loop in css_tryget() ?as this.
>
> INFO: RCU detected CPU 0 stall (t=10000 jiffies)
> sending NMI to all CPUs:
> NMI backtrace for cpu 0
> CPU 0:
> <snip>
>
> ?<<EOE>> ?<IRQ> ?[<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
> ?[<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
> ?[<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
> ?[<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
> ?[<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
> ?[<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
> ?[<ffffffff81063a26>] update_process_times+0x46/0x70
> ?[<ffffffff81085930>] tick_sched_timer+0x60/0x160
> ?[<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
> ?[<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
> ?[<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
> ?[<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> ?[<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
> ?[<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
> ?<EOI> ?[<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
> ?[<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
> ?[<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
> ?[<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
> ?[<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
> ?[<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
>
> Above shows CPU0 caught in css_tryget()'s inifinite loop because
> of bad refcnt.
>
> This is a fix to set mz=NULL at the top of retry path.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> ?mm/memcontrol.c | ? ?3 ++-
> ?1 file changed, 2 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.32-rc1/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.32-rc1.orig/mm/memcontrol.c
> +++ linux-2.6.32-rc1/mm/memcontrol.c
> @@ -447,9 +447,10 @@ static struct mem_cgroup_per_zone *
> ?__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
> ?{
> ? ? ? ?struct rb_node *rightmost = NULL;
> - ? ? ? struct mem_cgroup_per_zone *mz = NULL;
> + ? ? ? struct mem_cgroup_per_zone *mz;
>
> ?retry:
> + ? ? ? mz = NULL;
> ? ? ? ?rightmost = rb_last(&mctz->rb_root);
> ? ? ? ?if (!rightmost)
> ? ? ? ? ? ? ? ?goto done; ? ? ? ? ? ? ?/* Nothing to reclaim from */
>
Good catch! So we fail at css_tryget() once, but mz is valid, we
return a non NULL mz and we do a css_put() causing ref count to go
bad. The next iteration that uses this mz will hang? I've not been
able to hit, may be I should test in a really small cgroup under
stress.
Balbir Singh.
On Mon, Sep 28, 2009 at 03:42:13PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 25 Sep 2009 14:35:46 -0700
> [email protected] wrote:
>
> > The mm-of-the-moment snapshot 2009-09-25-14-35 has been uploaded to
> >
> > http://userweb.kernel.org/~akpm/mmotm/
> >
> > and will soon be available at
> >
> > git://git.zen-sources.org/zen/mmotm.git
> >
> > It contains the following patches against 2.6.31:
> >
>
> At testing my (small) patch, with high memory pressure to
> memcg+hierarchy+softlimit, following is shown.
> ==
> INFO: RCU detected CPU 0 stall (t=10000 jiffies)
> sending NMI to all CPUs:
> NMI backtrace for cpu 0
> CPU 0:
> Modules linked in: sco bridge stp bnep l2cap crc16 bluetooth rfkill iptabl
> e_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log d
> m_multipath dm_mod uinput ppdev i2c_i801 pcspkr i2c_core bnx2 sg e1000e parport_pc parport button shpchp megaraid_sas sd_mo
> d scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2 PRIMERGY
> RIP: 0010:[<ffffffff810878fe>] [<ffffffff810878fe>] trace_hardirqs_off_ca
> ller+0x3e/0xb RSP: 0018:ffff88004fa03d98 EFLAGS: 00000006
> RAX: 0000000000000046 RBX: 0000000000000c00 RCX: 000000000000e501
> RDX: ffff8806133564f0 RSI: 0000000000000002 RDI: ffffffff8102a940
> RBP: ffff88004fa03d98 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> R13: 0000000000000046 R14: 00000000000000ff R15: ffff88004fa03f48
> FS: 00007fdeca0856f0(0000) GS:ffff88004fa00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fdeca09e000 CR3: 0000000619fc6000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Call Trace:
> <#DB[1]> <<EOE>> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2
> Call Trace:
> <NMI> [<ffffffff8100af79>] ? show_regs+0x49/0x50
> [<ffffffff81429385>] nmi_watchdog_tick+0x1e5/0x210
> [<ffffffff81428891>] do_nmi+0x1b1/0x2e0
> [<ffffffff8142808a>] nmi+0x1a/0x2c
> [<ffffffff8102a940>] ? flat_send_IPI_mask+0x90/0xb0
> [<ffffffff810878fe>] ? trace_hardirqs_off_caller+0x3e/0xb0
> <<EOE>> <IRQ> [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
> [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
> [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
> [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
> [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
> [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
> [<ffffffff81063a26>] update_process_times+0x46/0x70
> [<ffffffff81085930>] tick_sched_timer+0x60/0x160
> [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
> [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
> [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
> [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
> [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
> <EOI> [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
> [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
> [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
mem_cgroup_walk_tree() does have a loop. Not sure why it is showing
up on the stack twice. But it might well be the culprit. Also,
it repeatedly calls css_get_next(), which acquires ss->id_lock.
If some other CPU is holding this lock too long, or failed to release
it, then we could see a stall spinning on the lock. (Though these
sorts of problems are often caught by other diagnostics.)
> [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
mem_cgroup_get_local_stat() is straight-line code, so should not be
the problem.
> [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
mem_control_stat_show() has a pair of fixed-iteration loops, so
should not be the problem.
> [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
Straight-line code, should not be a problem.
> [<ffffffff810a5b90>] ? cgroup_map_add+0x0/0x30
Straight-line code, should not be a problem.
> [<ffffffff8115de03>] ? seq_read+0xf3/0x420
I suppose that seq_read() could be repeatedly getting -EAGAIN out of
traverse(), but unless I am missing something, we would be seeing a
different stack trace in that case.
> [<ffffffff811d9926>] ? security_file_permission+0x16/0x20
... and so on ...
> [<ffffffff8113b7ec>] ? vfs_read+0xcc/0x190
> [<ffffffff8113b9b5>] ? sys_read+0x55/0x90
> [<ffffffff8100bf9b>] ? system_call_fastpath+0x16/0x1b
> .....
> ==
> mem_cgroup_walk_tree() is not very young function and I've been doing the
> same kind of tests but this is 1st messeage for me. So, I'm a bit confused.
> and not sure how I start debug from...
>
> Does this mean mem_cgroup_walk_tree() blocks RCU's progress
> over 10000 jiffies ? What should I doubt ?
It means that this CPU has indeed been blocking RCU's progress for more
than 10 seconds (10,000 jiffies if HZ==1000). One thing would be to
wait for another 30 seconds and see if you get another stall warning.
If not, then you could try decreasing RCU_SECONDS_TILL_STALL_RECHECK to
(say) 5*HZ.
Either way, if you get two stack traces during the same CPU-stall event,
it can give you a much better idea where in the callstack the stall is
actually occurring.
Thanx, Paul
Balbir Singh wrote:
>> __mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone
>> "mz"
>> with incremnted mz->mem->css's refcnt.
>> Then, the caller of this function has to call css_put(mz->mem->css).
>>
>> But, mz can be !NULL even if "not found" i.e. without css_get().
>> By this, css->refcnt will go down to minus.
>>
>> This may cause various things...one of results will be
>> initite-loop in css_tryget()  as this.
>>
>> INFO: RCU detected CPU 0 stall (t=10000 jiffies)
>> sending NMI to all CPUs:
>> NMI backtrace for cpu 0
>> CPU 0:
>> <snip>
>>
>>  <<EOE>>  <IRQ>  [<ffffffff810884bd>]
trace_hardirqs_off+0xd/0x10
>>  [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
>>  [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
>>  [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
>>  [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
>>  [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
>>  [<ffffffff81063a26>] update_process_times+0x46/0x70
>>  [<ffffffff81085930>] tick_sched_timer+0x60/0x160
>>  [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
>>  [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
>>  [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
>>  [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
>>  [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
>>  [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
>>  <EOI>  [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
>>  [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
>>  [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
>>  [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
>>  [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
>>  [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
>>
>> Above shows CPU0 caught in css_tryget()'s inifinite loop because
>> of bad refcnt.
>>
>> This is a fix to set mz=NULL at the top of retry path.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>
>> ---
>>  mm/memcontrol.c |    3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> Index: linux-2.6.32-rc1/mm/memcontrol.c
>> ===================================================================
>> --- linux-2.6.32-rc1.orig/mm/memcontrol.c
>> +++ linux-2.6.32-rc1/mm/memcontrol.c
>> @@ -447,9 +447,10 @@ static struct mem_cgroup_per_zone *
>>  __mem_cgroup_largest_soft_limit_node(struct
mem_cgroup_tree_per_zone *mctz)
>>  {
>>        struct rb_node *rightmost = NULL;
>> -       struct mem_cgroup_per_zone *mz = NULL;
>> +       struct mem_cgroup_per_zone *mz;
>>
>>  retry:
>> +       mz = NULL;
>>        rightmost = rb_last(&mctz->rb_root);
>>        if (!rightmost)
>>                goto done;
             /* Nothing to reclaim
from */
>>
>
> Good catch! So we fail at css_tryget() once, but mz is valid, we
> return a non NULL mz and we do a css_put() causing ref count to go
> bad.
yes.
> The next iteration that uses this mz will hang?
Maybe no. next css_tryget() hangs if css->refcnt is 0.
> I've not been
> able to hit, may be I should test in a really small cgroup under
> stress.
>
Maybe my small patch (not posted yet) affected and I couldn't reproduce it.
But the bug itself should be fixed.
(So, I want to add a sanity check to css_put()).
Thanks,
-Kame
> Balbir Singh.
>
Balbir Singh wrote:
>> [email protected] wrote:
>>
>>> The mm-of-the-moment snapshot 2009-09-25-14-35 has been uploaded to
>>>
>>>    http://userweb.kernel.org/~akpm/mmotm/
>>>
>>> and will soon be available at
>>>
>>>    git://git.zen-sources.org/zen/mmotm.git
>>>
>>> It contains the following patches against 2.6.31:
>>>
>>
>> At testing my (small) patch, with high memory pressure to
>> memcg+hierarchy+softlimit, following is shown.
>> ==
>> INFO: RCU detected CPU 0 stall (t=10000 jiffies)
>> sending NMI to all CPUs:
>> NMI backtrace for cpu 0
>> CPU 0:
>> Modules linked in: sco bridge stp bnep l2cap crc16 bluetooth rfkill
>> iptabl
>> e_filter ip_tables ip6table_filter ip6_tables x_tables ipv6
>> cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log d
>> m_multipath dm_mod uinput ppdev i2c_i801 pcspkr i2c_core bnx2 sg e1000e
>> parport_pc parport button shpchp megaraid_sas sd_mo
>> d scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded:
>> microcode]
>>  Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2 PRIMERGY
>>  RIP: 0010:[<ffffffff810878fe>]  [<ffffffff810878fe>]
trace_hardirqs_off_ca
>> ller+0x3e/0xb RSP: 0018:ffff88004fa03d98  EFLAGS: 00000006
>>  RAX: 0000000000000046 RBX: 0000000000000c00 RCX: 000000000000e501
>>  RDX: ffff8806133564f0 RSI: 0000000000000002 RDI: ffffffff8102a940
>>  RBP: ffff88004fa03d98 R08: 0000000000000001 R09: 0000000000000000
>>  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
>>  R13: 0000000000000046 R14: 00000000000000ff R15: ffff88004fa03f48
>>  FS:  00007fdeca0856f0(0000) GS:ffff88004fa00000(0000)
knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 00007fdeca09e000 CR3: 0000000619fc6000 CR4: 00000000000006f0
>>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>  Call Trace:
>>  <#DB[1]>  <<EOE>> Pid: 2886, comm: ruby Not tainted
2.6.31-mm1 #2
>>  Call Trace:
>>  <NMI>  [<ffffffff8100af79>] ? show_regs+0x49/0x50
>>  [<ffffffff81429385>] nmi_watchdog_tick+0x1e5/0x210
>>  [<ffffffff81428891>] do_nmi+0x1b1/0x2e0
>>  [<ffffffff8142808a>] nmi+0x1a/0x2c
>>  [<ffffffff8102a940>] ? flat_send_IPI_mask+0x90/0xb0
>>  [<ffffffff810878fe>] ? trace_hardirqs_off_caller+0x3e/0xb0
>>  <<EOE>>  <IRQ>  [<ffffffff810884bd>]
trace_hardirqs_off+0xd/0x10
>>  [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
>>  [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
>>  [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
>>  [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
>>  [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
>>  [<ffffffff81063a26>] update_process_times+0x46/0x70
>>  [<ffffffff81085930>] tick_sched_timer+0x60/0x160
>>  [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
>>  [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
>>  [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
>>  [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
>>  [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
>>  [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
>>  <EOI>  [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
>>  [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
>>  [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
>>  [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
>>  [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
>>  [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
>>  [<ffffffff810a5b90>] ? cgroup_map_add+0x0/0x30
>>  [<ffffffff8115de03>] ? seq_read+0xf3/0x420
>>  [<ffffffff811d9926>] ? security_file_permission+0x16/0x20
>>  [<ffffffff8113b7ec>] ? vfs_read+0xcc/0x190
>>  [<ffffffff8113b9b5>] ? sys_read+0x55/0x90
>>  [<ffffffff8100bf9b>] ? system_call_fastpath+0x16/0x1b
>> .....
>> ==
>> mem_cgroup_walk_tree() is not very young function and I've been doing
>> the
>> same kind of tests but this is 1st messeage for me. So, I'm a bit
>> confused.
>> and not sure how I start debug from...
>>
>> Does this mean mem_cgroup_walk_tree() blocks RCU's progress
>> over 10000 jiffies ? What should I doubt ?
>
> Could you share your test case so that I can reproduce the problem? My
> tests seem to work fine so far
>
Create following hierarchy
/cgroup/A .... use_hierarchy=1, softlimit=300M, no task
/01 .... add task which use much memory
/02 .... no task
B .... add task which use much memory
Then, kswapd will reclaim memory from A/01.
I couldn't reproduce this. I doubt ,y own patch may affect.
Thanks,
-Kame
Paul E. McKenney wrote:
> On Mon, Sep 28, 2009 at 03:42:13PM +0900, KAMEZAWA Hiroyuki wrote:
>> On Fri, 25 Sep 2009 14:35:46 -0700
>> [email protected] wrote:
>>
>> > The mm-of-the-moment snapshot 2009-09-25-14-35 has been uploaded to
>> >
>> > http://userweb.kernel.org/~akpm/mmotm/
>> >
>> > and will soon be available at
>> >
>> > git://git.zen-sources.org/zen/mmotm.git
>> >
>> > It contains the following patches against 2.6.31:
>> >
>>
>> At testing my (small) patch, with high memory pressure to
>> memcg+hierarchy+softlimit, following is shown.
>> ==
>> INFO: RCU detected CPU 0 stall (t=10000 jiffies)
>> sending NMI to all CPUs:
>> NMI backtrace for cpu 0
>> CPU 0:
>> Modules linked in: sco bridge stp bnep l2cap crc16 bluetooth rfkill
>> iptabl
>> e_filter ip_tables ip6table_filter ip6_tables x_tables ipv6
>> cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log d
>> m_multipath dm_mod uinput ppdev i2c_i801 pcspkr i2c_core bnx2 sg e1000e
>> parport_pc parport button shpchp megaraid_sas sd_mo
>> d scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded:
>> microcode]
>> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2 PRIMERGY
>> RIP: 0010:[<ffffffff810878fe>] [<ffffffff810878fe>]
>> trace_hardirqs_off_ca
>> ller+0x3e/0xb RSP: 0018:ffff88004fa03d98 EFLAGS: 00000006
>> RAX: 0000000000000046 RBX: 0000000000000c00 RCX: 000000000000e501
>> RDX: ffff8806133564f0 RSI: 0000000000000002 RDI: ffffffff8102a940
>> RBP: ffff88004fa03d98 R08: 0000000000000001 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
>> R13: 0000000000000046 R14: 00000000000000ff R15: ffff88004fa03f48
>> FS: 00007fdeca0856f0(0000) GS:ffff88004fa00000(0000)
>> knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007fdeca09e000 CR3: 0000000619fc6000 CR4: 00000000000006f0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Call Trace:
>> <#DB[1]> <<EOE>> Pid: 2886, comm: ruby Not tainted 2.6.31-mm1 #2
>> Call Trace:
>> <NMI> [<ffffffff8100af79>] ? show_regs+0x49/0x50
>> [<ffffffff81429385>] nmi_watchdog_tick+0x1e5/0x210
>> [<ffffffff81428891>] do_nmi+0x1b1/0x2e0
>> [<ffffffff8142808a>] nmi+0x1a/0x2c
>> [<ffffffff8102a940>] ? flat_send_IPI_mask+0x90/0xb0
>> [<ffffffff810878fe>] ? trace_hardirqs_off_caller+0x3e/0xb0
>> <<EOE>> <IRQ> [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
>> [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
>> [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
>> [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
>> [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
>> [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
>> [<ffffffff81063a26>] update_process_times+0x46/0x70
>> [<ffffffff81085930>] tick_sched_timer+0x60/0x160
>> [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
>> [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
>> [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
>> [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
>> [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
>> [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
>> <EOI> [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
>> [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
>> [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
>
> mem_cgroup_walk_tree() does have a loop. Not sure why it is showing
> up on the stack twice. But it might well be the culprit. Also,
> it repeatedly calls css_get_next(), which acquires ss->id_lock.
> If some other CPU is holding this lock too long, or failed to release
> it, then we could see a stall spinning on the lock. (Though these
> sorts of problems are often caught by other diagnostics.)
>
Thanks. I found above stack trace is in css_tryget() which has a loop.
It loops forever if css->refcnt is 0. Then, I doubt refcnt-miss in css.
>> [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
>
> mem_cgroup_get_local_stat() is straight-line code, so should not be
> the problem.
>
>> [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
>
> mem_control_stat_show() has a pair of fixed-iteration loops, so
> should not be the problem.
>
>> [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
>
> Straight-line code, should not be a problem.
>
>> [<ffffffff810a5b90>] ? cgroup_map_add+0x0/0x30
>
> Straight-line code, should not be a problem.
>
>> [<ffffffff8115de03>] ? seq_read+0xf3/0x420
>
> I suppose that seq_read() could be repeatedly getting -EAGAIN out of
> traverse(), but unless I am missing something, we would be seeing a
> different stack trace in that case.
>
>> [<ffffffff811d9926>] ? security_file_permission+0x16/0x20
>
> ... and so on ...
>
>> [<ffffffff8113b7ec>] ? vfs_read+0xcc/0x190
>> [<ffffffff8113b9b5>] ? sys_read+0x55/0x90
>> [<ffffffff8100bf9b>] ? system_call_fastpath+0x16/0x1b
>> .....
>> ==
>> mem_cgroup_walk_tree() is not very young function and I've been doing
>> the
>> same kind of tests but this is 1st messeage for me. So, I'm a bit
>> confused.
>> and not sure how I start debug from...
>>
>> Does this mean mem_cgroup_walk_tree() blocks RCU's progress
>> over 10000 jiffies ? What should I doubt ?
>
> It means that this CPU has indeed been blocking RCU's progress for more
> than 10 seconds (10,000 jiffies if HZ==1000). One thing would be to
> wait for another 30 seconds and see if you get another stall warning.
> If not, then you could try decreasing RCU_SECONDS_TILL_STALL_RECHECK to
> (say) 5*HZ.
>
I saw twice in the same place. Thanks. I now doubt memcg's code.
> Either way, if you get two stack traces during the same CPU-stall event,
> it can give you a much better idea where in the callstack the stall is
> actually occurring.
>
Thank you for your advices :)
I'll review and add bug-check codes in memcg more.
Regards,
-Kame
On Mon, Sep 28, 2009 at 2:13 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> This is a patch for checking css->refcnt's sanity at css_put().
>
> BTW, I noticed that...css->refcnt may overflow if used with memcg...
> Now, refcnt is incremented per a page. Paul, do you have any idea ?
> (Ah, yes. "don't use css->refcnt per page" is maybe reasonable but
> ?it will be big change..)
>
> ==
> __css_put() doesn't check a bug as refcnt goes to minus.
> I think it should be caught. This patch adds a check for it.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Paul Menage <[email protected]>
Looks reasonable, although there's no guarantee that it will warn on a
buggy release rather than a correct release that occur after a buggy
release.
> ---
> ?kernel/cgroup.c | ? ?5 ++++-
> ?1 file changed, 4 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.32-rc1/kernel/cgroup.c
> ===================================================================
> --- linux-2.6.32-rc1.orig/kernel/cgroup.c
> +++ linux-2.6.32-rc1/kernel/cgroup.c
> @@ -3708,8 +3708,10 @@ static void check_for_release(struct cgr
> ?void __css_put(struct cgroup_subsys_state *css)
> ?{
> ? ? ? ?struct cgroup *cgrp = css->cgroup;
> + ? ? ? int val;
> ? ? ? ?rcu_read_lock();
> - ? ? ? if (atomic_dec_return(&css->refcnt) == 1) {
> + ? ? ? val = atomic_dec_return(&css->refcnt);
> + ? ? ? if (val == 1) {
> ? ? ? ? ? ? ? ?if (notify_on_release(cgrp)) {
> ? ? ? ? ? ? ? ? ? ? ? ?set_bit(CGRP_RELEASABLE, &cgrp->flags);
> ? ? ? ? ? ? ? ? ? ? ? ?check_for_release(cgrp);
> @@ -3717,6 +3719,7 @@ void __css_put(struct cgroup_subsys_stat
> ? ? ? ? ? ? ? ?cgroup_wakeup_rmdir_waiter(cgrp);
> ? ? ? ?}
> ? ? ? ?rcu_read_unlock();
> + ? ? ? WARN_ON(val < 1);
> ?}
>
> ?/*
>
>
From: Randy Dunlap <[email protected]>
ecryptfs uses crypto APIs so it should depend on CRYPTO.
Otherwise many build errors occur. [63 lines not pasted]
Signed-off-by: Randy Dunlap <[email protected]>
---
fs/ecryptfs/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- mmotm-2009-0925-1435.orig/fs/ecryptfs/Kconfig
+++ mmotm-2009-0925-1435/fs/ecryptfs/Kconfig
@@ -1,6 +1,6 @@
config ECRYPT_FS
tristate "eCrypt filesystem layer support (EXPERIMENTAL)"
- depends on EXPERIMENTAL && KEYS && NET
+ depends on EXPERIMENTAL && KEYS && NET && CRYPTO
select CRYPTO_ECB
select CRYPTO_CBC
help
On 09/28/2009 03:34 PM, Randy Dunlap wrote:
> From: Randy Dunlap <[email protected]>
>
> ecryptfs uses crypto APIs so it should depend on CRYPTO.
> Otherwise many build errors occur. [63 lines not pasted]
>
> Signed-off-by: Randy Dunlap <[email protected]>
> ---
> fs/ecryptfs/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- mmotm-2009-0925-1435.orig/fs/ecryptfs/Kconfig
> +++ mmotm-2009-0925-1435/fs/ecryptfs/Kconfig
> @@ -1,6 +1,6 @@
> config ECRYPT_FS
> tristate "eCrypt filesystem layer support (EXPERIMENTAL)"
> - depends on EXPERIMENTAL && KEYS && NET
> + depends on EXPERIMENTAL && KEYS && NET && CRYPTO
> select CRYPTO_ECB
> select CRYPTO_CBC
> help
Hi Randy - Thanks for the patch! Unfortunately, I think it defeats what
Dave Hansen was wanting to do with commit
382684984e93039a3bbd83b04d341b0ceb831519.
When I pulled that patch in, I was under the assumption that the select
would also select all necessary dependencies. According to
Documentation/kbuild/kconfig-language.txt, that's not the case:
select should be used with care. select will force
a symbol to a value without visiting the dependencies.
By abusing select you are able to select a symbol FOO even
if FOO depends on BAR that is not set.
Maybe we should do it how other folks are tackling this problem and
select CRYPTO, along with CRYPTO_ECB and CRYPTO_CBC. While we're at it,
we should probably throw in CRYPTO_AES (aes-128 is the default cipher,
but the cipher is configurable at mount so it might be too obtrusive for
us to select it) and CRYPTO_MD5 (our default hash alg, not currently
configurable). Also, we don't depend on NET anymore because our netlink
interface is no longer around. It may not hurt to select KEYS, rather
than depend on it. Does all of this sound sane to you?
Tyler
On Mon, 28 Sep 2009 19:10:00 -0500 Tyler Hicks wrote:
> On 09/28/2009 03:34 PM, Randy Dunlap wrote:
> > From: Randy Dunlap <[email protected]>
> >
> > ecryptfs uses crypto APIs so it should depend on CRYPTO.
> > Otherwise many build errors occur. [63 lines not pasted]
> >
> > Signed-off-by: Randy Dunlap <[email protected]>
> > ---
> > fs/ecryptfs/Kconfig | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > --- mmotm-2009-0925-1435.orig/fs/ecryptfs/Kconfig
> > +++ mmotm-2009-0925-1435/fs/ecryptfs/Kconfig
> > @@ -1,6 +1,6 @@
> > config ECRYPT_FS
> > tristate "eCrypt filesystem layer support (EXPERIMENTAL)"
> > - depends on EXPERIMENTAL && KEYS && NET
> > + depends on EXPERIMENTAL && KEYS && NET && CRYPTO
> > select CRYPTO_ECB
> > select CRYPTO_CBC
> > help
>
> Hi Randy - Thanks for the patch! Unfortunately, I think it defeats what
> Dave Hansen was wanting to do with commit
> 382684984e93039a3bbd83b04d341b0ceb831519.
>
> When I pulled that patch in, I was under the assumption that the select
> would also select all necessary dependencies. According to
> Documentation/kbuild/kconfig-language.txt, that's not the case:
>
> select should be used with care. select will force
> a symbol to a value without visiting the dependencies.
> By abusing select you are able to select a symbol FOO even
> if FOO depends on BAR that is not set.
>
> Maybe we should do it how other folks are tackling this problem and
> select CRYPTO, along with CRYPTO_ECB and CRYPTO_CBC. While we're at it,
> we should probably throw in CRYPTO_AES (aes-128 is the default cipher,
> but the cipher is configurable at mount so it might be too obtrusive for
> us to select it) and CRYPTO_MD5 (our default hash alg, not currently
> configurable). Also, we don't depend on NET anymore because our netlink
> interface is no longer around. It may not hurt to select KEYS, rather
> than depend on it. Does all of this sound sane to you?
It selects too much stuff. "select" should not be used to enable
a full subsystem (that's my general rule, not in kconfig-language.txt).
What kconfig-language.txt says that applies here is just after your
quoted text:
In general use select only for non-visible symbols
(no prompts anywhere) and for symbols with no dependencies.
That will limit the usefulness but on the other hand avoid
the illegal configurations all over.
CRYPTO does not fit that.
One of the big problems with selecting kconfig symbols (like subsystem
ones) is that it makes it difficult to disable that symbol, which some
of us often want to do.
---
~Randy
KAMEZAWA Hiroyuki wrote:
> This is a patch for checking css->refcnt's sanity at css_put().
>
> BTW, I noticed that...css->refcnt may overflow if used with memcg...
> Now, refcnt is incremented per a page. Paul, do you have any idea ?
atomic64_t ?
But for 4K pagesize, it won't overflow until when the referenced
memory is > 8T?
> (Ah, yes. "don't use css->refcnt per page" is maybe reasonable but
> it will be big change..)
>
> ==
> __css_put() doesn't check a bug as refcnt goes to minus.
> I think it should be caught. This patch adds a check for it.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Li Zefan <[email protected]>
> ---
> kernel/cgroup.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.32-rc1/kernel/cgroup.c
> ===================================================================
> --- linux-2.6.32-rc1.orig/kernel/cgroup.c
> +++ linux-2.6.32-rc1/kernel/cgroup.c
> @@ -3708,8 +3708,10 @@ static void check_for_release(struct cgr
> void __css_put(struct cgroup_subsys_state *css)
> {
> struct cgroup *cgrp = css->cgroup;
> + int val;
> rcu_read_lock();
> - if (atomic_dec_return(&css->refcnt) == 1) {
> + val = atomic_dec_return(&css->refcnt);
> + if (val == 1) {
> if (notify_on_release(cgrp)) {
> set_bit(CGRP_RELEASABLE, &cgrp->flags);
> check_for_release(cgrp);
> @@ -3717,6 +3719,7 @@ void __css_put(struct cgroup_subsys_stat
> cgroup_wakeup_rmdir_waiter(cgrp);
> }
> rcu_read_unlock();
> + WARN_ON(val < 1);
When we run into this, it'll probably fill up the syslog quickly,
so I think WARN_ON_ONCE() is a bit better.
> }
>
On Tue, 29 Sep 2009 08:50:33 +0800
Li Zefan <[email protected]> wrote:
> KAMEZAWA Hiroyuki wrote:
> > This is a patch for checking css->refcnt's sanity at css_put().
> >
> > BTW, I noticed that...css->refcnt may overflow if used with memcg...
> > Now, refcnt is incremented per a page. Paul, do you have any idea ?
>
> atomic64_t ?
>
maybe. atomic_long_t ?
> But for 4K pagesize, it won't overflow until when the referenced
> memory is > 8T?
>
you're right. But there tends to be a few users who use unbelievable amounts
of memory in the world.
(Such user uses memcg or not is another problem ;)
> > (Ah, yes. "don't use css->refcnt per page" is maybe reasonable but
> > it will be big change..)
> >
> > ==
> > __css_put() doesn't check a bug as refcnt goes to minus.
> > I think it should be caught. This patch adds a check for it.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Acked-by: Li Zefan <[email protected]>
>
> > ---
> > kernel/cgroup.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6.32-rc1/kernel/cgroup.c
> > ===================================================================
> > --- linux-2.6.32-rc1.orig/kernel/cgroup.c
> > +++ linux-2.6.32-rc1/kernel/cgroup.c
> > @@ -3708,8 +3708,10 @@ static void check_for_release(struct cgr
> > void __css_put(struct cgroup_subsys_state *css)
> > {
> > struct cgroup *cgrp = css->cgroup;
> > + int val;
> > rcu_read_lock();
> > - if (atomic_dec_return(&css->refcnt) == 1) {
> > + val = atomic_dec_return(&css->refcnt);
> > + if (val == 1) {
> > if (notify_on_release(cgrp)) {
> > set_bit(CGRP_RELEASABLE, &cgrp->flags);
> > check_for_release(cgrp);
> > @@ -3717,6 +3719,7 @@ void __css_put(struct cgroup_subsys_stat
> > cgroup_wakeup_rmdir_waiter(cgrp);
> > }
> > rcu_read_unlock();
> > + WARN_ON(val < 1);
>
> When we run into this, it'll probably fill up the syslog quickly,
> so I think WARN_ON_ONCE() is a bit better.
>
Hmm, ok. I'll rewrite.
Thanks,
-Kame
> > }
> >
>
On Mon, 28 Sep 2009 07:20:08 -0700
Paul Menage <[email protected]> wrote:
> On Mon, Sep 28, 2009 at 2:13 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > This is a patch for checking css->refcnt's sanity at css_put().
> >
> > BTW, I noticed that...css->refcnt may overflow if used with memcg...
> > Now, refcnt is incremented per a page. Paul, do you have any idea ?
> > (Ah, yes. "don't use css->refcnt per page" is maybe reasonable but
> > it will be big change..)
> >
> > ==
> > __css_put() doesn't check a bug as refcnt goes to minus.
> > I think it should be caught. This patch adds a check for it.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Acked-by: Paul Menage <[email protected]>
>
Thanks.
> Looks reasonable, although there's no guarantee that it will warn on a
> buggy release rather than a correct release that occur after a buggy
> release.
>
yes, it's a problem of refcnt.
Thanks,
-Kame
> > ---
> > kernel/cgroup.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6.32-rc1/kernel/cgroup.c
> > ===================================================================
> > --- linux-2.6.32-rc1.orig/kernel/cgroup.c
> > +++ linux-2.6.32-rc1/kernel/cgroup.c
> > @@ -3708,8 +3708,10 @@ static void check_for_release(struct cgr
> > void __css_put(struct cgroup_subsys_state *css)
> > {
> > struct cgroup *cgrp = css->cgroup;
> > + int val;
> > rcu_read_lock();
> > - if (atomic_dec_return(&css->refcnt) == 1) {
> > + val = atomic_dec_return(&css->refcnt);
> > + if (val == 1) {
> > if (notify_on_release(cgrp)) {
> > set_bit(CGRP_RELEASABLE, &cgrp->flags);
> > check_for_release(cgrp);
> > @@ -3717,6 +3719,7 @@ void __css_put(struct cgroup_subsys_stat
> > cgroup_wakeup_rmdir_waiter(cgrp);
> > }
> > rcu_read_unlock();
> > + WARN_ON(val < 1);
> > }
> >
> > /*
> >
> >
>
__css_put() doesn't check a buggy case as refcnt goes to minus.
This patch adds a check for it.
Changelog:
- using WARN_ON_ONCE() instead of WARN_ON()
Acked-by: Paul Menage <[email protected]>
Acked-by: Li Zefan <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
kernel/cgroup.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Index: linux-2.6.32-rc1/kernel/cgroup.c
===================================================================
--- linux-2.6.32-rc1.orig/kernel/cgroup.c
+++ linux-2.6.32-rc1/kernel/cgroup.c
@@ -3708,8 +3708,10 @@ static void check_for_release(struct cgr
void __css_put(struct cgroup_subsys_state *css)
{
struct cgroup *cgrp = css->cgroup;
+ int val;
rcu_read_lock();
- if (atomic_dec_return(&css->refcnt) == 1) {
+ val = atomic_dec_return(&css->refcnt);
+ if (val == 1) {
if (notify_on_release(cgrp)) {
set_bit(CGRP_RELEASABLE, &cgrp->flags);
check_for_release(cgrp);
@@ -3717,6 +3719,7 @@ void __css_put(struct cgroup_subsys_stat
cgroup_wakeup_rmdir_waiter(cgrp);
}
rcu_read_unlock();
+ WARN_ON_ONCE(val < 1);
}
/*
On 09/28/2009 07:20 PM, Randy Dunlap wrote:
> On Mon, 28 Sep 2009 19:10:00 -0500 Tyler Hicks wrote:
>
>> On 09/28/2009 03:34 PM, Randy Dunlap wrote:
>>> From: Randy Dunlap <[email protected]>
>>>
>>> ecryptfs uses crypto APIs so it should depend on CRYPTO.
>>> Otherwise many build errors occur. [63 lines not pasted]
>>>
>>> Signed-off-by: Randy Dunlap <[email protected]>
>>> ---
>>> fs/ecryptfs/Kconfig | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> --- mmotm-2009-0925-1435.orig/fs/ecryptfs/Kconfig
>>> +++ mmotm-2009-0925-1435/fs/ecryptfs/Kconfig
>>> @@ -1,6 +1,6 @@
>>> config ECRYPT_FS
>>> tristate "eCrypt filesystem layer support (EXPERIMENTAL)"
>>> - depends on EXPERIMENTAL && KEYS && NET
>>> + depends on EXPERIMENTAL && KEYS && NET && CRYPTO
>>> select CRYPTO_ECB
>>> select CRYPTO_CBC
>>> help
>>
>> Hi Randy - Thanks for the patch! Unfortunately, I think it defeats what
>> Dave Hansen was wanting to do with commit
>> 382684984e93039a3bbd83b04d341b0ceb831519.
>>
>> When I pulled that patch in, I was under the assumption that the select
>> would also select all necessary dependencies. According to
>> Documentation/kbuild/kconfig-language.txt, that's not the case:
>>
>> select should be used with care. select will force
>> a symbol to a value without visiting the dependencies.
>> By abusing select you are able to select a symbol FOO even
>> if FOO depends on BAR that is not set.
>>
>> Maybe we should do it how other folks are tackling this problem and
>> select CRYPTO, along with CRYPTO_ECB and CRYPTO_CBC. While we're at it,
>> we should probably throw in CRYPTO_AES (aes-128 is the default cipher,
>> but the cipher is configurable at mount so it might be too obtrusive for
>> us to select it) and CRYPTO_MD5 (our default hash alg, not currently
>> configurable). Also, we don't depend on NET anymore because our netlink
>> interface is no longer around. It may not hurt to select KEYS, rather
>> than depend on it. Does all of this sound sane to you?
>
> It selects too much stuff. "select" should not be used to enable
> a full subsystem (that's my general rule, not in kconfig-language.txt).
> What kconfig-language.txt says that applies here is just after your
> quoted text:
>
> In general use select only for non-visible symbols
> (no prompts anywhere) and for symbols with no dependencies.
> That will limit the usefulness but on the other hand avoid
> the illegal configurations all over.
>
> CRYPTO does not fit that.
>
> One of the big problems with selecting kconfig symbols (like subsystem
> ones) is that it makes it difficult to disable that symbol, which some
> of us often want to do.
>
>
> ---
> ~Randy
eCryptfs wouldn't be the first to select CRYPTO:
$ grep -r "select CRYPTO$" --include=Kconfig . | wc -l
26
But after trying to deselect CRYPTO with one of my custom configs, I
realized that you are right. :) Depending on CRYPTO and then selecting
the proper CRYPTO_* symbols is the way to go.
Applied to
git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6.git#next
Thanks again!
Tyler
On Tue, 29 Sep 2009 12:08:55 -0500 Tyler Hicks wrote:
> On 09/28/2009 07:20 PM, Randy Dunlap wrote:
> > On Mon, 28 Sep 2009 19:10:00 -0500 Tyler Hicks wrote:
> >
> >> On 09/28/2009 03:34 PM, Randy Dunlap wrote:
> >>> From: Randy Dunlap <[email protected]>
> >>>
> >>> ecryptfs uses crypto APIs so it should depend on CRYPTO.
> >>> Otherwise many build errors occur. [63 lines not pasted]
> >>>
> >>> Signed-off-by: Randy Dunlap <[email protected]>
> >>> ---
> >>> fs/ecryptfs/Kconfig | 2 +-
> >>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> --- mmotm-2009-0925-1435.orig/fs/ecryptfs/Kconfig
> >>> +++ mmotm-2009-0925-1435/fs/ecryptfs/Kconfig
> >>> @@ -1,6 +1,6 @@
> >>> config ECRYPT_FS
> >>> tristate "eCrypt filesystem layer support (EXPERIMENTAL)"
> >>> - depends on EXPERIMENTAL && KEYS && NET
> >>> + depends on EXPERIMENTAL && KEYS && NET && CRYPTO
> >>> select CRYPTO_ECB
> >>> select CRYPTO_CBC
> >>> help
> >>
> >> Hi Randy - Thanks for the patch! Unfortunately, I think it defeats what
> >> Dave Hansen was wanting to do with commit
> >> 382684984e93039a3bbd83b04d341b0ceb831519.
> >>
> >> When I pulled that patch in, I was under the assumption that the select
> >> would also select all necessary dependencies. According to
> >> Documentation/kbuild/kconfig-language.txt, that's not the case:
> >>
> >> select should be used with care. select will force
> >> a symbol to a value without visiting the dependencies.
> >> By abusing select you are able to select a symbol FOO even
> >> if FOO depends on BAR that is not set.
> >>
> >> Maybe we should do it how other folks are tackling this problem and
> >> select CRYPTO, along with CRYPTO_ECB and CRYPTO_CBC. While we're at it,
> >> we should probably throw in CRYPTO_AES (aes-128 is the default cipher,
> >> but the cipher is configurable at mount so it might be too obtrusive for
> >> us to select it) and CRYPTO_MD5 (our default hash alg, not currently
> >> configurable). Also, we don't depend on NET anymore because our netlink
> >> interface is no longer around. It may not hurt to select KEYS, rather
> >> than depend on it. Does all of this sound sane to you?
> >
> > It selects too much stuff. "select" should not be used to enable
> > a full subsystem (that's my general rule, not in kconfig-language.txt).
> > What kconfig-language.txt says that applies here is just after your
> > quoted text:
> >
> > In general use select only for non-visible symbols
> > (no prompts anywhere) and for symbols with no dependencies.
> > That will limit the usefulness but on the other hand avoid
> > the illegal configurations all over.
> >
> > CRYPTO does not fit that.
> >
> > One of the big problems with selecting kconfig symbols (like subsystem
> > ones) is that it makes it difficult to disable that symbol, which some
> > of us often want to do.
> >
> >
> > ---
> > ~Randy
>
> eCryptfs wouldn't be the first to select CRYPTO:
>
> $ grep -r "select CRYPTO$" --include=Kconfig . | wc -l
> 26
Yes, I realize that. and INPUT is selected to much also. :(
> But after trying to deselect CRYPTO with one of my custom configs, I
> realized that you are right. :) Depending on CRYPTO and then selecting
> the proper CRYPTO_* symbols is the way to go.
>
> Applied to
> git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6.git#next
>
> Thanks again!
Thanks, glad that you tried the experiment.
---
~Randy