Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752615Ab3GHP1C (ORCPT ); Mon, 8 Jul 2013 11:27:02 -0400 Received: from mailout4.samsung.com ([203.254.224.34]:32233 "EHLO mailout4.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751743Ab3GHP1A (ORCPT ); Mon, 8 Jul 2013 11:27:00 -0400 X-AuditID: cbfee61b-b7f8e6d00000524c-b6-51dada42c324 From: Bartlomiej Zolnierkiewicz To: Michael Wang Cc: "Rafael J. Wysocki" , Viresh Kumar , Borislav Petkov , Jiri Kosina , Tomasz Figa , linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Subject: [v3.10 regression] deadlock on cpu hotplug Date: Mon, 08 Jul 2013 17:26:55 +0200 Message-id: <1443144.WnBWEpaopK@amdc1032> User-Agent: KMail/4.8.4 (Linux/3.5.0-rc2+; KDE/4.8.5; i686; ; ) MIME-version: 1.0 Content-transfer-encoding: 7Bit Content-type: text/plain; charset=us-ascii X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFrrELMWRmVeSWpSXmKPExsVy+t9jQV2nW7cCDc5PkLf4vOEfm8XuOYtZ LC7vmsNm8bn3CKPF4xVv2S3Wz3jNYrHxq4fFoadzWBw4PL639rF4LN7zksnjzrU9bB4PDm1m 8ejbsorR48yCI+wenzfJBbBHcdmkpOZklqUW6dslcGUsezmFtaA3pGLb9kmMDYwf3LoYOTkk BEwkLm15zwJhi0lcuLeerYuRi0NIYDqjxOJVL9ghnBYmiZkL3jKCVLEJWElMbF8FZosI6Ers +v0MrINZ4DejxMnrnWCjhAWMJWZd/w1UxMHBIqAqsfRZMkiYV0BTYn7HRLBeUQF7iW3v3rJB xAUlfky+B9bKLCAvsW//VFYIW0ti/c7jTBMY+WYhKZuFpGwWkrIFjMyrGEVTC5ILipPSc430 ihNzi0vz0vWS83M3MYJD+pn0DsZVDRaHGAU4GJV4eD9cvhUoxJpYVlyZe4hRgoNZSYR30zGg EG9KYmVValF+fFFpTmrxIUZpDhYlcd6DrdaBQgLpiSWp2ampBalFMFkmDk6pBkb+0q6alMw5 jhW/NjUc+CN+cqlnwKFsmeurZwaGbBbgnDP5bGnDl9ivohOPfryl/rlAJ5fzeYjx3Xlh9SIM 25K6/0x6Gvvb25rtscOVyN+/rr0195A2i06de2LiPkOLxbtn3sjsviXnWHXn8byb06bEZd32 nJMW9GvdpyOLy4OalrrOs2e3eHtXiaU4I9FQi7moOBEA8HNJ4GUCAAA= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10847 Lines: 170 Hi, Commit 2f7021a8 ("cpufreq: protect 'policy->cpus' from offlining during __gov_queue_work()") causes the following deadlock for me when using kernel v3.10 on ARM EXYNOS4412: [ 960.380000] INFO: task kworker/0:1:34 blocked for more than 120 seconds. [ 960.385000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.390000] kworker/0:1 D c0613d88 0 34 2 0x00000000 [ 960.395000] Workqueue: events od_dbs_timer [ 960.400000] [] (__schedule+0x39c/0x7ec) from [] (schedule_preempt_disabled+0x24/0x34) [ 960.410000] [] (schedule_preempt_disabled+0x24/0x34) from [] (__mutex_lock_slowpath+0x15c/0x21c) [ 960.420000] [] (__mutex_lock_slowpath+0x15c/0x21c) from [] (mutex_lock+0x48/0x4c) [ 960.430000] [] (mutex_lock+0x48/0x4c) from [] (get_online_cpus+0x2c/0x48) [ 960.440000] [] (get_online_cpus+0x2c/0x48) from [] (gov_queue_work+0x1c/0xb8) [ 960.450000] [] (gov_queue_work+0x1c/0xb8) from [] (od_dbs_timer+0xa8/0x12c) [ 960.455000] [] (od_dbs_timer+0xa8/0x12c) from [] (process_one_work+0x138/0x43c) [ 960.465000] [] (process_one_work+0x138/0x43c) from [] (worker_thread+0x134/0x3b8) [ 960.475000] [] (worker_thread+0x134/0x3b8) from [] (kthread+0xa4/0xb0) [ 960.485000] [] (kthread+0xa4/0xb0) from [] (ret_from_fork+0x14/0x3c) [ 960.490000] INFO: task bash:2497 blocked for more than 120 seconds. [ 960.495000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.505000] bash D c0613d88 0 2497 2496 0x00000001 [ 960.510000] [] (__schedule+0x39c/0x7ec) from [] (schedule_timeout+0x14c/0x20c) [ 960.520000] [] (schedule_timeout+0x14c/0x20c) from [] (wait_for_common+0xac/0x150) [ 960.530000] [] (wait_for_common+0xac/0x150) from [] (flush_work+0xa4/0x130) [ 960.540000] [] (flush_work+0xa4/0x130) from [] (__cancel_work_timer+0x74/0x100) [ 960.545000] [] (__cancel_work_timer+0x74/0x100) from [] (cpufreq_governor_dbs+0x4dc/0x630) [ 960.555000] [] (cpufreq_governor_dbs+0x4dc/0x630) from [] (__cpufreq_governor+0x60/0x120) [ 960.565000] [] (__cpufreq_governor+0x60/0x120) from [] (cpufreq_add_dev+0x2f4/0x4d8) [ 960.575000] [] (cpufreq_add_dev+0x2f4/0x4d8) from [] (cpufreq_cpu_callback+0x54/0x5c) [ 960.585000] [] (cpufreq_cpu_callback+0x54/0x5c) from [] (notifier_call_chain+0x44/0x84) [ 960.595000] [] (notifier_call_chain+0x44/0x84) from [] (__cpu_notify+0x2c/0x48) [ 960.605000] [] (__cpu_notify+0x2c/0x48) from [] (_cpu_up+0x10c/0x15c) [ 960.615000] [] (_cpu_up+0x10c/0x15c) from [] (cpu_up+0x64/0x80) [ 960.620000] [] (cpu_up+0x64/0x80) from [] (store_online+0x4c/0x74) [ 960.630000] [] (store_online+0x4c/0x74) from [] (dev_attr_store+0x18/0x24) [ 960.635000] [] (dev_attr_store+0x18/0x24) from [] (sysfs_write_file+0xfc/0x164) [ 960.645000] [] (sysfs_write_file+0xfc/0x164) from [] (vfs_write+0xbc/0x184) [ 960.655000] [] (vfs_write+0xbc/0x184) from [] (SyS_write+0x40/0x68) [ 960.665000] [] (SyS_write+0x40/0x68) from [] (ret_fast_syscall+0x0/0x30) Tomek has also managed to got a lockdep info for the problem: [ 36.570161] [ 36.570233] ====================================================== [ 36.576376] [ INFO: possible circular locking dependency detected ] [ 36.582637] 3.10.0-rc5-00336-g6905065-dirty #1069 Not tainted [ 36.588355] ------------------------------------------------------- [ 36.594608] bash/2783 is trying to acquire lock: [ 36.599205] ((&(&j_cdbs->work)->work)){+.+.+.}, at: [] flush_work+0x0/0x27c [ 36.607104] [ 36.607104] but task is already holding lock: [ 36.612917] (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2c/0x58 [ 36.620731] [ 36.620731] which lock already depends on the new lock. [ 36.620731] [ 36.628890] [ 36.628890] the existing dependency chain (in reverse order) is: [ 36.636355] -> #2 (cpu_hotplug.lock){+.+.+.}: [ 36.640867] [] lock_acquire+0x9c/0x130 [ 36.646075] [] mutex_lock_nested+0x68/0x3ec [ 36.651716] [] get_online_cpus+0x40/0x60 [ 36.657097] [] gov_queue_work+0x1c/0xac [ 36.662392] [] od_dbs_timer+0xb8/0x16c [ 36.667600] [] process_one_work+0x198/0x4d8 [ 36.673242] [] worker_thread+0x134/0x3e8 [ 36.678624] [] kthread+0xa8/0xb4 [ 36.683311] [] ret_from_fork+0x14/0x2c [ 36.688522] -> #1 (&j_cdbs->timer_mutex){+.+.+.}: [ 36.693380] [] lock_acquire+0x9c/0x130 [ 36.698588] [] mutex_lock_nested+0x68/0x3ec [ 36.704229] [] od_dbs_timer+0x40/0x16c [ 36.709438] [] process_one_work+0x198/0x4d8 [ 36.715080] [] worker_thread+0x134/0x3e8 [ 36.720461] [] kthread+0xa8/0xb4 [ 36.725148] [] ret_from_fork+0x14/0x2c [ 36.730360] -> #0 ((&(&j_cdbs->work)->work)){+.+.+.}: [ 36.735565] [] __lock_acquire+0x17c0/0x1e64 [ 36.741206] [] lock_acquire+0x9c/0x130 [ 36.746414] [] flush_work+0x38/0x27c [ 36.751449] [] __cancel_work_timer+0x84/0x120 [ 36.757265] [] cpufreq_governor_dbs+0x4f4/0x674 [ 36.763254] [] __cpufreq_governor+0x60/0x120 [ 36.768983] [] __cpufreq_remove_dev.clone.4+0x7c/0x480 [ 36.775579] [] cpufreq_cpu_callback+0x48/0x5c [ 36.781396] [] notifier_call_chain+0x44/0x84 [ 36.787124] [] __cpu_notify+0x2c/0x48 [ 36.792245] [] _cpu_down+0xb4/0x238 [ 36.797193] [] cpu_down+0x28/0x3c [ 36.801966] [] store_online+0x30/0x74 [ 36.807088] [] dev_attr_store+0x18/0x24 [ 36.812382] [] sysfs_write_file+0x104/0x18c [ 36.818025] [] vfs_write+0xc8/0x194 [ 36.822973] [] SyS_write+0x44/0x70 [ 36.827833] [] ret_fast_syscall+0x0/0x48 [ 36.833218] [ 36.833218] other info that might help us debug this: [ 36.833218] [ 36.841203] Chain exists of: (&(&j_cdbs->work)->work) --> &j_cdbs->timer_mutex --> cpu_hotplug.lock [ 36.850663] Possible unsafe locking scenario: [ 36.850663] [ 36.856565] CPU0 CPU1 [ 36.861079] ---- ---- [ 36.865591] lock(cpu_hotplug.lock); [ 36.869237] lock(&j_cdbs->timer_mutex); [ 36.875746] lock(cpu_hotplug.lock); [ 36.881909] lock((&(&j_cdbs->work)->work)); [ 36.886250] [ 36.886250] *** DEADLOCK *** [ 36.886250] [ 36.892163] 5 locks held by bash/2783: [ 36.895886] #0: (sb_writers#7){.+.+.+}, at: [] vfs_write+0x190/0x194 [ 36.903264] #1: (&buffer->mutex){+.+.+.}, at: [] sysfs_write_file+0x28/0x18c [ 36.911336] #2: (s_active#50){.+.+..}, at: [] sysfs_write_file+0xe0/0x18c [ 36.919148] #3: (cpu_add_remove_lock){+.+.+.}, at: [] cpu_down+0xc/0x3c [ 36.926787] #4: (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2c/0x58 [ 36.935032] [ 36.935032] stack backtrace: [ 36.939394] CPU: 1 PID: 2783 Comm: bash Not tainted 3.10.0-rc5-00336-g6905065-dirty #1069 [ 36.947591] [] (unwind_backtrace+0x0/0x13c) from [] (show_stack+0x10/0x14) [ 36.956164] [] (show_stack+0x10/0x14) from [] (print_circular_bug+0x1c8/0x304) [ 36.965105] [] (print_circular_bug+0x1c8/0x304) from [] (__lock_acquire+0x17c0/0x1e64) [ 36.974735] [] (__lock_acquire+0x17c0/0x1e64) from [] (lock_acquire+0x9c/0x130) [ 36.983767] [] (lock_acquire+0x9c/0x130) from [] (flush_work+0x38/0x27c) [ 36.992187] [] (flush_work+0x38/0x27c) from [] (__cancel_work_timer+0x84/0x120) [ 37.001225] [] (__cancel_work_timer+0x84/0x120) from [] (cpufreq_governor_dbs+0x4f4/0x674) [ 37.011196] [] (cpufreq_governor_dbs+0x4f4/0x674) from [] (__cpufreq_governor+0x60/0x120) [ 37.021087] [] (__cpufreq_governor+0x60/0x120) from [] (__cpufreq_remove_dev.clone.4+0x7c/0x480) [ 37.031607] [] (__cpufreq_remove_dev.clone.4+0x7c/0x480) from [] (cpufreq_cpu_callback+0x48/0x5c) [ 37.042207] [] (cpufreq_cpu_callback+0x48/0x5c) from [] (notifier_call_chain+0x44/0x84) [ 37.051910] [] (notifier_call_chain+0x44/0x84) from [] (__cpu_notify+0x2c/0x48) [ 37.060931] [] (__cpu_notify+0x2c/0x48) from [] (_cpu_down+0xb4/0x238) [ 37.069177] [] (_cpu_down+0xb4/0x238) from [] (cpu_down+0x28/0x3c) [ 37.077074] [] (cpu_down+0x28/0x3c) from [] (store_online+0x30/0x74) [ 37.085147] [] (store_online+0x30/0x74) from [] (dev_attr_store+0x18/0x24) [ 37.093740] [] (dev_attr_store+0x18/0x24) from [] (sysfs_write_file+0x104/0x18c) [ 37.102852] [] (sysfs_write_file+0x104/0x18c) from [] (vfs_write+0xc8/0x194) [ 37.111616] [] (vfs_write+0xc8/0x194) from [] (SyS_write+0x44/0x70) [ 37.119613] [] (SyS_write+0x44/0x70) from [] (ret_fast_syscall+0x0/0x48) Reproducing the issue is very easy, i.e.: # echo 0 > /sys/devices/system/cpu/cpu3/online # echo 0 > /sys/devices/system/cpu/cpu2/online # echo 0 > /sys/devices/system/cpu/cpu1/online # while true;do echo 1 > /sys/devices/system/cpu/cpu1/online;echo 0 > /sys/devices/system/cpu/cpu1/online;done The commit in question (2f7021a8) was merged in v3.10-rc5 as a fix for commit 031299b ("cpufreq: governors: Avoid unnecessary per cpu timer interrupts") which was causing a kernel warning to show up. Michael/Viresh: do you have some idea how to fix the issue? Best regards, -- Bartlomiej Zolnierkiewicz Samsung R&D Institute Poland Samsung Electronics -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/