Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754793AbZFQP33 (ORCPT ); Wed, 17 Jun 2009 11:29:29 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753886AbZFQP3T (ORCPT ); Wed, 17 Jun 2009 11:29:19 -0400 Received: from cantor2.suse.de ([195.135.220.15]:32947 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753406AbZFQP3S (ORCPT ); Wed, 17 Jun 2009 11:29:18 -0400 From: Thomas Renninger Organization: SUSE Products GmbH To: "Pallipadi, Venkatesh" Subject: Re: [Bug #13475] suspend/hibernate lockdep warning Date: Wed, 17 Jun 2009 17:29:12 +0200 User-Agent: KMail/1.10.3 (Linux/2.6.27.19-3.2-default; KDE/4.1.3; x86_64; ; ) Cc: Mathieu Desnoyers , Simon Holm =?iso-8859-1?q?Th=F8gersen?= , Dave Jones , Pekka Enberg , Dave Young , "Rafael J. Wysocki" , Linux Kernel Mailing List , Kernel Testers List , "cpufreq@vger.kernel.org" , Rusty Russell , "sven.wegener@stealer.net" References: <20090611152329.GB28099@Krystal> <20090617003925.GA3900@linux-os.sc.intel.com> In-Reply-To: <20090617003925.GA3900@linux-os.sc.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200906171729.16272.trenn@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by alpha.home.local id n5HFUM54019804 Content-Length: 13324 Lines: 15 On Wednesday 17 June 2009 02:39:25 Pallipadi, Venkatesh wrote:> On Thu, Jun 11, 2009 at 08:23:29AM -0700, Mathieu Desnoyers wrote:> > * Simon Holm Th?gersen (odie@cs.aau.dk) wrote:> > > man, 08 06 2009 kl. 10:32 -0400, skrev Dave Jones: > > > > On Mon, Jun 08, 2009 at 08:48:45AM -0400, Mathieu Desnoyers wrote:> > > > > > > > > > > >> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13475> > > > > > > >> Subject : suspend/hibernate lockdep warning> > > > > > > >> References : http://marc.info/?l=linux-kernel&m=124393723321241&w=4> > > > > > > > > > > > > > I suspect the following commit, after revert this patch I test 5 times> > > > > > > without lockdep warnings.> > > > > > > > > > > > > > commit b14893a62c73af0eca414cfed505b8c09efc613c> > > > > > > Author: Mathieu Desnoyers > > > > > > > Date: Sun May 17 10:30:45 2009 -0400> > > > > > > > > > > > > > [CPUFREQ] fix timer teardown in ondemand governor> > > > > > > > > > > > The patch is probably not at fault here. I suspect it's some latent bug> > > > > > that simply got exposed by the change to cancel_delayed_work_sync(). In> > > > > > any case, Mathieu, can you take a look at this please?> > > > > > > > > > Yes, it's been looked at and discussed on the cpufreq ML. The short> > > > > answer is that they plan to re-engineer cpufreq and remove the policy> > > > > rwlock taken around almost every operations at the cpufreq level.> > > > > > > > > > The short-term solution, which is recognised as ugly, would be do to the> > > > > following before doing the cancel_delayed_work_sync() :> > > > > > > > > > unlock policy rwlock write lock> > > > > > > > > > lock policy rwlock write lock> > > > > > > > > > It basically works because this rwlock is unneeded for teardown, hence> > > > > the future re-work planned.> > > > > > > > > > I'm sorry I cannot prepare a patch current... I've got quite a few pages> > > > > of Ph.D. thesis due for the beginning of July.> > > > > > > > I'm kinda scared to touch this code at all for .30 due to the number of> > > > unexpected gotchas we seem to run into every time we touch something> > > > locking related. So I'm inclined to just live with the lockdep warning> > > > for .30, and see how the real fixes look for .31, and push them back> > > > as -stable updates if they work out.> > > > > > Unfortunately I don't think it is just theoretical, I've actually hit> > > the following (that haven't got anything to do with suspend/hibernate)> > > > > > INFO: task cpufreqd:4676 blocked for more than 120 seconds.> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.> > > cpufreqd D eee2ac60 0 4676 1> > > ee01bd68 00000086 eee2aad0 eee2ac60 00000533 eee2aad0 eee2ac60 0002b16f> > > 00000000 eee2ac60 7fffffff 7fffffff eee2ac60 7fffffff 7fffffff 00000000> > > ee01bd70 c03117ee ee01bdbc c0311c0c eee2aad0 eecf6900 eee2aad0 eecf6900> > > Call Trace:> > > [] schedule+0x12/0x24> > > [] schedule_timeout+0x17/0x170> > > [] ? __wake_up+0x2b/0x51> > > [] wait_for_common+0xc4/0x135> > > [] ? default_wake_function+0x0/0xd> > > [] wait_for_completion+0x12/0x14> > > [] __cancel_work_timer+0xfe/0x129> > > [] ? wq_barrier_func+0x0/0xd> > > [] cancel_delayed_work_sync+0xb/0xd> > > [] cpufreq_governor_dbs+0x22e/0x291 [cpufreq_ondemand]> > > [] __cpufreq_governor+0x65/0x9d> > > [] __cpufreq_set_policy+0xd1/0x11f> > > [] store_scaling_governor+0x18a/0x1b2> > > [] ? handle_update+0x0/0xd> > > [] ? store_scaling_governor+0x0/0x1b2> > > [] store+0x48/0x61> > > [] sysfs_write_file+0xb4/0xdf> > > [] ? sysfs_write_file+0x0/0xdf> > > [] vfs_write+0x8a/0x104> > > [] sys_write+0x3b/0x60> > > [] sysenter_do_call+0x12/0x2c> > > INFO: task kondemand/0:4956 blocked for more than 120 seconds.> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.> > > kondemand/0 D 00000533 0 4956 2> > > ee1d9efc 00000046 c011815f 00000533 071148de ee1e0080 ee1e0210 00000000> > > c03ff478 9189e633 00000082 c03ff478 ee1e0210 c04159f4 c04159f0 00000000> > > ee1d9f04 c03117ee ee1d9f28 c0313104 ee1d9f30 c04159f4 ee1e0080 c01183be> > > Call Trace:> > > [] ? update_curr+0x6c/0x14b> > > [] schedule+0x12/0x24> > > [] rwsem_down_failed_common+0x150/0x16e> > > [] ? dequeue_task_fair+0x51/0x56> > > [] rwsem_down_write_failed+0x1b/0x23> > > [] call_rwsem_down_write_failed+0x6/0x8> > > [] ? down_write+0x14/0x16> > > [] lock_policy_rwsem_write+0x1d/0x33> > > [] do_dbs_timer+0x45/0x266 [cpufreq_ondemand]> > > [] worker_thread+0x165/0x212> > > [] ? do_dbs_timer+0x0/0x266 [cpufreq_ondemand]> > > [] ? autoremove_wake_function+0x0/0x33> > > [] ? worker_thread+0x0/0x212> > > [] kthread+0x42/0x67> > > [] ? kthread+0x0/0x67> > > [] kernel_thread_helper+0x7/0x10> > > > > > I've only seen it once in 5 boots and CONFIG_PROVELOCKING does not give any> > > warnings about this, though it does yell when switching governor as reported> > > by others in bug #13493.> > > > > > Let's hope Mathieu nails it, though I know he's busy with his thesis.> > > > > > > Thanks for the lockdep reports,> > > > I'm currently looking into it, and it's not pretty. Basically we have :> > > > A> > B> > (means B nested in A)> > > > work> > read rwlock policy> > > > dbs_mutex> > work> > read rwlock policy> > > > write rwlock policy> > dbs_mutex> > > > So the added dbs_mutex <- work <- rwlock policy dependency (for proper> > teardown) is firing the reverse dependency between policy rwlock and> > dbs_mutex.> > > > The real way to fix this is to do not take the rwlock policy around> > non-policy-related actions, like governor START/STOP doing worker> > creation/teardown.> > > > One simple short-term solution would be to take a mutex outside of the> > policy rwlock write lock in cpufreq.c. This mutex would be the> > equivalent of dbs_mutex "lifted" outside of the rwlock write lock. For> > teardown, we only need to hold this mutex, not the rwlock write lock.> > Then we can remove the dbs_mutex from the governors.> > > > But looking at cpufreq.c's cpufreq_add_dev() is very much like kicking a> > wasp nest: a lot of error paths are not handled properly, and I fear> > someone will have to go through the code, fix the currently incorrect> > code paths, and then add the lifted mutex.> > > > I currently have no time for implementation due to my thesis, but I'll> > be happy to review a patch.> > > > How about below patch on top of Mathieu's patch here> http://marc.info/?l=linux-kernel&m=124448150529838&w=2> > [PATCH] cpufreq: Eliminate lockdep issue with dbs_mutex and policy_rwsem> > This removes the unneeded dependency of > write rwlock policy> dbs_mutex> > dbs_mutex does not have anything to do with timer_init and timer_exit. It> is just to protect dbs tunables in sysfs cpufreq/ondemandWhy is sysfs tunables protection needed at all? The ondemand locking very much looks like taken over from the userspacegovernor. There you need the lock because a write to set_speed directlycalls ->target. What is urgently missing is a description for what the locks arereally used, not only in which case they deadlock. >From your comment above:> dbs_mutex does not have anything to do with timer_init and timer_exit.But this is what it seems to do?If it's not needed to protect calling timer_init while in timer_exit(or the other way around) and sysfs_create_group whilein sysfs_remove_group I think the mutex can be deleted.What do you think about this patch (compile tested only and notfor .30)? Is someone aware of any test scenarios I could run to try withoutthe mutex and run into trouble?Do I totally miss something here or does this make sense? Thanks, Thomas ----- CPUFREQ ondemand: Remove unneeded dbs_mutex There is no need to protect general (not per core) ondemand sysfs variablesagainst per core governor (de-)activation (GOV_START/GOV_STOP). It must just be assured that these are only initialized once, before userspacecan modify them (otherwise userspace modifications will be overriden byre-initializing the general variables).This should already be the case. Signed-off-by: Thomas Renninger --- drivers/cpufreq/cpufreq_ondemand.c | 64 +++++++------------------------------ 1 file changed, 13 insertions(+), 51 deletions(-) Index: linux-2.6.29-master/drivers/cpufreq/cpufreq_ondemand.c===================================================================--- linux-2.6.29-master.orig/drivers/cpufreq/cpufreq_ondemand.c+++ linux-2.6.29-master/drivers/cpufreq/cpufreq_ondemand.c@@ -17,7 +17,6 @@ #include #include #include -#include #include #include #include @@ -91,16 +90,6 @@ static DEFINE_PER_CPU(struct cpu_dbs_inf static unsigned int dbs_enable; /* number of CPUs using this policy */ -/*- * DEADLOCK ALERT! There is a ordering requirement between cpu_hotplug- * lock and dbs_mutex. cpu_hotplug lock should always be held before- * dbs_mutex. If any function that can potentially take cpu_hotplug lock- * (like __cpufreq_driver_target()) is being called with dbs_mutex taken, then- * cpu_hotplug lock should be taken before that. Note that cpu_hotplug lock- * is recursive for the same process. -Venki- */-static DEFINE_MUTEX(dbs_mutex);- static struct workqueue_struct *kondemand_wq; static struct dbs_tuners {@@ -266,14 +255,7 @@ static ssize_t store_sampling_rate(struc int ret; ret = sscanf(buf, "%u", &input); - mutex_lock(&dbs_mutex);- if (ret != 1) {- mutex_unlock(&dbs_mutex);- return -EINVAL;- } dbs_tuners_ins.sampling_rate = max(input, minimum_sampling_rate());- mutex_unlock(&dbs_mutex);- return count; } @@ -284,16 +266,12 @@ static ssize_t store_up_threshold(struct int ret; ret = sscanf(buf, "%u", &input); - mutex_lock(&dbs_mutex); if (ret != 1 || input > MAX_FREQUENCY_UP_THRESHOLD || input < MIN_FREQUENCY_UP_THRESHOLD) {- mutex_unlock(&dbs_mutex); return -EINVAL; } dbs_tuners_ins.up_threshold = input;- mutex_unlock(&dbs_mutex);- return count; } @@ -312,9 +290,7 @@ static ssize_t store_ignore_nice_load(st if (input > 1) input = 1; - mutex_lock(&dbs_mutex); if (input == dbs_tuners_ins.ignore_nice) { /* nothing to do */- mutex_unlock(&dbs_mutex); return count; } dbs_tuners_ins.ignore_nice = input;@@ -329,8 +305,6 @@ static ssize_t store_ignore_nice_load(st dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice; }- mutex_unlock(&dbs_mutex);- return count; } @@ -347,11 +321,8 @@ static ssize_t store_powersave_bias(stru if (input > 1000) input = 1000; - mutex_lock(&dbs_mutex); dbs_tuners_ins.powersave_bias = input; ondemand_powersave_bias_init();- mutex_unlock(&dbs_mutex);- return count; } @@ -580,16 +551,6 @@ static int cpufreq_governor_dbs(struct c if (this_dbs_info->enable) /* Already enabled */ break; - mutex_lock(&dbs_mutex);- dbs_enable++;-- rc = sysfs_create_group(&policy->kobj, &dbs_attr_group);- if (rc) {- dbs_enable--;- mutex_unlock(&dbs_mutex);- return rc;- }- for_each_cpu(j, policy->cpus) { struct cpu_dbs_info_s *j_dbs_info; j_dbs_info = &per_cpu(cpu_dbs_info, j);@@ -604,10 +565,10 @@ static int cpufreq_governor_dbs(struct c } this_dbs_info->cpu = cpu; /*- * Start the timerschedule work, when this governor- * is used for first time+ * Initialize general ondemand tunables only ones, not for+ * each core */- if (dbs_enable == 1) {+ if (!dbs_enable) { unsigned int latency; /* policy latency is in nS. Convert it to uS first */ latency = policy->cpuinfo.transition_latency / 1000;@@ -619,30 +580,31 @@ static int cpufreq_governor_dbs(struct c MIN_STAT_SAMPLING_RATE); dbs_tuners_ins.sampling_rate = def_sampling_rate;+ } + rc = sysfs_create_group(&policy->kobj, &dbs_attr_group);+ if (rc) {+ this_dbs_info->enable = 0;+ return rc; } dbs_timer_init(this_dbs_info);-- mutex_unlock(&dbs_mutex);+ dbs_enable++; break; case CPUFREQ_GOV_STOP:- mutex_lock(&dbs_mutex);- dbs_timer_exit(this_dbs_info);- sysfs_remove_group(&policy->kobj, &dbs_attr_group);+ if (this_dbs_info->enable) {+ dbs_timer_exit(this_dbs_info);+ sysfs_remove_group(&policy->kobj, &dbs_attr_group);+ } dbs_enable--;- mutex_unlock(&dbs_mutex);- break; case CPUFREQ_GOV_LIMITS:- mutex_lock(&dbs_mutex); if (policy->max < this_dbs_info->cur_policy->cur) __cpufreq_driver_target(this_dbs_info->cur_policy, policy->max, CPUFREQ_RELATION_H); else if (policy->min > this_dbs_info->cur_policy->cur) __cpufreq_driver_target(this_dbs_info->cur_policy, policy->min, CPUFREQ_RELATION_L);- mutex_unlock(&dbs_mutex); break; } return 0;????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?