Date: Wed, 16 Mar 2016 18:25:23 +0800
From: Ming Lei <ming.lei@canonical.com>
To: Chandra Sekhar Lingutla <clingutla@codeaurora.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        linux-arm-msm@vger.kernel.org, Russell King <linux@arm.linux.org.uk>
Subject: Re: Possible race at user mode helper in request_firmware
Message-ID: <20160316182523.14976eef@tom-T450>
In-Reply-To: <56E92C9C.9020205@codeaurora.org>
References: <56E2768C.1050000@codeaurora.com>
	<CACVXFVP4m-qLFFJZsOLSRfQ=JzLnxgqeoT7RcNi=Qv==76OdFQ@mail.gmail.com>
	<CACVXFVPyB3fkpgU-bO34V4SC-6+iiQBQDk4VOJrK7u_S_Eqa6g@mail.gmail.com>
	<56E904EC.8090802@codeaurora.org>
	<CACVXFVNT=7qLUq1GdrYuMmtmkKxF+W_MZUB8kjzaMAotp=Wu=Q@mail.gmail.com>
	<56E91DF7.7050005@codeaurora.org>
	<CACVXFVM+vCpn7Od46bpc5muuu3Nw+ePsoDRSVO3vSn0+WzHFUw@mail.gmail.com>
	<56E92198.6040501@codeaurora.org>
	<56E92C9C.9020205@codeaurora.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 26039
Lines: 658

On Wed, 16 Mar 2016 15:21:24 +0530
Chandra Sekhar Lingutla <clingutla@codeaurora.org> wrote:

> Hi Ming,
> 
> On 03/16/2016 02:34 PM, Chandra Sekhar Lingutla wrote:
> > On 03/16/2016 02:28 PM, Ming Lei wrote:
> >> On Wed, Mar 16, 2016 at 4:48 PM, Chandra Sekhar Lingutla
> >> <clingutla@codeaurora.org> wrote:
> >>> On 03/16/2016 02:12 PM, Ming Lei wrote:
> >>>>
> >>>> On Wed, Mar 16, 2016 at 3:02 PM, Chandra Sekhar Lingutla
> >>>> <clingutla@codeaurora.org> wrote:
> >>>>>
> >>>>> Hi Ming,
> >>>>>
> >>>>>
> >>>>> On 03/16/2016 07:36 AM, Ming Lei wrote:
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 15, 2016 at 11:09 PM, Ming Lei <ming.lei@canonical.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On Mon, Mar 14, 2016 at 8:00 PM,  <clingutla@codeaurora.org> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I see possible race in _request_firmware_load function, on which I
> >>>>>>>> wanted to
> >>>>>>>> take your opinion.
> >>>>>>>>
> >>>>>>>> When system is going to low power mode, device_cache_fw_images() is
> >>>>>>>> called
> >>>>>>>> from pm notifier which schedules asynchronous workers to cache devices
> >>>>>>>> firmware.
> >>>>>>>> If more than 2 async requests falls under user helper mode, then there
> >>>>>>>> is a
> >>>>>>>> use after free of "firmware" directory kobject.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> If two requests are for getting same firmware image, only one can be
> >>>>>>> sent
> >>>>>>> out and be completed with user helper, please see
> >>>>>>> fw_lookup_and_allocate_buf(),
> >>>>>>> so the race shouldn't have been triggered in this case.
> >>>>>>>
> >>>>>>> So I guess your test must be involved with at least two different
> >>>>>>> requests.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> The race will hit, when one asynchronous worker calls device_del()
> >>>>>>>> while
> >>>>>>>> other asynchronous worker calls get_device_parent() in
> >>>>>>>> _request_firmware_load().
> >>>>>>>> After loading firmware, first worker calls device_del(), and
> >>>>>>>> "firmware"
> >>>>>>>> directory ref count is 1, so cleanup_glue_dir() calls kobject_put()
> >>>>>>>> with
> >>>>>>>> gdp_mutex lock held.
> >>>>>>>> Meanwhile, second async worker calls device_add(), and
> >>>>>>>> get_device_parent()
> >>>>>>>> is still able to find the firmware directory kobject in ksets after
> >>>>>>>> getting
> >>>>>>>> gdp_mutex.
> >>>>>>>> This kobject gets added as parent to the second device kobject.
> >>>>>>>> By the time of accessing the parent kobject, its ref count was 0 and
> >>>>>>>> prints
> >>>>>>>> below warning stack followed by crash.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> The trick here is just that we don't use 'struct device' in
> >>>>>>> device_cache_fw_images(),
> >>>>>>> then the firmware device's parent is null during calling device_add()
> >>>>>>> from
> >>>>>>> _request_firmware_load(), but this should be allowed by driver core.
> >>>>>>>
> >>>>>>>    From firmware class view, both device_add()/device_del() is run with
> >>>>>>> pair,
> >>>>>>> and driver core still complains during get_device_parent(), so
> >>>>>>> something
> >>>>>>> inside driver core should be be wrong, IMO.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thinking about the issue further, I believe it is a false positive which
> >>>>>> is
> >>>>>> caused by CONFIG_DEBUG_KOBJECT_RELEASE, follows the story:
> >>>>>>
> >>>>>> 1) device_add() path
> >>>>>>
> >>>>>> get_device_parent()
> >>>>>>                   ......
> >>>>>>                   mutex_lock(&gdp_mutex);
> >>>>>>
> >>>>>>                    /* find our class-directory at the parent and
> >>>>>> reference
> >>>>>> it */
> >>>>>>                    spin_lock(&dev->class->p->glue_dirs.list_lock);
> >>>>>>                    list_for_each_entry(k, &dev->class->p->glue_dirs.list,
> >>>>>> entry)
> >>>>>>                            if (k->parent == parent_kobj) {
> >>>>>>                                    kobj = kobject_get(k);
> >>>>>>                                    break;
> >>>>>>                            }
> >>>>>>                    spin_unlock(&dev->class->p->glue_dirs.list_lock);
> >>>>>>                    if (kobj) {
> >>>>>>                            mutex_unlock(&gdp_mutex);
> >>>>>>                            return kobj;
> >>>>>>                    }
> >>>>>>                    k = class_dir_create_and_add(dev->class, parent_kobj);
> >>>>>>                    mutex_unlock(&gdp_mutex);
> >>>>>>                    ......
> >>>>>>
> >>>>>> 2) device_del() path
> >>>>>>
> >>>>>> cleanup_device_parent()
> >>>>>>           ->cleanup_glue_dir()
> >>>>>>                ->mutex_lock(&gdp_mutex);
> >>>>>>                ->kobject_put(glue_dir);
> >>>>>>                       ->kobject_cleanup()
> >>>>>>                             ->kobject_del()
> >>>>>>                                    ->kobj_kset_leave()
> >>>>>>                ->mutex_unlock(&gdp_mutex);
> >>>>>>
> >>>>>> So gdp_mutex is held wrt. both looking up/adding the parent kobject
> >>>>>> and removing the parent kobject. Actually with this lock, both glur_dir
> >>>>>> kobject's referece counting and joining/leaving kset are run atomically.
> >>>>>>
> >>>>>> The race you found is triggered by CONFIG_DEBUG_KOBJECT_RELEASE,
> >>>>>> which causes kobject_cleanup() to be run in workqueue context and the
> >>>>>> lock of gdp_mutex can't be hold in that situation anymore. That is the
> >>>>>> root
> >>>>>> cause for your race.
> >>>>>>
> >>>>>> That also said DEBUG_KOBJECT_RELEASE might cause trouble in
> >>>>>> some situations.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ming
> >>>>>>
> >>>>> I agree that, CONFIG_DEBUG_KOBJECT_RELEASE can cause the race, as it
> >>>>> delays
> >>>>> call to kobject_cleanup() which deletes kobj entry in glue_dir list.
> >>>>>
> >>>>> I see the race even after disabling CONFIG_DEBUG_KOBJECT_RELEASE.
> >>>>>
> >>>>> The race is between kobject_put() and kobject_get() for the "firmware"
> >>>>> directory.
> >>>>>
> >>>>> In  device_del() path, I see the race after the call of
> >>>>> cleanup_device_parent().
> >>>>> cleanup_device_parent() calls kobject_put(glue_dir) with gdp_mutex lock
> >>>>> holding,
> >>>>> So ideally, to call kobject_put/kobject_get(glue_dir) we need to acquire
> >>>>> gdp_mutex.
> >>>>>
> >>>>> but kobject_del(kobj) will not hold the gdp_mutex lock while calling
> >>>>> kobject_put(kobj->parent).
> >>>>>
> >>>>> Here is the race:
> >>>>>
> >>>>> Consider 2 async works calls request_firmware() for 2 different binaries
> >>>>> (fw1.bin, fw2.bin)
> >>>>> Async work1 got scheduled first, and it falls in user helper mode so it
> >>>>> creates "firmware" dir
> >>>>> and then creates "fw1.bin" directory to load data from user.
> >>>>> After load completion or timeout, it starts cleanup dir's, meantime async
> >>>>> work2 got scheduled and
> >>>>> it also falls in user helper mode, and story starts here:
> >>>>>
> >>>>> PATH1   req fw for "fw1.bin"
> >>>>> PATH2
> >>>>>    req fw for "fw2.bin"
> >>>>> _request_firmware_load()
> >>>>> device_del(dev1)
> >>>>>
> >>>>>           ->cleanup_device_parent()
> >>>>>
> >>>>>                   mutex_lock(&gdp_mutex);
> >>>>>
> >>>>> _request_firmware_load("fw2.bin")
> >>>>>                   kobject_put(kobj->parent); --> firmware ref ->1
> >>>>>                    ->device_add(dev2)
> >>>>>                   mutex_unlock(&gdp_mutex);
> >>>>> ->get_device_parent()
> >>>>>           ->kobject_del()
> >>>>> mutex_lock(&gdp_mutex)
> >>>>>                   kobj_kset_leave(kobj); -->del "fw1.bin" kobj in kset
> >>>>> list_for_each_entry(k, &dev->class->p->glue_dirs.list, entry)
> >>>>>                   kobject_put(kobj->parent); /* No lock :)*/
> >>>>> if (k->parent == parent_kobj) {
> >>>>>
> >>>>>                           ->kref_put()    --> firmware ref ->0
> >>>>> kobj = kobject_get(k); -> k is "firmware"
> >>>>>    dir
> >>>>>                                   ->kboject_release()
> >>>>> break;
> >>>>>
> >>>>>                                           ->kobject_del()
> >>>>> }
> >>>>>
> >>>>>                                   /*here deletes firmware entry in kset*/
> >>>>> mutex_unlock(&gdp_mutex)
> >>>>>
> >>>>> ->dev2->kobj.parent = kobj    -> kobj is released already
> >>>>>
> >>>>>
> >>>>>
> >>>>> PATH 1 is started removing the kobject of "fw1.bin" and its parent
> >>>>> "firmware"
> >>>>> .
> >>>>> cleanup_device_parent(dev1) makes firmware ref count to 1, by calling
> >>>>> kobject_put(firmware) with gdp_mutex held.
> >>>>>
> >>>>> and kobject_del(dev1->kobj) will again calls kobject_put(kobj->parent)
> >>>>> without holding gdp_mutex.
> >>>>>
> >>>>> In PATH2, glue_dir is able to find "firmware" kobj, as firmware dir kobj
> >>>>> is
> >>>>> deleted from the glue_dir list at kboject_release().
> >>>>> and it calls kobject_get(firmware), by the time PATH1 calls
> >>>>> kobject_release(firmware).
> >>>>>
> >>>>> So PATH2 got reference of freed firmware kobj and added as parent to
> >>>>> "fw2.bin" kobj.
> >>>>>
> >>>>> I think, here is possible fix for this race:
> >>>>>
> >>>>> ------------------------8<-------------------------------------
> >>>>> Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
> >>>>>
> >>>>> diff --git a/drivers/base/core.c b/drivers/base/core.c
> >>>>> index 5d8b713..47bc082 100644
> >>>>> --- a/drivers/base/core.c
> >>>>> +++ b/drivers/base/core.c
> >>>>> @@ -1224,7 +1224,13 @@ void device_del(struct device *dev)
> >>>>>                                                BUS_NOTIFY_REMOVED_DEVICE,
> >>>>> dev);
> >>>>>           kobject_uevent(&dev->kobj, KOBJ_REMOVE);
> >>>>>           cleanup_device_parent(dev);
> >>>>> -       kobject_del(&dev->kobj);
> >>>>> +
> >>>>> +       if (dev->kobj.parent->kset == &dev->class->p->glue_dirs) {
> >>>>> +               mutex_lock(&gdp_mutex);
> >>>>> +               kobject_del(&dev->kobj);
> >>>>> +               mutex_unlock(&gdp_mutex);
> >>>>> +       } else
> >>>>> +               kobject_del(&dev->kobj);
> >>>>>           put_device(parent);
> >>>>>    }
> >>>>>    EXPORT_SYMBOL_GPL(device_del);
> >>>>>
> >>>>> --------------------------->8-----------------------------------
> >>>>
> >>>>
> >>>> OK, one simpler way is to move cleanup_device_parent(dev)
> >>>> after kobject_del(&dev->kobj), which looks more clean too.
> >>>>
> >>> No, We can't move that, bcz kobject_del(&dev->kobj) will clear the parent.
> >>
> >> Yeah, but I think it is better to do that and just call
> >> cleanup_glue_dir() directly
> >> because cleanup_device_parent() should have released the glue_dir.
> >>
> Sorry, i am not able understand correctly, If we reorder kobje_del and cleanup_device_parent(),
> kobject_del() clears kobj->parent, and cleanup_device_parent() gets glue_dir with dev->kobj->parent(),
> cleanup_glue_dir() returns without calling kobject_put() as glue_dir is NULL, so firmware dir becomes dangling kobj.
> Am I missing something here?

I mean something like below, and what do you think about it?

---

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0a8bdad..3f2e1f8 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -836,11 +836,31 @@ static struct kobject *get_device_parent(struct device *dev,
 	return NULL;
 }
 
+static inline bool live_in_glue_dir(struct kobject *kobj,
+				    struct device *dev)
+{
+	if (!kobj || !dev->class ||
+	    kobj->kset != &dev->class->p->glue_dirs)
+		return true;
+	return false;
+}
+
+static inline struct kobject *get_glue_dir(struct device *dev)
+{
+	if (live_in_glue_dir(&dev->kobj, dev))
+		return dev->kobj.parent;
+	return NULL;
+}
+
+/*
+ * make sure cleaning up dir as the last step, we need to make
+ * sure .release handler of kobject is run with holding the
+ * global lock
+ */
 static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
 {
 	/* see if we live in a "glue" directory */
-	if (!glue_dir || !dev->class ||
-	    glue_dir->kset != &dev->class->p->glue_dirs)
+	if (!live_in_glue_dir(glue_dir, dev))
 		return;
 
 	mutex_lock(&gdp_mutex);
@@ -848,11 +868,6 @@ static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
 	mutex_unlock(&gdp_mutex);
 }
 
-static void cleanup_device_parent(struct device *dev)
-{
-	cleanup_glue_dir(dev, dev->kobj.parent);
-}
-
 static int device_add_class_symlinks(struct device *dev)
 {
 	struct device_node *of_node = dev_of_node(dev);
@@ -1028,6 +1043,7 @@ int device_add(struct device *dev)
 	struct kobject *kobj;
 	struct class_interface *class_intf;
 	int error = -EINVAL;
+	struct kobject *glue_dir = NULL;
 
 	dev = get_device(dev);
 	if (!dev)
@@ -1072,8 +1088,10 @@ int device_add(struct device *dev)
 	/* first, register with generic layer. */
 	/* we require the name to be set before, and pass NULL */
 	error = kobject_add(&dev->kobj, dev->kobj.parent, NULL);
-	if (error)
+	if (error) {
+		glue_dir = get_glue_dir(dev);
 		goto Error;
+	}
 
 	/* notify platform of device entry */
 	if (platform_notify)
@@ -1154,9 +1172,10 @@ done:
 	device_remove_file(dev, &dev_attr_uevent);
  attrError:
 	kobject_uevent(&dev->kobj, KOBJ_REMOVE);
+	glue_dir = get_glue_dir(dev);
 	kobject_del(&dev->kobj);
  Error:
-	cleanup_device_parent(dev);
+	cleanup_glue_dir(dev, glue_dir);
 	put_device(parent);
 name_error:
 	kfree(dev->p);
@@ -1232,6 +1251,7 @@ EXPORT_SYMBOL_GPL(put_device);
 void device_del(struct device *dev)
 {
 	struct device *parent = dev->parent;
+	struct kobject *glue_dir = NULL;
 	struct class_interface *class_intf;
 
 	/* Notify clients of device removal.  This call must come
@@ -1276,8 +1296,9 @@ void device_del(struct device *dev)
 		blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
 					     BUS_NOTIFY_REMOVED_DEVICE, dev);
 	kobject_uevent(&dev->kobj, KOBJ_REMOVE);
-	cleanup_device_parent(dev);
+	glue_dir = get_glue_dir(dev);
 	kobject_del(&dev->kobj);
+	cleanup_glue_dir(dev, glue_dir);
 	put_device(parent);
 }
 EXPORT_SYMBOL_GPL(device_del);


> 
> >> BTW, failure path of device_add() has been broken already.
> >>
> > Yah, forgot to mention that. we need to reorder like above fix, can we?.
> >
> >>>
> >>>
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> [21147.455038] ------------[ cut here ]------------
> >>>>>>>> [21147.458910] WARNING: at <...>/kernel/include/linux/kref.h:47
> >>>>>>>> kobject_get+0x50/0x68()
> >>>>>>>> [21147.481199] Modules linked in: core_ctl(PO) qdrbg_module(O)
> >>>>>>>> qcrypto_module(O)
> >>>>>>>> [21147.481256] CPU: 2 PID: 23935 Comm: kworker/u16:8 Tainted: P W O
> >>>>>>>> 3.10.84-g0957845-00427-g56a05c2 #1
> >>>>>>>> [21147.481284] Workqueue: events_unbound async_run_entry_fn
> >>>>>>>> [21147.481296] Call trace:
> >>>>>>>> [21147.481315] [<ffffffc000206b60>] dump_backtrace+0x0/0x270
> >>>>>>>> [21147.481331] [<ffffffc000206de0>] show_stack+0x10/0x1c
> >>>>>>>> [21147.481353] [<ffffffc000d569ec>] dump_stack+0x1c/0x28
> >>>>>>>> [21147.481370] [<ffffffc00021cc18>] warn_slowpath_common+0x74/0x9c
> >>>>>>>> [21147.481383] [<ffffffc00021ce4c>] warn_slowpath_null+0x14/0x20
> >>>>>>>> [21147.481397] [<ffffffc000456b10>] kobject_get+0x4c/0x68
> >>>>>>>> [21147.481416] [<ffffffc000609f64>] get_device_parent+0xa0/0x194
> >>>>>>>> [21147.481430] [<ffffffc00060a3f4>] device_add+0x100/0x600
> >>>>>>>> [21147.481446] [<ffffffc00061c390>] _request_firmware+0x8b4/0xa80
> >>>>>>>> [21147.481459] [<ffffffc00061c58c>] request_firmware+0x30/0x3c
> >>>>>>>> [21147.481473] [<ffffffc00061c5ec>] cache_firmware+0x54/0xb0
> >>>>>>>> [21147.481490] [<ffffffc00061c65c>]
> >>>>>>>> __async_dev_cache_fw_image+0x14/0x54
> >>>>>>>> [21147.481505] [<ffffffc000243794>] async_run_entry_fn+0x6c/0x12c
> >>>>>>>> [21147.481521] [<ffffffc000237e88>] process_one_work+0x264/0x3dc
> >>>>>>>> [21147.481535] [<ffffffc0002392c0>] worker_thread+0x1f0/0x340
> >>>>>>>> [21147.481553] [<ffffffc00023e500>] kthread+0xac/0xb8
> >>>>>>>> [21147.481563] ---[ end trace dabc98ea48b8ba59 ]---
> >>>>>>>>
> >>>>>>>> [21147.486436] Unable to handle kernel paging request at virtual
> >>>>>>>> address
> >>>>>>>> 0x6b6b6b6b6b6b6bcb
> >>>>>>>> [21147.493816] pgd = ffffffc0a982d000
> >>>>>>>> [21147.498337] [6b6b6b6b6b6b6bcb] *pgd=0000000000000000
> >>>>>>>> [21147.502287] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> >>>>>>>> [21147.507826] Modules linked in: core_ctl(PO) qdrbg_module(O)
> >>>>>>>> qcrypto_module(O)
> >>>>>>>> [21147.514951] CPU: 2 PID: 23935 Comm: kworker/u16:8 Tainted: P W O
> >>>>>>>> 3.10.84-g0957845-00427-g56a05c2 #1
> >>>>>>>> [21147.524686] Workqueue: events_unbound async_run_entry_fn
> >>>>>>>> [21147.529965] task: ffffffc008138000 ti: ffffffc007880000 task.ti:
> >>>>>>>> ffffffc007880000
> >>>>>>>> [21147.537438] PC is at sysfs_create_dir+0x34/0xd4
> >>>>>>>> [21147.541949] LR is at kobject_add_internal+0x120/0x24c
> >>>>>>>> [21147.546979] pc : [<ffffffc000361fa8>] lr : [<ffffffc000456f5c>]
> >>>>>>>> pstate:
> >>>>>>>> 80000145
> >>>>>>>> [21147.996295] Process kworker/u16:8 (pid: 23935, stack limit =
> >>>>>>>> 0xffffffc007880058)
> >>>>>>>> [21148.003673] Call trace:
> >>>>>>>> [21148.006117] [<ffffffc000361fa8>] sysfs_create_dir+0x34/0xd4
> >>>>>>>> [21148.011663] [<ffffffc000456f58>] kobject_add_internal+0x11c/0x24c
> >>>>>>>> [21148.017737] [<ffffffc0004573ec>] kobject_add+0xc8/0xe4
> >>>>>>>> [21148.022863] [<ffffffc00060a410>] device_add+0x11c/0x600
> >>>>>>>> [21148.028069] [<ffffffc00061c390>] _request_firmware+0x8b4/0xa80
> >>>>>>>> [21148.033886] [<ffffffc00061c58c>] request_firmware+0x30/0x3c
> >>>>>>>> [21148.039439] [<ffffffc00061c5ec>] cache_firmware+0x54/0xb0
> >>>>>>>> [21148.044822] [<ffffffc00061c65c>]
> >>>>>>>> __async_dev_cache_fw_image+0x14/0x54
> >>>>>>>> [21148.051249] [<ffffffc000243794>] async_run_entry_fn+0x6c/0x12c
> >>>>>>>> [21148.057064] [<ffffffc000237e88>] process_one_work+0x264/0x3dc
> >>>>>>>> [21148.062792] [<ffffffc0002392c0>] worker_thread+0x1f0/0x340
> >>>>>>>> [21148.068263] [<ffffffc00023e500>] kthread+0xac/0xb8
> >>>>>>>> [21148.073039] Code: b5000094 14000026 d000ac54 9125a294 (7940c281)
> >>>>>>>> [21148.080312] ---[ end trace dabc98ea48b8ba5a ]---
> >>>>>>>>
> >>>>>>>> Below is the callstack where the parent is freed from the slub_debug
> >>>>>>>> info.
> >>>>>>>> 0xFFFFFFC000608D74  \\vmlinux\base/core\class_dir_release+0x0C
> >>>>>>>> 0xFFFFFFC0002FE6BC \\vmlinux\slub\free_debug_processing\fail+0x114
> >>>>>>>> 0xFFFFFFC0002FE7A0 \\vmlinux\slub\__slab_free+0x44
> >>>>>>>> 0xFFFFFFC0002FEE60 \\vmlinux\slub\kfree+0x1F8
> >>>>>>>> 0xFFFFFFC000608D70 \\vmlinux\base/core\class_dir_release+0x8
> >>>>>>>> 0xFFFFFFC000456D70 \\vmlinux\kobject\kobject_release+0x134
> >>>>>>>> 0xFFFFFFC000456E10 \\vmlinux\kobject\kobject_put+0x58
> >>>>>>>> 0xFFFFFFC000456C28 \\vmlinux\kobject\kobject_del+0x64
> >>>>>>>> 0xFFFFFFC000609D78 \\vmlinux\base/core\device_del+0x150
> >>>>>>>> 0xFFFFFFC00061C524
> >>>>>>>> \\vmlinux\firmware_class\_request_firmware\out+0x71C
> >>>>>>>> 0xFFFFFFC00061C58C \\vmlinux\firmware_class\request_firmware+0x30
> >>>>>>>> 0xFFFFFFC00061C5EC \\vmlinux\firmware_class\cache_firmware+0x54
> >>>>>>>> 0xFFFFFFC00061C65C
> >>>>>>>> \\vmlinux\firmware_class\__async_dev_cache_fw_image+0x14
> >>>>>>>> 0xFFFFFFC000243794 \\vmlinux\async\async_run_entry_fn+0x6C
> >>>>>>>> 0xFFFFFFC000237E88 \\vmlinux\workqueue\process_one_work+0x264
> >>>>>>>> 0xFFFFFFC0002392C0 \\vmlinux\workqueue\worker_thread\recheck+0x1A0
> >>>>>>>> 0xFFFFFFC00023E500 \\vmlinux\kthread\kthread+0xAC
> >>>>>>>>
> >>>>>>>> For similar type of issue, I see there is an existing fix: "sysfs:
> >>>>>>>> driver
> >>>>>>>> core: Fix glue dir race condition by gdp_mutex".
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/drivers/base/core.c?id=e4a60d139060975eb956717e4f63ae348d4d8cc5
> >>>>>>>>
> >>>>>>>> Still I am able to reproduce the issue. I have verified this race on
> >>>>>>>> kernels: 3.10, 3.18 and 4.4.
> >>>>>>>>
> >>>>>>>> I followed below procedure to reproduce the issue:
> >>>>>>>>         1. Enable "CONFIG_DEBUG_KOBJECT_RELEASE"
> >>>>>>>>         2. Use test_firmware, modified the test for async calls
> >>>>>>>>         3. Replace WARN_ON with BUG_ON in kref_get().
> >>>>>>>> I ran below script from shell:
> >>>>>>>>        count=0
> >>>>>>>>        while [ 1 ]
> >>>>>>>>        do
> >>>>>>>>                echo 3 >
> >>>>>>>> /sys/devices/virtual/misc/test_firmware/trigger_ufw
> >>>>>>>>                if [ $? -ne 0 ]; then
> >>>>>>>>                        print "Exiting.. "
> >>>>>>>>                        exit 1
> >>>>>>>>                fi
> >>>>>>>>                count=$(($count +1))
> >>>>>>>>                echo "count $count"
> >>>>>>>>         done
> >>>>>>>>
> >>>>>>>> With below patch, I could reproduce the issue in the 2nd iteration.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> I will run your test script to see if there is new findings.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -------------------------8<----------------------------------------------
> >>>>>>>>
> >>>>>>>> To reproduce the race condition, use firmware test and
> >>>>>>>> CONFIG_DEBUG_KOBJECT_RELEASE=y, which delays kobject
> >>>>>>>> releases. So that we can catch the bug easily.
> >>>>>>>>
> >>>>>>>> Enabled below flags in kernel config file:
> >>>>>>>>
> >>>>>>>> +CONFIG_DEBUG_KOBJECT_RELEASE=y
> >>>>>>>> +CONFIG_TEST_FIRMWARE=y
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Looks I can't reproduce the issue with your approach.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> diff --git a/include/linux/kref.h b/include/linux/kref.h
> >>>>>>>> index 484604d..727fb24 100644
> >>>>>>>> --- a/include/linux/kref.h
> >>>>>>>> +++ b/include/linux/kref.h
> >>>>>>>> @@ -44,7 +44,7 @@ static inline void kref_get(struct kref *kref)
> >>>>>>>>          * condition when this kref is freeing by some other thread
> >>>>>>>> right
> >>>>>>>> now.
> >>>>>>>>          * In this case one should use kref_get_unless_zero()
> >>>>>>>>          */
> >>>>>>>> -    WARN_ON_ONCE(atomic_inc_return(&kref->refcount) < 2);
> >>>>>>>> +    BUG_ON(atomic_inc_return(&kref->refcount) < 2);
> >>>>>>>>     }
> >>>>>>>>
> >>>>>>>>     /**
> >>>>>>>> diff --git a/lib/test_firmware.c b/lib/test_firmware.c
> >>>>>>>> old mode 100644
> >>>>>>>> new mode 100755
> >>>>>>>> index 86374c1..14c9598
> >>>>>>>> --- a/lib/test_firmware.c
> >>>>>>>> +++ b/lib/test_firmware.c
> >>>>>>>> @@ -18,6 +18,7 @@
> >>>>>>>>     #include <linux/miscdevice.h>
> >>>>>>>>     #include <linux/slab.h>
> >>>>>>>>     #include <linux/uaccess.h>
> >>>>>>>> +#include <linux/async.h>
> >>>>>>>>
> >>>>>>>> +
> >>>>>>>> +static ASYNC_DOMAIN_EXCLUSIVE(ufw_domain);
> >>>>>>>> +
> >>>>>>>> +static void __async_ufw_work(void *name,
> >>>>>>>> +                                       async_cookie_t cookie)
> >>>>>>>> +{
> >>>>>>>> +    const struct firmware *fw;
> >>>>>>>> +    const char *fw_name = name;
> >>>>>>>> +    int ret;
> >>>>>>>> +
> >>>>>>>> +    pr_err("requesting %s\n", fw_name);
> >>>>>>>> +
> >>>>>>>> +    ret = request_firmware(&fw, fw_name, NULL);
> >>>>>>>> +    pr_err("loaded: %zu\n", fw ? fw->size : 0);
> >>>>>>>> +        if (!ret)
> >>>>>>>> +                kfree(fw);
> >>>>>>>> +
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static char *name[] = { "fw1.bin", "fw2.bin"};
> >>>>>>>> +
> >>>>>>>> +static ssize_t trigger_ufw_store(struct device *dev,
> >>>>>>>> +        struct device_attribute *attr,
> >>>>>>>> +        const char *buf, size_t count)
> >>>>>>>> +{
> >>>>>>>> +    int rc, i, c=0;
> >>>>>>>> +    u32 start;
> >>>>>>>> +    char *fw_name;
> >>>>>>>> +
> >>>>>>>> +    rc = kstrtou32(buf, 0, &start);
> >>>>>>>> +    if (rc){
> >>>>>>>> +        pr_err("Invalid option\n");
> >>>>>>>> +        return rc;
> >>>>>>>> +    }
> >>>>>>>> +    pr_err(" no of iterations %d\n", start);
> >>>>>>>> +
> >>>>>>>> +    for (i=0; i< start; i++)
> >>>>>>>> +    {
> >>>>>>>> +        if (c >= 2)
> >>>>>>>> +            c = 0;
> >>>>>>>> +        fw_name = name[c++];
> >>>>>>>> +        async_schedule_domain(__async_ufw_work, (void *)fw_name,
> >>>>>>>> &ufw_domain);
> >>>>>>>> +    }
> >>>>>>>> +    async_synchronize_full_domain(&ufw_domain);
> >>>>>>>> +    return count;
> >>>>>>>> +}
> >>>>>>>>
> >>>>>>>>     static DEVICE_ATTR_WO(trigger_request);
> >>>>>>>> +static DEVICE_ATTR_WO(trigger_ufw);
> >>>>>>>>
> >>>>>>>>     static int __init test_firmware_init(void)
> >>>>>>>>     {
> >>>>>>>> @@ -92,6 +140,12 @@ static int __init test_firmware_init(void)
> >>>>>>>>             goto dereg;
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> +    rc = device_create_file(test_fw_misc_device.this_device,
> >>>>>>>> +                             &dev_attr_trigger_ufw);
> >>>>>>>> +    if (rc) {
> >>>>>>>> +        pr_err("could not create sysfs interface: %d\n", rc);
> >>>>>>>> +        goto dereg;
> >>>>>>>> +    }
> >>>>>>>>         pr_warn("interface ready\n");
> >>>>>>>>
> >>>>>>>>         return 0;
> >>>>>>>> @@ -106,7 +160,9 @@ static void __exit test_firmware_exit(void)
> >>>>>>>>     {
> >>>>>>>>         release_firmware(test_firmware);
> >>>>>>>>         device_remove_file(test_fw_misc_device.this_device,
> >>>>>>>> -               &dev_attr_trigger_request);
> >>>>>>>> +            &dev_attr_trigger_request);
> >>>>>>>> +    device_remove_file(test_fw_misc_device.this_device,
> >>>>>>>> +            &dev_attr_trigger_ufw);
> >>>>>>>>         misc_deregister(&test_fw_misc_device);
> >>>>>>>>         pr_warn("removed interface\n");
> >>>>>>>>     }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ----------------------------8<--------------------------------------------
> >>>>>>>>
> >>>>>>>> Thanks and regards,
> >>>>>>>> Chandrasekhar L.
> >>>>>>>> --
> >>>>>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> >>>>>>>> member of
> >>>>>>>> Code Aurora Forum,
> >>>>>>>>     a Linux Foundation Collaborative Project.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html