2012-08-08 23:08:06

by Fernando Guzman Lugo

[permalink] [raw]
Subject: [PATCH 0/3] remoteproc: introduce rproc recovery

These set of patches make possible the remoteproc recover after a crash.
This is a hard recovery, that means the remoteproc is reset and it will
start from the beginning. When a crash happen all the virtio devices are
destroyed. Therefore, both rpmsg drivers and devices are gracefully
removed which also cause rproc users become 0 and the remoteproc is turned
off. After the virtio devices are destroyed the crash handler function
will read the virtio information from the firmware in order to recreate
the virtio devices that will boot the remoteproc and everything will be
functional again.

Fernando Guzman Lugo (3):
remoteproc: add rproc_report_crash function to notify rproc crashes
remoteproc: recover a remoteproc when it has crashed
remoteproc: create debugfs entry to disable/enable recovery
dynamically

Documentation/remoteproc.txt | 7 ++
drivers/remoteproc/remoteproc_core.c | 107 ++++++++++++++++++++++++++++--
drivers/remoteproc/remoteproc_debugfs.c | 83 +++++++++++++++++++++++
drivers/remoteproc/remoteproc_internal.h | 1 +
include/linux/remoteproc.h | 20 ++++++
5 files changed, 211 insertions(+), 7 deletions(-)


2012-08-08 23:08:19

by Fernando Guzman Lugo

[permalink] [raw]
Subject: [PATCH 3/3] remoteproc: create debugfs entry to disable/enable recovery dynamically

Add a debugfs entry (named recovery) so that recovery can be disabled
dynamically at runtime. This entry is very useful when you are trying to
debug a rproc crash. Without this a recovery will take place making
impossible to debug the issue.

Original idea from Ohad Ben-Cohen and contributions from
Subramaniam Chanderashekarapuram

Example:
-disabling recovery:
$ echo disabled > <debugfs>/remoteproc/remoteproc0/recovery

-enabling recovery:
$ echo enabled > <debugfs>/remoteproc/remoteproc0/recovery

-in case you have disabled recovery and you want to continue
debugging you can recover the remoteproc once using recover.
This will not change the state of the recovery entry, it will
only recovery the rproc if its state is RPROC_CRASHED
$ echo recover > <debugfs>/remoteproc/remoteproc0/recovery

Signed-off-by: Fernando Guzman Lugo <[email protected]>
---
drivers/remoteproc/remoteproc_core.c | 3 +-
drivers/remoteproc/remoteproc_debugfs.c | 83 +++++++++++++++++++++++++++++++
include/linux/remoteproc.h | 2 +
3 files changed, 87 insertions(+), 1 deletions(-)

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index c879069..0b52169 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -932,7 +932,8 @@ static void rproc_crash_handler_work(struct work_struct *work)
++rproc->crash_cnt, rproc->name);
mutex_unlock(&rproc->lock);

- rproc_trigger_recover(rproc);
+ if (!rproc->recovery_disabled)
+ rproc_trigger_recover(rproc);
}

/**
diff --git a/drivers/remoteproc/remoteproc_debugfs.c b/drivers/remoteproc/remoteproc_debugfs.c
index 0383385..aa95cde 100644
--- a/drivers/remoteproc/remoteproc_debugfs.c
+++ b/drivers/remoteproc/remoteproc_debugfs.c
@@ -28,6 +28,9 @@
#include <linux/debugfs.h>
#include <linux/remoteproc.h>
#include <linux/device.h>
+#include <linux/uaccess.h>
+
+#include "remoteproc_internal.h"

/* remoteproc debugfs parent dir */
static struct dentry *rproc_dbg;
@@ -111,6 +114,84 @@ static const struct file_operations rproc_name_ops = {
.llseek = generic_file_llseek,
};

+/* expose recovery flag via debugfs */
+static ssize_t rproc_recovery_read(struct file *filp, char __user *userbuf,
+ size_t count, loff_t *ppos)
+{
+ struct rproc *rproc = filp->private_data;
+ char *buf = rproc->recovery_disabled ? "disabled\n" : "enabled\n";
+
+ return simple_read_from_buffer(userbuf, count, ppos, buf, strlen(buf));
+}
+
+
+/*
+ * Writing to the recovey debugfs entry we can change the behavior of the
+ * recovery dynamically. The default value of this entry is "enabled".
+ *
+ * There are 3 possible options you can write to the recovery debug entry:
+ * "enabled", "disabled" and "recover"
+ *
+ * enabled: In this case recovery will be enabled, every time there is a
+ * rproc crashed the rproc will be recovered. If recovery has been
+ * disabled and it crashed and you enable recovery it will be
+ * recover as soon as you enable recovery.
+ * disabled: In this case recovery will be disabled, that means if a rproc
+ * crashes it will remain in crashed state. Therefore the rproc
+ * won't be functional any more. But this option is used for
+ * debugging purposes. Otherwise, debugging a crash would not be
+ * possible.
+ * recover: This function will trigger a recovery without taking care of
+ * the recovery state (enabled/disabled) and without changing it.
+ * This useful for the cases when you are debugging a crash and
+ * after enabling recovery you get another crash immediately. As
+ * the recovery state will be enabled it will recover the rproc
+ * without let you debug the new crash. So, it is recommended to
+ * disabled recovery, then starting debugging and use "recovery"
+ * command while still debugging and when you are done then you
+ * case use enabled command.
+ */
+static ssize_t rproc_recovery_write(struct file *filp,
+ const char __user *user_buf, size_t count, loff_t *ppos)
+{
+ struct rproc *rproc = filp->private_data;
+ char buf[10];
+ int ret;
+
+ if (count > sizeof(buf))
+ return count;
+
+ ret = copy_from_user(buf, user_buf, count);
+ if (ret)
+ return ret;
+
+ /* remove end of line */
+ if (buf[count - 1] == '\n')
+ buf[count - 1] = '\0';
+
+ if (!strncmp(buf, "enabled", count)) {
+ rproc->recovery_disabled = false;
+ /* if rproc has crashed trigger recovery */
+ if (rproc->state == RPROC_CRASHED)
+ rproc_trigger_recover(rproc);
+ } else if (!strncmp(buf, "disabled", count)) {
+ rproc->recovery_disabled = true;
+ } else if (!strncmp(buf, "recover", count)) {
+ /* if rproc has crashed trigger recovery */
+ if (rproc->state == RPROC_CRASHED)
+ rproc_trigger_recover(rproc);
+ }
+
+ return count;
+}
+
+static const struct file_operations rproc_recovery_ops = {
+ .read = rproc_recovery_read,
+ .write = rproc_recovery_write,
+ .open = simple_open,
+ .llseek = generic_file_llseek,
+};
+
void rproc_remove_trace_file(struct dentry *tfile)
{
debugfs_remove(tfile);
@@ -154,6 +235,8 @@ void rproc_create_debug_dir(struct rproc *rproc)
rproc, &rproc_name_ops);
debugfs_create_file("state", 0400, rproc->dbg_dir,
rproc, &rproc_state_ops);
+ debugfs_create_file("recovery", 0400, rproc->dbg_dir,
+ rproc, &rproc_recovery_ops);
}

void __init rproc_init_debugfs(void)
diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
index a46ed27..1f73c75 100644
--- a/include/linux/remoteproc.h
+++ b/include/linux/remoteproc.h
@@ -398,6 +398,7 @@ enum rproc_crash_type {
* @index: index of this rproc device
* @crash_handler: workqueue for handling a crash
* @crash_cnt: crash counter
+ * @recovery_disabled: flag that state if recovery was disabled
*/
struct rproc {
struct klist_node node;
@@ -423,6 +424,7 @@ struct rproc {
int index;
struct work_struct crash_handler;
unsigned crash_cnt;
+ bool recovery_disabled;
};

/* we currently support only two vrings per rvdev */
--
1.7.1

2012-08-08 23:08:56

by Fernando Guzman Lugo

[permalink] [raw]
Subject: [PATCH 2/3] remoteproc: recover a remoteproc when it has crashed

This patch is introducing rproc_trigger_recover function which is in
charge of recovering the rproc. One way to recover the rproc after a crash
is resetting all its virtio devices. Doing that, all rpmsg drivers are
restored along with the rpmsg devices and that also causes the reset of
the remoteproc making the rpmsg communication with the remoteproc
functional again. So far, rproc_trigger_recover function is only resetting
all virtio devices, if in the future other rproc features are introduced
and need to be reset too, rproc_trigger_recover function should take care
of that.

Signed-off-by: Fernando Guzman Lugo <[email protected]>
---
drivers/remoteproc/remoteproc_core.c | 28 +++++++++++++++++++++++++++-
drivers/remoteproc/remoteproc_internal.h | 1 +
2 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index 3a6f1a1..c879069 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -882,6 +882,32 @@ out:
}

/**
+ * rproc_trigger_recover() - recover a remoteproc
+ * @rproc: the remote processor
+ *
+ * The recovery is done by reseting all the virtio devices, that way all the
+ * rpmsg drivers will be reseted along with the remote processor making the
+ * remoteproc functional again.
+ *
+ * This function can sleep, so that it cannot be called from atomic context.
+ */
+int rproc_trigger_recover(struct rproc *rproc)
+{
+ struct rproc_vdev *rvdev, *rvtmp;
+
+ dev_err(&rproc->dev, "recovering %s\n", rproc->name);
+
+ /* clean up remote vdev entries */
+ list_for_each_entry_safe(rvdev, rvtmp, &rproc->rvdevs, node)
+ rproc_remove_virtio_dev(rvdev);
+
+ /* run rproc_fw_config_virtio to create vdevs again */
+ return request_firmware_nowait(THIS_MODULE, FW_ACTION_HOTPLUG,
+ rproc->firmware, &rproc->dev, GFP_KERNEL,
+ rproc, rproc_fw_config_virtio);
+}
+
+/**
* rproc_crash_handler_work() - handle a crash
*
* This function needs to handle everything related to a crash, like cpu
@@ -906,7 +932,7 @@ static void rproc_crash_handler_work(struct work_struct *work)
++rproc->crash_cnt, rproc->name);
mutex_unlock(&rproc->lock);

- /* TODO: handle crash */
+ rproc_trigger_recover(rproc);
}

/**
diff --git a/drivers/remoteproc/remoteproc_internal.h b/drivers/remoteproc/remoteproc_internal.h
index a690ebe..d9c0730 100644
--- a/drivers/remoteproc/remoteproc_internal.h
+++ b/drivers/remoteproc/remoteproc_internal.h
@@ -63,6 +63,7 @@ void rproc_free_vring(struct rproc_vring *rvring);
int rproc_alloc_vring(struct rproc_vdev *rvdev, int i);

void *rproc_da_to_va(struct rproc *rproc, u64 da, int len);
+int rproc_trigger_recover(struct rproc *rproc);

static inline
int rproc_fw_sanity_check(struct rproc *rproc, const struct firmware *fw)
--
1.7.1

2012-08-08 23:09:34

by Fernando Guzman Lugo

[permalink] [raw]
Subject: [PATCH 1/3] remoteproc: add rproc_report_crash function to notify rproc crashes

This patch is exporting the rproc_report_crash function which can be used
to report a rproc crash to the remoteproc core. This function is specially
thought to be called by low-level remoteproc driver code in case of
detecting a crash (remoteproc is not functional anymore). Using this
function from another driver (non rproc driver) should be analyzed very
carefully most of the time that will be considered wrong.

rproc_report_crash function can be called from any context, that means,
it can be called from atomic context without any problem. The reporter
function is creating a new thread (workqueue work) in charge of handling
the crash (if possible).

Creating this new thread is done for two main reasons. First reason is
to be able to call it from atomic context, due to the fact that many
crashes trigger an interrupt, so this function can be called directly
from ISR context. Second reason is avoid any deadlock condition which
could happen if the rproc_report_crash function is called from a
function which is indirectly holding a rproc lock.

The reporter function is scheduling the crash handler task. This task
is thought to have some features like:

-remoteproc register dump
-remoteproc stack dump
-remoteproc core dump
-Saving of the remoteproc traces in order to be visible after the crash
-Reseting the remoteproc in order to make it functional again (hard recovery)

Right now, it is only printing the crash type which was detected. The
types of crashes are represented by an enum. I have only added mmufault
crash type. Remoteproc low-level drivers can add more types when needed.

Signed-off-by: Fernando Guzman Lugo <[email protected]>
---
Documentation/remoteproc.txt | 7 +++
drivers/remoteproc/remoteproc_core.c | 80 +++++++++++++++++++++++++++++++---
include/linux/remoteproc.h | 18 ++++++++
3 files changed, 98 insertions(+), 7 deletions(-)

diff --git a/Documentation/remoteproc.txt b/Documentation/remoteproc.txt
index 23a09b8..e6469fd 100644
--- a/Documentation/remoteproc.txt
+++ b/Documentation/remoteproc.txt
@@ -129,6 +129,13 @@ int dummy_rproc_example(struct rproc *my_rproc)

Returns 0 on success and -EINVAL if @rproc isn't valid.

+ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
+ - Report a crash in a remoteproc
+ This function must be called every time a crash is detected by the
+ platform specific rproc implementation. This should not be called from a
+ non-remoteproc driver. This function can be called from atomic/interrupt
+ context.
+
5. Implementation callbacks

These callbacks should be provided by platform-specific remoteproc
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index d5c2dbf..3a6f1a1 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -50,6 +50,18 @@ typedef int (*rproc_handle_resource_t)(struct rproc *rproc, void *, int avail);
/* Unique indices for remoteproc devices */
static DEFINE_IDA(rproc_dev_index);

+static const char * const rproc_crash_names[] = {
+ [RPROC_MMUFAULT] = "mmufault",
+};
+
+/* translate rproc_crash_type to string */
+static const char *rproc_crash_to_string(enum rproc_crash_type type)
+{
+ if (type < ARRAY_SIZE(rproc_crash_names))
+ return rproc_crash_names[type];
+ return "unkown";
+}
+
/*
* This is the IOMMU fault handler we register with the IOMMU API
* (when relevant; not all remote processors access memory through
@@ -57,19 +69,17 @@ static DEFINE_IDA(rproc_dev_index);
*
* IOMMU core will invoke this handler whenever the remote processor
* will try to access an unmapped device address.
- *
- * Currently this is mostly a stub, but it will be later used to trigger
- * the recovery of the remote processor.
*/
static int rproc_iommu_fault(struct iommu_domain *domain, struct device *dev,
unsigned long iova, int flags, void *token)
{
+ struct rproc *rproc = token;
+
dev_err(dev, "iommu fault: da 0x%lx flags 0x%x\n", iova, flags);

- /*
- * Let the iommu core know we're not really handling this fault;
- * we just plan to use this as a recovery trigger.
- */
+ rproc_report_crash(rproc, RPROC_MMUFAULT);
+
+ /* Let the iommu core know we're not really handling this fault; */
return -ENOSYS;
}

@@ -872,6 +882,34 @@ out:
}

/**
+ * rproc_crash_handler_work() - handle a crash
+ *
+ * This function needs to handle everything related to a crash, like cpu
+ * registers and stack dump, information to help to debug the fatal error, etc.
+ */
+static void rproc_crash_handler_work(struct work_struct *work)
+{
+ struct rproc *rproc = container_of(work, struct rproc, crash_handler);
+ struct device *dev = &rproc->dev;
+
+ dev_dbg(dev, "enter %s\n", __func__);
+
+ mutex_lock(&rproc->lock);
+ if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
+ /* handle only the first crash detected */
+ mutex_unlock(&rproc->lock);
+ return;
+ }
+
+ rproc->state = RPROC_CRASHED;
+ dev_err(&rproc->dev, "handling crash #%u in %s\n",
+ ++rproc->crash_cnt, rproc->name);
+ mutex_unlock(&rproc->lock);
+
+ /* TODO: handle crash */
+}
+
+/**
* rproc_boot() - boot a remote processor
* @rproc: handle of a remote processor
*
@@ -1165,6 +1203,8 @@ struct rproc *rproc_alloc(struct device *dev, const char *name,
INIT_LIST_HEAD(&rproc->traces);
INIT_LIST_HEAD(&rproc->rvdevs);

+ INIT_WORK(&rproc->crash_handler, rproc_crash_handler_work);
+
rproc->state = RPROC_OFFLINE;

return rproc;
@@ -1221,6 +1261,32 @@ int rproc_del(struct rproc *rproc)
}
EXPORT_SYMBOL(rproc_del);

+/**
+ * rproc_report_crash() - rproc crash reporter function
+ * @rproc: remote processor
+ * @type: crash type
+ *
+ * This function must be called every time a crash is detected by the low-level
+ * drivers implementing a specific remoteproc. This should not be called from a
+ * non-remoteproc driver.
+ *
+ * This function can be called from atomic/interrupt context.
+ */
+void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
+{
+ if (!rproc) {
+ pr_err("NULL rproc pointer\n");
+ return;
+ }
+
+ dev_err(&rproc->dev, "crash detected in %s: type %s\n",
+ rproc->name, rproc_crash_to_string(type));
+
+ /* create a new task to handle the error */
+ schedule_work(&rproc->crash_handler);
+}
+EXPORT_SYMBOL(rproc_report_crash);
+
static int __init remoteproc_init(void)
{
rproc_init_debugfs();
diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
index 131b539..a46ed27 100644
--- a/include/linux/remoteproc.h
+++ b/include/linux/remoteproc.h
@@ -361,6 +361,19 @@ enum rproc_state {
};

/**
+ * enum rproc_crash_type - remote processor crash types
+ * @RPROC_MMUFAULT: iommu fault
+ *
+ * Each element of the enum is used as an array index. So that, the value of
+ * the elements should be always something sane.
+ *
+ * Feel free to add more types when needed.
+ */
+enum rproc_crash_type {
+ RPROC_MMUFAULT,
+};
+
+/**
* struct rproc - represents a physical remote processor device
* @node: klist node of this rproc object
* @domain: iommu domain
@@ -383,6 +396,8 @@ enum rproc_state {
* @rvdevs: list of remote virtio devices
* @notifyids: idr for dynamically assigning rproc-wide unique notify ids
* @index: index of this rproc device
+ * @crash_handler: workqueue for handling a crash
+ * @crash_cnt: crash counter
*/
struct rproc {
struct klist_node node;
@@ -406,6 +421,8 @@ struct rproc {
struct list_head rvdevs;
struct idr notifyids;
int index;
+ struct work_struct crash_handler;
+ unsigned crash_cnt;
};

/* we currently support only two vrings per rvdev */
@@ -460,6 +477,7 @@ int rproc_del(struct rproc *rproc);

int rproc_boot(struct rproc *rproc);
void rproc_shutdown(struct rproc *rproc);
+void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type);

static inline struct rproc_vdev *vdev_to_rvdev(struct virtio_device *vdev)
{
--
1.7.1

2012-08-20 13:07:38

by Sjur BRENDELAND

[permalink] [raw]
Subject: Re: [PATCH 2/3] remoteproc: recover a remoteproc when it has crashed

Hi Fernando,

>This patch is introducing rproc_trigger_recover function which is in
>charge of recovering the rproc. One way to recover the rproc after a crash
>is resetting all its virtio devices. Doing that, all rpmsg drivers are
>restored along with the rpmsg devices and that also causes the reset of
>the remoteproc making the rpmsg communication with the remoteproc
>functional again. So far, rproc_trigger_recover function is only resetting
>all virtio devices, if in the future other rproc features are introduced
>and need to be reset too, rproc_trigger_recover function should take care
>of that.

I think you drop the driver module's ref count during recovery, because
rproc_shutdown calls module_put(). Maybe you should move driver
ref count handling to rproc_add and rproc_type_release, instead of
rproc_boot() and rproc_shutdown()?

...
>+int rproc_trigger_recover(struct rproc *rproc)
>+{
>+ struct rproc_vdev *rvdev, *rvtmp;
>+
>+ dev_err(&rproc->dev, "recovering %s\n", rproc->name);
>+
>+ /* clean up remote vdev entries */
>+ list_for_each_entry_safe(rvdev, rvtmp, &rproc->rvdevs, node)
>+ rproc_remove_virtio_dev(rvdev);

...

Is this safe? Are you guaranteed that rproc->power
is counted down to zero at this point. Won't this fail if the
firmware loading starts before all clients has called
rproc_shutdown()?

I guess virtio drivers potentially could defer the call to del_vqs()
to a work-queue. In this case you cannot be guaranteed that the
rproc is shut down at this point. You may also have e.g. platform
driver who has previously called rproc_boot() and calls
rproc_shutdown() after calling rproc_crash().

I think you should wait for the rproc->power count to go to zero
before initiating the firmware_loading. You could e.g.
move firmware_loading to rproc_shutdown(), or add a
completion.

BTW: I ran into a bug in unregister_virtio_device when testing
your feature with SLAB_DEBUG=y. The dev->index is accessed
after the device is freed. I'll submit a patch on this at some point.

After fixing that, your patch works for me!

Regards,
Sjur

2012-08-20 20:45:10

by Fernando Guzman Lugo

[permalink] [raw]
Subject: Re: [PATCH 2/3] remoteproc: recover a remoteproc when it has crashed

Hi Sjur,

On Mon, Aug 20, 2012 at 8:07 AM, Sjur Br?ndeland
<[email protected]> wrote:
>
> Hi Fernando,
>
> >This patch is introducing rproc_trigger_recover function which is in
> >charge of recovering the rproc. One way to recover the rproc after a
> > crash
> >is resetting all its virtio devices. Doing that, all rpmsg drivers are
> >restored along with the rpmsg devices and that also causes the reset of
> >the remoteproc making the rpmsg communication with the remoteproc
> >functional again. So far, rproc_trigger_recover function is only
> > resetting
> >all virtio devices, if in the future other rproc features are introduced
> >and need to be reset too, rproc_trigger_recover function should take care
> >of that.
>
> I think you drop the driver module's ref count during recovery, because
> rproc_shutdown calls module_put(). Maybe you should move driver
> ref count handling to rproc_add and rproc_type_release, instead of
> rproc_boot() and rproc_shutdown()?

How could you remove the remoteproc modules then?

omap remoteproc for example:

# modprobe omap_remoteproc

# lsmod
Module Size Used by Tainted: G
virtio_rpmsg_bus 12557 0
omap_remoteproc 6194 1
remoteproc 31112 1 omap_remoteproc
virtio_ring 9820 2 virtio_rpmsg_bus,remoteproc
virtio 4865 2 virtio_rpmsg_bus,remoteproc

here if I want to remove the modules I can do:

modprobe -r virtio_rpmsg_bus

getting:

# lsmod
Module Size Used by Tainted: G
omap_remoteproc 6194 0
remoteproc 31112 1 omap_remoteproc
virtio_ring 9820 1 remoteproc
virtio

So, at this moment omap_remoteproc use count becomes 0 and then I can remove
omap_remoteproc and then remoteproc modules.

However if I do what you suggested then:
- omap_remoteproc depends on remoteproc module and it will increase the use
count of remoteproc module.
- as soon omap_remoteproc call rproc_add, remoteproc module will call
module_get on omap_remoteproc module so it will increase omap_remoteproc use
counter.

At this point we have a circular dependency, omap_remoteproc depends on
remoteproc and vice-versa. Making impossible to remove the remoteproc
modules. So, I don't think it is a good idea to move module_get/put calls.

what I could do is call module_get before calling rproc_remove_virtio_dev
for the virtio devices and call module_put after loading the firmware.
However, I am no sure why you want to address something like that. Don't
expect recovery to work if you call rmmod <platform rproc driver> at the
middle of the recovery, is like if in my example I would do

# modprobe omap_remoteproc

# rmmod virtio_rpmsg_bus
# rmmod omap_remoteproc

# modprobe virtio_rpmsg_bus

At this moment the remote proce will not boot because there is not more omap
remoteproc devices, that is expected and it won't generate any crash the
same way recovery won't generate any crash it will only do nothing after
removing the devices or it will fail gracefully because the rproc name does
not exists anymore.

However, if you still have concerns about that or a valid testcase where
that would have an unexpected behaviour we can think in a way to fix it.
Probably just calling module_get/put inside rproc_trigger_recover is the
easiest way. Please let me know what you think.

>
>
> ...
> >+int rproc_trigger_recover(struct rproc *rproc)
> >+{
> >+ struct rproc_vdev *rvdev, *rvtmp;
> >+
> >+ dev_err(&rproc->dev, "recovering %s\n", rproc->name);
> >+
> >+ /* clean up remote vdev entries */
> >+ list_for_each_entry_safe(rvdev, rvtmp, &rproc->rvdevs, node)
> >+ rproc_remove_virtio_dev(rvdev);
>
> ...
>
> Is this safe? Are you guaranteed that rproc->power
> is counted down to zero at this point.

if the remoteproc is using rpmsg and there is no other module/user calling
rproc_boot it is guaranteed. Otherwise, it does not.

>
> Won't this fail if the
> firmware loading starts before all clients has called
> rproc_shutdown()?

Yes, the rproc refcount will never reach 0. Therefore, the remoteproc will
not be reset. At this moment we only support recovery of rprocs which are
only used by rpmsg module. If there are any other user which call directly
to rproc_boot then the recovery will not work for that case. AFAIK, there is
no such users right now, and we will need to implement some kind of rproc
crash notifications in order to address those cases. At the moment we don't
need that, but if you have some of those test cases we can discuss how to
address them. Please let me know what do you think.

>
>
> I guess virtio drivers potentially could defer the call to del_vqs()
> to a work-queue.

that would need to change the remoteproc_virtio.c which means change the
remoteproc module, that is not a thing that the platform remoteproc
implementation can change. So, if there are some changes in the remoteproc
core module to defer the find_vqs, the recovery should also change to
guarantee it will be functional.

>
> In this case you cannot be guaranteed that the
> rproc is shut down at this point. You may also have e.g. platform
> driver who has previously called rproc_boot() and calls
> rproc_shutdown() after calling rproc_crash().

do you mean rproc_report_crash()? that function is not supposed to be called
by any platform driver which uses remoteproc, expect for the platform driver
which implements a specific remoteproc. So, that sequence should not be
valid at first place.

>
>
> I think you should wait for the rproc->power count to go to zero
> before initiating the firmware_loading. You could e.g.
> move firmware_loading to rproc_shutdown(), or add a
> completion.

Yes, that sounds good and it will work when we supports rproc_boot/shutdown
calls from anywhere and not only from rpmsg module(find_vqs). Also, I think
with that wait_for_completion inside rproc_trigger_recover will make more
clear how the recovery is performed (actually the need for ref counter to
become 0 in order to work).

I will try that and send a new patch to get more comments.

>
>
> BTW: I ran into a bug in unregister_virtio_device when testing
> your feature with SLAB_DEBUG=y. The dev->index is accessed
> after the device is freed. I'll submit a patch on this at some point.

good!

>
>
> After fixing that, your patch works for me!

Excellent news :).

Thanks for the comments and for try them out on your system.

Regards,
Fernando.

>
>
> Regards,
> Sjur

2012-08-22 11:57:49

by Sjur BRENDELAND

[permalink] [raw]
Subject: RE: [PATCH 2/3] remoteproc: recover a remoteproc when it has crashed

Hi Fernando,

> > I think you drop the driver module's ref count during recovery, because
> > rproc_shutdown calls module_put(). Maybe you should move driver
> > ref count handling to rproc_add and rproc_type_release, instead of
> > rproc_boot() and rproc_shutdown()?
>
> How could you remove the remoteproc modules then?
...
> At this point we have a circular dependency, omap_remoteproc depends on
> remoteproc and vice-versa. Making impossible to remove the remoteproc
> modules. So, I don't think it is a good idea to move module_get/put
> calls.

Yes, I see - you're right. What I proposed won't work.

> However, if you still have concerns about that or a valid testcase where
> that would have an unexpected behaviour we can think in a way to fix it.
> Probably just calling module_get/put inside rproc_trigger_recover is
> the easiest way. Please let me know what you think.

My concern was unloading of pdev or driver modules durng async firmware
loading. However it turns out that my issue was that I was using
rproc_put() instead of rproc_del() when unloading my driver module.
So after fixing my driver your patch works fine even when unloading the
driver during firmware loading.

> > >+int rproc_trigger_recover(struct rproc *rproc)
> > >+{
> > >+ struct rproc_vdev *rvdev, *rvtmp;
> > >+
> > >+ dev_err(&rproc->dev, "recovering %s\n", rproc->name);
> > >+
> > >+ /* clean up remote vdev entries */
> > >+ list_for_each_entry_safe(rvdev, rvtmp, &rproc->rvdevs, node)
> > >+ rproc_remove_virtio_dev(rvdev);
> > ...
> > Is this safe? Are you guaranteed that rproc->power
> > is counted down to zero at this point.
...
> > I think you should wait for the rproc->power count to go to zero
> > before initiating the firmware_loading. You could e.g.
> > move firmware_loading to rproc_shutdown(), or add a
> > completion.
>
> Yes, that sounds good and it will work when we supports
> rproc_boot/shutdown > calls from anywhere and not only from
> rpmsg module(find_vqs). Also, I think with that wait_for_completion
> inside rproc_trigger_recover will make more clear how the recovery
> is performed (actually the need for ref counter to
> become 0 in order to work).
>
> I will try that and send a new patch to get more comments.


Ok, looking forward to your next patch then.

Regards,
Sjur