2012-10-22 08:34:52

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH v2 0/6] solve deadlock caused by memory allocation with I/O

Hi,

This patchset try to solve one deadlock problem which might be caused
by memory allocation with block I/O during runtime resume and block device
error handling path. Traditionly, the problem is addressed by passing
GFP_NOIO statically to mm, but that is not a effective solution, see
detailed description in patch 1's commit log.

This patch set introduces one process flag and trys to fix one deadlock
problem on block device/network device during runtime resume or usb bus reset.

The 1st one is the change on include/sched.h and mm.

The 2nd patch introduces the flag of memalloc_noio_resume on 'dev_pm_info',
and pm_runtime_set_memalloc_noio(), so that PM Core can teach mm to not
allocate mm with GFP_IOFS during the runtime_resume callback only on
device with the flag set.

The following 2 patches apply the introduced pm_runtime_set_memalloc_noio()
to mark all devices as memalloc_noio_resume in the path from the block or
network device to the root device in device tree.

The last 2 patches are applied again PM and USB subsystem to demonstrate
how to use the introduced mechanism to fix the deadlock problem.

V2:
- remove changes on 'may_writepage' and 'may_swap'(1/6)
- unset GFP_IOFS in try_to_free_pages() path(1/6)
- introduce pm_runtime_set_memalloc_noio()
- only apply the meachnism on block/network device and its ancestors
for runtime resume context
V1:
- take Minchan's change to avoid the check in alloc_page hot path
- change the helpers' style into save/restore as suggested by Alan
- memory allocation with no io in usb bus reset path for all devices
as suggested by Greg and Oliver

block/genhd.c | 8 +++++
drivers/base/power/runtime.c | 69 +++++++++++++++++++++++++++++++++++++++++-
drivers/usb/core/hub.c | 11 +++++++
include/linux/pm.h | 1 +
include/linux/pm_runtime.h | 3 ++
include/linux/sched.h | 10 ++++++
mm/page_alloc.c | 10 +++++-
mm/vmscan.c | 12 ++++++++
net/core/net-sysfs.c | 5 +++
9 files changed, 127 insertions(+), 2 deletions(-)


Thanks,
--
Ming Lei


2012-10-22 08:35:12

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH v2 1/6] mm: teach mm by current context info to not do I/O during memory allocation

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task
to avoid doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device,
and the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one
of its ancestors(or the block device itself), the deadlock may be
triggered inside the memory allocation since it might not complete
until the block device becomes active and the involed page I/O finishes.
The situation is pointed out first by Alan Stern. It is not a good
approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because
several subsystems may be involved(for example, PCI, USB and SCSI may
be involved for usb mass stoarage device, network devices involved too
in the iSCSI case)

- during error handling of usb mass storage deivce, USB bus reset
will be put on the device, so there shouldn't have any
memory allocation with GFP_KERNEL during USB bus reset, otherwise
the deadlock similar with above may be triggered. Unfortunately, any
usb device may include one mass storage interface in theory, so it
requires all usb interface drivers to handle the situation. In fact,
most usb drivers don't know how to handle bus reset on the device
and don't provide .pre_set() and .post_reset() callback at all, so
USB core has to unbind and bind driver for these devices. So it
is still not practical to resort to GFP_NOIO for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the
affected path into GFP_NOIO because these functions doing that may be
implemented as library and will be called in many other contexts.

In fact, memalloc_noio() can convert some of current static GFP_NOIO
allocation into GFP_KERNEL back in other non-affected contexts, at least
almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
after applying the approach and make allocation with GFP_IO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate
<-acpi_ut_allocate
<-ACPI_ALLOCATE_ZEROED
<-acpi_evaluate_object
<-__acpi_bus_set_power
<-acpi_bus_set_power
<-acpi_pci_set_power_state
<-platform_pci_set_power_state
<-pci_platform_power_transition
<-__pci_complete_power_transition
<-pci_set_power_state
<-pci_restore_standard_config
<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
<-finish_port_resume
<-usb_port_resume
<-generic_resume
<-usb_resume_device
<-usb_resume_both
<-usb_runtime_resume

- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....

That is just what I have found. Unfortunately, this allocation can
only be found by human being now, and there should be many not found
since any function in the resume path(call tree) may allocate memory
with GFP_KERNEL.

Cc: Alan Stern <[email protected]>
Cc: Oliver Neukum <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Ming Lei <[email protected]>

---
v2:
- remove changes on 'may_writepage' and 'may_swap' because that
isn't related with the patchset, and can't introduce I/O in
allocation path if GFP_IOFS is unset, so handing 'may_swap'
and may_writepage on GFP_NOIO or GFP_NOFS should be a
mm internal thing, and let mm guys deal with that, :-).

Looks clearing the two may_XXX flag only excludes dirty pages and
anon pages for relaiming, and the behaviour should be decided
by GFP FLAG, IMO.

- unset GFP_IOFS in try_to_free_pages() path since alloc_page_buffers()
and dma_alloc_from_contiguous may drop into the path, as
pointed by KAMEZAWA Hiroyuki
v1:
- take Minchan's change to avoid the check in alloc_page hot path

- change the helpers' style into save/restore as suggested by Alan
Stern
---
include/linux/sched.h | 10 ++++++++++
mm/page_alloc.c | 10 +++++++++-
mm/vmscan.c | 12 ++++++++++++
3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f7a76fa..ac5234a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1793,6 +1793,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
#define PF_FROZEN 0x00010000 /* frozen for system suspend */
#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
#define PF_KSWAPD 0x00040000 /* I am kswapd */
+#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
@@ -1830,6 +1831,15 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
#define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
#define used_math() tsk_used_math(current)

+#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO)
+#define memalloc_noio_save(flag) do { \
+ (flag) = current->flags & PF_MEMALLOC_NOIO; \
+ current->flags |= PF_MEMALLOC_NOIO; \
+} while (0)
+#define memalloc_noio_restore(flag) do { \
+ current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flag; \
+} while (0)
+
/*
* task->jobctl flags
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c871fc..a7b76ae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2630,10 +2630,18 @@ retry_cpuset:
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, alloc_flags,
preferred_zone, migratetype);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ /*
+ * Resume, block IO and its error handling path
+ * can deadlock because I/O on the device might not
+ * complete.
+ */
+ if (unlikely(memalloc_noio()))
+ gfp_mask &= ~GFP_IOFS;
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
+ }

trace_mm_page_alloc(page, order, gfp_mask, migratetype);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2624edc..5bf8290 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2298,6 +2298,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.gfp_mask = sc.gfp_mask,
};

+ if (unlikely(memalloc_noio())) {
+ gfp_mask &= ~GFP_IOFS;
+ sc.gfp_mask = gfp_mask;
+ shrink.gfp_mask = sc.gfp_mask;
+ }
+
throttle_direct_reclaim(gfp_mask, zonelist, nodemask);

/*
@@ -3298,6 +3304,12 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
};
unsigned long nr_slab_pages0, nr_slab_pages1;

+ if (unlikely(memalloc_noio())) {
+ gfp_mask &= ~GFP_IOFS;
+ sc.gfp_mask = gfp_mask;
+ shrink.gfp_mask = sc.gfp_mask;
+ }
+
cond_resched();
/*
* We need to be able to allocate from the reserves for RECLAIM_SWAP
--
1.7.9.5

2012-10-22 08:35:27

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

The patch introduces the flag of memalloc_noio_resume in
'struct dev_pm_info' to help PM core to teach mm not allocating
memory with GFP_KERNEL flag for avoiding probable deadlock
problem.

As explained in the comment, any GFP_KERNEL allocation inside
runtime_resume on any one of device in the path from one block
or network device to the root device in the device tree may cause
deadlock, the introduced pm_runtime_set_memalloc_noio() sets or
clears the flag on device of the path recursively.

Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
drivers/base/power/runtime.c | 53 ++++++++++++++++++++++++++++++++++++++++++
include/linux/pm.h | 1 +
include/linux/pm_runtime.h | 3 +++
3 files changed, 57 insertions(+)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 3148b10..a75eeca 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -124,6 +124,59 @@ unsigned long pm_runtime_autosuspend_expiration(struct device *dev)
}
EXPORT_SYMBOL_GPL(pm_runtime_autosuspend_expiration);

+static int dev_memalloc_noio(struct device *dev, void *data)
+{
+ if (dev->power.memalloc_noio_resume)
+ return 1;
+ return 0;
+}
+
+/*
+ * pm_runtime_set_memalloc_noio - Set a device's memalloc_noio flag.
+ * @dev: Device to handle.
+ * @enable: True for setting the flag and False for clearing the flag.
+ *
+ * Set the flag for all devices in the path from the device to the
+ * root device in the device tree if @enable is true, otherwise clear
+ * the flag for devices in the path which sibliings don't set the flag.
+ *
+ * The function should only be called by block device, or network
+ * device driver for solving the deadlock problem during runtime
+ * resume:
+ * if memory allocation with GFP_KERNEL is called inside runtime
+ * resume callback of any one of its ancestors(or the block device
+ * itself), the deadlock may be triggered inside the memory
+ * allocation since it might not complete until the block device
+ * becomes active and the involed page I/O finishes. The situation
+ * is pointed out first by Alan Stern. Network device are involved
+ * in iSCSI kind of situation.
+ *
+ * No lock is provovided in the function for handling hotplog race
+ * because pm_runtime_set_memalloc_noio(false) is called in parent's
+ * remove path.
+ */
+void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
+{
+ dev->power.memalloc_noio_resume = enable;
+
+ if (!dev->parent)
+ return;
+
+ if (enable) {
+ pm_runtime_set_memalloc_noio(dev->parent, 1);
+ } else {
+ /* only clear the flag for one device if all
+ * children of the device don't set the flag.
+ */
+ if (device_for_each_child(dev->parent, NULL,
+ dev_memalloc_noio))
+ return;
+
+ pm_runtime_set_memalloc_noio(dev->parent, 0);
+ }
+}
+EXPORT_SYMBOL_GPL(pm_runtime_set_memalloc_noio);
+
/**
* rpm_check_suspend_allowed - Test whether a device may be suspended.
* @dev: Device to test.
diff --git a/include/linux/pm.h b/include/linux/pm.h
index 007e687..5b0ee4d 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -538,6 +538,7 @@ struct dev_pm_info {
unsigned int irq_safe:1;
unsigned int use_autosuspend:1;
unsigned int timer_autosuspends:1;
+ unsigned int memalloc_noio_resume:1;
enum rpm_request request;
enum rpm_status runtime_status;
int runtime_error;
diff --git a/include/linux/pm_runtime.h b/include/linux/pm_runtime.h
index f271860..775e063 100644
--- a/include/linux/pm_runtime.h
+++ b/include/linux/pm_runtime.h
@@ -47,6 +47,7 @@ extern void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);
extern unsigned long pm_runtime_autosuspend_expiration(struct device *dev);
extern void pm_runtime_update_max_time_suspended(struct device *dev,
s64 delta_ns);
+extern void pm_runtime_set_memalloc_noio(struct device *dev, bool enable);

static inline bool pm_children_suspended(struct device *dev)
{
@@ -149,6 +150,8 @@ static inline void pm_runtime_set_autosuspend_delay(struct device *dev,
int delay) {}
static inline unsigned long pm_runtime_autosuspend_expiration(
struct device *dev) { return 0; }
+static inline void pm_runtime_set_memalloc_noio(struct device *dev,
+ bool enable){}

#endif /* !CONFIG_PM_RUNTIME */

--
1.7.9.5

2012-10-22 08:35:44

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH v2 3/6] block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices

This patch applyes the introduced pm_runtime_set_memalloc_noio on
block device so that PM core will teach mm to not allocate memory with
GFP_IOFS when calling the runtime_resume callback for block devices.

Cc: Jens Axboe <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/genhd.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 9e02cd6..c5f10ea 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -18,6 +18,7 @@
#include <linux/mutex.h>
#include <linux/idr.h>
#include <linux/log2.h>
+#include <linux/pm_runtime.h>

#include "blk.h"

@@ -519,6 +520,12 @@ static void register_disk(struct gendisk *disk)

dev_set_name(ddev, disk->disk_name);

+ /* avoid probable deadlock caused by allocate memory with
+ * GFP_KERNEL in runtime_resume callback of its all ancestor
+ * deivces
+ */
+ pm_runtime_set_memalloc_noio(ddev, true);
+
/* delay uevents, until we scanned partition table */
dev_set_uevent_suppress(ddev, 1);

@@ -661,6 +668,7 @@ void del_gendisk(struct gendisk *disk)
disk->driverfs_dev = NULL;
if (!sysfs_deprecated)
sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
+ pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
device_del(disk_to_dev(disk));
}
EXPORT_SYMBOL(del_gendisk);
--
1.7.9.5

2012-10-22 08:35:57

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH v2 4/6] net/core: apply pm_runtime_set_memalloc_noio on network devices

Deadlock might be caused by allocating memory with GFP_KERNEL in
runtime_resume callback of network devices in iSCSI situation, so
mark network devices and its ancestor as 'memalloc_noio_resume'
with the introduced pm_runtime_set_memalloc_noio().

Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: David Decotigny <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
net/core/net-sysfs.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bcf02f6..9aba5be 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -22,6 +22,7 @@
#include <linux/vmalloc.h>
#include <linux/export.h>
#include <linux/jiffies.h>
+#include <linux/pm_runtime.h>
#include <net/wext.h>

#include "net-sysfs.h"
@@ -1386,6 +1387,8 @@ void netdev_unregister_kobject(struct net_device * net)

remove_queue_kobjects(net);

+ pm_runtime_set_memalloc_noio(dev, false);
+
device_del(dev);
}

@@ -1411,6 +1414,8 @@ int netdev_register_kobject(struct net_device *net)
*groups++ = &netstat_group;
#endif /* CONFIG_SYSFS */

+ pm_runtime_set_memalloc_noio(dev, true);
+
error = device_add(dev);
if (error)
return error;
--
1.7.9.5

2012-10-22 08:36:16

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH v2 5/6] PM / Runtime: force memory allocation with no I/O during runtime_resume callbcack

This patch applies the introduced memalloc_noio_save() and
memalloc_noio_restore() to force memory allocation with no I/O
during runtime_resume callback on device which is marked as
memalloc_noio_resume.

Cc: Alan Stern <[email protected]>
Cc: Oliver Neukum <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
drivers/base/power/runtime.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index a75eeca..c61b7b0 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -556,6 +556,7 @@ static int rpm_resume(struct device *dev, int rpmflags)
int (*callback)(struct device *);
struct device *parent = NULL;
int retval = 0;
+ unsigned int noio_flag;

trace_rpm_resume(dev, rpmflags);

@@ -705,7 +706,20 @@ static int rpm_resume(struct device *dev, int rpmflags)
if (!callback && dev->driver && dev->driver->pm)
callback = dev->driver->pm->runtime_resume;

- retval = rpm_callback(callback, dev);
+ /*
+ * Deadlock might be caused if memory allocation with GFP_KERNEL
+ * happens inside runtime_resume callback of one block device's
+ * ancestor or the block device itself. Network device might be
+ * thought as part of iSCSI block device, so network device and
+ * its ancestor should be marked as memalloc_noio_resume.
+ */
+ if (dev->power.memalloc_noio_resume) {
+ memalloc_noio_save(noio_flag);
+ retval = rpm_callback(callback, dev);
+ memalloc_noio_restore(noio_flag);
+ } else {
+ retval = rpm_callback(callback, dev);
+ }
if (retval) {
__update_runtime_status(dev, RPM_SUSPENDED);
pm_runtime_cancel_pending(dev);
--
1.7.9.5

2012-10-22 08:36:34

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH v2 6/6] USB: forbid memory allocation with I/O during bus reset

If one storage interface or usb network interface(iSCSI case)
exists in current configuration, memory allocation with
GFP_KERNEL during usb_device_reset() might trigger I/O transfer
on the storage interface itself and cause deadlock because
the 'us->dev_mutex' is held in .pre_reset() and the storage
interface can't do I/O transfer when the reset is triggered
by other interface, or the error handling can't be completed
if the reset is triggered by the storage itself(error handling path).

Cc: Alan Stern <[email protected]>
Cc: Oliver Neukum <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
drivers/usb/core/hub.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 522ad57..106a80a 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -5038,6 +5038,7 @@ int usb_reset_device(struct usb_device *udev)
{
int ret;
int i;
+ unsigned int noio_flag;
struct usb_host_config *config = udev->actconfig;

if (udev->state == USB_STATE_NOTATTACHED ||
@@ -5047,6 +5048,15 @@ int usb_reset_device(struct usb_device *udev)
return -EINVAL;
}

+ /*
+ * Don't allocate memory with GFP_KERNEL in current
+ * context to avoid possible deadlock if usb mass
+ * storage interface or usbnet interface(iSCSI case)
+ * is included in current configuration. The easiest
+ * approach is to do it for all devices.
+ */
+ memalloc_noio_save(noio_flag);
+
/* Prevent autosuspend during the reset */
usb_autoresume_device(udev);

@@ -5091,6 +5101,7 @@ int usb_reset_device(struct usb_device *udev)
}

usb_autosuspend_device(udev);
+ memalloc_noio_restore(noio_flag);
return ret;
}
EXPORT_SYMBOL_GPL(usb_reset_device);
--
1.7.9.5

2012-10-22 14:33:17

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

On Mon, 22 Oct 2012, Ming Lei wrote:

> +void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
> +{
> + dev->power.memalloc_noio_resume = enable;
> +
> + if (!dev->parent)
> + return;
> +
> + if (enable) {
> + pm_runtime_set_memalloc_noio(dev->parent, 1);
> + } else {
> + /* only clear the flag for one device if all
> + * children of the device don't set the flag.
> + */
> + if (device_for_each_child(dev->parent, NULL,
> + dev_memalloc_noio))
> + return;
> +
> + pm_runtime_set_memalloc_noio(dev->parent, 0);
> + }
> +}
> +EXPORT_SYMBOL_GPL(pm_runtime_set_memalloc_noio);

Tail recursion should be implemented as a loop, not as an explicit
recursion. That is, the function should be:

void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
{
do {
dev->power.memalloc_noio_resume = enable;

if (!enable) {
/*
* Don't clear the parent's flag if any of the
* parent's children have their flag set.
*/
if (device_for_each_child(dev->parent, NULL,
dev_memalloc_noio))
return;
}
dev = dev->parent;
} while (dev);
}

except that you need to add locking, for two reasons:

There's a race. What happens if another child sets the flag
between the time device_for_each_child() runs and the next loop
iteration?

Even without a race, access to bitfields is not SMP-safe
without locking.

Alan Stern

2012-10-22 14:37:12

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC PATCH v2 6/6] USB: forbid memory allocation with I/O during bus reset

On Mon, 22 Oct 2012, Ming Lei wrote:

> If one storage interface or usb network interface(iSCSI case)
> exists in current configuration, memory allocation with
> GFP_KERNEL during usb_device_reset() might trigger I/O transfer
> on the storage interface itself and cause deadlock because
> the 'us->dev_mutex' is held in .pre_reset() and the storage
> interface can't do I/O transfer when the reset is triggered
> by other interface, or the error handling can't be completed
> if the reset is triggered by the storage itself(error handling path).
>
> Cc: Alan Stern <[email protected]>
> Cc: Oliver Neukum <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> drivers/usb/core/hub.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index 522ad57..106a80a 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -5038,6 +5038,7 @@ int usb_reset_device(struct usb_device *udev)
> {
> int ret;
> int i;
> + unsigned int noio_flag;
> struct usb_host_config *config = udev->actconfig;
>
> if (udev->state == USB_STATE_NOTATTACHED ||
> @@ -5047,6 +5048,15 @@ int usb_reset_device(struct usb_device *udev)
> return -EINVAL;
> }
>
> + /*
> + * Don't allocate memory with GFP_KERNEL in current
> + * context to avoid possible deadlock if usb mass
> + * storage interface or usbnet interface(iSCSI case)
> + * is included in current configuration. The easiest
> + * approach is to do it for all devices.
> + */
> + memalloc_noio_save(noio_flag);

Why not check dev->power.memalloc_noio_resume here too?

Alan Stern

2012-10-22 19:18:32

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/6] net/core: apply pm_runtime_set_memalloc_noio on network devices

On Mon, 22 Oct 2012, Ming Lei wrote:

> Deadlock might be caused by allocating memory with GFP_KERNEL in
> runtime_resume callback of network devices in iSCSI situation, so
> mark network devices and its ancestor as 'memalloc_noio_resume'
> with the introduced pm_runtime_set_memalloc_noio().

Is this really needed? Even with iSCSI, doesn't register_disk() have
to be called for the underlying block device? And given your 3/6
patch, wouldn't that mark the network device?

Alan Stern

2012-10-23 08:42:09

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/6] net/core: apply pm_runtime_set_memalloc_noio on network devices

On Tue, Oct 23, 2012 at 3:18 AM, Alan Stern <[email protected]> wrote:
> On Mon, 22 Oct 2012, Ming Lei wrote:

> Is this really needed? Even with iSCSI, doesn't register_disk() have
> to be called for the underlying block device? And given your 3/6
> patch, wouldn't that mark the network device?

The problem is that network device is not one ancestor of the
iSCSI disk device, which transfers data over tcp stack.


Thanks,
--
Ming Lei

2012-10-23 08:44:42

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 6/6] USB: forbid memory allocation with I/O during bus reset

On Mon, Oct 22, 2012 at 10:37 PM, Alan Stern <[email protected]> wrote:
> On Mon, 22 Oct 2012, Ming Lei wrote:

>>
>> + /*
>> + * Don't allocate memory with GFP_KERNEL in current
>> + * context to avoid possible deadlock if usb mass
>> + * storage interface or usbnet interface(iSCSI case)
>> + * is included in current configuration. The easiest
>> + * approach is to do it for all devices.
>> + */
>> + memalloc_noio_save(noio_flag);
>
> Why not check dev->power.memalloc_noio_resume here too?

Yes, we can use the flag here too even though it is introduced
for rutime_resume case.

Thanks,
--
Ming Lei

2012-10-23 09:08:50

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

On Mon, Oct 22, 2012 at 10:33 PM, Alan Stern <[email protected]> wrote:
>
> Tail recursion should be implemented as a loop, not as an explicit
> recursion. That is, the function should be:
>
> void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
> {
> do {
> dev->power.memalloc_noio_resume = enable;
>
> if (!enable) {
> /*
> * Don't clear the parent's flag if any of the
> * parent's children have their flag set.
> */
> if (device_for_each_child(dev->parent, NULL,
> dev_memalloc_noio))
> return;
> }
> dev = dev->parent;
> } while (dev);
> }

OK, will take the non-recursion implementation for saving kernel
stack space.

>
> except that you need to add locking, for two reasons:
>
> There's a race. What happens if another child sets the flag
> between the time device_for_each_child() runs and the next loop
> iteration?

Yes, I know the race, and not adding a lock because the function
is mostly called in .probe() or .remove() callback and its parent's device
lock is held to avoid this race.

Considered that it may be called in async probe() (scsi disk), one lock
is needed, the simplest way is to add a global lock. Any suggestion?

>
> Even without a race, access to bitfields is not SMP-safe
> without locking.

You mean one ancestor device might not be in active when
one of its descendants is being probed or removed?


Thanks,
--
Ming Lei

2012-10-23 14:46:52

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

On Tue, 23 Oct 2012, Ming Lei wrote:

> On Mon, Oct 22, 2012 at 10:33 PM, Alan Stern <[email protected]> wrote:
> >
> > Tail recursion should be implemented as a loop, not as an explicit
> > recursion. That is, the function should be:
> >
> > void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
> > {
> > do {
> > dev->power.memalloc_noio_resume = enable;
> >
> > if (!enable) {
> > /*
> > * Don't clear the parent's flag if any of the
> > * parent's children have their flag set.
> > */
> > if (device_for_each_child(dev->parent, NULL,
> > dev_memalloc_noio))
> > return;
> > }
> > dev = dev->parent;
> > } while (dev);
> > }
>
> OK, will take the non-recursion implementation for saving kernel
> stack space.
>
> >
> > except that you need to add locking, for two reasons:
> >
> > There's a race. What happens if another child sets the flag
> > between the time device_for_each_child() runs and the next loop
> > iteration?
>
> Yes, I know the race, and not adding a lock because the function
> is mostly called in .probe() or .remove() callback and its parent's device
> lock is held to avoid this race.
>
> Considered that it may be called in async probe() (scsi disk), one lock
> is needed, the simplest way is to add a global lock. Any suggestion?

No. Because of where you put the new flag, it must be protected by
dev->power.lock. And this means the iterative implementation shown
above can't be used as is. It will have to be more like this:

void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
{
spin_lock_irq(&dev->power.lock);
dev->power.memalloc_noio_resume = enable;

while (dev->parent) {
spin_unlock_irq(&dev->power.lock);
dev = dev->parent;

spin_lock_irq(&dev->power.lock);
/*
* Don't clear the parent's flag if any of the
* parent's children have their flag set.
*/
if (!enable && device_for_each_child(dev->parent, NULL,
dev_memalloc_noio))
break;
dev->power.memalloc_noio_resume = enable;
}
spin_unlock_irq(&dev->power.lock);
}

> > Even without a race, access to bitfields is not SMP-safe
> > without locking.
>
> You mean one ancestor device might not be in active when
> one of its descendants is being probed or removed?

No. Consider this example:

struct foo {
int a:1;
int b:1;
} x;

Consider what happens if CPU 0 does "x.a = 1" at the same time as
another CPU 1 does "x.b = 1". The compiler might produce object code
looking like this for CPU 0:

move x, reg1
or 0x1, reg1
move reg1, x

and this for CPU 1:

move x, reg2
or 0x2, reg2
move reg2, x

With no locking, the two "or" instructions could execute
simultaneously. What will the final value of x be?

The two CPUs will interfere, even though they are touching different
bitfields.

Alan Stern

2012-10-23 15:18:36

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

On Tue, Oct 23, 2012 at 10:46 PM, Alan Stern <[email protected]> wrote:
> On Tue, 23 Oct 2012, Ming Lei wrote:
>
>> On Mon, Oct 22, 2012 at 10:33 PM, Alan Stern <[email protected]> wrote:
>> >
>> > Tail recursion should be implemented as a loop, not as an explicit
>> > recursion. That is, the function should be:
>> >
>> > void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
>> > {
>> > do {
>> > dev->power.memalloc_noio_resume = enable;
>> >
>> > if (!enable) {
>> > /*
>> > * Don't clear the parent's flag if any of the
>> > * parent's children have their flag set.
>> > */
>> > if (device_for_each_child(dev->parent, NULL,
>> > dev_memalloc_noio))
>> > return;
>> > }
>> > dev = dev->parent;
>> > } while (dev);
>> > }
>>
>> OK, will take the non-recursion implementation for saving kernel
>> stack space.
>>
>> >
>> > except that you need to add locking, for two reasons:
>> >
>> > There's a race. What happens if another child sets the flag
>> > between the time device_for_each_child() runs and the next loop
>> > iteration?
>>
>> Yes, I know the race, and not adding a lock because the function
>> is mostly called in .probe() or .remove() callback and its parent's device
>> lock is held to avoid this race.
>>
>> Considered that it may be called in async probe() (scsi disk), one lock
>> is needed, the simplest way is to add a global lock. Any suggestion?
>
> No. Because of where you put the new flag, it must be protected by
> dev->power.lock. And this means the iterative implementation shown
> above can't be used as is. It will have to be more like this:
>
> void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
> {
> spin_lock_irq(&dev->power.lock);
> dev->power.memalloc_noio_resume = enable;
>
> while (dev->parent) {
> spin_unlock_irq(&dev->power.lock);
> dev = dev->parent;
>
> spin_lock_irq(&dev->power.lock);
> /*
> * Don't clear the parent's flag if any of the
> * parent's children have their flag set.
> */
> if (!enable && device_for_each_child(dev->parent, NULL,

s/dev->parent/dev

> dev_memalloc_noio))
> break;
> dev->power.memalloc_noio_resume = enable;
> }
> spin_unlock_irq(&dev->power.lock);
> }

With the problem of non-SMP-safe bitfields access, the power.lock should
be held, but that is not enough to prevent children from being probed or
disconnected. Looks another lock is still needed. I think a global lock
is OK in the infrequent path.

>
>> > Even without a race, access to bitfields is not SMP-safe
>> > without locking.
>>
>> You mean one ancestor device might not be in active when
>> one of its descendants is being probed or removed?
>
> No. Consider this example:
>
> struct foo {
> int a:1;
> int b:1;
> } x;
>
> Consider what happens if CPU 0 does "x.a = 1" at the same time as
> another CPU 1 does "x.b = 1". The compiler might produce object code
> looking like this for CPU 0:
>
> move x, reg1
> or 0x1, reg1
> move reg1, x
>
> and this for CPU 1:
>
> move x, reg2
> or 0x2, reg2
> move reg2, x
>
> With no locking, the two "or" instructions could execute
> simultaneously. What will the final value of x be?
>
> The two CPUs will interfere, even though they are touching different
> bitfields.

Got it, thanks for your detailed explanation.

Looks the problem is worse than above, not only bitfields are affected, the
adjacent fields might be involved too, see:

http://lwn.net/Articles/478657/


Thanks,
--
Ming Lei

2012-10-23 18:16:23

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

On Tue, 23 Oct 2012, Ming Lei wrote:

> With the problem of non-SMP-safe bitfields access, the power.lock should
> be held, but that is not enough to prevent children from being probed or
> disconnected. Looks another lock is still needed. I think a global lock
> is OK in the infrequent path.

Agreed.

> Got it, thanks for your detailed explanation.
>
> Looks the problem is worse than above, not only bitfields are affected, the
> adjacent fields might be involved too, see:
>
> http://lwn.net/Articles/478657/

Linus made it clear (in various emails at the time) that the kernel
requires the compiler not to do the sort of things discussed in that
article. But even the restrictions he wanted would not prevent
adjacent bitfields from interfering with each other.

Alan Stern

2012-10-24 09:13:48

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

> Looks the problem is worse than above, not only bitfields are affected, the
> adjacent fields might be involved too, see:
>
> http://lwn.net/Articles/478657/

Not mentioned in there is that even with x86/amd64 given
a struct with the following adjacent fields:
char a;
char b;
char c;
then foo->b |= 0x80; might do a 32bit RMW cycle.
This will (well might - but probably does) happen
if compiled to a 'BTS' instruction.
The x86 instruction set docs are actually unclear
as to whether the 32bit cycle might even be misaligned!
amd64 might do a 64bit cycle (not checked the docs).

David


2012-10-24 12:21:42

by Alan

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

On Wed, 24 Oct 2012 10:06:36 +0100
"David Laight" <[email protected]> wrote:

> > Looks the problem is worse than above, not only bitfields are affected, the
> > adjacent fields might be involved too, see:
> >
> > http://lwn.net/Articles/478657/
>
> Not mentioned in there is that even with x86/amd64 given
> a struct with the following adjacent fields:
> char a;
> char b;
> char c;
> then foo->b |= 0x80; might do a 32bit RMW cycle.

There are processors that will do this for the char case at least as they
do byte ops by a mix of 32bit ops and rotate.

> This will (well might - but probably does) happen
> if compiled to a 'BTS' instruction.
> The x86 instruction set docs are actually unclear
> as to whether the 32bit cycle might even be misaligned!
> amd64 might do a 64bit cycle (not checked the docs).

Even with a suitably aligned field the compiler is at liberty to generate
things like

reg = 0x80
reg |= foo->b
foo->b = reg;

One reason it's a good idea to use set_bit/test_bit and friends.

Alan