2012-10-15 05:14:55

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH 0/3] mm/PM/USB: force memory allocation with no io in need

Hi,

This patch set introduces one process flag and trys to fix one deadlock
problem on block device during runtime resume or usb bus reset.

The 1st one is the change on include/sched.h and mm/page_alloc.c.

The other 2 patches are applied again PM and USB subsystem to demo
how to use the introduced mechanism to fix the deadlock problem.

drivers/base/power/runtime.c | 13 +++++++++++++
drivers/usb/core/hub.c | 42 +++++++++++++++++++++++++++++++++++++++++-
drivers/usb/storage/uas.c | 4 ++++
drivers/usb/storage/usb.c | 4 ++++
include/linux/sched.h | 5 +++++
include/linux/usb.h | 1 +
mm/page_alloc.c | 2 ++


Thanks,
--
Ming Lei


2012-10-15 05:15:16

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task
to avoid doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device,
and the problem can be occured at least in the below situations:

- during block device runtime resume situation, if memory allocation
with GFP_KERNEL is called inside runtime resume callback of any one
of its ancestors(or the block device itself), the deadlock may be
triggered inside the memory allocation since it might not complete
until the block device becomes active and the involed page I/O finishes.
The situation is pointed out first by Alan Stern. It is not a good
approach to convert all GFP_KERNEL in the path into GFP_NOIO because
several subsystems may be involved(for example, PCI, USB and SCSI may
be involved for usb mass stoarage device)

- during error handling situation of usb mass storage deivce, USB
bus reset will be put on the device, so there shouldn't have any
memory allocation with GFP_KERNEL during USB bus reset, otherwise
the deadlock similar with above may be triggered. Unfortunately, any
usb device may include one mass storage interface in theory, so it
requires all usb interface drivers to handle the situation. In fact,
most usb drivers don't know how to handle bus reset on the device
and don't provide .pre_set() and .post_reset() callback at all, so
USB core has to unbind and bind driver for these devices. So it
is still not practical to resort to GFP_NOIO for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

Cc: Alan Stern <[email protected]>
Cc: Oliver Neukum <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: linux-mm <[email protected]>

Signed-off-by: Ming Lei <[email protected]>
---
include/linux/sched.h | 5 +++++
mm/page_alloc.c | 2 ++
2 files changed, 7 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f6961c9..33be290 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1811,6 +1811,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
#define PF_FROZEN 0x00010000 /* frozen for system suspend */
#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
#define PF_KSWAPD 0x00040000 /* I am kswapd */
+#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
@@ -1848,6 +1849,10 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
#define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
#define used_math() tsk_used_math(current)

+#define tsk_memalloc_no_io(p) ((p)->flags & PF_MEMALLOC_NOIO)
+#define tsk_memalloc_allow_io(p) do { (p)->flags &= ~PF_MEMALLOC_NOIO; } while (0)
+#define tsk_memalloc_forbid_io(p) do { (p)->flags |= PF_MEMALLOC_NOIO; } while (0)
+
/*
* task->jobctl flags
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e1be1c..e15381f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2596,6 +2596,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;

gfp_mask &= gfp_allowed_mask;
+ if (unlikely(tsk_memalloc_no_io(current)))
+ gfp_mask &= ~GFP_IOFS;

lockdep_trace_alloc(gfp_mask);

--
1.7.9.5

2012-10-15 05:15:29

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH 2/3] PM / Runtime: force memory allocation with no I/O during runtime_resume callbcack

This patch applies the introduces tsk_memalloc_forbid_io() and
tsk_memalloc_allow_io() to force memory allocation with no I/O
during runtime_resume callback.

Cc: Alan Stern <[email protected]>
Cc: Oliver Neukum <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
drivers/base/power/runtime.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 3148b10..76836c1 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -652,7 +652,20 @@ static int rpm_resume(struct device *dev, int rpmflags)
if (!callback && dev->driver && dev->driver->pm)
callback = dev->driver->pm->runtime_resume;

+ /*
+ * Deadlock might be caused if memory allocation with GFP_KERNEL
+ * happens inside runtime_resume callback of one block device's
+ * ancestor or the block device itself. The easiest approach is
+ * to forbid I/O inside runtime_resume of all devices.
+ *
+ * In fact, it can be done only if the deivce is a block device
+ * or there is one block device descendant. But that may become
+ * complicated and not efficient because device tree traversing
+ * is involved.
+ */
+ tsk_memalloc_forbid_io(current);
retval = rpm_callback(callback, dev);
+ tsk_memalloc_allow_io(current);
if (retval) {
__update_runtime_status(dev, RPM_SUSPENDED);
pm_runtime_cancel_pending(dev);
--
1.7.9.5

2012-10-15 05:15:37

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH 3/3] USB: forbid memory allocation with I/O during bus reset if storage interface exits

If one storage interface exists in the device, memory allocation
with GFP_KERNEL during usb_device_reset() might trigger I/O transfer
on the storage interface itself and cause deadlock because the
'us->dev_mutex' is held in .pre_reset() and the storage interface
can't do I/O transfer when the reset is triggered by other
interface, or the error handling can't be completed if the reset
is triggered by mass storage itself(error handling path).

Cc: Alan Stern <[email protected]>
Cc: Oliver Neukum <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
drivers/usb/core/hub.c | 42 +++++++++++++++++++++++++++++++++++++++++-
drivers/usb/storage/uas.c | 4 ++++
drivers/usb/storage/usb.c | 4 ++++
include/linux/usb.h | 1 +
4 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 9dc8ff2..f6958f7 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -5004,6 +5004,33 @@ re_enumerate:
return -ENODEV;
}

+static inline int is_bound_usb_storage_intf(struct usb_interface *intf)
+{
+ struct usb_driver *drv;
+
+ if (!intf->dev.driver)
+ return 0;
+
+ drv = to_usb_driver(intf->dev.driver);
+
+ return drv->for_storage;
+}
+
+static inline int has_bound_usb_storage_intf(struct usb_device *udev)
+{
+ struct usb_host_config *config = udev->actconfig;
+
+ if (config) {
+ int i;
+ for (i = 0; i < config->desc.bNumInterfaces; ++i) {
+ struct usb_interface *cintf = config->interface[i];
+ if (is_bound_usb_storage_intf(cintf))
+ return 1;
+ }
+ }
+ return 0;
+}
+
/**
* usb_reset_device - warn interface drivers and perform a USB port reset
* @udev: device to reset (not in SUSPENDED or NOTATTACHED state)
@@ -5027,7 +5054,7 @@ re_enumerate:
int usb_reset_device(struct usb_device *udev)
{
int ret;
- int i;
+ int i, no_io = 0;
struct usb_host_config *config = udev->actconfig;

if (udev->state == USB_STATE_NOTATTACHED ||
@@ -5037,6 +5064,15 @@ int usb_reset_device(struct usb_device *udev)
return -EINVAL;
}

+ /*
+ * If bound mass storage interface exists, don't allocate memory
+ * with GFP_KERNEL in current context to avoid possible deadlock
+ */
+ if (has_bound_usb_storage_intf(udev)) {
+ no_io = 1;
+ tsk_memalloc_forbid_io(current);
+ }
+
/* Prevent autosuspend during the reset */
usb_autoresume_device(udev);

@@ -5081,6 +5117,10 @@ int usb_reset_device(struct usb_device *udev)
}

usb_autosuspend_device(udev);
+
+ if (no_io)
+ tsk_memalloc_allow_io(current);
+
return ret;
}
EXPORT_SYMBOL_GPL(usb_reset_device);
diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
index 98b98ee..89daecb 100644
--- a/drivers/usb/storage/uas.c
+++ b/drivers/usb/storage/uas.c
@@ -888,6 +888,10 @@ static int uas_probe(struct usb_interface *intf, const struct usb_device_id *id)
struct Scsi_Host *shost;
struct uas_dev_info *devinfo;
struct usb_device *udev = interface_to_usbdev(intf);
+ struct usb_driver *drv = to_usb_driver(intf->dev.driver);
+
+ if (!drv->for_storage)
+ drv->for_storage = 1;

if (uas_switch_interface(udev, intf))
return -ENODEV;
diff --git a/drivers/usb/storage/usb.c b/drivers/usb/storage/usb.c
index 12aa726..f11ed09 100644
--- a/drivers/usb/storage/usb.c
+++ b/drivers/usb/storage/usb.c
@@ -905,9 +905,13 @@ int usb_stor_probe1(struct us_data **pus,
struct Scsi_Host *host;
struct us_data *us;
int result;
+ struct usb_driver *drv = to_usb_driver(intf->dev.driver);

US_DEBUGP("USB Mass Storage device detected\n");

+ if (!drv->for_storage)
+ drv->for_storage = 1;
+
/*
* Ask the SCSI layer to allocate a host structure, with extra
* space at the end for our private us_data structure.
diff --git a/include/linux/usb.h b/include/linux/usb.h
index 07915a3..3216b00 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -1024,6 +1024,7 @@ struct usb_driver {
unsigned int supports_autosuspend:1;
unsigned int disable_hub_initiated_lpm:1;
unsigned int soft_unbind:1;
+ unsigned int for_storage:1;
};
#define to_usb_driver(d) container_of(d, struct usb_driver, drvwrap.driver)

--
1.7.9.5

2012-10-15 09:36:25

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] USB: forbid memory allocation with I/O during bus reset if storage interface exits

On Monday 15 October 2012 13:14:19 Ming Lei wrote:
> If one storage interface exists in the device, memory allocation
> with GFP_KERNEL during usb_device_reset() might trigger I/O transfer
> on the storage interface itself and cause deadlock because the
> 'us->dev_mutex' is held in .pre_reset() and the storage interface
> can't do I/O transfer when the reset is triggered by other
> interface, or the error handling can't be completed if the reset
> is triggered by mass storage itself(error handling path).

I think limiting this to devices which have a storage device is not
productive. What if you are using iSCSI or nbd? In the long run
we will see busses attached to busses and as soon as the daughter
bus is hotpluggable you are thwarted anyway. Just do it unconditionally.

Regards
Oliver

2012-10-15 12:06:39

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] USB: forbid memory allocation with I/O during bus reset if storage interface exits

On Mon, Oct 15, 2012 at 5:34 PM, Oliver Neukum <[email protected]> wrote:
>
> I think limiting this to devices which have a storage device is not
> productive. What if you are using iSCSI or nbd? In the long run

You mean other non-mass storage or non-uas usb interfaces may
be involved in iSCSI or nbd? If not, the patch should be OK.
If yes, could you list them?

> we will see busses attached to busses and as soon as the daughter
> bus is hotpluggable you are thwarted anyway. Just do it unconditionally.

IMO, doing it unconditionally is not good because big chunk buffer
is often allocated in probe() path.

Thanks,
--
Ming Lei

2012-10-15 12:33:04

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] USB: forbid memory allocation with I/O during bus reset if storage interface exits

On Monday 15 October 2012 20:06:36 Ming Lei wrote:
> On Mon, Oct 15, 2012 at 5:34 PM, Oliver Neukum <[email protected]> wrote:
> >
> > I think limiting this to devices which have a storage device is not
> > productive. What if you are using iSCSI or nbd? In the long run
>
> You mean other non-mass storage or non-uas usb interfaces may
> be involved in iSCSI or nbd? If not, the patch should be OK.
> If yes, could you list them?

All network devices?

> > we will see busses attached to busses and as soon as the daughter
> > bus is hotpluggable you are thwarted anyway. Just do it unconditionally.
>
> IMO, doing it unconditionally is not good because big chunk buffer
> is often allocated in probe() path.

It would be a chunk that has just been freed.

Regards
Oliver

2012-10-15 13:21:16

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] USB: forbid memory allocation with I/O during bus reset if storage interface exits

On Mon, Oct 15, 2012 at 8:30 PM, Oliver Neukum <[email protected]> wrote:

>
> All network devices?

Good point, but I am wondering if there are guys who would like to
bring up iSCSI over usb network dongle, which should be
very slow at least with high speed. For super speed device,
looks there are few network dongles in market now.

So how about keeping the limit now?

We can remove the limit or extend it to network device if there are
some USB 3.0 network dongles coming.

>
>> > we will see busses attached to busses and as soon as the daughter
>> > bus is hotpluggable you are thwarted anyway. Just do it unconditionally.
>>
>> IMO, doing it unconditionally is not good because big chunk buffer
>> is often allocated in probe() path.
>
> It would be a chunk that has just been freed.

The reset will last for a bit long time, and many memory allocations
are involved inside usb_reset_and_verify_device(), so it might be
hard to get the previous chunk.

Thanks,
--
Ming

2012-10-15 14:14:38

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] PM / Runtime: force memory allocation with no I/O during runtime_resume callbcack

On Mon, 15 Oct 2012, Ming Lei wrote:

> This patch applies the introduces tsk_memalloc_forbid_io() and
> tsk_memalloc_allow_io() to force memory allocation with no I/O
> during runtime_resume callback.
>
> Cc: Alan Stern <[email protected]>
> Cc: Oliver Neukum <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> drivers/base/power/runtime.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> index 3148b10..76836c1 100644
> --- a/drivers/base/power/runtime.c
> +++ b/drivers/base/power/runtime.c
> @@ -652,7 +652,20 @@ static int rpm_resume(struct device *dev, int rpmflags)
> if (!callback && dev->driver && dev->driver->pm)
> callback = dev->driver->pm->runtime_resume;
>
> + /*
> + * Deadlock might be caused if memory allocation with GFP_KERNEL
> + * happens inside runtime_resume callback of one block device's
> + * ancestor or the block device itself. The easiest approach is
> + * to forbid I/O inside runtime_resume of all devices.
> + *
> + * In fact, it can be done only if the deivce is a block device
> + * or there is one block device descendant. But that may become
> + * complicated and not efficient because device tree traversing
> + * is involved.
> + */
> + tsk_memalloc_forbid_io(current);
> retval = rpm_callback(callback, dev);
> + tsk_memalloc_allow_io(current);

This is not so good. What happens if I/O was already forbidden when
this function was called?

Alan Stern

2012-10-15 14:33:41

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

On Mon, 15 Oct 2012, Ming Lei wrote:

> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
> 'struct task_struct'), so that the flag can be set by one task
> to avoid doing I/O inside memory allocation in the task's context.
>
> The patch trys to solve one deadlock problem caused by block device,
> and the problem can be occured at least in the below situations:
>
> - during block device runtime resume situation, if memory allocation
> with GFP_KERNEL is called inside runtime resume callback of any one
> of its ancestors(or the block device itself), the deadlock may be
> triggered inside the memory allocation since it might not complete
> until the block device becomes active and the involed page I/O finishes.
> The situation is pointed out first by Alan Stern. It is not a good
> approach to convert all GFP_KERNEL in the path into GFP_NOIO because
> several subsystems may be involved(for example, PCI, USB and SCSI may
> be involved for usb mass stoarage device)
>
> - during error handling situation of usb mass storage deivce, USB
> bus reset will be put on the device, so there shouldn't have any
> memory allocation with GFP_KERNEL during USB bus reset, otherwise
> the deadlock similar with above may be triggered. Unfortunately, any
> usb device may include one mass storage interface in theory, so it
> requires all usb interface drivers to handle the situation. In fact,
> most usb drivers don't know how to handle bus reset on the device
> and don't provide .pre_set() and .post_reset() callback at all, so
> USB core has to unbind and bind driver for these devices. So it
> is still not practical to resort to GFP_NOIO for solving the problem.
>
> Also the introduced solution can be used by block subsystem or block
> drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
> actual I/O transfer.

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1811,6 +1811,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define PF_FROZEN 0x00010000 /* frozen for system suspend */
> #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
> #define PF_KSWAPD 0x00040000 /* I am kswapd */
> +#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
> #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
> @@ -1848,6 +1849,10 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
> #define used_math() tsk_used_math(current)
>
> +#define tsk_memalloc_no_io(p) ((p)->flags & PF_MEMALLOC_NOIO)
> +#define tsk_memalloc_allow_io(p) do { (p)->flags &= ~PF_MEMALLOC_NOIO; } while (0)
> +#define tsk_memalloc_forbid_io(p) do { (p)->flags |= PF_MEMALLOC_NOIO; } while (0)

Instead of allow/forbid, the API should be save/restore (like
local_irq_save and local_irq_restore). This makes nesting much easier.

Also, do we really the "p" argument? This is not at all likely to be
used with any task other than the current one.

Alan Stern

2012-10-15 14:35:53

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] PM / Runtime: force memory allocation with no I/O during runtime_resume callbcack

On Mon, Oct 15, 2012 at 10:14 PM, Alan Stern <[email protected]> wrote:
> On Mon, 15 Oct 2012, Ming Lei wrote:
>
>> This patch applies the introduces tsk_memalloc_forbid_io() and
>> tsk_memalloc_allow_io() to force memory allocation with no I/O
>> during runtime_resume callback.
>>
>> Cc: Alan Stern <[email protected]>
>> Cc: Oliver Neukum <[email protected]>
>> Cc: Rafael J. Wysocki <[email protected]>
>> Signed-off-by: Ming Lei <[email protected]>
>> ---
>> drivers/base/power/runtime.c | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>>
>> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
>> index 3148b10..76836c1 100644
>> --- a/drivers/base/power/runtime.c
>> +++ b/drivers/base/power/runtime.c
>> @@ -652,7 +652,20 @@ static int rpm_resume(struct device *dev, int rpmflags)
>> if (!callback && dev->driver && dev->driver->pm)
>> callback = dev->driver->pm->runtime_resume;
>>
>> + /*
>> + * Deadlock might be caused if memory allocation with GFP_KERNEL
>> + * happens inside runtime_resume callback of one block device's
>> + * ancestor or the block device itself. The easiest approach is
>> + * to forbid I/O inside runtime_resume of all devices.
>> + *
>> + * In fact, it can be done only if the deivce is a block device
>> + * or there is one block device descendant. But that may become
>> + * complicated and not efficient because device tree traversing
>> + * is involved.
>> + */
>> + tsk_memalloc_forbid_io(current);
>> retval = rpm_callback(callback, dev);
>> + tsk_memalloc_allow_io(current);
>
> This is not so good. What happens if I/O was already forbidden when
> this function was called?

You are right, the old flag should be saved before forbidding and restored
after allowing.

Thanks,
--
Ming Lei

2012-10-15 14:41:51

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

On Mon, Oct 15, 2012 at 10:33 PM, Alan Stern <[email protected]> wrote:
>
> Instead of allow/forbid, the API should be save/restore (like
> local_irq_save and local_irq_restore). This makes nesting much easier.

Good point.

> Also, do we really the "p" argument? This is not at all likely to be
> used with any task other than the current one.

Yes, only 'current' can be passed now. I keep it because no performance
effect with macro implementation. But that is not good since it may
cause misuse. Will remove the 'p' argument.

Thanks,
--
Ming Lei

2012-10-15 15:47:32

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

On Mon, Oct 15, 2012 at 01:14:17PM +0800, Ming Lei wrote:
> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
> 'struct task_struct'), so that the flag can be set by one task
> to avoid doing I/O inside memory allocation in the task's context.
>
> The patch trys to solve one deadlock problem caused by block device,
> and the problem can be occured at least in the below situations:
>
> - during block device runtime resume situation, if memory allocation
> with GFP_KERNEL is called inside runtime resume callback of any one
> of its ancestors(or the block device itself), the deadlock may be
> triggered inside the memory allocation since it might not complete
> until the block device becomes active and the involed page I/O finishes.
> The situation is pointed out first by Alan Stern. It is not a good
> approach to convert all GFP_KERNEL in the path into GFP_NOIO because
> several subsystems may be involved(for example, PCI, USB and SCSI may
> be involved for usb mass stoarage device)

Couldn't we expand pm_restrict_gfp_mask to cover resume path as well as
suspend path?

>
> - during error handling situation of usb mass storage deivce, USB
> bus reset will be put on the device, so there shouldn't have any
> memory allocation with GFP_KERNEL during USB bus reset, otherwise
> the deadlock similar with above may be triggered. Unfortunately, any
> usb device may include one mass storage interface in theory, so it
> requires all usb interface drivers to handle the situation. In fact,
> most usb drivers don't know how to handle bus reset on the device
> and don't provide .pre_set() and .post_reset() callback at all, so
> USB core has to unbind and bind driver for these devices. So it
> is still not practical to resort to GFP_NOIO for solving the problem.

I hope this case could be handled by usb core like usb_restrict_gfp_mask
rather than adding new branch on fast path.

>
> Also the introduced solution can be used by block subsystem or block
> drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
> actual I/O transfer.
>
> Cc: Alan Stern <[email protected]>
> Cc: Oliver Neukum <[email protected]>
> Cc: Jiri Kosina <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: linux-mm <[email protected]>
>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> include/linux/sched.h | 5 +++++
> mm/page_alloc.c | 2 ++
> 2 files changed, 7 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f6961c9..33be290 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1811,6 +1811,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define PF_FROZEN 0x00010000 /* frozen for system suspend */
> #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
> #define PF_KSWAPD 0x00040000 /* I am kswapd */
> +#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
> #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
> @@ -1848,6 +1849,10 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
> #define used_math() tsk_used_math(current)
>
> +#define tsk_memalloc_no_io(p) ((p)->flags & PF_MEMALLOC_NOIO)
> +#define tsk_memalloc_allow_io(p) do { (p)->flags &= ~PF_MEMALLOC_NOIO; } while (0)
> +#define tsk_memalloc_forbid_io(p) do { (p)->flags |= PF_MEMALLOC_NOIO; } while (0)
> +
> /*
> * task->jobctl flags
> */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8e1be1c..e15381f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2596,6 +2596,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
>
> gfp_mask &= gfp_allowed_mask;
> + if (unlikely(tsk_memalloc_no_io(current)))
> + gfp_mask &= ~GFP_IOFS;
>
> lockdep_trace_alloc(gfp_mask);
>
> --
> 1.7.9.5
>

--
Kind Regards,
Minchan Kim

2012-10-15 16:06:13

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] USB: forbid memory allocation with I/O during bus reset if storage interface exits

On Monday 15 October 2012 21:21:13 Ming Lei wrote:
> On Mon, Oct 15, 2012 at 8:30 PM, Oliver Neukum <[email protected]> wrote:
>
> >
> > All network devices?
>
> Good point, but I am wondering if there are guys who would like to
> bring up iSCSI over usb network dongle, which should be

LTE, UMTS. Those people may have no choice.

> very slow at least with high speed. For super speed device,
> looks there are few network dongles in market now.
>
> So how about keeping the limit now?

I am perfectly happy with _adding_ code only when it is needed.
Removal looks like another thing to me.

Regards
Oliver

2012-10-15 16:48:14

by Greg KH

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] USB: forbid memory allocation with I/O during bus reset if storage interface exits

On Mon, Oct 15, 2012 at 09:21:13PM +0800, Ming Lei wrote:
> On Mon, Oct 15, 2012 at 8:30 PM, Oliver Neukum <[email protected]> wrote:
>
> >
> > All network devices?
>
> Good point, but I am wondering if there are guys who would like to
> bring up iSCSI over usb network dongle, which should be
> very slow at least with high speed. For super speed device,
> looks there are few network dongles in market now.
>
> So how about keeping the limit now?
>
> We can remove the limit or extend it to network device if there are
> some USB 3.0 network dongles coming.

No, don't limit us for no good reason, that's not acceptable at all.
Patching old kernels when new devices show up that work on those old
kernels (like USB network devices that follow the spec with no driver
change needed), is not a good idea.

So I agree with Oliver here, this isn't acceptable.

greg k-h

2012-10-15 16:49:05

by Greg KH

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] USB: forbid memory allocation with I/O during bus reset if storage interface exits

On Mon, Oct 15, 2012 at 08:06:36PM +0800, Ming Lei wrote:
> On Mon, Oct 15, 2012 at 5:34 PM, Oliver Neukum <[email protected]> wrote:
> >
> > I think limiting this to devices which have a storage device is not
> > productive. What if you are using iSCSI or nbd? In the long run
>
> You mean other non-mass storage or non-uas usb interfaces may
> be involved in iSCSI or nbd? If not, the patch should be OK.
> If yes, could you list them?

Networking over USB to serial devices (sadly, it does happen, and some
network devices actually show up this way.)

In other words, don't put policy like this in the USB core for no good
reason.

greg k-h

2012-10-16 01:56:51

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

On Mon, Oct 15, 2012 at 11:47 PM, Minchan Kim <[email protected]> wrote:
> On Mon, Oct 15, 2012 at 01:14:17PM +0800, Ming Lei wrote:
>> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
>> 'struct task_struct'), so that the flag can be set by one task
>> to avoid doing I/O inside memory allocation in the task's context.
>>
>> The patch trys to solve one deadlock problem caused by block device,
>> and the problem can be occured at least in the below situations:
>>
>> - during block device runtime resume situation, if memory allocation
>> with GFP_KERNEL is called inside runtime resume callback of any one
>> of its ancestors(or the block device itself), the deadlock may be
>> triggered inside the memory allocation since it might not complete
>> until the block device becomes active and the involed page I/O finishes.
>> The situation is pointed out first by Alan Stern. It is not a good
>> approach to convert all GFP_KERNEL in the path into GFP_NOIO because
>> several subsystems may be involved(for example, PCI, USB and SCSI may
>> be involved for usb mass stoarage device)
>
> Couldn't we expand pm_restrict_gfp_mask to cover resume path as well as
> suspend path?

IMO, we could, but it is not good and might trigger memory allocation problem.

pm_restrict_gfp_mask uses the global variable of gfp_allowed_mask to
avoid allocating page with GFP_IOFS in all contexts during system sleep,
when processes have been frozen.

But during runtime PM, the whole system is running and all processes are
runnable. Also runtime PM is per device and the whole system may have
lots of devices, so taking the global gfp_allowed_mask may keep page
allocation with ~GFP_IOFS for a considerable proportion of system
running time, then alloc_page() will return failure easier.

The above deadlock problem may be fixed by allocating memory with
~GFP_IOFS only in the context of calling runtime_resume, and that is
idea of the patch.

>
>>
>> - during error handling situation of usb mass storage deivce, USB
>> bus reset will be put on the device, so there shouldn't have any
>> memory allocation with GFP_KERNEL during USB bus reset, otherwise
>> the deadlock similar with above may be triggered. Unfortunately, any
>> usb device may include one mass storage interface in theory, so it
>> requires all usb interface drivers to handle the situation. In fact,
>> most usb drivers don't know how to handle bus reset on the device
>> and don't provide .pre_set() and .post_reset() callback at all, so
>> USB core has to unbind and bind driver for these devices. So it
>> is still not practical to resort to GFP_NOIO for solving the problem.
>
> I hope this case could be handled by usb core like usb_restrict_gfp_mask
> rather than adding new branch on fast path.

See above, applying the global gfp_allowed_mask is not good.


Thanks,
--
Ming Lei

2012-10-16 05:49:55

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

On Tue, Oct 16, 2012 at 09:56:48AM +0800, Ming Lei wrote:
> On Mon, Oct 15, 2012 at 11:47 PM, Minchan Kim <[email protected]> wrote:
> > On Mon, Oct 15, 2012 at 01:14:17PM +0800, Ming Lei wrote:
> >> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
> >> 'struct task_struct'), so that the flag can be set by one task
> >> to avoid doing I/O inside memory allocation in the task's context.
> >>
> >> The patch trys to solve one deadlock problem caused by block device,
> >> and the problem can be occured at least in the below situations:
> >>
> >> - during block device runtime resume situation, if memory allocation
> >> with GFP_KERNEL is called inside runtime resume callback of any one
> >> of its ancestors(or the block device itself), the deadlock may be
> >> triggered inside the memory allocation since it might not complete
> >> until the block device becomes active and the involed page I/O finishes.
> >> The situation is pointed out first by Alan Stern. It is not a good
> >> approach to convert all GFP_KERNEL in the path into GFP_NOIO because
> >> several subsystems may be involved(for example, PCI, USB and SCSI may
> >> be involved for usb mass stoarage device)
> >
> > Couldn't we expand pm_restrict_gfp_mask to cover resume path as well as
> > suspend path?
>
> IMO, we could, but it is not good and might trigger memory allocation problem.
>
> pm_restrict_gfp_mask uses the global variable of gfp_allowed_mask to
> avoid allocating page with GFP_IOFS in all contexts during system sleep,
> when processes have been frozen.
>
> But during runtime PM, the whole system is running and all processes are
> runnable. Also runtime PM is per device and the whole system may have
> lots of devices, so taking the global gfp_allowed_mask may keep page
> allocation with ~GFP_IOFS for a considerable proportion of system
> running time, then alloc_page() will return failure easier.
>
> The above deadlock problem may be fixed by allocating memory with
> ~GFP_IOFS only in the context of calling runtime_resume, and that is
> idea of the patch.

Fair enough but it wouldn't be a good idea that add new unlikely branch
in allocator's fast path. Please move the check into slow path which could
be in __alloc_pages_slowpath.

>
> >
> >>
> >> - during error handling situation of usb mass storage deivce, USB
> >> bus reset will be put on the device, so there shouldn't have any
> >> memory allocation with GFP_KERNEL during USB bus reset, otherwise
> >> the deadlock similar with above may be triggered. Unfortunately, any
> >> usb device may include one mass storage interface in theory, so it
> >> requires all usb interface drivers to handle the situation. In fact,
> >> most usb drivers don't know how to handle bus reset on the device
> >> and don't provide .pre_set() and .post_reset() callback at all, so
> >> USB core has to unbind and bind driver for these devices. So it
> >> is still not practical to resort to GFP_NOIO for solving the problem.
> >
> > I hope this case could be handled by usb core like usb_restrict_gfp_mask
> > rather than adding new branch on fast path.
>
> See above, applying the global gfp_allowed_mask is not good.
>
>
> Thanks,
> --
> Ming Lei

--
Kind Regards,
Minchan Kim

2012-10-16 07:08:44

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

On Tue, Oct 16, 2012 at 1:49 PM, Minchan Kim <[email protected]> wrote:
>
> Fair enough but it wouldn't be a good idea that add new unlikely branch
> in allocator's fast path. Please move the check into slow path which could
> be in __alloc_pages_slowpath.

Thanks for your comment.

I have considered to add the branch into gfp_to_alloc_flags() before,
but didn't do it because I see that get_page_from_freelist() may use
the GFP_IO or GFP_FS flag at least in zone_reclaim() path.

So could you make sure it is safe to move the branch into
__alloc_pages_slowpath()? If so, I will add the check into
gfp_to_alloc_flags().


Thanks,
--
Ming Lei

2012-10-16 07:41:57

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] USB: forbid memory allocation with I/O during bus reset if storage interface exits

On Tue, Oct 16, 2012 at 12:48 AM, Greg Kroah-Hartman
<[email protected]> wrote:
>
> No, don't limit us for no good reason, that's not acceptable at all.
> Patching old kernels when new devices show up that work on those old
> kernels (like USB network devices that follow the spec with no driver
> change needed), is not a good idea.
>
> So I agree with Oliver here, this isn't acceptable.

OK, I will remove the limit in -v1 and add comments on the case.

thanks,
--
Ming Lei

2012-10-16 13:09:36

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

On Tue, Oct 16, 2012 at 03:08:41PM +0800, Ming Lei wrote:
> On Tue, Oct 16, 2012 at 1:49 PM, Minchan Kim <[email protected]> wrote:
> >
> > Fair enough but it wouldn't be a good idea that add new unlikely branch
> > in allocator's fast path. Please move the check into slow path which could
> > be in __alloc_pages_slowpath.
>
> Thanks for your comment.
>
> I have considered to add the branch into gfp_to_alloc_flags() before,
> but didn't do it because I see that get_page_from_freelist() may use
> the GFP_IO or GFP_FS flag at least in zone_reclaim() path.

Good point. You can check it in __zone_reclaim and change gfp_mask of scan_control
because it's never hot path.

>
> So could you make sure it is safe to move the branch into
> __alloc_pages_slowpath()? If so, I will add the check into
> gfp_to_alloc_flags().

How about this?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d976957..b3607fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2614,10 +2614,16 @@ retry_cpuset:
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, alloc_flags,
preferred_zone, migratetype);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ /*
+ * Resume path can deadlock because block device
+ * isn't active yet.
+ */
+ if (unlikely(tsk_memalloc_no_io(current)))
+ gfp_mask &= ~GFP_IOFS;
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
+ }

trace_mm_page_alloc(page, order, gfp_mask, migratetype);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b5e45f4..6c2ccdd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3290,6 +3290,16 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
};
unsigned long nr_slab_pages0, nr_slab_pages1;

+ if (unlikely(tsk_memalloc_no_io(current))) {
+ sc.gfp_mask &= ~GFP_IOFS;
+ shrink.gfp_mask = sc.gfp_mask;
+ /*
+ * We allow to reclaim only clean pages.
+ * It can affect RECLAIM_SWAP and RECLAIM_WRITE mode
+ * but this is really rare event and allocator can
* fallback to other zones.
+ */
+ sc.may_writepage = 0;
+ sc.may_swap = 0;
+ }
+
cond_resched();
/*
* We need to be able to allocate from the reserves for RECLAIM_SWAP


>
>
> Thanks,
> --
> Ming Lei
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind Regards,
Minchan Kim

2012-10-16 13:47:09

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

On Tue, Oct 16, 2012 at 9:09 PM, Minchan Kim <[email protected]> wrote:
>
> Good point. You can check it in __zone_reclaim and change gfp_mask of scan_control
> because it's never hot path.
>
>>
>> So could you make sure it is safe to move the branch into
>> __alloc_pages_slowpath()? If so, I will add the check into
>> gfp_to_alloc_flags().
>
> How about this?

It is quite smart change, :-)

Considered that other part(sched.h) of the patch need update, I
will merge your change into -v1 for further review with your
Signed-off-by if you have no objection.

>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d976957..b3607fa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2614,10 +2614,16 @@ retry_cpuset:
> page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> zonelist, high_zoneidx, alloc_flags,
> preferred_zone, migratetype);
> - if (unlikely(!page))
> + if (unlikely(!page)) {
> + /*
> + * Resume path can deadlock because block device
> + * isn't active yet.
> + */

Not only resume path, I/O transfer or its error handling path may deadlock too.

> + if (unlikely(tsk_memalloc_no_io(current)))
> + gfp_mask &= ~GFP_IOFS;
> page = __alloc_pages_slowpath(gfp_mask, order,
> zonelist, high_zoneidx, nodemask,
> preferred_zone, migratetype);
> + }
>
> trace_mm_page_alloc(page, order, gfp_mask, migratetype);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b5e45f4..6c2ccdd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3290,6 +3290,16 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> };
> unsigned long nr_slab_pages0, nr_slab_pages1;
>
> + if (unlikely(tsk_memalloc_no_io(current))) {
> + sc.gfp_mask &= ~GFP_IOFS;
> + shrink.gfp_mask = sc.gfp_mask;
> + /*
> + * We allow to reclaim only clean pages.
> + * It can affect RECLAIM_SWAP and RECLAIM_WRITE mode
> + * but this is really rare event and allocator can
> * fallback to other zones.
> + */
> + sc.may_writepage = 0;
> + sc.may_swap = 0;
> + }
> +
> cond_resched();
> /*
> * We need to be able to allocate from the reserves for RECLAIM_SWAP
>

Thanks,
--
Ming Lei

2012-10-16 13:53:17

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

On Tue, Oct 16, 2012 at 09:47:03PM +0800, Ming Lei wrote:
> On Tue, Oct 16, 2012 at 9:09 PM, Minchan Kim <[email protected]> wrote:
> >
> > Good point. You can check it in __zone_reclaim and change gfp_mask of scan_control
> > because it's never hot path.
> >
> >>
> >> So could you make sure it is safe to move the branch into
> >> __alloc_pages_slowpath()? If so, I will add the check into
> >> gfp_to_alloc_flags().
> >
> > How about this?
>
> It is quite smart change, :-)
>
> Considered that other part(sched.h) of the patch need update, I
> will merge your change into -v1 for further review with your
> Signed-off-by if you have no objection.

No problem. :)

--
Kind Regards,
Minchan Kim