2017-03-01 16:57:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock

Hello,

On Tue, Feb 28, 2017 at 03:51:27PM -0800, Tahsin Erdogan wrote:
> On Tue, Feb 28, 2017 at 2:47 PM, Tejun Heo <[email protected]> wrote:
> >> + if (!blkcg_policy_enabled(q, pol)) {
> >> + ret = -EOPNOTSUPP;
> >> + goto fail;
> >
> > Pulling this out of the queue_lock doesn't seem safe to me. This
> > function may end up calling into callbacks of disabled policies this
> > way.
>
> I will move this to within the lock. To make things safe, I am also
> thinking of rechecking both blkcg_policy_enabled() and
> blk_queue_bypass() after reacquiring the locks in each iteration.
>
> >> + parent = blkcg_parent(blkcg);
> >> + while (parent && !__blkg_lookup(parent, q, false)) {
> >> + pos = parent;
> >> + parent = blkcg_parent(parent);
> >> + }
> >
> > Hmm... how about adding @new_blkg to blkg_lookup_create() and calling
> > it with non-NULL @new_blkg until it succeeds? Wouldn't that be
> > simpler?
> >
> >> +
> >> + new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
>
> The challenge with that approach is creating a new_blkg with the right
> blkcg before passing to blkg_lookup_create(). blkg_lookup_create()
> walks down the hierarchy and will try to fill the first missing entry
> and the preallocated new_blkg must have been created with the right
> blkcg (feel free to send a code fragment if you think I am
> misunderstanding the suggestion).

Ah, indeed, but we can break out allocation of blkg and its
initialization, right? It's a bit more work but then we'd be able to
do something like.


retry:
new_blkg = alloc;
lock;
sanity checks;
blkg = blkg_lookup_and_create(..., new_blkg);
if (!blkg) {
unlock;
goto retry;
}

Thanks.

--
tejun


2017-03-02 00:18:03

by Tahsin Erdogan

[permalink] [raw]
Subject: Re: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock

Hi Tejun,
>
> Ah, indeed, but we can break out allocation of blkg and its
> initialization, right? It's a bit more work but then we'd be able to
> do something like.
>
>
> retry:
> new_blkg = alloc;
> lock;
> sanity checks;
> blkg = blkg_lookup_and_create(..., new_blkg);
> if (!blkg) {
> unlock;
> goto retry;
> }
I tried doing it based on the sample above but I wasn't happy with the
result. The amount of code grew too big. I sent a simplified version
that does blkg allocation within blkg_lookup_create(). I think this
version is simpler, let me know what you think.


On Wed, Mar 1, 2017 at 3:43 PM, Tahsin Erdogan <[email protected]> wrote:
> blkg_conf_prep() currently calls blkg_lookup_create() while holding
> request queue spinlock. This means allocating memory for struct
> blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
> failures in call paths like below:
>
> pcpu_alloc+0x68f/0x710
> __alloc_percpu_gfp+0xd/0x10
> __percpu_counter_init+0x55/0xc0
> cfq_pd_alloc+0x3b2/0x4e0
> blkg_alloc+0x187/0x230
> blkg_create+0x489/0x670
> blkg_lookup_create+0x9a/0x230
> blkg_conf_prep+0x1fb/0x240
> __cfqg_set_weight_device.isra.105+0x5c/0x180
> cfq_set_weight_on_dfl+0x69/0xc0
> cgroup_file_write+0x39/0x1c0
> kernfs_fop_write+0x13f/0x1d0
> __vfs_write+0x23/0x120
> vfs_write+0xc2/0x1f0
> SyS_write+0x44/0xb0
> entry_SYSCALL_64_fastpath+0x18/0xad
>
> In the code path above, percpu allocator cannot call vmalloc() due to
> queue spinlock.
>
> A failure in this call path gives grief to tools which are trying to
> configure io weights. We see occasional failures happen shortly after
> reboots even when system is not under any memory pressure. Machines
> with a lot of cpus are more vulnerable to this condition.
>
> Add a flag to blkg_lookup_create() to indicate whether releasing locks
> for memory allocation purposes is okay.
>
> Suggested-by: Tejun Heo <[email protected]>
> Signed-off-by: Tahsin Erdogan <[email protected]>
> ---
> v2:
> Moved blkg creation into blkg_lookup_create() to avoid duplicating
> blkg_lookup_create() logic.
>
> block/blk-cgroup.c | 51 +++++++++++++++++++++++++++++++++++++++-------
> include/linux/blk-cgroup.h | 4 ++--
> 2 files changed, 46 insertions(+), 9 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 295e98c2c8cc..afb16e998bf3 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> * blkg_lookup_create - lookup blkg, try to create one if not there
> * @blkcg: blkcg of interest
> * @q: request_queue of interest
> + * @wait_ok: whether blocking for memory allocations is okay
> *
> * Lookup blkg for the @blkcg - @q pair. If it doesn't exist, try to
> * create one. blkg creation is performed recursively from blkcg_root such
> * that all non-root blkg's have access to the parent blkg. This function
> * should be called under RCU read lock and @q->queue_lock.
> *
> + * When @wait_ok is true, rcu and queue locks may be dropped for allocating
> + * memory. In this case, the locks will be reacquired on return.
> + *
> * Returns pointer to the looked up or created blkg on success, ERR_PTR()
> * value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not
> * dead and bypassing, returns ERR_PTR(-EBUSY).
> */
> struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q)
> + struct request_queue *q, bool wait_ok)
> {
> struct blkcg_gq *blkg;
>
> @@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> parent = blkcg_parent(parent);
> }
>
> - blkg = blkg_create(pos, q, NULL);
> + if (wait_ok) {
> + struct blkcg_gq *new_blkg;
> +
> + spin_unlock_irq(q->queue_lock);
> + rcu_read_unlock();
> +
> + new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
> +
> + rcu_read_lock();
> + spin_lock_irq(q->queue_lock);
> +
> + if (unlikely(!new_blkg))
> + return ERR_PTR(-ENOMEM);
> +
> + if (unlikely(blk_queue_bypass(q))) {
> + blkg_free(new_blkg);
> + return ERR_PTR(blk_queue_dying(q) ?
> + -ENODEV : -EBUSY);
> + }
> +
> + blkg = blkg_create(pos, q, new_blkg);
> + } else
> + blkg = blkg_create(pos, q, NULL);
> +
> if (pos == blkcg || IS_ERR(blkg))
> return blkg;
> }
> @@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
> {
> struct gendisk *disk;
> struct blkcg_gq *blkg;
> + struct request_queue *q;
> struct module *owner;
> unsigned int major, minor;
> int key_len, part, ret;
> @@ -812,18 +840,27 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
> return -ENODEV;
> }
>
> + q = disk->queue;
> +
> rcu_read_lock();
> - spin_lock_irq(disk->queue->queue_lock);
> + spin_lock_irq(q->queue_lock);
>
> - if (blkcg_policy_enabled(disk->queue, pol))
> - blkg = blkg_lookup_create(blkcg, disk->queue);
> - else
> + if (blkcg_policy_enabled(q, pol)) {
> + blkg = blkg_lookup_create(blkcg, q, true /* wait_ok */);
> +
> + /*
> + * blkg_lookup_create() may have dropped and reacquired the
> + * queue lock. Check policy enabled state again.
> + */
> + if (!IS_ERR(blkg) && unlikely(!blkcg_policy_enabled(q, pol)))
> + blkg = ERR_PTR(-EOPNOTSUPP);
> + } else
> blkg = ERR_PTR(-EOPNOTSUPP);
>
> if (IS_ERR(blkg)) {
> ret = PTR_ERR(blkg);
> rcu_read_unlock();
> - spin_unlock_irq(disk->queue->queue_lock);
> + spin_unlock_irq(q->queue_lock);
> owner = disk->fops->owner;
> put_disk(disk);
> module_put(owner);
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 01b62e7bac74..78067dd59c91 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -172,7 +172,7 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
> struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
> struct request_queue *q, bool update_hint);
> struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q);
> + struct request_queue *q, bool wait_ok);
> int blkcg_init_queue(struct request_queue *q);
> void blkcg_drain_queue(struct request_queue *q);
> void blkcg_exit_queue(struct request_queue *q);
> @@ -694,7 +694,7 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
> blkg = blkg_lookup(blkcg, q);
> if (unlikely(!blkg)) {
> spin_lock_irq(q->queue_lock);
> - blkg = blkg_lookup_create(blkcg, q);
> + blkg = blkg_lookup_create(blkcg, q, false /* wait_ok */);
> if (IS_ERR(blkg))
> blkg = NULL;
> spin_unlock_irq(q->queue_lock);
> --
> 2.12.0.rc1.440.g5b76565f74-goog
>

2017-03-02 02:50:56

by Tahsin Erdogan

[permalink] [raw]
Subject: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock

blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:

pcpu_alloc+0x68f/0x710
__alloc_percpu_gfp+0xd/0x10
__percpu_counter_init+0x55/0xc0
cfq_pd_alloc+0x3b2/0x4e0
blkg_alloc+0x187/0x230
blkg_create+0x489/0x670
blkg_lookup_create+0x9a/0x230
blkg_conf_prep+0x1fb/0x240
__cfqg_set_weight_device.isra.105+0x5c/0x180
cfq_set_weight_on_dfl+0x69/0xc0
cgroup_file_write+0x39/0x1c0
kernfs_fop_write+0x13f/0x1d0
__vfs_write+0x23/0x120
vfs_write+0xc2/0x1f0
SyS_write+0x44/0xb0
entry_SYSCALL_64_fastpath+0x18/0xad

In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.

A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.

Add a flag to blkg_lookup_create() to indicate whether releasing locks
for memory allocation purposes is okay.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Tahsin Erdogan <[email protected]>
---
v2:
Moved blkg creation into blkg_lookup_create() to avoid duplicating
blkg_lookup_create() logic.

block/blk-cgroup.c | 51 +++++++++++++++++++++++++++++++++++++++-------
include/linux/blk-cgroup.h | 4 ++--
2 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 295e98c2c8cc..afb16e998bf3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
* blkg_lookup_create - lookup blkg, try to create one if not there
* @blkcg: blkcg of interest
* @q: request_queue of interest
+ * @wait_ok: whether blocking for memory allocations is okay
*
* Lookup blkg for the @blkcg - @q pair. If it doesn't exist, try to
* create one. blkg creation is performed recursively from blkcg_root such
* that all non-root blkg's have access to the parent blkg. This function
* should be called under RCU read lock and @q->queue_lock.
*
+ * When @wait_ok is true, rcu and queue locks may be dropped for allocating
+ * memory. In this case, the locks will be reacquired on return.
+ *
* Returns pointer to the looked up or created blkg on success, ERR_PTR()
* value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not
* dead and bypassing, returns ERR_PTR(-EBUSY).
*/
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
- struct request_queue *q)
+ struct request_queue *q, bool wait_ok)
{
struct blkcg_gq *blkg;

@@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
parent = blkcg_parent(parent);
}

- blkg = blkg_create(pos, q, NULL);
+ if (wait_ok) {
+ struct blkcg_gq *new_blkg;
+
+ spin_unlock_irq(q->queue_lock);
+ rcu_read_unlock();
+
+ new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
+
+ rcu_read_lock();
+ spin_lock_irq(q->queue_lock);
+
+ if (unlikely(!new_blkg))
+ return ERR_PTR(-ENOMEM);
+
+ if (unlikely(blk_queue_bypass(q))) {
+ blkg_free(new_blkg);
+ return ERR_PTR(blk_queue_dying(q) ?
+ -ENODEV : -EBUSY);
+ }
+
+ blkg = blkg_create(pos, q, new_blkg);
+ } else
+ blkg = blkg_create(pos, q, NULL);
+
if (pos == blkcg || IS_ERR(blkg))
return blkg;
}
@@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
{
struct gendisk *disk;
struct blkcg_gq *blkg;
+ struct request_queue *q;
struct module *owner;
unsigned int major, minor;
int key_len, part, ret;
@@ -812,18 +840,27 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
return -ENODEV;
}

+ q = disk->queue;
+
rcu_read_lock();
- spin_lock_irq(disk->queue->queue_lock);
+ spin_lock_irq(q->queue_lock);

- if (blkcg_policy_enabled(disk->queue, pol))
- blkg = blkg_lookup_create(blkcg, disk->queue);
- else
+ if (blkcg_policy_enabled(q, pol)) {
+ blkg = blkg_lookup_create(blkcg, q, true /* wait_ok */);
+
+ /*
+ * blkg_lookup_create() may have dropped and reacquired the
+ * queue lock. Check policy enabled state again.
+ */
+ if (!IS_ERR(blkg) && unlikely(!blkcg_policy_enabled(q, pol)))
+ blkg = ERR_PTR(-EOPNOTSUPP);
+ } else
blkg = ERR_PTR(-EOPNOTSUPP);

if (IS_ERR(blkg)) {
ret = PTR_ERR(blkg);
rcu_read_unlock();
- spin_unlock_irq(disk->queue->queue_lock);
+ spin_unlock_irq(q->queue_lock);
owner = disk->fops->owner;
put_disk(disk);
module_put(owner);
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 01b62e7bac74..78067dd59c91 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -172,7 +172,7 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
struct request_queue *q, bool update_hint);
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
- struct request_queue *q);
+ struct request_queue *q, bool wait_ok);
int blkcg_init_queue(struct request_queue *q);
void blkcg_drain_queue(struct request_queue *q);
void blkcg_exit_queue(struct request_queue *q);
@@ -694,7 +694,7 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
spin_lock_irq(q->queue_lock);
- blkg = blkg_lookup_create(blkcg, q);
+ blkg = blkg_lookup_create(blkcg, q, false /* wait_ok */);
if (IS_ERR(blkg))
blkg = NULL;
spin_unlock_irq(q->queue_lock);
--
2.12.0.rc1.440.g5b76565f74-goog

2017-03-02 19:45:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock

Hello, Tahsin.

On Wed, Mar 01, 2017 at 03:43:19PM -0800, Tahsin Erdogan wrote:
> @@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q)
> + struct request_queue *q, bool wait_ok)

I'm okay with this direction but it probably would be better if the
parameter is gfp_mask and we branch on __GFP_WAIT in the function.

> {
> struct blkcg_gq *blkg;
>
> @@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> parent = blkcg_parent(parent);
> }
>
> - blkg = blkg_create(pos, q, NULL);
> + if (wait_ok) {
> + struct blkcg_gq *new_blkg;
> +
> + spin_unlock_irq(q->queue_lock);
> + rcu_read_unlock();
> +
> + new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
> +
> + rcu_read_lock();
> + spin_lock_irq(q->queue_lock);
> +
> + if (unlikely(!new_blkg))
> + return ERR_PTR(-ENOMEM);
> +
> + if (unlikely(blk_queue_bypass(q))) {
> + blkg_free(new_blkg);
> + return ERR_PTR(blk_queue_dying(q) ?
> + -ENODEV : -EBUSY);
> + }
> +
> + blkg = blkg_create(pos, q, new_blkg);
> + } else
> + blkg = blkg_create(pos, q, NULL);

So, while I'm okay with the approach, now we're creating a hybrid
approach where we have both pre-allocation and allocation mode
altering mechanisms. If we're going to take this route, I think the
right thing to do is passing down @gfp_mask all the way down to
blkg_create().

> @@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
> {
> struct gendisk *disk;
> struct blkcg_gq *blkg;
> + struct request_queue *q;
> struct module *owner;
> unsigned int major, minor;
> int key_len, part, ret;
> @@ -812,18 +840,27 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
> return -ENODEV;
> }
>
> + q = disk->queue;
> +
> rcu_read_lock();
> - spin_lock_irq(disk->queue->queue_lock);
> + spin_lock_irq(q->queue_lock);
>
> - if (blkcg_policy_enabled(disk->queue, pol))
> - blkg = blkg_lookup_create(blkcg, disk->queue);
> - else
> + if (blkcg_policy_enabled(q, pol)) {
> + blkg = blkg_lookup_create(blkcg, q, true /* wait_ok */);
> +
> + /*
> + * blkg_lookup_create() may have dropped and reacquired the
> + * queue lock. Check policy enabled state again.
> + */
> + if (!IS_ERR(blkg) && unlikely(!blkcg_policy_enabled(q, pol)))
> + blkg = ERR_PTR(-EOPNOTSUPP);

And let blkg_create() verify these conditions after releasing and
regrabbing the lock.

This also means that the init path can simply pass in GFP_KERNEL.

Thanks.

--
tejun

2017-03-02 22:43:51

by Tahsin Erdogan

[permalink] [raw]
Subject: Re: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock

On Thu, Mar 2, 2017 at 11:32 AM, Tejun Heo <[email protected]> wrote:
> Hello, Tahsin.
>
> On Wed, Mar 01, 2017 at 03:43:19PM -0800, Tahsin Erdogan wrote:
>> @@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>> struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>> - struct request_queue *q)
>> + struct request_queue *q, bool wait_ok)
>
> I'm okay with this direction but it probably would be better if the
> parameter is gfp_mask and we branch on __GFP_WAIT in the function.

Agreed, gfp mask is better as a parameter.

>
>> {
>> struct blkcg_gq *blkg;
>>
>> @@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>> parent = blkcg_parent(parent);
>> }
>>
>> - blkg = blkg_create(pos, q, NULL);
>> + if (wait_ok) {
>> + struct blkcg_gq *new_blkg;
>> +
>> + spin_unlock_irq(q->queue_lock);
>> + rcu_read_unlock();
>> +
>> + new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
>> +
>> + rcu_read_lock();
>> + spin_lock_irq(q->queue_lock);
>> +
>> + if (unlikely(!new_blkg))
>> + return ERR_PTR(-ENOMEM);
>> +
>> + if (unlikely(blk_queue_bypass(q))) {
>> + blkg_free(new_blkg);
>> + return ERR_PTR(blk_queue_dying(q) ?
>> + -ENODEV : -EBUSY);
>> + }
>> +
>> + blkg = blkg_create(pos, q, new_blkg);
>> + } else
>> + blkg = blkg_create(pos, q, NULL);
>
> So, while I'm okay with the approach, now we're creating a hybrid
> approach where we have both pre-allocation and allocation mode
> altering mechanisms. If we're going to take this route, I think the
> right thing to do is passing down @gfp_mask all the way down to
> blkg_create().
>
>> @@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
>> {
>> struct gendisk *disk;
>> struct blkcg_gq *blkg;
>> + struct request_queue *q;
>> struct module *owner;
>> unsigned int major, minor;
>> int key_len, part, ret;
>> @@ -812,18 +840,27 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
>> return -ENODEV;
>> }
>>
>> + q = disk->queue;
>> +
>> rcu_read_lock();
>> - spin_lock_irq(disk->queue->queue_lock);
>> + spin_lock_irq(q->queue_lock);
>>
>> - if (blkcg_policy_enabled(disk->queue, pol))
>> - blkg = blkg_lookup_create(blkcg, disk->queue);
>> - else
>> + if (blkcg_policy_enabled(q, pol)) {
>> + blkg = blkg_lookup_create(blkcg, q, true /* wait_ok */);
>> +
>> + /*
>> + * blkg_lookup_create() may have dropped and reacquired the
>> + * queue lock. Check policy enabled state again.
>> + */
>> + if (!IS_ERR(blkg) && unlikely(!blkcg_policy_enabled(q, pol)))
>> + blkg = ERR_PTR(-EOPNOTSUPP);
>
> And let blkg_create() verify these conditions after releasing and
> regrabbing the lock.
>
> This also means that the init path can simply pass in GFP_KERNEL.

I tried that approach, but I encountered two issues that complicate things:

1) Pushing down blk_queue_bypass(q) check in blkg_create() doesn't
quite work because when blkcg_init_queue() calls blkg_create(), the
queue is still in bypassing mode.

2) Pushing down blkcg_policy_enabled() doesn't work well either,
because blkcg_init_queue() doesn't have a policy to pass down. We
could let it pass a NULL parameter but that would make blkg_create
more ugly.

2017-03-03 19:25:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock

Hello, Tahsin.

On Thu, Mar 02, 2017 at 02:33:11PM -0800, Tahsin Erdogan wrote:
> > And let blkg_create() verify these conditions after releasing and
> > regrabbing the lock.
> >
> > This also means that the init path can simply pass in GFP_KERNEL.
>
> I tried that approach, but I encountered two issues that complicate things:
>
> 1) Pushing down blk_queue_bypass(q) check in blkg_create() doesn't
> quite work because when blkcg_init_queue() calls blkg_create(), the
> queue is still in bypassing mode.
>
> 2) Pushing down blkcg_policy_enabled() doesn't work well either,
> because blkcg_init_queue() doesn't have a policy to pass down. We
> could let it pass a NULL parameter but that would make blkg_create
> more ugly.

I see. It kinda really bothers me that we'll have two different modes
for non-atomic allocations. Can't we bind both to the policy
parameter? Skip the checks if policy is NULL?

Thanks.

--
tejun

2017-03-04 01:41:11

by Tahsin Erdogan

[permalink] [raw]
Subject: [PATCH v3] blkcg: allocate struct blkcg_gq outside request queue spinlock

blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:

pcpu_alloc+0x68f/0x710
__alloc_percpu_gfp+0xd/0x10
__percpu_counter_init+0x55/0xc0
cfq_pd_alloc+0x3b2/0x4e0
blkg_alloc+0x187/0x230
blkg_create+0x489/0x670
blkg_lookup_create+0x9a/0x230
blkg_conf_prep+0x1fb/0x240
__cfqg_set_weight_device.isra.105+0x5c/0x180
cfq_set_weight_on_dfl+0x69/0xc0
cgroup_file_write+0x39/0x1c0
kernfs_fop_write+0x13f/0x1d0
__vfs_write+0x23/0x120
vfs_write+0xc2/0x1f0
SyS_write+0x44/0xb0
entry_SYSCALL_64_fastpath+0x18/0xad

In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.

A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.

Update blkg_create() function to temporarily drop the rcu and queue
locks when it is allowed by gfp mask.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Tahsin Erdogan <[email protected]>
---
v3:
Pushed down all blkg allocations into blkg_create()

v2:
Moved blkg creation into blkg_lookup_create() to avoid duplicating
blkg_lookup_create() logic.

block/blk-cgroup.c | 96 +++++++++++++++++++++++++++++-----------------
include/linux/blk-cgroup.h | 20 ++++++++--
2 files changed, 76 insertions(+), 40 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 295e98c2c8cc..9bc2b10f3b5a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -164,16 +164,17 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
EXPORT_SYMBOL_GPL(blkg_lookup_slowpath);

/*
- * If @new_blkg is %NULL, this function tries to allocate a new one as
- * necessary using %GFP_NOWAIT. @new_blkg is always consumed on return.
+ * If gfp mask allows blocking, this function temporarily drops rcu and queue
+ * locks to allocate memory.
*/
static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
- struct request_queue *q,
- struct blkcg_gq *new_blkg)
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol)
{
- struct blkcg_gq *blkg;
+ struct blkcg_gq *blkg = NULL;
struct bdi_writeback_congested *wb_congested;
int i, ret;
+ const bool drop_locks = gfpflags_allow_blocking(gfp);

WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);
@@ -184,24 +185,48 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
goto err_free_blkg;
}

+ if (drop_locks) {
+ spin_unlock_irq(q->queue_lock);
+ rcu_read_unlock();
+ }
+
wb_congested = wb_congested_get_create(q->backing_dev_info,
- blkcg->css.id,
- GFP_NOWAIT | __GFP_NOWARN);
- if (!wb_congested) {
+ blkcg->css.id, gfp);
+ blkg = blkg_alloc(blkcg, q, gfp);
+
+ if (drop_locks) {
+ rcu_read_lock();
+ spin_lock_irq(q->queue_lock);
+ }
+
+ if (unlikely(!wb_congested)) {
ret = -ENOMEM;
goto err_put_css;
+ } else if (unlikely(!blkg)) {
+ ret = -ENOMEM;
+ goto err_put_congested;
}

- /* allocate */
- if (!new_blkg) {
- new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT | __GFP_NOWARN);
- if (unlikely(!new_blkg)) {
- ret = -ENOMEM;
+ blkg->wb_congested = wb_congested;
+
+ if (pol) {
+ WARN_ON(!drop_locks);
+
+ if (!blkcg_policy_enabled(q, pol)) {
+ ret = -EOPNOTSUPP;
+ goto err_put_congested;
+ }
+
+ /*
+ * This could be the first entry point of blkcg implementation
+ * and we shouldn't allow anything to go through for a bypassing
+ * queue.
+ */
+ if (unlikely(blk_queue_bypass(q))) {
+ ret = blk_queue_dying(q) ? -ENODEV : -EBUSY;
goto err_put_congested;
}
}
- blkg = new_blkg;
- blkg->wb_congested = wb_congested;

/* link parent */
if (blkcg_parent(blkcg)) {
@@ -250,7 +275,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
err_put_css:
css_put(&blkcg->css);
err_free_blkg:
- blkg_free(new_blkg);
+ blkg_free(blkg);
return ERR_PTR(ret);
}

@@ -258,31 +283,30 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
* blkg_lookup_create - lookup blkg, try to create one if not there
* @blkcg: blkcg of interest
* @q: request_queue of interest
+ * @gfp: gfp mask
+ * @pol: blkcg policy (optional)
*
* Lookup blkg for the @blkcg - @q pair. If it doesn't exist, try to
* create one. blkg creation is performed recursively from blkcg_root such
* that all non-root blkg's have access to the parent blkg. This function
* should be called under RCU read lock and @q->queue_lock.
*
+ * When gfp mask allows blocking, rcu and queue locks may be dropped for
+ * allocating memory. In this case, the locks will be reacquired on return.
+ *
* Returns pointer to the looked up or created blkg on success, ERR_PTR()
* value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not
* dead and bypassing, returns ERR_PTR(-EBUSY).
*/
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
- struct request_queue *q)
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol)
{
struct blkcg_gq *blkg;

WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);

- /*
- * This could be the first entry point of blkcg implementation and
- * we shouldn't allow anything to go through for a bypassing queue.
- */
- if (unlikely(blk_queue_bypass(q)))
- return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
-
blkg = __blkg_lookup(blkcg, q, true);
if (blkg)
return blkg;
@@ -300,7 +324,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
parent = blkcg_parent(parent);
}

- blkg = blkg_create(pos, q, NULL);
+ blkg = blkg_create(pos, q, gfp, pol);
if (pos == blkcg || IS_ERR(blkg))
return blkg;
}
@@ -789,6 +813,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
{
struct gendisk *disk;
struct blkcg_gq *blkg;
+ struct request_queue *q;
struct module *owner;
unsigned int major, minor;
int key_len, part, ret;
@@ -812,18 +837,22 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
return -ENODEV;
}

+ q = disk->queue;
+
rcu_read_lock();
- spin_lock_irq(disk->queue->queue_lock);
+ spin_lock_irq(q->queue_lock);

- if (blkcg_policy_enabled(disk->queue, pol))
- blkg = blkg_lookup_create(blkcg, disk->queue);
- else
+ if (!blkcg_policy_enabled(q, pol))
blkg = ERR_PTR(-EOPNOTSUPP);
+ else if (unlikely(blk_queue_bypass(q)))
+ blkg = ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
+ else
+ blkg = blkg_lookup_create(blkcg, q, GFP_KERNEL, pol);

if (IS_ERR(blkg)) {
ret = PTR_ERR(blkg);
rcu_read_unlock();
- spin_unlock_irq(disk->queue->queue_lock);
+ spin_unlock_irq(q->queue_lock);
owner = disk->fops->owner;
put_disk(disk);
module_put(owner);
@@ -1065,14 +1094,9 @@ int blkcg_init_queue(struct request_queue *q)

preloaded = !radix_tree_preload(GFP_KERNEL);

- /*
- * Make sure the root blkg exists and count the existing blkgs. As
- * @q is bypassing at this point, blkg_lookup_create() can't be
- * used. Open code insertion.
- */
rcu_read_lock();
spin_lock_irq(q->queue_lock);
- blkg = blkg_create(&blkcg_root, q, new_blkg);
+ blkg = blkg_lookup_create(&blkcg_root, q, GFP_KERNEL, NULL);
spin_unlock_irq(q->queue_lock);
rcu_read_unlock();

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 01b62e7bac74..9dbe552c6ea0 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -172,7 +172,8 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
struct request_queue *q, bool update_hint);
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
- struct request_queue *q);
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol);
int blkcg_init_queue(struct request_queue *q);
void blkcg_drain_queue(struct request_queue *q);
void blkcg_exit_queue(struct request_queue *q);
@@ -694,9 +695,20 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
spin_lock_irq(q->queue_lock);
- blkg = blkg_lookup_create(blkcg, q);
- if (IS_ERR(blkg))
- blkg = NULL;
+
+ /*
+ * This could be the first entry point of blkcg implementation
+ * and we shouldn't allow anything to go through for a bypassing
+ * queue.
+ */
+ if (likely(!blk_queue_bypass(q))) {
+ blkg = blkg_lookup_create(blkcg, q,
+ GFP_NOWAIT | __GFP_NOWARN,
+ NULL);
+ if (IS_ERR(blkg))
+ blkg = NULL;
+ }
+
spin_unlock_irq(q->queue_lock);
}

--
2.12.0.rc1.440.g5b76565f74-goog

2017-03-04 19:23:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3] blkcg: allocate struct blkcg_gq outside request queue spinlock

Hello, Tahsin.

Generally looks good. Please see below.

> @@ -184,24 +185,48 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> + if (unlikely(!wb_congested)) {
> ret = -ENOMEM;
> goto err_put_css;
> + } else if (unlikely(!blkg)) {
> + ret = -ENOMEM;
> + goto err_put_congested;
> }

I'm not sure this error handling logic is correct. We can have any
combo of alloc failure here, right?

> @@ -694,9 +695,20 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
> blkg = blkg_lookup(blkcg, q);
> if (unlikely(!blkg)) {
> spin_lock_irq(q->queue_lock);
> - blkg = blkg_lookup_create(blkcg, q);
> - if (IS_ERR(blkg))
> - blkg = NULL;
> +
> + /*
> + * This could be the first entry point of blkcg implementation
> + * and we shouldn't allow anything to go through for a bypassing
> + * queue.
> + */
> + if (likely(!blk_queue_bypass(q))) {
> + blkg = blkg_lookup_create(blkcg, q,
> + GFP_NOWAIT | __GFP_NOWARN,
> + NULL);
> + if (IS_ERR(blkg))
> + blkg = NULL;
> + }

Heh, this seems a bit too fragile. I get that this covers both call
paths into lookup_create which needs the checking, but it seems a bit
too nasty to do it this way. Would it be ugly if we factor out both
policy enabled and bypass tests into a helper and have inner
__blkg_lookup_create() which skips it? We can call the same helper
from blkg_create() when @pol. Sorry that this is involving so much
bikeshedding.

Thanks!

--
tejun

2017-03-05 14:23:25

by Tahsin Erdogan

[permalink] [raw]
Subject: [PATCH v4] blkcg: allocate struct blkcg_gq outside request queue spinlock

blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:

pcpu_alloc+0x68f/0x710
__alloc_percpu_gfp+0xd/0x10
__percpu_counter_init+0x55/0xc0
cfq_pd_alloc+0x3b2/0x4e0
blkg_alloc+0x187/0x230
blkg_create+0x489/0x670
blkg_lookup_create+0x9a/0x230
blkg_conf_prep+0x1fb/0x240
__cfqg_set_weight_device.isra.105+0x5c/0x180
cfq_set_weight_on_dfl+0x69/0xc0
cgroup_file_write+0x39/0x1c0
kernfs_fop_write+0x13f/0x1d0
__vfs_write+0x23/0x120
vfs_write+0xc2/0x1f0
SyS_write+0x44/0xb0
entry_SYSCALL_64_fastpath+0x18/0xad

In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.

A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.

Update blkg_create() function to temporarily drop the rcu and queue
locks when it is allowed by gfp mask.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Tahsin Erdogan <[email protected]>
---
v4:
Simplified error checking in blkg_create()
Factored out __blkg_lookup_create()

v3:
Pushed down all blkg allocations into blkg_create()

v2:
Moved blkg creation into blkg_lookup_create() to avoid duplicating
blkg_lookup_create() logic.

block/blk-cgroup.c | 119 ++++++++++++++++++++++++++++++---------------
include/linux/blk-cgroup.h | 6 ++-
2 files changed, 83 insertions(+), 42 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 295e98c2c8cc..f7bc93928818 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -164,16 +164,17 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
EXPORT_SYMBOL_GPL(blkg_lookup_slowpath);

/*
- * If @new_blkg is %NULL, this function tries to allocate a new one as
- * necessary using %GFP_NOWAIT. @new_blkg is always consumed on return.
+ * If gfp mask allows blocking, this function temporarily drops rcu and queue
+ * locks to allocate memory.
*/
static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
- struct request_queue *q,
- struct blkcg_gq *new_blkg)
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol)
{
- struct blkcg_gq *blkg;
+ struct blkcg_gq *blkg = NULL;
struct bdi_writeback_congested *wb_congested;
int i, ret;
+ const bool drop_locks = gfpflags_allow_blocking(gfp);

WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);
@@ -184,31 +185,52 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
goto err_free_blkg;
}

+ if (drop_locks) {
+ spin_unlock_irq(q->queue_lock);
+ rcu_read_unlock();
+ }
+
wb_congested = wb_congested_get_create(q->backing_dev_info,
- blkcg->css.id,
- GFP_NOWAIT | __GFP_NOWARN);
- if (!wb_congested) {
+ blkcg->css.id, gfp);
+ blkg = blkg_alloc(blkcg, q, gfp);
+
+ if (drop_locks) {
+ rcu_read_lock();
+ spin_lock_irq(q->queue_lock);
+ }
+
+ if (unlikely(!wb_congested || !blkg)) {
ret = -ENOMEM;
- goto err_put_css;
+ goto err_put;
}

- /* allocate */
- if (!new_blkg) {
- new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT | __GFP_NOWARN);
- if (unlikely(!new_blkg)) {
- ret = -ENOMEM;
- goto err_put_congested;
+ blkg->wb_congested = wb_congested;
+
+ if (pol) {
+ WARN_ON(!drop_locks);
+
+ if (!blkcg_policy_enabled(q, pol)) {
+ ret = -EOPNOTSUPP;
+ goto err_put;
+ }
+
+ /*
+ * This could be the first entry point of blkcg implementation
+ * and we shouldn't allow anything to go through for a bypassing
+ * queue.
+ */
+ if (unlikely(blk_queue_bypass(q))) {
+ ret = blk_queue_dying(q) ? -ENODEV : -EBUSY;
+ goto err_put;
}
}
- blkg = new_blkg;
- blkg->wb_congested = wb_congested;

/* link parent */
if (blkcg_parent(blkcg)) {
blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
if (WARN_ON_ONCE(!blkg->parent)) {
ret = -ENODEV;
- goto err_put_congested;
+ goto err_put;
}
blkg_get(blkg->parent);
}
@@ -245,44 +267,43 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_put(blkg);
return ERR_PTR(ret);

-err_put_congested:
- wb_congested_put(wb_congested);
-err_put_css:
+err_put:
+ if (wb_congested)
+ wb_congested_put(wb_congested);
css_put(&blkcg->css);
err_free_blkg:
- blkg_free(new_blkg);
+ blkg_free(blkg);
return ERR_PTR(ret);
}

/**
- * blkg_lookup_create - lookup blkg, try to create one if not there
+ * __blkg_lookup_create - lookup blkg, try to create one if not there
* @blkcg: blkcg of interest
* @q: request_queue of interest
+ * @gfp: gfp mask
+ * @pol: blkcg policy (optional)
*
* Lookup blkg for the @blkcg - @q pair. If it doesn't exist, try to
* create one. blkg creation is performed recursively from blkcg_root such
* that all non-root blkg's have access to the parent blkg. This function
* should be called under RCU read lock and @q->queue_lock.
*
+ * When gfp mask allows blocking, rcu and queue locks may be dropped for
+ * allocating memory. In this case, the locks will be reacquired on return.
+ *
* Returns pointer to the looked up or created blkg on success, ERR_PTR()
* value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not
* dead and bypassing, returns ERR_PTR(-EBUSY).
*/
-struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
- struct request_queue *q)
+struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol)
{
struct blkcg_gq *blkg;

WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);

- /*
- * This could be the first entry point of blkcg implementation and
- * we shouldn't allow anything to go through for a bypassing queue.
- */
- if (unlikely(blk_queue_bypass(q)))
- return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
-
blkg = __blkg_lookup(blkcg, q, true);
if (blkg)
return blkg;
@@ -300,12 +321,35 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
parent = blkcg_parent(parent);
}

- blkg = blkg_create(pos, q, NULL);
+ blkg = blkg_create(pos, q, gfp, pol);
if (pos == blkcg || IS_ERR(blkg))
return blkg;
}
}

+/**
+ * blkg_lookup_create - lookup blkg, try to create one if not there
+ *
+ * Performs an initial queue bypass check and then passes control to
+ * __blkg_lookup_create().
+ */
+struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ lockdep_assert_held(q->queue_lock);
+
+ /*
+ * This could be the first entry point of blkcg implementation and
+ * we shouldn't allow anything to go through for a bypassing queue.
+ */
+ if (unlikely(blk_queue_bypass(q)))
+ return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
+
+ return __blkg_lookup_create(blkcg, q, gfp, pol);
+}
+
static void blkg_destroy(struct blkcg_gq *blkg)
{
struct blkcg *blkcg = blkg->blkcg;
@@ -816,7 +860,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
spin_lock_irq(disk->queue->queue_lock);

if (blkcg_policy_enabled(disk->queue, pol))
- blkg = blkg_lookup_create(blkcg, disk->queue);
+ blkg = blkg_lookup_create(blkcg, disk->queue, GFP_KERNEL, pol);
else
blkg = ERR_PTR(-EOPNOTSUPP);

@@ -1065,14 +1109,9 @@ int blkcg_init_queue(struct request_queue *q)

preloaded = !radix_tree_preload(GFP_KERNEL);

- /*
- * Make sure the root blkg exists and count the existing blkgs. As
- * @q is bypassing at this point, blkg_lookup_create() can't be
- * used. Open code insertion.
- */
rcu_read_lock();
spin_lock_irq(q->queue_lock);
- blkg = blkg_create(&blkcg_root, q, new_blkg);
+ blkg = __blkg_lookup_create(&blkcg_root, q, GFP_KERNEL, NULL);
spin_unlock_irq(q->queue_lock);
rcu_read_unlock();

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 01b62e7bac74..955903a8f6cb 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -172,7 +172,8 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
struct request_queue *q, bool update_hint);
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
- struct request_queue *q);
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol);
int blkcg_init_queue(struct request_queue *q);
void blkcg_drain_queue(struct request_queue *q);
void blkcg_exit_queue(struct request_queue *q);
@@ -694,7 +695,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
spin_lock_irq(q->queue_lock);
- blkg = blkg_lookup_create(blkcg, q);
+ blkg = blkg_lookup_create(blkcg, q, GFP_NOWAIT | __GFP_NOWARN,
+ NULL);
if (IS_ERR(blkg))
blkg = NULL;
spin_unlock_irq(q->queue_lock);
--
2.12.0.rc1.440.g5b76565f74-goog

2017-03-05 14:24:07

by Tahsin Erdogan

[permalink] [raw]
Subject: Re: [PATCH v4] blkcg: allocate struct blkcg_gq outside request queue spinlock

Hi Tejun,

>> @@ -184,24 +185,48 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>> + if (unlikely(!wb_congested)) {
>> ret = -ENOMEM;
>> goto err_put_css;
>> + } else if (unlikely(!blkg)) {
>> + ret = -ENOMEM;
>> + goto err_put_congested;
>> }
>
> I'm not sure this error handling logic is correct. We can have any
> combo of alloc failure here, right?

I tried to simplify this part in v4, please take a look.

>> @@ -694,9 +695,20 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
>> blkg = blkg_lookup(blkcg, q);
>> if (unlikely(!blkg)) {
>> spin_lock_irq(q->queue_lock);
>> - blkg = blkg_lookup_create(blkcg, q);
>> - if (IS_ERR(blkg))
>> - blkg = NULL;
>> +
>> + /*
>> + * This could be the first entry point of blkcg implementation
>> + * and we shouldn't allow anything to go through for a bypassing
>> + * queue.
>> + */
>> + if (likely(!blk_queue_bypass(q))) {
>> + blkg = blkg_lookup_create(blkcg, q,
>> + GFP_NOWAIT | __GFP_NOWARN,
>> + NULL);
>> + if (IS_ERR(blkg))
>> + blkg = NULL;
>> + }
>
> Heh, this seems a bit too fragile. I get that this covers both call
> paths into lookup_create which needs the checking, but it seems a bit
> too nasty to do it this way. Would it be ugly if we factor out both
> policy enabled and bypass tests into a helper and have inner
> __blkg_lookup_create() which skips it? We can call the same helper
> from blkg_create() when @pol. Sorry that this is involving so much
> bikeshedding.

I have added a __blkg_lookup_create() and moved queue bypass check to
blkg_lookup_create(). This probably doesn't cover all your points but
I hope it is going towards the right direction. Do you still think we
need a helper function for policy/queue bypass checks?


On Sun, Mar 5, 2017 at 6:12 AM, Tahsin Erdogan <[email protected]> wrote:
> blkg_conf_prep() currently calls blkg_lookup_create() while holding
> request queue spinlock. This means allocating memory for struct
> blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
> failures in call paths like below:
>
> pcpu_alloc+0x68f/0x710
> __alloc_percpu_gfp+0xd/0x10
> __percpu_counter_init+0x55/0xc0
> cfq_pd_alloc+0x3b2/0x4e0
> blkg_alloc+0x187/0x230
> blkg_create+0x489/0x670
> blkg_lookup_create+0x9a/0x230
> blkg_conf_prep+0x1fb/0x240
> __cfqg_set_weight_device.isra.105+0x5c/0x180
> cfq_set_weight_on_dfl+0x69/0xc0
> cgroup_file_write+0x39/0x1c0
> kernfs_fop_write+0x13f/0x1d0
> __vfs_write+0x23/0x120
> vfs_write+0xc2/0x1f0
> SyS_write+0x44/0xb0
> entry_SYSCALL_64_fastpath+0x18/0xad
>
> In the code path above, percpu allocator cannot call vmalloc() due to
> queue spinlock.
>
> A failure in this call path gives grief to tools which are trying to
> configure io weights. We see occasional failures happen shortly after
> reboots even when system is not under any memory pressure. Machines
> with a lot of cpus are more vulnerable to this condition.
>
> Update blkg_create() function to temporarily drop the rcu and queue
> locks when it is allowed by gfp mask.
>
> Suggested-by: Tejun Heo <[email protected]>
> Signed-off-by: Tahsin Erdogan <[email protected]>
> ---
> v4:
> Simplified error checking in blkg_create()
> Factored out __blkg_lookup_create()
>
> v3:
> Pushed down all blkg allocations into blkg_create()
>
> v2:
> Moved blkg creation into blkg_lookup_create() to avoid duplicating
> blkg_lookup_create() logic.
>
> block/blk-cgroup.c | 119 ++++++++++++++++++++++++++++++---------------
> include/linux/blk-cgroup.h | 6 ++-
> 2 files changed, 83 insertions(+), 42 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 295e98c2c8cc..f7bc93928818 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -164,16 +164,17 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
> EXPORT_SYMBOL_GPL(blkg_lookup_slowpath);
>
> /*
> - * If @new_blkg is %NULL, this function tries to allocate a new one as
> - * necessary using %GFP_NOWAIT. @new_blkg is always consumed on return.
> + * If gfp mask allows blocking, this function temporarily drops rcu and queue
> + * locks to allocate memory.
> */
> static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> - struct request_queue *q,
> - struct blkcg_gq *new_blkg)
> + struct request_queue *q, gfp_t gfp,
> + const struct blkcg_policy *pol)
> {
> - struct blkcg_gq *blkg;
> + struct blkcg_gq *blkg = NULL;
> struct bdi_writeback_congested *wb_congested;
> int i, ret;
> + const bool drop_locks = gfpflags_allow_blocking(gfp);
>
> WARN_ON_ONCE(!rcu_read_lock_held());
> lockdep_assert_held(q->queue_lock);
> @@ -184,31 +185,52 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> goto err_free_blkg;
> }
>
> + if (drop_locks) {
> + spin_unlock_irq(q->queue_lock);
> + rcu_read_unlock();
> + }
> +
> wb_congested = wb_congested_get_create(q->backing_dev_info,
> - blkcg->css.id,
> - GFP_NOWAIT | __GFP_NOWARN);
> - if (!wb_congested) {
> + blkcg->css.id, gfp);
> + blkg = blkg_alloc(blkcg, q, gfp);
> +
> + if (drop_locks) {
> + rcu_read_lock();
> + spin_lock_irq(q->queue_lock);
> + }
> +
> + if (unlikely(!wb_congested || !blkg)) {
> ret = -ENOMEM;
> - goto err_put_css;
> + goto err_put;
> }
>
> - /* allocate */
> - if (!new_blkg) {
> - new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT | __GFP_NOWARN);
> - if (unlikely(!new_blkg)) {
> - ret = -ENOMEM;
> - goto err_put_congested;
> + blkg->wb_congested = wb_congested;
> +
> + if (pol) {
> + WARN_ON(!drop_locks);
> +
> + if (!blkcg_policy_enabled(q, pol)) {
> + ret = -EOPNOTSUPP;
> + goto err_put;
> + }
> +
> + /*
> + * This could be the first entry point of blkcg implementation
> + * and we shouldn't allow anything to go through for a bypassing
> + * queue.
> + */
> + if (unlikely(blk_queue_bypass(q))) {
> + ret = blk_queue_dying(q) ? -ENODEV : -EBUSY;
> + goto err_put;
> }
> }
> - blkg = new_blkg;
> - blkg->wb_congested = wb_congested;
>
> /* link parent */
> if (blkcg_parent(blkcg)) {
> blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
> if (WARN_ON_ONCE(!blkg->parent)) {
> ret = -ENODEV;
> - goto err_put_congested;
> + goto err_put;
> }
> blkg_get(blkg->parent);
> }
> @@ -245,44 +267,43 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> blkg_put(blkg);
> return ERR_PTR(ret);
>
> -err_put_congested:
> - wb_congested_put(wb_congested);
> -err_put_css:
> +err_put:
> + if (wb_congested)
> + wb_congested_put(wb_congested);
> css_put(&blkcg->css);
> err_free_blkg:
> - blkg_free(new_blkg);
> + blkg_free(blkg);
> return ERR_PTR(ret);
> }
>
> /**
> - * blkg_lookup_create - lookup blkg, try to create one if not there
> + * __blkg_lookup_create - lookup blkg, try to create one if not there
> * @blkcg: blkcg of interest
> * @q: request_queue of interest
> + * @gfp: gfp mask
> + * @pol: blkcg policy (optional)
> *
> * Lookup blkg for the @blkcg - @q pair. If it doesn't exist, try to
> * create one. blkg creation is performed recursively from blkcg_root such
> * that all non-root blkg's have access to the parent blkg. This function
> * should be called under RCU read lock and @q->queue_lock.
> *
> + * When gfp mask allows blocking, rcu and queue locks may be dropped for
> + * allocating memory. In this case, the locks will be reacquired on return.
> + *
> * Returns pointer to the looked up or created blkg on success, ERR_PTR()
> * value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not
> * dead and bypassing, returns ERR_PTR(-EBUSY).
> */
> -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q)
> +struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
> + struct request_queue *q, gfp_t gfp,
> + const struct blkcg_policy *pol)
> {
> struct blkcg_gq *blkg;
>
> WARN_ON_ONCE(!rcu_read_lock_held());
> lockdep_assert_held(q->queue_lock);
>
> - /*
> - * This could be the first entry point of blkcg implementation and
> - * we shouldn't allow anything to go through for a bypassing queue.
> - */
> - if (unlikely(blk_queue_bypass(q)))
> - return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
> -
> blkg = __blkg_lookup(blkcg, q, true);
> if (blkg)
> return blkg;
> @@ -300,12 +321,35 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> parent = blkcg_parent(parent);
> }
>
> - blkg = blkg_create(pos, q, NULL);
> + blkg = blkg_create(pos, q, gfp, pol);
> if (pos == blkcg || IS_ERR(blkg))
> return blkg;
> }
> }
>
> +/**
> + * blkg_lookup_create - lookup blkg, try to create one if not there
> + *
> + * Performs an initial queue bypass check and then passes control to
> + * __blkg_lookup_create().
> + */
> +struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> + struct request_queue *q, gfp_t gfp,
> + const struct blkcg_policy *pol)
> +{
> + WARN_ON_ONCE(!rcu_read_lock_held());
> + lockdep_assert_held(q->queue_lock);
> +
> + /*
> + * This could be the first entry point of blkcg implementation and
> + * we shouldn't allow anything to go through for a bypassing queue.
> + */
> + if (unlikely(blk_queue_bypass(q)))
> + return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
> +
> + return __blkg_lookup_create(blkcg, q, gfp, pol);
> +}
> +
> static void blkg_destroy(struct blkcg_gq *blkg)
> {
> struct blkcg *blkcg = blkg->blkcg;
> @@ -816,7 +860,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
> spin_lock_irq(disk->queue->queue_lock);
>
> if (blkcg_policy_enabled(disk->queue, pol))
> - blkg = blkg_lookup_create(blkcg, disk->queue);
> + blkg = blkg_lookup_create(blkcg, disk->queue, GFP_KERNEL, pol);
> else
> blkg = ERR_PTR(-EOPNOTSUPP);
>
> @@ -1065,14 +1109,9 @@ int blkcg_init_queue(struct request_queue *q)
>
> preloaded = !radix_tree_preload(GFP_KERNEL);
>
> - /*
> - * Make sure the root blkg exists and count the existing blkgs. As
> - * @q is bypassing at this point, blkg_lookup_create() can't be
> - * used. Open code insertion.
> - */
> rcu_read_lock();
> spin_lock_irq(q->queue_lock);
> - blkg = blkg_create(&blkcg_root, q, new_blkg);
> + blkg = __blkg_lookup_create(&blkcg_root, q, GFP_KERNEL, NULL);
> spin_unlock_irq(q->queue_lock);
> rcu_read_unlock();
>
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 01b62e7bac74..955903a8f6cb 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -172,7 +172,8 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
> struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
> struct request_queue *q, bool update_hint);
> struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q);
> + struct request_queue *q, gfp_t gfp,
> + const struct blkcg_policy *pol);
> int blkcg_init_queue(struct request_queue *q);
> void blkcg_drain_queue(struct request_queue *q);
> void blkcg_exit_queue(struct request_queue *q);
> @@ -694,7 +695,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
> blkg = blkg_lookup(blkcg, q);
> if (unlikely(!blkg)) {
> spin_lock_irq(q->queue_lock);
> - blkg = blkg_lookup_create(blkcg, q);
> + blkg = blkg_lookup_create(blkcg, q, GFP_NOWAIT | __GFP_NOWARN,
> + NULL);
> if (IS_ERR(blkg))
> blkg = NULL;
> spin_unlock_irq(q->queue_lock);
> --
> 2.12.0.rc1.440.g5b76565f74-goog
>

2017-03-06 20:04:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v4] blkcg: allocate struct blkcg_gq outside request queue spinlock

On Sun, Mar 05, 2017 at 06:12:43AM -0800, Tahsin Erdogan wrote:
> blkg_conf_prep() currently calls blkg_lookup_create() while holding
> request queue spinlock. This means allocating memory for struct
> blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
> failures in call paths like below:
>
> pcpu_alloc+0x68f/0x710
> __alloc_percpu_gfp+0xd/0x10
> __percpu_counter_init+0x55/0xc0
> cfq_pd_alloc+0x3b2/0x4e0
> blkg_alloc+0x187/0x230
> blkg_create+0x489/0x670
> blkg_lookup_create+0x9a/0x230
> blkg_conf_prep+0x1fb/0x240
> __cfqg_set_weight_device.isra.105+0x5c/0x180
> cfq_set_weight_on_dfl+0x69/0xc0
> cgroup_file_write+0x39/0x1c0
> kernfs_fop_write+0x13f/0x1d0
> __vfs_write+0x23/0x120
> vfs_write+0xc2/0x1f0
> SyS_write+0x44/0xb0
> entry_SYSCALL_64_fastpath+0x18/0xad
>
> In the code path above, percpu allocator cannot call vmalloc() due to
> queue spinlock.
>
> A failure in this call path gives grief to tools which are trying to
> configure io weights. We see occasional failures happen shortly after
> reboots even when system is not under any memory pressure. Machines
> with a lot of cpus are more vulnerable to this condition.
>
> Update blkg_create() function to temporarily drop the rcu and queue
> locks when it is allowed by gfp mask.
>
> Suggested-by: Tejun Heo <[email protected]>
> Signed-off-by: Tahsin Erdogan <[email protected]>

Cc: [email protected] # v4.2+

Looks good to me.

Acked-by: Tejun Heo <[email protected]>

Thanks for your patience!

--
tejun

2017-03-09 05:49:01

by kernel test robot

[permalink] [raw]
Subject: [lkp-robot] [blkcg] ad63af3cb7: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h


FYI, we noticed the following commit:

commit: ad63af3cb70378a7f780dbef2387a6d13e53a6c9 ("blkcg: allocate struct blkcg_gq outside request queue spinlock")
url: https://github.com/0day-ci/linux/commits/Tahsin-Erdogan/blkcg-allocate-struct-blkcg_gq-outside-request-queue-spinlock/20170307-030921
base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git for-next

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -m 512M

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+----------------------------------------------------------------+------------+------------+
| | 3695539290 | ad63af3cb7 |
+----------------------------------------------------------------+------------+------------+
| boot_successes | 12 | 12 |
| boot_failures | 8 | 20 |
| BUG:kernel_hang_in_test_stage | 8 | |
| BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h | 0 | 20 |
+----------------------------------------------------------------+------------+------------+



[ 23.511528] BUG: sleeping function called from invalid context at mm/slab.h:408
[ 23.543085] in_atomic(): 1, irqs_disabled(): 0, pid: 130, name: udevd
[ 23.563283] CPU: 0 PID: 130 Comm: udevd Not tainted 4.10.0-rc7-00236-gad63af3 #1
[ 23.592056] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 23.625535] Call Trace:
[ 23.638139] dump_stack+0x63/0x8a
[ 23.652227] ___might_sleep+0xd3/0x120
[ 23.667071] __might_sleep+0x4a/0x80
[ 23.681760] kmem_cache_alloc_node_trace+0x1d6/0x1f0
[ 23.699087] blkg_alloc+0x4b/0x230
[ 23.713446] blkg_lookup_create+0x3dc/0x610
[ 23.729228] ? blkg_alloc+0x158/0x230
[ 23.744066] blkcg_init_queue+0x62/0x100
[ 23.759658] blk_alloc_queue_node+0x25a/0x2c0
[ 23.775866] ? set_fdc+0x130/0x130 [floppy]
[ 23.791487] blk_init_queue_node+0x20/0x60
[ 23.807233] blk_init_queue+0x13/0x20
[ 23.822104] floppy_module_init+0x234/0xee2 [floppy]
[ 23.838763] ? vunmap_page_range+0x221/0x390
[ 23.853604] ? set_cmos+0x68/0x68 [floppy]
[ 23.868345] do_one_initcall+0x43/0x180
[ 23.883582] ? __might_sleep+0x4a/0x80
[ 23.898621] ? kmem_cache_alloc_trace+0x163/0x1b0
[ 23.915435] do_init_module+0x5f/0x1f8
[ 23.930273] load_module+0x149e/0x1b10
[ 23.945372] ? __symbol_put+0x40/0x40
[ 23.960025] ? kernel_read_file+0x1a3/0x1c0
[ 23.975740] ? kernel_read_file_from_fd+0x49/0x80
[ 23.992559] SYSC_finit_module+0xbc/0xf0
[ 24.007991] SyS_finit_module+0xe/0x10
[ 24.022920] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 24.039327] RIP: 0033:0x7f9380e1f4a9
[ 24.053934] RSP: 002b:00007ffc09ff8698 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[ 24.082614] RAX: ffffffffffffffda RBX: 0000000000a64c20 RCX: 00007f9380e1f4a9
[ 24.103686] RDX: 0000000000000000 RSI: 00007f93810eb0aa RDI: 0000000000000007
[ 24.124749] RBP: 00007ffc09ff8690 R08: 0000000000000000 R09: 0000000000a63490
[ 24.146260] R10: 0000000000000007 R11: 0000000000000202 R12: 0000000000a64c20
[ 24.167489] R13: 0000000000000000 R14: 0000000000a5a2f0 R15: 0000000000a63490
[ 24.191631] parport_pc 00:04: reported by Plug and Play ACPI
[ 24.254724] piix4_smbus 0000:00:01.3: SMBus Host Controller at 0x700, revision 0
[ 24.344624] libata version 3.00 loaded.
[ 24.355991] ata_piix 0000:00:01.1: version 2.13
[ 24.380594] input: PC Speaker as /devices/platform/pcspkr/input/input4
[ 24.454024] scsi host0: ata_piix
[ 24.460472] scsi host1: ata_piix
[ 24.460962] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc080 irq 14
[ 24.460985] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc088 irq 15
[ 24.483223] Error: Driver 'pcspkr' is already registered, aborting...
[ 24.640449] ata2.01: NODEV after polling detection
[ 24.643436] ata2.00: ATAPI: QEMU DVD-ROM, 2.5+, max UDMA/100
[ 24.649435] ata2.00: configured for MWDMA2
[ 24.653487] BUG: sleeping function called from invalid context at mm/slab.h:408
[ 24.653510] in_atomic(): 1, irqs_disabled(): 0, pid: 5, name: kworker/u2:0
[ 24.653515] CPU: 0 PID: 5 Comm: kworker/u2:0 Tainted: G W 4.10.0-rc7-00236-gad63af3 #1
[ 24.653518] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 24.653548] Workqueue: events_unbound async_run_entry_fn
[ 24.653551] Call Trace:
[ 24.653584] dump_stack+0x63/0x8a
[ 24.653609] ___might_sleep+0xd3/0x120
[ 24.653611] __might_sleep+0x4a/0x80
[ 24.653616] kmem_cache_alloc_node_trace+0x1d6/0x1f0
[ 24.653643] blkg_alloc+0x4b/0x230
[ 24.653647] blkg_lookup_create+0x3dc/0x610
[ 24.653670] ? blkg_alloc+0x158/0x230
[ 24.653674] blkcg_init_queue+0x62/0x100
[ 24.653678] blk_alloc_queue_node+0x25a/0x2c0
[ 24.653706] scsi_alloc_queue+0x1e/0xf0
[ 24.653709] scsi_alloc_sdev+0x1d3/0x300
[ 24.653712] scsi_probe_and_add_lun+0x99d/0xe70
[ 24.653738] ? rpm_resume+0x1a6/0x6e0
[ 24.653742] ? __pm_runtime_resume+0x5b/0x90
[ 24.653745] __scsi_add_device+0xff/0x110
[ 24.653876] ata_scsi_scan_host+0xa3/0x1d0 [libata]
[ 24.653937] async_port_probe+0x43/0x60 [libata]
[ 24.653941] async_run_entry_fn+0x39/0x170
[ 24.653966] process_one_work+0x1a3/0x480
[ 24.653970] ? try_to_del_timer_sync+0x4b/0x60
[ 24.653973] worker_thread+0x4e/0x4d0
[ 24.653997] kthread+0x10c/0x140
[ 24.654000] ? process_one_work+0x480/0x480
[ 24.654003] ? kthread_create_on_node+0x40/0x40
[ 24.654028] ret_from_fork+0x2c/0x40
[ 24.656370] scsi 1:0:0:0: CD-ROM QEMU QEMU DVD-ROM 2.5+ PQ: 0 ANSI: 5
[ 24.745810] ppdev: user-space parallel port driver
[ 24.849575] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
[ 24.849599] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 24.854112] sr 1:0:0:0: Attached scsi CD-ROM sr0
[ 25.653532] BUG: sleeping function called from invalid context at mm/slab.h:408
[ 25.653536] in_atomic(): 1, irqs_disabled(): 0, pid: 130, name: udevd
[ 25.653542] CPU: 0 PID: 130 Comm: udevd Tainted: G W 4.10.0-rc7-00236-gad63af3 #1
[ 25.653564] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 25.653566] Call Trace:
[ 25.653603] dump_stack+0x63/0x8a
[ 25.653630] ___might_sleep+0xd3/0x120
[ 25.653634] __might_sleep+0x4a/0x80
[ 25.653639] kmem_cache_alloc_node_trace+0x1d6/0x1f0
[ 25.653666] blkg_alloc+0x4b/0x230
[ 25.653671] blkg_lookup_create+0x3dc/0x610
[ 25.653695] ? blkg_alloc+0x158/0x230
[ 25.653701] blkcg_init_queue+0x62/0x100
[ 25.653726] blk_alloc_queue_node+0x25a/0x2c0
[ 25.653760] ? set_fdc+0x130/0x130 [floppy]
[ 25.653802] blk_init_queue_node+0x20/0x60
[ 25.653827] blk_init_queue+0x13/0x20
[ 25.653856] floppy_module_init+0x234/0xee2 [floppy]
[ 25.653861] ? vunmap_page_range+0x221/0x390
[ 25.653889] ? set_cmos+0x68/0x68 [floppy]
[ 25.653895] do_one_initcall+0x43/0x180
[ 25.653898] ? __might_sleep+0x4a/0x80
[ 25.653923] ? kmem_cache_alloc_trace+0x163/0x1b0
[ 25.653932] do_init_module+0x5f/0x1f8
[ 25.653960] load_module+0x149e/0x1b10
[ 25.653964] ? __symbol_put+0x40/0x40
[ 25.653990] ? kernel_read_file+0x1a3/0x1c0
[ 25.653995] ? kernel_read_file_from_fd+0x49/0x80
[ 25.654020] SYSC_finit_module+0xbc/0xf0
[ 25.654025] SyS_finit_module+0xe/0x10
[ 25.654054] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 25.654058] RIP: 0033:0x7f9380e1f4a9
[ 25.654060] RSP: 002b:00007ffc09ff8698 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[ 25.654085] RAX: ffffffffffffffda RBX: 0000000000a64c20 RCX: 00007f9380e1f4a9
[ 25.654088] RDX: 0000000000000000 RSI: 00007f93810eb0aa RDI: 0000000000000007
[ 25.654090] RBP: 00007ffc09ff8690 R08: 0000000000000000 R09: 0000000000a63490
[ 25.654092] R10: 0000000000000007 R11: 0000000000000202 R12: 0000000000a64c20
[ 25.654115] R13: 0000000000000000 R14: 0000000000a5a2f0 R15: 0000000000a63490
[ 25.669580] Floppy drive(s): fd0 is 2.88M AMI BIOS
[ 25.691745] FDC 0 is a S82078B
[ 26.450439] parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE]


To reproduce:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Xiaolong


Attachments:
(No filename) (8.39 kB)
config-4.10.0-rc7-00236-gad63af3 (152.79 kB)
job-script (3.83 kB)
dmesg.xz (11.70 kB)
Download all attachments

2017-03-09 07:59:57

by Tahsin Erdogan

[permalink] [raw]
Subject: Re: [lkp-robot] [blkcg] ad63af3cb7: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h

This is a good catch!

I will post a v5 of the patch shortly to the other email thread.

On Wed, Mar 8, 2017 at 9:25 PM, kernel test robot <[email protected]> wrote:
>
> FYI, we noticed the following commit:
>
> commit: ad63af3cb70378a7f780dbef2387a6d13e53a6c9 ("blkcg: allocate struct blkcg_gq outside request queue spinlock")
> url: https://github.com/0day-ci/linux/commits/Tahsin-Erdogan/blkcg-allocate-struct-blkcg_gq-outside-request-queue-spinlock/20170307-030921
> base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git for-next
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -m 512M
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> +----------------------------------------------------------------+------------+------------+
> | | 3695539290 | ad63af3cb7 |
> +----------------------------------------------------------------+------------+------------+
> | boot_successes | 12 | 12 |
> | boot_failures | 8 | 20 |
> | BUG:kernel_hang_in_test_stage | 8 | |
> | BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h | 0 | 20 |
> +----------------------------------------------------------------+------------+------------+
>
>
>
> [ 23.511528] BUG: sleeping function called from invalid context at mm/slab.h:408
> [ 23.543085] in_atomic(): 1, irqs_disabled(): 0, pid: 130, name: udevd
> [ 23.563283] CPU: 0 PID: 130 Comm: udevd Not tainted 4.10.0-rc7-00236-gad63af3 #1
> [ 23.592056] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
> [ 23.625535] Call Trace:
> [ 23.638139] dump_stack+0x63/0x8a
> [ 23.652227] ___might_sleep+0xd3/0x120
> [ 23.667071] __might_sleep+0x4a/0x80
> [ 23.681760] kmem_cache_alloc_node_trace+0x1d6/0x1f0
> [ 23.699087] blkg_alloc+0x4b/0x230
> [ 23.713446] blkg_lookup_create+0x3dc/0x610
> [ 23.729228] ? blkg_alloc+0x158/0x230
> [ 23.744066] blkcg_init_queue+0x62/0x100
> [ 23.759658] blk_alloc_queue_node+0x25a/0x2c0
> [ 23.775866] ? set_fdc+0x130/0x130 [floppy]
> [ 23.791487] blk_init_queue_node+0x20/0x60
> [ 23.807233] blk_init_queue+0x13/0x20
> [ 23.822104] floppy_module_init+0x234/0xee2 [floppy]
> [ 23.838763] ? vunmap_page_range+0x221/0x390
> [ 23.853604] ? set_cmos+0x68/0x68 [floppy]
> [ 23.868345] do_one_initcall+0x43/0x180
> [ 23.883582] ? __might_sleep+0x4a/0x80
> [ 23.898621] ? kmem_cache_alloc_trace+0x163/0x1b0
> [ 23.915435] do_init_module+0x5f/0x1f8
> [ 23.930273] load_module+0x149e/0x1b10
> [ 23.945372] ? __symbol_put+0x40/0x40
> [ 23.960025] ? kernel_read_file+0x1a3/0x1c0
> [ 23.975740] ? kernel_read_file_from_fd+0x49/0x80
> [ 23.992559] SYSC_finit_module+0xbc/0xf0
> [ 24.007991] SyS_finit_module+0xe/0x10
> [ 24.022920] entry_SYSCALL_64_fastpath+0x1a/0xa9
> [ 24.039327] RIP: 0033:0x7f9380e1f4a9
> [ 24.053934] RSP: 002b:00007ffc09ff8698 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
> [ 24.082614] RAX: ffffffffffffffda RBX: 0000000000a64c20 RCX: 00007f9380e1f4a9
> [ 24.103686] RDX: 0000000000000000 RSI: 00007f93810eb0aa RDI: 0000000000000007
> [ 24.124749] RBP: 00007ffc09ff8690 R08: 0000000000000000 R09: 0000000000a63490
> [ 24.146260] R10: 0000000000000007 R11: 0000000000000202 R12: 0000000000a64c20
> [ 24.167489] R13: 0000000000000000 R14: 0000000000a5a2f0 R15: 0000000000a63490
> [ 24.191631] parport_pc 00:04: reported by Plug and Play ACPI
> [ 24.254724] piix4_smbus 0000:00:01.3: SMBus Host Controller at 0x700, revision 0
> [ 24.344624] libata version 3.00 loaded.
> [ 24.355991] ata_piix 0000:00:01.1: version 2.13
> [ 24.380594] input: PC Speaker as /devices/platform/pcspkr/input/input4
> [ 24.454024] scsi host0: ata_piix
> [ 24.460472] scsi host1: ata_piix
> [ 24.460962] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc080 irq 14
> [ 24.460985] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc088 irq 15
> [ 24.483223] Error: Driver 'pcspkr' is already registered, aborting...
> [ 24.640449] ata2.01: NODEV after polling detection
> [ 24.643436] ata2.00: ATAPI: QEMU DVD-ROM, 2.5+, max UDMA/100
> [ 24.649435] ata2.00: configured for MWDMA2
> [ 24.653487] BUG: sleeping function called from invalid context at mm/slab.h:408
> [ 24.653510] in_atomic(): 1, irqs_disabled(): 0, pid: 5, name: kworker/u2:0
> [ 24.653515] CPU: 0 PID: 5 Comm: kworker/u2:0 Tainted: G W 4.10.0-rc7-00236-gad63af3 #1
> [ 24.653518] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
> [ 24.653548] Workqueue: events_unbound async_run_entry_fn
> [ 24.653551] Call Trace:
> [ 24.653584] dump_stack+0x63/0x8a
> [ 24.653609] ___might_sleep+0xd3/0x120
> [ 24.653611] __might_sleep+0x4a/0x80
> [ 24.653616] kmem_cache_alloc_node_trace+0x1d6/0x1f0
> [ 24.653643] blkg_alloc+0x4b/0x230
> [ 24.653647] blkg_lookup_create+0x3dc/0x610
> [ 24.653670] ? blkg_alloc+0x158/0x230
> [ 24.653674] blkcg_init_queue+0x62/0x100
> [ 24.653678] blk_alloc_queue_node+0x25a/0x2c0
> [ 24.653706] scsi_alloc_queue+0x1e/0xf0
> [ 24.653709] scsi_alloc_sdev+0x1d3/0x300
> [ 24.653712] scsi_probe_and_add_lun+0x99d/0xe70
> [ 24.653738] ? rpm_resume+0x1a6/0x6e0
> [ 24.653742] ? __pm_runtime_resume+0x5b/0x90
> [ 24.653745] __scsi_add_device+0xff/0x110
> [ 24.653876] ata_scsi_scan_host+0xa3/0x1d0 [libata]
> [ 24.653937] async_port_probe+0x43/0x60 [libata]
> [ 24.653941] async_run_entry_fn+0x39/0x170
> [ 24.653966] process_one_work+0x1a3/0x480
> [ 24.653970] ? try_to_del_timer_sync+0x4b/0x60
> [ 24.653973] worker_thread+0x4e/0x4d0
> [ 24.653997] kthread+0x10c/0x140
> [ 24.654000] ? process_one_work+0x480/0x480
> [ 24.654003] ? kthread_create_on_node+0x40/0x40
> [ 24.654028] ret_from_fork+0x2c/0x40
> [ 24.656370] scsi 1:0:0:0: CD-ROM QEMU QEMU DVD-ROM 2.5+ PQ: 0 ANSI: 5
> [ 24.745810] ppdev: user-space parallel port driver
> [ 24.849575] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> [ 24.849599] cdrom: Uniform CD-ROM driver Revision: 3.20
> [ 24.854112] sr 1:0:0:0: Attached scsi CD-ROM sr0
> [ 25.653532] BUG: sleeping function called from invalid context at mm/slab.h:408
> [ 25.653536] in_atomic(): 1, irqs_disabled(): 0, pid: 130, name: udevd
> [ 25.653542] CPU: 0 PID: 130 Comm: udevd Tainted: G W 4.10.0-rc7-00236-gad63af3 #1
> [ 25.653564] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
> [ 25.653566] Call Trace:
> [ 25.653603] dump_stack+0x63/0x8a
> [ 25.653630] ___might_sleep+0xd3/0x120
> [ 25.653634] __might_sleep+0x4a/0x80
> [ 25.653639] kmem_cache_alloc_node_trace+0x1d6/0x1f0
> [ 25.653666] blkg_alloc+0x4b/0x230
> [ 25.653671] blkg_lookup_create+0x3dc/0x610
> [ 25.653695] ? blkg_alloc+0x158/0x230
> [ 25.653701] blkcg_init_queue+0x62/0x100
> [ 25.653726] blk_alloc_queue_node+0x25a/0x2c0
> [ 25.653760] ? set_fdc+0x130/0x130 [floppy]
> [ 25.653802] blk_init_queue_node+0x20/0x60
> [ 25.653827] blk_init_queue+0x13/0x20
> [ 25.653856] floppy_module_init+0x234/0xee2 [floppy]
> [ 25.653861] ? vunmap_page_range+0x221/0x390
> [ 25.653889] ? set_cmos+0x68/0x68 [floppy]
> [ 25.653895] do_one_initcall+0x43/0x180
> [ 25.653898] ? __might_sleep+0x4a/0x80
> [ 25.653923] ? kmem_cache_alloc_trace+0x163/0x1b0
> [ 25.653932] do_init_module+0x5f/0x1f8
> [ 25.653960] load_module+0x149e/0x1b10
> [ 25.653964] ? __symbol_put+0x40/0x40
> [ 25.653990] ? kernel_read_file+0x1a3/0x1c0
> [ 25.653995] ? kernel_read_file_from_fd+0x49/0x80
> [ 25.654020] SYSC_finit_module+0xbc/0xf0
> [ 25.654025] SyS_finit_module+0xe/0x10
> [ 25.654054] entry_SYSCALL_64_fastpath+0x1a/0xa9
> [ 25.654058] RIP: 0033:0x7f9380e1f4a9
> [ 25.654060] RSP: 002b:00007ffc09ff8698 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
> [ 25.654085] RAX: ffffffffffffffda RBX: 0000000000a64c20 RCX: 00007f9380e1f4a9
> [ 25.654088] RDX: 0000000000000000 RSI: 00007f93810eb0aa RDI: 0000000000000007
> [ 25.654090] RBP: 00007ffc09ff8690 R08: 0000000000000000 R09: 0000000000a63490
> [ 25.654092] R10: 0000000000000007 R11: 0000000000000202 R12: 0000000000a64c20
> [ 25.654115] R13: 0000000000000000 R14: 0000000000a5a2f0 R15: 0000000000a63490
> [ 25.669580] Floppy drive(s): fd0 is 2.88M AMI BIOS
> [ 25.691745] FDC 0 is a S82078B
> [ 26.450439] parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE]
>
>
> To reproduce:
>
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
>
>
>
> Thanks,
> Xiaolong

2017-03-09 08:06:30

by Tahsin Erdogan

[permalink] [raw]
Subject: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:

pcpu_alloc+0x68f/0x710
__alloc_percpu_gfp+0xd/0x10
__percpu_counter_init+0x55/0xc0
cfq_pd_alloc+0x3b2/0x4e0
blkg_alloc+0x187/0x230
blkg_create+0x489/0x670
blkg_lookup_create+0x9a/0x230
blkg_conf_prep+0x1fb/0x240
__cfqg_set_weight_device.isra.105+0x5c/0x180
cfq_set_weight_on_dfl+0x69/0xc0
cgroup_file_write+0x39/0x1c0
kernfs_fop_write+0x13f/0x1d0
__vfs_write+0x23/0x120
vfs_write+0xc2/0x1f0
SyS_write+0x44/0xb0
entry_SYSCALL_64_fastpath+0x18/0xad

In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.

A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.

Update blkg_create() function to temporarily drop the rcu and queue
locks when it is allowed by gfp mask.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Tahsin Erdogan <[email protected]>
---
v5:
Removed stale blkg_alloc() in blkcg_init_queue()

Pushed down radix_tree_preload() into blkg_create() because it
disables preemption on return and makes it unsafe to call blocking
memory allocations.

v4:
Simplified error checking in blkg_create()
Factored out __blkg_lookup_create()

v3:
Pushed down all blkg allocations into blkg_create()

v2:
Moved blkg creation into blkg_lookup_create() to avoid duplicating
blkg_lookup_create() logic.

block/blk-cgroup.c | 138 ++++++++++++++++++++++++++++-----------------
include/linux/blk-cgroup.h | 6 +-
2 files changed, 91 insertions(+), 53 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bbe7ee00bd3d..bdf87f0c1b1b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -165,16 +165,18 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
EXPORT_SYMBOL_GPL(blkg_lookup_slowpath);

/*
- * If @new_blkg is %NULL, this function tries to allocate a new one as
- * necessary using %GFP_NOWAIT. @new_blkg is always consumed on return.
+ * If gfp mask allows blocking, this function temporarily drops rcu and queue
+ * locks to allocate memory.
*/
static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
- struct request_queue *q,
- struct blkcg_gq *new_blkg)
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol)
{
- struct blkcg_gq *blkg;
+ struct blkcg_gq *blkg = NULL;
struct bdi_writeback_congested *wb_congested;
int i, ret;
+ const bool drop_locks = gfpflags_allow_blocking(gfp);
+ bool preloaded = false;

WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);
@@ -185,31 +187,53 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
goto err_free_blkg;
}

+ if (drop_locks) {
+ spin_unlock_irq(q->queue_lock);
+ rcu_read_unlock();
+ }
+
wb_congested = wb_congested_get_create(q->backing_dev_info,
- blkcg->css.id,
- GFP_NOWAIT | __GFP_NOWARN);
- if (!wb_congested) {
+ blkcg->css.id, gfp);
+ blkg = blkg_alloc(blkcg, q, gfp);
+
+ if (drop_locks) {
+ preloaded = !radix_tree_preload(gfp);
+ rcu_read_lock();
+ spin_lock_irq(q->queue_lock);
+ }
+
+ if (unlikely(!wb_congested || !blkg)) {
ret = -ENOMEM;
- goto err_put_css;
+ goto err_put;
}

- /* allocate */
- if (!new_blkg) {
- new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT | __GFP_NOWARN);
- if (unlikely(!new_blkg)) {
- ret = -ENOMEM;
- goto err_put_congested;
+ blkg->wb_congested = wb_congested;
+
+ if (pol) {
+ WARN_ON(!drop_locks);
+
+ if (!blkcg_policy_enabled(q, pol)) {
+ ret = -EOPNOTSUPP;
+ goto err_put;
+ }
+
+ /*
+ * This could be the first entry point of blkcg implementation
+ * and we shouldn't allow anything to go through for a bypassing
+ * queue.
+ */
+ if (unlikely(blk_queue_bypass(q))) {
+ ret = blk_queue_dying(q) ? -ENODEV : -EBUSY;
+ goto err_put;
}
}
- blkg = new_blkg;
- blkg->wb_congested = wb_congested;

/* link parent */
if (blkcg_parent(blkcg)) {
blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
if (WARN_ON_ONCE(!blkg->parent)) {
ret = -ENODEV;
- goto err_put_congested;
+ goto err_put;
}
blkg_get(blkg->parent);
}
@@ -236,6 +260,9 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
pol->pd_online_fn(blkg->pd[i]);
}
}
+
+ if (preloaded)
+ radix_tree_preload_end();
blkg->online = true;
spin_unlock(&blkcg->lock);

@@ -246,44 +273,45 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_put(blkg);
return ERR_PTR(ret);

-err_put_congested:
- wb_congested_put(wb_congested);
-err_put_css:
+err_put:
+ if (preloaded)
+ radix_tree_preload_end();
+ if (wb_congested)
+ wb_congested_put(wb_congested);
css_put(&blkcg->css);
err_free_blkg:
- blkg_free(new_blkg);
+ blkg_free(blkg);
return ERR_PTR(ret);
}

/**
- * blkg_lookup_create - lookup blkg, try to create one if not there
+ * __blkg_lookup_create - lookup blkg, try to create one if not there
* @blkcg: blkcg of interest
* @q: request_queue of interest
+ * @gfp: gfp mask
+ * @pol: blkcg policy (optional)
*
* Lookup blkg for the @blkcg - @q pair. If it doesn't exist, try to
* create one. blkg creation is performed recursively from blkcg_root such
* that all non-root blkg's have access to the parent blkg. This function
* should be called under RCU read lock and @q->queue_lock.
*
+ * When gfp mask allows blocking, rcu and queue locks may be dropped for
+ * allocating memory. In this case, the locks will be reacquired on return.
+ *
* Returns pointer to the looked up or created blkg on success, ERR_PTR()
* value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not
* dead and bypassing, returns ERR_PTR(-EBUSY).
*/
-struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
- struct request_queue *q)
+struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol)
{
struct blkcg_gq *blkg;

WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);

- /*
- * This could be the first entry point of blkcg implementation and
- * we shouldn't allow anything to go through for a bypassing queue.
- */
- if (unlikely(blk_queue_bypass(q)))
- return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
-
blkg = __blkg_lookup(blkcg, q, true);
if (blkg)
return blkg;
@@ -301,12 +329,35 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
parent = blkcg_parent(parent);
}

- blkg = blkg_create(pos, q, NULL);
+ blkg = blkg_create(pos, q, gfp, pol);
if (pos == blkcg || IS_ERR(blkg))
return blkg;
}
}

+/**
+ * blkg_lookup_create - lookup blkg, try to create one if not there
+ *
+ * Performs an initial queue bypass check and then passes control to
+ * __blkg_lookup_create().
+ */
+struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ lockdep_assert_held(q->queue_lock);
+
+ /*
+ * This could be the first entry point of blkcg implementation and
+ * we shouldn't allow anything to go through for a bypassing queue.
+ */
+ if (unlikely(blk_queue_bypass(q)))
+ return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
+
+ return __blkg_lookup_create(blkcg, q, gfp, pol);
+}
+
static void blkg_destroy(struct blkcg_gq *blkg)
{
struct blkcg *blkcg = blkg->blkcg;
@@ -817,7 +868,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
spin_lock_irq(disk->queue->queue_lock);

if (blkcg_policy_enabled(disk->queue, pol))
- blkg = blkg_lookup_create(blkcg, disk->queue);
+ blkg = blkg_lookup_create(blkcg, disk->queue, GFP_KERNEL, pol);
else
blkg = ERR_PTR(-EOPNOTSUPP);

@@ -1056,30 +1107,15 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
*/
int blkcg_init_queue(struct request_queue *q)
{
- struct blkcg_gq *new_blkg, *blkg;
- bool preloaded;
+ struct blkcg_gq *blkg;
int ret;

- new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
- if (!new_blkg)
- return -ENOMEM;
-
- preloaded = !radix_tree_preload(GFP_KERNEL);
-
- /*
- * Make sure the root blkg exists and count the existing blkgs. As
- * @q is bypassing at this point, blkg_lookup_create() can't be
- * used. Open code insertion.
- */
rcu_read_lock();
spin_lock_irq(q->queue_lock);
- blkg = blkg_create(&blkcg_root, q, new_blkg);
+ blkg = __blkg_lookup_create(&blkcg_root, q, GFP_KERNEL, NULL);
spin_unlock_irq(q->queue_lock);
rcu_read_unlock();

- if (preloaded)
- radix_tree_preload_end();
-
if (IS_ERR(blkg))
return PTR_ERR(blkg);

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 01b62e7bac74..955903a8f6cb 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -172,7 +172,8 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
struct request_queue *q, bool update_hint);
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
- struct request_queue *q);
+ struct request_queue *q, gfp_t gfp,
+ const struct blkcg_policy *pol);
int blkcg_init_queue(struct request_queue *q);
void blkcg_drain_queue(struct request_queue *q);
void blkcg_exit_queue(struct request_queue *q);
@@ -694,7 +695,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
spin_lock_irq(q->queue_lock);
- blkg = blkg_lookup_create(blkcg, q);
+ blkg = blkg_lookup_create(blkcg, q, GFP_NOWAIT | __GFP_NOWARN,
+ NULL);
if (IS_ERR(blkg))
blkg = NULL;
spin_unlock_irq(q->queue_lock);
--
2.12.0.246.ga2ecc84866-goog

2017-03-09 18:27:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

On Thu, Mar 09, 2017 at 12:05:31AM -0800, Tahsin Erdogan wrote:
> blkg_conf_prep() currently calls blkg_lookup_create() while holding
> request queue spinlock. This means allocating memory for struct
> blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
> failures in call paths like below:
>
> pcpu_alloc+0x68f/0x710
> __alloc_percpu_gfp+0xd/0x10
> __percpu_counter_init+0x55/0xc0
> cfq_pd_alloc+0x3b2/0x4e0
> blkg_alloc+0x187/0x230
> blkg_create+0x489/0x670
> blkg_lookup_create+0x9a/0x230
> blkg_conf_prep+0x1fb/0x240
> __cfqg_set_weight_device.isra.105+0x5c/0x180
> cfq_set_weight_on_dfl+0x69/0xc0
> cgroup_file_write+0x39/0x1c0
> kernfs_fop_write+0x13f/0x1d0
> __vfs_write+0x23/0x120
> vfs_write+0xc2/0x1f0
> SyS_write+0x44/0xb0
> entry_SYSCALL_64_fastpath+0x18/0xad
>
> In the code path above, percpu allocator cannot call vmalloc() due to
> queue spinlock.
>
> A failure in this call path gives grief to tools which are trying to
> configure io weights. We see occasional failures happen shortly after
> reboots even when system is not under any memory pressure. Machines
> with a lot of cpus are more vulnerable to this condition.
>
> Update blkg_create() function to temporarily drop the rcu and queue
> locks when it is allowed by gfp mask.
>
> Suggested-by: Tejun Heo <[email protected]>
> Signed-off-by: Tahsin Erdogan <[email protected]>

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2017-03-11 22:42:57

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

> @@ -185,31 +187,53 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> goto err_free_blkg;
> }
>
> + if (drop_locks) {
> + spin_unlock_irq(q->queue_lock);
> + rcu_read_unlock();
> + }

I have a general dislike for code like that, where you conditionally
drop locks. And this one seems even worse, since the knowledge of
whether the locks should/could be dropped or not is embedded in the gfp
flags.

> +/**
> + * blkg_lookup_create - lookup blkg, try to create one if not there
> + *
> + * Performs an initial queue bypass check and then passes control to
> + * __blkg_lookup_create().
> + */
> +struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> + struct request_queue *q, gfp_t gfp,
> + const struct blkcg_policy *pol)
> +{
> + WARN_ON_ONCE(!rcu_read_lock_held());
> + lockdep_assert_held(q->queue_lock);

This seems problematic, as blkcg_bio_issue_check() calls with the rcu
read lock held.

--
Jens Axboe

2017-03-11 22:52:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

On 03/11/2017 03:42 PM, Jens Axboe wrote:
>> @@ -185,31 +187,53 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>> goto err_free_blkg;
>> }
>>
>> + if (drop_locks) {
>> + spin_unlock_irq(q->queue_lock);
>> + rcu_read_unlock();
>> + }
>
> I have a general dislike for code like that, where you conditionally
> drop locks. And this one seems even worse, since the knowledge of
> whether the locks should/could be dropped or not is embedded in the gfp
> flags.

Talked to Tejun about this as well, and we both agree that the splitting
this into separate init/alloc paths would be much cleaner. I can't
apply the current patch, sorry, it's just too ugly to live.

>> +/**
>> + * blkg_lookup_create - lookup blkg, try to create one if not there
>> + *
>> + * Performs an initial queue bypass check and then passes control to
>> + * __blkg_lookup_create().
>> + */
>> +struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>> + struct request_queue *q, gfp_t gfp,
>> + const struct blkcg_policy *pol)
>> +{
>> + WARN_ON_ONCE(!rcu_read_lock_held());
>> + lockdep_assert_held(q->queue_lock);
>
> This seems problematic, as blkcg_bio_issue_check() calls with the rcu
> read lock held.

Brain fart, that part is fine.

--
Jens Axboe

2017-03-12 04:35:50

by Tahsin Erdogan

[permalink] [raw]
Subject: Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

On Sat, Mar 11, 2017 at 2:52 PM, Jens Axboe <[email protected]> wrote:

>
> Talked to Tejun about this as well, and we both agree that the splitting
> this into separate init/alloc paths would be much cleaner. I can't
> apply the current patch, sorry, it's just too ugly to live.

Do you mean, you prefer the approach that was taken in v1 patch or
something else?

2017-03-13 14:33:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

On 03/11/2017 09:35 PM, Tahsin Erdogan wrote:
> On Sat, Mar 11, 2017 at 2:52 PM, Jens Axboe <[email protected]> wrote:
>
>>
>> Talked to Tejun about this as well, and we both agree that the splitting
>> this into separate init/alloc paths would be much cleaner. I can't
>> apply the current patch, sorry, it's just too ugly to live.
>
> Do you mean, you prefer the approach that was taken in v1 patch or
> something else?

I can no longer find v1 of the patch, just v2 and on. Can you send a
link to it?

--
Jens Axboe

2017-03-13 16:18:06

by Tahsin Erdogan

[permalink] [raw]
Subject: Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

>> Do you mean, you prefer the approach that was taken in v1 patch or
>> something else?
>
> I can no longer find v1 of the patch, just v2 and on. Can you send a
> link to it?

https://lkml.org/lkml/2017/2/28/8

2017-03-24 21:57:03

by Tahsin Erdogan

[permalink] [raw]
Subject: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock

blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:

pcpu_alloc+0x68f/0x710
__alloc_percpu_gfp+0xd/0x10
__percpu_counter_init+0x55/0xc0
cfq_pd_alloc+0x3b2/0x4e0
blkg_alloc+0x187/0x230
blkg_create+0x489/0x670
blkg_lookup_create+0x9a/0x230
blkg_conf_prep+0x1fb/0x240
__cfqg_set_weight_device.isra.105+0x5c/0x180
cfq_set_weight_on_dfl+0x69/0xc0
cgroup_file_write+0x39/0x1c0
kernfs_fop_write+0x13f/0x1d0
__vfs_write+0x23/0x120
vfs_write+0xc2/0x1f0
SyS_write+0x44/0xb0
entry_SYSCALL_64_fastpath+0x18/0xad

In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.

A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.

Do struct blkcg_gq allocations outside the queue spinlock to allow
blocking during memory allocations.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Tahsin Erdogan <[email protected]>
---
v6:
Due to Jens' objection to conditionally dropping locks based on gfp
flags, go back to v1 approach.
Perform queue bypass and policy enabled checks at every iteration.
Add blkg_lookup_check() to reduce code duplication.

v5:
Removed stale blkg_alloc() in blkcg_init_queue()

Pushed down radix_tree_preload() into blkg_create() because it
disables preemption on return and makes it unsafe to call blocking
memory allocations.

v4:
Simplified error checking in blkg_create()
Factored out __blkg_lookup_create()

v3:
Pushed down all blkg allocations into blkg_create()

v2:
Moved blkg creation into blkg_lookup_create() to avoid duplicating
blkg_lookup_create() logic.

block/blk-cgroup.c | 123 ++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 98 insertions(+), 25 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bbe7ee00bd3d..7c2947128f58 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -772,6 +772,27 @@ struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg,
}
EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum);

+/* Performs queue bypass and policy enabled checks then looks up blkg. */
+static struct blkcg_gq *blkg_lookup_check(struct blkcg *blkcg,
+ const struct blkcg_policy *pol,
+ struct request_queue *q)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ lockdep_assert_held(q->queue_lock);
+
+ if (!blkcg_policy_enabled(q, pol))
+ return ERR_PTR(-EOPNOTSUPP);
+
+ /*
+ * This could be the first entry point of blkcg implementation and
+ * we shouldn't allow anything to go through for a bypassing queue.
+ */
+ if (unlikely(blk_queue_bypass(q)))
+ return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
+
+ return __blkg_lookup(blkcg, q, true /* update_hint */);
+}
+
/**
* blkg_conf_prep - parse and prepare for per-blkg config update
* @blkcg: target block cgroup
@@ -789,6 +810,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
__acquires(rcu) __acquires(disk->queue->queue_lock)
{
struct gendisk *disk;
+ struct request_queue *q;
struct blkcg_gq *blkg;
struct module *owner;
unsigned int major, minor;
@@ -807,44 +829,95 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
if (!disk)
return -ENODEV;
if (part) {
- owner = disk->fops->owner;
- put_disk(disk);
- module_put(owner);
- return -ENODEV;
+ ret = -ENODEV;
+ goto fail;
}

- rcu_read_lock();
- spin_lock_irq(disk->queue->queue_lock);
+ q = disk->queue;

- if (blkcg_policy_enabled(disk->queue, pol))
- blkg = blkg_lookup_create(blkcg, disk->queue);
- else
- blkg = ERR_PTR(-EOPNOTSUPP);
+ rcu_read_lock();
+ spin_lock_irq(q->queue_lock);

+ blkg = blkg_lookup_check(blkcg, pol, q);
if (IS_ERR(blkg)) {
ret = PTR_ERR(blkg);
+ goto fail_unlock;
+ }
+
+ if (blkg)
+ goto success;
+
+ /*
+ * Create blkgs walking down from blkcg_root to @blkcg, so that all
+ * non-root blkgs have access to their parents.
+ */
+ while (true) {
+ struct blkcg *pos = blkcg;
+ struct blkcg *parent;
+ struct blkcg_gq *new_blkg;
+
+ parent = blkcg_parent(blkcg);
+ while (parent && !__blkg_lookup(parent, q, false)) {
+ pos = parent;
+ parent = blkcg_parent(parent);
+ }
+
+ /* Drop locks to do new blkg allocation with GFP_KERNEL. */
+ spin_unlock_irq(q->queue_lock);
rcu_read_unlock();
- spin_unlock_irq(disk->queue->queue_lock);
- owner = disk->fops->owner;
- put_disk(disk);
- module_put(owner);
- /*
- * If queue was bypassing, we should retry. Do so after a
- * short msleep(). It isn't strictly necessary but queue
- * can be bypassing for some time and it's always nice to
- * avoid busy looping.
- */
- if (ret == -EBUSY) {
- msleep(10);
- ret = restart_syscall();
+
+ new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
+ if (unlikely(!new_blkg)) {
+ ret = -ENOMEM;
+ goto fail;
}
- return ret;
- }

+ rcu_read_lock();
+ spin_lock_irq(q->queue_lock);
+
+ blkg = blkg_lookup_check(pos, pol, q);
+ if (IS_ERR(blkg)) {
+ ret = PTR_ERR(blkg);
+ goto fail_unlock;
+ }
+
+ if (blkg) {
+ blkg_free(new_blkg);
+ } else {
+ blkg = blkg_create(pos, q, new_blkg);
+ if (unlikely(IS_ERR(blkg))) {
+ ret = PTR_ERR(blkg);
+ goto fail_unlock;
+ }
+ }
+
+ if (pos == blkcg)
+ goto success;
+ }
+success:
ctx->disk = disk;
ctx->blkg = blkg;
ctx->body = body;
return 0;
+
+fail_unlock:
+ spin_unlock_irq(q->queue_lock);
+ rcu_read_unlock();
+fail:
+ owner = disk->fops->owner;
+ put_disk(disk);
+ module_put(owner);
+ /*
+ * If queue was bypassing, we should retry. Do so after a
+ * short msleep(). It isn't strictly necessary but queue
+ * can be bypassing for some time and it's always nice to
+ * avoid busy looping.
+ */
+ if (ret == -EBUSY) {
+ msleep(10);
+ ret = restart_syscall();
+ }
+ return ret;
}
EXPORT_SYMBOL_GPL(blkg_conf_prep);

--
2.12.1.578.ge9c3154ca4-goog

2017-03-24 22:05:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock

On 03/24/2017 03:56 PM, Tahsin Erdogan wrote:
> blkg_conf_prep() currently calls blkg_lookup_create() while holding
> request queue spinlock. This means allocating memory for struct
> blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
> failures in call paths like below:
>
> pcpu_alloc+0x68f/0x710
> __alloc_percpu_gfp+0xd/0x10
> __percpu_counter_init+0x55/0xc0
> cfq_pd_alloc+0x3b2/0x4e0
> blkg_alloc+0x187/0x230
> blkg_create+0x489/0x670
> blkg_lookup_create+0x9a/0x230
> blkg_conf_prep+0x1fb/0x240
> __cfqg_set_weight_device.isra.105+0x5c/0x180
> cfq_set_weight_on_dfl+0x69/0xc0
> cgroup_file_write+0x39/0x1c0
> kernfs_fop_write+0x13f/0x1d0
> __vfs_write+0x23/0x120
> vfs_write+0xc2/0x1f0
> SyS_write+0x44/0xb0
> entry_SYSCALL_64_fastpath+0x18/0xad
>
> In the code path above, percpu allocator cannot call vmalloc() due to
> queue spinlock.
>
> A failure in this call path gives grief to tools which are trying to
> configure io weights. We see occasional failures happen shortly after
> reboots even when system is not under any memory pressure. Machines
> with a lot of cpus are more vulnerable to this condition.
>
> Do struct blkcg_gq allocations outside the queue spinlock to allow
> blocking during memory allocations.

This looks much simpler/cleaner to me, compared to v5. Tejun, what do
you think?

--
Jens Axboe

2017-03-26 10:54:55

by Julia Lawall

[permalink] [raw]
Subject: Re: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock

Is an unlock needed before line 848? Or does blkg_lookup_check take care
of it?

julia

---------- Forwarded message ----------
Date: Sun, 26 Mar 2017 07:50:17 +0800
From: kbuild test robot <[email protected]>
To: [email protected]
Cc: Julia Lawall <[email protected]>
Subject: Re: [PATCH] blkcg: allocate struct blkcg_gq outside request queue
spinlock

CC: [email protected]
In-Reply-To: <[email protected]>
TO: Tahsin Erdogan <[email protected]>
CC: Tejun Heo <[email protected]>, Jens Axboe <[email protected]>, [email protected], David Rientjes <[email protected]>, [email protected], Tahsin Erdogan <[email protected]>
CC: [email protected], David Rientjes <[email protected]>, [email protected], Tahsin Erdogan <[email protected]>

Hi Tahsin,

[auto build test WARNING on block/for-next]
[also build test WARNING on v4.11-rc3 next-20170324]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Tahsin-Erdogan/blkcg-allocate-struct-blkcg_gq-outside-request-queue-spinlock/20170326-020755
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
:::::: branch date: 6 hours ago
:::::: commit date: 6 hours ago

>> block/blk-cgroup.c:901:1-7: preceding lock on line 839
block/blk-cgroup.c:901:1-7: preceding lock on line 876

git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout d6344a76907e470f483dcb19998d438d19f6432d
vim +901 block/blk-cgroup.c

d6344a76 Tahsin Erdogan 2017-03-24 833 goto fail;
5f6c2d2b Tejun Heo 2015-07-22 834 }
e56da7e2 Tejun Heo 2012-03-05 835
d6344a76 Tahsin Erdogan 2017-03-24 836 q = disk->queue;
d6344a76 Tahsin Erdogan 2017-03-24 837
e56da7e2 Tejun Heo 2012-03-05 838 rcu_read_lock();
d6344a76 Tahsin Erdogan 2017-03-24 @839 spin_lock_irq(q->queue_lock);
da8b0662 Tejun Heo 2012-04-13 840
d6344a76 Tahsin Erdogan 2017-03-24 841 blkg = blkg_lookup_check(blkcg, pol, q);
d6344a76 Tahsin Erdogan 2017-03-24 842 if (IS_ERR(blkg)) {
d6344a76 Tahsin Erdogan 2017-03-24 843 ret = PTR_ERR(blkg);
d6344a76 Tahsin Erdogan 2017-03-24 844 goto fail_unlock;
d6344a76 Tahsin Erdogan 2017-03-24 845 }
d6344a76 Tahsin Erdogan 2017-03-24 846
d6344a76 Tahsin Erdogan 2017-03-24 847 if (blkg)
d6344a76 Tahsin Erdogan 2017-03-24 848 goto success;
d6344a76 Tahsin Erdogan 2017-03-24 849
d6344a76 Tahsin Erdogan 2017-03-24 850 /*
d6344a76 Tahsin Erdogan 2017-03-24 851 * Create blkgs walking down from blkcg_root to @blkcg, so that all
d6344a76 Tahsin Erdogan 2017-03-24 852 * non-root blkgs have access to their parents.
d6344a76 Tahsin Erdogan 2017-03-24 853 */
d6344a76 Tahsin Erdogan 2017-03-24 854 while (true) {
d6344a76 Tahsin Erdogan 2017-03-24 855 struct blkcg *pos = blkcg;
d6344a76 Tahsin Erdogan 2017-03-24 856 struct blkcg *parent;
d6344a76 Tahsin Erdogan 2017-03-24 857 struct blkcg_gq *new_blkg;
e56da7e2 Tejun Heo 2012-03-05 858
d6344a76 Tahsin Erdogan 2017-03-24 859 parent = blkcg_parent(blkcg);
d6344a76 Tahsin Erdogan 2017-03-24 860 while (parent && !__blkg_lookup(parent, q, false)) {
d6344a76 Tahsin Erdogan 2017-03-24 861 pos = parent;
d6344a76 Tahsin Erdogan 2017-03-24 862 parent = blkcg_parent(parent);
d6344a76 Tahsin Erdogan 2017-03-24 863 }
d6344a76 Tahsin Erdogan 2017-03-24 864
d6344a76 Tahsin Erdogan 2017-03-24 865 /* Drop locks to do new blkg allocation with GFP_KERNEL. */
d6344a76 Tahsin Erdogan 2017-03-24 866 spin_unlock_irq(q->queue_lock);
d6344a76 Tahsin Erdogan 2017-03-24 867 rcu_read_unlock();
d6344a76 Tahsin Erdogan 2017-03-24 868
d6344a76 Tahsin Erdogan 2017-03-24 869 new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
d6344a76 Tahsin Erdogan 2017-03-24 870 if (unlikely(!new_blkg)) {
d6344a76 Tahsin Erdogan 2017-03-24 871 ret = -ENOMEM;
d6344a76 Tahsin Erdogan 2017-03-24 872 goto fail;
d6344a76 Tahsin Erdogan 2017-03-24 873 }
d6344a76 Tahsin Erdogan 2017-03-24 874
d6344a76 Tahsin Erdogan 2017-03-24 875 rcu_read_lock();
d6344a76 Tahsin Erdogan 2017-03-24 876 spin_lock_irq(q->queue_lock);
d6344a76 Tahsin Erdogan 2017-03-24 877
d6344a76 Tahsin Erdogan 2017-03-24 878 blkg = blkg_lookup_check(pos, pol, q);
e56da7e2 Tejun Heo 2012-03-05 879 if (IS_ERR(blkg)) {
e56da7e2 Tejun Heo 2012-03-05 880 ret = PTR_ERR(blkg);
d6344a76 Tahsin Erdogan 2017-03-24 881 goto fail_unlock;
d6344a76 Tahsin Erdogan 2017-03-24 882 }
d6344a76 Tahsin Erdogan 2017-03-24 883
d6344a76 Tahsin Erdogan 2017-03-24 884 if (blkg) {
d6344a76 Tahsin Erdogan 2017-03-24 885 blkg_free(new_blkg);
d6344a76 Tahsin Erdogan 2017-03-24 886 } else {
d6344a76 Tahsin Erdogan 2017-03-24 887 blkg = blkg_create(pos, q, new_blkg);
d6344a76 Tahsin Erdogan 2017-03-24 888 if (unlikely(IS_ERR(blkg))) {
d6344a76 Tahsin Erdogan 2017-03-24 889 ret = PTR_ERR(blkg);
d6344a76 Tahsin Erdogan 2017-03-24 890 goto fail_unlock;
d6344a76 Tahsin Erdogan 2017-03-24 891 }
d6344a76 Tahsin Erdogan 2017-03-24 892 }
d6344a76 Tahsin Erdogan 2017-03-24 893
d6344a76 Tahsin Erdogan 2017-03-24 894 if (pos == blkcg)
d6344a76 Tahsin Erdogan 2017-03-24 895 goto success;
d6344a76 Tahsin Erdogan 2017-03-24 896 }
d6344a76 Tahsin Erdogan 2017-03-24 897 success:
d6344a76 Tahsin Erdogan 2017-03-24 898 ctx->disk = disk;
d6344a76 Tahsin Erdogan 2017-03-24 899 ctx->blkg = blkg;
d6344a76 Tahsin Erdogan 2017-03-24 900 ctx->body = body;
d6344a76 Tahsin Erdogan 2017-03-24 @901 return 0;
d6344a76 Tahsin Erdogan 2017-03-24 902
d6344a76 Tahsin Erdogan 2017-03-24 903 fail_unlock:
d6344a76 Tahsin Erdogan 2017-03-24 904 spin_unlock_irq(q->queue_lock);

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

2017-03-27 18:38:25

by Tahsin Erdogan

[permalink] [raw]
Subject: Re: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock

On Sun, Mar 26, 2017 at 3:54 AM, Julia Lawall <[email protected]> wrote:
> Is an unlock needed before line 848? Or does blkg_lookup_check take care
> of it?

Unlock is not needed. On success, function returns with locks held.
It is documented at line 805:

"... This function returns with RCU read
* lock and queue lock held and must be paired with blkg_conf_finish()."

2017-03-28 21:54:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock

On Fri, Mar 24, 2017 at 04:04:32PM -0600, Jens Axboe wrote:
> On 03/24/2017 03:56 PM, Tahsin Erdogan wrote:
> > blkg_conf_prep() currently calls blkg_lookup_create() while holding
> > request queue spinlock. This means allocating memory for struct
> > blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
> > failures in call paths like below:
> >
> > pcpu_alloc+0x68f/0x710
> > __alloc_percpu_gfp+0xd/0x10
> > __percpu_counter_init+0x55/0xc0
> > cfq_pd_alloc+0x3b2/0x4e0
> > blkg_alloc+0x187/0x230
> > blkg_create+0x489/0x670
> > blkg_lookup_create+0x9a/0x230
> > blkg_conf_prep+0x1fb/0x240
> > __cfqg_set_weight_device.isra.105+0x5c/0x180
> > cfq_set_weight_on_dfl+0x69/0xc0
> > cgroup_file_write+0x39/0x1c0
> > kernfs_fop_write+0x13f/0x1d0
> > __vfs_write+0x23/0x120
> > vfs_write+0xc2/0x1f0
> > SyS_write+0x44/0xb0
> > entry_SYSCALL_64_fastpath+0x18/0xad
> >
> > In the code path above, percpu allocator cannot call vmalloc() due to
> > queue spinlock.
> >
> > A failure in this call path gives grief to tools which are trying to
> > configure io weights. We see occasional failures happen shortly after
> > reboots even when system is not under any memory pressure. Machines
> > with a lot of cpus are more vulnerable to this condition.
> >
> > Do struct blkcg_gq allocations outside the queue spinlock to allow
> > blocking during memory allocations.
>
> This looks much simpler/cleaner to me, compared to v5. Tejun, what do
> you think?

So, this patch in itself looks better but now we end up with two
separate mechanisms to handle non-atomic allocations. This drop lock
/ alloc / relock / check invariants in the main path and preallocation
logic used in the init path. Right now, both proposed implementations
aren't that satisfactory. Personally, I'd prefer superficial ugliness
to structural duplications, but, ideally, we shouldn't have to make
this choice. idk, it's a bug fix. We can always clean things up
later.

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2017-03-28 21:59:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

On 03/09/2017 01:05 AM, Tahsin Erdogan wrote:
> blkg_conf_prep() currently calls blkg_lookup_create() while holding
> request queue spinlock. This means allocating memory for struct
> blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
> failures in call paths like below:
>
> pcpu_alloc+0x68f/0x710
> __alloc_percpu_gfp+0xd/0x10
> __percpu_counter_init+0x55/0xc0
> cfq_pd_alloc+0x3b2/0x4e0
> blkg_alloc+0x187/0x230
> blkg_create+0x489/0x670
> blkg_lookup_create+0x9a/0x230
> blkg_conf_prep+0x1fb/0x240
> __cfqg_set_weight_device.isra.105+0x5c/0x180
> cfq_set_weight_on_dfl+0x69/0xc0
> cgroup_file_write+0x39/0x1c0
> kernfs_fop_write+0x13f/0x1d0
> __vfs_write+0x23/0x120
> vfs_write+0xc2/0x1f0
> SyS_write+0x44/0xb0
> entry_SYSCALL_64_fastpath+0x18/0xad
>
> In the code path above, percpu allocator cannot call vmalloc() due to
> queue spinlock.
>
> A failure in this call path gives grief to tools which are trying to
> configure io weights. We see occasional failures happen shortly after
> reboots even when system is not under any memory pressure. Machines
> with a lot of cpus are more vulnerable to this condition.
>
> Update blkg_create() function to temporarily drop the rcu and queue
> locks when it is allowed by gfp mask.

Added for 4.12, thanks for your persistence in pulling this through.

--
Jens Axboe

2017-03-28 22:01:37

by Tahsin Erdogan

[permalink] [raw]
Subject: Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

Thanks!

On Tue, Mar 28, 2017 at 2:59 PM, Jens Axboe <[email protected]> wrote:
> On 03/09/2017 01:05 AM, Tahsin Erdogan wrote:
>> blkg_conf_prep() currently calls blkg_lookup_create() while holding
>> request queue spinlock. This means allocating memory for struct
>> blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
>> failures in call paths like below:
>>
>> pcpu_alloc+0x68f/0x710
>> __alloc_percpu_gfp+0xd/0x10
>> __percpu_counter_init+0x55/0xc0
>> cfq_pd_alloc+0x3b2/0x4e0
>> blkg_alloc+0x187/0x230
>> blkg_create+0x489/0x670
>> blkg_lookup_create+0x9a/0x230
>> blkg_conf_prep+0x1fb/0x240
>> __cfqg_set_weight_device.isra.105+0x5c/0x180
>> cfq_set_weight_on_dfl+0x69/0xc0
>> cgroup_file_write+0x39/0x1c0
>> kernfs_fop_write+0x13f/0x1d0
>> __vfs_write+0x23/0x120
>> vfs_write+0xc2/0x1f0
>> SyS_write+0x44/0xb0
>> entry_SYSCALL_64_fastpath+0x18/0xad
>>
>> In the code path above, percpu allocator cannot call vmalloc() due to
>> queue spinlock.
>>
>> A failure in this call path gives grief to tools which are trying to
>> configure io weights. We see occasional failures happen shortly after
>> reboots even when system is not under any memory pressure. Machines
>> with a lot of cpus are more vulnerable to this condition.
>>
>> Update blkg_create() function to temporarily drop the rcu and queue
>> locks when it is allowed by gfp mask.
>
> Added for 4.12, thanks for your persistence in pulling this through.
>
> --
> Jens Axboe
>