When jumping to the out_put_disk label, we will call put_disk(), which will
trigger a call to disk_release(), which calls blk_put_queue().
Later in the cleanup code, we do blk_cleanup_queue(), which will also call
blk_put_queue().
Putting the queue twice is incorrect, and will generate a KASAN splat.
Set the disk->queue pointer to NULL, before calling put_disk(), so that the
first call to blk_put_queue() will not free the queue.
The second call to blk_put_queue() uses another pointer to the same queue,
so this call will still free the queue.
Fixes: 85136c010285 ("lightnvm: simplify geometry enumeration")
Signed-off-by: Niklas Cassel <[email protected]>
---
drivers/nvme/host/core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 91c1bd659947..f2adea96b04c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3642,6 +3642,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
return;
out_put_disk:
+ /* prevent double queue cleanup */
+ ns->disk->queue = NULL;
put_disk(ns->disk);
out_unlink_ns:
mutex_lock(&ctrl->subsys->lock);
--
2.25.3
Applied to nvme-5.7.
Not a huge fan of it, though - but I think I have an idea how to sort
some of this mess out with better block layer APIs.
On 27.04.2020 14:34, Niklas Cassel wrote:
>When jumping to the out_put_disk label, we will call put_disk(), which will
>trigger a call to disk_release(), which calls blk_put_queue().
>
>Later in the cleanup code, we do blk_cleanup_queue(), which will also call
>blk_put_queue().
>
>Putting the queue twice is incorrect, and will generate a KASAN splat.
>
>Set the disk->queue pointer to NULL, before calling put_disk(), so that the
>first call to blk_put_queue() will not free the queue.
>
>The second call to blk_put_queue() uses another pointer to the same queue,
>so this call will still free the queue.
>
>Fixes: 85136c010285 ("lightnvm: simplify geometry enumeration")
>Signed-off-by: Niklas Cassel <[email protected]>
>---
> drivers/nvme/host/core.c | 2 ++
> 1 file changed, 2 insertions(+)
>
>diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>index 91c1bd659947..f2adea96b04c 100644
>--- a/drivers/nvme/host/core.c
>+++ b/drivers/nvme/host/core.c
>@@ -3642,6 +3642,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>
> return;
> out_put_disk:
>+ /* prevent double queue cleanup */
>+ ns->disk->queue = NULL;
> put_disk(ns->disk);
> out_unlink_ns:
> mutex_lock(&ctrl->subsys->lock);
>--
>2.25.3
>
What about delaying the assignment of ns->disk?
diff --git i/drivers/nvme/host/core.c w/drivers/nvme/host/core.c
index a4d8c90ee7cc..6da4a9ced945 100644
--- i/drivers/nvme/host/core.c
+++ w/drivers/nvme/host/core.c
@@ -3541,7 +3541,6 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
disk->queue = ns->queue;
disk->flags = flags;
memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
- ns->disk = disk;
__nvme_revalidate_disk(disk, id);
@@ -3553,6 +3552,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
}
}
+ ns->disk = disk;
+
down_write(&ctrl->namespaces_rwsem);
list_add_tail(&ns->list, &ctrl->namespaces);
up_write(&ctrl->namespaces_rwsem);
Javier
On Mon, Apr 27, 2020 at 08:03:11PM +0200, Javier Gonz?lez wrote:
> On 27.04.2020 14:34, Niklas Cassel wrote:
> > When jumping to the out_put_disk label, we will call put_disk(), which will
> > trigger a call to disk_release(), which calls blk_put_queue().
> >
> > Later in the cleanup code, we do blk_cleanup_queue(), which will also call
> > blk_put_queue().
> >
> > Putting the queue twice is incorrect, and will generate a KASAN splat.
> >
> > Set the disk->queue pointer to NULL, before calling put_disk(), so that the
> > first call to blk_put_queue() will not free the queue.
> >
> > The second call to blk_put_queue() uses another pointer to the same queue,
> > so this call will still free the queue.
> >
> > Fixes: 85136c010285 ("lightnvm: simplify geometry enumeration")
> > Signed-off-by: Niklas Cassel <[email protected]>
> > ---
> > drivers/nvme/host/core.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 91c1bd659947..f2adea96b04c 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3642,6 +3642,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> >
> > return;
> > out_put_disk:
> > + /* prevent double queue cleanup */
> > + ns->disk->queue = NULL;
> > put_disk(ns->disk);
> > out_unlink_ns:
> > mutex_lock(&ctrl->subsys->lock);
> > --
> > 2.25.3
> >
> What about delaying the assignment of ns->disk?
>
> diff --git i/drivers/nvme/host/core.c w/drivers/nvme/host/core.c
> index a4d8c90ee7cc..6da4a9ced945 100644
> --- i/drivers/nvme/host/core.c
> +++ w/drivers/nvme/host/core.c
> @@ -3541,7 +3541,6 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> disk->queue = ns->queue;
> disk->flags = flags;
> memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
> - ns->disk = disk;
>
> __nvme_revalidate_disk(disk, id);
>
> @@ -3553,6 +3552,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> }
> }
>
> + ns->disk = disk;
> +
Hello Javier!
The only case where we jump to the out_put_disk label, is if the
nvme_nvm_register() call failed.
In that case, we want to undo the alloc_disk_node() operation, i.e.,
decrease the refcount.
If we don't set "ns->disk = disk;" before the call to nvme_nvm_register(),
then, if register fails, and we jump to the put_disk(ns->disk) label,
ns->disk will be NULL, so the recount will not be decreased, so I assume
that this memory would then be a memory leak.
I think that the problem is that the block functions are a bit messy.
Most drivers seem to do blk_cleanup_queue() first and then do put_disk(),
but some drivers do it in the opposite way, so I think that we might have
some more use-after-free bugs in some of these drivers that do it in the
opposite way.
Kind regards,
Niklas
> down_write(&ctrl->namespaces_rwsem);
> list_add_tail(&ns->list, &ctrl->namespaces);
> up_write(&ctrl->namespaces_rwsem);
>
>
> Javier
On 27.04.2020 18:22, Niklas Cassel wrote:
>On Mon, Apr 27, 2020 at 08:03:11PM +0200, Javier González wrote:
>> On 27.04.2020 14:34, Niklas Cassel wrote:
>> > When jumping to the out_put_disk label, we will call put_disk(), which will
>> > trigger a call to disk_release(), which calls blk_put_queue().
>> >
>> > Later in the cleanup code, we do blk_cleanup_queue(), which will also call
>> > blk_put_queue().
>> >
>> > Putting the queue twice is incorrect, and will generate a KASAN splat.
>> >
>> > Set the disk->queue pointer to NULL, before calling put_disk(), so that the
>> > first call to blk_put_queue() will not free the queue.
>> >
>> > The second call to blk_put_queue() uses another pointer to the same queue,
>> > so this call will still free the queue.
>> >
>> > Fixes: 85136c010285 ("lightnvm: simplify geometry enumeration")
>> > Signed-off-by: Niklas Cassel <[email protected]>
>> > ---
>> > drivers/nvme/host/core.c | 2 ++
>> > 1 file changed, 2 insertions(+)
>> >
>> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> > index 91c1bd659947..f2adea96b04c 100644
>> > --- a/drivers/nvme/host/core.c
>> > +++ b/drivers/nvme/host/core.c
>> > @@ -3642,6 +3642,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>> >
>> > return;
>> > out_put_disk:
>> > + /* prevent double queue cleanup */
>> > + ns->disk->queue = NULL;
>> > put_disk(ns->disk);
>> > out_unlink_ns:
>> > mutex_lock(&ctrl->subsys->lock);
>> > --
>> > 2.25.3
>> >
>> What about delaying the assignment of ns->disk?
>>
>> diff --git i/drivers/nvme/host/core.c w/drivers/nvme/host/core.c
>> index a4d8c90ee7cc..6da4a9ced945 100644
>> --- i/drivers/nvme/host/core.c
>> +++ w/drivers/nvme/host/core.c
>> @@ -3541,7 +3541,6 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>> disk->queue = ns->queue;
>> disk->flags = flags;
>> memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
>> - ns->disk = disk;
>>
>> __nvme_revalidate_disk(disk, id);
>>
>> @@ -3553,6 +3552,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>> }
>> }
>>
>> + ns->disk = disk;
>> +
>
>Hello Javier!
>
>
>The only case where we jump to the out_put_disk label, is if the
>nvme_nvm_register() call failed.
>
>In that case, we want to undo the alloc_disk_node() operation, i.e.,
>decrease the refcount.
>
>If we don't set "ns->disk = disk;" before the call to nvme_nvm_register(),
>then, if register fails, and we jump to the put_disk(ns->disk) label,
>ns->disk will be NULL, so the recount will not be decreased, so I assume
>that this memory would then be a memory leak.
>
>
>I think that the problem is that the block functions are a bit messy.
>Most drivers seem to do blk_cleanup_queue() first and then do put_disk(),
>but some drivers do it in the opposite way, so I think that we might have
>some more use-after-free bugs in some of these drivers that do it in the
>opposite way.
>
Hi Niklas,
Yes, the out_put_disk label was introduced at the same time as the
LightNVM entry point. We can do a better job at separating the cleanup
functions, but as far as I can see ns->disk is not used in the LightNVM
initialization, so delaying the initialization should be ok. Part of
this should be also changing the out_put_disk to put_disk(disk).
Note that initializing other namespace types here do not require
ns->disk either, so delaying initialization should be ok. We have been
running with this patch locally for some time.
This said, this is an alternative as your fix works.
Javier
On Tue, Apr 28, 2020 at 09:06:51AM +0200, Javier Gonz?lez wrote:
> CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.
>
>
> On 27.04.2020 18:22, Niklas Cassel wrote:
> > On Mon, Apr 27, 2020 at 08:03:11PM +0200, Javier Gonz?lez wrote:
> > > On 27.04.2020 14:34, Niklas Cassel wrote:
> > > > When jumping to the out_put_disk label, we will call put_disk(), which will
> > > > trigger a call to disk_release(), which calls blk_put_queue().
> > > >
> > > > Later in the cleanup code, we do blk_cleanup_queue(), which will also call
> > > > blk_put_queue().
> > > >
> > > > Putting the queue twice is incorrect, and will generate a KASAN splat.
> > > >
> > > > Set the disk->queue pointer to NULL, before calling put_disk(), so that the
> > > > first call to blk_put_queue() will not free the queue.
> > > >
> > > > The second call to blk_put_queue() uses another pointer to the same queue,
> > > > so this call will still free the queue.
> > > >
> > > > Fixes: 85136c010285 ("lightnvm: simplify geometry enumeration")
> > > > Signed-off-by: Niklas Cassel <[email protected]>
> > > > ---
> > > > drivers/nvme/host/core.c | 2 ++
> > > > 1 file changed, 2 insertions(+)
> > > >
> > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > > index 91c1bd659947..f2adea96b04c 100644
> > > > --- a/drivers/nvme/host/core.c
> > > > +++ b/drivers/nvme/host/core.c
> > > > @@ -3642,6 +3642,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> > > >
> > > > return;
> > > > out_put_disk:
> > > > + /* prevent double queue cleanup */
> > > > + ns->disk->queue = NULL;
> > > > put_disk(ns->disk);
> > > > out_unlink_ns:
> > > > mutex_lock(&ctrl->subsys->lock);
> > > > --
> > > > 2.25.3
> > > >
> > > What about delaying the assignment of ns->disk?
> > >
> > > diff --git i/drivers/nvme/host/core.c w/drivers/nvme/host/core.c
> > > index a4d8c90ee7cc..6da4a9ced945 100644
> > > --- i/drivers/nvme/host/core.c
> > > +++ w/drivers/nvme/host/core.c
> > > @@ -3541,7 +3541,6 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> > > disk->queue = ns->queue;
> > > disk->flags = flags;
> > > memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
> > > - ns->disk = disk;
> > >
> > > __nvme_revalidate_disk(disk, id);
> > >
> > > @@ -3553,6 +3552,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> > > }
> > > }
> > >
> > > + ns->disk = disk;
> > > +
> >
> > Hello Javier!
> >
> >
> > The only case where we jump to the out_put_disk label, is if the
> > nvme_nvm_register() call failed.
> >
> > In that case, we want to undo the alloc_disk_node() operation, i.e.,
> > decrease the refcount.
> >
> > If we don't set "ns->disk = disk;" before the call to nvme_nvm_register(),
> > then, if register fails, and we jump to the put_disk(ns->disk) label,
> > ns->disk will be NULL, so the recount will not be decreased, so I assume
> > that this memory would then be a memory leak.
> >
> >
> > I think that the problem is that the block functions are a bit messy.
> > Most drivers seem to do blk_cleanup_queue() first and then do put_disk(),
> > but some drivers do it in the opposite way, so I think that we might have
> > some more use-after-free bugs in some of these drivers that do it in the
> > opposite way.
> >
>
> Hi Niklas,
>
> Yes, the out_put_disk label was introduced at the same time as the
> LightNVM entry point. We can do a better job at separating the cleanup
> functions, but as far as I can see ns->disk is not used in the LightNVM
> initialization, so delaying the initialization should be ok. Part of
> this should be also changing the out_put_disk to put_disk(disk).
Hello Javier,
If I understand correctly, you are suggesting this:
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3616,7 +3616,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
disk->queue = ns->queue;
disk->flags = flags;
memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
- ns->disk = disk;
__nvme_revalidate_disk(disk, id);
@@ -3628,6 +3627,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
}
}
+ ns->disk = disk;
+
down_write(&ctrl->namespaces_rwsem);
list_add_tail(&ns->list, &ctrl->namespaces);
up_write(&ctrl->namespaces_rwsem);
@@ -3642,7 +3643,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
return;
out_put_disk:
- put_disk(ns->disk);
+ put_disk(disk);
out_unlink_ns:
mutex_lock(&ctrl->subsys->lock);
list_del_rcu(&ns->siblings);
That would not solve the double free error when the registration fails,
since disk->queue is still set, so this call will still free the queue.
Later in the cleanup code blk_cleanup_queue(ns->queue);
is called, which will still try to free what is already freed.
To remove the double free by delaying assignments, we would need to
delay the assignment of disk->queue until after the LightNVM entry point,
but I don't think that is possible, since __nvme_revalidate_disk(), which
is called before the LightNVM entry point uses disk->queue.
Christoph said that he would clean up this mess with better block layer
APIs. So perhaps the fix that is already queued is good enough until then.
Kind regards,
Niklas
On 28.04.2020 07:49, Niklas Cassel wrote:
>On Tue, Apr 28, 2020 at 09:06:51AM +0200, Javier González wrote:
>> CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.
>>
>>
>> On 27.04.2020 18:22, Niklas Cassel wrote:
>> > On Mon, Apr 27, 2020 at 08:03:11PM +0200, Javier González wrote:
>> > > On 27.04.2020 14:34, Niklas Cassel wrote:
>> > > > When jumping to the out_put_disk label, we will call put_disk(), which will
>> > > > trigger a call to disk_release(), which calls blk_put_queue().
>> > > >
>> > > > Later in the cleanup code, we do blk_cleanup_queue(), which will also call
>> > > > blk_put_queue().
>> > > >
>> > > > Putting the queue twice is incorrect, and will generate a KASAN splat.
>> > > >
>> > > > Set the disk->queue pointer to NULL, before calling put_disk(), so that the
>> > > > first call to blk_put_queue() will not free the queue.
>> > > >
>> > > > The second call to blk_put_queue() uses another pointer to the same queue,
>> > > > so this call will still free the queue.
>> > > >
>> > > > Fixes: 85136c010285 ("lightnvm: simplify geometry enumeration")
>> > > > Signed-off-by: Niklas Cassel <[email protected]>
>> > > > ---
>> > > > drivers/nvme/host/core.c | 2 ++
>> > > > 1 file changed, 2 insertions(+)
>> > > >
>> > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> > > > index 91c1bd659947..f2adea96b04c 100644
>> > > > --- a/drivers/nvme/host/core.c
>> > > > +++ b/drivers/nvme/host/core.c
>> > > > @@ -3642,6 +3642,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>> > > >
>> > > > return;
>> > > > out_put_disk:
>> > > > + /* prevent double queue cleanup */
>> > > > + ns->disk->queue = NULL;
>> > > > put_disk(ns->disk);
>> > > > out_unlink_ns:
>> > > > mutex_lock(&ctrl->subsys->lock);
>> > > > --
>> > > > 2.25.3
>> > > >
>> > > What about delaying the assignment of ns->disk?
>> > >
>> > > diff --git i/drivers/nvme/host/core.c w/drivers/nvme/host/core.c
>> > > index a4d8c90ee7cc..6da4a9ced945 100644
>> > > --- i/drivers/nvme/host/core.c
>> > > +++ w/drivers/nvme/host/core.c
>> > > @@ -3541,7 +3541,6 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>> > > disk->queue = ns->queue;
>> > > disk->flags = flags;
>> > > memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
>> > > - ns->disk = disk;
>> > >
>> > > __nvme_revalidate_disk(disk, id);
>> > >
>> > > @@ -3553,6 +3552,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>> > > }
>> > > }
>> > >
>> > > + ns->disk = disk;
>> > > +
>> >
>> > Hello Javier!
>> >
>> >
>> > The only case where we jump to the out_put_disk label, is if the
>> > nvme_nvm_register() call failed.
>> >
>> > In that case, we want to undo the alloc_disk_node() operation, i.e.,
>> > decrease the refcount.
>> >
>> > If we don't set "ns->disk = disk;" before the call to nvme_nvm_register(),
>> > then, if register fails, and we jump to the put_disk(ns->disk) label,
>> > ns->disk will be NULL, so the recount will not be decreased, so I assume
>> > that this memory would then be a memory leak.
>> >
>> >
>> > I think that the problem is that the block functions are a bit messy.
>> > Most drivers seem to do blk_cleanup_queue() first and then do put_disk(),
>> > but some drivers do it in the opposite way, so I think that we might have
>> > some more use-after-free bugs in some of these drivers that do it in the
>> > opposite way.
>> >
>>
>> Hi Niklas,
>>
>> Yes, the out_put_disk label was introduced at the same time as the
>> LightNVM entry point. We can do a better job at separating the cleanup
>> functions, but as far as I can see ns->disk is not used in the LightNVM
>> initialization, so delaying the initialization should be ok. Part of
>> this should be also changing the out_put_disk to put_disk(disk).
>
>Hello Javier,
>
>
>If I understand correctly, you are suggesting this:
>
>--- a/drivers/nvme/host/core.c
>+++ b/drivers/nvme/host/core.c
>@@ -3616,7 +3616,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> disk->queue = ns->queue;
> disk->flags = flags;
> memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
>- ns->disk = disk;
>
> __nvme_revalidate_disk(disk, id);
>
>@@ -3628,6 +3627,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> }
> }
>
>+ ns->disk = disk;
>+
> down_write(&ctrl->namespaces_rwsem);
> list_add_tail(&ns->list, &ctrl->namespaces);
> up_write(&ctrl->namespaces_rwsem);
>@@ -3642,7 +3643,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>
> return;
> out_put_disk:
>- put_disk(ns->disk);
>+ put_disk(disk);
> out_unlink_ns:
> mutex_lock(&ctrl->subsys->lock);
> list_del_rcu(&ns->siblings);
>
>
>That would not solve the double free error when the registration fails,
>since disk->queue is still set, so this call will still free the queue.
>
>Later in the cleanup code blk_cleanup_queue(ns->queue);
>is called, which will still try to free what is already freed.
>
>To remove the double free by delaying assignments, we would need to
>delay the assignment of disk->queue until after the LightNVM entry point,
>but I don't think that is possible, since __nvme_revalidate_disk(), which
>is called before the LightNVM entry point uses disk->queue.
>
>Christoph said that he would clean up this mess with better block layer
>APIs. So perhaps the fix that is already queued is good enough until then.
>
Makes sense. I see some things have changed in this path since I last
looked at it.
Thanks!
Javier