2009-08-27 22:19:12

by Brad Bosch

[permalink] [raw]
Subject: Crypto oops in async_chainiv_do_postponed

Herbert,

I seem to have found a bug in chainiv.c. The following oops occured
while using blowfish and hmac-sha with UDP. The patch (against
2.6.27.30) after the oops output below is an completely untested
possible fix. We will be testing this fix starting tomorrow.

The null dereference occurs when subreq is dereferenced in
async_chainiv_do_postponed(). My guess is that the
skcipher_givcrypt_request block has been freed and subsequently
overwritten by the time the postponed request is processed. This
could happen if chainiv_givencrypt() returns anything other than
-EINPROGRESS after postponing the request. From what I can see, this
could indeed happen since the same err field in async_chainiv_ctx is
used both when the request is postponed and also when it is later
processed by the worker thread. Neither the lock, nor the
CHAINIV_STATE_INUSE bit in ctx->state will prevent this race which
results in the -EINPROGRESS err being overwritten between the time it
is placed in ctx->err in async_chainiv_postpone_request() and when it
is read back out in async_chainiv_schedule_work(). I am curious why
this method of returning the error code was used in the first place.

Please let me know if this change looks OK and I will follow up after I test
this patch.

Thanks!

--Brad

BUG: unable to handle kernel NULL pointer dereference at 0000003c
IP: [<c054994b>] async_chainiv_do_postponed+0x38/0x5d
*pde = 00000000
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: loop

Pid: 19, comm: events/0 Not tainted (2.6.27.31-wmspt #1)
EIP: 0060:[<c054994b>] EFLAGS: 00010286 CPU: 0
EIP is at async_chainiv_do_postponed+0x38/0x5d
EAX: f78c2000 EBX: fffffff0 ECX: 00000000 EDX: 00000000
ESI: f4f5bdf4 EDI: 00000000 EBP: f4f5bdf0 ESP: f78c3f5c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process events/0 (pid: 19, ti=f78c2000 task=f78ae430 task.ti=f78c2000)
Stack: f4f5be14 f789bd40 f4f5be10 c0549913 c042c7e9 f789bd40 f78c3f98 f78c3fb8
00000005 c042c8fc 00000000 f78ae430 c042f33e f78c3fb0 f78c3fb0 c0928a80
c041baf3 00000000 00000000 f78ae430 c042f33e f78c3fb0 f78c3fb0 00000000
Call Trace:
[<c0549913>] async_chainiv_do_postponed+0x0/0x5d
[<c042c7e9>] run_workqueue+0x72/0xf8
[<c042c8fc>] worker_thread+0x8d/0x99
[<c042f33e>] autoremove_wake_function+0x0/0x2d
[<c041baf3>] __wake_up_common+0x37/0x61
[<c042f33e>] autoremove_wake_function+0x0/0x2d
[<c041bbdb>] complete+0x28/0x36
[<c042c86f>] worker_thread+0x0/0x99
[<c042ee53>] kthread+0x34/0x55
[<c042ee1f>] kthread+0x0/0x55
[<c04037df>] kernel_thread_helper+0x7/0x10
=======================
Code: eb 14 89 f0 e8 92 16 23 00 89 d8 e8 c2 df ff ff 8d 58 f0 89 c7 89 f0 e8 15 18 23 00 85 db 75 0b 5b 89 e8 5e 5f 5d e9 00 fe ff ff <81> 4f 3c 00 02 00 00 89 d8 e8 6f fe ff ff 89 c3 e8 c4 9e ed ff
EIP: [<c054994b>] async_chainiv_do_postponed+0x38/0x5d SS:ESP 0068:f78c3f5c
---[ end trace daceb93ad89f6e25 ]---


Index: chainiv.c
===================================================================
RCS file: /share/cvs/sdg/kernels/kernel.wms/kernel_2_6_27/src/crypto/chainiv.c,v
retrieving revision 1.1.1.1.4.2
diff -u -r1.1.1.1.4.2 chainiv.c
--- chainiv.c 10 Mar 2009 05:16:24 -0000 1.1.1.1.4.2
+++ chainiv.c 27 Aug 2009 19:40:27 -0000
@@ -36,7 +36,6 @@
unsigned long state;

spinlock_t lock;
- int err;

struct crypto_queue queue;
struct work_struct postponed;
@@ -114,10 +113,9 @@
return chainiv_init_common(tfm);
}

-static int async_chainiv_schedule_work(struct async_chainiv_ctx *ctx)
+static void async_chainiv_schedule_work(struct async_chainiv_ctx *ctx)
{
int queued;
- int err = ctx->err;

if (!ctx->queue.qlen) {
smp_mb__before_clear_bit();
@@ -125,14 +123,11 @@

if (!ctx->queue.qlen ||
test_and_set_bit(CHAINIV_STATE_INUSE, &ctx->state))
- goto out;
+ return;
}

queued = schedule_work(&ctx->postponed);
BUG_ON(!queued);
-
-out:
- return err;
}

static int async_chainiv_postpone_request(struct skcipher_givcrypt_request *req)
@@ -148,8 +143,8 @@
if (test_and_set_bit(CHAINIV_STATE_INUSE, &ctx->state))
return err;

- ctx->err = err;
- return async_chainiv_schedule_work(ctx);
+ async_chainiv_schedule_work(ctx);
+ return err;
}

static int async_chainiv_givencrypt_tail(struct skcipher_givcrypt_request *req)
@@ -158,18 +153,20 @@
struct async_chainiv_ctx *ctx = crypto_ablkcipher_ctx(geniv);
struct ablkcipher_request *subreq = skcipher_givcrypt_reqctx(req);
unsigned int ivsize = crypto_ablkcipher_ivsize(geniv);
+ int err;

memcpy(req->giv, ctx->iv, ivsize);
memcpy(subreq->info, ctx->iv, ivsize);

- ctx->err = crypto_ablkcipher_encrypt(subreq);
- if (ctx->err)
+ err = crypto_ablkcipher_encrypt(subreq);
+ if (err)
goto out;

memcpy(ctx->iv, subreq->info, ivsize);

out:
- return async_chainiv_schedule_work(ctx);
+ async_chainiv_schedule_work(ctx);
+ return err;
}

static int async_chainiv_givencrypt(struct skcipher_givcrypt_request *req)
@@ -236,7 +233,7 @@
spin_unlock_bh(&ctx->lock);

if (!req) {
- async_chainiv_schedule_work(ctx);
+ async_chainiv_schedule_work(ctx);
return;
}



2009-08-29 10:46:08

by Herbert Xu

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

Hi Brad:

On Thu, Aug 27, 2009 at 05:13:04PM -0500, Brad Bosch wrote:
>
> I seem to have found a bug in chainiv.c. The following oops occured
> while using blowfish and hmac-sha with UDP. The patch (against
> 2.6.27.30) after the oops output below is an completely untested
> possible fix. We will be testing this fix starting tomorrow.

Thanks for the detailed analysis and patch!

> The null dereference occurs when subreq is dereferenced in
> async_chainiv_do_postponed(). My guess is that the
> skcipher_givcrypt_request block has been freed and subsequently
> overwritten by the time the postponed request is processed. This
> could happen if chainiv_givencrypt() returns anything other than
> -EINPROGRESS after postponing the request. From what I can see, this
> could indeed happen since the same err field in async_chainiv_ctx is
> used both when the request is postponed and also when it is later
> processed by the worker thread. Neither the lock, nor the
> CHAINIV_STATE_INUSE bit in ctx->state will prevent this race which
> results in the -EINPROGRESS err being overwritten between the time it
> is placed in ctx->err in async_chainiv_postpone_request() and when it
> is read back out in async_chainiv_schedule_work(). I am curious why
> this method of returning the error code was used in the first place.

Actually I think the problem is much less subtle than that :)

The problem is that whenever crypto_dequeue_request returns NULL,
chainiv will never see the NULL pointer because we end up converting
the pointer to skcipher_givcrypt_request which would have the value
(NULL - off) where off is non-zero.

Please let me know if this patch fixes the crash for you.

commit 0c7d400fafaeab6014504a6a6249f01bac7f7db4
Author: Herbert Xu <[email protected]>
Date: Sat Aug 29 20:44:04 2009 +1000

crypto: skcipher - Fix skcipher_dequeue_givcrypt NULL test

As struct skcipher_givcrypt_request includes struct crypto_request
at a non-zero offset, testing for NULL after converting the pointer
returned by crypto_dequeue_request does not work. This can result
in IPsec crashes when the queue is depleted.

This patch fixes it by doing the pointer conversion only when the
return value is non-NULL. In particular, we create a new function
__crypto_dequeue_request that does the pointer conversion.

Reported-by: Brad Bosch <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

diff --git a/crypto/algapi.c b/crypto/algapi.c
index 56c62e2..df0863d 100644
--- a/crypto/algapi.c
+++ b/crypto/algapi.c
@@ -692,7 +692,7 @@ out:
}
EXPORT_SYMBOL_GPL(crypto_enqueue_request);

-struct crypto_async_request *crypto_dequeue_request(struct crypto_queue *queue)
+void *__crypto_dequeue_request(struct crypto_queue *queue, unsigned int offset)
{
struct list_head *request;

@@ -707,7 +707,14 @@ struct crypto_async_request *crypto_dequeue_request(struct crypto_queue *queue)
request = queue->list.next;
list_del(request);

- return list_entry(request, struct crypto_async_request, list);
+ return (char *)list_entry(request, struct crypto_async_request, list) -
+ offset;
+}
+EXPORT_SYMBOL_GPL(__crypto_dequeue_request);
+
+struct crypto_async_request *crypto_dequeue_request(struct crypto_queue *queue)
+{
+ return __crypto_dequeue_request(queue, 0);
}
EXPORT_SYMBOL_GPL(crypto_dequeue_request);

diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
index 0105454..5a2bd1c 100644
--- a/include/crypto/algapi.h
+++ b/include/crypto/algapi.h
@@ -137,6 +137,7 @@ struct crypto_instance *crypto_alloc_instance(const char *name,
void crypto_init_queue(struct crypto_queue *queue, unsigned int max_qlen);
int crypto_enqueue_request(struct crypto_queue *queue,
struct crypto_async_request *request);
+void *__crypto_dequeue_request(struct crypto_queue *queue, unsigned int offset);
struct crypto_async_request *crypto_dequeue_request(struct crypto_queue *queue);
int crypto_tfm_in_queue(struct crypto_queue *queue, struct crypto_tfm *tfm);

diff --git a/include/crypto/internal/skcipher.h b/include/crypto/internal/skcipher.h
index 2ba42cd..3a748a6 100644
--- a/include/crypto/internal/skcipher.h
+++ b/include/crypto/internal/skcipher.h
@@ -79,8 +79,8 @@ static inline int skcipher_enqueue_givcrypt(
static inline struct skcipher_givcrypt_request *skcipher_dequeue_givcrypt(
struct crypto_queue *queue)
{
- return container_of(ablkcipher_dequeue_request(queue),
- struct skcipher_givcrypt_request, creq);
+ return __crypto_dequeue_request(
+ queue, offsetof(struct skcipher_givcrypt_request, creq.base));
}

static inline void *skcipher_givcrypt_reqctx(

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-08-31 16:12:08

by Brad Bosch

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

Herbert Xu writes:
> Thanks for the detailed analysis and patch!

No problem, thanks for looking at it!

>
> > The null dereference occurs when subreq is dereferenced in
> > async_chainiv_do_postponed(). My guess is that the
> > skcipher_givcrypt_request block has been freed and subsequently
> > overwritten by the time the postponed request is processed. This
> > could happen if chainiv_givencrypt() returns anything other than
> > -EINPROGRESS after postponing the request. From what I can see, this
> > could indeed happen since the same err field in async_chainiv_ctx is
> > used both when the request is postponed and also when it is later
> > processed by the worker thread. Neither the lock, nor the
> > CHAINIV_STATE_INUSE bit in ctx->state will prevent this race which
> > results in the -EINPROGRESS err being overwritten between the time it
> > is placed in ctx->err in async_chainiv_postpone_request() and when it
> > is read back out in async_chainiv_schedule_work(). I am curious why
> > this method of returning the error code was used in the first place.
>
> Actually I think the problem is much less subtle than that :)

OK. I was looking for something subtle because the crash takes a long
time to happen. But do you agree that the race I described above also
a real bug?

>
> The problem is that whenever crypto_dequeue_request returns NULL,
> chainiv will never see the NULL pointer because we end up converting
> the pointer to skcipher_givcrypt_request which would have the value
> (NULL - off) where off is non-zero.

Yes, I see that this bug must be the bug we would likely encounter first.
Apparently, async_chainiv_do_postponed was never tested? But I don't
see how the patch you proposed below helps. We still don't seem to be
returning NULL from skcipher_dequeue_givcrypt when we reach the end of
the queue because __crypto_dequeue_request is not checking for NULL
before it subtracts offset.

Wouldn't the following (simpler, but untested) patch work?

Index: skcipher.h
===================================================================
RCS file: /share/cvs/sdg/kernels/kernel.wms/kernel_2_6_27/src/include/crypto/internal/skcipher.h,v
retrieving revision 1.1.1.1.4.2
diff -u -r1.1.1.1.4.2 skcipher.h
--- skcipher.h 10 Mar 2009 05:25:25 -0000 1.1.1.1.4.2
+++ skcipher.h 31 Aug 2009 15:56:50 -0000
@@ -85,8 +85,9 @@
static inline struct skcipher_givcrypt_request *skcipher_dequeue_givcrypt(
struct crypto_queue *queue)
{
- return container_of(ablkcipher_dequeue_request(queue),
- struct skcipher_givcrypt_request, creq);
+ struct ablkcipher_request *req = ablkcipher_dequeue_request(queue);
+ return req ? container_of(req, struct skcipher_givcrypt_request, creq)
+ : NULL;
}

static inline void *skcipher_givcrypt_reqctx(


Thanks!

--Brad

>
> Please let me know if this patch fixes the crash for you.
>
> commit 0c7d400fafaeab6014504a6a6249f01bac7f7db4
> Author: Herbert Xu <[email protected]>
> Date: Sat Aug 29 20:44:04 2009 +1000
>
> crypto: skcipher - Fix skcipher_dequeue_givcrypt NULL test
>
> As struct skcipher_givcrypt_request includes struct crypto_request
> at a non-zero offset, testing for NULL after converting the pointer
> returned by crypto_dequeue_request does not work. This can result
> in IPsec crashes when the queue is depleted.
>
> This patch fixes it by doing the pointer conversion only when the
> return value is non-NULL. In particular, we create a new function
> __crypto_dequeue_request that does the pointer conversion.
>
> Reported-by: Brad Bosch <[email protected]>
> Signed-off-by: Herbert Xu <[email protected]>
>
> diff --git a/crypto/algapi.c b/crypto/algapi.c
> index 56c62e2..df0863d 100644
> --- a/crypto/algapi.c
> +++ b/crypto/algapi.c
> @@ -692,7 +692,7 @@ out:
> }
> EXPORT_SYMBOL_GPL(crypto_enqueue_request);
>
> -struct crypto_async_request *crypto_dequeue_request(struct crypto_queue *queue)
> +void *__crypto_dequeue_request(struct crypto_queue *queue, unsigned int offset)
> {
> struct list_head *request;
>
> @@ -707,7 +707,14 @@ struct crypto_async_request *crypto_dequeue_request(struct crypto_queue *queue)
> request = queue->list.next;
> list_del(request);
>
> - return list_entry(request, struct crypto_async_request, list);
> + return (char *)list_entry(request, struct crypto_async_request, list) -
> + offset;
> +}
> +EXPORT_SYMBOL_GPL(__crypto_dequeue_request);
> +
> +struct crypto_async_request *crypto_dequeue_request(struct crypto_queue *queue)
> +{
> + return __crypto_dequeue_request(queue, 0);
> }
> EXPORT_SYMBOL_GPL(crypto_dequeue_request);
>
> diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
> index 0105454..5a2bd1c 100644
> --- a/include/crypto/algapi.h
> +++ b/include/crypto/algapi.h
> @@ -137,6 +137,7 @@ struct crypto_instance *crypto_alloc_instance(const char *name,
> void crypto_init_queue(struct crypto_queue *queue, unsigned int max_qlen);
> int crypto_enqueue_request(struct crypto_queue *queue,
> struct crypto_async_request *request);
> +void *__crypto_dequeue_request(struct crypto_queue *queue, unsigned int offset);
> struct crypto_async_request *crypto_dequeue_request(struct crypto_queue *queue);
> int crypto_tfm_in_queue(struct crypto_queue *queue, struct crypto_tfm *tfm);
>
> diff --git a/include/crypto/internal/skcipher.h b/include/crypto/internal/skcipher.h
> index 2ba42cd..3a748a6 100644
> --- a/include/crypto/internal/skcipher.h
> +++ b/include/crypto/internal/skcipher.h
> @@ -79,8 +79,8 @@ static inline int skcipher_enqueue_givcrypt(
> static inline struct skcipher_givcrypt_request *skcipher_dequeue_givcrypt(
> struct crypto_queue *queue)
> {
> - return container_of(ablkcipher_dequeue_request(queue),
> - struct skcipher_givcrypt_request, creq);
> + return __crypto_dequeue_request(
> + queue, offsetof(struct skcipher_givcrypt_request, creq.base));
> }
>
> static inline void *skcipher_givcrypt_reqctx(
>
> Cheers,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-08-31 22:04:59

by Herbert Xu

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

On Mon, Aug 31, 2009 at 11:11:42AM -0500, Brad Bosch wrote:
>
> OK. I was looking for something subtle because the crash takes a long
> time to happen. But do you agree that the race I described above also
> a real bug?

No I don't think it is. CHAINV_STATE_INUSE guarantees that only
one entity can use ctx->err at any time.

> Yes, I see that this bug must be the bug we would likely encounter first.
> Apparently, async_chainiv_do_postponed was never tested? But I don't
> see how the patch you proposed below helps. We still don't seem to be
> returning NULL from skcipher_dequeue_givcrypt when we reach the end of
> the queue because __crypto_dequeue_request is not checking for NULL
> before it subtracts offset.

Where we subtract the offset the pointer can never be NULL. Please
try my patch.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-09-01 15:43:07

by Brad Bosch

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

Herbert Xu writes:
> On Mon, Aug 31, 2009 at 11:11:42AM -0500, Brad Bosch wrote:
> >
> > OK. I was looking for something subtle because the crash takes a long
> > time to happen. But do you agree that the race I described above also
> > a real bug?
>
> No I don't think it is. CHAINV_STATE_INUSE guarantees that only
> one entity can use ctx->err at any time.

I don't see how you are protecting ctx->err with the INUSE flag. For
example:

If two threads enter async_chainiv_givencrypt at the same time, one
thread will call async_chainiv_postpone_request (INUSE will be clear
until set by async_chainiv_postpone_request) and the other thread will
call async_chainiv_givencrypt_tail (INUSE may or may not be set yet).

Now, ctx-err may be used by both async_chainiv_postpone_request to
store the return value from skcipher_enqueue_givcrypt and by
async_chainiv_givencrypt_tail to store the return value from
crypto_ablkcipher_encrypt at the same time. This can cause the
calling function to think async_chainiv_givencrypt has completed it's
work, when in fact, the work was defered.

The patch I proposed earlier (included again below) avoids this and
also makes the error handling simpler and more direct without
requiring ctx->err at all. I still don't understand why ctx->err was
required in the first place.

Did I miss something with regard to the use of ctx->err?

Now, as to the other bug...

>
> Where we subtract the offset the pointer can never be NULL. Please
> try my patch.

OK. I see now that your offset patch should indeed solve that
problem. But why did you choose to fix it in a complex way? My
suggestion just adds a single test while yours adds new parameters, a
new function and an extra function call.

Thanks for your help.

--Brad

Index: chainiv.c
===================================================================
RCS file: /share/cvs/sdg/kernels/kernel.wms/kernel_2_6_27/src/crypto/chainiv.c,v
retrieving revision 1.1.1.1.4.2
diff -u -r1.1.1.1.4.2 chainiv.c
--- chainiv.c 10 Mar 2009 05:16:24 -0000 1.1.1.1.4.2
+++ chainiv.c 27 Aug 2009 19:40:27 -0000
@@ -36,7 +36,6 @@
unsigned long state;

spinlock_t lock;
- int err;

struct crypto_queue queue;
struct work_struct postponed;
@@ -114,10 +113,9 @@
return chainiv_init_common(tfm);
}

-static int async_chainiv_schedule_work(struct async_chainiv_ctx *ctx)
+static void async_chainiv_schedule_work(struct async_chainiv_ctx *ctx)
{
int queued;
- int err = ctx->err;

if (!ctx->queue.qlen) {
smp_mb__before_clear_bit();
@@ -125,14 +123,11 @@

if (!ctx->queue.qlen ||
test_and_set_bit(CHAINIV_STATE_INUSE, &ctx->state))
- goto out;
+ return;
}

queued = schedule_work(&ctx->postponed);
BUG_ON(!queued);
-
-out:
- return err;
}

static int async_chainiv_postpone_request(struct skcipher_givcrypt_request *req)
@@ -148,8 +143,8 @@
if (test_and_set_bit(CHAINIV_STATE_INUSE, &ctx->state))
return err;

- ctx->err = err;
- return async_chainiv_schedule_work(ctx);
+ async_chainiv_schedule_work(ctx);
+ return err;
}

static int async_chainiv_givencrypt_tail(struct skcipher_givcrypt_request *req)
@@ -158,18 +153,20 @@
struct async_chainiv_ctx *ctx = crypto_ablkcipher_ctx(geniv);
struct ablkcipher_request *subreq = skcipher_givcrypt_reqctx(req);
unsigned int ivsize = crypto_ablkcipher_ivsize(geniv);
+ int err;

memcpy(req->giv, ctx->iv, ivsize);
memcpy(subreq->info, ctx->iv, ivsize);

- ctx->err = crypto_ablkcipher_encrypt(subreq);
- if (ctx->err)
+ err = crypto_ablkcipher_encrypt(subreq);
+ if (err)
goto out;

memcpy(ctx->iv, subreq->info, ivsize);

out:
- return async_chainiv_schedule_work(ctx);
+ async_chainiv_schedule_work(ctx);
+ return err;
}

static int async_chainiv_givencrypt(struct skcipher_givcrypt_request *req)
@@ -236,7 +233,7 @@
spin_unlock_bh(&ctx->lock);

if (!req) {
- async_chainiv_schedule_work(ctx);
+ async_chainiv_schedule_work(ctx);
return;
}


2009-09-01 22:17:21

by Herbert Xu

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

On Tue, Sep 01, 2009 at 10:42:44AM -0500, Brad Bosch wrote:
>
> Now, ctx-err may be used by both async_chainiv_postpone_request to
> store the return value from skcipher_enqueue_givcrypt and by
> async_chainiv_givencrypt_tail to store the return value from
> crypto_ablkcipher_encrypt at the same time. This can cause the
> calling function to think async_chainiv_givencrypt has completed it's
> work, when in fact, the work was defered.

async_chainiv_postpone_request never touches ctx->err unless
it can obtain the INUSE bit lock. On the other hand, the normal
patch async_chainiv_givencrypt_tail never relinquishes the INUSE
bit until it is finisehd with ctx->err.

> OK. I see now that your offset patch should indeed solve that
> problem. But why did you choose to fix it in a complex way? My
> suggestion just adds a single test while yours adds new parameters, a
> new function and an extra function call.

Because that introduces two NULL checks where the second one is
useless. Not a big deal but then again, my patch wasn't that
complicated either :)

Please let me know whether it actually fixes your problem though
so I can get this upstream.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-09-02 14:15:03

by Brad Bosch

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

Herbert Xu writes:
> On Tue, Sep 01, 2009 at 10:42:44AM -0500, Brad Bosch wrote:
> >
> > Now, ctx-err may be used by both async_chainiv_postpone_request to
> > store the return value from skcipher_enqueue_givcrypt and by
> > async_chainiv_givencrypt_tail to store the return value from
> > crypto_ablkcipher_encrypt at the same time. This can cause the
> > calling function to think async_chainiv_givencrypt has completed it's
> > work, when in fact, the work was defered.
>
> async_chainiv_postpone_request never touches ctx->err unless
> it can obtain the INUSE bit lock. On the other hand, the normal
> patch async_chainiv_givencrypt_tail never relinquishes the INUSE
> bit until it is finisehd with ctx->err.

But the above statements are not adequate to demonstrate that your use
of the INUSE flag always prevents a condition where both
async_chainiv_postpone_request and async_chainiv_givencrypt_tail
operate on the same ctx at the same time. The flaw in your logic may
be that async_chainiv_schedule_work does not have solid assurance that
it's thread is the one that holds the INUSE bit when it calls
clear_bit.

I seem to have trouble getting the details right in describing a path
that causes both uses of ctx->err to happen at the same time. Let me
try again.

Assume the worker thread is executing between the dequeue in
async_chainiv_do_postponed and the clear_bit call in
async_chainiv_schedule_work. Further assume that we are processing
the last item on the queue so durring this time, ctx->queue.qlen =
0.

Meanwhile, three threads enter async_chainiv_givencrypt for the same
ctx at about the same time.

Thread one calls test_and_set_bit which returns 1 and calls
async_cahiniv_postpone_request but suppose it has not yet enqueued.
Now INUSE is set and qlen=0.

Next, the worker thread calls clear_bit in async_chainiv_schedule_work
but it is interrupted before it can call test_and_set_bit. Now INUSE
is clear and qlen=0

The test_and_set_bit in thread two is called at this moment and
returns 0 and then calls async_chainiv_givencrypt_tail. Now INUSE is
set and qlen=0.

Thread one now locks the ctx and calls skcipher_enqueue_givcrypt and
unlocks. Now INUSE is set and qlen=1.

Thread three calls test_and_set_bit which returns 1 and then it clears
INUSE since qlen=1 and it calls postpone with INUSE clear and qlen=1

Now thread three will use ctx->err to hold the return value of
skcipher_enqueue_givcrypt at the same time as thread two uses ctx->err
to hold the return value of crypto_ablkcipher_encrypt!

Did I make a mistake above? I suspect more bad things can happen as
well in this scenario, but I'm just focusing on the use of ctx->err here.

>
> Please let me know whether it actually fixes your problem though
> so I can get this upstream.

Unfortunately, the offset problem is not easily reproduced with our
application, so testing long enough to be sure the problem is fixed
(assuming that it was indeed the cause of the oops) may not be
practical. All I can say at the moment is that I have not seen the
crash since I introduced the two patches I sent you.

Thanks for taking the time to discuss this!

--Brad

2009-09-02 14:24:01

by Brad Bosch

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

(resent due to bounce notification for vger)
Herbert Xu writes:
> On Tue, Sep 01, 2009 at 10:42:44AM -0500, Brad Bosch wrote:
> >
> > Now, ctx-err may be used by both async_chainiv_postpone_request to
> > store the return value from skcipher_enqueue_givcrypt and by
> > async_chainiv_givencrypt_tail to store the return value from
> > crypto_ablkcipher_encrypt at the same time. This can cause the
> > calling function to think async_chainiv_givencrypt has completed it's
> > work, when in fact, the work was defered.
>
> async_chainiv_postpone_request never touches ctx->err unless
> it can obtain the INUSE bit lock. On the other hand, the normal
> patch async_chainiv_givencrypt_tail never relinquishes the INUSE
> bit until it is finisehd with ctx->err.

But the above statements are not adequate to demonstrate that your use
of the INUSE flag always prevents a condition where both
async_chainiv_postpone_request and async_chainiv_givencrypt_tail
operate on the same ctx at the same time. The flaw in your logic may
be that async_chainiv_schedule_work does not have solid assurance that
it's thread is the one that holds the INUSE bit when it calls
clear_bit.

I seem to have trouble getting the details right in describing a path
that causes both uses of ctx->err to happen at the same time. Let me
try again.

Assume the worker thread is executing between the dequeue in
async_chainiv_do_postponed and the clear_bit call in
async_chainiv_schedule_work. Further assume that we are processing
the last item on the queue so durring this time, ctx->queue.qlen =
0.

Meanwhile, three threads enter async_chainiv_givencrypt for the same
ctx at about the same time.

Thread one calls test_and_set_bit which returns 1 and calls
async_cahiniv_postpone_request but suppose it has not yet enqueued.
Now INUSE is set and qlen=0.

Next, the worker thread calls clear_bit in async_chainiv_schedule_work
but it is interrupted before it can call test_and_set_bit. Now INUSE
is clear and qlen=0

The test_and_set_bit in thread two is called at this moment and
returns 0 and then calls async_chainiv_givencrypt_tail. Now INUSE is
set and qlen=0.

Thread one now locks the ctx and calls skcipher_enqueue_givcrypt and
unlocks. Now INUSE is set and qlen=1.

Thread three calls test_and_set_bit which returns 1 and then it clears
INUSE since qlen=1 and it calls postpone with INUSE clear and qlen=1

Now thread three will use ctx->err to hold the return value of
skcipher_enqueue_givcrypt at the same time as thread two uses ctx->err
to hold the return value of crypto_ablkcipher_encrypt!

Did I make a mistake above? I suspect more bad things can happen as
well in this scenario, but I'm just focusing on the use of ctx->err here.

>
> Please let me know whether it actually fixes your problem though
> so I can get this upstream.

Unfortunately, the offset problem is not easily reproduced with our
application, so testing long enough to be sure the problem is fixed
(assuming that it was indeed the cause of the oops) may not be
practical. All I can say at the moment is that I have not seen the
crash since I introduced the two patches I sent you.

Thanks for taking the time to discuss this!

--Brad

2009-09-02 21:57:14

by Herbert Xu

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

On Wed, Sep 02, 2009 at 09:08:38AM -0500, Brad Bosch wrote:
>
> Assume the worker thread is executing between the dequeue in
> async_chainiv_do_postponed and the clear_bit call in
> async_chainiv_schedule_work. Further assume that we are processing

It cannot. The worker thread can only execute when it owns
the INUSE bit. In that case do_postponed will never call the
schedule_work function.

Perhaps you were misled by the clear_bit call in schedule_work.
That is only used if we end up not scheduling the work.

> Unfortunately, the offset problem is not easily reproduced with our
> application, so testing long enough to be sure the problem is fixed
> (assuming that it was indeed the cause of the oops) may not be
> practical. All I can say at the moment is that I have not seen the
> crash since I introduced the two patches I sent you.

OK I'll forward this upstream then.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-09-02 23:47:49

by Brad Bosch

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

Herbert Xu writes:
> On Wed, Sep 02, 2009 at 09:08:38AM -0500, Brad Bosch wrote:
> >
> > Assume the worker thread is executing between the dequeue in
> > async_chainiv_do_postponed and the clear_bit call in
> > async_chainiv_schedule_work. Further assume that we are processing
>
> It cannot. The worker thread can only execute when it owns
> the INUSE bit. In that case do_postponed will never call the
> schedule_work function.

In the example I cited (one entry in the queue when the worker
function starts), async_chainiv_schedule_work is indeed executed.
(indirectly) by async_chainiv_givencrypt_tail from the worker thread.
I'm sorry I didn't make it more clear that it is that code path I was
talking about.

>
> Perhaps you were misled by the clear_bit call in schedule_work.
> That is only used if we end up not scheduling the work.

No, I was not misled. But apparently, I was not clear. I do
understand how you use the INUSE bit. I did not say above that
INUSE is not set when the worker thread is running (at least not for
the first part of my example). If you had read further, you might
have noticed that the following paragraphs showed that indeed I do
understand that INUSE is set in the worker thread as evidenced by
"thread one calls test_and_set_bit which returns 1" I have added one
sentence (marked by **) to my event description below to make my
understanding more clear. Please read on.

Assume the worker thread is executing between the dequeue in
async_chainiv_do_postponed and the clear_bit call in
async_chainiv_schedule_work. Further assume that we are processing
the last item on the queue so durring this time, ctx->queue.qlen =
0. **INUSE is still set at this point.

Meanwhile, three threads enter async_chainiv_givencrypt for the same
ctx at about the same time.

Thread one calls test_and_set_bit which returns 1 and calls
async_cahiniv_postpone_request but suppose it has not yet enqueued.
Now INUSE is set and qlen=0.

Next, the worker thread calls clear_bit in async_chainiv_schedule_work
but it is interrupted before it can call test_and_set_bit. Now INUSE
is clear and qlen=0

The test_and_set_bit in thread two is called at this moment and
returns 0 and then calls async_chainiv_givencrypt_tail. Now INUSE is
set and qlen=0.

Thread one now locks the ctx and calls skcipher_enqueue_givcrypt and
unlocks. Now INUSE is set and qlen=1.

Thread three calls test_and_set_bit which returns 1 and then it clears
INUSE since qlen=1 and it calls postpone with INUSE clear and qlen=1

Now thread three will use ctx->err to hold the return value of
skcipher_enqueue_givcrypt at the same time as thread two uses ctx->err
to hold the return value of crypto_ablkcipher_encrypt!

Did I make a mistake above? I suspect more bad things can happen as
well in this scenario, but I'm just focusing on the use of ctx->err here.

Thanks

--Brad

2009-09-03 01:53:47

by Herbert Xu

[permalink] [raw]
Subject: Re: Crypto oops in async_chainiv_do_postponed

On Wed, Sep 02, 2009 at 06:47:49PM -0500, Brad Bosch wrote:
>
> Assume the worker thread is executing between the dequeue in
> async_chainiv_do_postponed and the clear_bit call in
> async_chainiv_schedule_work. Further assume that we are processing
> the last item on the queue so durring this time, ctx->queue.qlen =
> 0. **INUSE is still set at this point.
>
> Meanwhile, three threads enter async_chainiv_givencrypt for the same
> ctx at about the same time.
>
> Thread one calls test_and_set_bit which returns 1 and calls
> async_cahiniv_postpone_request but suppose it has not yet enqueued.
> Now INUSE is set and qlen=0.
>
> Next, the worker thread calls clear_bit in async_chainiv_schedule_work
> but it is interrupted before it can call test_and_set_bit. Now INUSE
> is clear and qlen=0
>
> The test_and_set_bit in thread two is called at this moment and
> returns 0 and then calls async_chainiv_givencrypt_tail. Now INUSE is
> set and qlen=0.
>
> Thread one now locks the ctx and calls skcipher_enqueue_givcrypt and
> unlocks. Now INUSE is set and qlen=1.
>
> Thread three calls test_and_set_bit which returns 1 and then it clears
> INUSE since qlen=1 and it calls postpone with INUSE clear and qlen=1

How can thread three clear INUSE if test_and_set_bit returned 1?
If thread three sees it set then it will postpone. It can only
clear it if it was not set originally.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt