2015-11-20 17:17:26

by Phil Sutter

[permalink] [raw]
Subject: [PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test

The following series aims to improve lib/test_rhashtable in different
situations:

Patch 1 allows the kernel to reschedule so the test does not block too
long on slow systems.
Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent
error case (-EBUSY).
Patch 3 auto-adjusts the upper table size limit according to the number
of threads (in concurrency test). In fact, the current default is
already too small.
Patch 4 makes it possible to retry inserts even in supposedly permanent
error case (-ENOMEM) to expose rhashtable's remaining problem of
-ENOMEM being not as permanent as it is expected to be.

Changes since v1:
- Introduce insert_retry() which is then used in single-threaded test as
well.
- Do not retry inserts by default if -ENOMEM was returned.
- Rename the retry counter to be a bit more verbose about what it
contains.
- Add patch 4 as a debugging aid.

Phil Sutter (4):
rhashtable-test: add cond_resched() to thread test
rhashtable-test: retry insert operations
rhashtable-test: calculate max_entries value by default
rhashtable-test: allow to retry even if -ENOMEM was returned

lib/test_rhashtable.c | 76 +++++++++++++++++++++++++++++++++------------------
1 file changed, 50 insertions(+), 26 deletions(-)

--
2.1.2


2015-11-20 17:17:24

by Phil Sutter

[permalink] [raw]
Subject: [PATCH v2 1/4] rhashtable-test: add cond_resched() to thread test

This should fix for soft lockup bugs triggered on slow systems.

Signed-off-by: Phil Sutter <[email protected]>
---
lib/test_rhashtable.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 8c1ad1c..63654e3 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -236,6 +236,8 @@ static int thread_lookup_test(struct thread_data *tdata)
obj->value, key);
err++;
}
+
+ cond_resched();
}
return err;
}
@@ -251,6 +253,7 @@ static int threadfunc(void *data)

for (i = 0; i < entries; i++) {
tdata->objs[i].value = (tdata->id << 16) | i;
+ cond_resched();
err = rhashtable_insert_fast(&ht, &tdata->objs[i].node,
test_rht_params);
if (err == -ENOMEM || err == -EBUSY) {
@@ -285,6 +288,8 @@ static int threadfunc(void *data)
goto out;
}
tdata->objs[i].value = TEST_INSERT_FAIL;
+
+ cond_resched();
}
err = thread_lookup_test(tdata);
if (err) {
--
2.1.2

2015-11-20 17:17:28

by Phil Sutter

[permalink] [raw]
Subject: [PATCH v2 2/4] rhashtable-test: retry insert operations

After adding cond_resched() calls to threadfunc(), a surprisingly high
rate of insert failures occurred probably due to table resizes getting a
better chance to run in background. To not soften up the remaining
tests, retry inserts until they either succeed or fail permanently.

Also change the non-threaded test to retry insert operations, too.

Suggested-by: Thomas Graf <[email protected]>
Signed-off-by: Phil Sutter <[email protected]>
---
lib/test_rhashtable.c | 53 ++++++++++++++++++++++++++++-----------------------
1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 63654e3..cfc3440 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -76,6 +76,20 @@ static struct rhashtable_params test_rht_params = {
static struct semaphore prestart_sem;
static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);

+static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
+ const struct rhashtable_params params)
+{
+ int err, retries = -1;
+
+ do {
+ retries++;
+ cond_resched();
+ err = rhashtable_insert_fast(ht, obj, params);
+ } while (err == -EBUSY);
+
+ return err ? : retries;
+}
+
static int __init test_rht_lookup(struct rhashtable *ht)
{
unsigned int i;
@@ -157,7 +171,7 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
{
struct test_obj *obj;
int err;
- unsigned int i, insert_fails = 0;
+ unsigned int i, insert_retries = 0;
s64 start, end;

/*
@@ -170,22 +184,16 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
struct test_obj *obj = &array[i];

obj->value = i * 2;
-
- err = rhashtable_insert_fast(ht, &obj->node, test_rht_params);
- if (err == -ENOMEM || err == -EBUSY) {
- /* Mark failed inserts but continue */
- obj->value = TEST_INSERT_FAIL;
- insert_fails++;
- } else if (err) {
+ err = insert_retry(ht, &obj->node, test_rht_params);
+ if (err > 0)
+ insert_retries += err;
+ else if (err)
return err;
- }
-
- cond_resched();
}

- if (insert_fails)
- pr_info(" %u insertions failed due to memory pressure\n",
- insert_fails);
+ if (insert_retries)
+ pr_info(" %u insertions retried due to memory pressure\n",
+ insert_retries);

test_bucket_stats(ht);
rcu_read_lock();
@@ -244,7 +252,7 @@ static int thread_lookup_test(struct thread_data *tdata)

static int threadfunc(void *data)
{
- int i, step, err = 0, insert_fails = 0;
+ int i, step, err = 0, insert_retries = 0;
struct thread_data *tdata = data;

up(&prestart_sem);
@@ -253,21 +261,18 @@ static int threadfunc(void *data)

for (i = 0; i < entries; i++) {
tdata->objs[i].value = (tdata->id << 16) | i;
- cond_resched();
- err = rhashtable_insert_fast(&ht, &tdata->objs[i].node,
- test_rht_params);
- if (err == -ENOMEM || err == -EBUSY) {
- tdata->objs[i].value = TEST_INSERT_FAIL;
- insert_fails++;
+ err = insert_retry(&ht, &tdata->objs[i].node, test_rht_params);
+ if (err > 0) {
+ insert_retries += err;
} else if (err) {
pr_err(" thread[%d]: rhashtable_insert_fast failed\n",
tdata->id);
goto out;
}
}
- if (insert_fails)
- pr_info(" thread[%d]: %d insert failures\n",
- tdata->id, insert_fails);
+ if (insert_retries)
+ pr_info(" thread[%d]: %u insertions retried due to memory pressure\n",
+ tdata->id, insert_retries);

err = thread_lookup_test(tdata);
if (err) {
--
2.1.2

2015-11-20 17:17:26

by Phil Sutter

[permalink] [raw]
Subject: [PATCH v2 3/4] rhashtable-test: calculate max_entries value by default

A maximum table size of 64k entries is insufficient for the multiple
threads test even in default configuration (10 threads * 50000 objects =
500000 objects in total). Since we know how many objects will be
inserted, calculate the max size unless overridden by parameter.

Note that specifying the exact number of objects upon table init won't
suffice as that value is being rounded down to the next power of two -
anticipate this by rounding up to the next power of two in beforehand.

Signed-off-by: Phil Sutter <[email protected]>
---
lib/test_rhashtable.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index cfc3440..6fa77b3 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -36,9 +36,9 @@ static int runs = 4;
module_param(runs, int, 0);
MODULE_PARM_DESC(runs, "Number of test runs per variant (default: 4)");

-static int max_size = 65536;
+static int max_size = 0;
module_param(max_size, int, 0);
-MODULE_PARM_DESC(runs, "Maximum table size (default: 65536)");
+MODULE_PARM_DESC(runs, "Maximum table size (default: calculated)");

static bool shrinking = false;
module_param(shrinking, bool, 0);
@@ -321,7 +321,7 @@ static int __init test_rht_init(void)
entries = min(entries, MAX_ENTRIES);

test_rht_params.automatic_shrinking = shrinking;
- test_rht_params.max_size = max_size;
+ test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
test_rht_params.nelem_hint = size;

pr_info("Running rhashtable test nelem=%d, max_size=%d, shrinking=%d\n",
@@ -367,6 +367,8 @@ static int __init test_rht_init(void)
return -ENOMEM;
}

+ test_rht_params.max_size = max_size ? :
+ roundup_pow_of_two(tcount * entries);
err = rhashtable_init(&ht, &test_rht_params);
if (err < 0) {
pr_warn("Test failed: Unable to initialize hashtable: %d\n",
--
2.1.2

2015-11-20 17:19:26

by Phil Sutter

[permalink] [raw]
Subject: [PATCH v2 4/4] rhashtable-test: allow to retry even if -ENOMEM was returned

This is rather a hack to expose the current issue with rhashtable to
under high pressure sometimes return -ENOMEM even though system memory
is not exhausted and a consecutive insert may succeed.

Signed-off-by: Phil Sutter <[email protected]>
---
lib/test_rhashtable.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 6fa77b3..270bf72 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -52,6 +52,10 @@ static int tcount = 10;
module_param(tcount, int, 0);
MODULE_PARM_DESC(tcount, "Number of threads to spawn (default: 10)");

+static bool enomem_retry = false;
+module_param(enomem_retry, bool, 0);
+MODULE_PARM_DESC(enomem_retry, "Retry insert even if -ENOMEM was returned (default: off)");
+
struct test_obj {
int value;
struct rhash_head node;
@@ -79,14 +83,22 @@ static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
const struct rhashtable_params params)
{
- int err, retries = -1;
+ int err, retries = -1, enomem_retries = 0;

do {
retries++;
cond_resched();
err = rhashtable_insert_fast(ht, obj, params);
+ if (err == -ENOMEM && enomem_retry) {
+ enomem_retries++;
+ err = -EBUSY;
+ }
} while (err == -EBUSY);

+ if (enomem_retries)
+ pr_info(" %u insertions retried after -ENOMEM\n",
+ enomem_retries);
+
return err ? : retries;
}

--
2.1.2

2015-11-20 17:28:45

by Phil Sutter

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] rhashtable-test: allow to retry even if -ENOMEM was returned

On Fri, Nov 20, 2015 at 06:17:20PM +0100, Phil Sutter wrote:
> This is rather a hack to expose the current issue with rhashtable to
> under high pressure sometimes return -ENOMEM even though system memory
> is not exhausted and a consecutive insert may succeed.

Please note that this problem does not show every time when running the
test in default configuration on my system. With increased number of
threads though, it becomes very visible. Load test_rhashtable like so:

modprobe test_rhashtable enomem_retry=1 tcount=20

and grep dmesg for 'insertions retried after -ENOMEM'. In my case:

# dmesg | grep -E '(insertions retried after -ENOMEM|Started)' | tail
[ 34.642980] 1 insertions retried after -ENOMEM
[ 34.642989] 1 insertions retried after -ENOMEM
[ 34.642994] 1 insertions retried after -ENOMEM
[ 34.648353] 28294 insertions retried after -ENOMEM
[ 34.689687] 31262 insertions retried after -ENOMEM
[ 34.714015] 16280 insertions retried after -ENOMEM
[ 34.736019] 15327 insertions retried after -ENOMEM
[ 34.755100] 39012 insertions retried after -ENOMEM
[ 34.769116] 49369 insertions retried after -ENOMEM
[ 35.387200] Started 20 threads, 0 failed

Cheers, Phil

2015-11-23 17:38:24

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test

From: Phil Sutter <[email protected]>
Date: Fri, 20 Nov 2015 18:17:16 +0100

> The following series aims to improve lib/test_rhashtable in different
> situations:
>
> Patch 1 allows the kernel to reschedule so the test does not block too
> long on slow systems.
> Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent
> error case (-EBUSY).
> Patch 3 auto-adjusts the upper table size limit according to the number
> of threads (in concurrency test). In fact, the current default is
> already too small.
> Patch 4 makes it possible to retry inserts even in supposedly permanent
> error case (-ENOMEM) to expose rhashtable's remaining problem of
> -ENOMEM being not as permanent as it is expected to be.
>
> Changes since v1:
> - Introduce insert_retry() which is then used in single-threaded test as
> well.
> - Do not retry inserts by default if -ENOMEM was returned.
> - Rename the retry counter to be a bit more verbose about what it
> contains.
> - Add patch 4 as a debugging aid.

Series applied, thanks Phil.

2015-11-30 09:38:13

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test

Phil Sutter <[email protected]> wrote:
> The following series aims to improve lib/test_rhashtable in different
> situations:
>
> Patch 1 allows the kernel to reschedule so the test does not block too
> long on slow systems.
> Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent
> error case (-EBUSY).
> Patch 3 auto-adjusts the upper table size limit according to the number
> of threads (in concurrency test). In fact, the current default is
> already too small.
> Patch 4 makes it possible to retry inserts even in supposedly permanent
> error case (-ENOMEM) to expose rhashtable's remaining problem of
> -ENOMEM being not as permanent as it is expected to be.

I'm sorry but this patch series is simply bogus.

If rhashtable is indeed returning such errors under normal
conditions then rhashtable is broken and we must fix it instead
of working around it in the test code!

FWIW I still haven't been able to reproduce this problem, perhaps
because my machines have too few CPUs?

So can someone please help me reproduce this? Because just loading
test_rhashtable isn't doing it.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-11-30 10:14:08

by Phil Sutter

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test

On Mon, Nov 30, 2015 at 05:37:55PM +0800, Herbert Xu wrote:
> Phil Sutter <[email protected]> wrote:
> > The following series aims to improve lib/test_rhashtable in different
> > situations:
> >
> > Patch 1 allows the kernel to reschedule so the test does not block too
> > long on slow systems.
> > Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent
> > error case (-EBUSY).
> > Patch 3 auto-adjusts the upper table size limit according to the number
> > of threads (in concurrency test). In fact, the current default is
> > already too small.
> > Patch 4 makes it possible to retry inserts even in supposedly permanent
> > error case (-ENOMEM) to expose rhashtable's remaining problem of
> > -ENOMEM being not as permanent as it is expected to be.
>
> I'm sorry but this patch series is simply bogus.

The whole series?!

> If rhashtable is indeed returning such errors under normal
> conditions then rhashtable is broken and we must fix it instead
> of working around it in the test code!

You're stating the obvious. Remember, the reason I prepared patch 4 was
because you wanted to fix just that bug in rhashtable in the first
place.

Just to make this clear: Patches 1-3 are reasonable on their own, the
only connection to the bug is that patch 2 makes it visible (at least on
my system it wasn't before).

> FWIW I still haven't been able to reproduce this problem, perhaps
> because my machines have too few CPUs?

Did you try with my bogus patch series applied? How many CPUs does your
test system actually have?

> So can someone please help me reproduce this? Because just loading
> test_rhashtable isn't doing it.

As said, maybe you need to increase the number of spawned threads
(tcount=50 or so).

Cheers, Phil

2015-11-30 10:19:12

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test

On Mon, Nov 30, 2015 at 11:14:01AM +0100, Phil Sutter wrote:
> On Mon, Nov 30, 2015 at 05:37:55PM +0800, Herbert Xu wrote:
> > Phil Sutter <[email protected]> wrote:
> > > The following series aims to improve lib/test_rhashtable in different
> > > situations:
> > >
> > > Patch 1 allows the kernel to reschedule so the test does not block too
> > > long on slow systems.
> > > Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent
> > > error case (-EBUSY).
> > > Patch 3 auto-adjusts the upper table size limit according to the number
> > > of threads (in concurrency test). In fact, the current default is
> > > already too small.
> > > Patch 4 makes it possible to retry inserts even in supposedly permanent
> > > error case (-ENOMEM) to expose rhashtable's remaining problem of
> > > -ENOMEM being not as permanent as it is expected to be.
> >
> > I'm sorry but this patch series is simply bogus.
>
> The whole series?!

Well at least patch two and four seem clearly wrong because no
rhashtable user should need to retry insertions.

> Did you try with my bogus patch series applied? How many CPUs does your
> test system actually have?
>
> > So can someone please help me reproduce this? Because just loading
> > test_rhashtable isn't doing it.
>
> As said, maybe you need to increase the number of spawned threads
> (tcount=50 or so).

OK that's better. I think I see the problem. The test in
rhashtable_insert_rehash is racy and if two threads both try
to grow the table one of them may be tricked into doing a rehash
instead.

I'm working on a fix.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-03 12:41:48

by Herbert Xu

[permalink] [raw]
Subject: rhashtable: Prevent spurious EBUSY errors on insertion

On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
>
> OK that's better. I think I see the problem. The test in
> rhashtable_insert_rehash is racy and if two threads both try
> to grow the table one of them may be tricked into doing a rehash
> instead.
>
> I'm working on a fix.

OK this patch fixes the EBUSY problem as far as I can tell. Please
let me know if you still observe EBUSY with it. I'll respond to the
ENOMEM problem in another email.

---8<---
Thomas and Phil observed that under stress rhashtable insertion
sometimes failed with EBUSY, even though this error should only
ever been seen when we're under attack and our hash chain length
has grown to an unacceptable level, even after a rehash.

It turns out that the logic for detecting whether there is an
existing rehash is faulty. In particular, when two threads both
try to grow the same table at the same time, one of them may see
the newly grown table and thus erroneously conclude that it had
been rehashed. This is what leads to the EBUSY error.

This patch fixes this by remembering the current last table we
used during insertion so that rhashtable_insert_rehash can detect
when another thread has also done a resize/rehash. When this is
detected we will give up our resize/rehash and simply retry the
insertion with the new table.

Reported-by: Thomas Graf <[email protected]>
Reported-by: Phil Sutter <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 843ceca..e50b31d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -19,6 +19,7 @@

#include <linux/atomic.h>
#include <linux/compiler.h>
+#include <linux/err.h>
#include <linux/errno.h>
#include <linux/jhash.h>
#include <linux/list_nulls.h>
@@ -339,10 +340,11 @@ static inline int lockdep_rht_bucket_is_held(const struct bucket_table *tbl,
int rhashtable_init(struct rhashtable *ht,
const struct rhashtable_params *params);

-int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
- struct rhash_head *obj,
- struct bucket_table *old_tbl);
-int rhashtable_insert_rehash(struct rhashtable *ht);
+struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
+ const void *key,
+ struct rhash_head *obj,
+ struct bucket_table *old_tbl);
+int rhashtable_insert_rehash(struct rhashtable *ht, struct bucket_table *tbl);

int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter);
void rhashtable_walk_exit(struct rhashtable_iter *iter);
@@ -598,9 +600,11 @@ restart:

new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
if (unlikely(new_tbl)) {
- err = rhashtable_insert_slow(ht, key, obj, new_tbl);
- if (err == -EAGAIN)
+ tbl = rhashtable_insert_slow(ht, key, obj, new_tbl);
+ if (!IS_ERR_OR_NULL(tbl))
goto slow_path;
+
+ err = PTR_ERR(tbl);
goto out;
}

@@ -611,7 +615,7 @@ restart:
if (unlikely(rht_grow_above_100(ht, tbl))) {
slow_path:
spin_unlock_bh(lock);
- err = rhashtable_insert_rehash(ht);
+ err = rhashtable_insert_rehash(ht, tbl);
rcu_read_unlock();
if (err)
return err;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a54ff89..2ff7ed9 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -389,33 +389,31 @@ static bool rhashtable_check_elasticity(struct rhashtable *ht,
return false;
}

-int rhashtable_insert_rehash(struct rhashtable *ht)
+int rhashtable_insert_rehash(struct rhashtable *ht,
+ struct bucket_table *tbl)
{
struct bucket_table *old_tbl;
struct bucket_table *new_tbl;
- struct bucket_table *tbl;
unsigned int size;
int err;

old_tbl = rht_dereference_rcu(ht->tbl, ht);
- tbl = rhashtable_last_table(ht, old_tbl);

size = tbl->size;

+ err = -EBUSY;
+
if (rht_grow_above_75(ht, tbl))
size *= 2;
/* Do not schedule more than one rehash */
else if (old_tbl != tbl)
- return -EBUSY;
+ goto fail;
+
+ err = -ENOMEM;

new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
- if (new_tbl == NULL) {
- /* Schedule async resize/rehash to try allocation
- * non-atomic context.
- */
- schedule_work(&ht->run_work);
- return -ENOMEM;
- }
+ if (new_tbl == NULL)
+ goto fail;

err = rhashtable_rehash_attach(ht, tbl, new_tbl);
if (err) {
@@ -426,12 +424,24 @@ int rhashtable_insert_rehash(struct rhashtable *ht)
schedule_work(&ht->run_work);

return err;
+
+fail:
+ /* Do not fail the insert if someone else did a rehash. */
+ if (likely(rcu_dereference_raw(tbl->future_tbl)))
+ return 0;
+
+ /* Schedule async rehash to retry allocation in process context. */
+ if (err == -ENOMEM)
+ schedule_work(&ht->run_work);
+
+ return err;
}
EXPORT_SYMBOL_GPL(rhashtable_insert_rehash);

-int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
- struct rhash_head *obj,
- struct bucket_table *tbl)
+struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
+ const void *key,
+ struct rhash_head *obj,
+ struct bucket_table *tbl)
{
struct rhash_head *head;
unsigned int hash;
@@ -467,7 +477,12 @@ int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
exit:
spin_unlock(rht_bucket_lock(tbl, hash));

- return err;
+ if (err == 0)
+ return NULL;
+ else if (err == -EAGAIN)
+ return tbl;
+ else
+ return ERR_PTR(err);
}
EXPORT_SYMBOL_GPL(rhashtable_insert_slow);

--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-03 12:51:27

by Herbert Xu

[permalink] [raw]
Subject: rhashtable: ENOMEM errors when hit with a flood of insertions

On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
>
> OK that's better. I think I see the problem. The test in
> rhashtable_insert_rehash is racy and if two threads both try
> to grow the table one of them may be tricked into doing a rehash
> instead.
>
> I'm working on a fix.

While the EBUSY errors are gone for me, I can still see plenty
of ENOMEM errors. In fact it turns out that the reason is quite
understandable. When you pound the rhashtable hard so that it
doesn't actually get a chance to grow the table in process context,
then the table will only grow with GFP_ATOMIC allocations.

For me this starts failing regularly at around 2^19 entries, which
requires about 1024 contiguous pages if I'm not mistaken.

I've got fairly straightforward solution for this, but it does
mean that we have to add another level of complexity to the
rhashtable implementation. So before I go there I want to be
absolutely sure that we need it.

I guess the question is do we care about users that pound rhashtable
in this fashion?

My answer would be yes but I'd like to hear your opinions.

My solution is to use a slightly more complex/less efficient hash
table when we fail the allocation in interrupt context. Instead
of allocating contiguous pages, we'll simply switch to allocating
individual pages and have a master page that points to them.

On a 64-bit platform, each page can accomodate 512 entries. So
with a two-level deep setup (meaning one extra access for a hash
lookup), this would accomodate 2^18 entries. Three levels (two
extra lookups) will give us 2^27 entries, which should be enough.

When we do this we should of course schedule an async rehash so
that as soon as we get a chance we can move the entries into a
normal hash table that needs only a single lookup.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-03 15:09:59

by David Laight

[permalink] [raw]
Subject: RE: rhashtable: ENOMEM errors when hit with a flood of insertions

From: Herbert Xu
> Sent: 03 December 2015 12:51
> On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
> >
> > OK that's better. I think I see the problem. The test in
> > rhashtable_insert_rehash is racy and if two threads both try
> > to grow the table one of them may be tricked into doing a rehash
> > instead.
> >
> > I'm working on a fix.
>
> While the EBUSY errors are gone for me, I can still see plenty
> of ENOMEM errors. In fact it turns out that the reason is quite
> understandable. When you pound the rhashtable hard so that it
> doesn't actually get a chance to grow the table in process context,
> then the table will only grow with GFP_ATOMIC allocations.
>
> For me this starts failing regularly at around 2^19 entries, which
> requires about 1024 contiguous pages if I'm not mistaken.

ISTM that you should always let the insert succeed - even if it makes
the average/maximum chain length increase beyond some limit.
Any limit on the number of hashed items should have been done earlier
by the calling code.
The slight performance decrease caused by scanning longer chains
is almost certainly more 'user friendly' than an error return.

Hoping to get 1024+ contiguous VA pages does seem over-optimistic.

With a 2-level lookup you could make all the 2nd level tables
a fixed size (maybe 4 or 8 pages?) and extend the first level
table as needed.

David

2015-12-03 15:38:09

by Phil Sutter

[permalink] [raw]
Subject: Re: rhashtable: Prevent spurious EBUSY errors on insertion

On Thu, Dec 03, 2015 at 08:41:29PM +0800, Herbert Xu wrote:
> On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
> >
> > OK that's better. I think I see the problem. The test in
> > rhashtable_insert_rehash is racy and if two threads both try
> > to grow the table one of them may be tricked into doing a rehash
> > instead.
> >
> > I'm working on a fix.
>
> OK this patch fixes the EBUSY problem as far as I can tell. Please
> let me know if you still observe EBUSY with it. I'll respond to the
> ENOMEM problem in another email.
>
> ---8<---
> Thomas and Phil observed that under stress rhashtable insertion
> sometimes failed with EBUSY, even though this error should only
> ever been seen when we're under attack and our hash chain length
> has grown to an unacceptable level, even after a rehash.
>
> It turns out that the logic for detecting whether there is an
> existing rehash is faulty. In particular, when two threads both
> try to grow the same table at the same time, one of them may see
> the newly grown table and thus erroneously conclude that it had
> been rehashed. This is what leads to the EBUSY error.
>
> This patch fixes this by remembering the current last table we
> used during insertion so that rhashtable_insert_rehash can detect
> when another thread has also done a resize/rehash. When this is
> detected we will give up our resize/rehash and simply retry the
> insertion with the new table.
>
> Reported-by: Thomas Graf <[email protected]>
> Reported-by: Phil Sutter <[email protected]>
> Signed-off-by: Herbert Xu <[email protected]>

Tested-by: Phil Sutter <[email protected]>

2015-12-03 16:08:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: rhashtable: ENOMEM errors when hit with a flood of insertions

On Thu, 2015-12-03 at 20:51 +0800, Herbert Xu wrote:
> On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
> >
> > OK that's better. I think I see the problem. The test in
> > rhashtable_insert_rehash is racy and if two threads both try
> > to grow the table one of them may be tricked into doing a rehash
> > instead.
> >
> > I'm working on a fix.
>
> While the EBUSY errors are gone for me, I can still see plenty
> of ENOMEM errors. In fact it turns out that the reason is quite
> understandable. When you pound the rhashtable hard so that it
> doesn't actually get a chance to grow the table in process context,
> then the table will only grow with GFP_ATOMIC allocations.
>
> For me this starts failing regularly at around 2^19 entries, which
> requires about 1024 contiguous pages if I'm not mistaken.

Well, it will fail before this point if memory is fragmented.

Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a54ff8949f91..9ef5d74963b2 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -120,8 +120,9 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) ||
gfp != GFP_KERNEL)
tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY);
- if (tbl == NULL && gfp == GFP_KERNEL)
- tbl = vzalloc(size);
+ if (tbl == NULL)
+ tbl = __vmalloc(size, gfp | __GFP_HIGHMEM | __GFP_ZERO,
+ PAGE_KERNEL);
if (tbl == NULL)
return NULL;


2015-12-04 00:07:17

by Herbert Xu

[permalink] [raw]
Subject: Re: rhashtable: ENOMEM errors when hit with a flood of insertions

On Thu, Dec 03, 2015 at 08:08:39AM -0800, Eric Dumazet wrote:
>
> Well, it will fail before this point if memory is fragmented.

Indeed, I was surprised that it even worked up to that point,
possibly because the previous resizes might have actually been
done in process context.

> Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?

Ah I didn't know that. That would be much simpler. I'll give it a
try.

Thanks Eric!
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-04 14:40:09

by Herbert Xu

[permalink] [raw]
Subject: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On Thu, Dec 03, 2015 at 08:08:39AM -0800, Eric Dumazet wrote:
>
> Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?

OK I've tried it and I no longer get any ENOMEM errors!

---8<---
When an rhashtable user pounds rhashtable hard with back-to-back
insertions we may end up growing the table in GFP_ATOMIC context.
Unfortunately when the table reaches a certain size this often
fails because we don't have enough physically contiguous pages
to hold the new table.

Eric Dumazet suggested (and in fact wrote this patch) using
__vmalloc instead which can be used in GFP_ATOMIC context.

Reported-by: Phil Sutter <[email protected]>
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a54ff89..1c624db 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -120,8 +120,9 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) ||
gfp != GFP_KERNEL)
tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY);
- if (tbl == NULL && gfp == GFP_KERNEL)
- tbl = vzalloc(size);
+ if (tbl == NULL)
+ tbl = __vmalloc(size, gfp | __GFP_HIGHMEM | __GFP_ZERO,
+ PAGE_KERNEL);
if (tbl == NULL)
return NULL;

--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-04 17:01:06

by Phil Sutter

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On Fri, Dec 04, 2015 at 10:39:56PM +0800, Herbert Xu wrote:
> On Thu, Dec 03, 2015 at 08:08:39AM -0800, Eric Dumazet wrote:
> >
> > Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?
>
> OK I've tried it and I no longer get any ENOMEM errors!

I can't confirm this, sadly. Using 50 threads, results seem to be stable
and good. But increasing the number of threads I can provoke ENOMEM
condition again. See attached log which shows a failing test run with
100 threads.

I tried to extract logs of a test run with as few as possible failing
threads, but wasn't successful. It seems like the error amplifies
itself: While having stable success with less than 70 threads, going
beyond a margin I could not identify exactly, much more threads failed
than expected. For instance, the attached log shows 70 out of 100
threads failing, while for me every single test with 50 threads was
successful.

HTH, Phil


Attachments:
(No filename) (923.00 B)
test_rhashtable_fail.log (47.41 kB)
Download all attachments

2015-12-04 17:45:24

by Eric Dumazet

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On Fri, 2015-12-04 at 18:01 +0100, Phil Sutter wrote:
> On Fri, Dec 04, 2015 at 10:39:56PM +0800, Herbert Xu wrote:
> > On Thu, Dec 03, 2015 at 08:08:39AM -0800, Eric Dumazet wrote:
> > >
> > > Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?
> >
> > OK I've tried it and I no longer get any ENOMEM errors!
>
> I can't confirm this, sadly. Using 50 threads, results seem to be stable
> and good. But increasing the number of threads I can provoke ENOMEM
> condition again. See attached log which shows a failing test run with
> 100 threads.
>
> I tried to extract logs of a test run with as few as possible failing
> threads, but wasn't successful. It seems like the error amplifies
> itself: While having stable success with less than 70 threads, going
> beyond a margin I could not identify exactly, much more threads failed
> than expected. For instance, the attached log shows 70 out of 100
> threads failing, while for me every single test with 50 threads was
> successful.
>
> HTH, Phil

But this patch is about GFP_ATOMIC allocations, I doubt your test is
using GFP_ATOMIC.

Threads (process context) should use GFP_KERNEL allocations.

BTW, if 100 threads are simultaneously trying to vmalloc(32 MB), this
might not be very wise :(

Only one should really do this, while others are waiting.

If we really want parallelism (multiple cpus coordinating their effort),
it should be done very differently.


2015-12-04 18:15:59

by Phil Sutter

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On Fri, Dec 04, 2015 at 09:45:20AM -0800, Eric Dumazet wrote:
> On Fri, 2015-12-04 at 18:01 +0100, Phil Sutter wrote:
> > On Fri, Dec 04, 2015 at 10:39:56PM +0800, Herbert Xu wrote:
> > > On Thu, Dec 03, 2015 at 08:08:39AM -0800, Eric Dumazet wrote:
> > > >
> > > > Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?
> > >
> > > OK I've tried it and I no longer get any ENOMEM errors!
> >
> > I can't confirm this, sadly. Using 50 threads, results seem to be stable
> > and good. But increasing the number of threads I can provoke ENOMEM
> > condition again. See attached log which shows a failing test run with
> > 100 threads.
> >
> > I tried to extract logs of a test run with as few as possible failing
> > threads, but wasn't successful. It seems like the error amplifies
> > itself: While having stable success with less than 70 threads, going
> > beyond a margin I could not identify exactly, much more threads failed
> > than expected. For instance, the attached log shows 70 out of 100
> > threads failing, while for me every single test with 50 threads was
> > successful.
>
> But this patch is about GFP_ATOMIC allocations, I doubt your test is
> using GFP_ATOMIC.
>
> Threads (process context) should use GFP_KERNEL allocations.

Well, I assumed Herbert did his tests using test_rhashtable, and
therefore fixed whatever code-path that triggers. Maybe I'm wrong,
though.

Looking at the vmalloc allocation failure trace, it seems like it's
trying to indeed use GFP_ATOMIC from inside those threads: If I don't
miss anything, bucket_table_alloc is called from
rhashtable_insert_rehash, which passes GFP_ATOMIC unconditionally. But
then again bucket_table_alloc should use kzalloc if 'gfp != GFP_KERNEL',
so I'm probably just cross-eyed right now.

> BTW, if 100 threads are simultaneously trying to vmalloc(32 MB), this
> might not be very wise :(
>
> Only one should really do this, while others are waiting.

Sure, that was my previous understanding of how this thing works.

> If we really want parallelism (multiple cpus coordinating their effort),
> it should be done very differently.

Maybe my approach of stress-testing rhashtable was too naive in the
first place.

Thanks, Phil

2015-12-04 19:38:48

by David Miller

[permalink] [raw]
Subject: Re: rhashtable: Prevent spurious EBUSY errors on insertion

From: Herbert Xu <[email protected]>
Date: Thu, 3 Dec 2015 20:41:29 +0800

> Thomas and Phil observed that under stress rhashtable insertion
> sometimes failed with EBUSY, even though this error should only
> ever been seen when we're under attack and our hash chain length
> has grown to an unacceptable level, even after a rehash.
>
> It turns out that the logic for detecting whether there is an
> existing rehash is faulty. In particular, when two threads both
> try to grow the same table at the same time, one of them may see
> the newly grown table and thus erroneously conclude that it had
> been rehashed. This is what leads to the EBUSY error.
>
> This patch fixes this by remembering the current last table we
> used during insertion so that rhashtable_insert_rehash can detect
> when another thread has also done a resize/rehash. When this is
> detected we will give up our resize/rehash and simply retry the
> insertion with the new table.
>
> Reported-by: Thomas Graf <[email protected]>
> Reported-by: Phil Sutter <[email protected]>
> Signed-off-by: Herbert Xu <[email protected]>

Looks good, applied, thanks Herbert.

2015-12-04 21:53:39

by David Miller

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

From: Herbert Xu <[email protected]>
Date: Fri, 4 Dec 2015 22:39:56 +0800

> When an rhashtable user pounds rhashtable hard with back-to-back
> insertions we may end up growing the table in GFP_ATOMIC context.
> Unfortunately when the table reaches a certain size this often
> fails because we don't have enough physically contiguous pages
> to hold the new table.
>
> Eric Dumazet suggested (and in fact wrote this patch) using
> __vmalloc instead which can be used in GFP_ATOMIC context.
>
> Reported-by: Phil Sutter <[email protected]>
> Suggested-by: Eric Dumazet <[email protected]>
> Signed-off-by: Herbert Xu <[email protected]>

Applied, thanks Herbert.

2015-12-05 07:04:14

by Herbert Xu

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On Fri, Dec 04, 2015 at 04:53:34PM -0500, David Miller wrote:
> From: Herbert Xu <[email protected]>
> Date: Fri, 4 Dec 2015 22:39:56 +0800
>
> > When an rhashtable user pounds rhashtable hard with back-to-back
> > insertions we may end up growing the table in GFP_ATOMIC context.
> > Unfortunately when the table reaches a certain size this often
> > fails because we don't have enough physically contiguous pages
> > to hold the new table.
> >
> > Eric Dumazet suggested (and in fact wrote this patch) using
> > __vmalloc instead which can be used in GFP_ATOMIC context.
> >
> > Reported-by: Phil Sutter <[email protected]>
> > Suggested-by: Eric Dumazet <[email protected]>
> > Signed-off-by: Herbert Xu <[email protected]>
>
> Applied, thanks Herbert.

Sorry Dave but you'll have to revert this because I've been able
to trigger the following crash with the patch:

Testing concurrent rhashtable access from 50 threads
------------[ cut here ]------------
kernel BUG at ../mm/vmalloc.c:1337!
invalid opcode: 0000 [#1] PREEMPT SMP

The reason is that because I was testing insertions with BH disabled,
and __vmalloc doesn't like that, even with GFP_ATOMIC. As we
obviously want to continue to support rhashtable users inserting
entries with BH disabled, we'll have to look for an alternate
solution.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-05 07:06:14

by Herbert Xu

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On Fri, Dec 04, 2015 at 07:15:55PM +0100, Phil Sutter wrote:
>
> > Only one should really do this, while others are waiting.
>
> Sure, that was my previous understanding of how this thing works.

Yes that's clearly how it should be. Unfortunately while adding
the locking to do this, I found out that you can't actually call
__vmalloc with BH disabled so this is a no-go.

Unless we can make __vmalloc work with BH disabled, I guess we'll
have to go back to multi-level lookups unless someone has a better
suggestion.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-06 03:48:35

by David Miller

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

From: Herbert Xu <[email protected]>
Date: Sat, 5 Dec 2015 15:03:54 +0800

> Sorry Dave but you'll have to revert this because I've been able
> to trigger the following crash with the patch:
>
> Testing concurrent rhashtable access from 50 threads
> ------------[ cut here ]------------
> kernel BUG at ../mm/vmalloc.c:1337!
> invalid opcode: 0000 [#1] PREEMPT SMP
>
> The reason is that because I was testing insertions with BH disabled,
> and __vmalloc doesn't like that, even with GFP_ATOMIC. As we
> obviously want to continue to support rhashtable users inserting
> entries with BH disabled, we'll have to look for an alternate
> solution.

Ok, reverted, thanks for the heads up.

2015-12-07 15:35:32

by Thomas Graf

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On 12/05/15 at 03:06pm, Herbert Xu wrote:
> On Fri, Dec 04, 2015 at 07:15:55PM +0100, Phil Sutter wrote:
> >
> > > Only one should really do this, while others are waiting.
> >
> > Sure, that was my previous understanding of how this thing works.
>
> Yes that's clearly how it should be. Unfortunately while adding
> the locking to do this, I found out that you can't actually call
> __vmalloc with BH disabled so this is a no-go.
>
> Unless we can make __vmalloc work with BH disabled, I guess we'll
> have to go back to multi-level lookups unless someone has a better
> suggestion.

Thanks for fixing the race.

As for the remaining problem, I think we'll have to find a way to
serve a hard pounding user if we want to convert TCP hashtables
later on.

Did you look into what __vmalloc prevents to work with BH disabled?

2015-12-07 19:29:57

by David Miller

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

From: Thomas Graf <[email protected]>
Date: Mon, 7 Dec 2015 16:35:24 +0100

> Did you look into what __vmalloc prevents to work with BH disabled?

You can't issue the cross-cpu TLB flushes from atomic contexts.
It's the kernel page table updates that create the restriction.

2015-12-09 02:18:30

by Thomas Graf

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation


On 12/05/15 at 03:06pm, Herbert Xu wrote:
> Unless we can make __vmalloc work with BH disabled, I guess we'll
> have to go back to multi-level lookups unless someone has a better
> suggestion.

Assuming that we only encounter this scenario with very large
table sizes, it might be OK to assume that deferring the actual
resize via the worker thread while continuing to insert above
100% utilization in atomic context is safe.

On 12/07/15 at 02:29pm, David Miller wrote:
> You can't issue the cross-cpu TLB flushes from atomic contexts.
> It's the kernel page table updates that create the restriction.

2015-12-09 02:25:14

by Herbert Xu

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On Wed, Dec 09, 2015 at 03:18:26AM +0100, Thomas Graf wrote:
>
> Assuming that we only encounter this scenario with very large
> table sizes, it might be OK to assume that deferring the actual
> resize via the worker thread while continuing to insert above
> 100% utilization in atomic context is safe.

As test_rhashtable has demonstrated already this approach doesn't
work. There is nothing in the kernel that will ensure that the
worker thread gets to run at all.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-09 02:36:36

by Thomas Graf

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On 12/09/15 at 10:24am, Herbert Xu wrote:
> On Wed, Dec 09, 2015 at 03:18:26AM +0100, Thomas Graf wrote:
> >
> > Assuming that we only encounter this scenario with very large
> > table sizes, it might be OK to assume that deferring the actual
> > resize via the worker thread while continuing to insert above
> > 100% utilization in atomic context is safe.
>
> As test_rhashtable has demonstrated already this approach doesn't
> work. There is nothing in the kernel that will ensure that the
> worker thread gets to run at all.

If we define work assuming that an insertion in atomic context
should never fail then yes. I'm not sure you can guarantee that
with a segmented table either though. I agree though that the
insertion behaviour is much better defined.

My argument is that if we are in a situation in which a worker
thread is never invoked and we've grown 2x from the original
table size, do we still need entries to be inserted into the
table or can we fail?

Without knowing your exact implementation plans: introducing an
additional reference indirection for every lookup will have a
huge performance penalty as well.

Is your plan to only introduce the master table after an
allocation has failed?

2015-12-09 02:38:55

by Herbert Xu

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On Wed, Dec 09, 2015 at 03:36:32AM +0100, Thomas Graf wrote:
>
> Without knowing your exact implementation plans: introducing an
> additional reference indirection for every lookup will have a
> huge performance penalty as well.
>
> Is your plan to only introduce the master table after an
> allocation has failed?

Right, obviously the extra indirections would only come into play
after a failed allocation. As soon as we can run the worker thread
it'll try to remove the extra indirections by doing vmalloc.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-09 02:42:53

by Thomas Graf

[permalink] [raw]
Subject: Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

On 12/09/15 at 10:38am, Herbert Xu wrote:
> On Wed, Dec 09, 2015 at 03:36:32AM +0100, Thomas Graf wrote:
> >
> > Without knowing your exact implementation plans: introducing an
> > additional reference indirection for every lookup will have a
> > huge performance penalty as well.
> >
> > Is your plan to only introduce the master table after an
> > allocation has failed?
>
> Right, obviously the extra indirections would only come into play
> after a failed allocation. As soon as we can run the worker thread
> it'll try to remove the extra indirections by doing vmalloc.

OK, this sounds like a good compromise. The penalty is isolated
for the duration of the atomic burst.

2015-12-17 08:46:08

by Xin Long

[permalink] [raw]
Subject: Re: rhashtable: Prevent spurious EBUSY errors on insertion

On Thu, Dec 3, 2015 at 8:41 PM, Herbert Xu <[email protected]> wrote:
> On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
>>
>> OK that's better. I think I see the problem. The test in
>> rhashtable_insert_rehash is racy and if two threads both try
>> to grow the table one of them may be tricked into doing a rehash
>> instead.
>>
>> I'm working on a fix.
>
> OK this patch fixes the EBUSY problem as far as I can tell. Please
> let me know if you still observe EBUSY with it. I'll respond to the
> ENOMEM problem in another email.
>
> ---8<---
> Thomas and Phil observed that under stress rhashtable insertion
> sometimes failed with EBUSY, even though this error should only
> ever been seen when we're under attack and our hash chain length
> has grown to an unacceptable level, even after a rehash.
>
> It turns out that the logic for detecting whether there is an
> existing rehash is faulty. In particular, when two threads both
> try to grow the same table at the same time, one of them may see
> the newly grown table and thus erroneously conclude that it had
> been rehashed. This is what leads to the EBUSY error.
>
> This patch fixes this by remembering the current last table we
> used during insertion so that rhashtable_insert_rehash can detect
> when another thread has also done a resize/rehash. When this is
> detected we will give up our resize/rehash and simply retry the
> insertion with the new table.
>
> Reported-by: Thomas Graf <[email protected]>
> Reported-by: Phil Sutter <[email protected]>
> Signed-off-by: Herbert Xu <[email protected]>
>
> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index 843ceca..e50b31d 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -19,6 +19,7 @@
>
> #include <linux/atomic.h>
> #include <linux/compiler.h>
> +#include <linux/err.h>
> #include <linux/errno.h>
> #include <linux/jhash.h>
> #include <linux/list_nulls.h>
> @@ -339,10 +340,11 @@ static inline int lockdep_rht_bucket_is_held(const struct bucket_table *tbl,
> int rhashtable_init(struct rhashtable *ht,
> const struct rhashtable_params *params);
>
> -int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
> - struct rhash_head *obj,
> - struct bucket_table *old_tbl);
> -int rhashtable_insert_rehash(struct rhashtable *ht);
> +struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
> + const void *key,
> + struct rhash_head *obj,
> + struct bucket_table *old_tbl);
> +int rhashtable_insert_rehash(struct rhashtable *ht, struct bucket_table *tbl);
>
> int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter);
> void rhashtable_walk_exit(struct rhashtable_iter *iter);
> @@ -598,9 +600,11 @@ restart:
>
> new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
> if (unlikely(new_tbl)) {
> - err = rhashtable_insert_slow(ht, key, obj, new_tbl);
> - if (err == -EAGAIN)
> + tbl = rhashtable_insert_slow(ht, key, obj, new_tbl);
> + if (!IS_ERR_OR_NULL(tbl))
> goto slow_path;
> +
> + err = PTR_ERR(tbl);
> goto out;
> }
>
> @@ -611,7 +615,7 @@ restart:
> if (unlikely(rht_grow_above_100(ht, tbl))) {
> slow_path:
> spin_unlock_bh(lock);
> - err = rhashtable_insert_rehash(ht);
> + err = rhashtable_insert_rehash(ht, tbl);
> rcu_read_unlock();
> if (err)
> return err;
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index a54ff89..2ff7ed9 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -389,33 +389,31 @@ static bool rhashtable_check_elasticity(struct rhashtable *ht,
> return false;
> }
>
> -int rhashtable_insert_rehash(struct rhashtable *ht)
> +int rhashtable_insert_rehash(struct rhashtable *ht,
> + struct bucket_table *tbl)
> {
> struct bucket_table *old_tbl;
> struct bucket_table *new_tbl;
> - struct bucket_table *tbl;
> unsigned int size;
> int err;
>
> old_tbl = rht_dereference_rcu(ht->tbl, ht);
> - tbl = rhashtable_last_table(ht, old_tbl);
>
> size = tbl->size;
>
> + err = -EBUSY;
> +
> if (rht_grow_above_75(ht, tbl))
> size *= 2;
> /* Do not schedule more than one rehash */
> else if (old_tbl != tbl)
> - return -EBUSY;
> + goto fail;
> +
> + err = -ENOMEM;
>
> new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
> - if (new_tbl == NULL) {
> - /* Schedule async resize/rehash to try allocation
> - * non-atomic context.
> - */
> - schedule_work(&ht->run_work);
> - return -ENOMEM;
> - }
> + if (new_tbl == NULL)
> + goto fail;
>
> err = rhashtable_rehash_attach(ht, tbl, new_tbl);
> if (err) {
> @@ -426,12 +424,24 @@ int rhashtable_insert_rehash(struct rhashtable *ht)
> schedule_work(&ht->run_work);
>
> return err;
> +
> +fail:
> + /* Do not fail the insert if someone else did a rehash. */
> + if (likely(rcu_dereference_raw(tbl->future_tbl)))
> + return 0;
> +
> + /* Schedule async rehash to retry allocation in process context. */
> + if (err == -ENOMEM)
> + schedule_work(&ht->run_work);
> +
> + return err;
> }
> EXPORT_SYMBOL_GPL(rhashtable_insert_rehash);
>
> -int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
> - struct rhash_head *obj,
> - struct bucket_table *tbl)
> +struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
> + const void *key,
> + struct rhash_head *obj,
> + struct bucket_table *tbl)
> {
> struct rhash_head *head;
> unsigned int hash;
> @@ -467,7 +477,12 @@ int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
> exit:
> spin_unlock(rht_bucket_lock(tbl, hash));
>
> - return err;
> + if (err == 0)
> + return NULL;
> + else if (err == -EAGAIN)
> + return tbl;
> + else
> + return ERR_PTR(err);
> }
> EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
>

sorry for late test, but unfortunately, my case with rhashtalbe still
return EBUSY.
I added some debug code in rhashtable_insert_rehash(), and found:
*future_tbl is null*

fail:
/* Do not fail the insert if someone else did a rehash. */
if (likely(rcu_dereference_raw(tbl->future_tbl))) {
printk("future_tbl is there\n");
return 0;
} else {
printk("future_tbl is null\n");
}

any idea why ?

2015-12-17 08:48:23

by Herbert Xu

[permalink] [raw]
Subject: Re: rhashtable: Prevent spurious EBUSY errors on insertion

On Thu, Dec 17, 2015 at 04:46:00PM +0800, Xin Long wrote:
>
> sorry for late test, but unfortunately, my case with rhashtalbe still
> return EBUSY.
> I added some debug code in rhashtable_insert_rehash(), and found:
> *future_tbl is null*
>
> fail:
> /* Do not fail the insert if someone else did a rehash. */
> if (likely(rcu_dereference_raw(tbl->future_tbl))) {
> printk("future_tbl is there\n");
> return 0;
> } else {
> printk("future_tbl is null\n");
> }
>
> any idea why ?

That's presumably because you got a genuine double rehash.

Until you post your code we can't really help you.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-17 09:00:37

by Xin Long

[permalink] [raw]
Subject: Re: rhashtable: Prevent spurious EBUSY errors on insertion

On Thu, Dec 17, 2015 at 4:48 PM, Herbert Xu <[email protected]> wrote:
> On Thu, Dec 17, 2015 at 04:46:00PM +0800, Xin Long wrote:
>>
>> sorry for late test, but unfortunately, my case with rhashtalbe still
>> return EBUSY.
>> I added some debug code in rhashtable_insert_rehash(), and found:
>> *future_tbl is null*
>>
>> fail:
>> /* Do not fail the insert if someone else did a rehash. */
>> if (likely(rcu_dereference_raw(tbl->future_tbl))) {
>> printk("future_tbl is there\n");
>> return 0;
>> } else {
>> printk("future_tbl is null\n");
>> }
>>
>> any idea why ?
>
> That's presumably because you got a genuine double rehash.
>
> Until you post your code we can't really help you.
>
i wish i could , but my codes is a big patch for sctp, and this issue
happens in a special stress test based on this patch.
im trying to think how i can show you. :)

2015-12-17 16:07:12

by Xin Long

[permalink] [raw]
Subject: Re: rhashtable: Prevent spurious EBUSY errors on insertion

On Thu, Dec 17, 2015 at 5:00 PM, Xin Long <[email protected]> wrote:
> On Thu, Dec 17, 2015 at 4:48 PM, Herbert Xu <[email protected]> wrote:
>> On Thu, Dec 17, 2015 at 04:46:00PM +0800, Xin Long wrote:
>>>
>>> sorry for late test, but unfortunately, my case with rhashtalbe still
>>> return EBUSY.
>>> I added some debug code in rhashtable_insert_rehash(), and found:
>>> *future_tbl is null*
>>>
>>> fail:
>>> /* Do not fail the insert if someone else did a rehash. */
>>> if (likely(rcu_dereference_raw(tbl->future_tbl))) {
>>> printk("future_tbl is there\n");
>>> return 0;
>>> } else {
>>> printk("future_tbl is null\n");
>>> }
>>>
>>> any idea why ?
>>
>> That's presumably because you got a genuine double rehash.
>>
>> Until you post your code we can't really help you.
>>
> i wish i could , but my codes is a big patch for sctp, and this issue
> happens in a special stress test based on this patch.
> im trying to think how i can show you. :)

I'm just wondering, why do not we handle the genuine double rehash
issue inside rhashtable? i mean it's just a temporary error that a
simple retry may fix it.

2015-12-17 17:00:29

by David Miller

[permalink] [raw]
Subject: Re: rhashtable: Prevent spurious EBUSY errors on insertion

From: Xin Long <[email protected]>
Date: Thu, 17 Dec 2015 17:00:35 +0800

> On Thu, Dec 17, 2015 at 4:48 PM, Herbert Xu <[email protected]> wrote:
>> On Thu, Dec 17, 2015 at 04:46:00PM +0800, Xin Long wrote:
>>>
>>> sorry for late test, but unfortunately, my case with rhashtalbe still
>>> return EBUSY.
>>> I added some debug code in rhashtable_insert_rehash(), and found:
>>> *future_tbl is null*
>>>
>>> fail:
>>> /* Do not fail the insert if someone else did a rehash. */
>>> if (likely(rcu_dereference_raw(tbl->future_tbl))) {
>>> printk("future_tbl is there\n");
>>> return 0;
>>> } else {
>>> printk("future_tbl is null\n");
>>> }
>>>
>>> any idea why ?
>>
>> That's presumably because you got a genuine double rehash.
>>
>> Until you post your code we can't really help you.
>>
> i wish i could , but my codes is a big patch for sctp, and this issue
> happens in a special stress test based on this patch.
> im trying to think how i can show you. :)

Simply post it.

2015-12-18 02:27:11

by Herbert Xu

[permalink] [raw]
Subject: Re: rhashtable: Prevent spurious EBUSY errors on insertion

On Fri, Dec 18, 2015 at 12:07:08AM +0800, Xin Long wrote:
>
> I'm just wondering, why do not we handle the genuine double rehash
> issue inside rhashtable? i mean it's just a temporary error that a
> simple retry may fix it.

Because a double rehash means that someone has cracked your hash
function and there is no point in trying anymore.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-18 08:18:40

by Xin Long

[permalink] [raw]
Subject: Re: rhashtable: Prevent spurious EBUSY errors on insertion

On Fri, Dec 18, 2015 at 10:26 AM, Herbert Xu
<[email protected]> wrote:
> On Fri, Dec 18, 2015 at 12:07:08AM +0800, Xin Long wrote:
>>
>> I'm just wondering, why do not we handle the genuine double rehash
>> issue inside rhashtable? i mean it's just a temporary error that a
>> simple retry may fix it.
>
> Because a double rehash means that someone has cracked your hash
> function and there is no point in trying anymore.

ok, get your point, is it possible to be triggered by some cases under
a big stress insertion, but they are all legal cases. like we use rhash in
nftables, if there are a big batch sets to insert, may this issue happen?

>
> Cheers,
> --
> Email: Herbert Xu <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt