2019-04-01 23:09:04

by NeilBrown

[permalink] [raw]
Subject: [PATCH 0/4 v2] Convert rhashtable to use bitlocks

This series converts rhashtable to use a per-bucket bitlock
rather than a separate array of spinlocks.
This:
reduces memory usage
results in slightly fewer memory accesses
slightly improves parallelism
makes a configuration option unnecessary

The main change from previous version is to use a distinct type for
the pointer in the bucket which has a bit-lock in it. This
helped find two places where rht_ptr() was missed, one
in rhashtable_free_and_destroy() in print_ht in the test code.

Thanks,
NeilBrown

---

NeilBrown (4):
rhashtable: use cmpxchg() in nested_table_alloc()
rhashtable: allow rht_bucket_var to return NULL.
rhashtable: use bit_spin_locks to protect hash bucket.
rhashtable: add lockdep tracking to bucket bit-spin-locks.


include/linux/rhashtable-types.h | 2
include/linux/rhashtable.h | 271 ++++++++++++++++++++++++++------------
ipc/util.c | 1
lib/rhashtable.c | 161 ++++++++++++-----------
lib/test_rhashtable.c | 2
net/bridge/br_fdb.c | 1
net/bridge/br_multicast.c | 1
net/bridge/br_vlan.c | 1
net/bridge/br_vlan_tunnel.c | 1
net/ipv4/ipmr.c | 1
net/ipv6/ip6mr.c | 1
net/netfilter/nf_tables_api.c | 1
12 files changed, 271 insertions(+), 173 deletions(-)

--
Signature


2019-04-01 23:09:15

by NeilBrown

[permalink] [raw]
Subject: [PATCH 1/4] rhashtable: use cmpxchg() in nested_table_alloc()

nested_table_alloc() relies on the fact that there is
at most one spinlock allocated for every slot in the top
level nested table, so it is not possible for two threads
to try to allocate the same table at the same time.

This assumption is a little fragile (it is not explicit) and is
unnecessary as cmpxchg() can be used instead.

A future patch will replace the spinlocks by per-bucket bitlocks,
and then we won't be able to protect the slot pointer with a spinlock.

So replace rcu_assign_pointer() with cmpxchg() - which has equivalent
barrier properties.
If it the cmp fails, free the table that was just allocated.

Signed-off-by: NeilBrown <[email protected]>
---
lib/rhashtable.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 811d51b7cb86..6c4f5c8e9baa 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -131,9 +131,11 @@ static union nested_table *nested_table_alloc(struct rhashtable *ht,
INIT_RHT_NULLS_HEAD(ntbl[i].bucket);
}

- rcu_assign_pointer(*prev, ntbl);
-
- return ntbl;
+ if (cmpxchg(prev, NULL, ntbl) == NULL)
+ return ntbl;
+ /* Raced with another thread. */
+ kfree(ntbl);
+ return rcu_dereference(*prev);
}

static struct bucket_table *nested_bucket_table_alloc(struct rhashtable *ht,


2019-04-01 23:09:32

by NeilBrown

[permalink] [raw]
Subject: [PATCH 2/4] rhashtable: allow rht_bucket_var to return NULL.

Rather than returning a pointer to a static nulls, rht_bucket_var()
now returns NULL if the bucket doesn't exist.
This will make the next patch, which stores a bitlock in the
bucket pointer, somewhat cleaner.

This change involves introducing __rht_bucket_nested() which is
like rht_bucket_nested(), but doesn't provide the static nulls,
and changing rht_bucket_nested() to call this and possible
provide a static nulls - as is still needed for the non-var case.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/rhashtable.h | 11 +++++++++--
lib/rhashtable.c | 29 ++++++++++++++++++++---------
2 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 86dfa417848d..0c9175aeab8a 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -265,6 +265,8 @@ void rhashtable_destroy(struct rhashtable *ht);

struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
unsigned int hash);
+struct rhash_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
+ unsigned int hash);
struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
struct bucket_table *tbl,
unsigned int hash);
@@ -294,7 +296,7 @@ static inline struct rhash_head __rcu *const *rht_bucket(
static inline struct rhash_head __rcu **rht_bucket_var(
struct bucket_table *tbl, unsigned int hash)
{
- return unlikely(tbl->nest) ? rht_bucket_nested(tbl, hash) :
+ return unlikely(tbl->nest) ? __rht_bucket_nested(tbl, hash) :
&tbl->buckets[hash];
}

@@ -890,6 +892,8 @@ static inline int __rhashtable_remove_fast_one(
spin_lock_bh(lock);

pprev = rht_bucket_var(tbl, hash);
+ if (!pprev)
+ goto out;
rht_for_each_from(he, *pprev, tbl, hash) {
struct rhlist_head *list;

@@ -934,6 +938,7 @@ static inline int __rhashtable_remove_fast_one(
break;
}

+out:
spin_unlock_bh(lock);

if (err > 0) {
@@ -1042,6 +1047,8 @@ static inline int __rhashtable_replace_fast(
spin_lock_bh(lock);

pprev = rht_bucket_var(tbl, hash);
+ if (!pprev)
+ goto out;
rht_for_each_from(he, *pprev, tbl, hash) {
if (he != obj_old) {
pprev = &he->next;
@@ -1053,7 +1060,7 @@ static inline int __rhashtable_replace_fast(
err = 0;
break;
}
-
+out:
spin_unlock_bh(lock);

return err;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 6c4f5c8e9baa..b28fdd560ea9 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -237,8 +237,10 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
goto out;

err = -ENOENT;
+ if (!pprev)
+ goto out;

- rht_for_each(entry, old_tbl, old_hash) {
+ rht_for_each_from(entry, *pprev, old_tbl, old_hash) {
err = 0;
next = rht_dereference_bucket(entry->next, old_tbl, old_hash);

@@ -496,6 +498,8 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,

elasticity = RHT_ELASTICITY;
pprev = rht_bucket_var(tbl, hash);
+ if (!pprev)
+ return ERR_PTR(-ENOENT);
rht_for_each_from(head, *pprev, tbl, hash) {
struct rhlist_head *list;
struct rhlist_head *plist;
@@ -1161,11 +1165,10 @@ void rhashtable_destroy(struct rhashtable *ht)
}
EXPORT_SYMBOL_GPL(rhashtable_destroy);

-struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
- unsigned int hash)
+struct rhash_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
+ unsigned int hash)
{
const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
- static struct rhash_head __rcu *rhnull;
unsigned int index = hash & ((1 << tbl->nest) - 1);
unsigned int size = tbl->size >> tbl->nest;
unsigned int subhash = hash;
@@ -1183,15 +1186,23 @@ struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
subhash >>= shift;
}

- if (!ntbl) {
- if (!rhnull)
- INIT_RHT_NULLS_HEAD(rhnull);
- return &rhnull;
- }
+ if (!ntbl)
+ return NULL;

return &ntbl[subhash].bucket;

}
+EXPORT_SYMBOL_GPL(__rht_bucket_nested);
+
+struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
+ unsigned int hash)
+{
+ static struct rhash_head __rcu *rhnull;
+
+ if (!rhnull)
+ INIT_RHT_NULLS_HEAD(rhnull);
+ return __rht_bucket_nested(tbl, hash) ?: &rhnull;
+}
EXPORT_SYMBOL_GPL(rht_bucket_nested);

struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,


2019-04-01 23:09:59

by NeilBrown

[permalink] [raw]
Subject: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
bucket pointer to lock the hash chain for that bucket.

The benefits of a bit spin_lock are:
- no need to allocate a separate array of locks.
- no need to have a configuration option to guide the
choice of the size of this array
- locking cost is often a single test-and-set in a cache line
that will have to be loaded anyway. When inserting at, or removing
from, the head of the chain, the unlock is free - writing the new
address in the bucket head implicitly clears the lock bit.
For __rhashtable_insert_fast() we ensure this always happens
when adding a new key.
- even when lockings costs 2 updates (lock and unlock), they are
in a cacheline that needs to be read anyway.

The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.

Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved. This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.

As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.

To enhance type checking, a new struct is introduced to represent the
pointer plus lock-bit
that is stored in the bucket-table. This is "struct rhash_lock_head"
and is empty. A pointer to this needs to be cast to either an
unsigned lock, or a "struct rhash_head *" to be useful.
Variables of this type are most often called "bkt".

Previously "pprev" would sometimes point to a bucket, and sometimes a
->next pointer in an rhash_head. As these are now different types,
pprev is NULL when it would have pointed to the bucket. In that case,
'blk' is used, together with correct locking protocol.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/rhashtable-types.h | 2
include/linux/rhashtable.h | 261 ++++++++++++++++++++++++--------------
ipc/util.c | 1
lib/rhashtable.c | 141 ++++++++++-----------
lib/test_rhashtable.c | 2
net/bridge/br_fdb.c | 1
net/bridge/br_multicast.c | 1
net/bridge/br_vlan.c | 1
net/bridge/br_vlan_tunnel.c | 1
net/ipv4/ipmr.c | 1
net/ipv6/ip6mr.c | 1
net/netfilter/nf_tables_api.c | 1
12 files changed, 236 insertions(+), 178 deletions(-)

diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index 763d613ce2c2..57467cbf4c5b 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -48,7 +48,6 @@ typedef int (*rht_obj_cmpfn_t)(struct rhashtable_compare_arg *arg,
* @head_offset: Offset of rhash_head in struct to be hashed
* @max_size: Maximum size while expanding
* @min_size: Minimum size while shrinking
- * @locks_mul: Number of bucket locks to allocate per cpu (default: 32)
* @automatic_shrinking: Enable automatic shrinking of tables
* @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
* @obj_hashfn: Function to hash object
@@ -62,7 +61,6 @@ struct rhashtable_params {
unsigned int max_size;
u16 min_size;
bool automatic_shrinking;
- u8 locks_mul;
rht_hashfn_t hashfn;
rht_obj_hashfn_t obj_hashfn;
rht_obj_cmpfn_t obj_cmpfn;
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 0c9175aeab8a..ccbbafdf5547 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -24,12 +24,27 @@
#include <linux/list_nulls.h>
#include <linux/workqueue.h>
#include <linux/rculist.h>
+#include <linux/bit_spinlock.h>

#include <linux/rhashtable-types.h>
/*
+ * Objects in an rhashtable have an embedded struct rhash_head
+ * which is linked into as hash chain from the hash table - or one
+ * of two or more hash tables when the rhashtable is being resized.
* The end of the chain is marked with a special nulls marks which has
- * the least significant bit set.
+ * the least significant bit set but otherwise stores the address of
+ * the hash bucket. This allows us to be be sure we've found the end
+ * of the right list.
+ * The value stored in the hash bucket has BIT(2) used as a lock bit.
+ * This bit must be atomically set before any changes are made to
+ * the chain. To avoid dereferencing this pointer without clearing
+ * the bit first, we use an opaque 'struct rhash_lock_head *' for the
+ * pointer stored in the bucket. This struct needs to be defined so
+ * that rcu_derefernce() works on it, but it has no content so a
+ * cast is needed for it to be useful. This ensures it isn't
+ * used by mistake with clearing the lock bit first.
*/
+struct rhash_lock_head {};

/* Maximum chain length before rehash
*
@@ -52,8 +67,6 @@
* @nest: Number of bits of first-level nested table.
* @rehash: Current bucket being rehashed
* @hash_rnd: Random seed to fold into hash
- * @locks_mask: Mask to apply before accessing locks[]
- * @locks: Array of spinlocks protecting individual buckets
* @walkers: List of active walkers
* @rcu: RCU structure for freeing the table
* @future_tbl: Table under construction during rehashing
@@ -64,16 +77,70 @@ struct bucket_table {
unsigned int size;
unsigned int nest;
u32 hash_rnd;
- unsigned int locks_mask;
- spinlock_t *locks;
struct list_head walkers;
struct rcu_head rcu;

struct bucket_table __rcu *future_tbl;

- struct rhash_head __rcu *buckets[] ____cacheline_aligned_in_smp;
+ struct rhash_lock_head __rcu *buckets[] ____cacheline_aligned_in_smp;
};

+/*
+ * We lock a bucket by setting BIT(1) in the pointer - this is always
+ * zero in real pointers and in the nulls marker.
+ * bit_spin_locks do not handle contention well, but the whole point
+ * of the hashtable design is to achieve minimum per-bucket contention.
+ * A nested hash table might not have a bucket pointer. In that case
+ * we cannot get a lock. For remove and replace the bucket cannot be
+ * interesting and doesn't need locking.
+ * For insert we allocate the bucket if this is the last bucket_table,
+ * and then take the lock.
+ * Sometimes we unlock a bucket by writing a new pointer there. In that
+ * case we don't need to unlock, but we do need to reset state such as
+ * local_bh. For that we have rht_assign_unlock(). As rcu_assign_pointer()
+ * provides the same release semantics that bit_spin_unlock() provides,
+ * this is safe.
+ */
+
+static inline void rht_lock(struct rhash_lock_head **bkt)
+{
+ local_bh_disable();
+ bit_spin_lock(1, (unsigned long *)bkt);
+}
+
+static inline void rht_unlock(struct rhash_lock_head **bkt)
+{
+ bit_spin_unlock(1, (unsigned long *)bkt);
+ local_bh_enable();
+}
+
+static inline void rht_assign_unlock(struct rhash_lock_head **bkt,
+ struct rhash_head *obj)
+{
+ struct rhash_head **p = (struct rhash_head **)bkt;
+
+ rcu_assign_pointer(*p, obj);
+ preempt_enable();
+ __release(bitlock);
+ local_bh_enable();
+}
+
+/*
+ * If 'p' is a bucket head and might be locked:
+ * rht_ptr() returns the address without the lock bit.
+ * rht_ptr_locked() returns the address WITH the lock bit.
+ */
+static inline struct rhash_head __rcu *rht_ptr(const struct rhash_lock_head *p)
+{
+ return (void *)(((unsigned long)p) & ~BIT(1));
+}
+
+static inline struct rhash_lock_head __rcu *rht_ptr_locked(const
+ struct rhash_head *p)
+{
+ return (void *)(((unsigned long)p) | BIT(1));
+}
+
/*
* NULLS_MARKER() expects a hash value with the low
* bits mostly likely to be significant, and it discards
@@ -206,25 +273,6 @@ static inline bool rht_grow_above_max(const struct rhashtable *ht,
return atomic_read(&ht->nelems) >= ht->max_elems;
}

-/* The bucket lock is selected based on the hash and protects mutations
- * on a group of hash buckets.
- *
- * A maximum of tbl->size/2 bucket locks is allocated. This ensures that
- * a single lock always covers both buckets which may both contains
- * entries which link to the same bucket of the old table during resizing.
- * This allows to simplify the locking as locking the bucket in both
- * tables during resize always guarantee protection.
- *
- * IMPORTANT: When holding the bucket lock of both the old and new table
- * during expansions and shrinking, the old bucket lock must always be
- * acquired first.
- */
-static inline spinlock_t *rht_bucket_lock(const struct bucket_table *tbl,
- unsigned int hash)
-{
- return &tbl->locks[hash & tbl->locks_mask];
-}
-
#ifdef CONFIG_PROVE_LOCKING
int lockdep_rht_mutex_is_held(struct rhashtable *ht);
int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash);
@@ -263,13 +311,13 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
void *arg);
void rhashtable_destroy(struct rhashtable *ht);

-struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
- unsigned int hash);
-struct rhash_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
- unsigned int hash);
-struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
- struct bucket_table *tbl,
+struct rhash_lock_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
+ unsigned int hash);
+struct rhash_lock_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
unsigned int hash);
+struct rhash_lock_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
+ struct bucket_table *tbl,
+ unsigned int hash);

#define rht_dereference(p, ht) \
rcu_dereference_protected(p, lockdep_rht_mutex_is_held(ht))
@@ -286,21 +334,21 @@ struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
#define rht_entry(tpos, pos, member) \
({ tpos = container_of(pos, typeof(*tpos), member); 1; })

-static inline struct rhash_head __rcu *const *rht_bucket(
+static inline struct rhash_lock_head __rcu *const *rht_bucket(
const struct bucket_table *tbl, unsigned int hash)
{
return unlikely(tbl->nest) ? rht_bucket_nested(tbl, hash) :
&tbl->buckets[hash];
}

-static inline struct rhash_head __rcu **rht_bucket_var(
+static inline struct rhash_lock_head __rcu **rht_bucket_var(
struct bucket_table *tbl, unsigned int hash)
{
return unlikely(tbl->nest) ? __rht_bucket_nested(tbl, hash) :
&tbl->buckets[hash];
}

-static inline struct rhash_head __rcu **rht_bucket_insert(
+static inline struct rhash_lock_head __rcu **rht_bucket_insert(
struct rhashtable *ht, struct bucket_table *tbl, unsigned int hash)
{
return unlikely(tbl->nest) ? rht_bucket_nested_insert(ht, tbl, hash) :
@@ -326,7 +374,7 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
* @hash: the hash value / bucket index
*/
#define rht_for_each(pos, tbl, hash) \
- rht_for_each_from(pos, *rht_bucket(tbl, hash), tbl, hash)
+ rht_for_each_from(pos, rht_ptr(*rht_bucket(tbl, hash)), tbl, hash)

/**
* rht_for_each_entry_from - iterate over hash chain from given head
@@ -351,7 +399,7 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
* @member: name of the &struct rhash_head within the hashable struct.
*/
#define rht_for_each_entry(tpos, pos, tbl, hash, member) \
- rht_for_each_entry_from(tpos, pos, *rht_bucket(tbl, hash), \
+ rht_for_each_entry_from(tpos, pos, rht_ptr(*rht_bucket(tbl, hash)), \
tbl, hash, member)

/**
@@ -367,7 +415,8 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
* remove the loop cursor from the list.
*/
#define rht_for_each_entry_safe(tpos, pos, next, tbl, hash, member) \
- for (pos = rht_dereference_bucket(*rht_bucket(tbl, hash), tbl, hash), \
+ for (pos = rht_dereference_bucket(rht_ptr(*rht_bucket(tbl, hash)), \
+ tbl, hash), \
next = !rht_is_a_nulls(pos) ? \
rht_dereference_bucket(pos->next, tbl, hash) : NULL; \
(!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \
@@ -402,8 +451,12 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
* the _rcu mutation primitives such as rhashtable_insert() as long as the
* traversal is guarded by rcu_read_lock().
*/
-#define rht_for_each_rcu(pos, tbl, hash) \
- rht_for_each_rcu_from(pos, *rht_bucket(tbl, hash), tbl, hash)
+#define rht_for_each_rcu(pos, tbl, hash) \
+ for (({barrier(); }), \
+ pos = rht_ptr(rht_dereference_bucket_rcu( \
+ *rht_bucket(tbl, hash), tbl, hash)); \
+ !rht_is_a_nulls(pos); \
+ pos = rcu_dereference_raw(pos->next))

/**
* rht_for_each_entry_rcu_from - iterated over rcu hash chain from given head
@@ -437,7 +490,8 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
* traversal is guarded by rcu_read_lock().
*/
#define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member) \
- rht_for_each_entry_rcu_from(tpos, pos, *rht_bucket(tbl, hash), \
+ rht_for_each_entry_rcu_from(tpos, pos, \
+ rht_ptr(*rht_bucket(tbl, hash)), \
tbl, hash, member)

/**
@@ -483,7 +537,7 @@ static inline struct rhash_head *__rhashtable_lookup(
.ht = ht,
.key = key,
};
- struct rhash_head __rcu * const *head;
+ struct rhash_lock_head __rcu * const *bkt;
struct bucket_table *tbl;
struct rhash_head *he;
unsigned int hash;
@@ -491,9 +545,10 @@ static inline struct rhash_head *__rhashtable_lookup(
tbl = rht_dereference_rcu(ht->tbl, ht);
restart:
hash = rht_key_hashfn(ht, tbl, key, params);
- head = rht_bucket(tbl, hash);
+ bkt = rht_bucket(tbl, hash);
do {
- rht_for_each_rcu_from(he, *head, tbl, hash) {
+ he = rht_ptr(rht_dereference_bucket_rcu(*bkt, tbl, hash));
+ rht_for_each_rcu_from(he, he, tbl, hash) {
if (params.obj_cmpfn ?
params.obj_cmpfn(&arg, rht_obj(ht, he)) :
rhashtable_compare(&arg, rht_obj(ht, he)))
@@ -503,7 +558,7 @@ static inline struct rhash_head *__rhashtable_lookup(
/* An object might have been moved to a different hash chain,
* while we walk along it - better check and retry.
*/
- } while (he != RHT_NULLS_MARKER(head));
+ } while (he != RHT_NULLS_MARKER(bkt));

/* Ensure we see any new tables. */
smp_rmb();
@@ -599,10 +654,10 @@ static inline void *__rhashtable_insert_fast(
.ht = ht,
.key = key,
};
+ struct rhash_lock_head __rcu **bkt;
struct rhash_head __rcu **pprev;
struct bucket_table *tbl;
struct rhash_head *head;
- spinlock_t *lock;
unsigned int hash;
int elasticity;
void *data;
@@ -611,23 +666,22 @@ static inline void *__rhashtable_insert_fast(

tbl = rht_dereference_rcu(ht->tbl, ht);
hash = rht_head_hashfn(ht, tbl, obj, params);
- lock = rht_bucket_lock(tbl, hash);
- spin_lock_bh(lock);
+ elasticity = RHT_ELASTICITY;
+ bkt = rht_bucket_insert(ht, tbl, hash);
+ data = ERR_PTR(-ENOMEM);
+ if (!bkt)
+ goto out;
+ pprev = NULL;
+ rht_lock(bkt);

if (unlikely(rcu_access_pointer(tbl->future_tbl))) {
slow_path:
- spin_unlock_bh(lock);
+ rht_unlock(bkt);
rcu_read_unlock();
return rhashtable_insert_slow(ht, key, obj);
}

- elasticity = RHT_ELASTICITY;
- pprev = rht_bucket_insert(ht, tbl, hash);
- data = ERR_PTR(-ENOMEM);
- if (!pprev)
- goto out;
-
- rht_for_each_from(head, *pprev, tbl, hash) {
+ rht_for_each_from(head, rht_ptr(*bkt), tbl, hash) {
struct rhlist_head *plist;
struct rhlist_head *list;

@@ -643,7 +697,7 @@ static inline void *__rhashtable_insert_fast(
data = rht_obj(ht, head);

if (!rhlist)
- goto out;
+ goto out_unlock;


list = container_of(obj, struct rhlist_head, rhead);
@@ -652,9 +706,13 @@ static inline void *__rhashtable_insert_fast(
RCU_INIT_POINTER(list->next, plist);
head = rht_dereference_bucket(head->next, tbl, hash);
RCU_INIT_POINTER(list->rhead.next, head);
- rcu_assign_pointer(*pprev, obj);
-
- goto good;
+ if (pprev) {
+ rcu_assign_pointer(*pprev, obj);
+ rht_unlock(bkt);
+ } else
+ rht_assign_unlock(bkt, obj);
+ data = NULL;
+ goto out;
}

if (elasticity <= 0)
@@ -662,12 +720,13 @@ static inline void *__rhashtable_insert_fast(

data = ERR_PTR(-E2BIG);
if (unlikely(rht_grow_above_max(ht, tbl)))
- goto out;
+ goto out_unlock;

if (unlikely(rht_grow_above_100(ht, tbl)))
goto slow_path;

- head = rht_dereference_bucket(*pprev, tbl, hash);
+ /* Inserting at head of list makes unlocking free. */
+ head = rht_ptr(rht_dereference_bucket(*bkt, tbl, hash));

RCU_INIT_POINTER(obj->next, head);
if (rhlist) {
@@ -677,20 +736,21 @@ static inline void *__rhashtable_insert_fast(
RCU_INIT_POINTER(list->next, NULL);
}

- rcu_assign_pointer(*pprev, obj);
-
atomic_inc(&ht->nelems);
+ rht_assign_unlock(bkt, obj);
+
if (rht_grow_above_75(ht, tbl))
schedule_work(&ht->run_work);

-good:
data = NULL;
-
out:
- spin_unlock_bh(lock);
rcu_read_unlock();

return data;
+
+out_unlock:
+ rht_unlock(bkt);
+ goto out;
}

/**
@@ -699,9 +759,9 @@ static inline void *__rhashtable_insert_fast(
* @obj: pointer to hash head inside object
* @params: hash table parameters
*
- * Will take a per bucket spinlock to protect against mutual mutations
+ * Will take the per bucket bitlock to protect against mutual mutations
* on the same bucket. Multiple insertions may occur in parallel unless
- * they map to the same bucket lock.
+ * they map to the same bucket.
*
* It is safe to call this function from atomic context.
*
@@ -728,9 +788,9 @@ static inline int rhashtable_insert_fast(
* @list: pointer to hash list head inside object
* @params: hash table parameters
*
- * Will take a per bucket spinlock to protect against mutual mutations
+ * Will take the per bucket bitlock to protect against mutual mutations
* on the same bucket. Multiple insertions may occur in parallel unless
- * they map to the same bucket lock.
+ * they map to the same bucket.
*
* It is safe to call this function from atomic context.
*
@@ -751,9 +811,9 @@ static inline int rhltable_insert_key(
* @list: pointer to hash list head inside object
* @params: hash table parameters
*
- * Will take a per bucket spinlock to protect against mutual mutations
+ * Will take the per bucket bitlock to protect against mutual mutations
* on the same bucket. Multiple insertions may occur in parallel unless
- * they map to the same bucket lock.
+ * they map to the same bucket.
*
* It is safe to call this function from atomic context.
*
@@ -880,21 +940,20 @@ static inline int __rhashtable_remove_fast_one(
struct rhash_head *obj, const struct rhashtable_params params,
bool rhlist)
{
+ struct rhash_lock_head __rcu **bkt;
struct rhash_head __rcu **pprev;
struct rhash_head *he;
- spinlock_t * lock;
unsigned int hash;
int err = -ENOENT;

hash = rht_head_hashfn(ht, tbl, obj, params);
- lock = rht_bucket_lock(tbl, hash);
+ bkt = rht_bucket_var(tbl, hash);
+ if (!bkt)
+ return -ENOENT;
+ pprev = NULL;
+ rht_lock(bkt);

- spin_lock_bh(lock);
-
- pprev = rht_bucket_var(tbl, hash);
- if (!pprev)
- goto out;
- rht_for_each_from(he, *pprev, tbl, hash) {
+ rht_for_each_from(he, rht_ptr(*bkt), tbl, hash) {
struct rhlist_head *list;

list = container_of(he, struct rhlist_head, rhead);
@@ -934,13 +993,17 @@ static inline int __rhashtable_remove_fast_one(
}
}

- rcu_assign_pointer(*pprev, obj);
- break;
+ if (pprev) {
+ rcu_assign_pointer(*pprev, obj);
+ rht_unlock(bkt);
+ } else {
+ rht_assign_unlock(bkt, obj);
+ }
+ goto unlocked;
}

-out:
- spin_unlock_bh(lock);
-
+ rht_unlock(bkt);
+unlocked:
if (err > 0) {
atomic_dec(&ht->nelems);
if (unlikely(ht->p.automatic_shrinking &&
@@ -1029,9 +1092,9 @@ static inline int __rhashtable_replace_fast(
struct rhash_head *obj_old, struct rhash_head *obj_new,
const struct rhashtable_params params)
{
+ struct rhash_lock_head __rcu **bkt;
struct rhash_head __rcu **pprev;
struct rhash_head *he;
- spinlock_t *lock;
unsigned int hash;
int err = -ENOENT;

@@ -1042,27 +1105,33 @@ static inline int __rhashtable_replace_fast(
if (hash != rht_head_hashfn(ht, tbl, obj_new, params))
return -EINVAL;

- lock = rht_bucket_lock(tbl, hash);
+ bkt = rht_bucket_var(tbl, hash);
+ if (!bkt)
+ return -ENOENT;

- spin_lock_bh(lock);
+ pprev = NULL;
+ rht_lock(bkt);

- pprev = rht_bucket_var(tbl, hash);
- if (!pprev)
- goto out;
- rht_for_each_from(he, *pprev, tbl, hash) {
+ rht_for_each_from(he, rht_ptr(*bkt), tbl, hash) {
if (he != obj_old) {
pprev = &he->next;
continue;
}

rcu_assign_pointer(obj_new->next, obj_old->next);
- rcu_assign_pointer(*pprev, obj_new);
+ if (pprev) {
+ rcu_assign_pointer(*pprev, obj_new);
+ rht_unlock(bkt);
+ } else {
+ rht_assign_unlock(bkt, obj_new);
+ }
err = 0;
- break;
+ goto unlocked;
}
-out:
- spin_unlock_bh(lock);

+ rht_unlock(bkt);
+
+unlocked:
return err;
}

diff --git a/ipc/util.c b/ipc/util.c
index 0af05752969f..095274a871f8 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -101,7 +101,6 @@ static const struct rhashtable_params ipc_kht_params = {
.head_offset = offsetof(struct kern_ipc_perm, khtnode),
.key_offset = offsetof(struct kern_ipc_perm, key),
.key_len = FIELD_SIZEOF(struct kern_ipc_perm, key),
- .locks_mul = 1,
.automatic_shrinking = true,
};

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index b28fdd560ea9..c5d0974467ee 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -31,11 +31,10 @@

#define HASH_DEFAULT_SIZE 64UL
#define HASH_MIN_SIZE 4U
-#define BUCKET_LOCKS_PER_CPU 32UL

union nested_table {
union nested_table __rcu *table;
- struct rhash_head __rcu *bucket;
+ struct rhash_lock_head __rcu *bucket;
};

static u32 head_hashfn(struct rhashtable *ht,
@@ -56,9 +55,11 @@ EXPORT_SYMBOL_GPL(lockdep_rht_mutex_is_held);

int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash)
{
- spinlock_t *lock = rht_bucket_lock(tbl, hash);
-
- return (debug_locks) ? lockdep_is_held(lock) : 1;
+ if (!debug_locks)
+ return 1;
+ if (unlikely(tbl->nest))
+ return 1;
+ return bit_spin_is_locked(1, (unsigned long *)&tbl->buckets[hash]);
}
EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
#else
@@ -104,7 +105,6 @@ static void bucket_table_free(const struct bucket_table *tbl)
if (tbl->nest)
nested_bucket_table_free(tbl);

- free_bucket_spinlocks(tbl->locks);
kvfree(tbl);
}

@@ -171,7 +171,7 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
gfp_t gfp)
{
struct bucket_table *tbl = NULL;
- size_t size, max_locks;
+ size_t size;
int i;

size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
@@ -189,16 +189,6 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,

tbl->size = size;

- max_locks = size >> 1;
- if (tbl->nest)
- max_locks = min_t(size_t, max_locks, 1U << tbl->nest);
-
- if (alloc_bucket_spinlocks(&tbl->locks, &tbl->locks_mask, max_locks,
- ht->p.locks_mul, gfp) < 0) {
- bucket_table_free(tbl);
- return NULL;
- }
-
rcu_head_init(&tbl->rcu);
INIT_LIST_HEAD(&tbl->walkers);

@@ -223,24 +213,23 @@ static struct bucket_table *rhashtable_last_table(struct rhashtable *ht,
return new_tbl;
}

-static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
+static int rhashtable_rehash_one(struct rhashtable *ht,
+ struct rhash_lock_head __rcu **bkt,
+ unsigned int old_hash)
{
struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl);
- struct rhash_head __rcu **pprev = rht_bucket_var(old_tbl, old_hash);
int err = -EAGAIN;
struct rhash_head *head, *next, *entry;
- spinlock_t *new_bucket_lock;
+ struct rhash_head **pprev = NULL;
unsigned int new_hash;

if (new_tbl->nest)
goto out;

err = -ENOENT;
- if (!pprev)
- goto out;

- rht_for_each_from(entry, *pprev, old_tbl, old_hash) {
+ rht_for_each_from(entry, rht_ptr(*bkt), old_tbl, old_hash) {
err = 0;
next = rht_dereference_bucket(entry->next, old_tbl, old_hash);

@@ -255,18 +244,20 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)

new_hash = head_hashfn(ht, new_tbl, entry);

- new_bucket_lock = rht_bucket_lock(new_tbl, new_hash);
+ rht_lock(&new_tbl->buckets[new_hash]);

- spin_lock_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
- head = rht_dereference_bucket(new_tbl->buckets[new_hash],
- new_tbl, new_hash);
+ head = rht_ptr(rht_dereference_bucket(new_tbl->buckets[new_hash],
+ new_tbl, new_hash));

RCU_INIT_POINTER(entry->next, head);

- rcu_assign_pointer(new_tbl->buckets[new_hash], entry);
- spin_unlock(new_bucket_lock);
+ rht_assign_unlock(&new_tbl->buckets[new_hash], entry);

- rcu_assign_pointer(*pprev, next);
+ if (pprev)
+ rcu_assign_pointer(*pprev, next);
+ else
+ /* Need to preserved the bit lock. */
+ rcu_assign_pointer(*bkt, rht_ptr_locked(next));

out:
return err;
@@ -276,19 +267,19 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,
unsigned int old_hash)
{
struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
- spinlock_t *old_bucket_lock;
+ struct rhash_lock_head __rcu **bkt = rht_bucket_var(old_tbl, old_hash);
int err;

- old_bucket_lock = rht_bucket_lock(old_tbl, old_hash);
+ if (!bkt)
+ return 0;
+ rht_lock(bkt);

- spin_lock_bh(old_bucket_lock);
- while (!(err = rhashtable_rehash_one(ht, old_hash)))
+ while (!(err = rhashtable_rehash_one(ht, bkt, old_hash)))
;

if (err == -ENOENT)
err = 0;
-
- spin_unlock_bh(old_bucket_lock);
+ rht_unlock(bkt);

return err;
}
@@ -485,6 +476,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
}

static void *rhashtable_lookup_one(struct rhashtable *ht,
+ struct rhash_lock_head __rcu **bkt,
struct bucket_table *tbl, unsigned int hash,
const void *key, struct rhash_head *obj)
{
@@ -492,15 +484,12 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
.ht = ht,
.key = key,
};
- struct rhash_head __rcu **pprev;
+ struct rhash_head **pprev = NULL;
struct rhash_head *head;
int elasticity;

elasticity = RHT_ELASTICITY;
- pprev = rht_bucket_var(tbl, hash);
- if (!pprev)
- return ERR_PTR(-ENOENT);
- rht_for_each_from(head, *pprev, tbl, hash) {
+ rht_for_each_from(head, rht_ptr(*bkt), tbl, hash) {
struct rhlist_head *list;
struct rhlist_head *plist;

@@ -522,7 +511,11 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
RCU_INIT_POINTER(list->next, plist);
head = rht_dereference_bucket(head->next, tbl, hash);
RCU_INIT_POINTER(list->rhead.next, head);
- rcu_assign_pointer(*pprev, obj);
+ if (pprev)
+ rcu_assign_pointer(*pprev, obj);
+ else
+ /* Need to preserve the bit lock */
+ rcu_assign_pointer(*bkt, rht_ptr_locked(obj));

return NULL;
}
@@ -534,12 +527,12 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
}

static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
+ struct rhash_lock_head __rcu **bkt,
struct bucket_table *tbl,
unsigned int hash,
struct rhash_head *obj,
void *data)
{
- struct rhash_head __rcu **pprev;
struct bucket_table *new_tbl;
struct rhash_head *head;

@@ -562,11 +555,7 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
if (unlikely(rht_grow_above_100(ht, tbl)))
return ERR_PTR(-EAGAIN);

- pprev = rht_bucket_insert(ht, tbl, hash);
- if (!pprev)
- return ERR_PTR(-ENOMEM);
-
- head = rht_dereference_bucket(*pprev, tbl, hash);
+ head = rht_ptr(rht_dereference_bucket(*bkt, tbl, hash));

RCU_INIT_POINTER(obj->next, head);
if (ht->rhlist) {
@@ -576,7 +565,10 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
RCU_INIT_POINTER(list->next, NULL);
}

- rcu_assign_pointer(*pprev, obj);
+ /* bkt is always the head of the list, so it holds
+ * the lock, which we need to preserve
+ */
+ rcu_assign_pointer(*bkt, rht_ptr_locked(obj));

atomic_inc(&ht->nelems);
if (rht_grow_above_75(ht, tbl))
@@ -590,6 +582,7 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
{
struct bucket_table *new_tbl;
struct bucket_table *tbl;
+ struct rhash_lock_head __rcu **bkt;
unsigned int hash;
void *data;

@@ -598,14 +591,25 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
do {
tbl = new_tbl;
hash = rht_head_hashfn(ht, tbl, obj, ht->p);
- spin_lock_bh(rht_bucket_lock(tbl, hash));
-
- data = rhashtable_lookup_one(ht, tbl, hash, key, obj);
- new_tbl = rhashtable_insert_one(ht, tbl, hash, obj, data);
- if (PTR_ERR(new_tbl) != -EEXIST)
- data = ERR_CAST(new_tbl);
-
- spin_unlock_bh(rht_bucket_lock(tbl, hash));
+ if (rcu_access_pointer(tbl->future_tbl))
+ /* Failure is OK */
+ bkt = rht_bucket_var(tbl, hash);
+ else
+ bkt = rht_bucket_insert(ht, tbl, hash);
+ if (bkt == NULL) {
+ new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
+ data = ERR_PTR(-EAGAIN);
+ } else {
+ rht_lock(bkt);
+ data = rhashtable_lookup_one(ht, bkt, tbl,
+ hash, key, obj);
+ new_tbl = rhashtable_insert_one(ht, bkt, tbl,
+ hash, obj, data);
+ if (PTR_ERR(new_tbl) != -EEXIST)
+ data = ERR_CAST(new_tbl);
+
+ rht_unlock(bkt);
+ }
} while (!IS_ERR_OR_NULL(new_tbl));

if (PTR_ERR(data) == -EAGAIN)
@@ -1032,11 +1036,6 @@ int rhashtable_init(struct rhashtable *ht,

size = rounded_hashtable_size(&ht->p);

- if (params->locks_mul)
- ht->p.locks_mul = roundup_pow_of_two(params->locks_mul);
- else
- ht->p.locks_mul = BUCKET_LOCKS_PER_CPU;
-
ht->key_len = ht->p.key_len;
if (!params->hashfn) {
ht->p.hashfn = jhash;
@@ -1138,7 +1137,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
struct rhash_head *pos, *next;

cond_resched();
- for (pos = rht_dereference(*rht_bucket(tbl, i), ht),
+ for (pos = rht_ptr(rht_dereference(*rht_bucket(tbl, i), ht)),
next = !rht_is_a_nulls(pos) ?
rht_dereference(pos->next, ht) : NULL;
!rht_is_a_nulls(pos);
@@ -1165,8 +1164,8 @@ void rhashtable_destroy(struct rhashtable *ht)
}
EXPORT_SYMBOL_GPL(rhashtable_destroy);

-struct rhash_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
- unsigned int hash)
+struct rhash_lock_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
+ unsigned int hash)
{
const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
unsigned int index = hash & ((1 << tbl->nest) - 1);
@@ -1194,10 +1193,10 @@ struct rhash_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
}
EXPORT_SYMBOL_GPL(__rht_bucket_nested);

-struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
- unsigned int hash)
+struct rhash_lock_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
+ unsigned int hash)
{
- static struct rhash_head __rcu *rhnull;
+ static struct rhash_lock_head __rcu *rhnull;

if (!rhnull)
INIT_RHT_NULLS_HEAD(rhnull);
@@ -1205,9 +1204,9 @@ struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
}
EXPORT_SYMBOL_GPL(rht_bucket_nested);

-struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
- struct bucket_table *tbl,
- unsigned int hash)
+struct rhash_lock_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
+ struct bucket_table *tbl,
+ unsigned int hash)
{
const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
unsigned int index = hash & ((1 << tbl->nest) - 1);
diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 3bd2e91bfc29..02592c2a249c 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -500,7 +500,7 @@ static unsigned int __init print_ht(struct rhltable *rhlt)
struct rhash_head *pos, *next;
struct test_obj_rhl *p;

- pos = rht_dereference(tbl->buckets[i], ht);
+ pos = rht_ptr(rht_dereference(tbl->buckets[i], ht));
next = !rht_is_a_nulls(pos) ? rht_dereference(pos->next, ht) : NULL;

if (!rht_is_a_nulls(pos)) {
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 00573cc46c98..b1c91f66d79c 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -33,7 +33,6 @@ static const struct rhashtable_params br_fdb_rht_params = {
.key_offset = offsetof(struct net_bridge_fdb_entry, key),
.key_len = sizeof(struct net_bridge_fdb_key),
.automatic_shrinking = true,
- .locks_mul = 1,
};

static struct kmem_cache *br_fdb_cache __read_mostly;
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index f5343dfac282..4b77e17df73b 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -44,7 +44,6 @@ static const struct rhashtable_params br_mdb_rht_params = {
.key_offset = offsetof(struct net_bridge_mdb_entry, addr),
.key_len = sizeof(struct br_ip),
.automatic_shrinking = true,
- .locks_mul = 1,
};

static void br_multicast_start_querier(struct net_bridge *br,
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 96abf8feb9dc..0a02822b5667 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -21,7 +21,6 @@ static const struct rhashtable_params br_vlan_rht_params = {
.key_offset = offsetof(struct net_bridge_vlan, vid),
.key_len = sizeof(u16),
.nelem_hint = 3,
- .locks_mul = 1,
.max_size = VLAN_N_VID,
.obj_cmpfn = br_vlan_cmp,
.automatic_shrinking = true,
diff --git a/net/bridge/br_vlan_tunnel.c b/net/bridge/br_vlan_tunnel.c
index 6d2c4eed2dc8..758151863669 100644
--- a/net/bridge/br_vlan_tunnel.c
+++ b/net/bridge/br_vlan_tunnel.c
@@ -34,7 +34,6 @@ static const struct rhashtable_params br_vlan_tunnel_rht_params = {
.key_offset = offsetof(struct net_bridge_vlan, tinfo.tunnel_id),
.key_len = sizeof(__be64),
.nelem_hint = 3,
- .locks_mul = 1,
.obj_cmpfn = br_vlan_tunid_cmp,
.automatic_shrinking = true,
};
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 2c931120c494..9a3f13edc98e 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -373,7 +373,6 @@ static const struct rhashtable_params ipmr_rht_params = {
.key_offset = offsetof(struct mfc_cache, cmparg),
.key_len = sizeof(struct mfc_cache_cmp_arg),
.nelem_hint = 3,
- .locks_mul = 1,
.obj_cmpfn = ipmr_hash_cmp,
.automatic_shrinking = true,
};
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index e4dd57976737..4e69847ed5be 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -355,7 +355,6 @@ static const struct rhashtable_params ip6mr_rht_params = {
.key_offset = offsetof(struct mfc6_cache, cmparg),
.key_len = sizeof(struct mfc6_cache_cmp_arg),
.nelem_hint = 3,
- .locks_mul = 1,
.obj_cmpfn = ip6mr_hash_cmp,
.automatic_shrinking = true,
};
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index ef7772e976cc..90e6b09ef2af 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -53,7 +53,6 @@ static const struct rhashtable_params nft_chain_ht_params = {
.hashfn = nft_chain_hash,
.obj_hashfn = nft_chain_hash_obj,
.obj_cmpfn = nft_chain_hash_cmp,
- .locks_mul = 1,
.automatic_shrinking = true,
};



2019-04-01 23:10:25

by NeilBrown

[permalink] [raw]
Subject: [PATCH 4/4] rhashtable: add lockdep tracking to bucket bit-spin-locks.

Native bit_spin_locks are not tracked by lockdep.

The bit_spin_locks used for rhashtable buckets are local
to the rhashtable implementation, so there is little opportunity
for the sort of misuse that lockdep might detect.
However locks are held while a hash function or compare
function is called, and if one of these took a lock,
a misbehaviour is possible.

As it is quite easy to add lockdep support this unlikely
possibility seems to be enough justification.

So create a lockdep class for bucket bit_spin_lock and attach
through a lockdep_map in each bucket_table.

Without the 'nested' annotation in rhashtable_rehash_one(), lockdep
correctly reports a possible problem as this lock is taken
while another bucket lock (in another table) is held. This
confirms that the added support works.
With the correct nested annotation in place, lockdep reports
no problems.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/rhashtable.h | 51 +++++++++++++++++++++++++++++---------------
lib/rhashtable.c | 15 ++++++++-----
2 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index ccbbafdf5547..460c0eaf6b96 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -82,6 +82,8 @@ struct bucket_table {

struct bucket_table __rcu *future_tbl;

+ struct lockdep_map dep_map;
+
struct rhash_lock_head __rcu *buckets[] ____cacheline_aligned_in_smp;
};

@@ -102,23 +104,38 @@ struct bucket_table {
* this is safe.
*/

-static inline void rht_lock(struct rhash_lock_head **bkt)
+static inline void rht_lock(struct bucket_table *tbl,
+ struct rhash_lock_head **bkt)
{
local_bh_disable();
bit_spin_lock(1, (unsigned long *)bkt);
+ lock_map_acquire(&tbl->dep_map);
+}
+
+static inline void rht_lock_nested(struct bucket_table *tbl,
+ struct rhash_lock_head **bucket,
+ unsigned int subclass)
+{
+ local_bh_disable();
+ bit_spin_lock(1, (unsigned long *)bucket);
+ lock_acquire_exclusive(&tbl->dep_map, subclass, 0, NULL, _THIS_IP_);
}

-static inline void rht_unlock(struct rhash_lock_head **bkt)
+static inline void rht_unlock(struct bucket_table *tbl,
+ struct rhash_lock_head **bkt)
{
+ lock_map_release(&tbl->dep_map);
bit_spin_unlock(1, (unsigned long *)bkt);
local_bh_enable();
}

-static inline void rht_assign_unlock(struct rhash_lock_head **bkt,
+static inline void rht_assign_unlock(struct bucket_table *tbl,
+ struct rhash_lock_head **bkt,
struct rhash_head *obj)
{
struct rhash_head **p = (struct rhash_head **)bkt;

+ lock_map_release(&tbl->dep_map);
rcu_assign_pointer(*p, obj);
preempt_enable();
__release(bitlock);
@@ -672,11 +689,11 @@ static inline void *__rhashtable_insert_fast(
if (!bkt)
goto out;
pprev = NULL;
- rht_lock(bkt);
+ rht_lock(tbl, bkt);

if (unlikely(rcu_access_pointer(tbl->future_tbl))) {
slow_path:
- rht_unlock(bkt);
+ rht_unlock(tbl, bkt);
rcu_read_unlock();
return rhashtable_insert_slow(ht, key, obj);
}
@@ -708,9 +725,9 @@ static inline void *__rhashtable_insert_fast(
RCU_INIT_POINTER(list->rhead.next, head);
if (pprev) {
rcu_assign_pointer(*pprev, obj);
- rht_unlock(bkt);
+ rht_unlock(tbl, bkt);
} else
- rht_assign_unlock(bkt, obj);
+ rht_assign_unlock(tbl, bkt, obj);
data = NULL;
goto out;
}
@@ -737,7 +754,7 @@ static inline void *__rhashtable_insert_fast(
}

atomic_inc(&ht->nelems);
- rht_assign_unlock(bkt, obj);
+ rht_assign_unlock(tbl, bkt, obj);

if (rht_grow_above_75(ht, tbl))
schedule_work(&ht->run_work);
@@ -749,7 +766,7 @@ static inline void *__rhashtable_insert_fast(
return data;

out_unlock:
- rht_unlock(bkt);
+ rht_unlock(tbl, bkt);
goto out;
}

@@ -951,7 +968,7 @@ static inline int __rhashtable_remove_fast_one(
if (!bkt)
return -ENOENT;
pprev = NULL;
- rht_lock(bkt);
+ rht_lock(tbl, bkt);

rht_for_each_from(he, rht_ptr(*bkt), tbl, hash) {
struct rhlist_head *list;
@@ -995,14 +1012,14 @@ static inline int __rhashtable_remove_fast_one(

if (pprev) {
rcu_assign_pointer(*pprev, obj);
- rht_unlock(bkt);
+ rht_unlock(tbl, bkt);
} else {
- rht_assign_unlock(bkt, obj);
+ rht_assign_unlock(tbl, bkt, obj);
}
goto unlocked;
}

- rht_unlock(bkt);
+ rht_unlock(tbl, bkt);
unlocked:
if (err > 0) {
atomic_dec(&ht->nelems);
@@ -1110,7 +1127,7 @@ static inline int __rhashtable_replace_fast(
return -ENOENT;

pprev = NULL;
- rht_lock(bkt);
+ rht_lock(tbl, bkt);

rht_for_each_from(he, rht_ptr(*bkt), tbl, hash) {
if (he != obj_old) {
@@ -1121,15 +1138,15 @@ static inline int __rhashtable_replace_fast(
rcu_assign_pointer(obj_new->next, obj_old->next);
if (pprev) {
rcu_assign_pointer(*pprev, obj_new);
- rht_unlock(bkt);
+ rht_unlock(tbl, bkt);
} else {
- rht_assign_unlock(bkt, obj_new);
+ rht_assign_unlock(tbl, bkt, obj_new);
}
err = 0;
goto unlocked;
}

- rht_unlock(bkt);
+ rht_unlock(tbl, bkt);

unlocked:
return err;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index c5d0974467ee..a8583af43b59 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -173,6 +173,7 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
struct bucket_table *tbl = NULL;
size_t size;
int i;
+ static struct lock_class_key __key;

size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
tbl = kvzalloc(size, gfp);
@@ -187,6 +188,8 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
if (tbl == NULL)
return NULL;

+ lockdep_init_map(&tbl->dep_map, "rhashtable_bucket", &__key, 0);
+
tbl->size = size;

rcu_head_init(&tbl->rcu);
@@ -244,14 +247,14 @@ static int rhashtable_rehash_one(struct rhashtable *ht,

new_hash = head_hashfn(ht, new_tbl, entry);

- rht_lock(&new_tbl->buckets[new_hash]);
+ rht_lock_nested(new_tbl, &new_tbl->buckets[new_hash], SINGLE_DEPTH_NESTING);

head = rht_ptr(rht_dereference_bucket(new_tbl->buckets[new_hash],
new_tbl, new_hash));

RCU_INIT_POINTER(entry->next, head);

- rht_assign_unlock(&new_tbl->buckets[new_hash], entry);
+ rht_assign_unlock(new_tbl, &new_tbl->buckets[new_hash], entry);

if (pprev)
rcu_assign_pointer(*pprev, next);
@@ -272,14 +275,14 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,

if (!bkt)
return 0;
- rht_lock(bkt);
+ rht_lock(old_tbl, bkt);

while (!(err = rhashtable_rehash_one(ht, bkt, old_hash)))
;

if (err == -ENOENT)
err = 0;
- rht_unlock(bkt);
+ rht_unlock(old_tbl, bkt);

return err;
}
@@ -600,7 +603,7 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
data = ERR_PTR(-EAGAIN);
} else {
- rht_lock(bkt);
+ rht_lock(tbl, bkt);
data = rhashtable_lookup_one(ht, bkt, tbl,
hash, key, obj);
new_tbl = rhashtable_insert_one(ht, bkt, tbl,
@@ -608,7 +611,7 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
if (PTR_ERR(new_tbl) != -EEXIST)
data = ERR_CAST(new_tbl);

- rht_unlock(bkt);
+ rht_unlock(tbl, bkt);
}
} while (!IS_ERR_OR_NULL(new_tbl));



2019-04-02 10:34:36

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

From: NeilBrown
> Sent: 02 April 2019 00:08
>
> This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
> bucket pointer to lock the hash chain for that bucket.
...
> To enhance type checking, a new struct is introduced to represent the
> pointer plus lock-bit
> that is stored in the bucket-table. This is "struct rhash_lock_head"
> and is empty. A pointer to this needs to be cast to either an
> unsigned lock, or a "struct rhash_head *" to be useful.
> Variables of this type are most often called "bkt".

Did you try using a union of the pointer and an 'unsigned long' ?
Should remove a lot of the casts.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-04-02 21:12:04

by NeilBrown

[permalink] [raw]
Subject: RE: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

On Tue, Apr 02 2019, David Laight wrote:

> From: NeilBrown
>> Sent: 02 April 2019 00:08
>>
>> This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
>> bucket pointer to lock the hash chain for that bucket.
> ...
>> To enhance type checking, a new struct is introduced to represent the
>> pointer plus lock-bit
>> that is stored in the bucket-table. This is "struct rhash_lock_head"
>> and is empty. A pointer to this needs to be cast to either an
>> unsigned lock, or a "struct rhash_head *" to be useful.
>> Variables of this type are most often called "bkt".
>
> Did you try using a union of the pointer and an 'unsigned long' ?
> Should remove a lot of the casts.

It might, but I'm not sure it is what we want.
The value is not an unsigned long OR a pointer, it is both blended
together. So it really isn't a union.
We *want* it to require casts to access, so that it is clear that
something unusual is happening, and care is needed.

Thanks,
NeilBrown


Attachments:
signature.asc (847.00 B)

2019-04-03 09:26:25

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

From: NeilBrown
> Sent: 02 April 2019 22:11
>
> On Tue, Apr 02 2019, David Laight wrote:
>
> > From: NeilBrown
> >> Sent: 02 April 2019 00:08
> >>
> >> This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
> >> bucket pointer to lock the hash chain for that bucket.
> > ...
> >> To enhance type checking, a new struct is introduced to represent the
> >> pointer plus lock-bit
> >> that is stored in the bucket-table. This is "struct rhash_lock_head"
> >> and is empty. A pointer to this needs to be cast to either an
> >> unsigned lock, or a "struct rhash_head *" to be useful.
> >> Variables of this type are most often called "bkt".
> >
> > Did you try using a union of the pointer and an 'unsigned long' ?
> > Should remove a lot of the casts.
>
> It might, but I'm not sure it is what we want.
> The value is not an unsigned long OR a pointer, it is both blended
> together. So it really isn't a union.
> We *want* it to require casts to access, so that it is clear that
> something unusual is happening, and care is needed.

Right, but you also want to make it hard to forget to do it properly.
Using a union to access the memory as either a pointer or a long
is perfectly valid (and is valid with 'strict aliasing' enabled).
(Rather than the other use of a union to just save space.)

An interesting thought....
Are the only valid actions 'lock and read, and 'unlock with optional update' ?
If so there are only 2 bits of code that need to understand the encoding.
If you make the bit number(s) and polarity properties of the architecture
you should be able to make the stored value an invalid pointer (locked
and unlocked) on at least some architectures.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-04-04 00:14:21

by NeilBrown

[permalink] [raw]
Subject: RE: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

On Wed, Apr 03 2019, David Laight wrote:

> From: NeilBrown
>> Sent: 02 April 2019 22:11
>>
>> On Tue, Apr 02 2019, David Laight wrote:
>>
>> > From: NeilBrown
>> >> Sent: 02 April 2019 00:08
>> >>
>> >> This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
>> >> bucket pointer to lock the hash chain for that bucket.
>> > ...
>> >> To enhance type checking, a new struct is introduced to represent the
>> >> pointer plus lock-bit
>> >> that is stored in the bucket-table. This is "struct rhash_lock_head"
>> >> and is empty. A pointer to this needs to be cast to either an
>> >> unsigned lock, or a "struct rhash_head *" to be useful.
>> >> Variables of this type are most often called "bkt".
>> >
>> > Did you try using a union of the pointer and an 'unsigned long' ?
>> > Should remove a lot of the casts.
>>
>> It might, but I'm not sure it is what we want.
>> The value is not an unsigned long OR a pointer, it is both blended
>> together. So it really isn't a union.
>> We *want* it to require casts to access, so that it is clear that
>> something unusual is happening, and care is needed.
>
> Right, but you also want to make it hard to forget to do it properly.
> Using a union to access the memory as either a pointer or a long
> is perfectly valid (and is valid with 'strict aliasing' enabled).
> (Rather than the other use of a union to just save space.)

Agreed.... I personally think that a union make it easy to forget.

>
> An interesting thought....
> Are the only valid actions 'lock and read, and 'unlock with optional update' ?

No, there is also "read without locking" - use for lookups with RCU
protection. But yes: the set of valid actions is quite limited.

> If so there are only 2 bits of code that need to understand the encoding.
> If you make the bit number(s) and polarity properties of the architecture
> you should be able to make the stored value an invalid pointer (locked
> and unlocked) on at least some architectures.

I'd rather avoid writing arch-specific code if I can avoid it. It isn't
clear that the value of your proposal justifies the cost.
Over-loading the low-order bits of a pointer is (I think) a well
understood pattern. I'd rather stick with such patterns.

Thanks,
NeilBrown


>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)


Attachments:
signature.asc (847.00 B)

2019-04-08 02:13:53

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 0/4 v2] Convert rhashtable to use bitlocks

From: NeilBrown <[email protected]>
Date: Tue, 02 Apr 2019 10:07:45 +1100

> This series converts rhashtable to use a per-bucket bitlock
> rather than a separate array of spinlocks.
> This:
> reduces memory usage
> results in slightly fewer memory accesses
> slightly improves parallelism
> makes a configuration option unnecessary
>
> The main change from previous version is to use a distinct type for
> the pointer in the bucket which has a bit-lock in it. This
> helped find two places where rht_ptr() was missed, one
> in rhashtable_free_and_destroy() in print_ht in the test code.

This looks good to me and I haven't seen any major objections.

I think however the thing is encoded, an unsigned long or a pointer,
the cleanliness is basically a wash.

Thanks.

2019-04-08 02:35:25

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

Hi Neil:

On Tue, Apr 02, 2019 at 10:07:45AM +1100, NeilBrown wrote:
>
> @@ -263,13 +311,13 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
> void *arg);
> void rhashtable_destroy(struct rhashtable *ht);
>
> -struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
> - unsigned int hash);
> -struct rhash_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
> - unsigned int hash);
> -struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
> - struct bucket_table *tbl,
> +struct rhash_lock_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
> + unsigned int hash);

I don't think this opaque type should be marked as __rcu. Because
you can't directly dereference it and once you put it through rht_ptr
then that's the pointer that should carry the __rcu marker.

If you add the __rcu here then you generate a lot of extra noise
in the code that isn't needed.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-04-08 02:37:05

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 1/4] rhashtable: use cmpxchg() in nested_table_alloc()

On Tue, Apr 02, 2019 at 10:07:45AM +1100, NeilBrown wrote:
> nested_table_alloc() relies on the fact that there is
> at most one spinlock allocated for every slot in the top
> level nested table, so it is not possible for two threads
> to try to allocate the same table at the same time.
>
> This assumption is a little fragile (it is not explicit) and is
> unnecessary as cmpxchg() can be used instead.
>
> A future patch will replace the spinlocks by per-bucket bitlocks,
> and then we won't be able to protect the slot pointer with a spinlock.
>
> So replace rcu_assign_pointer() with cmpxchg() - which has equivalent
> barrier properties.
> If it the cmp fails, free the table that was just allocated.
>
> Signed-off-by: NeilBrown <[email protected]>

Acked-by: Herbert Xu <[email protected]>
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-04-08 02:38:13

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 0/4 v2] Convert rhashtable to use bitlocks

On Sun, Apr 07, 2019 at 07:12:55PM -0700, David Miller wrote:
>
> This looks good to me and I haven't seen any major objections.
>
> I think however the thing is encoded, an unsigned long or a pointer,
> the cleanliness is basically a wash.

Hi Dave:

The series as a whole definitely makes sense. Although I do think
patch 3 can be improved slightly. But if you want to apply it now
that's fine too as we can fix it up later.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-04-10 19:35:17

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

Hi,

On Tue, Apr 02, 2019 at 10:07:45AM +1100, NeilBrown wrote:
> This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
> bucket pointer to lock the hash chain for that bucket.
>
> The benefits of a bit spin_lock are:
> - no need to allocate a separate array of locks.
> - no need to have a configuration option to guide the
> choice of the size of this array
> - locking cost is often a single test-and-set in a cache line
> that will have to be loaded anyway. When inserting at, or removing
> from, the head of the chain, the unlock is free - writing the new
> address in the bucket head implicitly clears the lock bit.
> For __rhashtable_insert_fast() we ensure this always happens
> when adding a new key.
> - even when lockings costs 2 updates (lock and unlock), they are
> in a cacheline that needs to be read anyway.
>
> The cost of using a bit spin_lock is a little bit of code complexity,
> which I think is quite manageable.
>
> Bit spin_locks are sometimes inappropriate because they are not fair -
> if multiple CPUs repeatedly contend of the same lock, one CPU can
> easily be starved. This is not a credible situation with rhashtable.
> Multiple CPUs may want to repeatedly add or remove objects, but they
> will typically do so at different buckets, so they will attempt to
> acquire different locks.
>
> As we have more bit-locks than we previously had spinlocks (by at
> least a factor of two) we can expect slightly less contention to
> go with the slightly better cache behavior and reduced memory
> consumption.
>
> To enhance type checking, a new struct is introduced to represent the
> pointer plus lock-bit
> that is stored in the bucket-table. This is "struct rhash_lock_head"
> and is empty. A pointer to this needs to be cast to either an
> unsigned lock, or a "struct rhash_head *" to be useful.
> Variables of this type are most often called "bkt".
>
> Previously "pprev" would sometimes point to a bucket, and sometimes a
> ->next pointer in an rhash_head. As these are now different types,
> pprev is NULL when it would have pointed to the bucket. In that case,
> 'blk' is used, together with correct locking protocol.
>
> Signed-off-by: NeilBrown <[email protected]>

This patch causes my qemu q800 boot test to crash reliably.

Starting network: Unable to handle kernel access at virtual address (ptrval)
Oops: 00000000
Modules linked in:
PC: [<00248b90>] sk_filter_trim_cap+0x56/0x158
SR: 2000 SP: (ptrval) a2: 07a30aa0
d0: 07836300 d1: 0783666c d2: 00000001 d3: 0025c192
d4: 0025a636 d5: 00248b3a a0: 078363fe a1: 60000000
Process ip (pid: 66, task=(ptrval))
Frame format=7 eff addr=6000000c ssw=0505 faddr=6000000c
wb 1 stat/addr/data: 0000 00000000 00000000
wb 2 stat/addr/data: 0000 00000000 00000000
wb 3 stat/addr/data: 0000 6000000c 00000000
push data: 00000000 00000000 00000000 00000000
Stack from 07a5bdec:
07a5be5c 0025c192 0025a636 00248b3a 00000000 ef9febc8 078363fe 0787a2d0
0783645c 07a5be5c 0025c192 0025a636 00248b3a 0787a2d0 07a2c000 0025c470
078363fe 0787a2d0 00000001 00000000 00000000 07a5be98 07a17500 00000000
0787a2d0 07a2c000 07a5bef8 07a5beac 7fffffff 0025c7e2 07a2c000 0787a2d0
00000000 00000000 00000000 0000000c ef9feb14 ef9feb14 00000000 0031a52e
07a5bef8 07a5bf28 0000000c 0781eba0 00000000 00000042 00000000 00000000
Call Trace: [<0025c192>] netlink_attachskb+0x0/0x138
[<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
[<00248b3a>] sk_filter_trim_cap+0x0/0x158
[<0025c192>] netlink_attachskb+0x0/0x138
[<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
[<00248b3a>] sk_filter_trim_cap+0x0/0x158
[<0025c470>] netlink_unicast+0x170/0x1be
[<0025c7e2>] netlink_sendmsg+0x288/0x2b2
[<0021a5be>] sock_sendmsg+0x1c/0x44
[<0021b8a6>] __sys_sendto+0xac/0xd2
[<00100000>] ext4_read_block_bitmap_nowait+0x4d4/0x4ec
[<00100000>] ext4_read_block_bitmap_nowait+0x4d4/0x4ec
[<00219906>] sock_alloc_file+0x50/0x80
[<000b7a1c>] fd_install+0x12/0x18
[<0021b1cc>] __sys_socket+0x7e/0x9c
[<0021b8ea>] sys_sendto+0x1e/0x24
[<00002a40>] syscall+0x8/0xc
[<0000c004>] ATANTBL+0x618/0x800
Code: 4a89 6604 4280 60ea 2c2b 000c 2748 000c <2869> 000c 082c 0003 0002 6728 4878 0014 7620 4873 3800 486e ffec 4eb9 002e 5b88
Disabling lock debugging due to kernel taint
Unable to handle kernel NULL pointer dereference at virtual address (ptrval)
Oops: 00000000
Modules linked in:
PC: [<0025bf9c>] netlink_release+0x284/0x446
SR: 2000 SP: (ptrval) a2: 07a30aa0
d0: 00000780 d1: 00000780 d2: 07a2c26e d3: 0001a0ba
d4: 07406160 d5: 000000ac a0: 078094ac a1: 00000780
Process ip (pid: 66, task=(ptrval))
Frame format=7 eff addr=00000780 ssw=0505 faddr=00000780
wb 1 stat/addr/data: 0000 00000000 00000000
wb 2 stat/addr/data: 0000 00000000 00000000
wb 3 stat/addr/data: 0000 00000780 00000000
push data: 00000000 00000000 00000000 00000000
Stack from 07a5bc2c:
00000000 078126a0 0741fe00 07a39208 00000001 ef9febc8 07406160 0740617c
00000000 00000000 002e8ca6 074061e8 07a5bcf4 00219830 07406160 00000008
07a39200 0740617c 002198a2 07406160 0740617c 000a3996 0740617c 07a39200
003deb54 07a39480 002e863c 000000b4 07a30aa0 002e7446 0002934a 07a39200
600000ff 00033e44 07a30aa0 0790f330 00018f34 07a30aa0 6000000c 0025c192
0025a636 00248b3a 00000001 ef9febc8 07a5bd84 00037986 0783645c 0790f368
Call Trace: [<002e8ca6>] __down_write+0xe/0x12
[<00219830>] __sock_release+0x34/0x98
[<002198a2>] sock_close+0xe/0x14
[<000a3996>] __fput+0xc4/0x16a
[<002e863c>] down_read+0x0/0x6
[<002e7446>] _cond_resched+0x0/0x2a
[<0002934a>] task_work_run+0x66/0x7c
[<00033e44>] up_read+0x0/0x6
[<00018f34>] do_exit+0x2a2/0x724
[<0025c192>] netlink_attachskb+0x0/0x138
[<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
[<00248b3a>] sk_filter_trim_cap+0x0/0x158
[<00037986>] printk+0x0/0x18
[<00004fd4>] die_if_kernel+0x52/0x56
[<00005d64>] send_fault_sig+0x74/0x86
[<00005094>] buserr_c+0xbc/0x518
[<0025c192>] netlink_attachskb+0x0/0x138
[<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
[<00248b3a>] sk_filter_trim_cap+0x0/0x158
[<00002944>] buserr+0x20/0x28
[<0025c192>] netlink_attachskb+0x0/0x138
[<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
[<00248b3a>] sk_filter_trim_cap+0x0/0x158
[<0025c192>] netlink_attachskb+0x0/0x138
[<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
[<00248b3a>] sk_filter_trim_cap+0x0/0x158
[<0025c192>] netlink_attachskb+0x0/0x138
[<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
[<00248b3a>] sk_filter_trim_cap+0x0/0x158
[<0025c470>] netlink_unicast+0x170/0x1be
[<0025c7e2>] netlink_sendmsg+0x288/0x2b2
[<0021a5be>] sock_sendmsg+0x1c/0x44
[<0021b8a6>] __sys_sendto+0xac/0xd2
[<00100000>] ext4_read_block_bitmap_nowait+0x4d4/0x4ec
[<00100000>] ext4_read_block_bitmap_nowait+0x4d4/0x4ec
[<00219906>] sock_alloc_file+0x50/0x80
[<000b7a1c>] fd_install+0x12/0x18
[<0021b1cc>] __sys_socket+0x7e/0x9c
[<0021b8ea>] sys_sendto+0x1e/0x24
[<00002a40>] syscall+0x8/0xc
[<0000c004>] ATANTBL+0x618/0x800
Code: 41f5 5800 6000 fe34 b082 670a 2200 2240 <2011> 6000 fe48 202b 026e 43f9 0001 a0ba 4a81 677a 2041 2080 4878 0200 2f3c 0025
Kernel panic - not syncing: Aiee, killing interrupt handler!
---[ end Kernel panic - not syncing: Aiee, killing interrupt handler! ]---
qemu-system-m68k: terminating on signal 15 from pid 19836 (/bin/bash)

Bisect log is attached. Obviously I have no idea if this is an issue with qemu
or with the code. qemu is from [email protected]:vivier/qemu-m68k.git, branch
m68k/q800-dev.

Guenter

---
# bad: [87b81df1a63d6791359a475241fa9d2f96fa30be] Add linux-next specific files for 20190410
# good: [15ade5d2e7775667cf191cf2f94327a4889f8b9d] Linux 5.1-rc4
git bisect start 'HEAD' 'v5.1-rc4'
# bad: [35a1393e49d19d625d0270fcdf409fad743263cd] Merge remote-tracking branch 'crypto/master'
git bisect bad 35a1393e49d19d625d0270fcdf409fad743263cd
# good: [d1472e6a10485eb679eec89324ab165533af0be1] Merge remote-tracking branch 'hid/for-next'
git bisect good d1472e6a10485eb679eec89324ab165533af0be1
# bad: [0c9381d9bcfbd7dbc26b6e5f296e90d6396ea4db] Merge branch 'netdevsim-small-spring-cleanup'
git bisect bad 0c9381d9bcfbd7dbc26b6e5f296e90d6396ea4db
# good: [356d71e00d278d865f8c7f68adebd6ce4698a7e2] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good 356d71e00d278d865f8c7f68adebd6ce4698a7e2
# good: [fe1ec0bdfba4e7bfe6db81a1e4ac74beb36691e8] ehea: remove set but not used variables 'epa' and 'cq_handle_ref'
git bisect good fe1ec0bdfba4e7bfe6db81a1e4ac74beb36691e8
# bad: [ed514fc5615d7688b7c227a76863e98a92fb0d54] cxgb4: Don't return EAGAIN when TCAM is full.
git bisect bad ed514fc5615d7688b7c227a76863e98a92fb0d54
# good: [7090425104dbd87959bd424e9bd5a8711fde0986] net: phy: add amlogic g12a mdio mux support
git bisect good 7090425104dbd87959bd424e9bd5a8711fde0986
# good: [059477830022e1886f55a9641702461c249fa864] net: hsr: fix placement of logical operator in a multi-line statement
git bisect good 059477830022e1886f55a9641702461c249fa864
# good: [7a41c294c1463100fdc82a356e22e36bbaa6b0f9] rhashtable: use cmpxchg() in nested_table_alloc()
git bisect good 7a41c294c1463100fdc82a356e22e36bbaa6b0f9
# bad: [9186c90bbb9525f46eddb590be26c749b5b1def7] Merge branch 'rhashtable-bitlocks'
git bisect bad 9186c90bbb9525f46eddb590be26c749b5b1def7
# bad: [8f0db018006a421956965e1149234c4e8db718ee] rhashtable: use bit_spin_locks to protect hash bucket.
git bisect bad 8f0db018006a421956965e1149234c4e8db718ee
# good: [ff302db965b57c141297911ea647d36d11fedfbe] rhashtable: allow rht_bucket_var to return NULL.
git bisect good ff302db965b57c141297911ea647d36d11fedfbe
# first bad commit: [8f0db018006a421956965e1149234c4e8db718ee] rhashtable: use bit_spin_locks to protect hash bucket.

2019-04-11 00:49:49

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

On Wed, Apr 10 2019, Guenter Roeck wrote:

> Hi,
>
> On Tue, Apr 02, 2019 at 10:07:45AM +1100, NeilBrown wrote:
>> This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
>> bucket pointer to lock the hash chain for that bucket.
>>
>> The benefits of a bit spin_lock are:
>> - no need to allocate a separate array of locks.
>> - no need to have a configuration option to guide the
>> choice of the size of this array
>> - locking cost is often a single test-and-set in a cache line
>> that will have to be loaded anyway. When inserting at, or removing
>> from, the head of the chain, the unlock is free - writing the new
>> address in the bucket head implicitly clears the lock bit.
>> For __rhashtable_insert_fast() we ensure this always happens
>> when adding a new key.
>> - even when lockings costs 2 updates (lock and unlock), they are
>> in a cacheline that needs to be read anyway.
>>
>> The cost of using a bit spin_lock is a little bit of code complexity,
>> which I think is quite manageable.
>>
>> Bit spin_locks are sometimes inappropriate because they are not fair -
>> if multiple CPUs repeatedly contend of the same lock, one CPU can
>> easily be starved. This is not a credible situation with rhashtable.
>> Multiple CPUs may want to repeatedly add or remove objects, but they
>> will typically do so at different buckets, so they will attempt to
>> acquire different locks.
>>
>> As we have more bit-locks than we previously had spinlocks (by at
>> least a factor of two) we can expect slightly less contention to
>> go with the slightly better cache behavior and reduced memory
>> consumption.
>>
>> To enhance type checking, a new struct is introduced to represent the
>> pointer plus lock-bit
>> that is stored in the bucket-table. This is "struct rhash_lock_head"
>> and is empty. A pointer to this needs to be cast to either an
>> unsigned lock, or a "struct rhash_head *" to be useful.
>> Variables of this type are most often called "bkt".
>>
>> Previously "pprev" would sometimes point to a bucket, and sometimes a
>> ->next pointer in an rhash_head. As these are now different types,
>> pprev is NULL when it would have pointed to the bucket. In that case,
>> 'blk' is used, together with correct locking protocol.
>>
>> Signed-off-by: NeilBrown <[email protected]>
>
> This patch causes my qemu q800 boot test to crash reliably.
>
> Starting network: Unable to handle kernel access at virtual address (ptrval)
> Oops: 00000000
> Modules linked in:
> PC: [<00248b90>] sk_filter_trim_cap+0x56/0x158
> SR: 2000 SP: (ptrval) a2: 07a30aa0
> d0: 07836300 d1: 0783666c d2: 00000001 d3: 0025c192
> d4: 0025a636 d5: 00248b3a a0: 078363fe a1: 60000000
> Process ip (pid: 66, task=(ptrval))
> Frame format=7 eff addr=6000000c ssw=0505 faddr=6000000c
> wb 1 stat/addr/data: 0000 00000000 00000000
> wb 2 stat/addr/data: 0000 00000000 00000000
> wb 3 stat/addr/data: 0000 6000000c 00000000
> push data: 00000000 00000000 00000000 00000000
> Stack from 07a5bdec:
> 07a5be5c 0025c192 0025a636 00248b3a 00000000 ef9febc8 078363fe 0787a2d0
> 0783645c 07a5be5c 0025c192 0025a636 00248b3a 0787a2d0 07a2c000 0025c470
> 078363fe 0787a2d0 00000001 00000000 00000000 07a5be98 07a17500 00000000
> 0787a2d0 07a2c000 07a5bef8 07a5beac 7fffffff 0025c7e2 07a2c000 0787a2d0
> 00000000 00000000 00000000 0000000c ef9feb14 ef9feb14 00000000 0031a52e
> 07a5bef8 07a5bf28 0000000c 0781eba0 00000000 00000042 00000000 00000000
> Call Trace: [<0025c192>] netlink_attachskb+0x0/0x138
> [<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
> [<00248b3a>] sk_filter_trim_cap+0x0/0x158
> [<0025c192>] netlink_attachskb+0x0/0x138
> [<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
> [<00248b3a>] sk_filter_trim_cap+0x0/0x158
> [<0025c470>] netlink_unicast+0x170/0x1be
> [<0025c7e2>] netlink_sendmsg+0x288/0x2b2
> [<0021a5be>] sock_sendmsg+0x1c/0x44
> [<0021b8a6>] __sys_sendto+0xac/0xd2
> [<00100000>] ext4_read_block_bitmap_nowait+0x4d4/0x4ec
> [<00100000>] ext4_read_block_bitmap_nowait+0x4d4/0x4ec
> [<00219906>] sock_alloc_file+0x50/0x80
> [<000b7a1c>] fd_install+0x12/0x18
> [<0021b1cc>] __sys_socket+0x7e/0x9c
> [<0021b8ea>] sys_sendto+0x1e/0x24
> [<00002a40>] syscall+0x8/0xc
> [<0000c004>] ATANTBL+0x618/0x800
> Code: 4a89 6604 4280 60ea 2c2b 000c 2748 000c <2869> 000c 082c 0003 0002 6728 4878 0014 7620 4873 3800 486e ffec 4eb9 002e 5b88

Thanks for testing and for the report.
The above code disassembles to:

0: 4a89 tstl %a1
2: 6604 bnes 0x8
4: 4280 clrl %d0
6: 60ea bras 0xfffffff2
8: 2c2b 000c movel %a3@(12),%d6
c: 2748 000c movel %a0,%a3@(12)
10:* 2869 000c moveal %a1@(12),%a4 <-- trapping instruction
14: 082c 0003 0002 btst #3,%a4@(2)
1a: 6728 beqs 0x44
1c: 4878 0014 pea 0x14
20: 7620 moveq #32,%d3
22: 4873 3800 pea %a3@(0000000000000000,%d3:l)
26: 486e ffec pea %fp@(-20)
2a: 4eb9 002e 5b88 jsr 0x2e5b88

And as %a1 is 60000000, it crashes.
I'm not familiar with m68k assembler, but a bit of hunting suggests that
moveal %a1@(12),%a4

means "add 12 to %1, load an address from there, then dereference that
address again.
I think that both skb->sk and filter->prog are at offsets of 12, so I
guess
8: 2c2b 000c movel %a3@(12),%d6
is
struct sock *save_sk = skb->sk;

c: 2748 000c movel %a0,%a3@(12)
is
skb->sk = sk;
and
10:* 2869 000c moveal %a1@(12),%a4 <-- trapping instruction
14: 082c 0003 0002 btst #3,%a4@(2)
is
bpf_prog_run_save_cb(filter->prog, skb);
if (unlikely(prog->cb_access)) {

cb_access might be bit 3 of the byte 2 along from *prog, which is 12
along from a1.

That makes a1 'filter', loaded from sk->sk_filter, where 'sk' is %a0,
078363fe

That address is 2-byte aligned, which is probably wrong.... If addresses
of structs aren't always 4-byte aligned, then using BIT(1) for a lock
bit isn't going to work.

.... and after googling a bit I see that 68000 require 2-byte alignment,
but not 4-byte. Oh..

That means there aren't two spare bits in an address, so I cannot use
one for the NULLS and one for a lock bit. Bother.

I might be able to find a different way forward, but for now I think we
need to drop this series.

Thanks,
NeilBrown


Attachments:
signature.asc (847.00 B)

2019-04-11 02:18:03

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

From: NeilBrown <[email protected]>
Date: Thu, 11 Apr 2019 10:48:42 +1000

> I might be able to find a different way forward, but for now I think we
> need to drop this series.

You need to send me an explicit revert of all of the changes then.

Thank you.

2019-04-11 06:14:25

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

On Thu, Apr 11 2019, NeilBrown wrote:

> On Wed, Apr 10 2019, Guenter Roeck wrote:
>
>> Hi,
>>
.....
>>
>> This patch causes my qemu q800 boot test to crash reliably.
>>
....
>> Code: 4a89 6604 4280 60ea 2c2b 000c 2748 000c <2869> 000c 082c 0003 0002 6728 4878 0014 7620 4873 3800 486e ffec 4eb9 002e 5b88
>
> Thanks for testing and for the report.
.....
>
> .... and after googling a bit I see that 68000 require 2-byte alignment,
> but not 4-byte. Oh..
>
> That means there aren't two spare bits in an address, so I cannot use
> one for the NULLS and one for a lock bit. Bother.
>
> I might be able to find a different way forward, but for now I think we
> need to drop this series.

I have found a way forward that I like. It only requires one bit per
address to be over-loaded.

The following patch implements it and works for me.
Could you please confirm that it fixes your problem on m68k ??

I'll break it up into a few reviewable patches and post them separately.

Thanks again,
NeilBrown


diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 01253d809687..214487aaf3eb 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -40,7 +40,7 @@
* the chain. To avoid dereferencing this pointer without clearing
* the bit first, we use an opaque 'struct rhash_lock_head *' for the
* pointer stored in the bucket. This struct needs to be defined so
- * that rcu_derefernce() works on it, but it has no content so a
+ * that rcu_dereference() works on it, but it has no content so a
* cast is needed for it to be useful. This ensures it isn't
* used by mistake with clearing the lock bit first.
*/
@@ -85,70 +85,6 @@ struct bucket_table {
struct rhash_lock_head __rcu *buckets[] ____cacheline_aligned_in_smp;
};

-/*
- * We lock a bucket by setting BIT(0) in the pointer - this is always
- * zero in real pointers and in the nulls marker.
- * bit_spin_locks do not handle contention well, but the whole point
- * of the hashtable design is to achieve minimum per-bucket contention.
- * A nested hash table might not have a bucket pointer. In that case
- * we cannot get a lock. For remove and replace the bucket cannot be
- * interesting and doesn't need locking.
- * For insert we allocate the bucket if this is the last bucket_table,
- * and then take the lock.
- * Sometimes we unlock a bucket by writing a new pointer there. In that
- * case we don't need to unlock, but we do need to reset state such as
- * local_bh. For that we have rht_assign_unlock(). As rcu_assign_pointer()
- * provides the same release semantics that bit_spin_unlock() provides,
- * this is safe.
- */
-
-static inline void rht_lock(struct rhash_lock_head **bkt)
-{
- local_bh_disable();
- bit_spin_lock(0, (unsigned long *)bkt);
-}
-
-static inline void rht_unlock(struct rhash_lock_head **bkt)
-{
- bit_spin_unlock(0, (unsigned long *)bkt);
- local_bh_enable();
-}
-
-static inline void rht_assign_unlock(struct rhash_lock_head **bkt,
- struct rhash_head *obj)
-{
- struct rhash_head **p = (struct rhash_head **)bkt;
-
- if (obj == RHT_NULLS_MARKER(bkt))
- obj = NULL;
- rcu_assign_pointer(*p, obj);
- preempt_enable();
- __release(bitlock);
- local_bh_enable();
-}
-
-/*
- * If 'p' is a bucket head and might be locked:
- * rht_ptr() returns the address without the lock bit.
- * rht_ptr_locked() returns the address WITH the lock bit.
- */
-static inline struct rhash_head __rcu *rht_ptr(const struct rhash_lock_head **bkt,
- struct bucket_table *tbl,
- unsigned int hash)
-{
- const struct rhash_lock_head *p =
- rht_dereference_bucket(*bkt, tbl, hash);
- if (!p)
- return RHT_NULLS_MARKER(bkt);
- return (void *)(((unsigned long)p) & ~BIT(0));
-}
-
-static inline struct rhash_lock_head __rcu *rht_ptr_locked(const
- struct rhash_head *p)
-{
- return (void *)(((unsigned long)p) | BIT(0));
-}
-
/*
* NULLS_MARKER() expects a hash value with the low
* bits mostly likely to be significant, and it discards
@@ -367,6 +303,93 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
&tbl->buckets[hash];
}

+/*
+ * We lock a bucket by setting BIT(0) in the pointer - this is always
+ * zero in real pointers and in the nulls marker.
+ * bit_spin_locks do not handle contention well, but the whole point
+ * of the hashtable design is to achieve minimum per-bucket contention.
+ * A nested hash table might not have a bucket pointer. In that case
+ * we cannot get a lock. For remove and replace the bucket cannot be
+ * interesting and doesn't need locking.
+ * For insert we allocate the bucket if this is the last bucket_table,
+ * and then take the lock.
+ * Sometimes we unlock a bucket by writing a new pointer there. In that
+ * case we don't need to unlock, but we do need to reset state such as
+ * local_bh. For that we have rht_assign_unlock(). As rcu_assign_pointer()
+ * provides the same release semantics that bit_spin_unlock() provides,
+ * this is safe.
+ */
+
+static inline void rht_lock(struct rhash_lock_head **bkt)
+{
+ local_bh_disable();
+ bit_spin_lock(0, (unsigned long *)bkt);
+}
+
+static inline void rht_unlock(struct rhash_lock_head **bkt)
+{
+ bit_spin_unlock(0, (unsigned long *)bkt);
+ local_bh_enable();
+}
+
+/*
+ * If 'p' is a bucket head and might be locked:
+ * rht_ptr() returns the address without the lock bit.
+ * rht_ptr_locked() returns the address WITH the lock bit.
+ */
+static inline struct rhash_head __rcu *rht_ptr(struct rhash_lock_head __rcu * const *bkt,
+ struct bucket_table *tbl,
+ unsigned int hash)
+{
+ const struct rhash_lock_head *p =
+ rht_dereference_bucket_rcu(*bkt, tbl, hash);
+ if ((((unsigned long)p) & ~BIT(0)) == 0)
+ return RHT_NULLS_MARKER(bkt);
+ return (void *)(((unsigned long)p) & ~BIT(0));
+}
+
+/*
+ * This can be called when access is known to be exclusive,
+ * such as when destorying an rhashtable
+ */
+static inline struct rhash_head __rcu *rht_ptr_unprotected(
+ struct rhash_lock_head __rcu * const *bkt)
+{
+ const struct rhash_lock_head *p = rcu_dereference_protected(*bkt, true);
+ if (!p)
+ return RHT_NULLS_MARKER(bkt);
+ return (void *)(((unsigned long)p) & ~BIT(0));
+}
+
+static inline struct rhash_lock_head __rcu *rht_ptr_locked(const
+ struct rhash_head *p)
+{
+ return (void *)(((unsigned long)p) | BIT(0));
+}
+
+static inline void rht_assign_locked(struct rhash_lock_head __rcu **bkt,
+ struct rhash_head *obj)
+{
+ struct rhash_head __rcu **p = (struct rhash_head __rcu **)bkt;
+
+ if (rht_is_a_nulls(obj))
+ obj = NULL;
+ rcu_assign_pointer(*p, rht_ptr_locked(obj));
+}
+
+static inline void rht_assign_unlock(struct rhash_lock_head __rcu **bkt,
+ struct rhash_head *obj)
+{
+ struct rhash_head __rcu **p = (struct rhash_head __rcu **)bkt;
+
+ if (rht_is_a_nulls(obj))
+ obj = NULL;
+ rcu_assign_pointer(*p, obj);
+ preempt_enable();
+ __release(bitlock);
+ local_bh_enable();
+}
+
/**
* rht_for_each_from - iterate over hash chain from given head
* @pos: the &struct rhash_head to use as a loop cursor.
@@ -375,7 +398,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* @hash: the hash value / bucket index
*/
#define rht_for_each_from(pos, head, tbl, hash) \
- for (pos = rht_dereference_bucket(head, tbl, hash); \
+ for (pos = head; \
!rht_is_a_nulls(pos); \
pos = rht_dereference_bucket((pos)->next, tbl, hash))

@@ -386,7 +409,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* @hash: the hash value / bucket index
*/
#define rht_for_each(pos, tbl, hash) \
- rht_for_each_from(pos, rht_ptr(*rht_bucket(tbl, hash)), tbl, hash)
+ rht_for_each_from(pos, rht_ptr(rht_bucket(tbl, hash), tbl, hash))

/**
* rht_for_each_entry_from - iterate over hash chain from given head
@@ -398,7 +421,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* @member: name of the &struct rhash_head within the hashable struct.
*/
#define rht_for_each_entry_from(tpos, pos, head, tbl, hash, member) \
- for (pos = rht_dereference_bucket(head, tbl, hash); \
+ for (pos = head; \
(!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \
pos = rht_dereference_bucket((pos)->next, tbl, hash))

@@ -411,8 +434,8 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* @member: name of the &struct rhash_head within the hashable struct.
*/
#define rht_for_each_entry(tpos, pos, tbl, hash, member) \
- rht_for_each_entry_from(tpos, pos, rht_ptr(*rht_bucket(tbl, hash)), \
- tbl, hash, member)
+ rht_for_each_entry_from(tpos, pos, rht_ptr(rht_bucket(tbl, hash), \
+ tbl, hash, member))

/**
* rht_for_each_entry_safe - safely iterate over hash chain of given type
@@ -427,8 +450,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* remove the loop cursor from the list.
*/
#define rht_for_each_entry_safe(tpos, pos, next, tbl, hash, member) \
- for (pos = rht_dereference_bucket(rht_ptr(*rht_bucket(tbl, hash)), \
- tbl, hash), \
+ for (pos = rht_ptr(rht_bucket(tbl, hash), tbl, hash)), \
next = !rht_is_a_nulls(pos) ? \
rht_dereference_bucket(pos->next, tbl, hash) : NULL; \
(!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \
@@ -449,7 +471,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
*/
#define rht_for_each_rcu_from(pos, head, tbl, hash) \
for (({barrier(); }), \
- pos = rht_dereference_bucket_rcu(head, tbl, hash); \
+ pos = head; \
!rht_is_a_nulls(pos); \
pos = rcu_dereference_raw(pos->next))

@@ -464,10 +486,9 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* traversal is guarded by rcu_read_lock().
*/
#define rht_for_each_rcu(pos, tbl, hash) \
- for (({barrier(); }), \
- pos = rht_ptr(rht_dereference_bucket_rcu( \
- *rht_bucket(tbl, hash), tbl, hash)); \
- !rht_is_a_nulls(pos); \
+ for (({barrier(); }), \
+ pos = rht_ptr(rht_bucket(tbl, hash), tbl, hash); \
+ !rht_is_a_nulls(pos); \
pos = rcu_dereference_raw(pos->next))

/**
@@ -485,7 +506,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
*/
#define rht_for_each_entry_rcu_from(tpos, pos, head, tbl, hash, member) \
for (({barrier(); }), \
- pos = rht_dereference_bucket_rcu(head, tbl, hash); \
+ pos = head; \
(!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \
pos = rht_dereference_bucket_rcu(pos->next, tbl, hash))

@@ -501,10 +522,10 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* the _rcu mutation primitives such as rhashtable_insert() as long as the
* traversal is guarded by rcu_read_lock().
*/
-#define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member) \
- rht_for_each_entry_rcu_from(tpos, pos, \
- rht_ptr(*rht_bucket(tbl, hash)), \
- tbl, hash, member)
+#define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member) \
+ rht_for_each_entry_rcu_from(tpos, pos, \
+ rht_ptr(rht_bucket(tbl, hash), \
+ tbl, hash, member))

/**
* rhl_for_each_rcu - iterate over rcu hash table list
@@ -559,8 +580,7 @@ static inline struct rhash_head *__rhashtable_lookup(
hash = rht_key_hashfn(ht, tbl, key, params);
bkt = rht_bucket(tbl, hash);
do {
- he = rht_ptr(rht_dereference_bucket_rcu(*bkt, tbl, hash));
- rht_for_each_rcu_from(he, he, tbl, hash) {
+ rht_for_each_rcu_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
if (params.obj_cmpfn ?
params.obj_cmpfn(&arg, rht_obj(ht, he)) :
rhashtable_compare(&arg, rht_obj(ht, he)))
@@ -693,7 +713,7 @@ static inline void *__rhashtable_insert_fast(
return rhashtable_insert_slow(ht, key, obj);
}

- rht_for_each_from(head, rht_ptr(*bkt), tbl, hash) {
+ rht_for_each_from(head, rht_ptr(bkt, tbl, hash), tbl, hash) {
struct rhlist_head *plist;
struct rhlist_head *list;

@@ -738,7 +758,7 @@ static inline void *__rhashtable_insert_fast(
goto slow_path;

/* Inserting at head of list makes unlocking free. */
- head = rht_ptr(rht_dereference_bucket(*bkt, tbl, hash));
+ head = rht_ptr(bkt, tbl, hash);

RCU_INIT_POINTER(obj->next, head);
if (rhlist) {
@@ -965,7 +985,7 @@ static inline int __rhashtable_remove_fast_one(
pprev = NULL;
rht_lock(bkt);

- rht_for_each_from(he, rht_ptr(*bkt), tbl, hash) {
+ rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
struct rhlist_head *list;

list = container_of(he, struct rhlist_head, rhead);
@@ -1124,7 +1144,7 @@ static inline int __rhashtable_replace_fast(
pprev = NULL;
rht_lock(bkt);

- rht_for_each_from(he, rht_ptr(*bkt), tbl, hash) {
+ rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
if (he != obj_old) {
pprev = &he->next;
continue;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index c5d0974467ee..2eafc8463349 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -59,7 +59,7 @@ int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash)
return 1;
if (unlikely(tbl->nest))
return 1;
- return bit_spin_is_locked(1, (unsigned long *)&tbl->buckets[hash]);
+ return bit_spin_is_locked(0, (unsigned long *)&tbl->buckets[hash]);
}
EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
#else
@@ -221,7 +221,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht,
struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl);
int err = -EAGAIN;
struct rhash_head *head, *next, *entry;
- struct rhash_head **pprev = NULL;
+ struct rhash_head __rcu **pprev = NULL;
unsigned int new_hash;

if (new_tbl->nest)
@@ -229,7 +229,8 @@ static int rhashtable_rehash_one(struct rhashtable *ht,

err = -ENOENT;

- rht_for_each_from(entry, rht_ptr(*bkt), old_tbl, old_hash) {
+ rht_for_each_from(entry, rht_ptr(bkt, old_tbl, old_hash),
+ old_tbl, old_hash) {
err = 0;
next = rht_dereference_bucket(entry->next, old_tbl, old_hash);

@@ -246,8 +247,8 @@ static int rhashtable_rehash_one(struct rhashtable *ht,

rht_lock(&new_tbl->buckets[new_hash]);

- head = rht_ptr(rht_dereference_bucket(new_tbl->buckets[new_hash],
- new_tbl, new_hash));
+ head = rht_ptr(new_tbl->buckets + new_hash,
+ new_tbl, new_hash);

RCU_INIT_POINTER(entry->next, head);

@@ -257,7 +258,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht,
rcu_assign_pointer(*pprev, next);
else
/* Need to preserved the bit lock. */
- rcu_assign_pointer(*bkt, rht_ptr_locked(next));
+ rht_assign_locked(bkt, next);

out:
return err;
@@ -484,12 +485,12 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
.ht = ht,
.key = key,
};
- struct rhash_head **pprev = NULL;
+ struct rhash_head __rcu **pprev = NULL;
struct rhash_head *head;
int elasticity;

elasticity = RHT_ELASTICITY;
- rht_for_each_from(head, rht_ptr(*bkt), tbl, hash) {
+ rht_for_each_from(head, rht_ptr(bkt, tbl, hash), tbl, hash) {
struct rhlist_head *list;
struct rhlist_head *plist;

@@ -515,7 +516,7 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
rcu_assign_pointer(*pprev, obj);
else
/* Need to preserve the bit lock */
- rcu_assign_pointer(*bkt, rht_ptr_locked(obj));
+ rht_assign_locked(bkt, obj);

return NULL;
}
@@ -555,7 +556,7 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
if (unlikely(rht_grow_above_100(ht, tbl)))
return ERR_PTR(-EAGAIN);

- head = rht_ptr(rht_dereference_bucket(*bkt, tbl, hash));
+ head = rht_ptr(bkt, tbl, hash);

RCU_INIT_POINTER(obj->next, head);
if (ht->rhlist) {
@@ -568,7 +569,7 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
/* bkt is always the head of the list, so it holds
* the lock, which we need to preserve
*/
- rcu_assign_pointer(*bkt, rht_ptr_locked(obj));
+ rht_assign_locked(bkt, obj);

atomic_inc(&ht->nelems);
if (rht_grow_above_75(ht, tbl))
@@ -1137,7 +1138,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
struct rhash_head *pos, *next;

cond_resched();
- for (pos = rht_ptr(rht_dereference(*rht_bucket(tbl, i), ht)),
+ for (pos = rht_ptr_unprotected(rht_bucket(tbl, i)),
next = !rht_is_a_nulls(pos) ?
rht_dereference(pos->next, ht) : NULL;
!rht_is_a_nulls(pos);
diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 02592c2a249c..7b93cfefe195 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -500,7 +500,7 @@ static unsigned int __init print_ht(struct rhltable *rhlt)
struct rhash_head *pos, *next;
struct test_obj_rhl *p;

- pos = rht_ptr(rht_dereference(tbl->buckets[i], ht));
+ pos = rht_ptr_unprotected(tbl->buckets + i);
next = !rht_is_a_nulls(pos) ? rht_dereference(pos->next, ht) : NULL;

if (!rht_is_a_nulls(pos)) {


Attachments:
signature.asc (847.00 B)

2019-04-11 06:42:33

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

On Thu, Apr 11 2019, NeilBrown wrote:

> On Thu, Apr 11 2019, NeilBrown wrote:
>
>> On Wed, Apr 10 2019, Guenter Roeck wrote:
>>
>>> Hi,
>>>
> .....
>>>
>>> This patch causes my qemu q800 boot test to crash reliably.
>>>
> ....
>>> Code: 4a89 6604 4280 60ea 2c2b 000c 2748 000c <2869> 000c 082c 0003 0002 6728 4878 0014 7620 4873 3800 486e ffec 4eb9 002e 5b88
>>
>> Thanks for testing and for the report.
> .....
>>
>> .... and after googling a bit I see that 68000 require 2-byte alignment,
>> but not 4-byte. Oh..
>>
>> That means there aren't two spare bits in an address, so I cannot use
>> one for the NULLS and one for a lock bit. Bother.
>>
>> I might be able to find a different way forward, but for now I think we
>> need to drop this series.
>
> I have found a way forward that I like. It only requires one bit per
> address to be over-loaded.
>
> The following patch implements it and works for me.
> Could you please confirm that it fixes your problem on m68k ??

Sorry, that was on the wrong base.

Please try this one, against current net-next.

NeilBrown


diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 460c0eaf6b96..4a30306bcf1d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -35,12 +35,12 @@
* the least significant bit set but otherwise stores the address of
* the hash bucket. This allows us to be be sure we've found the end
* of the right list.
- * The value stored in the hash bucket has BIT(2) used as a lock bit.
+ * The value stored in the hash bucket has BIT(0) used as a lock bit.
* This bit must be atomically set before any changes are made to
* the chain. To avoid dereferencing this pointer without clearing
* the bit first, we use an opaque 'struct rhash_lock_head *' for the
* pointer stored in the bucket. This struct needs to be defined so
- * that rcu_derefernce() works on it, but it has no content so a
+ * that rcu_dereference() works on it, but it has no content so a
* cast is needed for it to be useful. This ensures it isn't
* used by mistake with clearing the lock bit first.
*/
@@ -87,90 +87,23 @@ struct bucket_table {
struct rhash_lock_head __rcu *buckets[] ____cacheline_aligned_in_smp;
};

-/*
- * We lock a bucket by setting BIT(1) in the pointer - this is always
- * zero in real pointers and in the nulls marker.
- * bit_spin_locks do not handle contention well, but the whole point
- * of the hashtable design is to achieve minimum per-bucket contention.
- * A nested hash table might not have a bucket pointer. In that case
- * we cannot get a lock. For remove and replace the bucket cannot be
- * interesting and doesn't need locking.
- * For insert we allocate the bucket if this is the last bucket_table,
- * and then take the lock.
- * Sometimes we unlock a bucket by writing a new pointer there. In that
- * case we don't need to unlock, but we do need to reset state such as
- * local_bh. For that we have rht_assign_unlock(). As rcu_assign_pointer()
- * provides the same release semantics that bit_spin_unlock() provides,
- * this is safe.
- */
-
-static inline void rht_lock(struct bucket_table *tbl,
- struct rhash_lock_head **bkt)
-{
- local_bh_disable();
- bit_spin_lock(1, (unsigned long *)bkt);
- lock_map_acquire(&tbl->dep_map);
-}
-
-static inline void rht_lock_nested(struct bucket_table *tbl,
- struct rhash_lock_head **bucket,
- unsigned int subclass)
-{
- local_bh_disable();
- bit_spin_lock(1, (unsigned long *)bucket);
- lock_acquire_exclusive(&tbl->dep_map, subclass, 0, NULL, _THIS_IP_);
-}
-
-static inline void rht_unlock(struct bucket_table *tbl,
- struct rhash_lock_head **bkt)
-{
- lock_map_release(&tbl->dep_map);
- bit_spin_unlock(1, (unsigned long *)bkt);
- local_bh_enable();
-}
-
-static inline void rht_assign_unlock(struct bucket_table *tbl,
- struct rhash_lock_head **bkt,
- struct rhash_head *obj)
-{
- struct rhash_head **p = (struct rhash_head **)bkt;
-
- lock_map_release(&tbl->dep_map);
- rcu_assign_pointer(*p, obj);
- preempt_enable();
- __release(bitlock);
- local_bh_enable();
-}
-
-/*
- * If 'p' is a bucket head and might be locked:
- * rht_ptr() returns the address without the lock bit.
- * rht_ptr_locked() returns the address WITH the lock bit.
- */
-static inline struct rhash_head __rcu *rht_ptr(const struct rhash_lock_head *p)
-{
- return (void *)(((unsigned long)p) & ~BIT(1));
-}
-
-static inline struct rhash_lock_head __rcu *rht_ptr_locked(const
- struct rhash_head *p)
-{
- return (void *)(((unsigned long)p) | BIT(1));
-}
-
/*
* NULLS_MARKER() expects a hash value with the low
* bits mostly likely to be significant, and it discards
* the msb.
- * We git it an address, in which the bottom 2 bits are
+ * We give it an address, in which the bottom bit is
* always 0, and the msb might be significant.
* So we shift the address down one bit to align with
* expectations and avoid losing a significant bit.
+ *
+ * We never store the NULLS_MARKER in the hash table
+ * itself as we need the lsb for locking.
+ * Instead we store a NULL
*/
#define RHT_NULLS_MARKER(ptr) \
((void *)NULLS_MARKER(((unsigned long) (ptr)) >> 1))
#define INIT_RHT_NULLS_HEAD(ptr) \
- ((ptr) = RHT_NULLS_MARKER(&(ptr)))
+ ((ptr) = NULL)

static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
{
@@ -372,6 +305,108 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
&tbl->buckets[hash];
}

+/*
+ * We lock a bucket by setting BIT(0) in the pointer - this is always
+ * zero in real pointers and in the nulls marker.
+ * bit_spin_locks do not handle contention well, but the whole point
+ * of the hashtable design is to achieve minimum per-bucket contention.
+ * A nested hash table might not have a bucket pointer. In that case
+ * we cannot get a lock. For remove and replace the bucket cannot be
+ * interesting and doesn't need locking.
+ * For insert we allocate the bucket if this is the last bucket_table,
+ * and then take the lock.
+ * Sometimes we unlock a bucket by writing a new pointer there. In that
+ * case we don't need to unlock, but we do need to reset state such as
+ * local_bh. For that we have rht_assign_unlock(). As rcu_assign_pointer()
+ * provides the same release semantics that bit_spin_unlock() provides,
+ * this is safe.
+ */
+
+static inline void rht_lock(struct bucket_table *tbl,
+ struct rhash_lock_head **bkt)
+{
+ local_bh_disable();
+ bit_spin_lock(0, (unsigned long *)bkt);
+ lock_map_acquire(&tbl->dep_map);
+}
+
+static inline void rht_lock_nested(struct bucket_table *tbl,
+ struct rhash_lock_head **bkt,
+ unsigned int subclass)
+{
+ local_bh_disable();
+ bit_spin_lock(0, (unsigned long *)bkt);
+ lock_acquire_exclusive(&tbl->dep_map, subclass, 0, NULL, _THIS_IP_);
+}
+
+static inline void rht_unlock(struct bucket_table *tbl,
+ struct rhash_lock_head **bkt)
+{
+ lock_map_release(&tbl->dep_map);
+ bit_spin_unlock(0, (unsigned long *)bkt);
+ local_bh_enable();
+}
+
+/*
+ * If 'p' is a bucket head and might be locked:
+ * rht_ptr() returns the address without the lock bit.
+ * rht_ptr_locked() returns the address WITH the lock bit.
+ */
+static inline struct rhash_head __rcu *rht_ptr(struct rhash_lock_head __rcu * const *bkt,
+ struct bucket_table *tbl,
+ unsigned int hash)
+{
+ const struct rhash_lock_head *p =
+ rht_dereference_bucket_rcu(*bkt, tbl, hash);
+ if ((((unsigned long)p) & ~BIT(0)) == 0)
+ return RHT_NULLS_MARKER(bkt);
+ return (void *)(((unsigned long)p) & ~BIT(0));
+}
+
+/*
+ * This can be called when access is known to be exclusive,
+ * such as when destorying an rhashtable
+ */
+static inline struct rhash_head __rcu *rht_ptr_unprotected(
+ struct rhash_lock_head __rcu * const *bkt)
+{
+ const struct rhash_lock_head *p = rcu_dereference_protected(*bkt, true);
+ if (!p)
+ return RHT_NULLS_MARKER(bkt);
+ return (void *)(((unsigned long)p) & ~BIT(0));
+}
+
+static inline struct rhash_lock_head __rcu *rht_ptr_locked(const
+ struct rhash_head *p)
+{
+ return (void *)(((unsigned long)p) | BIT(0));
+}
+
+static inline void rht_assign_locked(struct rhash_lock_head __rcu **bkt,
+ struct rhash_head *obj)
+{
+ struct rhash_head __rcu **p = (struct rhash_head __rcu **)bkt;
+
+ if (rht_is_a_nulls(obj))
+ obj = NULL;
+ rcu_assign_pointer(*p, rht_ptr_locked(obj));
+}
+
+static inline void rht_assign_unlock(struct bucket_table *tbl,
+ struct rhash_lock_head __rcu **bkt,
+ struct rhash_head *obj)
+{
+ struct rhash_head __rcu **p = (struct rhash_head __rcu **)bkt;
+
+ if (rht_is_a_nulls(obj))
+ obj = NULL;
+ lock_map_release(&tbl->dep_map);
+ rcu_assign_pointer(*p, obj);
+ preempt_enable();
+ __release(bitlock);
+ local_bh_enable();
+}
+
/**
* rht_for_each_from - iterate over hash chain from given head
* @pos: the &struct rhash_head to use as a loop cursor.
@@ -380,7 +415,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* @hash: the hash value / bucket index
*/
#define rht_for_each_from(pos, head, tbl, hash) \
- for (pos = rht_dereference_bucket(head, tbl, hash); \
+ for (pos = head; \
!rht_is_a_nulls(pos); \
pos = rht_dereference_bucket((pos)->next, tbl, hash))

@@ -391,7 +426,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* @hash: the hash value / bucket index
*/
#define rht_for_each(pos, tbl, hash) \
- rht_for_each_from(pos, rht_ptr(*rht_bucket(tbl, hash)), tbl, hash)
+ rht_for_each_from(pos, rht_ptr(rht_bucket(tbl, hash), tbl, hash))

/**
* rht_for_each_entry_from - iterate over hash chain from given head
@@ -403,7 +438,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* @member: name of the &struct rhash_head within the hashable struct.
*/
#define rht_for_each_entry_from(tpos, pos, head, tbl, hash, member) \
- for (pos = rht_dereference_bucket(head, tbl, hash); \
+ for (pos = head; \
(!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \
pos = rht_dereference_bucket((pos)->next, tbl, hash))

@@ -416,8 +451,8 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* @member: name of the &struct rhash_head within the hashable struct.
*/
#define rht_for_each_entry(tpos, pos, tbl, hash, member) \
- rht_for_each_entry_from(tpos, pos, rht_ptr(*rht_bucket(tbl, hash)), \
- tbl, hash, member)
+ rht_for_each_entry_from(tpos, pos, rht_ptr(rht_bucket(tbl, hash), \
+ tbl, hash, member))

/**
* rht_for_each_entry_safe - safely iterate over hash chain of given type
@@ -432,8 +467,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* remove the loop cursor from the list.
*/
#define rht_for_each_entry_safe(tpos, pos, next, tbl, hash, member) \
- for (pos = rht_dereference_bucket(rht_ptr(*rht_bucket(tbl, hash)), \
- tbl, hash), \
+ for (pos = rht_ptr(rht_bucket(tbl, hash), tbl, hash)), \
next = !rht_is_a_nulls(pos) ? \
rht_dereference_bucket(pos->next, tbl, hash) : NULL; \
(!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \
@@ -454,7 +488,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
*/
#define rht_for_each_rcu_from(pos, head, tbl, hash) \
for (({barrier(); }), \
- pos = rht_dereference_bucket_rcu(head, tbl, hash); \
+ pos = head; \
!rht_is_a_nulls(pos); \
pos = rcu_dereference_raw(pos->next))

@@ -469,10 +503,9 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* traversal is guarded by rcu_read_lock().
*/
#define rht_for_each_rcu(pos, tbl, hash) \
- for (({barrier(); }), \
- pos = rht_ptr(rht_dereference_bucket_rcu( \
- *rht_bucket(tbl, hash), tbl, hash)); \
- !rht_is_a_nulls(pos); \
+ for (({barrier(); }), \
+ pos = rht_ptr(rht_bucket(tbl, hash), tbl, hash); \
+ !rht_is_a_nulls(pos); \
pos = rcu_dereference_raw(pos->next))

/**
@@ -490,7 +523,7 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
*/
#define rht_for_each_entry_rcu_from(tpos, pos, head, tbl, hash, member) \
for (({barrier(); }), \
- pos = rht_dereference_bucket_rcu(head, tbl, hash); \
+ pos = head; \
(!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \
pos = rht_dereference_bucket_rcu(pos->next, tbl, hash))

@@ -506,10 +539,10 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
* the _rcu mutation primitives such as rhashtable_insert() as long as the
* traversal is guarded by rcu_read_lock().
*/
-#define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member) \
- rht_for_each_entry_rcu_from(tpos, pos, \
- rht_ptr(*rht_bucket(tbl, hash)), \
- tbl, hash, member)
+#define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member) \
+ rht_for_each_entry_rcu_from(tpos, pos, \
+ rht_ptr(rht_bucket(tbl, hash), \
+ tbl, hash, member))

/**
* rhl_for_each_rcu - iterate over rcu hash table list
@@ -564,8 +597,7 @@ static inline struct rhash_head *__rhashtable_lookup(
hash = rht_key_hashfn(ht, tbl, key, params);
bkt = rht_bucket(tbl, hash);
do {
- he = rht_ptr(rht_dereference_bucket_rcu(*bkt, tbl, hash));
- rht_for_each_rcu_from(he, he, tbl, hash) {
+ rht_for_each_rcu_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
if (params.obj_cmpfn ?
params.obj_cmpfn(&arg, rht_obj(ht, he)) :
rhashtable_compare(&arg, rht_obj(ht, he)))
@@ -698,7 +730,7 @@ static inline void *__rhashtable_insert_fast(
return rhashtable_insert_slow(ht, key, obj);
}

- rht_for_each_from(head, rht_ptr(*bkt), tbl, hash) {
+ rht_for_each_from(head, rht_ptr(bkt, tbl, hash), tbl, hash) {
struct rhlist_head *plist;
struct rhlist_head *list;

@@ -743,7 +775,7 @@ static inline void *__rhashtable_insert_fast(
goto slow_path;

/* Inserting at head of list makes unlocking free. */
- head = rht_ptr(rht_dereference_bucket(*bkt, tbl, hash));
+ head = rht_ptr(bkt, tbl, hash);

RCU_INIT_POINTER(obj->next, head);
if (rhlist) {
@@ -970,7 +1002,7 @@ static inline int __rhashtable_remove_fast_one(
pprev = NULL;
rht_lock(tbl, bkt);

- rht_for_each_from(he, rht_ptr(*bkt), tbl, hash) {
+ rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
struct rhlist_head *list;

list = container_of(he, struct rhlist_head, rhead);
@@ -1129,7 +1161,7 @@ static inline int __rhashtable_replace_fast(
pprev = NULL;
rht_lock(tbl, bkt);

- rht_for_each_from(he, rht_ptr(*bkt), tbl, hash) {
+ rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
if (he != obj_old) {
pprev = &he->next;
continue;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a8583af43b59..06fc674feb3d 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -59,7 +59,7 @@ int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash)
return 1;
if (unlikely(tbl->nest))
return 1;
- return bit_spin_is_locked(1, (unsigned long *)&tbl->buckets[hash]);
+ return bit_spin_is_locked(0, (unsigned long *)&tbl->buckets[hash]);
}
EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
#else
@@ -224,7 +224,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht,
struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl);
int err = -EAGAIN;
struct rhash_head *head, *next, *entry;
- struct rhash_head **pprev = NULL;
+ struct rhash_head __rcu **pprev = NULL;
unsigned int new_hash;

if (new_tbl->nest)
@@ -232,7 +232,8 @@ static int rhashtable_rehash_one(struct rhashtable *ht,

err = -ENOENT;

- rht_for_each_from(entry, rht_ptr(*bkt), old_tbl, old_hash) {
+ rht_for_each_from(entry, rht_ptr(bkt, old_tbl, old_hash),
+ old_tbl, old_hash) {
err = 0;
next = rht_dereference_bucket(entry->next, old_tbl, old_hash);

@@ -249,8 +250,8 @@ static int rhashtable_rehash_one(struct rhashtable *ht,

rht_lock_nested(new_tbl, &new_tbl->buckets[new_hash], SINGLE_DEPTH_NESTING);

- head = rht_ptr(rht_dereference_bucket(new_tbl->buckets[new_hash],
- new_tbl, new_hash));
+ head = rht_ptr(new_tbl->buckets + new_hash,
+ new_tbl, new_hash);

RCU_INIT_POINTER(entry->next, head);

@@ -260,7 +261,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht,
rcu_assign_pointer(*pprev, next);
else
/* Need to preserved the bit lock. */
- rcu_assign_pointer(*bkt, rht_ptr_locked(next));
+ rht_assign_locked(bkt, next);

out:
return err;
@@ -487,12 +488,12 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
.ht = ht,
.key = key,
};
- struct rhash_head **pprev = NULL;
+ struct rhash_head __rcu **pprev = NULL;
struct rhash_head *head;
int elasticity;

elasticity = RHT_ELASTICITY;
- rht_for_each_from(head, rht_ptr(*bkt), tbl, hash) {
+ rht_for_each_from(head, rht_ptr(bkt, tbl, hash), tbl, hash) {
struct rhlist_head *list;
struct rhlist_head *plist;

@@ -518,7 +519,7 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
rcu_assign_pointer(*pprev, obj);
else
/* Need to preserve the bit lock */
- rcu_assign_pointer(*bkt, rht_ptr_locked(obj));
+ rht_assign_locked(bkt, obj);

return NULL;
}
@@ -558,7 +559,7 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
if (unlikely(rht_grow_above_100(ht, tbl)))
return ERR_PTR(-EAGAIN);

- head = rht_ptr(rht_dereference_bucket(*bkt, tbl, hash));
+ head = rht_ptr(bkt, tbl, hash);

RCU_INIT_POINTER(obj->next, head);
if (ht->rhlist) {
@@ -571,7 +572,7 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
/* bkt is always the head of the list, so it holds
* the lock, which we need to preserve
*/
- rcu_assign_pointer(*bkt, rht_ptr_locked(obj));
+ rht_assign_locked(bkt, obj);

atomic_inc(&ht->nelems);
if (rht_grow_above_75(ht, tbl))
@@ -1140,7 +1141,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
struct rhash_head *pos, *next;

cond_resched();
- for (pos = rht_ptr(rht_dereference(*rht_bucket(tbl, i), ht)),
+ for (pos = rht_ptr_unprotected(rht_bucket(tbl, i)),
next = !rht_is_a_nulls(pos) ?
rht_dereference(pos->next, ht) : NULL;
!rht_is_a_nulls(pos);
diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 02592c2a249c..7b93cfefe195 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -500,7 +500,7 @@ static unsigned int __init print_ht(struct rhltable *rhlt)
struct rhash_head *pos, *next;
struct test_obj_rhl *p;

- pos = rht_ptr(rht_dereference(tbl->buckets[i], ht));
+ pos = rht_ptr_unprotected(tbl->buckets + i);
next = !rht_is_a_nulls(pos) ? rht_dereference(pos->next, ht) : NULL;

if (!rht_is_a_nulls(pos)) {


Attachments:
signature.asc (847.00 B)

2019-04-11 12:45:40

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

On 4/10/19 11:40 PM, NeilBrown wrote:
> On Thu, Apr 11 2019, NeilBrown wrote:
>
>> On Thu, Apr 11 2019, NeilBrown wrote:
>>
>>> On Wed, Apr 10 2019, Guenter Roeck wrote:
>>>
>>>> Hi,
>>>>
>> .....
>>>>
>>>> This patch causes my qemu q800 boot test to crash reliably.
>>>>
>> ....
>>>> Code: 4a89 6604 4280 60ea 2c2b 000c 2748 000c <2869> 000c 082c 0003 0002 6728 4878 0014 7620 4873 3800 486e ffec 4eb9 002e 5b88
>>>
>>> Thanks for testing and for the report.
>> .....
>>>
>>> .... and after googling a bit I see that 68000 require 2-byte alignment,
>>> but not 4-byte. Oh..
>>>
>>> That means there aren't two spare bits in an address, so I cannot use
>>> one for the NULLS and one for a lock bit. Bother.
>>>
>>> I might be able to find a different way forward, but for now I think we
>>> need to drop this series.
>>
>> I have found a way forward that I like. It only requires one bit per
>> address to be over-loaded.
>>
>> The following patch implements it and works for me.
>> Could you please confirm that it fixes your problem on m68k ??
>
> Sorry, that was on the wrong base.
>
> Please try this one, against current net-next.
>

First of all, excellent analysis!

With this patch applied:

Build reference: next-20190410-1-ge294005789ed

Building mcf5208evb:m5208:m5208evb_defconfig:initrd ... running .... passed
Building q800:m68040:mac_defconfig:initrd ... running .... passed
Building q800:m68040:mac_defconfig:rootfs ... running .... passed


I also reconfirmed the crash with next-20190410.

With that, feel free to add:

Tested-by: Guenter Roeck <[email protected]>

Thanks,
Guenter