Hi,
Here are the pending patches for the merge window which are currently
in the GFS2 tree.
The main topics this time are allocation, in the form of Bob's
improvements when searching resource groups and several updates
to quotas which should increase scalability. The quota changes
follow on from those in the last merge window, and there will
likely be further work to come in this area in due course.
There are also a few patches which help to improve efficiency
of adding entries into directories, and clean up some of that
code.
One on-disk change is included this time, which is to write some
additional information which should be useful to fsck and
also potentially for debugging.
Other than that, its just a few small random bug fixes and
clean ups,
Steve.
From: Bob Peterson <[email protected]>
This is just basically a resend of a patch I posted earlier.
It didn't change from its original, except in diff offsets, etc:
This patch fixes a bug in the GFS2 block allocation code. The problem
starts if a process already has a multi-block reservation, but for
some reason, another process disqualifies it from further allocations.
For example, the other process might set on the GFS2_RDF_ERROR bit.
The process holding the reservation jumps to label skip_rgrp, but
that label comes after the code that removes the reservation from the
tree. Therefore, the no longer usable reservation is not removed from
the rgrp's reservations tree; it's lost. Eventually, the lost reservation
causes the count of reserved blocks to get off, and eventually that
causes a BUG_ON(rs->rs_rbm.rgd->rd_reserved < rs->rs_free) to trigger.
This patch moves the call to after label skip_rgrp so that the
disqualified reservation is properly removed from the tree, thus keeping
the rgrp rd_reserved count sane.
Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 809fecd..1ccf89a 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1944,15 +1944,16 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a
return 0;
}
- /* Drop reservation, if we couldn't use reserved rgrp */
- if (gfs2_rs_active(rs))
- gfs2_rs_deltree(rs);
check_rgrp:
/* Check for unlinked inodes which can be reclaimed */
if (rs->rs_rbm.rgd->rd_flags & GFS2_RDF_CHECK)
try_rgrp_unlink(rs->rs_rbm.rgd, &last_unlinked,
ip->i_no_addr);
skip_rgrp:
+ /* Drop reservation, if we couldn't use reserved rgrp */
+ if (gfs2_rs_active(rs))
+ gfs2_rs_deltree(rs);
+
/* Unlock rgrp if required */
if (!rg_locked)
gfs2_glock_dq_uninit(&rs->rs_rgd_gh);
--
1.8.3.1
Prior to this patch, GFS2 kept all the quotas for each
super block in a single linked list. This is rather slow
when there are large numbers of quotas.
This patch introduces a hlist_bl based hash table, similar
to the one used for glocks. The initial look up of the quota
is now lockless in the case where it is already cached,
although we still have to take the per quota spinlock in
order to bump the ref count. Either way though, this is a
big improvement on what was there before.
The qd_lock and the per super block list is preserved, for
the time being. However it is intended that since this is no
longer used for its original role, it should be possible to
shrink the number of items on that list in due course and
remove the requirement to take qd_lock in qd_get.
Signed-off-by: Steven Whitehouse <[email protected]>
Cc: Abhijith Das <[email protected]>
Cc: Paul E. McKenney <[email protected]>
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index a99f60c..59d99ec 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -428,10 +428,13 @@ enum {
};
struct gfs2_quota_data {
+ struct hlist_bl_node qd_hlist;
struct list_head qd_list;
struct kqid qd_id;
+ struct gfs2_sbd *qd_sbd;
struct lockref qd_lockref;
struct list_head qd_lru;
+ unsigned qd_hash;
unsigned long qd_flags; /* QDF_... */
@@ -450,6 +453,7 @@ struct gfs2_quota_data {
u64 qd_sync_gen;
unsigned long qd_last_warn;
+ struct rcu_head qd_rcu;
};
struct gfs2_trans {
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 0650db2..c272e73 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -76,6 +76,7 @@ static int __init init_gfs2_fs(void)
gfs2_str2qstr(&gfs2_qdot, ".");
gfs2_str2qstr(&gfs2_qdotdot, "..");
+ gfs2_quota_hash_init();
error = gfs2_sys_init();
if (error)
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 1b6b367..a1df01d 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -52,6 +52,10 @@
#include <linux/dqblk_xfs.h>
#include <linux/lockref.h>
#include <linux/list_lru.h>
+#include <linux/rcupdate.h>
+#include <linux/rculist_bl.h>
+#include <linux/bit_spinlock.h>
+#include <linux/jhash.h>
#include "gfs2.h"
#include "incore.h"
@@ -67,10 +71,43 @@
#include "inode.h"
#include "util.h"
-/* Lock order: qd_lock -> qd->lockref.lock -> lru lock */
+#define GFS2_QD_HASH_SHIFT 12
+#define GFS2_QD_HASH_SIZE (1 << GFS2_QD_HASH_SHIFT)
+#define GFS2_QD_HASH_MASK (GFS2_QD_HASH_SIZE - 1)
+
+/* Lock order: qd_lock -> bucket lock -> qd->lockref.lock -> lru lock */
static DEFINE_SPINLOCK(qd_lock);
struct list_lru gfs2_qd_lru;
+static struct hlist_bl_head qd_hash_table[GFS2_QD_HASH_SIZE];
+
+static unsigned int gfs2_qd_hash(const struct gfs2_sbd *sdp,
+ const struct kqid qid)
+{
+ unsigned int h;
+
+ h = jhash(&sdp, sizeof(struct gfs2_sbd *), 0);
+ h = jhash(&qid, sizeof(struct kqid), h);
+
+ return h & GFS2_QD_HASH_MASK;
+}
+
+static inline void spin_lock_bucket(unsigned int hash)
+{
+ hlist_bl_lock(&qd_hash_table[hash]);
+}
+
+static inline void spin_unlock_bucket(unsigned int hash)
+{
+ hlist_bl_unlock(&qd_hash_table[hash]);
+}
+
+static void gfs2_qd_dealloc(struct rcu_head *rcu)
+{
+ struct gfs2_quota_data *qd = container_of(rcu, struct gfs2_quota_data, qd_rcu);
+ kmem_cache_free(gfs2_quotad_cachep, qd);
+}
+
static void gfs2_qd_dispose(struct list_head *list)
{
struct gfs2_quota_data *qd;
@@ -87,6 +124,10 @@ static void gfs2_qd_dispose(struct list_head *list)
list_del(&qd->qd_list);
spin_unlock(&qd_lock);
+ spin_lock_bucket(qd->qd_hash);
+ hlist_bl_del_rcu(&qd->qd_hlist);
+ spin_unlock_bucket(qd->qd_hash);
+
gfs2_assert_warn(sdp, !qd->qd_change);
gfs2_assert_warn(sdp, !qd->qd_slot_count);
gfs2_assert_warn(sdp, !qd->qd_bh_count);
@@ -95,7 +136,7 @@ static void gfs2_qd_dispose(struct list_head *list)
atomic_dec(&sdp->sd_quota_count);
/* Delete it from the common reclaim list */
- kmem_cache_free(gfs2_quotad_cachep, qd);
+ call_rcu(&qd->qd_rcu, gfs2_qd_dealloc);
}
}
@@ -165,83 +206,95 @@ static u64 qd2offset(struct gfs2_quota_data *qd)
return offset;
}
-static int qd_alloc(struct gfs2_sbd *sdp, struct kqid qid,
- struct gfs2_quota_data **qdp)
+static struct gfs2_quota_data *qd_alloc(unsigned hash, struct gfs2_sbd *sdp, struct kqid qid)
{
struct gfs2_quota_data *qd;
int error;
qd = kmem_cache_zalloc(gfs2_quotad_cachep, GFP_NOFS);
if (!qd)
- return -ENOMEM;
+ return NULL;
+ qd->qd_sbd = sdp;
qd->qd_lockref.count = 1;
spin_lock_init(&qd->qd_lockref.lock);
qd->qd_id = qid;
qd->qd_slot = -1;
INIT_LIST_HEAD(&qd->qd_lru);
+ qd->qd_hash = hash;
error = gfs2_glock_get(sdp, qd2index(qd),
&gfs2_quota_glops, CREATE, &qd->qd_gl);
if (error)
goto fail;
- *qdp = qd;
-
- return 0;
+ return qd;
fail:
kmem_cache_free(gfs2_quotad_cachep, qd);
- return error;
+ return NULL;
}
-static int qd_get(struct gfs2_sbd *sdp, struct kqid qid,
- struct gfs2_quota_data **qdp)
+static struct gfs2_quota_data *gfs2_qd_search_bucket(unsigned int hash,
+ const struct gfs2_sbd *sdp,
+ struct kqid qid)
{
- struct gfs2_quota_data *qd = NULL, *new_qd = NULL;
- int error, found;
-
- *qdp = NULL;
+ struct gfs2_quota_data *qd;
+ struct hlist_bl_node *h;
- for (;;) {
- found = 0;
- spin_lock(&qd_lock);
- list_for_each_entry(qd, &sdp->sd_quota_list, qd_list) {
- if (qid_eq(qd->qd_id, qid) &&
- lockref_get_not_dead(&qd->qd_lockref)) {
- list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
- found = 1;
- break;
- }
+ hlist_bl_for_each_entry_rcu(qd, h, &qd_hash_table[hash], qd_hlist) {
+ if (!qid_eq(qd->qd_id, qid))
+ continue;
+ if (qd->qd_sbd != sdp)
+ continue;
+ if (lockref_get_not_dead(&qd->qd_lockref)) {
+ list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
+ return qd;
}
+ }
- if (!found)
- qd = NULL;
+ return NULL;
+}
- if (!qd && new_qd) {
- qd = new_qd;
- list_add(&qd->qd_list, &sdp->sd_quota_list);
- atomic_inc(&sdp->sd_quota_count);
- new_qd = NULL;
- }
- spin_unlock(&qd_lock);
+static int qd_get(struct gfs2_sbd *sdp, struct kqid qid,
+ struct gfs2_quota_data **qdp)
+{
+ struct gfs2_quota_data *qd, *new_qd;
+ unsigned int hash = gfs2_qd_hash(sdp, qid);
- if (qd) {
- if (new_qd) {
- gfs2_glock_put(new_qd->qd_gl);
- kmem_cache_free(gfs2_quotad_cachep, new_qd);
- }
- *qdp = qd;
- return 0;
- }
+ rcu_read_lock();
+ *qdp = qd = gfs2_qd_search_bucket(hash, sdp, qid);
+ rcu_read_unlock();
- error = qd_alloc(sdp, qid, &new_qd);
- if (error)
- return error;
+ if (qd)
+ return 0;
+
+ new_qd = qd_alloc(hash, sdp, qid);
+ if (!new_qd)
+ return -ENOMEM;
+
+ spin_lock(&qd_lock);
+ spin_lock_bucket(hash);
+ *qdp = qd = gfs2_qd_search_bucket(hash, sdp, qid);
+ if (qd == NULL) {
+ *qdp = new_qd;
+ list_add(&new_qd->qd_list, &sdp->sd_quota_list);
+ hlist_bl_add_head_rcu(&new_qd->qd_hlist, &qd_hash_table[hash]);
+ atomic_inc(&sdp->sd_quota_count);
+ }
+ spin_unlock_bucket(hash);
+ spin_unlock(&qd_lock);
+
+ if (qd) {
+ gfs2_glock_put(new_qd->qd_gl);
+ kmem_cache_free(gfs2_quotad_cachep, new_qd);
}
+
+ return 0;
}
+
static void qd_hold(struct gfs2_quota_data *qd)
{
struct gfs2_sbd *sdp = qd->qd_gl->gl_sbd;
@@ -1215,6 +1268,7 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
unsigned int blocks = size >> sdp->sd_sb.sb_bsize_shift;
unsigned int x, slot = 0;
unsigned int found = 0;
+ unsigned int hash;
u64 dblock;
u32 extlen = 0;
int error;
@@ -1272,8 +1326,9 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
if (!qc_change)
continue;
- error = qd_alloc(sdp, qc_id, &qd);
- if (error) {
+ hash = gfs2_qd_hash(sdp, qc_id);
+ qd = qd_alloc(hash, sdp, qc_id);
+ if (qd == NULL) {
brelse(bh);
goto fail;
}
@@ -1289,6 +1344,10 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
atomic_inc(&sdp->sd_quota_count);
spin_unlock(&qd_lock);
+ spin_lock_bucket(hash);
+ hlist_bl_add_head_rcu(&qd->qd_hlist, &qd_hash_table[hash]);
+ spin_unlock_bucket(hash);
+
found++;
}
@@ -1335,11 +1394,16 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
spin_unlock(&qd->qd_lockref.lock);
list_del(&qd->qd_list);
+
/* Also remove if this qd exists in the reclaim list */
list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
atomic_dec(&sdp->sd_quota_count);
spin_unlock(&qd_lock);
+ spin_lock_bucket(qd->qd_hash);
+ hlist_bl_del_rcu(&qd->qd_hlist);
+ spin_unlock_bucket(qd->qd_hash);
+
if (!qd->qd_lockref.count) {
gfs2_assert_warn(sdp, !qd->qd_change);
gfs2_assert_warn(sdp, !qd->qd_slot_count);
@@ -1348,7 +1412,7 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
gfs2_assert_warn(sdp, !qd->qd_bh_count);
gfs2_glock_put(qd->qd_gl);
- kmem_cache_free(gfs2_quotad_cachep, qd);
+ call_rcu(&qd->qd_rcu, gfs2_qd_dealloc);
spin_lock(&qd_lock);
}
@@ -1643,3 +1707,11 @@ const struct quotactl_ops gfs2_quotactl_ops = {
.get_dqblk = gfs2_get_dqblk,
.set_dqblk = gfs2_set_dqblk,
};
+
+void __init gfs2_quota_hash_init(void)
+{
+ unsigned i;
+
+ for(i = 0; i < GFS2_QD_HASH_SIZE; i++)
+ INIT_HLIST_BL_HEAD(&qd_hash_table[i]);
+}
diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
index 96e4f34..55d506e 100644
--- a/fs/gfs2/quota.h
+++ b/fs/gfs2/quota.h
@@ -57,5 +57,6 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
extern const struct quotactl_ops gfs2_quotactl_ops;
extern struct shrinker gfs2_qd_shrinker;
extern struct list_lru gfs2_qd_lru;
+extern void __init gfs2_quota_hash_init(void);
#endif /* __QUOTA_DOT_H__ */
--
1.8.3.1
Al Viro has tactfully pointed out that we are using the incorrect
error code in some cases. This patch fixes that, and also removes
the (unused) return value for glock dumping.
> * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
> "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
> we don't support STREAMS it's probably fair game, but... what the hell?
Signed-off-by: Steven Whitehouse <[email protected]>
Cc: Al Viro <[email protected]>
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 6f7a47c..ca0be6c 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1552,13 +1552,11 @@ void gfs2_glock_thaw(struct gfs2_sbd *sdp)
glock_hash_walk(thaw_glock, sdp);
}
-static int dump_glock(struct seq_file *seq, struct gfs2_glock *gl)
+static void dump_glock(struct seq_file *seq, struct gfs2_glock *gl)
{
- int ret;
spin_lock(&gl->gl_spin);
- ret = gfs2_dump_glock(seq, gl);
+ gfs2_dump_glock(seq, gl);
spin_unlock(&gl->gl_spin);
- return ret;
}
static void dump_glock_func(struct gfs2_glock *gl)
@@ -1647,10 +1645,9 @@ static const char *hflags2str(char *buf, unsigned flags, unsigned long iflags)
* @seq: the seq_file struct
* @gh: the glock holder
*
- * Returns: 0 on success, -ENOBUFS when we run out of space
*/
-static int dump_holder(struct seq_file *seq, const struct gfs2_holder *gh)
+static void dump_holder(struct seq_file *seq, const struct gfs2_holder *gh)
{
struct task_struct *gh_owner = NULL;
char flags_buf[32];
@@ -1666,7 +1663,6 @@ static int dump_holder(struct seq_file *seq, const struct gfs2_holder *gh)
gh_owner ? gh_owner->comm : "(ended)",
(void *)gh->gh_ip);
rcu_read_unlock();
- return 0;
}
static const char *gflags2str(char *buf, const struct gfs2_glock *gl)
@@ -1721,16 +1717,14 @@ static const char *gflags2str(char *buf, const struct gfs2_glock *gl)
* example. The field's are n = number (id of the object), f = flags,
* t = type, s = state, r = refcount, e = error, p = pid.
*
- * Returns: 0 on success, -ENOBUFS when we run out of space
*/
-int gfs2_dump_glock(struct seq_file *seq, const struct gfs2_glock *gl)
+void gfs2_dump_glock(struct seq_file *seq, const struct gfs2_glock *gl)
{
const struct gfs2_glock_operations *glops = gl->gl_ops;
unsigned long long dtime;
const struct gfs2_holder *gh;
char gflags_buf[32];
- int error = 0;
dtime = jiffies - gl->gl_demote_time;
dtime *= 1000000/HZ; /* demote time in uSec */
@@ -1747,15 +1741,11 @@ int gfs2_dump_glock(struct seq_file *seq, const struct gfs2_glock *gl)
atomic_read(&gl->gl_revokes),
(int)gl->gl_lockref.count, gl->gl_hold_time);
- list_for_each_entry(gh, &gl->gl_holders, gh_list) {
- error = dump_holder(seq, gh);
- if (error)
- goto out;
- }
+ list_for_each_entry(gh, &gl->gl_holders, gh_list)
+ dump_holder(seq, gh);
+
if (gl->gl_state != LM_ST_UNLOCKED && glops->go_dump)
- error = glops->go_dump(seq, gl);
-out:
- return error;
+ glops->go_dump(seq, gl);
}
static int gfs2_glstats_seq_show(struct seq_file *seq, void *iter_ptr)
@@ -1953,7 +1943,8 @@ static void gfs2_glock_seq_stop(struct seq_file *seq, void *iter_ptr)
static int gfs2_glock_seq_show(struct seq_file *seq, void *iter_ptr)
{
- return dump_glock(seq, iter_ptr);
+ dump_glock(seq, iter_ptr);
+ return 0;
}
static void *gfs2_sbstats_seq_start(struct seq_file *seq, loff_t *pos)
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 6647d77..32572f7 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -199,7 +199,7 @@ extern int gfs2_glock_nq_num(struct gfs2_sbd *sdp, u64 number,
struct gfs2_holder *gh);
extern int gfs2_glock_nq_m(unsigned int num_gh, struct gfs2_holder *ghs);
extern void gfs2_glock_dq_m(unsigned int num_gh, struct gfs2_holder *ghs);
-extern int gfs2_dump_glock(struct seq_file *seq, const struct gfs2_glock *gl);
+extern void gfs2_dump_glock(struct seq_file *seq, const struct gfs2_glock *gl);
#define GLOCK_BUG_ON(gl,x) do { if (unlikely(x)) { gfs2_dump_glock(NULL, gl); BUG(); } } while(0)
extern __printf(2, 3)
void gfs2_print_dbg(struct seq_file *seq, const char *fmt, ...);
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index b792dbc..3bf0631 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -437,21 +437,19 @@ static int inode_go_lock(struct gfs2_holder *gh)
* @seq: The iterator
* @ip: the inode
*
- * Returns: 0 on success, -ENOBUFS when we run out of space
*/
-static int inode_go_dump(struct seq_file *seq, const struct gfs2_glock *gl)
+static void inode_go_dump(struct seq_file *seq, const struct gfs2_glock *gl)
{
const struct gfs2_inode *ip = gl->gl_object;
if (ip == NULL)
- return 0;
+ return;
gfs2_print_dbg(seq, " I: n:%llu/%llu t:%u f:0x%02lx d:0x%08x s:%llu\n",
(unsigned long long)ip->i_no_formal_ino,
(unsigned long long)ip->i_no_addr,
IF2DT(ip->i_inode.i_mode), ip->i_flags,
(unsigned int)ip->i_diskflags,
(unsigned long long)i_size_read(&ip->i_inode));
- return 0;
}
/**
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 8c64e26..cf0e344 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -218,7 +218,7 @@ struct gfs2_glock_operations {
int (*go_demote_ok) (const struct gfs2_glock *gl);
int (*go_lock) (struct gfs2_holder *gh);
void (*go_unlock) (struct gfs2_holder *gh);
- int (*go_dump)(struct seq_file *seq, const struct gfs2_glock *gl);
+ void (*go_dump)(struct seq_file *seq, const struct gfs2_glock *gl);
void (*go_callback)(struct gfs2_glock *gl, bool remote);
const int go_type;
const unsigned long go_flags;
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 51cf10d..b8956f2 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -149,7 +149,7 @@ struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
ip = GFS2_I(inode);
if (!inode)
- return ERR_PTR(-ENOBUFS);
+ return ERR_PTR(-ENOMEM);
if (inode->i_state & I_NEW) {
struct gfs2_sbd *sdp = GFS2_SB(inode);
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 06a66c9..1e712b5 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -231,7 +231,7 @@ static int gfs2_read_super(struct gfs2_sbd *sdp, sector_t sector, int silent)
page = alloc_page(GFP_NOFS);
if (unlikely(!page))
- return -ENOBUFS;
+ return -ENOMEM;
ClearPageUptodate(page);
ClearPageDirty(page);
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 183cf0f..e14e887 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -2120,14 +2120,14 @@ static struct gfs2_rgrpd *rgblk_free(struct gfs2_sbd *sdp, u64 bstart,
*
*/
-int gfs2_rgrp_dump(struct seq_file *seq, const struct gfs2_glock *gl)
+void gfs2_rgrp_dump(struct seq_file *seq, const struct gfs2_glock *gl)
{
struct gfs2_rgrpd *rgd = gl->gl_object;
struct gfs2_blkreserv *trs;
const struct rb_node *n;
if (rgd == NULL)
- return 0;
+ return;
gfs2_print_dbg(seq, " R: n:%llu f:%02x b:%u/%u i:%u r:%u e:%u\n",
(unsigned long long)rgd->rd_addr, rgd->rd_flags,
rgd->rd_free, rgd->rd_free_clone, rgd->rd_dinodes,
@@ -2138,7 +2138,6 @@ int gfs2_rgrp_dump(struct seq_file *seq, const struct gfs2_glock *gl)
dump_rs(seq, trs);
}
spin_unlock(&rgd->rd_rsspin);
- return 0;
}
static void gfs2_rgrp_error(struct gfs2_rgrpd *rgd)
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index 3a10d2f..463ab2e 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -68,7 +68,7 @@ extern void gfs2_rlist_add(struct gfs2_inode *ip, struct gfs2_rgrp_list *rlist,
extern void gfs2_rlist_alloc(struct gfs2_rgrp_list *rlist, unsigned int state);
extern void gfs2_rlist_free(struct gfs2_rgrp_list *rlist);
extern u64 gfs2_ri_total(struct gfs2_sbd *sdp);
-extern int gfs2_rgrp_dump(struct seq_file *seq, const struct gfs2_glock *gl);
+extern void gfs2_rgrp_dump(struct seq_file *seq, const struct gfs2_glock *gl);
extern int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset,
struct buffer_head *bh,
const struct gfs2_bitmap *bi, unsigned minlen, u64 *ptrimmed);
--
1.8.3.1
From: Bob Peterson <[email protected]>
This is a small cleanup to function gfs2_rgrp_go_lock so that it
uses rgd instead of its more complicated twin.
Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index e14e887..a1da213 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1199,7 +1199,7 @@ int gfs2_rgrp_go_lock(struct gfs2_holder *gh)
if (gh->gh_flags & GL_SKIP && sdp->sd_args.ar_rgrplvb)
return 0;
- return gfs2_rgrp_bh_get((struct gfs2_rgrpd *)gh->gh_gl->gl_object);
+ return gfs2_rgrp_bh_get(rgd);
}
/**
--
1.8.3.1
From: "J. Bruce Fields" <[email protected]>
0d0d110720d7960b77c03c9f2597faaff4b484ae asserts that "d_splice_alias()
can't return error unless it was given an IS_ERR(inode)".
That was true of the implementation of d_splice_alias, but this is
really a problem with d_splice_alias: at a minimum it should be able to
return -ELOOP in the case where inserting the given dentry would cause a
directory loop.
Signed-off-by: J. Bruce Fields <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index b8956f2..890588c 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -604,6 +604,9 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
error = PTR_ERR(inode);
if (!IS_ERR(inode)) {
d = d_splice_alias(inode, dentry);
+ error = PTR_ERR(d);
+ if (IS_ERR(d))
+ goto fail_gunlock;
error = 0;
if (file) {
if (S_ISREG(inode->i_mode)) {
@@ -799,6 +802,11 @@ static struct dentry *__gfs2_lookup(struct inode *dir, struct dentry *dentry,
}
d = d_splice_alias(inode, dentry);
+ if (IS_ERR(d)) {
+ iput(inode);
+ gfs2_glock_dq_uninit(&gh);
+ return d;
+ }
if (file && S_ISREG(inode->i_mode))
error = finish_open(file, dentry, gfs2_open_common, opened);
--
1.8.3.1
Well I don't get the same warning locally as the kbuild
robot, but I guess this should fix the problem, anyway.
Here is the warning:
head: 2d9e72303d538024627fb1fe2cbde48aec12acc0
commit: ee2411a8db49a21bc55dc124e1b434ba194c8903 [19/20] GFS2: Clean up quota slot allocation
config: make ARCH=powerpc allmodconfig
All error/warnings:
fs/gfs2/quota.c: In function 'gfs2_quota_init':
>> fs/gfs2/quota.c:1246:3: error: implicit declaration of function '__vmalloc' [-Werror=implicit-function-declaration]
sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
^
>> fs/gfs2/quota.c:1246:24: warning: assignment makes pointer from integer without a cast [enabled by default]
sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
^
fs/gfs2/quota.c: In function 'gfs2_quota_cleanup':
>> fs/gfs2/quota.c:1361:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
vfree(sdp->sd_quota_bitmap);
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 02a2740..8bec0e31 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -56,6 +56,7 @@
#include <linux/rculist_bl.h>
#include <linux/bit_spinlock.h>
#include <linux/jhash.h>
+#include <linux/vmalloc.h>
#include "gfs2.h"
#include "incore.h"
--
1.8.3.1
Gradually, the global qd_lock is being used for less and less.
After this patch it will only be used for the per super block
list whose purpose is to allow syncing of changes back to the
master quota file from the local quota changes file. Fixing
up that process to make it more efficient will be the subject
of a later patch, however this patch removes another barrier
to doing that.
Signed-off-by: Steven Whitehouse <[email protected]>
Cc: Abhijith Das <[email protected]>
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 4b9aa5b..8c64e26 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -734,6 +734,7 @@ struct gfs2_sbd {
unsigned int sd_quota_slots;
unsigned long *sd_quota_bitmap;
+ spinlock_t sd_bitmap_lock;
u64 sd_quota_sync_gen;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 27648e8..06a66c9 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -99,6 +99,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
init_waitqueue_head(&sdp->sd_quota_wait);
INIT_LIST_HEAD(&sdp->sd_trunc_list);
spin_lock_init(&sdp->sd_trunc_lock);
+ spin_lock_init(&sdp->sd_bitmap_lock);
mapping = &sdp->sd_aspace;
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 79be67a..02a2740 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -76,6 +76,7 @@
#define GFS2_QD_HASH_MASK (GFS2_QD_HASH_SIZE - 1)
/* Lock order: qd_lock -> bucket lock -> qd->lockref.lock -> lru lock */
+/* -> sd_bitmap_lock */
static DEFINE_SPINLOCK(qd_lock);
struct list_lru gfs2_qd_lru;
@@ -319,7 +320,7 @@ static int slot_get(struct gfs2_quota_data *qd)
unsigned int bit;
int error = 0;
- spin_lock(&qd_lock);
+ spin_lock(&sdp->sd_bitmap_lock);
if (qd->qd_slot_count != 0)
goto out;
@@ -331,7 +332,7 @@ static int slot_get(struct gfs2_quota_data *qd)
out:
qd->qd_slot_count++;
}
- spin_unlock(&qd_lock);
+ spin_unlock(&sdp->sd_bitmap_lock);
return error;
}
@@ -340,23 +341,23 @@ static void slot_hold(struct gfs2_quota_data *qd)
{
struct gfs2_sbd *sdp = qd->qd_sbd;
- spin_lock(&qd_lock);
+ spin_lock(&sdp->sd_bitmap_lock);
gfs2_assert(sdp, qd->qd_slot_count);
qd->qd_slot_count++;
- spin_unlock(&qd_lock);
+ spin_unlock(&sdp->sd_bitmap_lock);
}
static void slot_put(struct gfs2_quota_data *qd)
{
struct gfs2_sbd *sdp = qd->qd_sbd;
- spin_lock(&qd_lock);
+ spin_lock(&sdp->sd_bitmap_lock);
gfs2_assert(sdp, qd->qd_slot_count);
if (!--qd->qd_slot_count) {
BUG_ON(!test_and_clear_bit(qd->qd_slot, sdp->sd_quota_bitmap));
qd->qd_slot = -1;
}
- spin_unlock(&qd_lock);
+ spin_unlock(&sdp->sd_bitmap_lock);
}
static int bh_get(struct gfs2_quota_data *qd)
@@ -434,8 +435,7 @@ static int qd_check_sync(struct gfs2_sbd *sdp, struct gfs2_quota_data *qd,
list_move_tail(&qd->qd_list, &sdp->sd_quota_list);
set_bit(QDF_LOCKED, &qd->qd_flags);
qd->qd_change_sync = qd->qd_change;
- gfs2_assert_warn(sdp, qd->qd_slot_count);
- qd->qd_slot_count++;
+ slot_hold(qd);
return 1;
}
--
1.8.3.1
Spotted by Andy Price. This should fix the odd messages from
lockdep caused by 70d4ee94b370c5ef54d0870600f16bd92d18013c
Signed-off-by: Steven Whitehouse <[email protected]>
Cc: Andrew Price <[email protected]>
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 94aab31..3b820b6 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -102,6 +102,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
mapping = &sdp->sd_aspace;
+ address_space_init_once(mapping);
mapping->a_ops = &gfs2_meta_aops;
mapping->host = sb->s_bdev->bd_inode;
mapping->flags = 0;
--
1.8.3.1
Quota slot allocation has historically used a vector of pages
and a set of homegrown find/test/set/clear bit functions. Since
the size of the bitmap is likely to be based on the default
qc file size, thats a couple of pages at most. So we ought
to be able to allocate that as a single chunk, with a vmalloc
fallback, just in case of memory fragmentation.
We are then able to use the kernel's own find/test/set/clear
bit functions, rather than rolling our own.
Signed-off-by: Steven Whitehouse <[email protected]>
Cc: Abhijith Das <[email protected]>
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 59d99ec..4b9aa5b 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -733,8 +733,7 @@ struct gfs2_sbd {
spinlock_t sd_trunc_lock;
unsigned int sd_quota_slots;
- unsigned int sd_quota_chunks;
- unsigned char **sd_quota_bitmap;
+ unsigned long *sd_quota_bitmap;
u64 sd_quota_sync_gen;
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 3287d98..79be67a 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -315,50 +315,30 @@ static void qd_put(struct gfs2_quota_data *qd)
static int slot_get(struct gfs2_quota_data *qd)
{
- struct gfs2_sbd *sdp = qd->qd_gl->gl_sbd;
- unsigned int c, o = 0, b;
- unsigned char byte = 0;
+ struct gfs2_sbd *sdp = qd->qd_sbd;
+ unsigned int bit;
+ int error = 0;
spin_lock(&qd_lock);
+ if (qd->qd_slot_count != 0)
+ goto out;
- if (qd->qd_slot_count++) {
- spin_unlock(&qd_lock);
- return 0;
+ error = -ENOSPC;
+ bit = find_first_zero_bit(sdp->sd_quota_bitmap, sdp->sd_quota_slots);
+ if (bit < sdp->sd_quota_slots) {
+ set_bit(bit, sdp->sd_quota_bitmap);
+ qd->qd_slot = bit;
+out:
+ qd->qd_slot_count++;
}
-
- for (c = 0; c < sdp->sd_quota_chunks; c++)
- for (o = 0; o < PAGE_SIZE; o++) {
- byte = sdp->sd_quota_bitmap[c][o];
- if (byte != 0xFF)
- goto found;
- }
-
- goto fail;
-
-found:
- for (b = 0; b < 8; b++)
- if (!(byte & (1 << b)))
- break;
- qd->qd_slot = c * (8 * PAGE_SIZE) + o * 8 + b;
-
- if (qd->qd_slot >= sdp->sd_quota_slots)
- goto fail;
-
- sdp->sd_quota_bitmap[c][o] |= 1 << b;
-
spin_unlock(&qd_lock);
- return 0;
-
-fail:
- qd->qd_slot_count--;
- spin_unlock(&qd_lock);
- return -ENOSPC;
+ return error;
}
static void slot_hold(struct gfs2_quota_data *qd)
{
- struct gfs2_sbd *sdp = qd->qd_gl->gl_sbd;
+ struct gfs2_sbd *sdp = qd->qd_sbd;
spin_lock(&qd_lock);
gfs2_assert(sdp, qd->qd_slot_count);
@@ -366,34 +346,14 @@ static void slot_hold(struct gfs2_quota_data *qd)
spin_unlock(&qd_lock);
}
-static void gfs2_icbit_munge(struct gfs2_sbd *sdp, unsigned char **bitmap,
- unsigned int bit, int new_value)
-{
- unsigned int c, o, b = bit;
- int old_value;
-
- c = b / (8 * PAGE_SIZE);
- b %= 8 * PAGE_SIZE;
- o = b / 8;
- b %= 8;
-
- old_value = (bitmap[c][o] & (1 << b));
- gfs2_assert_withdraw(sdp, !old_value != !new_value);
-
- if (new_value)
- bitmap[c][o] |= 1 << b;
- else
- bitmap[c][o] &= ~(1 << b);
-}
-
static void slot_put(struct gfs2_quota_data *qd)
{
- struct gfs2_sbd *sdp = qd->qd_gl->gl_sbd;
+ struct gfs2_sbd *sdp = qd->qd_sbd;
spin_lock(&qd_lock);
gfs2_assert(sdp, qd->qd_slot_count);
if (!--qd->qd_slot_count) {
- gfs2_icbit_munge(sdp, sdp->sd_quota_bitmap, qd->qd_slot, 0);
+ BUG_ON(!test_and_clear_bit(qd->qd_slot, sdp->sd_quota_bitmap));
qd->qd_slot = -1;
}
spin_unlock(&qd_lock);
@@ -1269,6 +1229,7 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
unsigned int x, slot = 0;
unsigned int found = 0;
unsigned int hash;
+ unsigned int bm_size;
u64 dblock;
u32 extlen = 0;
int error;
@@ -1277,20 +1238,16 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
return -EIO;
sdp->sd_quota_slots = blocks * sdp->sd_qc_per_block;
- sdp->sd_quota_chunks = DIV_ROUND_UP(sdp->sd_quota_slots, 8 * PAGE_SIZE);
-
+ bm_size = DIV_ROUND_UP(sdp->sd_quota_slots, 8 * sizeof(unsigned long));
+ bm_size *= sizeof(unsigned long);
error = -ENOMEM;
-
- sdp->sd_quota_bitmap = kcalloc(sdp->sd_quota_chunks,
- sizeof(unsigned char *), GFP_NOFS);
+ sdp->sd_quota_bitmap = kmalloc(bm_size, GFP_NOFS|__GFP_NOWARN);
+ if (sdp->sd_quota_bitmap == NULL)
+ sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
if (!sdp->sd_quota_bitmap)
return error;
- for (x = 0; x < sdp->sd_quota_chunks; x++) {
- sdp->sd_quota_bitmap[x] = kzalloc(PAGE_SIZE, GFP_NOFS);
- if (!sdp->sd_quota_bitmap[x])
- goto fail;
- }
+ memset(sdp->sd_quota_bitmap, 0, bm_size);
for (x = 0; x < blocks; x++) {
struct buffer_head *bh;
@@ -1339,7 +1296,7 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
qd->qd_slot_count = 1;
spin_lock(&qd_lock);
- gfs2_icbit_munge(sdp, sdp->sd_quota_bitmap, slot, 1);
+ BUG_ON(test_and_set_bit(slot, sdp->sd_quota_bitmap));
list_add(&qd->qd_list, &sdp->sd_quota_list);
atomic_inc(&sdp->sd_quota_count);
spin_unlock(&qd_lock);
@@ -1370,7 +1327,6 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
{
struct list_head *head = &sdp->sd_quota_list;
struct gfs2_quota_data *qd;
- unsigned int x;
spin_lock(&qd_lock);
while (!list_empty(head)) {
@@ -1401,9 +1357,11 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
gfs2_assert_warn(sdp, !atomic_read(&sdp->sd_quota_count));
if (sdp->sd_quota_bitmap) {
- for (x = 0; x < sdp->sd_quota_chunks; x++)
- kfree(sdp->sd_quota_bitmap[x]);
- kfree(sdp->sd_quota_bitmap);
+ if (is_vmalloc_addr(sdp->sd_quota_bitmap))
+ vfree(sdp->sd_quota_bitmap);
+ else
+ kfree(sdp->sd_quota_bitmap);
+ sdp->sd_quota_bitmap = NULL;
}
}
--
1.8.3.1
This patch adds four new fields to directory leaf blocks.
The intent is not to use them in the kernel itself, although
perhaps we may be able to use them as hints at some later date,
but instead to provide more information for debug/fsck use.
One new field adds a pointer to the inode to which the leaf
belongs. This can be useful if the pointer to the leaf block
has become corrupt, as it will allow us to know which inode
this block should be associated with. This field is set when
the leaf is created and never changed over its lifetime.
The second field is a "distance from the hash table" field.
The meaning is as follows:
0 = An old leaf in which this value has not been set
1 = This leaf is pointed to directly from the hash table
2+ = This leaf is part of a chain, pointed to by another leaf
block, the value gives the position in the chain.
The third and fourth fields combine to give a time stamp of
the most recent directory insertion or deletion from this
leaf block. The time stamp is not updated when a new leaf
block is chained from the current one. The code is currently
written such that the timestamp on the dir inode will match
that of the leaf block for the most recent insertion/deletion.
For backwards compatibility, any of these new fields which is
zero should be considered to be "unknown".
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index b2e5ebf..fa32655 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -834,6 +834,7 @@ static struct gfs2_leaf *new_leaf(struct inode *inode, struct buffer_head **pbh,
struct gfs2_leaf *leaf;
struct gfs2_dirent *dent;
struct qstr name = { .name = "" };
+ struct timespec tv = CURRENT_TIME;
error = gfs2_alloc_blocks(ip, &bn, &n, 0, NULL);
if (error)
@@ -850,7 +851,11 @@ static struct gfs2_leaf *new_leaf(struct inode *inode, struct buffer_head **pbh,
leaf->lf_entries = 0;
leaf->lf_dirent_format = cpu_to_be32(GFS2_FORMAT_DE);
leaf->lf_next = 0;
- memset(leaf->lf_reserved, 0, sizeof(leaf->lf_reserved));
+ leaf->lf_inode = cpu_to_be64(ip->i_no_addr);
+ leaf->lf_dist = cpu_to_be32(1);
+ leaf->lf_nsec = cpu_to_be32(tv.tv_nsec);
+ leaf->lf_sec = cpu_to_be64(tv.tv_sec);
+ memset(leaf->lf_reserved2, 0, sizeof(leaf->lf_reserved2));
dent = (struct gfs2_dirent *)(leaf+1);
gfs2_qstr2dirent(&name, bh->b_size - sizeof(struct gfs2_leaf), dent);
*pbh = bh;
@@ -1612,11 +1617,31 @@ out:
return ret;
}
+/**
+ * dir_new_leaf - Add a new leaf onto hash chain
+ * @inode: The directory
+ * @name: The name we are adding
+ *
+ * This adds a new dir leaf onto an existing leaf when there is not
+ * enough space to add a new dir entry. This is a last resort after
+ * we've expanded the hash table to max size and also split existing
+ * leaf blocks, so it will only occur for very large directories.
+ *
+ * The dist parameter is set to 1 for leaf blocks directly attached
+ * to the hash table, 2 for one layer of indirection, 3 for two layers
+ * etc. We are thus able to tell the difference between an old leaf
+ * with dist set to zero (i.e. "don't know") and a new one where we
+ * set this information for debug/fsck purposes.
+ *
+ * Returns: 0 on success, or -ve on error
+ */
+
static int dir_new_leaf(struct inode *inode, const struct qstr *name)
{
struct buffer_head *bh, *obh;
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_leaf *leaf, *oleaf;
+ u32 dist = 1;
int error;
u32 index;
u64 bn;
@@ -1626,6 +1651,7 @@ static int dir_new_leaf(struct inode *inode, const struct qstr *name)
if (error)
return error;
do {
+ dist++;
oleaf = (struct gfs2_leaf *)obh->b_data;
bn = be64_to_cpu(oleaf->lf_next);
if (!bn)
@@ -1643,6 +1669,7 @@ static int dir_new_leaf(struct inode *inode, const struct qstr *name)
brelse(obh);
return -ENOSPC;
}
+ leaf->lf_dist = cpu_to_be32(dist);
oleaf->lf_next = cpu_to_be64(bh->b_blocknr);
brelse(bh);
brelse(obh);
@@ -1679,6 +1706,7 @@ int gfs2_dir_add(struct inode *inode, const struct qstr *name,
struct gfs2_inode *ip = GFS2_I(inode);
struct buffer_head *bh = da->bh;
struct gfs2_dirent *dent = da->dent;
+ struct timespec tv;
struct gfs2_leaf *leaf;
int error;
@@ -1693,15 +1721,18 @@ int gfs2_dir_add(struct inode *inode, const struct qstr *name,
dent = gfs2_init_dirent(inode, dent, name, bh);
gfs2_inum_out(nip, dent);
dent->de_type = cpu_to_be16(IF2DT(nip->i_inode.i_mode));
+ tv = CURRENT_TIME;
if (ip->i_diskflags & GFS2_DIF_EXHASH) {
leaf = (struct gfs2_leaf *)bh->b_data;
be16_add_cpu(&leaf->lf_entries, 1);
+ leaf->lf_nsec = cpu_to_be32(tv.tv_nsec);
+ leaf->lf_sec = cpu_to_be64(tv.tv_sec);
}
da->dent = NULL;
da->bh = NULL;
brelse(bh);
ip->i_entries++;
- ip->i_inode.i_mtime = ip->i_inode.i_ctime = CURRENT_TIME;
+ ip->i_inode.i_mtime = ip->i_inode.i_ctime = tv;
if (S_ISDIR(nip->i_inode.i_mode))
inc_nlink(&ip->i_inode);
mark_inode_dirty(inode);
@@ -1752,6 +1783,7 @@ int gfs2_dir_del(struct gfs2_inode *dip, const struct dentry *dentry)
const struct qstr *name = &dentry->d_name;
struct gfs2_dirent *dent, *prev = NULL;
struct buffer_head *bh;
+ struct timespec tv = CURRENT_TIME;
/* Returns _either_ the entry (if its first in block) or the
previous entry otherwise */
@@ -1777,13 +1809,15 @@ int gfs2_dir_del(struct gfs2_inode *dip, const struct dentry *dentry)
if (!entries)
gfs2_consist_inode(dip);
leaf->lf_entries = cpu_to_be16(--entries);
+ leaf->lf_nsec = cpu_to_be32(tv.tv_nsec);
+ leaf->lf_sec = cpu_to_be64(tv.tv_sec);
}
brelse(bh);
if (!dip->i_entries)
gfs2_consist_inode(dip);
dip->i_entries--;
- dip->i_inode.i_mtime = dip->i_inode.i_ctime = CURRENT_TIME;
+ dip->i_inode.i_mtime = dip->i_inode.i_ctime = tv;
if (S_ISDIR(dentry->d_inode->i_mode))
drop_nlink(&dip->i_inode);
mark_inode_dirty(&dip->i_inode);
diff --git a/include/uapi/linux/gfs2_ondisk.h b/include/uapi/linux/gfs2_ondisk.h
index b2de1f9..0f24c07 100644
--- a/include/uapi/linux/gfs2_ondisk.h
+++ b/include/uapi/linux/gfs2_ondisk.h
@@ -319,7 +319,16 @@ struct gfs2_leaf {
__be32 lf_dirent_format; /* Format of the dirents */
__be64 lf_next; /* Next leaf, if overflow */
- __u8 lf_reserved[64];
+ union {
+ __u8 lf_reserved[64];
+ struct {
+ __be64 lf_inode; /* Dir inode number */
+ __be32 lf_dist; /* Dist from inode on chain */
+ __be32 lf_nsec; /* Last ins/del usecs */
+ __be64 lf_sec; /* Last ins/del in secs */
+ __u8 lf_reserved2[40];
+ };
+ };
};
/*
--
1.8.3.1
While investigating a rather strange bit of code in the quota
clean up function, I spotted that the reason for its existence
was that when remounting read only, we were not stopping the
quotad thread, and thus it was possible for it to still have
a reference to some of the quotas in that case.
This patch moves the logd and quota thread start and stop into
the make_fs_rw/ro functions, so that we now stop those threads
when mounted read only.
This means that quotad will always be stopped before we call
the quota clean up function, and we can thus dispose of the
(rather hackish) code that waits for it to give up its
reference on the quotas.
Signed-off-by: Steven Whitehouse <[email protected]>
Cc: Abhijith Das <[email protected]>
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 3b820b6..27648e8 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -969,40 +969,6 @@ fail:
return error;
}
-static int init_threads(struct gfs2_sbd *sdp, int undo)
-{
- struct task_struct *p;
- int error = 0;
-
- if (undo)
- goto fail_quotad;
-
- p = kthread_run(gfs2_logd, sdp, "gfs2_logd");
- if (IS_ERR(p)) {
- error = PTR_ERR(p);
- fs_err(sdp, "can't start logd thread: %d\n", error);
- return error;
- }
- sdp->sd_logd_process = p;
-
- p = kthread_run(gfs2_quotad, sdp, "gfs2_quotad");
- if (IS_ERR(p)) {
- error = PTR_ERR(p);
- fs_err(sdp, "can't start quotad thread: %d\n", error);
- goto fail;
- }
- sdp->sd_quotad_process = p;
-
- return 0;
-
-
-fail_quotad:
- kthread_stop(sdp->sd_quotad_process);
-fail:
- kthread_stop(sdp->sd_logd_process);
- return error;
-}
-
static const match_table_t nolock_tokens = {
{ Opt_jid, "jid=%d\n", },
{ Opt_err, NULL },
@@ -1267,15 +1233,11 @@ static int fill_super(struct super_block *sb, struct gfs2_args *args, int silent
goto fail_per_node;
}
- error = init_threads(sdp, DO);
- if (error)
- goto fail_per_node;
-
if (!(sb->s_flags & MS_RDONLY)) {
error = gfs2_make_fs_rw(sdp);
if (error) {
fs_err(sdp, "can't make FS RW: %d\n", error);
- goto fail_threads;
+ goto fail_per_node;
}
}
@@ -1283,8 +1245,6 @@ static int fill_super(struct super_block *sb, struct gfs2_args *args, int silent
gfs2_online_uevent(sdp);
return 0;
-fail_threads:
- init_threads(sdp, UNDO);
fail_per_node:
init_per_node(sdp, UNDO);
fail_inodes:
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index a1df01d..3287d98 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -1376,23 +1376,6 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
while (!list_empty(head)) {
qd = list_entry(head->prev, struct gfs2_quota_data, qd_list);
- /*
- * To be removed in due course... we should be able to
- * ensure that all refs to the qd have done by this point
- * so that this rather odd test is not required
- */
- spin_lock(&qd->qd_lockref.lock);
- if (qd->qd_lockref.count > 1 ||
- (qd->qd_lockref.count && !test_bit(QDF_CHANGE, &qd->qd_flags))) {
- spin_unlock(&qd->qd_lockref.lock);
- list_move(&qd->qd_list, head);
- spin_unlock(&qd_lock);
- schedule();
- spin_lock(&qd_lock);
- continue;
- }
- spin_unlock(&qd->qd_lockref.lock);
-
list_del(&qd->qd_list);
/* Also remove if this qd exists in the reclaim list */
@@ -1404,11 +1387,8 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
hlist_bl_del_rcu(&qd->qd_hlist);
spin_unlock_bucket(qd->qd_hash);
- if (!qd->qd_lockref.count) {
- gfs2_assert_warn(sdp, !qd->qd_change);
- gfs2_assert_warn(sdp, !qd->qd_slot_count);
- } else
- gfs2_assert_warn(sdp, qd->qd_slot_count == 1);
+ gfs2_assert_warn(sdp, !qd->qd_change);
+ gfs2_assert_warn(sdp, !qd->qd_slot_count);
gfs2_assert_warn(sdp, !qd->qd_bh_count);
gfs2_glock_put(qd->qd_gl);
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 35da5b1..60f60f6 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -369,6 +369,33 @@ int gfs2_jdesc_check(struct gfs2_jdesc *jd)
return 0;
}
+static int init_threads(struct gfs2_sbd *sdp)
+{
+ struct task_struct *p;
+ int error = 0;
+
+ p = kthread_run(gfs2_logd, sdp, "gfs2_logd");
+ if (IS_ERR(p)) {
+ error = PTR_ERR(p);
+ fs_err(sdp, "can't start logd thread: %d\n", error);
+ return error;
+ }
+ sdp->sd_logd_process = p;
+
+ p = kthread_run(gfs2_quotad, sdp, "gfs2_quotad");
+ if (IS_ERR(p)) {
+ error = PTR_ERR(p);
+ fs_err(sdp, "can't start quotad thread: %d\n", error);
+ goto fail;
+ }
+ sdp->sd_quotad_process = p;
+ return 0;
+
+fail:
+ kthread_stop(sdp->sd_logd_process);
+ return error;
+}
+
/**
* gfs2_make_fs_rw - Turn a Read-Only FS into a Read-Write one
* @sdp: the filesystem
@@ -384,10 +411,14 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
struct gfs2_log_header_host head;
int error;
- error = gfs2_glock_nq_init(sdp->sd_trans_gl, LM_ST_SHARED, 0, &t_gh);
+ error = init_threads(sdp);
if (error)
return error;
+ error = gfs2_glock_nq_init(sdp->sd_trans_gl, LM_ST_SHARED, 0, &t_gh);
+ if (error)
+ goto fail_threads;
+
j_gl->gl_ops->go_inval(j_gl, DIO_METADATA);
error = gfs2_find_jhead(sdp->sd_jdesc, &head);
@@ -417,7 +448,9 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
fail:
t_gh.gh_flags |= GL_NOCACHE;
gfs2_glock_dq_uninit(&t_gh);
-
+fail_threads:
+ kthread_stop(sdp->sd_quotad_process);
+ kthread_stop(sdp->sd_logd_process);
return error;
}
@@ -800,6 +833,9 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
struct gfs2_holder t_gh;
int error;
+ kthread_stop(sdp->sd_quotad_process);
+ kthread_stop(sdp->sd_logd_process);
+
flush_workqueue(gfs2_delete_workqueue);
gfs2_quota_sync(sdp->sd_vfs, 0);
gfs2_statfs_sync(sdp->sd_vfs, 0);
@@ -857,9 +893,6 @@ restart:
}
spin_unlock(&sdp->sd_jindex_spin);
- kthread_stop(sdp->sd_quotad_process);
- kthread_stop(sdp->sd_logd_process);
-
if (!(sb->s_flags & MS_RDONLY)) {
error = gfs2_make_fs_ro(sdp);
if (error)
--
1.8.3.1
From: Bob Peterson <[email protected]>
This patch calls get_write_access in function gfs2_setattr_chown,
which merely increases inode->i_writecount for the duration of the
function. That will ensure that any file closes won't delete the
inode's multi-block reservation while the function is running.
It also ensures that a multi-block reservation exists when needed
for quota change operations during the chown.
Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 4acc584..51cf10d 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -1623,10 +1623,22 @@ static int setattr_chown(struct inode *inode, struct iattr *attr)
if (!(attr->ia_valid & ATTR_GID) || gid_eq(ogid, ngid))
ogid = ngid = NO_GID_QUOTA_CHANGE;
- error = gfs2_quota_lock(ip, nuid, ngid);
+ error = get_write_access(inode);
if (error)
return error;
+ error = gfs2_rs_alloc(ip);
+ if (error)
+ goto out;
+
+ error = gfs2_rindex_update(sdp);
+ if (error)
+ goto out;
+
+ error = gfs2_quota_lock(ip, nuid, ngid);
+ if (error)
+ goto out;
+
if (!uid_eq(ouid, NO_UID_QUOTA_CHANGE) ||
!gid_eq(ogid, NO_GID_QUOTA_CHANGE)) {
error = gfs2_quota_check(ip, nuid, ngid);
@@ -1653,6 +1665,8 @@ out_end_trans:
gfs2_trans_end(sdp);
out_gunlock_q:
gfs2_quota_unlock(ip);
+out:
+ put_write_access(inode);
return error;
}
--
1.8.3.1
We recently fixed the writeback of pages prior to performing
direct i/o, however the initial fix was perhaps a bit heavy
handed. There is no need to invalidate pages if the direct i/o
is only a read, since they will be identical to what has been
flushed to disk anyway.
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index cf858da..49436fa 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1032,8 +1032,9 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
unmap_shared_mapping_range(ip->i_inode.i_mapping, offset, len);
rv = filemap_write_and_wait_range(mapping, lstart, end);
if (rv)
- return rv;
- truncate_inode_pages_range(mapping, lstart, end);
+ goto out;
+ if (rw == WRITE)
+ truncate_inode_pages_range(mapping, lstart, end);
}
rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
--
1.8.3.1
When we look to see if there is enough space to add a dir
entry without allocation, we have then been repeating the
same search later when we do the actual insertion. This
patch caches the details of the location in the gfs2_diradd
structure, so that we do not have to repeat the search.
This will provide a performance improvement which will be
greater as the size of the directory increases.
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 0b6be20..d5988aa 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -1659,26 +1659,34 @@ static int dir_new_leaf(struct inode *inode, const struct qstr *name)
/**
* gfs2_dir_add - Add new filename into directory
- * @dip: The GFS2 inode
- * @filename: The new name
- * @inode: The inode number of the entry
- * @type: The type of the entry
+ * @inode: The directory inode
+ * @name: The new name
+ * @nip: The GFS2 inode to be linked in to the directory
+ * @da: The directory addition info
+ *
+ * If the call to gfs2_diradd_alloc_required resulted in there being
+ * no need to allocate any new directory blocks, then it will contain
+ * a pointer to the directory entry and the bh in which it resides. We
+ * can use that without having to repeat the search. If there was no
+ * free space, then we must now create more space.
*
* Returns: 0 on success, error code on failure
*/
int gfs2_dir_add(struct inode *inode, const struct qstr *name,
- const struct gfs2_inode *nip)
+ const struct gfs2_inode *nip, struct gfs2_diradd *da)
{
struct gfs2_inode *ip = GFS2_I(inode);
- struct buffer_head *bh;
- struct gfs2_dirent *dent;
+ struct buffer_head *bh = da->bh;
+ struct gfs2_dirent *dent = da->dent;
struct gfs2_leaf *leaf;
int error;
while(1) {
- dent = gfs2_dirent_search(inode, name, gfs2_dirent_find_space,
- &bh);
+ if (da->bh == NULL) {
+ dent = gfs2_dirent_search(inode, name,
+ gfs2_dirent_find_space, &bh);
+ }
if (dent) {
if (IS_ERR(dent))
return PTR_ERR(dent);
@@ -1689,6 +1697,8 @@ int gfs2_dir_add(struct inode *inode, const struct qstr *name,
leaf = (struct gfs2_leaf *)bh->b_data;
be16_add_cpu(&leaf->lf_entries, 1);
}
+ da->dent = NULL;
+ da->bh = NULL;
brelse(bh);
ip->i_entries++;
ip->i_inode.i_mtime = ip->i_inode.i_ctime = CURRENT_TIME;
@@ -2030,6 +2040,8 @@ int gfs2_diradd_alloc_required(struct inode *inode, const struct qstr *name,
struct buffer_head *bh;
da->nr_blocks = 0;
+ da->bh = NULL;
+ da->dent = NULL;
dent = gfs2_dirent_search(inode, name, gfs2_dirent_find_space, &bh);
if (!dent) {
@@ -2038,7 +2050,8 @@ int gfs2_diradd_alloc_required(struct inode *inode, const struct qstr *name,
}
if (IS_ERR(dent))
return PTR_ERR(dent);
- brelse(bh);
+ da->bh = bh;
+ da->dent = dent;
return 0;
}
diff --git a/fs/gfs2/dir.h b/fs/gfs2/dir.h
index c5573e7..126c65d 100644
--- a/fs/gfs2/dir.h
+++ b/fs/gfs2/dir.h
@@ -16,9 +16,13 @@
struct inode;
struct gfs2_inode;
struct gfs2_inum;
+struct buffer_head;
+struct gfs2_dirent;
struct gfs2_diradd {
unsigned nr_blocks;
+ struct gfs2_dirent *dent;
+ struct buffer_head *bh;
};
extern struct inode *gfs2_dir_search(struct inode *dir,
@@ -27,7 +31,13 @@ extern struct inode *gfs2_dir_search(struct inode *dir,
extern int gfs2_dir_check(struct inode *dir, const struct qstr *filename,
const struct gfs2_inode *ip);
extern int gfs2_dir_add(struct inode *inode, const struct qstr *filename,
- const struct gfs2_inode *ip);
+ const struct gfs2_inode *ip, struct gfs2_diradd *da);
+static inline void gfs2_dir_no_add(struct gfs2_diradd *da)
+{
+ if (da->bh)
+ brelse(da->bh);
+ da->bh = NULL;
+}
extern int gfs2_dir_del(struct gfs2_inode *dip, const struct dentry *dentry);
extern int gfs2_dir_read(struct inode *inode, struct dir_context *ctx,
struct file_ra_state *f_ra);
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index fa4624f..4acc584 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -516,7 +516,7 @@ static int link_dinode(struct gfs2_inode *dip, const struct qstr *name,
goto fail_quota_locks;
}
- error = gfs2_dir_add(&dip->i_inode, name, ip);
+ error = gfs2_dir_add(&dip->i_inode, name, ip, da);
if (error)
goto fail_end_trans;
@@ -579,7 +579,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
struct dentry *d;
int error;
u32 aflags = 0;
- struct gfs2_diradd da;
+ struct gfs2_diradd da = { .bh = NULL, };
if (!name->len || name->len > GFS2_FNAMESIZE)
return -ENAMETOOLONG;
@@ -738,6 +738,7 @@ fail_free_inode:
free_inode_nonrcu(inode);
inode = NULL;
fail_gunlock:
+ gfs2_dir_no_add(&da);
gfs2_glock_dq_uninit(ghs);
if (inode && !IS_ERR(inode)) {
clear_nlink(inode);
@@ -836,7 +837,7 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_holder ghs[2];
struct buffer_head *dibh;
- struct gfs2_diradd da;
+ struct gfs2_diradd da = { .bh = NULL, };
int error;
if (S_ISDIR(inode->i_mode))
@@ -918,7 +919,7 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
if (error)
goto out_end_trans;
- error = gfs2_dir_add(dir, &dentry->d_name, ip);
+ error = gfs2_dir_add(dir, &dentry->d_name, ip, &da);
if (error)
goto out_brelse;
@@ -940,6 +941,7 @@ out_gunlock_q:
if (da.nr_blocks)
gfs2_quota_unlock(dip);
out_gunlock:
+ gfs2_dir_no_add(&da);
gfs2_glock_dq(ghs + 1);
out_child:
gfs2_glock_dq(ghs);
@@ -1454,7 +1456,7 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
if (error)
goto out_end_trans;
- error = gfs2_dir_add(ndir, &ndentry->d_name, ip);
+ error = gfs2_dir_add(ndir, &ndentry->d_name, ip, &da);
if (error)
goto out_end_trans;
@@ -1467,6 +1469,7 @@ out_gunlock_q:
if (da.nr_blocks)
gfs2_quota_unlock(ndip);
out_gunlock:
+ gfs2_dir_no_add(&da);
while (x--) {
gfs2_glock_dq(ghs + x);
gfs2_holder_uninit(ghs + x);
--
1.8.3.1
For most cases, only a single new block is needed when we reach
the point of converting from stuffed to exhash directory. The
exception being when the file name is so long that it will not
fit within the new leaf block.
So this patch adds a simple test for that situation so that we
do not need to request the full reservation size in this case.
Potentially we could calculate more accurately the value to use
in other cases too, but that is much more complicated to do and
it is doubtful that the benefit would outweigh the extra cost
in code complexity.
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index d5988aa..b2e5ebf 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -2035,7 +2035,9 @@ out:
int gfs2_diradd_alloc_required(struct inode *inode, const struct qstr *name,
struct gfs2_diradd *da)
{
+ struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);
+ const unsigned int extra = sizeof(struct gfs2_dinode) - sizeof(struct gfs2_leaf);
struct gfs2_dirent *dent;
struct buffer_head *bh;
@@ -2046,6 +2048,9 @@ int gfs2_diradd_alloc_required(struct inode *inode, const struct qstr *name,
dent = gfs2_dirent_search(inode, name, gfs2_dirent_find_space, &bh);
if (!dent) {
da->nr_blocks = sdp->sd_max_dirres;
+ if (!(ip->i_diskflags & GFS2_DIF_EXHASH) &&
+ (GFS2_DIRENT_SIZE(name->len) < extra))
+ da->nr_blocks = 1;
return 0;
}
if (IS_ERR(dent))
--
1.8.3.1
Prior to this patch, GFS2 had one address space for each rgrp,
stored in the glock. This patch changes them to use a single
address space in the super block. This therefore saves
(sizeof(struct address_space) * nr_of_rgrps) bytes of memory
and for large filesystems, that can be significant.
It would be nice to be able to do something similar and merge
the inode metadata address space into the same global
address space. However, that is rather more complicated as the
on-disk location doesn't have a 1:1 mapping with the inodes in
general. So while it could be done, it will be a more complicated
operation as it requires changing a lot more code paths.
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 1b192c8d..b792dbc 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -133,7 +133,8 @@ void gfs2_ail_flush(struct gfs2_glock *gl, bool fsync)
static void rgrp_go_sync(struct gfs2_glock *gl)
{
- struct address_space *metamapping = gfs2_glock2aspace(gl);
+ struct gfs2_sbd *sdp = gl->gl_sbd;
+ struct address_space *mapping = &sdp->sd_aspace;
struct gfs2_rgrpd *rgd;
int error;
@@ -141,10 +142,10 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
return;
GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
- gfs2_log_flush(gl->gl_sbd, gl);
- filemap_fdatawrite_range(metamapping, gl->gl_vm.start, gl->gl_vm.end);
- error = filemap_fdatawait_range(metamapping, gl->gl_vm.start, gl->gl_vm.end);
- mapping_set_error(metamapping, error);
+ gfs2_log_flush(sdp, gl);
+ filemap_fdatawrite_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
+ error = filemap_fdatawait_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
+ mapping_set_error(mapping, error);
gfs2_ail_empty_gl(gl);
spin_lock(&gl->gl_spin);
@@ -166,10 +167,11 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
{
- struct address_space *mapping = gfs2_glock2aspace(gl);
+ struct gfs2_sbd *sdp = gl->gl_sbd;
+ struct address_space *mapping = &sdp->sd_aspace;
WARN_ON_ONCE(!(flags & DIO_METADATA));
- gfs2_assert_withdraw(gl->gl_sbd, !atomic_read(&gl->gl_ail_count));
+ gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));
truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
if (gl->gl_object) {
@@ -558,7 +560,7 @@ const struct gfs2_glock_operations gfs2_rgrp_glops = {
.go_unlock = gfs2_rgrp_go_unlock,
.go_dump = gfs2_rgrp_dump,
.go_type = LM_TYPE_RGRP,
- .go_flags = GLOF_ASPACE | GLOF_LVB,
+ .go_flags = GLOF_LVB,
};
const struct gfs2_glock_operations gfs2_trans_glops = {
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e6544ee..a99f60c 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -736,6 +736,8 @@ struct gfs2_sbd {
/* Log stuff */
+ struct address_space sd_aspace;
+
spinlock_t sd_log_lock;
struct gfs2_trans *sd_log_tr;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index cdadb92..58f0640 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -589,8 +589,12 @@ static int buf_lo_scan_elements(struct gfs2_jdesc *jd, unsigned int start,
static void gfs2_meta_sync(struct gfs2_glock *gl)
{
struct address_space *mapping = gfs2_glock2aspace(gl);
+ struct gfs2_sbd *sdp = gl->gl_sbd;
int error;
+ if (mapping == NULL)
+ mapping = &sdp->sd_aspace;
+
filemap_fdatawrite(mapping);
error = filemap_fdatawait(mapping);
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 52f177b..c7f2469 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -116,6 +116,9 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
unsigned long index;
unsigned int bufnum;
+ if (mapping == NULL)
+ mapping = &sdp->sd_aspace;
+
shift = PAGE_CACHE_SHIFT - sdp->sd_sb.sb_bsize_shift;
index = blkno >> shift; /* convert block to page */
bufnum = blkno - (index << shift); /* block buf index within page */
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 52fa883..94aab31 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -36,6 +36,7 @@
#include "log.h"
#include "quota.h"
#include "dir.h"
+#include "meta_io.h"
#include "trace_gfs2.h"
#define DO 0
@@ -62,6 +63,7 @@ static void gfs2_tune_init(struct gfs2_tune *gt)
static struct gfs2_sbd *init_sbd(struct super_block *sb)
{
struct gfs2_sbd *sdp;
+ struct address_space *mapping;
sdp = kzalloc(sizeof(struct gfs2_sbd), GFP_KERNEL);
if (!sdp)
@@ -98,6 +100,16 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
INIT_LIST_HEAD(&sdp->sd_trunc_list);
spin_lock_init(&sdp->sd_trunc_lock);
+ mapping = &sdp->sd_aspace;
+
+ mapping->a_ops = &gfs2_meta_aops;
+ mapping->host = sb->s_bdev->bd_inode;
+ mapping->flags = 0;
+ mapping_set_gfp_mask(mapping, GFP_NOFS);
+ mapping->private_data = NULL;
+ mapping->backing_dev_info = sb->s_bdi;
+ mapping->writeback_index = 0;
+
spin_lock_init(&sdp->sd_log_lock);
atomic_set(&sdp->sd_log_pinned, 0);
INIT_LIST_HEAD(&sdp->sd_log_le_buf);
--
1.8.3.1
Each rgrp header is represented as a single extent on disk, so we
can calculate the position within the address space, since we are
using address spaces mapped 1:1 to the disk. This means that it
is possible to use the range based versions of filemap_fdatawrite/wait
and for invalidating the page cache.
Our eventual intent is to then be able to merge the address spaces
used for rgrps into a single address space, rather than to have
one for each glock, saving memory and reducing complexity.
Since during umount, the rgrp structures are disposed of before
the glocks, we need to store the extent information in the glock
so that is is available for a final invalidation. This patch uses
a field which is otherwise unused in rgrp glocks to do that, so
that we do not have to expand the size of a glock.
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index f88dcd9..1b192c8d 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -142,8 +142,8 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
gfs2_log_flush(gl->gl_sbd, gl);
- filemap_fdatawrite(metamapping);
- error = filemap_fdatawait(metamapping);
+ filemap_fdatawrite_range(metamapping, gl->gl_vm.start, gl->gl_vm.end);
+ error = filemap_fdatawait_range(metamapping, gl->gl_vm.start, gl->gl_vm.end);
mapping_set_error(metamapping, error);
gfs2_ail_empty_gl(gl);
@@ -170,7 +170,7 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
WARN_ON_ONCE(!(flags & DIO_METADATA));
gfs2_assert_withdraw(gl->gl_sbd, !atomic_read(&gl->gl_ail_count));
- truncate_inode_pages(mapping, 0);
+ truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
if (gl->gl_object) {
struct gfs2_rgrpd *rgd = (struct gfs2_rgrpd *)gl->gl_object;
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 0132816..e6544ee 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -351,7 +351,15 @@ struct gfs2_glock {
atomic_t gl_ail_count;
atomic_t gl_revokes;
struct delayed_work gl_work;
- struct work_struct gl_delete;
+ union {
+ /* For inode and iopen glocks only */
+ struct work_struct gl_delete;
+ /* For rgrp glocks only */
+ struct {
+ loff_t start;
+ loff_t end;
+ } gl_vm;
+ };
struct rcu_head gl_rcu;
};
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 2584710..183cf0f 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -886,6 +886,7 @@ static int rgd_insert(struct gfs2_rgrpd *rgd)
static int read_rindex_entry(struct gfs2_inode *ip)
{
struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
+ const unsigned bsize = sdp->sd_sb.sb_bsize;
loff_t pos = sdp->sd_rgrps * sizeof(struct gfs2_rindex);
struct gfs2_rindex buf;
int error;
@@ -923,6 +924,8 @@ static int read_rindex_entry(struct gfs2_inode *ip)
goto fail;
rgd->rd_gl->gl_object = rgd;
+ rgd->rd_gl->gl_vm.start = rgd->rd_addr * bsize;
+ rgd->rd_gl->gl_vm.end = rgd->rd_gl->gl_vm.start + (rgd->rd_length * bsize) - 1;
rgd->rd_rgl = (struct gfs2_rgrp_lvb *)rgd->rd_gl->gl_lksb.sb_lvbptr;
rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
if (rgd->rd_data > sdp->sd_max_rg_data)
--
1.8.3.1
There are three cases where we need to calculate the number of
blocks to reserve in a transaction involving linking an inode
into a directory. The one in rename is a bit more complicated,
but the basis of it is the same as for link and create. So it
makes sense to move this calculation into a single function
rather than repeating it three times.
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 9ac8f13..fa4624f 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -469,6 +469,28 @@ static void init_dinode(struct gfs2_inode *dip, struct gfs2_inode *ip,
brelse(dibh);
}
+/**
+ * gfs2_trans_da_blocks - Calculate number of blocks to link inode
+ * @dip: The directory we are linking into
+ * @da: The dir add information
+ * @nr_inodes: The number of inodes involved
+ *
+ * This calculate the number of blocks we need to reserve in a
+ * transaction to link @nr_inodes into a directory. In most cases
+ * @nr_inodes will be 2 (the directory plus the inode being linked in)
+ * but in case of rename, 4 may be required.
+ *
+ * Returns: Number of blocks
+ */
+
+static unsigned gfs2_trans_da_blks(const struct gfs2_inode *dip,
+ const struct gfs2_diradd *da,
+ unsigned nr_inodes)
+{
+ return da->nr_blocks + gfs2_rg_blocks(dip, da->nr_blocks) +
+ (nr_inodes * RES_DINODE) + RES_QUOTA + RES_STATFS;
+}
+
static int link_dinode(struct gfs2_inode *dip, const struct qstr *name,
struct gfs2_inode *ip, struct gfs2_diradd *da)
{
@@ -485,10 +507,7 @@ static int link_dinode(struct gfs2_inode *dip, const struct qstr *name,
if (error)
goto fail_quota_locks;
- error = gfs2_trans_begin(sdp, da->nr_blocks +
- gfs2_rg_blocks(dip, da->nr_blocks) +
- 2 * RES_DINODE +
- RES_STATFS + RES_QUOTA, 0);
+ error = gfs2_trans_begin(sdp, gfs2_trans_da_blks(dip, da, 2), 0);
if (error)
goto fail_ipreserv;
} else {
@@ -886,10 +905,7 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
if (error)
goto out_gunlock_q;
- error = gfs2_trans_begin(sdp, da.nr_blocks +
- gfs2_rg_blocks(dip, da.nr_blocks) +
- 2 * RES_DINODE + RES_STATFS +
- RES_QUOTA, 0);
+ error = gfs2_trans_begin(sdp, gfs2_trans_da_blks(dip, &da, 2), 0);
if (error)
goto out_ipres;
} else {
@@ -1403,10 +1419,8 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
if (error)
goto out_gunlock_q;
- error = gfs2_trans_begin(sdp, da.nr_blocks +
- gfs2_rg_blocks(ndip, da.nr_blocks) +
- 4 * RES_DINODE + 4 * RES_LEAF +
- RES_STATFS + RES_QUOTA + 4, 0);
+ error = gfs2_trans_begin(sdp, gfs2_trans_da_blks(ndip, &da, 4) +
+ 4 * RES_LEAF + 4, 0);
if (error)
goto out_ipreserv;
} else {
--
1.8.3.1
The intent is that this structure will hold the information
required when adding entries to a directory (linking). To
start with, it will contain only the number of blocks which
are required to link the new entry into the directory. The
current calculation returns either 0 or the maximim number of
blocks that can ever be requested by such a transaction.
The intent is that in a later patch, we can update the dir
code to calculate this value more accurately. In addition
further patches will also add further fields to the new
structure to increase its utility.
In addition this patch fixes a bug where the link used during
inode creation was adding requesting too many blocks in
some cases. This is harmless unless the fs is close to being
full.
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 2e5fc26..0b6be20 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -2017,18 +2017,24 @@ out:
* gfs2_diradd_alloc_required - find if adding entry will require an allocation
* @ip: the file being written to
* @filname: the filename that's going to be added
+ * @da: The structure to return dir alloc info
*
- * Returns: 1 if alloc required, 0 if not, -ve on error
+ * Returns: 0 if ok, -ve on error
*/
-int gfs2_diradd_alloc_required(struct inode *inode, const struct qstr *name)
+int gfs2_diradd_alloc_required(struct inode *inode, const struct qstr *name,
+ struct gfs2_diradd *da)
{
+ struct gfs2_sbd *sdp = GFS2_SB(inode);
struct gfs2_dirent *dent;
struct buffer_head *bh;
+ da->nr_blocks = 0;
+
dent = gfs2_dirent_search(inode, name, gfs2_dirent_find_space, &bh);
if (!dent) {
- return 1;
+ da->nr_blocks = sdp->sd_max_dirres;
+ return 0;
}
if (IS_ERR(dent))
return PTR_ERR(dent);
diff --git a/fs/gfs2/dir.h b/fs/gfs2/dir.h
index 4f03bbd..c5573e7 100644
--- a/fs/gfs2/dir.h
+++ b/fs/gfs2/dir.h
@@ -17,6 +17,10 @@ struct inode;
struct gfs2_inode;
struct gfs2_inum;
+struct gfs2_diradd {
+ unsigned nr_blocks;
+};
+
extern struct inode *gfs2_dir_search(struct inode *dir,
const struct qstr *filename,
bool fail_on_exist);
@@ -33,7 +37,8 @@ extern int gfs2_dir_mvino(struct gfs2_inode *dip, const struct qstr *filename,
extern int gfs2_dir_exhash_dealloc(struct gfs2_inode *dip);
extern int gfs2_diradd_alloc_required(struct inode *dir,
- const struct qstr *filename);
+ const struct qstr *filename,
+ struct gfs2_diradd *da);
extern int gfs2_dir_get_new_buffer(struct gfs2_inode *ip, u64 block,
struct buffer_head **bhp);
extern void gfs2_dir_hash_inval(struct gfs2_inode *ip);
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 7119504..9ac8f13 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -470,13 +470,13 @@ static void init_dinode(struct gfs2_inode *dip, struct gfs2_inode *ip,
}
static int link_dinode(struct gfs2_inode *dip, const struct qstr *name,
- struct gfs2_inode *ip, int arq)
+ struct gfs2_inode *ip, struct gfs2_diradd *da)
{
struct gfs2_sbd *sdp = GFS2_SB(&dip->i_inode);
- struct gfs2_alloc_parms ap = { .target = sdp->sd_max_dirres, };
+ struct gfs2_alloc_parms ap = { .target = da->nr_blocks, };
int error;
- if (arq) {
+ if (da->nr_blocks) {
error = gfs2_quota_lock_check(dip);
if (error)
goto fail_quota_locks;
@@ -485,8 +485,8 @@ static int link_dinode(struct gfs2_inode *dip, const struct qstr *name,
if (error)
goto fail_quota_locks;
- error = gfs2_trans_begin(sdp, sdp->sd_max_dirres +
- dip->i_rgd->rd_length +
+ error = gfs2_trans_begin(sdp, da->nr_blocks +
+ gfs2_rg_blocks(dip, da->nr_blocks) +
2 * RES_DINODE +
RES_STATFS + RES_QUOTA, 0);
if (error)
@@ -560,7 +560,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
struct dentry *d;
int error;
u32 aflags = 0;
- int arq;
+ struct gfs2_diradd da;
if (!name->len || name->len > GFS2_FNAMESIZE)
return -ENAMETOOLONG;
@@ -602,7 +602,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
goto fail_gunlock;
}
- arq = error = gfs2_diradd_alloc_required(dir, name);
+ error = gfs2_diradd_alloc_required(dir, name, &da);
if (error < 0)
goto fail_gunlock;
@@ -690,7 +690,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
if (error)
goto fail_gunlock3;
- error = link_dinode(dip, name, ip, arq);
+ error = link_dinode(dip, name, ip, &da);
if (error)
goto fail_gunlock3;
@@ -817,7 +817,7 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_holder ghs[2];
struct buffer_head *dibh;
- int alloc_required;
+ struct gfs2_diradd da;
int error;
if (S_ISDIR(inode->i_mode))
@@ -872,13 +872,12 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
if (ip->i_inode.i_nlink == (u32)-1)
goto out_gunlock;
- alloc_required = error = gfs2_diradd_alloc_required(dir, &dentry->d_name);
+ error = gfs2_diradd_alloc_required(dir, &dentry->d_name, &da);
if (error < 0)
goto out_gunlock;
- error = 0;
- if (alloc_required) {
- struct gfs2_alloc_parms ap = { .target = sdp->sd_max_dirres, };
+ if (da.nr_blocks) {
+ struct gfs2_alloc_parms ap = { .target = da.nr_blocks, };
error = gfs2_quota_lock_check(dip);
if (error)
goto out_gunlock;
@@ -887,8 +886,8 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
if (error)
goto out_gunlock_q;
- error = gfs2_trans_begin(sdp, sdp->sd_max_dirres +
- gfs2_rg_blocks(dip, sdp->sd_max_dirres) +
+ error = gfs2_trans_begin(sdp, da.nr_blocks +
+ gfs2_rg_blocks(dip, da.nr_blocks) +
2 * RES_DINODE + RES_STATFS +
RES_QUOTA, 0);
if (error)
@@ -919,10 +918,10 @@ out_brelse:
out_end_trans:
gfs2_trans_end(sdp);
out_ipres:
- if (alloc_required)
+ if (da.nr_blocks)
gfs2_inplace_release(dip);
out_gunlock_q:
- if (alloc_required)
+ if (da.nr_blocks)
gfs2_quota_unlock(dip);
out_gunlock:
gfs2_glock_dq(ghs + 1);
@@ -1254,7 +1253,7 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
struct gfs2_rgrpd *nrgd;
unsigned int num_gh;
int dir_rename = 0;
- int alloc_required = 0;
+ struct gfs2_diradd da = { .nr_blocks = 0, };
unsigned int x;
int error;
@@ -1388,14 +1387,14 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
goto out_gunlock;
}
- if (nip == NULL)
- alloc_required = gfs2_diradd_alloc_required(ndir, &ndentry->d_name);
- error = alloc_required;
- if (error < 0)
- goto out_gunlock;
+ if (nip == NULL) {
+ error = gfs2_diradd_alloc_required(ndir, &ndentry->d_name, &da);
+ if (error)
+ goto out_gunlock;
+ }
- if (alloc_required) {
- struct gfs2_alloc_parms ap = { .target = sdp->sd_max_dirres, };
+ if (da.nr_blocks) {
+ struct gfs2_alloc_parms ap = { .target = da.nr_blocks, };
error = gfs2_quota_lock_check(ndip);
if (error)
goto out_gunlock;
@@ -1404,8 +1403,8 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
if (error)
goto out_gunlock_q;
- error = gfs2_trans_begin(sdp, sdp->sd_max_dirres +
- gfs2_rg_blocks(ndip, sdp->sd_max_dirres) +
+ error = gfs2_trans_begin(sdp, da.nr_blocks +
+ gfs2_rg_blocks(ndip, da.nr_blocks) +
4 * RES_DINODE + 4 * RES_LEAF +
RES_STATFS + RES_QUOTA + 4, 0);
if (error)
@@ -1448,10 +1447,10 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
out_end_trans:
gfs2_trans_end(sdp);
out_ipreserv:
- if (alloc_required)
+ if (da.nr_blocks)
gfs2_inplace_release(ndip);
out_gunlock_q:
- if (alloc_required)
+ if (da.nr_blocks)
gfs2_quota_unlock(ndip);
out_gunlock:
while (x--) {
--
1.8.3.1
For historical reasons, we drop and retake the log lock in ->releasepage()
however, since there is no reason why we cannot hold the log lock over
the whole function, this allows some simplification. In particular,
pinning a buffer is only ever done under the log lock, so it is possible
here to remove the test for pinned buffers in the second loop, since it
is impossible for that to happen (it is also tested in the first loop).
As a result, two tests made later in the second loop become constants
and can also be reduced to the only possible branch. So the net result
is to remove various bits of unreachable code and make this more
readable.
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 73f3e4e..cf858da 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1080,30 +1080,22 @@ int gfs2_releasepage(struct page *page, gfp_t gfp_mask)
bh = bh->b_this_page;
} while(bh != head);
spin_unlock(&sdp->sd_ail_lock);
- gfs2_log_unlock(sdp);
head = bh = page_buffers(page);
do {
- gfs2_log_lock(sdp);
bd = bh->b_private;
if (bd) {
gfs2_assert_warn(sdp, bd->bd_bh == bh);
- if (!list_empty(&bd->bd_list)) {
- if (!buffer_pinned(bh))
- list_del_init(&bd->bd_list);
- else
- bd = NULL;
- }
- if (bd)
- bd->bd_bh = NULL;
+ if (!list_empty(&bd->bd_list))
+ list_del_init(&bd->bd_list);
+ bd->bd_bh = NULL;
bh->b_private = NULL;
- }
- gfs2_log_unlock(sdp);
- if (bd)
kmem_cache_free(gfs2_bufdata_cachep, bd);
+ }
bh = bh->b_this_page;
} while (bh != head);
+ gfs2_log_unlock(sdp);
return try_to_free_buffers(page);
--
1.8.3.1
From: Bob Peterson <[email protected]>
With the preceding patch, we started accepting block reservations
smaller than the ideal size, which requires a lot more parsing of the
bitmaps. To reduce the amount of bitmap searching, this patch
implements a scheme whereby each rgrp keeps track of the point
at this multi-block reservations will fail.
Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index ba1ea67..0132816 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -93,6 +93,7 @@ struct gfs2_rgrpd {
struct gfs2_rgrp_lvb *rd_rgl;
u32 rd_last_alloc;
u32 rd_flags;
+ u32 rd_extfail_pt; /* extent failure point */
#define GFS2_RDF_CHECK 0x10000000 /* check for unlinked inodes */
#define GFS2_RDF_UPTODATE 0x20000000 /* rg is up to date */
#define GFS2_RDF_ERROR 0x40000000 /* error in rg */
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 010b9fb..cdadb92 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -83,6 +83,7 @@ static void maybe_release_space(struct gfs2_bufdata *bd)
bd->bd_bh->b_data + bi->bi_offset, bi->bi_len);
clear_bit(GBF_FULL, &bi->bi_flags);
rgd->rd_free_clone = rgd->rd_free;
+ rgd->rd_extfail_pt = rgd->rd_free;
}
/**
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 1ccf89a..797f1d3 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -641,9 +641,13 @@ static void __rs_deltree(struct gfs2_blkreserv *rs)
/* return reserved blocks to the rgrp */
BUG_ON(rs->rs_rbm.rgd->rd_reserved < rs->rs_free);
rs->rs_rbm.rgd->rd_reserved -= rs->rs_free;
+ /* The rgrp extent failure point is likely not to increase;
+ it will only do so if the freed blocks are somehow
+ contiguous with a span of free blocks that follows. Still,
+ it will force the number to be recalculated later. */
+ rgd->rd_extfail_pt += rs->rs_free;
rs->rs_free = 0;
clear_bit(GBF_FULL, &bi->bi_flags);
- smp_mb__after_clear_bit();
}
}
@@ -1132,6 +1136,8 @@ int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);
rgd->rd_flags |= (GFS2_RDF_UPTODATE | GFS2_RDF_CHECK);
rgd->rd_free_clone = rgd->rd_free;
+ /* max out the rgrp allocation failure point */
+ rgd->rd_extfail_pt = rgd->rd_free;
}
if (cpu_to_be32(GFS2_MAGIC) != rgd->rd_rgl->rl_magic) {
rgd->rd_rgl->rl_unlinked = cpu_to_be32(count_unlinked(rgd));
@@ -1593,6 +1599,8 @@ fail:
* Side effects:
* - If looking for free blocks, we set GBF_FULL on each bitmap which
* has no free blocks in it.
+ * - If looking for free blocks, we set rd_extfail_pt on each rgrp which
+ * has come up short on a free block search.
*
* Returns: 0 on success, -ENOSPC if there is no block of the requested state
*/
@@ -1604,6 +1612,8 @@ static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 *minext,
struct buffer_head *bh;
int initial_bii;
u32 initial_offset;
+ int first_bii = rbm->bii;
+ u32 first_offset = rbm->offset;
u32 offset;
u8 *buffer;
int n = 0;
@@ -1679,6 +1689,13 @@ next_iter:
if (minext == NULL || state != GFS2_BLKST_FREE)
return -ENOSPC;
+ /* If the extent was too small, and it's smaller than the smallest
+ to have failed before, remember for future reference that it's
+ useless to search this rgrp again for this amount or more. */
+ if ((first_offset == 0) && (first_bii == 0) &&
+ (*minext < rbm->rgd->rd_extfail_pt))
+ rbm->rgd->rd_extfail_pt = *minext;
+
/* If the maximum extent we found is big enough to fulfill the
minimum requirements, use it anyway. */
if (maxext.len) {
@@ -1924,7 +1941,9 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a
}
/* Skip unuseable resource groups */
- if (rs->rs_rbm.rgd->rd_flags & (GFS2_RGF_NOALLOC | GFS2_RDF_ERROR))
+ if ((rs->rs_rbm.rgd->rd_flags & (GFS2_RGF_NOALLOC |
+ GFS2_RDF_ERROR)) ||
+ (ap && (ap->target > rs->rs_rbm.rgd->rd_extfail_pt)))
goto skip_rgrp;
if (sdp->sd_args.ar_rgrplvb)
@@ -2106,10 +2125,10 @@ int gfs2_rgrp_dump(struct seq_file *seq, const struct gfs2_glock *gl)
if (rgd == NULL)
return 0;
- gfs2_print_dbg(seq, " R: n:%llu f:%02x b:%u/%u i:%u r:%u\n",
+ gfs2_print_dbg(seq, " R: n:%llu f:%02x b:%u/%u i:%u r:%u e:%u\n",
(unsigned long long)rgd->rd_addr, rgd->rd_flags,
rgd->rd_free, rgd->rd_free_clone, rgd->rd_dinodes,
- rgd->rd_reserved);
+ rgd->rd_reserved, rgd->rd_extfail_pt);
spin_lock(&rgd->rd_rsspin);
for (n = rb_first(&rgd->rd_rstree); n; n = rb_next(&trs->rs_node)) {
trs = rb_entry(n, struct gfs2_blkreserv, rs_node);
@@ -2228,9 +2247,10 @@ int gfs2_alloc_blocks(struct gfs2_inode *ip, u64 *bn, unsigned int *nblocks,
/* Since all blocks are reserved in advance, this shouldn't happen */
if (error) {
- fs_warn(sdp, "inum=%llu error=%d, nblocks=%u, full=%d\n",
+ fs_warn(sdp, "inum=%llu error=%d, nblocks=%u, full=%d fail_pt=%d\n",
(unsigned long long)ip->i_no_addr, error, *nblocks,
- test_bit(GBF_FULL, &rbm.rgd->rd_bits->bi_flags));
+ test_bit(GBF_FULL, &rbm.rgd->rd_bits->bi_flags),
+ rbm.rgd->rd_extfail_pt);
goto rgrp_error;
}
--
1.8.3.1
There is only one place this is used, when reading in the quota
changes at mount time. It is not really required and much
simpler to just convert the fields from the on-disk structure
as required.
There should be no functional change as a result of this patch.
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 98236d0..1b6b367 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -67,12 +67,6 @@
#include "inode.h"
#include "util.h"
-struct gfs2_quota_change_host {
- u64 qc_change;
- u32 qc_flags; /* GFS2_QCF_... */
- struct kqid qc_id;
-};
-
/* Lock order: qd_lock -> qd->lockref.lock -> lru lock */
static DEFINE_SPINLOCK(qd_lock);
struct list_lru gfs2_qd_lru;
@@ -1214,17 +1208,6 @@ int gfs2_quota_refresh(struct gfs2_sbd *sdp, struct kqid qid)
return error;
}
-static void gfs2_quota_change_in(struct gfs2_quota_change_host *qc, const void *buf)
-{
- const struct gfs2_quota_change *str = buf;
-
- qc->qc_change = be64_to_cpu(str->qc_change);
- qc->qc_flags = be32_to_cpu(str->qc_flags);
- qc->qc_id = make_kqid(&init_user_ns,
- (qc->qc_flags & GFS2_QCF_USER)?USRQUOTA:GRPQUOTA,
- be32_to_cpu(str->qc_id));
-}
-
int gfs2_quota_init(struct gfs2_sbd *sdp)
{
struct gfs2_inode *ip = GFS2_I(sdp->sd_qc_inode);
@@ -1257,6 +1240,7 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
for (x = 0; x < blocks; x++) {
struct buffer_head *bh;
+ const struct gfs2_quota_change *qc;
unsigned int y;
if (!extlen) {
@@ -1274,25 +1258,28 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
goto fail;
}
+ qc = (const struct gfs2_quota_change *)(bh->b_data + sizeof(struct gfs2_meta_header));
for (y = 0; y < sdp->sd_qc_per_block && slot < sdp->sd_quota_slots;
y++, slot++) {
- struct gfs2_quota_change_host qc;
struct gfs2_quota_data *qd;
-
- gfs2_quota_change_in(&qc, bh->b_data +
- sizeof(struct gfs2_meta_header) +
- y * sizeof(struct gfs2_quota_change));
- if (!qc.qc_change)
+ s64 qc_change = be64_to_cpu(qc->qc_change);
+ u32 qc_flags = be32_to_cpu(qc->qc_flags);
+ enum quota_type qtype = (qc_flags & GFS2_QCF_USER) ?
+ USRQUOTA : GRPQUOTA;
+ struct kqid qc_id = make_kqid(&init_user_ns, qtype,
+ be32_to_cpu(qc->qc_id));
+ qc++;
+ if (!qc_change)
continue;
- error = qd_alloc(sdp, qc.qc_id, &qd);
+ error = qd_alloc(sdp, qc_id, &qd);
if (error) {
brelse(bh);
goto fail;
}
set_bit(QDF_CHANGE, &qd->qd_flags);
- qd->qd_change = qc.qc_change;
+ qd->qd_change = qc_change;
qd->qd_slot = slot;
qd->qd_slot_count = 1;
--
1.8.3.1
Since gfs2_inplace_reserve() is always called with a valid
alloc parms structure, there is no need to test for this
within the function itself - and in any case, after we've
all ready dereferenced it anyway.
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 797f1d3..2584710 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1943,7 +1943,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a
/* Skip unuseable resource groups */
if ((rs->rs_rbm.rgd->rd_flags & (GFS2_RGF_NOALLOC |
GFS2_RDF_ERROR)) ||
- (ap && (ap->target > rs->rs_rbm.rgd->rd_extfail_pt)))
+ (ap->target > rs->rs_rbm.rgd->rd_extfail_pt))
goto skip_rgrp;
if (sdp->sd_args.ar_rgrplvb)
--
1.8.3.1
From: Bob Peterson <[email protected]>
Here is a second try at a patch I posted earlier, which also implements
suggestions Steve made:
Before this patch, GFS2 would keep searching through all the rgrps
until it found one that had a chunk of free blocks big enough to
satisfy the size hint, which is based on the file write size,
regardless of whether the chunk was big enough to perform the write.
However, when doing big writes there may not be a large enough
chunk of free blocks in any rgrp, due to file system fragmentation.
The largest chunk may be big enough to satisfy the write request,
but it may not meet the ideal reservation size from the "size hint".
The writes would slow to a crawl because every write would search
every rgrp, then finally give up and default to a single-block write.
In my case, performance would drop from 425MB/s to 18KB/s, or 24000
times slower.
This patch basically makes it so that if we can't find a contiguous
chunk of blocks big enough to satisfy the sizehint, we'll use the
largest chunk of blocks we found that will still contain the write.
It does so by keeping track of the largest run of blocks within the
rgrp.
Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index c8d6161..809fecd 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -57,6 +57,11 @@
* 3 = Used (metadata)
*/
+struct gfs2_extent {
+ struct gfs2_rbm rbm;
+ u32 len;
+};
+
static const char valid_change[16] = {
/* current */
/* n */ 0, 1, 1, 1,
@@ -65,8 +70,9 @@ static const char valid_change[16] = {
1, 0, 0, 0
};
-static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 minext,
- const struct gfs2_inode *ip, bool nowrap);
+static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 *minext,
+ const struct gfs2_inode *ip, bool nowrap,
+ const struct gfs2_alloc_parms *ap);
/**
@@ -1455,7 +1461,7 @@ static void rg_mblk_search(struct gfs2_rgrpd *rgd, struct gfs2_inode *ip,
if (WARN_ON(gfs2_rbm_from_block(&rbm, goal)))
return;
- ret = gfs2_rbm_find(&rbm, GFS2_BLKST_FREE, extlen, ip, true);
+ ret = gfs2_rbm_find(&rbm, GFS2_BLKST_FREE, &extlen, ip, true, ap);
if (ret == 0) {
rs->rs_rbm = rbm;
rs->rs_free = extlen;
@@ -1520,6 +1526,7 @@ static u64 gfs2_next_unreserved_block(struct gfs2_rgrpd *rgd, u64 block,
* @rbm: The current position in the resource group
* @ip: The inode for which we are searching for blocks
* @minext: The minimum extent length
+ * @maxext: A pointer to the maximum extent structure
*
* This checks the current position in the rgrp to see whether there is
* a reservation covering this block. If not then this function is a
@@ -1532,7 +1539,8 @@ static u64 gfs2_next_unreserved_block(struct gfs2_rgrpd *rgd, u64 block,
static int gfs2_reservation_check_and_update(struct gfs2_rbm *rbm,
const struct gfs2_inode *ip,
- u32 minext)
+ u32 minext,
+ struct gfs2_extent *maxext)
{
u64 block = gfs2_rbm_to_block(rbm);
u32 extlen = 1;
@@ -1545,8 +1553,7 @@ static int gfs2_reservation_check_and_update(struct gfs2_rbm *rbm,
*/
if (minext) {
extlen = gfs2_free_extlen(rbm, minext);
- nblock = block + extlen;
- if (extlen < minext)
+ if (extlen <= maxext->len)
goto fail;
}
@@ -1555,9 +1562,17 @@ static int gfs2_reservation_check_and_update(struct gfs2_rbm *rbm,
* and skip if parts of it are already reserved
*/
nblock = gfs2_next_unreserved_block(rbm->rgd, block, extlen, ip);
- if (nblock == block)
- return 0;
+ if (nblock == block) {
+ if (!minext || extlen >= minext)
+ return 0;
+
+ if (extlen > maxext->len) {
+ maxext->len = extlen;
+ maxext->rbm = *rbm;
+ }
fail:
+ nblock = block + extlen;
+ }
ret = gfs2_rbm_from_block(rbm, nblock);
if (ret < 0)
return ret;
@@ -1568,10 +1583,12 @@ fail:
* gfs2_rbm_find - Look for blocks of a particular state
* @rbm: Value/result starting position and final position
* @state: The state which we want to find
- * @minext: The requested extent length (0 for a single block)
+ * @minext: Pointer to the requested extent length (NULL for a single block)
+ * This is updated to be the actual reservation size.
* @ip: If set, check for reservations
* @nowrap: Stop looking at the end of the rgrp, rather than wrapping
* around until we've reached the starting point.
+ * @ap: the allocation parameters
*
* Side effects:
* - If looking for free blocks, we set GBF_FULL on each bitmap which
@@ -1580,8 +1597,9 @@ fail:
* Returns: 0 on success, -ENOSPC if there is no block of the requested state
*/
-static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 minext,
- const struct gfs2_inode *ip, bool nowrap)
+static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 *minext,
+ const struct gfs2_inode *ip, bool nowrap,
+ const struct gfs2_alloc_parms *ap)
{
struct buffer_head *bh;
int initial_bii;
@@ -1592,6 +1610,7 @@ static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 minext,
int iters = rbm->rgd->rd_length;
int ret;
struct gfs2_bitmap *bi;
+ struct gfs2_extent maxext = { .rbm.rgd = rbm->rgd, };
/* If we are not starting at the beginning of a bitmap, then we
* need to add one to the bitmap count to ensure that we search
@@ -1620,7 +1639,9 @@ static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 minext,
return 0;
initial_bii = rbm->bii;
- ret = gfs2_reservation_check_and_update(rbm, ip, minext);
+ ret = gfs2_reservation_check_and_update(rbm, ip,
+ minext ? *minext : 0,
+ &maxext);
if (ret == 0)
return 0;
if (ret > 0) {
@@ -1655,6 +1676,17 @@ next_iter:
break;
}
+ if (minext == NULL || state != GFS2_BLKST_FREE)
+ return -ENOSPC;
+
+ /* If the maximum extent we found is big enough to fulfill the
+ minimum requirements, use it anyway. */
+ if (maxext.len) {
+ *rbm = maxext.rbm;
+ *minext = maxext.len;
+ return 0;
+ }
+
return -ENOSPC;
}
@@ -1680,7 +1712,8 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
while (1) {
down_write(&sdp->sd_log_flush_lock);
- error = gfs2_rbm_find(&rbm, GFS2_BLKST_UNLINKED, 0, NULL, true);
+ error = gfs2_rbm_find(&rbm, GFS2_BLKST_UNLINKED, NULL, NULL,
+ true, NULL);
up_write(&sdp->sd_log_flush_lock);
if (error == -ENOSPC)
break;
@@ -2184,11 +2217,12 @@ int gfs2_alloc_blocks(struct gfs2_inode *ip, u64 *bn, unsigned int *nblocks,
int error;
gfs2_set_alloc_start(&rbm, ip, dinode);
- error = gfs2_rbm_find(&rbm, GFS2_BLKST_FREE, 0, ip, false);
+ error = gfs2_rbm_find(&rbm, GFS2_BLKST_FREE, NULL, ip, false, NULL);
if (error == -ENOSPC) {
gfs2_set_alloc_start(&rbm, ip, dinode);
- error = gfs2_rbm_find(&rbm, GFS2_BLKST_FREE, 0, NULL, false);
+ error = gfs2_rbm_find(&rbm, GFS2_BLKST_FREE, NULL, NULL, false,
+ NULL);
}
/* Since all blocks are reserved in advance, this shouldn't happen */
--
1.8.3.1
On Mon, Jan 20, 2014 at 12:23:40PM +0000, Steven Whitehouse wrote:
> Prior to this patch, GFS2 kept all the quotas for each
> super block in a single linked list. This is rather slow
> when there are large numbers of quotas.
>
> This patch introduces a hlist_bl based hash table, similar
> to the one used for glocks. The initial look up of the quota
> is now lockless in the case where it is already cached,
> although we still have to take the per quota spinlock in
> order to bump the ref count. Either way though, this is a
> big improvement on what was there before.
>
> The qd_lock and the per super block list is preserved, for
> the time being. However it is intended that since this is no
> longer used for its original role, it should be possible to
> shrink the number of items on that list in due course and
> remove the requirement to take qd_lock in qd_get.
>
> Signed-off-by: Steven Whitehouse <[email protected]>
> Cc: Abhijith Das <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
Interesting! I thought that Sasha Levin had a hash table in the works,
but I don't see it, so CCing him.
A few questions and comments below.
Thanx, Paul
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index a99f60c..59d99ec 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -428,10 +428,13 @@ enum {
> };
>
> struct gfs2_quota_data {
> + struct hlist_bl_node qd_hlist;
> struct list_head qd_list;
> struct kqid qd_id;
> + struct gfs2_sbd *qd_sbd;
> struct lockref qd_lockref;
> struct list_head qd_lru;
> + unsigned qd_hash;
>
> unsigned long qd_flags; /* QDF_... */
>
> @@ -450,6 +453,7 @@ struct gfs2_quota_data {
>
> u64 qd_sync_gen;
> unsigned long qd_last_warn;
> + struct rcu_head qd_rcu;
> };
>
> struct gfs2_trans {
> diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
> index 0650db2..c272e73 100644
> --- a/fs/gfs2/main.c
> +++ b/fs/gfs2/main.c
> @@ -76,6 +76,7 @@ static int __init init_gfs2_fs(void)
>
> gfs2_str2qstr(&gfs2_qdot, ".");
> gfs2_str2qstr(&gfs2_qdotdot, "..");
> + gfs2_quota_hash_init();
>
> error = gfs2_sys_init();
> if (error)
> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index 1b6b367..a1df01d 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -52,6 +52,10 @@
> #include <linux/dqblk_xfs.h>
> #include <linux/lockref.h>
> #include <linux/list_lru.h>
> +#include <linux/rcupdate.h>
> +#include <linux/rculist_bl.h>
> +#include <linux/bit_spinlock.h>
> +#include <linux/jhash.h>
>
> #include "gfs2.h"
> #include "incore.h"
> @@ -67,10 +71,43 @@
> #include "inode.h"
> #include "util.h"
>
> -/* Lock order: qd_lock -> qd->lockref.lock -> lru lock */
> +#define GFS2_QD_HASH_SHIFT 12
Should this be a function of the number of CPUs? (Might not be an issue
if the really big systems don't use GFS.)
> +#define GFS2_QD_HASH_SIZE (1 << GFS2_QD_HASH_SHIFT)
> +#define GFS2_QD_HASH_MASK (GFS2_QD_HASH_SIZE - 1)
> +
> +/* Lock order: qd_lock -> bucket lock -> qd->lockref.lock -> lru lock */
> static DEFINE_SPINLOCK(qd_lock);
> struct list_lru gfs2_qd_lru;
>
> +static struct hlist_bl_head qd_hash_table[GFS2_QD_HASH_SIZE];
> +
> +static unsigned int gfs2_qd_hash(const struct gfs2_sbd *sdp,
> + const struct kqid qid)
> +{
> + unsigned int h;
> +
> + h = jhash(&sdp, sizeof(struct gfs2_sbd *), 0);
> + h = jhash(&qid, sizeof(struct kqid), h);
> +
> + return h & GFS2_QD_HASH_MASK;
> +}
> +
> +static inline void spin_lock_bucket(unsigned int hash)
> +{
> + hlist_bl_lock(&qd_hash_table[hash]);
> +}
> +
> +static inline void spin_unlock_bucket(unsigned int hash)
> +{
> + hlist_bl_unlock(&qd_hash_table[hash]);
> +}
> +
> +static void gfs2_qd_dealloc(struct rcu_head *rcu)
> +{
> + struct gfs2_quota_data *qd = container_of(rcu, struct gfs2_quota_data, qd_rcu);
> + kmem_cache_free(gfs2_quotad_cachep, qd);
> +}
> +
> static void gfs2_qd_dispose(struct list_head *list)
> {
> struct gfs2_quota_data *qd;
> @@ -87,6 +124,10 @@ static void gfs2_qd_dispose(struct list_head *list)
> list_del(&qd->qd_list);
> spin_unlock(&qd_lock);
>
> + spin_lock_bucket(qd->qd_hash);
> + hlist_bl_del_rcu(&qd->qd_hlist);
> + spin_unlock_bucket(qd->qd_hash);
> +
Good, removed from the RCU-traversed list before invoking call_rcu().
> gfs2_assert_warn(sdp, !qd->qd_change);
> gfs2_assert_warn(sdp, !qd->qd_slot_count);
> gfs2_assert_warn(sdp, !qd->qd_bh_count);
> @@ -95,7 +136,7 @@ static void gfs2_qd_dispose(struct list_head *list)
> atomic_dec(&sdp->sd_quota_count);
>
> /* Delete it from the common reclaim list */
> - kmem_cache_free(gfs2_quotad_cachep, qd);
> + call_rcu(&qd->qd_rcu, gfs2_qd_dealloc);
> }
> }
>
> @@ -165,83 +206,95 @@ static u64 qd2offset(struct gfs2_quota_data *qd)
> return offset;
> }
>
> -static int qd_alloc(struct gfs2_sbd *sdp, struct kqid qid,
> - struct gfs2_quota_data **qdp)
> +static struct gfs2_quota_data *qd_alloc(unsigned hash, struct gfs2_sbd *sdp, struct kqid qid)
> {
> struct gfs2_quota_data *qd;
> int error;
>
> qd = kmem_cache_zalloc(gfs2_quotad_cachep, GFP_NOFS);
> if (!qd)
> - return -ENOMEM;
> + return NULL;
>
> + qd->qd_sbd = sdp;
> qd->qd_lockref.count = 1;
> spin_lock_init(&qd->qd_lockref.lock);
> qd->qd_id = qid;
> qd->qd_slot = -1;
> INIT_LIST_HEAD(&qd->qd_lru);
> + qd->qd_hash = hash;
>
> error = gfs2_glock_get(sdp, qd2index(qd),
> &gfs2_quota_glops, CREATE, &qd->qd_gl);
> if (error)
> goto fail;
>
> - *qdp = qd;
> -
> - return 0;
> + return qd;
>
> fail:
> kmem_cache_free(gfs2_quotad_cachep, qd);
> - return error;
> + return NULL;
> }
>
> -static int qd_get(struct gfs2_sbd *sdp, struct kqid qid,
> - struct gfs2_quota_data **qdp)
> +static struct gfs2_quota_data *gfs2_qd_search_bucket(unsigned int hash,
> + const struct gfs2_sbd *sdp,
> + struct kqid qid)
> {
> - struct gfs2_quota_data *qd = NULL, *new_qd = NULL;
> - int error, found;
> -
> - *qdp = NULL;
> + struct gfs2_quota_data *qd;
> + struct hlist_bl_node *h;
>
> - for (;;) {
> - found = 0;
> - spin_lock(&qd_lock);
> - list_for_each_entry(qd, &sdp->sd_quota_list, qd_list) {
> - if (qid_eq(qd->qd_id, qid) &&
> - lockref_get_not_dead(&qd->qd_lockref)) {
> - list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
> - found = 1;
> - break;
> - }
> + hlist_bl_for_each_entry_rcu(qd, h, &qd_hash_table[hash], qd_hlist) {
All callers of gfs2_qd_search_bucket() either hold the update-side lock
(acquired via spin_lock_bucket()) or have invoked rcu_read_lock(), so
this is good.
> + if (!qid_eq(qd->qd_id, qid))
> + continue;
> + if (qd->qd_sbd != sdp)
> + continue;
> + if (lockref_get_not_dead(&qd->qd_lockref)) {
> + list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
list_lru_del() acquires a lock, but it is from an array whose size
depends on the NODES_SHIFT Kconfig variable. The array size seems to
vary from 8 to 16 to 64 to 1024, depending on other configuration info,
so should be OK.
> + return qd;
> }
> + }
>
> - if (!found)
> - qd = NULL;
> + return NULL;
> +}
>
> - if (!qd && new_qd) {
> - qd = new_qd;
> - list_add(&qd->qd_list, &sdp->sd_quota_list);
> - atomic_inc(&sdp->sd_quota_count);
> - new_qd = NULL;
> - }
>
> - spin_unlock(&qd_lock);
> +static int qd_get(struct gfs2_sbd *sdp, struct kqid qid,
> + struct gfs2_quota_data **qdp)
> +{
> + struct gfs2_quota_data *qd, *new_qd;
> + unsigned int hash = gfs2_qd_hash(sdp, qid);
>
> - if (qd) {
> - if (new_qd) {
> - gfs2_glock_put(new_qd->qd_gl);
> - kmem_cache_free(gfs2_quotad_cachep, new_qd);
> - }
> - *qdp = qd;
> - return 0;
> - }
> + rcu_read_lock();
> + *qdp = qd = gfs2_qd_search_bucket(hash, sdp, qid);
> + rcu_read_unlock();
>
> - error = qd_alloc(sdp, qid, &new_qd);
> - if (error)
> - return error;
> + if (qd)
> + return 0;
> +
> + new_qd = qd_alloc(hash, sdp, qid);
> + if (!new_qd)
> + return -ENOMEM;
> +
> + spin_lock(&qd_lock);
> + spin_lock_bucket(hash);
> + *qdp = qd = gfs2_qd_search_bucket(hash, sdp, qid);
> + if (qd == NULL) {
> + *qdp = new_qd;
> + list_add(&new_qd->qd_list, &sdp->sd_quota_list);
> + hlist_bl_add_head_rcu(&new_qd->qd_hlist, &qd_hash_table[hash]);
> + atomic_inc(&sdp->sd_quota_count);
> + }
> + spin_unlock_bucket(hash);
> + spin_unlock(&qd_lock);
> +
> + if (qd) {
> + gfs2_glock_put(new_qd->qd_gl);
> + kmem_cache_free(gfs2_quotad_cachep, new_qd);
> }
> +
> + return 0;
> }
>
> +
> static void qd_hold(struct gfs2_quota_data *qd)
> {
> struct gfs2_sbd *sdp = qd->qd_gl->gl_sbd;
> @@ -1215,6 +1268,7 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
> unsigned int blocks = size >> sdp->sd_sb.sb_bsize_shift;
> unsigned int x, slot = 0;
> unsigned int found = 0;
> + unsigned int hash;
> u64 dblock;
> u32 extlen = 0;
> int error;
> @@ -1272,8 +1326,9 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
> if (!qc_change)
> continue;
>
> - error = qd_alloc(sdp, qc_id, &qd);
> - if (error) {
> + hash = gfs2_qd_hash(sdp, qc_id);
> + qd = qd_alloc(hash, sdp, qc_id);
> + if (qd == NULL) {
> brelse(bh);
> goto fail;
> }
> @@ -1289,6 +1344,10 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
> atomic_inc(&sdp->sd_quota_count);
> spin_unlock(&qd_lock);
>
> + spin_lock_bucket(hash);
> + hlist_bl_add_head_rcu(&qd->qd_hlist, &qd_hash_table[hash]);
> + spin_unlock_bucket(hash);
> +
> found++;
> }
>
> @@ -1335,11 +1394,16 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
> spin_unlock(&qd->qd_lockref.lock);
>
> list_del(&qd->qd_list);
> +
> /* Also remove if this qd exists in the reclaim list */
> list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
> atomic_dec(&sdp->sd_quota_count);
> spin_unlock(&qd_lock);
>
> + spin_lock_bucket(qd->qd_hash);
> + hlist_bl_del_rcu(&qd->qd_hlist);
Might just be my unfamiliarity with this code, but it took me a bit
to see the difference between ->qd_hlist and ->qd_list. Of course, until
I spotted the difference, I was wondering why you were removing the
item twice. ;-)
> + spin_unlock_bucket(qd->qd_hash);
> +
> if (!qd->qd_lockref.count) {
> gfs2_assert_warn(sdp, !qd->qd_change);
> gfs2_assert_warn(sdp, !qd->qd_slot_count);
> @@ -1348,7 +1412,7 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
> gfs2_assert_warn(sdp, !qd->qd_bh_count);
>
> gfs2_glock_put(qd->qd_gl);
> - kmem_cache_free(gfs2_quotad_cachep, qd);
> + call_rcu(&qd->qd_rcu, gfs2_qd_dealloc);
>
> spin_lock(&qd_lock);
> }
> @@ -1643,3 +1707,11 @@ const struct quotactl_ops gfs2_quotactl_ops = {
> .get_dqblk = gfs2_get_dqblk,
> .set_dqblk = gfs2_set_dqblk,
> };
> +
> +void __init gfs2_quota_hash_init(void)
> +{
> + unsigned i;
> +
> + for(i = 0; i < GFS2_QD_HASH_SIZE; i++)
> + INIT_HLIST_BL_HEAD(&qd_hash_table[i]);
> +}
> diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
> index 96e4f34..55d506e 100644
> --- a/fs/gfs2/quota.h
> +++ b/fs/gfs2/quota.h
> @@ -57,5 +57,6 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
> extern const struct quotactl_ops gfs2_quotactl_ops;
> extern struct shrinker gfs2_qd_shrinker;
> extern struct list_lru gfs2_qd_lru;
> +extern void __init gfs2_quota_hash_init(void);
>
> #endif /* __QUOTA_DOT_H__ */
> --
> 1.8.3.1
>
On 01/22/2014 12:32 AM, Paul E. McKenney wrote:
> On Mon, Jan 20, 2014 at 12:23:40PM +0000, Steven Whitehouse wrote:
>> >Prior to this patch, GFS2 kept all the quotas for each
>> >super block in a single linked list. This is rather slow
>> >when there are large numbers of quotas.
>> >
>> >This patch introduces a hlist_bl based hash table, similar
>> >to the one used for glocks. The initial look up of the quota
>> >is now lockless in the case where it is already cached,
>> >although we still have to take the per quota spinlock in
>> >order to bump the ref count. Either way though, this is a
>> >big improvement on what was there before.
>> >
>> >The qd_lock and the per super block list is preserved, for
>> >the time being. However it is intended that since this is no
>> >longer used for its original role, it should be possible to
>> >shrink the number of items on that list in due course and
>> >remove the requirement to take qd_lock in qd_get.
>> >
>> >Signed-off-by: Steven Whitehouse<[email protected]>
>> >Cc: Abhijith Das<[email protected]>
>> >Cc: Paul E. McKenney<[email protected]>
> Interesting! I thought that Sasha Levin had a hash table in the works,
> but I don't see it, so CCing him.
Indeed, there is a hlist based hashtable at include/linux/hashtable.h for couple kernel
versions now. However, there's no hlist_bl one.
If there is a plan on adding a hlist_bl hashtable for whatever reason, it should probably
be done by expanding hashtable.h so that more places that use hlist_bl would benefit from it (yes,
there are couple more places that do hlist_bl hashtable).
Also, do we really want to use hlist_bl here? It doesn't seem like it's being done to conserve on
memory, and that's the only reason it should be used for. Doing a single spinlock per bucket is
much more efficient than using the bit locking scheme that hlist_bl does.
Thanks,
Sasha
Hi,
On Wed, 2014-01-22 at 01:06 -0500, Sasha Levin wrote:
> On 01/22/2014 12:32 AM, Paul E. McKenney wrote:
> > On Mon, Jan 20, 2014 at 12:23:40PM +0000, Steven Whitehouse wrote:
> >> >Prior to this patch, GFS2 kept all the quotas for each
> >> >super block in a single linked list. This is rather slow
> >> >when there are large numbers of quotas.
> >> >
> >> >This patch introduces a hlist_bl based hash table, similar
> >> >to the one used for glocks. The initial look up of the quota
> >> >is now lockless in the case where it is already cached,
> >> >although we still have to take the per quota spinlock in
> >> >order to bump the ref count. Either way though, this is a
> >> >big improvement on what was there before.
> >> >
> >> >The qd_lock and the per super block list is preserved, for
> >> >the time being. However it is intended that since this is no
> >> >longer used for its original role, it should be possible to
> >> >shrink the number of items on that list in due course and
> >> >remove the requirement to take qd_lock in qd_get.
> >> >
> >> >Signed-off-by: Steven Whitehouse<[email protected]>
> >> >Cc: Abhijith Das<[email protected]>
> >> >Cc: Paul E. McKenney<[email protected]>
> > Interesting! I thought that Sasha Levin had a hash table in the works,
> > but I don't see it, so CCing him.
>
> Indeed, there is a hlist based hashtable at include/linux/hashtable.h for couple kernel
> versions now. However, there's no hlist_bl one.
>
> If there is a plan on adding a hlist_bl hashtable for whatever reason, it should probably
> be done by expanding hashtable.h so that more places that use hlist_bl would benefit from it (yes,
> there are couple more places that do hlist_bl hashtable).
>
> Also, do we really want to use hlist_bl here? It doesn't seem like it's being done to conserve on
> memory, and that's the only reason it should be used for. Doing a single spinlock per bucket is
> much more efficient than using the bit locking scheme that hlist_bl does.
>
>
> Thanks,
> Sasha
So this will probably make a bit more sense with a bit of history to
explain how we got here... The recent addition of the quota hash table
is modeled upon the one we are using for glocks at the moment. The glock
hash table at one stage, had one lock per bucket, and was then expanded
so that we had more buckets, but the same number of locks - to conserve
memory. It was then expanded again when we moved to hlist_bl a little
while ago. At each stage we gained performance so the changes seemed to
be a good thing.
Really the max number of glocks depends on the size of the memory, so
that we should really try to scale the hash table with increasing memory
size, however a simpler static table has worked reasonably well for now.
If we could use a tree (or forest) instead of a hash table that would be
even better, but still a bit of a pain to do with RCU I think, which is
why I've not gone that extra step yet.
Now the quota hash table has been modeled as a smaller version of the
glock hash table for the time being. The question is how large it should
be?... any more than a single hash list head is an improvement on the
previous code, and maybe the table should be larger than I've made it
currently. We'll see whether anybody runs into issues if they have large
number of quotas in due course.
So perhaps the memory saving is not the most important thing with the
quota hash table, but at least it matches in form the glock hash table
which has been proven over many releases as being effective. However, if
we can make use of some generic code that solves many of the problems
for us, then I can certainly look into that.
The changes in the quota code are not complete yet, and with a bit of
luck we'll have a further set of changes ready for the next merge window
too,
Steve.
Hi,
On Tue, 2014-01-21 at 21:32 -0800, Paul E. McKenney wrote:
> On Mon, Jan 20, 2014 at 12:23:40PM +0000, Steven Whitehouse wrote:
> > Prior to this patch, GFS2 kept all the quotas for each
> > super block in a single linked list. This is rather slow
> > when there are large numbers of quotas.
> >
> > This patch introduces a hlist_bl based hash table, similar
> > to the one used for glocks. The initial look up of the quota
> > is now lockless in the case where it is already cached,
> > although we still have to take the per quota spinlock in
> > order to bump the ref count. Either way though, this is a
> > big improvement on what was there before.
> >
> > The qd_lock and the per super block list is preserved, for
> > the time being. However it is intended that since this is no
> > longer used for its original role, it should be possible to
> > shrink the number of items on that list in due course and
> > remove the requirement to take qd_lock in qd_get.
> >
> > Signed-off-by: Steven Whitehouse <[email protected]>
> > Cc: Abhijith Das <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
>
> Interesting! I thought that Sasha Levin had a hash table in the works,
> but I don't see it, so CCing him.
>
> A few questions and comments below.
>
> Thanx, Paul
>
Thanks for the review.
[snip]
> > +#define GFS2_QD_HASH_SHIFT 12
> Should this be a function of the number of CPUs? (Might not be an issue
> if the really big systems don't use GFS.)
I'm not sure... really it depends on how many quotas are in use, so
number of users, and even on relatively small systems, there might be a
lot of them. So I'm guessing a bit, and we'll bump it up a bit if its a
problem. There is a lot of extra complexity in changing hash table sizes
on the fly, which would be another possible solution. Either way it is a
vast improvement on what was there before :-)
[snip]
> > + if (!qid_eq(qd->qd_id, qid))
> > + continue;
> > + if (qd->qd_sbd != sdp)
> > + continue;
> > + if (lockref_get_not_dead(&qd->qd_lockref)) {
> > + list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
> list_lru_del() acquires a lock, but it is from an array whose size
> depends on the NODES_SHIFT Kconfig variable. The array size seems to
> vary from 8 to 16 to 64 to 1024, depending on other configuration info,
> so should be OK.
Yes, we've tried to make use of the generic lru code here, and I'd like
to do that same for the glocks, however thats more complicated, so we've
not got that far just yet, even though we have made some steps in that
direction. Dave Chinner has pointed out to us that the lru code was
designed such that the lru lock should always be the inner most lock, so
thats the ordering that we've used here.
[snip]
> > @@ -1335,11 +1394,16 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
> > spin_unlock(&qd->qd_lockref.lock);
> >
> > list_del(&qd->qd_list);
> > +
> > /* Also remove if this qd exists in the reclaim list */
> > list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
> > atomic_dec(&sdp->sd_quota_count);
> > spin_unlock(&qd_lock);
> >
> > + spin_lock_bucket(qd->qd_hash);
> > + hlist_bl_del_rcu(&qd->qd_hlist);
>
> Might just be my unfamiliarity with this code, but it took me a bit
> to see the difference between ->qd_hlist and ->qd_list. Of course, until
> I spotted the difference, I was wondering why you were removing the
> item twice. ;-)
>
Well I hope that eventually qd_list might be able to go away. I'm still
working on a plan to deal with improving the quota data writeback which
should help to make that happen,
Steve.