These are all the remaining patches in my bcachefs tree that touch stuff outside
fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
some discussion first.
* pagecache add lock
This is the only one that touches existing code in nontrivial ways. The problem
it's solving is that there is no existing general mechanism for shooting down
pages in the page and keeping them removed, which is a real problem if you're
doing anything that modifies file data and isn't buffered writes.
Historically, the only problematic case has been direct IO, and people have been
willing to say "well, if you mix buffered and direct IO you get what you
deserve", and that's probably not unreasonable. But now we have fallocate insert
range and collapse range, and those are broken in ways I frankly don't want to
think about if they can't ensure consistency with the page cache.
Also, the mechanism truncate uses (i_size and sacrificing a goat) has
historically been rather fragile, IMO it might be a good think if we switched it
to a more general rigorous mechanism.
I need this solved for bcachefs because without this mechanism, the page cache
inconsistencies lead to various assertions popping (primarily when we didn't
think we need to get a disk reservation going by page cache state, but then do
the actual write and disk space accounting says oops, we did need one). And
having to reason about what can happen without a locking mechanism for this is
not something I care to spend brain cycles on.
That said, my patch is kind of ugly, and it requires filesystem changes for
other filesystems to take advantage of it. And unfortunately, since one of the
code paths that needs locking is readahead, I don't see any realistic way of
implementing the locking within just bcachefs code.
So I'm hoping someone has an idea for something cleaner (I think I recall
Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
if not I'll polish up my pagecache add lock patch and see what I can do to make
it less ugly, and hopefully other people find it palatable or at least useful.
* lglocks
They were removed by Peter Zijlstra when the last in kernel user was removed,
but I've found them useful. His commit message seems to imply he doesn't think
people should be using them, but I'm not sure why. They are a bit niche though,
I can move them to fs/bcachefs if people would prefer.
* Generic radix trees
This is a very simple radix tree implementation that can store types of
arbitrary size, not just pointers/unsigned long. It could probably replace
flex arrays.
* Dynamic fault injection
I've actually had this code sitting in my tree since forever... I know we have
an existing fault injection framework, but I think this one is quite a bit nicer
to actually use.
It works very much like the dynamic debug infrastructure - for those who aren't
familiar, dynamic debug makes it so you can list and individually enable/disable
every pr_debug() callsite in debugfs.
So to add a fault injection site with this, you just stick a call to
dynamic_fault("foobar") somewhere in your code - dynamic_fault() returns true if
you should fail whatever it is you're testing. And then it'll show up in
debugfs, where you can enable/disable faults by file/linenumber, module, name,
etc.
The patch then also adds macros that wrap all the various memory allocation
functions and fail if dynamic_fault("memory") returns true - which means you can
see in debugfs every place you're allocating memory and fail all of them or just
individually (I have tests that iterate over all the faults and flip them on one
by one). I also use it in bcachefs to add fault injection points for uncommon
error paths in the filesystem startup/recovery path, and for various hard to
test slowpaths that only happen if we race in weird ways (race_fault()).
Kent Overstreet (10):
mm: pagecache add lock
mm: export find_get_pages()
locking: bring back lglocks
locking: export osq_lock()/osq_unlock()
don't use spin_lock_irqsave() unnecessarily
Generic radix trees
bcache: optimize continue_at_nobarrier()
bcache: move closures to lib/
closures: closure_wait_event()
Dynamic fault injection
drivers/md/bcache/Kconfig | 10 +-
drivers/md/bcache/Makefile | 6 +-
drivers/md/bcache/bcache.h | 2 +-
drivers/md/bcache/super.c | 1 -
drivers/md/bcache/util.h | 3 +-
fs/inode.c | 1 +
include/asm-generic/vmlinux.lds.h | 4 +
.../md/bcache => include/linux}/closure.h | 50 +-
include/linux/dynamic_fault.h | 117 +++
include/linux/fs.h | 23 +
include/linux/generic-radix-tree.h | 131 +++
include/linux/lglock.h | 97 +++
include/linux/sched.h | 4 +
init/init_task.c | 1 +
kernel/locking/Makefile | 1 +
kernel/locking/lglock.c | 105 +++
kernel/locking/osq_lock.c | 2 +
lib/Kconfig | 3 +
lib/Kconfig.debug | 14 +
lib/Makefile | 7 +-
{drivers/md/bcache => lib}/closure.c | 17 +-
lib/dynamic_fault.c | 760 ++++++++++++++++++
lib/generic-radix-tree.c | 167 ++++
mm/filemap.c | 92 ++-
mm/page-writeback.c | 5 +-
25 files changed, 1577 insertions(+), 46 deletions(-)
rename {drivers/md/bcache => include/linux}/closure.h (92%)
create mode 100644 include/linux/dynamic_fault.h
create mode 100644 include/linux/generic-radix-tree.h
create mode 100644 include/linux/lglock.h
create mode 100644 kernel/locking/lglock.c
rename {drivers/md/bcache => lib}/closure.c (95%)
create mode 100644 lib/dynamic_fault.c
create mode 100644 lib/generic-radix-tree.c
--
2.17.0
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/closure.h | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/include/linux/closure.h b/include/linux/closure.h
index 1072bf2c13..ef22d18f7c 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -375,4 +375,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
continue_at_nobarrier(cl, fn, wq);
}
+#define __closure_wait_event(waitlist, _cond) \
+do { \
+ struct closure cl; \
+ \
+ closure_init_stack(&cl); \
+ \
+ while (1) { \
+ closure_wait(waitlist, &cl); \
+ if (_cond) \
+ break; \
+ closure_sync(&cl); \
+ } \
+ closure_wake_up(waitlist); \
+ closure_sync(&cl); \
+} while (0)
+
+#define closure_wait_event(waitlist, _cond) \
+do { \
+ if (!(_cond)) \
+ __closure_wait_event(waitlist, _cond); \
+} while (0)
+
#endif /* _LINUX_CLOSURE_H */
--
2.17.0
Signed-off-by: Kent Overstreet <[email protected]>
---
include/asm-generic/vmlinux.lds.h | 4 +
include/linux/dynamic_fault.h | 117 +++++
lib/Kconfig.debug | 5 +
lib/Makefile | 2 +
lib/dynamic_fault.c | 760 ++++++++++++++++++++++++++++++
5 files changed, 888 insertions(+)
create mode 100644 include/linux/dynamic_fault.h
create mode 100644 lib/dynamic_fault.c
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 1ab0e520d6..a4c9dfcbbd 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -246,6 +246,10 @@
VMLINUX_SYMBOL(__start___verbose) = .; \
KEEP(*(__verbose)) \
VMLINUX_SYMBOL(__stop___verbose) = .; \
+ . = ALIGN(8); \
+ VMLINUX_SYMBOL(__start___faults) = .; \
+ *(__faults) \
+ VMLINUX_SYMBOL(__stop___faults) = .; \
LIKELY_PROFILE() \
BRANCH_PROFILE() \
TRACE_PRINTKS() \
diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
new file mode 100644
index 0000000000..6e7bb56ae8
--- /dev/null
+++ b/include/linux/dynamic_fault.h
@@ -0,0 +1,117 @@
+#ifndef _DYNAMIC_FAULT_H
+#define _DYNAMIC_FAULT_H
+
+#include <linux/bio.h>
+#include <linux/jump_label.h>
+#include <linux/slab.h>
+
+enum dfault_enabled {
+ DFAULT_DISABLED,
+ DFAULT_ENABLED,
+ DFAULT_ONESHOT,
+};
+
+union dfault_state {
+ struct {
+ unsigned enabled:2;
+ unsigned count:30;
+ };
+
+ struct {
+ unsigned v;
+ };
+};
+
+/*
+ * An instance of this structure is created in a special
+ * ELF section at every dynamic fault callsite. At runtime,
+ * the special section is treated as an array of these.
+ */
+struct _dfault {
+ const char *modname;
+ const char *function;
+ const char *filename;
+ const char *class;
+
+ const u16 line;
+
+ unsigned frequency;
+ union dfault_state state;
+
+ struct static_key enabled;
+} __aligned(8);
+
+
+#ifdef CONFIG_DYNAMIC_FAULT
+
+int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod);
+int dfault_remove_module(char *mod_name);
+bool __dynamic_fault_enabled(struct _dfault *);
+
+#define dynamic_fault(_class) \
+({ \
+ static struct _dfault descriptor \
+ __used __aligned(8) __attribute__((section("__faults"))) = { \
+ .modname = KBUILD_MODNAME, \
+ .function = __func__, \
+ .filename = __FILE__, \
+ .line = __LINE__, \
+ .class = _class, \
+ }; \
+ \
+ static_key_false(&descriptor.enabled) && \
+ __dynamic_fault_enabled(&descriptor); \
+})
+
+#define memory_fault() dynamic_fault("memory")
+#define race_fault() dynamic_fault("race")
+
+#define kmalloc(...) \
+ (memory_fault() ? NULL : kmalloc(__VA_ARGS__))
+#define kzalloc(...) \
+ (memory_fault() ? NULL : kzalloc(__VA_ARGS__))
+#define krealloc(...) \
+ (memory_fault() ? NULL : krealloc(__VA_ARGS__))
+
+#define mempool_alloc(pool, gfp_mask) \
+ ((!gfpflags_allow_blocking(gfp_mask) && memory_fault()) \
+ ? NULL : mempool_alloc(pool, gfp_mask))
+
+#define __get_free_pages(...) \
+ (memory_fault() ? 0 : __get_free_pages(__VA_ARGS__))
+#define alloc_pages_node(...) \
+ (memory_fault() ? NULL : alloc_pages_node(__VA_ARGS__))
+#define alloc_pages_nodemask(...) \
+ (memory_fault() ? NULL : alloc_pages_nodemask(__VA_ARGS__))
+
+#define bio_alloc_bioset(gfp_mask, ...) \
+ ((!gfpflags_allow_blocking(gfp_mask) && memory_fault()) \
+ ? NULL : bio_alloc_bioset(gfp_mask, __VA_ARGS__))
+
+#define bio_clone(bio, gfp_mask) \
+ ((!gfpflags_allow_blocking(gfp_mask) && memory_fault()) \
+ ? NULL : bio_clone(bio, gfp_mask))
+
+#define bio_clone_bioset(bio, gfp_mask, bs) \
+ ((!gfpflags_allow_blocking(gfp_mask) && memory_fault()) \
+ ? NULL : bio_clone_bioset(bio, gfp_mask, bs))
+
+#define bio_kmalloc(...) \
+ (memory_fault() ? NULL : bio_kmalloc(__VA_ARGS__))
+#define bio_clone_kmalloc(...) \
+ (memory_fault() ? NULL : bio_clone_kmalloc(__VA_ARGS__))
+
+#define bio_iov_iter_get_pages(...) \
+ (memory_fault() ? -ENOMEM : bio_iov_iter_get_pages(__VA_ARGS__))
+
+#else /* CONFIG_DYNAMIC_FAULT */
+
+#define dfault_add_module(tab, n, modname) 0
+#define dfault_remove_module(mod) 0
+#define dynamic_fault(_class) 0
+#define memory_fault() 0
+#define race_fault() 0
+
+#endif /* CONFIG_DYNAMIC_FAULT */
+
+#endif
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 427292a759..7b7ca0d813 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1599,6 +1599,11 @@ config LATENCYTOP
Enable this option if you want to use the LatencyTOP tool
to find out which userspace is blocking on what kernel operations.
+config DYNAMIC_FAULT
+ bool "Enable dynamic fault support"
+ default n
+ depends on DEBUG_FS
+
source kernel/trace/Kconfig
config PROVIDE_OHCI1394_DMA_INIT
diff --git a/lib/Makefile b/lib/Makefile
index 66d2231d70..f6f70f4771 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -158,6 +158,8 @@ obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o
+obj-$(CONFIG_DYNAMIC_FAULT) += dynamic_fault.o
+
obj-$(CONFIG_NLATTR) += nlattr.o
obj-$(CONFIG_LRU_CACHE) += lru_cache.o
diff --git a/lib/dynamic_fault.c b/lib/dynamic_fault.c
new file mode 100644
index 0000000000..75fc9a1b4b
--- /dev/null
+++ b/lib/dynamic_fault.c
@@ -0,0 +1,760 @@
+/*
+ * lib/dynamic_fault.c
+ *
+ * make dynamic_fault() calls runtime configurable based upon their
+ * source module.
+ *
+ * Copyright (C) 2011 Adam Berkan <[email protected]>
+ * Based on dynamic_debug.c:
+ * Copyright (C) 2008 Jason Baron <[email protected]>
+ * By Greg Banks <[email protected]>
+ * Copyright (c) 2008 Silicon Graphics Inc. All Rights Reserved.
+ *
+ */
+
+#define pr_fmt(fmt) "dfault: " fmt "\n"
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/kallsyms.h>
+#include <linux/version.h>
+#include <linux/types.h>
+#include <linux/mutex.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/list.h>
+#include <linux/sysctl.h>
+#include <linux/ctype.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/dynamic_fault.h>
+#include <linux/debugfs.h>
+#include <linux/slab.h>
+
+#undef kzalloc
+
+extern struct _dfault __start___faults[];
+extern struct _dfault __stop___faults[];
+
+struct dfault_table {
+ struct list_head link;
+ char *mod_name;
+ unsigned int num_dfaults;
+ struct _dfault *dfaults;
+};
+
+struct dfault_query {
+ const char *filename;
+ const char *module;
+ const char *function;
+ const char *class;
+ unsigned int first_line, last_line;
+ unsigned int first_index, last_index;
+
+ unsigned match_line:1;
+ unsigned match_index:1;
+
+ unsigned set_enabled:1;
+ unsigned enabled:2;
+
+ unsigned set_frequency:1;
+ unsigned frequency;
+};
+
+struct dfault_iter {
+ struct dfault_table *table;
+ unsigned int idx;
+};
+
+static DEFINE_MUTEX(dfault_lock);
+static LIST_HEAD(dfault_tables);
+
+bool __dynamic_fault_enabled(struct _dfault *df)
+{
+ union dfault_state old, new;
+ unsigned v = df->state.v;
+ bool ret;
+
+ do {
+ old.v = new.v = v;
+
+ if (new.enabled == DFAULT_DISABLED)
+ return false;
+
+ ret = df->frequency
+ ? ++new.count >= df->frequency
+ : true;
+ if (ret)
+ new.count = 0;
+ if (ret && new.enabled == DFAULT_ONESHOT)
+ new.enabled = DFAULT_DISABLED;
+ } while ((v = cmpxchg(&df->state.v, old.v, new.v)) != old.v);
+
+ if (ret)
+ pr_debug("returned true for %s:%u", df->filename, df->line);
+
+ return ret;
+}
+EXPORT_SYMBOL(__dynamic_fault_enabled);
+
+/* Return the last part of a pathname */
+static inline const char *basename(const char *path)
+{
+ const char *tail = strrchr(path, '/');
+
+ return tail ? tail + 1 : path;
+}
+
+/* format a string into buf[] which describes the _dfault's flags */
+static char *dfault_describe_flags(struct _dfault *df, char *buf, size_t buflen)
+{
+ switch (df->state.enabled) {
+ case DFAULT_DISABLED:
+ strlcpy(buf, "disabled", buflen);
+ break;
+ case DFAULT_ENABLED:
+ strlcpy(buf, "enabled", buflen);
+ break;
+ case DFAULT_ONESHOT:
+ strlcpy(buf, "oneshot", buflen);
+ break;
+ default:
+ BUG();
+ }
+
+ return buf;
+}
+
+/*
+ * must be called with dfault_lock held
+ */
+
+/*
+ * Search the tables for _dfault's which match the given
+ * `query' and apply the `flags' and `mask' to them. Tells
+ * the user which dfault's were changed, or whether none
+ * were matched.
+ */
+static int dfault_change(const struct dfault_query *query)
+{
+ struct dfault_table *dt;
+ unsigned int nfound = 0;
+ unsigned i, index = 0;
+ char flagbuf[16];
+
+ /* search for matching dfaults */
+ mutex_lock(&dfault_lock);
+ list_for_each_entry(dt, &dfault_tables, link) {
+
+ /* match against the module name */
+ if (query->module != NULL &&
+ strcmp(query->module, dt->mod_name))
+ continue;
+
+ for (i = 0 ; i < dt->num_dfaults ; i++) {
+ struct _dfault *df = &dt->dfaults[i];
+
+ /* match against the source filename */
+ if (query->filename != NULL &&
+ strcmp(query->filename, df->filename) &&
+ strcmp(query->filename, basename(df->filename)))
+ continue;
+
+ /* match against the function */
+ if (query->function != NULL &&
+ strcmp(query->function, df->function))
+ continue;
+
+ /* match against the class */
+ if (query->class) {
+ size_t len = strlen(query->class);
+
+ if (strncmp(query->class, df->class, len))
+ continue;
+
+ if (df->class[len] && df->class[len] != ':')
+ continue;
+ }
+
+ /* match against the line number range */
+ if (query->match_line &&
+ (df->line < query->first_line ||
+ df->line > query->last_line))
+ continue;
+
+ /* match against the fault index */
+ if (query->match_index &&
+ (index < query->first_index ||
+ index > query->last_index)) {
+ index++;
+ continue;
+ }
+
+ if (query->set_enabled &&
+ query->enabled != df->state.enabled) {
+ if (query->enabled != DFAULT_DISABLED)
+ static_key_slow_inc(&df->enabled);
+ else if (df->state.enabled != DFAULT_DISABLED)
+ static_key_slow_dec(&df->enabled);
+
+ df->state.enabled = query->enabled;
+ }
+
+ if (query->set_frequency)
+ df->frequency = query->frequency;
+
+ pr_debug("changed %s:%d [%s]%s #%d %s",
+ df->filename, df->line, dt->mod_name,
+ df->function, index,
+ dfault_describe_flags(df, flagbuf,
+ sizeof(flagbuf)));
+
+ index++;
+ nfound++;
+ }
+ }
+ mutex_unlock(&dfault_lock);
+
+ pr_debug("dfault: %u matches", nfound);
+
+ return nfound ? 0 : -ENOENT;
+}
+
+/*
+ * Split the buffer `buf' into space-separated words.
+ * Handles simple " and ' quoting, i.e. without nested,
+ * embedded or escaped \". Return the number of words
+ * or <0 on error.
+ */
+static int dfault_tokenize(char *buf, char *words[], int maxwords)
+{
+ int nwords = 0;
+
+ while (*buf) {
+ char *end;
+
+ /* Skip leading whitespace */
+ buf = skip_spaces(buf);
+ if (!*buf)
+ break; /* oh, it was trailing whitespace */
+
+ /* Run `end' over a word, either whitespace separated or quoted
+ */
+ if (*buf == '"' || *buf == '\'') {
+ int quote = *buf++;
+
+ for (end = buf ; *end && *end != quote ; end++)
+ ;
+ if (!*end)
+ return -EINVAL; /* unclosed quote */
+ } else {
+ for (end = buf ; *end && !isspace(*end) ; end++)
+ ;
+ BUG_ON(end == buf);
+ }
+ /* Here `buf' is the start of the word, `end' is one past the
+ * end
+ */
+
+ if (nwords == maxwords)
+ return -EINVAL; /* ran out of words[] before bytes */
+ if (*end)
+ *end++ = '\0'; /* terminate the word */
+ words[nwords++] = buf;
+ buf = end;
+ }
+
+ return nwords;
+}
+
+/*
+ * Parse a range.
+ */
+static inline int parse_range(char *str,
+ unsigned int *first,
+ unsigned int *last)
+{
+ char *first_str = str;
+ char *last_str = strchr(first_str, '-');
+
+ if (last_str)
+ *last_str++ = '\0';
+
+ if (kstrtouint(first_str, 10, first))
+ return -EINVAL;
+
+ if (!last_str)
+ *last = *first;
+ else if (kstrtouint(last_str, 10, last))
+ return -EINVAL;
+
+ return 0;
+}
+
+enum dfault_token {
+ TOK_INVALID,
+
+ /* Queries */
+ TOK_FUNC,
+ TOK_FILE,
+ TOK_LINE,
+ TOK_MODULE,
+ TOK_CLASS,
+ TOK_INDEX,
+
+ /* Commands */
+ TOK_DISABLE,
+ TOK_ENABLE,
+ TOK_ONESHOT,
+ TOK_FREQUENCY,
+};
+
+static const struct {
+ const char *str;
+ enum dfault_token tok;
+ unsigned args_required;
+} dfault_token_strs[] = {
+ { "func", TOK_FUNC, 1, },
+ { "file", TOK_FILE, 1, },
+ { "line", TOK_LINE, 1, },
+ { "module", TOK_MODULE, 1, },
+ { "class", TOK_CLASS, 1, },
+ { "index", TOK_INDEX, 1, },
+ { "disable", TOK_DISABLE, 0, },
+ { "enable", TOK_ENABLE, 0, },
+ { "oneshot", TOK_ONESHOT, 0, },
+ { "frequency", TOK_FREQUENCY, 1, },
+};
+
+static enum dfault_token str_to_token(const char *word, unsigned nr_words)
+{
+ unsigned i;
+
+ for (i = 0; i < ARRAY_SIZE(dfault_token_strs); i++)
+ if (!strcmp(word, dfault_token_strs[i].str)) {
+ if (nr_words < dfault_token_strs[i].args_required) {
+ pr_debug("insufficient arguments to \"%s\"",
+ word);
+ return TOK_INVALID;
+ }
+
+ return dfault_token_strs[i].tok;
+ }
+
+ pr_debug("unknown keyword \"%s\"", word);
+
+ return TOK_INVALID;
+}
+
+static int dfault_parse_command(struct dfault_query *query,
+ enum dfault_token tok,
+ char *words[], size_t nr_words)
+{
+ unsigned i = 0;
+ int ret;
+
+ switch (tok) {
+ case TOK_INVALID:
+ return -EINVAL;
+ case TOK_FUNC:
+ query->function = words[i++];
+ case TOK_FILE:
+ query->filename = words[i++];
+ return 1;
+ case TOK_LINE:
+ ret = parse_range(words[i++],
+ &query->first_line,
+ &query->last_line);
+ if (ret)
+ return ret;
+ query->match_line = true;
+ break;
+ case TOK_MODULE:
+ query->module = words[i++];
+ break;
+ case TOK_CLASS:
+ query->class = words[i++];
+ break;
+ case TOK_INDEX:
+ ret = parse_range(words[i++],
+ &query->first_index,
+ &query->last_index);
+ if (ret)
+ return ret;
+ query->match_index = true;
+ break;
+ case TOK_DISABLE:
+ query->set_enabled = true;
+ query->enabled = DFAULT_DISABLED;
+ break;
+ case TOK_ENABLE:
+ query->set_enabled = true;
+ query->enabled = DFAULT_ENABLED;
+ break;
+ case TOK_ONESHOT:
+ query->set_enabled = true;
+ query->enabled = DFAULT_ONESHOT;
+ break;
+ case TOK_FREQUENCY:
+ query->set_frequency = 1;
+ ret = kstrtouint(words[i++], 10, &query->frequency);
+ if (ret)
+ return ret;
+
+ if (!query->set_enabled) {
+ query->set_enabled = 1;
+ query->enabled = DFAULT_ENABLED;
+ }
+ break;
+ }
+
+ return i;
+}
+
+/*
+ * Parse words[] as a dfault query specification, which is a series
+ * of (keyword, value) pairs chosen from these possibilities:
+ *
+ * func <function-name>
+ * file <full-pathname>
+ * file <base-filename>
+ * module <module-name>
+ * line <lineno>
+ * line <first-lineno>-<last-lineno> // where either may be empty
+ * index <m>-<n> // dynamic faults numbered from <m>
+ * // to <n> inside each matching function
+ */
+static int dfault_parse_query(struct dfault_query *query,
+ char *words[], size_t nr_words)
+{
+ unsigned i = 0;
+
+ while (i < nr_words) {
+ const char *tok_str = words[i++];
+ enum dfault_token tok = str_to_token(tok_str, nr_words - i);
+ int ret = dfault_parse_command(query, tok, words + i,
+ nr_words - i);
+
+ if (ret < 0)
+ return ret;
+ i += ret;
+ BUG_ON(i > nr_words);
+ }
+
+ return 0;
+}
+
+/*
+ * File_ops->write method for <debugfs>/dynamic_fault/conrol. Gathers the
+ * command text from userspace, parses and executes it.
+ */
+static ssize_t dfault_proc_write(struct file *file, const char __user *ubuf,
+ size_t len, loff_t *offp)
+{
+ struct dfault_query query;
+#define MAXWORDS 9
+ int nwords;
+ char *words[MAXWORDS];
+ char tmpbuf[256];
+ int ret;
+
+ memset(&query, 0, sizeof(query));
+
+ if (len == 0)
+ return 0;
+ /* we don't check *offp -- multiple writes() are allowed */
+ if (len > sizeof(tmpbuf)-1)
+ return -E2BIG;
+ if (copy_from_user(tmpbuf, ubuf, len))
+ return -EFAULT;
+ tmpbuf[len] = '\0';
+
+ pr_debug("read %zu bytes from userspace", len);
+
+ nwords = dfault_tokenize(tmpbuf, words, MAXWORDS);
+ if (nwords < 0)
+ return -EINVAL;
+ if (dfault_parse_query(&query, words, nwords))
+ return -EINVAL;
+
+ /* actually go and implement the change */
+ ret = dfault_change(&query);
+ if (ret < 0)
+ return ret;
+
+ *offp += len;
+ return len;
+}
+
+/* Control file read code */
+
+/*
+ * Set the iterator to point to the first _dfault object
+ * and return a pointer to that first object. Returns
+ * NULL if there are no _dfaults at all.
+ */
+static struct _dfault *dfault_iter_first(struct dfault_iter *iter)
+{
+ if (list_empty(&dfault_tables)) {
+ iter->table = NULL;
+ iter->idx = 0;
+ return NULL;
+ }
+ iter->table = list_entry(dfault_tables.next,
+ struct dfault_table, link);
+ iter->idx = 0;
+ return &iter->table->dfaults[iter->idx];
+}
+
+/*
+ * Advance the iterator to point to the next _dfault
+ * object from the one the iterator currently points at,
+ * and returns a pointer to the new _dfault. Returns
+ * NULL if the iterator has seen all the _dfaults.
+ */
+static struct _dfault *dfault_iter_next(struct dfault_iter *iter)
+{
+ if (iter->table == NULL)
+ return NULL;
+ if (++iter->idx == iter->table->num_dfaults) {
+ /* iterate to next table */
+ iter->idx = 0;
+ if (list_is_last(&iter->table->link, &dfault_tables)) {
+ iter->table = NULL;
+ return NULL;
+ }
+ iter->table = list_entry(iter->table->link.next,
+ struct dfault_table, link);
+ }
+ return &iter->table->dfaults[iter->idx];
+}
+
+/*
+ * Seq_ops start method. Called at the start of every
+ * read() call from userspace. Takes the dfault_lock and
+ * seeks the seq_file's iterator to the given position.
+ */
+static void *dfault_proc_start(struct seq_file *m, loff_t *pos)
+{
+ struct dfault_iter *iter = m->private;
+ struct _dfault *dp;
+ int n = *pos;
+
+ mutex_lock(&dfault_lock);
+
+ if (n < 0)
+ return NULL;
+ dp = dfault_iter_first(iter);
+ while (dp != NULL && --n >= 0)
+ dp = dfault_iter_next(iter);
+ return dp;
+}
+
+/*
+ * Seq_ops next method. Called several times within a read()
+ * call from userspace, with dfault_lock held. Walks to the
+ * next _dfault object with a special case for the header line.
+ */
+static void *dfault_proc_next(struct seq_file *m, void *p, loff_t *pos)
+{
+ struct dfault_iter *iter = m->private;
+ struct _dfault *dp;
+
+ if (p == SEQ_START_TOKEN)
+ dp = dfault_iter_first(iter);
+ else
+ dp = dfault_iter_next(iter);
+ ++*pos;
+ return dp;
+}
+
+/*
+ * Seq_ops show method. Called several times within a read()
+ * call from userspace, with dfault_lock held. Formats the
+ * current _dfault as a single human-readable line, with a
+ * special case for the header line.
+ */
+static int dfault_proc_show(struct seq_file *m, void *p)
+{
+ struct dfault_iter *iter = m->private;
+ struct _dfault *df = p;
+ char flagsbuf[8];
+
+ seq_printf(m, "%s:%u class:%s module:%s func:%s %s \"\"\n",
+ df->filename, df->line, df->class,
+ iter->table->mod_name, df->function,
+ dfault_describe_flags(df, flagsbuf, sizeof(flagsbuf)));
+
+ return 0;
+}
+
+/*
+ * Seq_ops stop method. Called at the end of each read()
+ * call from userspace. Drops dfault_lock.
+ */
+static void dfault_proc_stop(struct seq_file *m, void *p)
+{
+ mutex_unlock(&dfault_lock);
+}
+
+static const struct seq_operations dfault_proc_seqops = {
+ .start = dfault_proc_start,
+ .next = dfault_proc_next,
+ .show = dfault_proc_show,
+ .stop = dfault_proc_stop
+};
+
+/*
+ * File_ops->open method for <debugfs>/dynamic_fault/control. Does the seq_file
+ * setup dance, and also creates an iterator to walk the _dfaults.
+ * Note that we create a seq_file always, even for O_WRONLY files
+ * where it's not needed, as doing so simplifies the ->release method.
+ */
+static int dfault_proc_open(struct inode *inode, struct file *file)
+{
+ struct dfault_iter *iter;
+ int err;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (iter == NULL)
+ return -ENOMEM;
+
+ err = seq_open(file, &dfault_proc_seqops);
+ if (err) {
+ kfree(iter);
+ return err;
+ }
+ ((struct seq_file *) file->private_data)->private = iter;
+ return 0;
+}
+
+static const struct file_operations dfault_proc_fops = {
+ .owner = THIS_MODULE,
+ .open = dfault_proc_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+ .write = dfault_proc_write
+};
+
+/*
+ * Allocate a new dfault_table for the given module
+ * and add it to the global list.
+ */
+int dfault_add_module(struct _dfault *tab, unsigned int n,
+ const char *name)
+{
+ struct dfault_table *dt;
+ char *new_name;
+ const char *func = NULL;
+ int i;
+
+ dt = kzalloc(sizeof(*dt), GFP_KERNEL);
+ if (dt == NULL)
+ return -ENOMEM;
+ new_name = kstrdup(name, GFP_KERNEL);
+ if (new_name == NULL) {
+ kfree(dt);
+ return -ENOMEM;
+ }
+ dt->mod_name = new_name;
+ dt->num_dfaults = n;
+ dt->dfaults = tab;
+
+ mutex_lock(&dfault_lock);
+ list_add_tail(&dt->link, &dfault_tables);
+ mutex_unlock(&dfault_lock);
+
+ /* __attribute__(("section")) emits things in reverse order */
+ for (i = n - 1; i >= 0; i--)
+ if (!func || strcmp(tab[i].function, func))
+ func = tab[i].function;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dfault_add_module);
+
+static void dfault_table_free(struct dfault_table *dt)
+{
+ list_del_init(&dt->link);
+ kfree(dt->mod_name);
+ kfree(dt);
+}
+
+/*
+ * Called in response to a module being unloaded. Removes
+ * any dfault_table's which point at the module.
+ */
+int dfault_remove_module(char *mod_name)
+{
+ struct dfault_table *dt, *nextdt;
+ int ret = -ENOENT;
+
+ mutex_lock(&dfault_lock);
+ list_for_each_entry_safe(dt, nextdt, &dfault_tables, link) {
+ if (!strcmp(dt->mod_name, mod_name)) {
+ dfault_table_free(dt);
+ ret = 0;
+ }
+ }
+ mutex_unlock(&dfault_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(dfault_remove_module);
+
+static void dfault_remove_all_tables(void)
+{
+ mutex_lock(&dfault_lock);
+ while (!list_empty(&dfault_tables)) {
+ struct dfault_table *dt = list_entry(dfault_tables.next,
+ struct dfault_table,
+ link);
+ dfault_table_free(dt);
+ }
+ mutex_unlock(&dfault_lock);
+}
+
+static int __init dynamic_fault_init(void)
+{
+ struct dentry *dir, *file;
+ struct _dfault *iter, *iter_start;
+ const char *modname = NULL;
+ int ret = 0;
+ int n = 0;
+
+ dir = debugfs_create_dir("dynamic_fault", NULL);
+ if (!dir)
+ return -ENOMEM;
+ file = debugfs_create_file("control", 0644, dir, NULL,
+ &dfault_proc_fops);
+ if (!file) {
+ debugfs_remove(dir);
+ return -ENOMEM;
+ }
+ if (__start___faults != __stop___faults) {
+ iter = __start___faults;
+ modname = iter->modname;
+ iter_start = iter;
+ for (; iter < __stop___faults; iter++) {
+ if (strcmp(modname, iter->modname)) {
+ ret = dfault_add_module(iter_start, n, modname);
+ if (ret)
+ goto out_free;
+ n = 0;
+ modname = iter->modname;
+ iter_start = iter;
+ }
+ n++;
+ }
+ ret = dfault_add_module(iter_start, n, modname);
+ }
+out_free:
+ if (ret) {
+ dfault_remove_all_tables();
+ debugfs_remove(dir);
+ debugfs_remove(file);
+ }
+ return 0;
+}
+module_init(dynamic_fault_init);
--
2.17.0
Signed-off-by: Kent Overstreet <[email protected]>
---
block/blk-sysfs.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index cbea895a55..d6dd7d8198 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -497,6 +497,11 @@ static ssize_t queue_wc_store(struct request_queue *q, const char *page,
return count;
}
+static ssize_t queue_fua_show(struct request_queue *q, char *page)
+{
+ return sprintf(page, "%u\n", test_bit(QUEUE_FLAG_FUA, &q->queue_flags));
+}
+
static ssize_t queue_dax_show(struct request_queue *q, char *page)
{
return queue_var_show(blk_queue_dax(q), page);
@@ -665,6 +670,11 @@ static struct queue_sysfs_entry queue_wc_entry = {
.store = queue_wc_store,
};
+static struct queue_sysfs_entry queue_fua_entry = {
+ .attr = {.name = "fua", .mode = S_IRUGO },
+ .show = queue_fua_show,
+};
+
static struct queue_sysfs_entry queue_dax_entry = {
.attr = {.name = "dax", .mode = S_IRUGO },
.show = queue_dax_show,
@@ -714,6 +724,7 @@ static struct attribute *default_attrs[] = {
&queue_random_entry.attr,
&queue_poll_entry.attr,
&queue_wc_entry.attr,
+ &queue_fua_entry.attr,
&queue_dax_entry.attr,
&queue_wb_lat_entry.attr,
&queue_poll_delay_entry.attr,
--
2.17.0
Prep work for bcachefs - being a fork of bcache it also uses closures
Signed-off-by: Kent Overstreet <[email protected]>
---
drivers/md/bcache/Kconfig | 10 +---------
drivers/md/bcache/Makefile | 6 +++---
drivers/md/bcache/bcache.h | 2 +-
drivers/md/bcache/super.c | 1 -
drivers/md/bcache/util.h | 3 +--
{drivers/md/bcache => include/linux}/closure.h | 17 ++++++++---------
lib/Kconfig | 3 +++
lib/Kconfig.debug | 9 +++++++++
lib/Makefile | 2 ++
{drivers/md/bcache => lib}/closure.c | 17 ++++++++---------
10 files changed, 36 insertions(+), 34 deletions(-)
rename {drivers/md/bcache => include/linux}/closure.h (97%)
rename {drivers/md/bcache => lib}/closure.c (95%)
diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index 4d200883c5..45f1094c08 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -1,6 +1,7 @@
config BCACHE
tristate "Block device as cache"
+ select CLOSURES
---help---
Allows a block device to be used as cache for other devices; uses
a btree for indexing and the layout is optimized for SSDs.
@@ -15,12 +16,3 @@ config BCACHE_DEBUG
Enables extra debugging tools, allows expensive runtime checks to be
turned on.
-
-config BCACHE_CLOSURES_DEBUG
- bool "Debug closures"
- depends on BCACHE
- select DEBUG_FS
- ---help---
- Keeps all active closures in a linked list and provides a debugfs
- interface to list them, which makes it possible to see asynchronous
- operations that get stuck.
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index d26b351958..2b790fb813 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -2,8 +2,8 @@
obj-$(CONFIG_BCACHE) += bcache.o
-bcache-y := alloc.o bset.o btree.o closure.o debug.o extents.o\
- io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
- util.o writeback.o
+bcache-y := alloc.o bset.o btree.o debug.o extents.o io.o\
+ journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o util.o\
+ writeback.o
CFLAGS_request.o += -Iblock
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 12e5197f18..d954dc44dd 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -180,6 +180,7 @@
#include <linux/bcache.h>
#include <linux/bio.h>
+#include <linux/closure.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mutex.h>
@@ -191,7 +192,6 @@
#include "bset.h"
#include "util.h"
-#include "closure.h"
struct bucket {
atomic_t pin;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index f2273143b3..5f1ac8e0a3 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2148,7 +2148,6 @@ static int __init bcache_init(void)
mutex_init(&bch_register_lock);
init_waitqueue_head(&unregister_wait);
register_reboot_notifier(&reboot);
- closure_debug_init();
bcache_major = register_blkdev(0, "bcache");
if (bcache_major < 0) {
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index a6763db7f0..a75523ed0d 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
#define _BCACHE_UTIL_H
#include <linux/blkdev.h>
+#include <linux/closure.h>
#include <linux/errno.h>
#include <linux/kernel.h>
#include <linux/sched/clock.h>
@@ -12,8 +13,6 @@
#include <linux/vmalloc.h>
#include <linux/workqueue.h>
-#include "closure.h"
-
#define PAGE_SECTORS (PAGE_SIZE / 512)
struct closure;
diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
similarity index 97%
rename from drivers/md/bcache/closure.h
rename to include/linux/closure.h
index 2392a46bcd..1072bf2c13 100644
--- a/drivers/md/bcache/closure.h
+++ b/include/linux/closure.h
@@ -154,7 +154,7 @@ struct closure {
atomic_t remaining;
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
#define CLOSURE_MAGIC_DEAD 0xc054dead
#define CLOSURE_MAGIC_ALIVE 0xc054a11e
@@ -183,15 +183,13 @@ static inline void closure_sync(struct closure *cl)
__closure_sync(cl);
}
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
-void closure_debug_init(void);
void closure_debug_create(struct closure *cl);
void closure_debug_destroy(struct closure *cl);
#else
-static inline void closure_debug_init(void) {}
static inline void closure_debug_create(struct closure *cl) {}
static inline void closure_debug_destroy(struct closure *cl) {}
@@ -199,21 +197,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}
static inline void closure_set_ip(struct closure *cl)
{
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
cl->ip = _THIS_IP_;
#endif
}
static inline void closure_set_ret_ip(struct closure *cl)
{
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
cl->ip = _RET_IP_;
#endif
}
static inline void closure_set_waiting(struct closure *cl, unsigned long f)
{
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
cl->waiting_on = f;
#endif
}
@@ -242,6 +240,7 @@ static inline void closure_queue(struct closure *cl)
*/
BUILD_BUG_ON(offsetof(struct closure, fn)
!= offsetof(struct work_struct, func));
+
if (wq) {
INIT_WORK(&cl->work, cl->work.func);
queue_work(wq, &cl->work);
@@ -254,7 +253,7 @@ static inline void closure_queue(struct closure *cl)
*/
static inline void closure_get(struct closure *cl)
{
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
BUG_ON((atomic_inc_return(&cl->remaining) &
CLOSURE_REMAINING_MASK) <= 1);
#else
@@ -270,7 +269,7 @@ static inline void closure_get(struct closure *cl)
*/
static inline void closure_init(struct closure *cl, struct closure *parent)
{
- memset(cl, 0, sizeof(struct closure));
+ cl->fn = NULL;
cl->parent = parent;
if (parent)
closure_get(parent);
diff --git a/lib/Kconfig b/lib/Kconfig
index e960894993..eff939519e 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -393,6 +393,9 @@ config ASSOCIATIVE_ARRAY
for more information.
+config CLOSURES
+ bool
+
config HAS_IOMEM
bool
depends on !NO_IOMEM
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 64155e310a..427292a759 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1362,6 +1362,15 @@ config DEBUG_CREDENTIALS
source "kernel/rcu/Kconfig.debug"
+config DEBUG_CLOSURES
+ bool "Debug closures (bcache async widgits)"
+ depends on CLOSURES
+ select DEBUG_FS
+ ---help---
+ Keeps all active closures in a linked list and provides a debugfs
+ interface to list them, which makes it possible to see asynchronous
+ operations that get stuck.
+
config DEBUG_WQ_FORCE_RR_CPU
bool "Force round-robin CPU selection for unbound work items"
depends on DEBUG_KERNEL
diff --git a/lib/Makefile b/lib/Makefile
index 5db5a7fb1e..66d2231d70 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -172,6 +172,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
+obj-$(CONFIG_CLOSURES) += closure.o
+
obj-$(CONFIG_CORDIC) += cordic.o
obj-$(CONFIG_DQL) += dynamic_queue_limits.o
diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
similarity index 95%
rename from drivers/md/bcache/closure.c
rename to lib/closure.c
index 7f12920c14..025f8d8f48 100644
--- a/drivers/md/bcache/closure.c
+++ b/lib/closure.c
@@ -5,13 +5,12 @@
* Copyright 2012 Google, Inc.
*/
+#include <linux/closure.h>
#include <linux/debugfs.h>
-#include <linux/module.h>
+#include <linux/export.h>
#include <linux/seq_file.h>
#include <linux/sched/debug.h>
-#include "closure.h"
-
static inline void closure_put_after_sub(struct closure *cl, int flags)
{
int r = flags & CLOSURE_REMAINING_MASK;
@@ -126,7 +125,7 @@ void __sched __closure_sync(struct closure *cl)
}
EXPORT_SYMBOL(__closure_sync);
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
static LIST_HEAD(closure_list);
static DEFINE_SPINLOCK(closure_list_lock);
@@ -162,6 +161,7 @@ static struct dentry *debug;
static int debug_seq_show(struct seq_file *f, void *data)
{
struct closure *cl;
+
spin_lock_irq(&closure_list_lock);
list_for_each_entry(cl, &closure_list, all) {
@@ -180,7 +180,7 @@ static int debug_seq_show(struct seq_file *f, void *data)
seq_printf(f, " W %pF\n",
(void *) cl->waiting_on);
- seq_printf(f, "\n");
+ seq_puts(f, "\n");
}
spin_unlock_irq(&closure_list_lock);
@@ -199,12 +199,11 @@ static const struct file_operations debug_ops = {
.release = single_release
};
-void __init closure_debug_init(void)
+static int __init closure_debug_init(void)
{
debug = debugfs_create_file("closures", 0400, NULL, NULL, &debug_ops);
+ return 0;
}
+late_initcall(closure_debug_init)
#endif
-
-MODULE_AUTHOR("Kent Overstreet <[email protected]>");
-MODULE_LICENSE("GPL");
--
2.17.0
Signed-off-by: Kent Overstreet <[email protected]>
---
block/bio.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/block/bio.c b/block/bio.c
index 5c81391100..6689102f5d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1610,6 +1610,7 @@ void bio_set_pages_dirty(struct bio *bio)
set_page_dirty_lock(page);
}
}
+EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
static void bio_release_pages(struct bio *bio)
{
@@ -1693,6 +1694,7 @@ void bio_check_pages_dirty(struct bio *bio)
bio_put(bio);
}
}
+EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
void generic_start_io_acct(struct request_queue *q, int rw,
unsigned long sectors, struct hd_struct *part)
--
2.17.0
Recently found a bug where a driver left bi_next not NULL and then
called bio_endio(), and then the submitter of the bio used
bio_copy_data() which was treating src and dst as lists of bios.
Fixed that bug by splitting out bio_list_copy_data(), but in case other
things are depending on bi_next in weird ways, add a warning to help
avoid more bugs like that in the future.
Signed-off-by: Kent Overstreet <[email protected]>
---
block/bio.c | 3 +++
block/blk-core.c | 8 +++++++-
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c
index ce8e259f9a..5c81391100 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1775,6 +1775,9 @@ void bio_endio(struct bio *bio)
if (!bio_integrity_endio(bio))
return;
+ if (WARN_ONCE(bio->bi_next, "driver left bi_next not NULL"))
+ bio->bi_next = NULL;
+
/*
* Need to have a real endio function for chained bios, otherwise
* various corner cases will break (like stacking block devices that
diff --git a/block/blk-core.c b/block/blk-core.c
index 66f24798ef..f3cf79198a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -204,6 +204,10 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
bio_advance(bio, nbytes);
/* don't actually finish bio if it's part of flush sequence */
+ /*
+ * XXX this code looks suspicious - it's not consistent with advancing
+ * req->bio in caller
+ */
if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
bio_endio(bio);
}
@@ -2982,8 +2986,10 @@ bool blk_update_request(struct request *req, blk_status_t error,
struct bio *bio = req->bio;
unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
- if (bio_bytes == bio->bi_iter.bi_size)
+ if (bio_bytes == bio->bi_iter.bi_size) {
req->bio = bio->bi_next;
+ bio->bi_next = NULL;
+ }
/* Completion has already been traced */
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
--
2.17.0
Signed-off-by: Kent Overstreet <[email protected]>
---
drivers/md/bcache/closure.h | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index 3b9dfc9962..2392a46bcd 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -244,7 +244,7 @@ static inline void closure_queue(struct closure *cl)
!= offsetof(struct work_struct, func));
if (wq) {
INIT_WORK(&cl->work, cl->work.func);
- BUG_ON(!queue_work(wq, &cl->work));
+ queue_work(wq, &cl->work);
} else
cl->fn(cl);
}
@@ -337,8 +337,13 @@ do { \
*/
#define continue_at_nobarrier(_cl, _fn, _wq) \
do { \
- set_closure_fn(_cl, _fn, _wq); \
- closure_queue(_cl); \
+ closure_set_ip(_cl); \
+ if (_wq) { \
+ INIT_WORK(&(_cl)->work, (void *) _fn); \
+ queue_work((_wq), &(_cl)->work); \
+ } else { \
+ (_fn)(_cl); \
+ } \
} while (0)
/**
--
2.17.0
Since a bio can point to userspace pages (e.g. direct IO), this is
generally necessary.
Signed-off-by: Kent Overstreet <[email protected]>
---
block/bio.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/block/bio.c b/block/bio.c
index c58544d4bc..ce8e259f9a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -994,6 +994,8 @@ void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
kunmap_atomic(dst_p);
kunmap_atomic(src_p);
+ flush_dcache_page(dst_bv.bv_page);
+
bio_advance_iter(src, src_iter, bytes);
bio_advance_iter(dst, dst_iter, bytes);
}
--
2.17.0
Signed-off-by: Kent Overstreet <[email protected]>
---
kernel/locking/osq_lock.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f..dfa71347b0 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -202,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
return false;
}
+EXPORT_SYMBOL_GPL(osq_lock);
void osq_unlock(struct optimistic_spin_queue *lock)
{
@@ -229,3 +230,4 @@ void osq_unlock(struct optimistic_spin_queue *lock)
if (next)
WRITE_ONCE(next->locked, 1);
}
+EXPORT_SYMBOL_GPL(osq_unlock);
--
2.17.0
Signed-off-by: Kent Overstreet <[email protected]>
---
mm/filemap.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 31dd888785..78b99019bf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1845,6 +1845,7 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
return ret;
}
+EXPORT_SYMBOL(find_get_pages_range);
/**
* find_get_pages_contig - gang contiguous pagecache lookup
--
2.17.0
Found a bug (with ASAN) where we were passing a bio to bio_copy_data()
with bi_next not NULL, when it should have been - a driver had left
bi_next set to something after calling bio_endio().
Since the normal case is only copying single bios, split out
bio_list_copy_data() to avoid more bugs like this in the future.
Signed-off-by: Kent Overstreet <[email protected]>
---
block/bio.c | 83 +++++++++++++++++++++++++----------------
drivers/block/pktcdvd.c | 2 +-
include/linux/bio.h | 5 ++-
3 files changed, 55 insertions(+), 35 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index d7bd765e9e..c58544d4bc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -971,32 +971,16 @@ void bio_advance(struct bio *bio, unsigned bytes)
}
EXPORT_SYMBOL(bio_advance);
-void bio_copy_data_iter(struct bio *dst, struct bvec_iter dst_iter,
- struct bio *src, struct bvec_iter src_iter)
+void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
+ struct bio *src, struct bvec_iter *src_iter)
{
struct bio_vec src_bv, dst_bv;
void *src_p, *dst_p;
unsigned bytes;
- while (1) {
- if (!src_iter.bi_size) {
- src = src->bi_next;
- if (!src)
- break;
-
- src_iter = src->bi_iter;
- }
-
- if (!dst_iter.bi_size) {
- dst = dst->bi_next;
- if (!dst)
- break;
-
- dst_iter = dst->bi_iter;
- }
-
- src_bv = bio_iter_iovec(src, src_iter);
- dst_bv = bio_iter_iovec(dst, dst_iter);
+ while (src_iter->bi_size && dst_iter->bi_size) {
+ src_bv = bio_iter_iovec(src, *src_iter);
+ dst_bv = bio_iter_iovec(dst, *dst_iter);
bytes = min(src_bv.bv_len, dst_bv.bv_len);
@@ -1010,31 +994,66 @@ void bio_copy_data_iter(struct bio *dst, struct bvec_iter dst_iter,
kunmap_atomic(dst_p);
kunmap_atomic(src_p);
- bio_advance_iter(src, &src_iter, bytes);
- bio_advance_iter(dst, &dst_iter, bytes);
+ bio_advance_iter(src, src_iter, bytes);
+ bio_advance_iter(dst, dst_iter, bytes);
}
}
EXPORT_SYMBOL(bio_copy_data_iter);
/**
- * bio_copy_data - copy contents of data buffers from one chain of bios to
- * another
- * @src: source bio list
- * @dst: destination bio list
- *
- * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
- * @src and @dst as linked lists of bios.
+ * bio_copy_data - copy contents of data buffers from one bio to another
+ * @src: source bio
+ * @dst: destination bio
*
* Stops when it reaches the end of either @src or @dst - that is, copies
* min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
*/
void bio_copy_data(struct bio *dst, struct bio *src)
{
- bio_copy_data_iter(dst, dst->bi_iter,
- src, src->bi_iter);
+ struct bvec_iter src_iter = src->bi_iter;
+ struct bvec_iter dst_iter = dst->bi_iter;
+
+ bio_copy_data_iter(dst, &dst_iter, src, &src_iter);
}
EXPORT_SYMBOL(bio_copy_data);
+/**
+ * bio_list_copy_data - copy contents of data buffers from one chain of bios to
+ * another
+ * @src: source bio list
+ * @dst: destination bio list
+ *
+ * Stops when it reaches the end of either the @src list or @dst list - that is,
+ * copies min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of
+ * bios).
+ */
+void bio_list_copy_data(struct bio *dst, struct bio *src)
+{
+ struct bvec_iter src_iter = src->bi_iter;
+ struct bvec_iter dst_iter = dst->bi_iter;
+
+ while (1) {
+ if (!src_iter.bi_size) {
+ src = src->bi_next;
+ if (!src)
+ break;
+
+ src_iter = src->bi_iter;
+ }
+
+ if (!dst_iter.bi_size) {
+ dst = dst->bi_next;
+ if (!dst)
+ break;
+
+ dst_iter = dst->bi_iter;
+ }
+
+ bio_copy_data_iter(dst, &dst_iter, src, &src_iter);
+ }
+}
+EXPORT_SYMBOL(bio_list_copy_data);
+
struct bio_map_data {
int is_our_pages;
struct iov_iter iter;
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index c61d20c9f3..00ea788b17 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1285,7 +1285,7 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)
* Fill-in bvec with data from orig_bios.
*/
spin_lock(&pkt->lock);
- bio_copy_data(pkt->w_bio, pkt->orig_bios.head);
+ bio_list_copy_data(pkt->w_bio, pkt->orig_bios.head);
pkt_set_state(pkt, PACKET_WRITE_WAIT_STATE);
spin_unlock(&pkt->lock);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5a6ee955a8..98b175cc00 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -505,9 +505,10 @@ static inline void bio_flush_dcache_pages(struct bio *bi)
}
#endif
-extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter dst_iter,
- struct bio *src, struct bvec_iter src_iter);
+extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
+ struct bio *src, struct bvec_iter *src_iter);
extern void bio_copy_data(struct bio *dst, struct bio *src);
+extern void bio_list_copy_data(struct bio *dst, struct bio *src);
extern void bio_free_pages(struct bio *bio);
extern struct bio *bio_copy_user_iov(struct request_queue *,
--
2.17.0
Minor optimization - remove a pointer indirection when using fs_bio_set.
Signed-off-by: Kent Overstreet <[email protected]>
---
block/bio.c | 7 +++----
block/blk-core.c | 2 +-
drivers/target/target_core_iblock.c | 2 +-
include/linux/bio.h | 4 ++--
4 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 980befd919..b7cdad6fc4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -53,7 +53,7 @@ static struct biovec_slab bvec_slabs[BVEC_POOL_NR] __read_mostly = {
* fs_bio_set is the bio_set containing bio and iovec memory pools used by
* IO code that does not need private memory pools.
*/
-struct bio_set *fs_bio_set;
+struct bio_set fs_bio_set;
EXPORT_SYMBOL(fs_bio_set);
/*
@@ -2055,11 +2055,10 @@ static int __init init_bio(void)
bio_integrity_init();
biovec_init_slabs();
- fs_bio_set = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
- if (!fs_bio_set)
+ if (bioset_init(&fs_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS))
panic("bio: can't allocate bios\n");
- if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE))
+ if (bioset_integrity_create(&fs_bio_set, BIO_POOL_SIZE))
panic("bio: can't create integrity pool\n");
return 0;
diff --git a/block/blk-core.c b/block/blk-core.c
index 6d82c4f7fa..66f24798ef 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3409,7 +3409,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
struct bio *bio, *bio_src;
if (!bs)
- bs = fs_bio_set;
+ bs = &fs_bio_set;
__rq_for_each_bio(bio_src, rq_src) {
bio = bio_clone_fast(bio_src, gfp_mask, bs);
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index 07c814c426..c969c01c7c 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -164,7 +164,7 @@ static int iblock_configure_device(struct se_device *dev)
goto out_blkdev_put;
}
pr_debug("IBLOCK setup BIP bs->bio_integrity_pool: %p\n",
- bs->bio_integrity_pool);
+ &bs->bio_integrity_pool);
}
dev->dev_attrib.hw_pi_prot_type = dev->dev_attrib.pi_prot_type;
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index fa3cf94a50..91b02520e2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -423,11 +423,11 @@ extern void __bio_clone_fast(struct bio *, struct bio *);
extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *);
extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
-extern struct bio_set *fs_bio_set;
+extern struct bio_set fs_bio_set;
static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
{
- return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
+ return bio_alloc_bioset(gfp_mask, nr_iovecs, &fs_bio_set);
}
static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
--
2.17.0
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/generic-radix-tree.h | 131 ++++++++++++++++++++++
lib/Makefile | 3 +-
lib/generic-radix-tree.c | 167 +++++++++++++++++++++++++++++
3 files changed, 300 insertions(+), 1 deletion(-)
create mode 100644 include/linux/generic-radix-tree.h
create mode 100644 lib/generic-radix-tree.c
diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
new file mode 100644
index 0000000000..4ede23e55f
--- /dev/null
+++ b/include/linux/generic-radix-tree.h
@@ -0,0 +1,131 @@
+#ifndef _LINUX_GENERIC_RADIX_TREE_H
+#define _LINUX_GENERIC_RADIX_TREE_H
+
+/*
+ * Generic radix trees/sparse arrays:
+ *
+ * A generic radix tree has all nodes of size PAGE_SIZE - both leaves and
+ * interior nodes.
+ */
+
+#include <asm/page.h>
+#include <linux/bug.h>
+#include <linux/kernel.h>
+#include <linux/log2.h>
+
+struct genradix_node;
+
+struct __genradix {
+ struct genradix_node *root;
+ size_t depth;
+};
+
+/*
+ * NOTE: currently, sizeof(_type) must be a power of two and not larger than
+ * PAGE_SIZE:
+ */
+
+#define __GENRADIX_INITIALIZER \
+ { \
+ .tree = { \
+ .root = NULL, \
+ .depth = 0, \
+ } \
+ }
+
+/*
+ * We use a 0 size array to stash the type we're storing without taking any
+ * space at runtime - then the various accessor macros can use typeof() to get
+ * to it for casts/sizeof - we also force the alignment so that storing a type
+ * with a ridiculous alignment doesn't blow up the alignment or size of the
+ * genradix.
+ */
+
+#define GENRADIX(_type) \
+struct { \
+ struct __genradix tree; \
+ _type type[0] __aligned(1); \
+}
+
+#define DEFINE_GENRADIX(_name, _type) \
+ GENRADIX(_type) _name = __GENRADIX_INITIALIZER
+
+#define genradix_init(_radix) \
+do { \
+ *(_radix) = (typeof(*_radix)) __GENRADIX_INITIALIZER; \
+} while (0)
+
+void __genradix_free(struct __genradix *);
+
+#define genradix_free(_radix) __genradix_free(&(_radix)->tree)
+
+static inline size_t __idx_to_offset(size_t idx, size_t obj_size)
+{
+ BUILD_BUG_ON(obj_size > PAGE_SIZE);
+
+ if (!is_power_of_2(obj_size)) {
+ size_t objs_per_page = PAGE_SIZE / obj_size;
+
+ return (idx / objs_per_page) * PAGE_SIZE +
+ (idx % objs_per_page) * obj_size;
+ } else {
+ return idx * obj_size;
+ }
+}
+
+#define __genradix_cast(_radix) (typeof((_radix)->type[0]) *)
+#define __genradix_obj_size(_radix) sizeof((_radix)->type[0])
+#define __genradix_idx_to_offset(_radix, _idx) \
+ __idx_to_offset(_idx, __genradix_obj_size(_radix))
+
+void *__genradix_ptr(struct __genradix *, size_t);
+
+/* Returns a pointer to element at @_idx */
+#define genradix_ptr(_radix, _idx) \
+ (__genradix_cast(_radix) \
+ __genradix_ptr(&(_radix)->tree, \
+ __genradix_idx_to_offset(_radix, _idx)))
+
+void *__genradix_ptr_alloc(struct __genradix *, size_t, gfp_t);
+
+/* Returns a pointer to element at @_idx, allocating it if necessary */
+#define genradix_ptr_alloc(_radix, _idx, _gfp) \
+ (__genradix_cast(_radix) \
+ __genradix_ptr_alloc(&(_radix)->tree, \
+ __genradix_idx_to_offset(_radix, _idx), \
+ _gfp))
+
+struct genradix_iter {
+ size_t offset;
+ size_t pos;
+};
+
+#define genradix_iter_init(_radix, _idx) \
+ ((struct genradix_iter) { \
+ .pos = (_idx), \
+ .offset = __genradix_idx_to_offset((_radix), (_idx)),\
+ })
+
+void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
+
+#define genradix_iter_peek(_iter, _radix) \
+ (__genradix_cast(_radix) \
+ __genradix_iter_peek(_iter, &(_radix)->tree, \
+ PAGE_SIZE / __genradix_obj_size(_radix)))
+
+static inline void __genradix_iter_advance(struct genradix_iter *iter,
+ size_t obj_size)
+{
+ iter->offset += obj_size;
+
+ if (!is_power_of_2(obj_size) &&
+ (iter->offset & (PAGE_SIZE - 1)) + obj_size > PAGE_SIZE)
+ iter->offset = round_up(iter->offset, PAGE_SIZE);
+
+ iter->pos++;
+}
+
+#define genradix_iter_advance(_iter, _radix) \
+ __genradix_iter_advance(_iter, __genradix_obj_size(_radix))
+
+#endif /* _LINUX_GENERIC_RADIX_TREE_H */
diff --git a/lib/Makefile b/lib/Makefile
index a90d4fcd74..5db5a7fb1e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,8 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
bsearch.o find_bit.o llist.o memweight.o kfifo.o \
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
- once.o refcount.o usercopy.o errseq.o bucket_locks.o
+ once.o refcount.o usercopy.o errseq.o bucket_locks.o \
+ generic-radix-tree.o
obj-$(CONFIG_STRING_SELFTEST) += test_string.o
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
new file mode 100644
index 0000000000..5c4a275ea3
--- /dev/null
+++ b/lib/generic-radix-tree.c
@@ -0,0 +1,167 @@
+
+#include <linux/export.h>
+#include <linux/generic-radix-tree.h>
+#include <linux/gfp.h>
+
+#define GENRADIX_ARY (PAGE_SIZE / sizeof(struct genradix_node *))
+#define GENRADIX_ARY_SHIFT ilog2(GENRADIX_ARY)
+
+struct genradix_node {
+ union {
+ /* Interior node: */
+ struct genradix_node *children[GENRADIX_ARY];
+
+ /* Leaf: */
+ u8 data[PAGE_SIZE];
+ };
+};
+
+static inline unsigned genradix_depth_shift(unsigned depth)
+{
+ return PAGE_SHIFT + GENRADIX_ARY_SHIFT * depth;
+}
+
+/*
+ * Returns size (of data, in bytes) that a tree of a given depth holds:
+ */
+static inline size_t genradix_depth_size(unsigned depth)
+{
+ return 1UL << genradix_depth_shift(depth);
+}
+
+/*
+ * Returns pointer to the specified byte @offset within @radix, or NULL if not
+ * allocated
+ */
+void *__genradix_ptr(struct __genradix *radix, size_t offset)
+{
+ size_t level = radix->depth;
+ struct genradix_node *n = radix->root;
+
+ if (offset >= genradix_depth_size(radix->depth))
+ return NULL;
+
+ while (1) {
+ if (!n)
+ return NULL;
+ if (!level)
+ break;
+
+ level--;
+
+ n = n->children[offset >> genradix_depth_shift(level)];
+ offset &= genradix_depth_size(level) - 1;
+ }
+
+ return &n->data[offset];
+}
+EXPORT_SYMBOL(__genradix_ptr);
+
+/*
+ * Returns pointer to the specified byte @offset within @radix, allocating it if
+ * necessary - newly allocated slots are always zeroed out:
+ */
+void *__genradix_ptr_alloc(struct __genradix *radix, size_t offset,
+ gfp_t gfp_mask)
+{
+ struct genradix_node **n;
+ size_t level;
+
+ /* Increase tree depth if necessary: */
+
+ while (offset >= genradix_depth_size(radix->depth)) {
+ struct genradix_node *new_root =
+ (void *) __get_free_page(gfp_mask|__GFP_ZERO);
+
+ if (!new_root)
+ return NULL;
+
+ new_root->children[0] = radix->root;
+ radix->root = new_root;
+ radix->depth++;
+ }
+
+ n = &radix->root;
+ level = radix->depth;
+
+ while (1) {
+ if (!*n) {
+ *n = (void *) __get_free_page(gfp_mask|__GFP_ZERO);
+ if (!*n)
+ return NULL;
+ }
+
+ if (!level)
+ break;
+
+ level--;
+
+ n = &(*n)->children[offset >> genradix_depth_shift(level)];
+ offset &= genradix_depth_size(level) - 1;
+ }
+
+ return &(*n)->data[offset];
+}
+EXPORT_SYMBOL(__genradix_ptr_alloc);
+
+void *__genradix_iter_peek(struct genradix_iter *iter,
+ struct __genradix *radix,
+ size_t objs_per_page)
+{
+ struct genradix_node *n;
+ size_t level, i;
+
+ if (!radix->root)
+ return NULL;
+restart:
+ if (iter->offset >= genradix_depth_size(radix->depth))
+ return NULL;
+
+ n = radix->root;
+ level = radix->depth;
+
+ while (level) {
+ level--;
+
+ i = (iter->offset >> genradix_depth_shift(level)) &
+ (GENRADIX_ARY - 1);
+
+ while (!n->children[i]) {
+ i++;
+ iter->offset = round_down(iter->offset +
+ genradix_depth_size(level),
+ genradix_depth_size(level));
+ iter->pos = (iter->offset >> PAGE_SHIFT) *
+ objs_per_page;
+ if (i == GENRADIX_ARY)
+ goto restart;
+ }
+
+ n = n->children[i];
+ }
+
+ return &n->data[iter->offset & (PAGE_SIZE - 1)];
+}
+EXPORT_SYMBOL(__genradix_iter_peek);
+
+static void genradix_free_recurse(struct genradix_node *n, unsigned level)
+{
+ if (level) {
+ unsigned i;
+
+ for (i = 0; i < GENRADIX_ARY; i++)
+ if (n->children[i])
+ genradix_free_recurse(n->children[i], level - 1);
+ }
+
+ free_page((unsigned long) n);
+}
+
+void __genradix_free(struct __genradix *radix)
+{
+ genradix_free_recurse(radix->root, radix->depth);
+
+ radix->root = NULL;
+ radix->depth = 0;
+}
+EXPORT_SYMBOL(__genradix_free);
--
2.17.0
bcachefs makes use of them - also, add a proper lg_lock_init()
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/lglock.h | 97 +++++++++++++++++++++++++++++++++++++
kernel/locking/Makefile | 1 +
kernel/locking/lglock.c | 105 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 203 insertions(+)
create mode 100644 include/linux/lglock.h
create mode 100644 kernel/locking/lglock.c
diff --git a/include/linux/lglock.h b/include/linux/lglock.h
new file mode 100644
index 0000000000..c1fbe64bd2
--- /dev/null
+++ b/include/linux/lglock.h
@@ -0,0 +1,97 @@
+/*
+ * Specialised local-global spinlock. Can only be declared as global variables
+ * to avoid overhead and keep things simple (and we don't want to start using
+ * these inside dynamically allocated structures).
+ *
+ * "local/global locks" (lglocks) can be used to:
+ *
+ * - Provide fast exclusive access to per-CPU data, with exclusive access to
+ * another CPU's data allowed but possibly subject to contention, and to
+ * provide very slow exclusive access to all per-CPU data.
+ * - Or to provide very fast and scalable read serialisation, and to provide
+ * very slow exclusive serialisation of data (not necessarily per-CPU data).
+ *
+ * Brlocks are also implemented as a short-hand notation for the latter use
+ * case.
+ *
+ * Copyright 2009, 2010, Nick Piggin, Novell Inc.
+ */
+#ifndef __LINUX_LGLOCK_H
+#define __LINUX_LGLOCK_H
+
+#include <linux/spinlock.h>
+#include <linux/lockdep.h>
+#include <linux/percpu.h>
+#include <linux/cpu.h>
+#include <linux/notifier.h>
+
+#ifdef CONFIG_SMP
+
+struct lglock {
+ arch_spinlock_t __percpu *lock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+#define DEFINE_LGLOCK(name) \
+ static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock) \
+ = __ARCH_SPIN_LOCK_UNLOCKED; \
+ struct lglock name = { .lock = &name ## _lock }
+
+#define DEFINE_STATIC_LGLOCK(name) \
+ static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock) \
+ = __ARCH_SPIN_LOCK_UNLOCKED; \
+ static struct lglock name = { .lock = &name ## _lock }
+
+static inline void lg_lock_free(struct lglock *lg)
+{
+ free_percpu(lg->lock);
+}
+
+#define lg_lock_lockdep_init(lock) \
+do { \
+ static struct lock_class_key __key; \
+ \
+ lockdep_init_map(&(lock)->dep_map, #lock, &__key, 0); \
+} while (0)
+
+static inline int __lg_lock_init(struct lglock *lg)
+{
+ lg->lock = alloc_percpu(*lg->lock);
+ return lg->lock ? 0 : -ENOMEM;
+}
+
+#define lg_lock_init(lock) \
+({ \
+ lg_lock_lockdep_init(lock); \
+ __lg_lock_init(lock); \
+})
+
+void lg_local_lock(struct lglock *lg);
+void lg_local_unlock(struct lglock *lg);
+void lg_local_lock_cpu(struct lglock *lg, int cpu);
+void lg_local_unlock_cpu(struct lglock *lg, int cpu);
+
+void lg_double_lock(struct lglock *lg, int cpu1, int cpu2);
+void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2);
+
+void lg_global_lock(struct lglock *lg);
+void lg_global_unlock(struct lglock *lg);
+
+#else
+/* When !CONFIG_SMP, map lglock to spinlock */
+#define lglock spinlock
+#define DEFINE_LGLOCK(name) DEFINE_SPINLOCK(name)
+#define DEFINE_STATIC_LGLOCK(name) static DEFINE_SPINLOCK(name)
+#define lg_lock_init(lg) ({ spin_lock_init(lg); 0; })
+#define lg_lock_free(lg) do {} while (0)
+#define lg_local_lock spin_lock
+#define lg_local_unlock spin_unlock
+#define lg_local_lock_cpu(lg, cpu) spin_lock(lg)
+#define lg_local_unlock_cpu(lg, cpu) spin_unlock(lg)
+#define lg_global_lock spin_lock
+#define lg_global_unlock spin_unlock
+#endif
+
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 392c7f23af..e5bb62823d 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
endif
obj-$(CONFIG_SMP) += spinlock.o
obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
+obj-$(CONFIG_SMP) += lglock.o
obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o
obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
diff --git a/kernel/locking/lglock.c b/kernel/locking/lglock.c
new file mode 100644
index 0000000000..051feaccc4
--- /dev/null
+++ b/kernel/locking/lglock.c
@@ -0,0 +1,105 @@
+/* See include/linux/lglock.h for description */
+#include <linux/module.h>
+#include <linux/lglock.h>
+#include <linux/cpu.h>
+#include <linux/string.h>
+
+/*
+ * Note there is no uninit, so lglocks cannot be defined in
+ * modules (but it's fine to use them from there)
+ * Could be added though, just undo lg_lock_init
+ */
+
+void lg_local_lock(struct lglock *lg)
+{
+ arch_spinlock_t *lock;
+
+ preempt_disable();
+ lock_acquire_shared(&lg->dep_map, 0, 0, NULL, _RET_IP_);
+ lock = this_cpu_ptr(lg->lock);
+ arch_spin_lock(lock);
+}
+EXPORT_SYMBOL(lg_local_lock);
+
+void lg_local_unlock(struct lglock *lg)
+{
+ arch_spinlock_t *lock;
+
+ lock_release(&lg->dep_map, 1, _RET_IP_);
+ lock = this_cpu_ptr(lg->lock);
+ arch_spin_unlock(lock);
+ preempt_enable();
+}
+EXPORT_SYMBOL(lg_local_unlock);
+
+void lg_local_lock_cpu(struct lglock *lg, int cpu)
+{
+ arch_spinlock_t *lock;
+
+ preempt_disable();
+ lock_acquire_shared(&lg->dep_map, 0, 0, NULL, _RET_IP_);
+ lock = per_cpu_ptr(lg->lock, cpu);
+ arch_spin_lock(lock);
+}
+EXPORT_SYMBOL(lg_local_lock_cpu);
+
+void lg_local_unlock_cpu(struct lglock *lg, int cpu)
+{
+ arch_spinlock_t *lock;
+
+ lock_release(&lg->dep_map, 1, _RET_IP_);
+ lock = per_cpu_ptr(lg->lock, cpu);
+ arch_spin_unlock(lock);
+ preempt_enable();
+}
+EXPORT_SYMBOL(lg_local_unlock_cpu);
+
+void lg_double_lock(struct lglock *lg, int cpu1, int cpu2)
+{
+ BUG_ON(cpu1 == cpu2);
+
+ /* lock in cpu order, just like lg_global_lock */
+ if (cpu2 < cpu1)
+ swap(cpu1, cpu2);
+
+ preempt_disable();
+ lock_acquire_shared(&lg->dep_map, 0, 0, NULL, _RET_IP_);
+ arch_spin_lock(per_cpu_ptr(lg->lock, cpu1));
+ arch_spin_lock(per_cpu_ptr(lg->lock, cpu2));
+}
+
+void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2)
+{
+ lock_release(&lg->dep_map, 1, _RET_IP_);
+ arch_spin_unlock(per_cpu_ptr(lg->lock, cpu1));
+ arch_spin_unlock(per_cpu_ptr(lg->lock, cpu2));
+ preempt_enable();
+}
+
+void lg_global_lock(struct lglock *lg)
+{
+ int i;
+
+ preempt_disable();
+ lock_acquire_exclusive(&lg->dep_map, 0, 0, NULL, _RET_IP_);
+ for_each_possible_cpu(i) {
+ arch_spinlock_t *lock;
+ lock = per_cpu_ptr(lg->lock, i);
+ arch_spin_lock(lock);
+ }
+}
+EXPORT_SYMBOL(lg_global_lock);
+
+void lg_global_unlock(struct lglock *lg)
+{
+ int i;
+
+ lock_release(&lg->dep_map, 1, _RET_IP_);
+ for_each_possible_cpu(i) {
+ arch_spinlock_t *lock;
+ lock = per_cpu_ptr(lg->lock, i);
+ arch_spin_unlock(lock);
+ }
+ preempt_enable();
+}
+EXPORT_SYMBOL(lg_global_unlock);
--
2.17.0
Signed-off-by: Kent Overstreet <[email protected]>
---
mm/page-writeback.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 586f31261c..17ccc294c9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2460,20 +2460,19 @@ int __set_page_dirty_nobuffers(struct page *page)
lock_page_memcg(page);
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
- unsigned long flags;
if (!mapping) {
unlock_page_memcg(page);
return 1;
}
- spin_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irq(&mapping->tree_lock);
BUG_ON(page_mapping(page) != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
radix_tree_tag_set(&mapping->page_tree, page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irq(&mapping->tree_lock);
unlock_page_memcg(page);
if (mapping->host) {
--
2.17.0
Similarly to mempool_init()/mempool_exit(), take a pointer indirection
out of allocation/freeing by allowing biosets to be embedded in other
structs.
Signed-off-by: Kent Overstreet <[email protected]>
---
block/bio.c | 93 +++++++++++++++++++++++++++++++--------------
include/linux/bio.h | 2 +
2 files changed, 67 insertions(+), 28 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 360e9bcea5..980befd919 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1856,21 +1856,83 @@ int biovec_init_pool(mempool_t *pool, int pool_entries)
return mempool_init_slab_pool(pool, pool_entries, bp->slab);
}
-void bioset_free(struct bio_set *bs)
+/*
+ * bioset_exit - exit a bioset initialized with bioset_init()
+ *
+ * May be called on a zeroed but uninitialized bioset (i.e. allocated with
+ * kzalloc()).
+ */
+void bioset_exit(struct bio_set *bs)
{
if (bs->rescue_workqueue)
destroy_workqueue(bs->rescue_workqueue);
+ bs->rescue_workqueue = NULL;
mempool_exit(&bs->bio_pool);
mempool_exit(&bs->bvec_pool);
bioset_integrity_free(bs);
- bio_put_slab(bs);
+ if (bs->bio_slab)
+ bio_put_slab(bs);
+ bs->bio_slab = NULL;
+}
+EXPORT_SYMBOL(bioset_exit);
+void bioset_free(struct bio_set *bs)
+{
+ bioset_exit(bs);
kfree(bs);
}
EXPORT_SYMBOL(bioset_free);
+/**
+ * bioset_init - Initialize a bio_set
+ * @pool_size: Number of bio and bio_vecs to cache in the mempool
+ * @front_pad: Number of bytes to allocate in front of the returned bio
+ * @flags: Flags to modify behavior, currently %BIOSET_NEED_BVECS
+ * and %BIOSET_NEED_RESCUER
+ *
+ * Similar to bioset_create(), but initializes a passed-in bioset instead of
+ * separately allocating it.
+ */
+int bioset_init(struct bio_set *bs,
+ unsigned int pool_size,
+ unsigned int front_pad,
+ int flags)
+{
+ unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
+
+ bs->front_pad = front_pad;
+
+ spin_lock_init(&bs->rescue_lock);
+ bio_list_init(&bs->rescue_list);
+ INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
+ bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
+ if (!bs->bio_slab)
+ return -ENOMEM;
+
+ if (mempool_init_slab_pool(&bs->bio_pool, pool_size, bs->bio_slab))
+ goto bad;
+
+ if ((flags & BIOSET_NEED_BVECS) &&
+ biovec_init_pool(&bs->bvec_pool, pool_size))
+ goto bad;
+
+ if (!(flags & BIOSET_NEED_RESCUER))
+ return 0;
+
+ bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+ if (!bs->rescue_workqueue)
+ goto bad;
+
+ return 0;
+bad:
+ bioset_exit(bs);
+ return -ENOMEM;
+}
+EXPORT_SYMBOL(bioset_init);
+
/**
* bioset_create - Create a bio_set
* @pool_size: Number of bio and bio_vecs to cache in the mempool
@@ -1895,43 +1957,18 @@ struct bio_set *bioset_create(unsigned int pool_size,
unsigned int front_pad,
int flags)
{
- unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
struct bio_set *bs;
bs = kzalloc(sizeof(*bs), GFP_KERNEL);
if (!bs)
return NULL;
- bs->front_pad = front_pad;
-
- spin_lock_init(&bs->rescue_lock);
- bio_list_init(&bs->rescue_list);
- INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
-
- bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
- if (!bs->bio_slab) {
+ if (bioset_init(bs, pool_size, front_pad, flags)) {
kfree(bs);
return NULL;
}
- if (mempool_init_slab_pool(&bs->bio_pool, pool_size, bs->bio_slab))
- goto bad;
-
- if ((flags & BIOSET_NEED_BVECS) &&
- biovec_init_pool(&bs->bvec_pool, pool_size))
- goto bad;
-
- if (!(flags & BIOSET_NEED_RESCUER))
- return bs;
-
- bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
- if (!bs->rescue_workqueue)
- goto bad;
-
return bs;
-bad:
- bioset_free(bs);
- return NULL;
}
EXPORT_SYMBOL(bioset_create);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 720f7261d0..fa3cf94a50 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -406,6 +406,8 @@ static inline struct bio *bio_next_split(struct bio *bio, int sectors,
return bio_split(bio, sectors, gfp, bs);
}
+extern int bioset_init(struct bio_set *, unsigned int, unsigned int, int flags);
+extern void bioset_exit(struct bio_set *);
extern struct bio_set *bioset_create(unsigned int, unsigned int, int flags);
enum {
BIOSET_NEED_BVECS = BIT(0),
--
2.17.0
Add a per address space lock around adding pages to the pagecache - making it
possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
hopefully making truncate and dio a bit saner.
Signed-off-by: Kent Overstreet <[email protected]>
---
fs/inode.c | 1 +
include/linux/fs.h | 23 +++++++++++
include/linux/sched.h | 4 ++
init/init_task.c | 1 +
mm/filemap.c | 91 ++++++++++++++++++++++++++++++++++++++++---
5 files changed, 115 insertions(+), 5 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index ef362364d3..e7aaa39adb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -350,6 +350,7 @@ void address_space_init_once(struct address_space *mapping)
{
memset(mapping, 0, sizeof(*mapping));
INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC | __GFP_ACCOUNT);
+ pagecache_lock_init(&mapping->add_lock);
spin_lock_init(&mapping->tree_lock);
init_rwsem(&mapping->i_mmap_rwsem);
INIT_LIST_HEAD(&mapping->private_list);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c6baf76761..18d2886a44 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -388,9 +388,32 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
+/*
+ * Two-state lock - can be taken for add or block - both states are shared,
+ * like read side of rwsem, but conflict with other state:
+ */
+struct pagecache_lock {
+ atomic_long_t v;
+ wait_queue_head_t wait;
+};
+
+static inline void pagecache_lock_init(struct pagecache_lock *lock)
+{
+ atomic_long_set(&lock->v, 0);
+ init_waitqueue_head(&lock->wait);
+}
+
+void pagecache_add_put(struct pagecache_lock *);
+void pagecache_add_get(struct pagecache_lock *);
+void __pagecache_block_put(struct pagecache_lock *);
+void __pagecache_block_get(struct pagecache_lock *);
+void pagecache_block_put(struct pagecache_lock *);
+void pagecache_block_get(struct pagecache_lock *);
+
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
+ struct pagecache_lock add_lock; /* protects adding new pages */
spinlock_t tree_lock; /* and lock protecting it */
atomic_t i_mmap_writable;/* count VM_SHARED mappings */
struct rb_root_cached i_mmap; /* tree of private and shared mappings */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b161ef8a90..e58465f61a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -40,6 +40,7 @@ struct io_context;
struct mempolicy;
struct nameidata;
struct nsproxy;
+struct pagecache_lock;
struct perf_event_context;
struct pid_namespace;
struct pipe_inode_info;
@@ -865,6 +866,9 @@ struct task_struct {
unsigned int in_ubsan;
#endif
+ /* currently held lock, for avoiding recursing in fault path: */
+ struct pagecache_lock *pagecache_lock;
+
/* Journalling filesystem info: */
void *journal_info;
diff --git a/init/init_task.c b/init/init_task.c
index 3ac6e754cf..308d46eef9 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -106,6 +106,7 @@ struct task_struct init_task
},
.blocked = {{0}},
.alloc_lock = __SPIN_LOCK_UNLOCKED(init_task.alloc_lock),
+ .pagecache_lock = NULL,
.journal_info = NULL,
INIT_CPU_TIMERS(init_task)
.pi_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
diff --git a/mm/filemap.c b/mm/filemap.c
index 693f62212a..31dd888785 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -111,6 +111,73 @@
* ->tasklist_lock (memory_failure, collect_procs_ao)
*/
+static void __pagecache_lock_put(struct pagecache_lock *lock, long i)
+{
+ BUG_ON(atomic_long_read(&lock->v) == 0);
+
+ if (atomic_long_sub_return_release(i, &lock->v) == 0)
+ wake_up_all(&lock->wait);
+}
+
+static bool __pagecache_lock_tryget(struct pagecache_lock *lock, long i)
+{
+ long v = atomic_long_read(&lock->v), old;
+
+ do {
+ old = v;
+
+ if (i > 0 ? v < 0 : v > 0)
+ return false;
+ } while ((v = atomic_long_cmpxchg_acquire(&lock->v,
+ old, old + i)) != old);
+ return true;
+}
+
+static void __pagecache_lock_get(struct pagecache_lock *lock, long i)
+{
+ wait_event(lock->wait, __pagecache_lock_tryget(lock, i));
+}
+
+void pagecache_add_put(struct pagecache_lock *lock)
+{
+ __pagecache_lock_put(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_put);
+
+void pagecache_add_get(struct pagecache_lock *lock)
+{
+ __pagecache_lock_get(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_get);
+
+void __pagecache_block_put(struct pagecache_lock *lock)
+{
+ __pagecache_lock_put(lock, -1);
+}
+EXPORT_SYMBOL(__pagecache_block_put);
+
+void __pagecache_block_get(struct pagecache_lock *lock)
+{
+ __pagecache_lock_get(lock, -1);
+}
+EXPORT_SYMBOL(__pagecache_block_get);
+
+void pagecache_block_put(struct pagecache_lock *lock)
+{
+ BUG_ON(current->pagecache_lock != lock);
+ current->pagecache_lock = NULL;
+ __pagecache_lock_put(lock, -1);
+}
+EXPORT_SYMBOL(pagecache_block_put);
+
+void pagecache_block_get(struct pagecache_lock *lock)
+{
+ __pagecache_lock_get(lock, -1);
+ BUG_ON(current->pagecache_lock);
+ current->pagecache_lock = lock;
+}
+EXPORT_SYMBOL(pagecache_block_get);
+
static int page_cache_tree_insert(struct address_space *mapping,
struct page *page, void **shadowp)
{
@@ -834,18 +901,21 @@ static int __add_to_page_cache_locked(struct page *page,
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapBacked(page), page);
+ if (current->pagecache_lock != &mapping->add_lock)
+ pagecache_add_get(&mapping->add_lock);
+
if (!huge) {
error = mem_cgroup_try_charge(page, current->mm,
gfp_mask, &memcg, false);
if (error)
- return error;
+ goto err;
}
error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error) {
if (!huge)
mem_cgroup_cancel_charge(page, memcg, false);
- return error;
+ goto err;
}
get_page(page);
@@ -865,7 +935,11 @@ static int __add_to_page_cache_locked(struct page *page,
if (!huge)
mem_cgroup_commit_charge(page, memcg, false, false);
trace_mm_filemap_add_to_page_cache(page);
- return 0;
+err:
+ if (current->pagecache_lock != &mapping->add_lock)
+ pagecache_add_put(&mapping->add_lock);
+
+ return error;
err_insert:
page->mapping = NULL;
/* Leave page->index set: truncation relies upon it */
@@ -873,7 +947,7 @@ static int __add_to_page_cache_locked(struct page *page,
if (!huge)
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
- return error;
+ goto err;
}
/**
@@ -2511,7 +2585,14 @@ int filemap_fault(struct vm_fault *vmf)
* Do we have something in the page cache already?
*/
page = find_get_page(mapping, offset);
- if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
+ if (unlikely(current->pagecache_lock == &mapping->add_lock)) {
+ /*
+ * fault from e.g. dio -> get_user_pages() - _don't_ want to do
+ * readahead, only read in page we need:
+ */
+ if (!page)
+ goto no_cached_page;
+ } else if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
/*
* We found the page, so try async readahead before
* waiting for the lock.
--
2.17.0
Add versions that take bvec_iter args instead of using bio->bi_iter - to
be used by bcachefs.
Signed-off-by: Kent Overstreet <[email protected]>
---
block/bio.c | 44 ++++++++++++++++++++++++--------------------
include/linux/bio.h | 18 +++++++++++++++---
2 files changed, 39 insertions(+), 23 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index b7cdad6fc4..d7bd765e9e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -530,20 +530,20 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
}
EXPORT_SYMBOL(bio_alloc_bioset);
-void zero_fill_bio(struct bio *bio)
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter start)
{
unsigned long flags;
struct bio_vec bv;
struct bvec_iter iter;
- bio_for_each_segment(bv, bio, iter) {
+ __bio_for_each_segment(bv, bio, iter, start) {
char *data = bvec_kmap_irq(&bv, &flags);
memset(data, 0, bv.bv_len);
flush_dcache_page(bv.bv_page);
bvec_kunmap_irq(data, &flags);
}
}
-EXPORT_SYMBOL(zero_fill_bio);
+EXPORT_SYMBOL(zero_fill_bio_iter);
/**
* bio_put - release a reference to a bio
@@ -971,28 +971,13 @@ void bio_advance(struct bio *bio, unsigned bytes)
}
EXPORT_SYMBOL(bio_advance);
-/**
- * bio_copy_data - copy contents of data buffers from one chain of bios to
- * another
- * @src: source bio list
- * @dst: destination bio list
- *
- * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
- * @src and @dst as linked lists of bios.
- *
- * Stops when it reaches the end of either @src or @dst - that is, copies
- * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
- */
-void bio_copy_data(struct bio *dst, struct bio *src)
+void bio_copy_data_iter(struct bio *dst, struct bvec_iter dst_iter,
+ struct bio *src, struct bvec_iter src_iter)
{
- struct bvec_iter src_iter, dst_iter;
struct bio_vec src_bv, dst_bv;
void *src_p, *dst_p;
unsigned bytes;
- src_iter = src->bi_iter;
- dst_iter = dst->bi_iter;
-
while (1) {
if (!src_iter.bi_size) {
src = src->bi_next;
@@ -1029,6 +1014,25 @@ void bio_copy_data(struct bio *dst, struct bio *src)
bio_advance_iter(dst, &dst_iter, bytes);
}
}
+EXPORT_SYMBOL(bio_copy_data_iter);
+
+/**
+ * bio_copy_data - copy contents of data buffers from one chain of bios to
+ * another
+ * @src: source bio list
+ * @dst: destination bio list
+ *
+ * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
+ * @src and @dst as linked lists of bios.
+ *
+ * Stops when it reaches the end of either @src or @dst - that is, copies
+ * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
+ */
+void bio_copy_data(struct bio *dst, struct bio *src)
+{
+ bio_copy_data_iter(dst, dst->bi_iter,
+ src, src->bi_iter);
+}
EXPORT_SYMBOL(bio_copy_data);
struct bio_map_data {
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 91b02520e2..5a6ee955a8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -67,8 +67,12 @@
#define bio_multiple_segments(bio) \
((bio)->bi_iter.bi_size != bio_iovec(bio).bv_len)
-#define bio_sectors(bio) ((bio)->bi_iter.bi_size >> 9)
-#define bio_end_sector(bio) ((bio)->bi_iter.bi_sector + bio_sectors((bio)))
+
+#define bvec_iter_sectors(iter) ((iter).bi_size >> 9)
+#define bvec_iter_end_sector(iter) ((iter).bi_sector + bvec_iter_sectors((iter)))
+
+#define bio_sectors(bio) bvec_iter_sectors((bio)->bi_iter)
+#define bio_end_sector(bio) bvec_iter_end_sector((bio)->bi_iter)
/*
* Return the data direction, READ or WRITE.
@@ -501,6 +505,8 @@ static inline void bio_flush_dcache_pages(struct bio *bi)
}
#endif
+extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter dst_iter,
+ struct bio *src, struct bvec_iter src_iter);
extern void bio_copy_data(struct bio *dst, struct bio *src);
extern void bio_free_pages(struct bio *bio);
@@ -509,7 +515,13 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
struct iov_iter *,
gfp_t);
extern int bio_uncopy_user(struct bio *);
-void zero_fill_bio(struct bio *bio);
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter iter);
+
+static inline void zero_fill_bio(struct bio *bio)
+{
+ zero_fill_bio_iter(bio, bio->bi_iter);
+}
+
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
--
2.17.0
shit, git send-email pebkac error - disregard all the block patches in the
thread
Allows mempools to be embedded in other structs, getting rid of a
pointer indirection from allocation fastpaths.
mempool_exit() is safe to call on an uninitialized but zeroed mempool.
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/mempool.h | 34 +++++++++++++
mm/mempool.c | 108 ++++++++++++++++++++++++++++++----------
2 files changed, 115 insertions(+), 27 deletions(-)
diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index b51f5c430c..0c964ac107 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -25,6 +25,18 @@ typedef struct mempool_s {
wait_queue_head_t wait;
} mempool_t;
+static inline bool mempool_initialized(mempool_t *pool)
+{
+ return pool->elements != NULL;
+}
+
+void mempool_exit(mempool_t *pool);
+int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data,
+ gfp_t gfp_mask, int node_id);
+int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data);
+
extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
@@ -43,6 +55,14 @@ extern void mempool_free(void *element, mempool_t *pool);
*/
void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data);
void mempool_free_slab(void *element, void *pool_data);
+
+static inline int
+mempool_init_slab_pool(mempool_t *pool, int min_nr, struct kmem_cache *kc)
+{
+ return mempool_init(pool, min_nr, mempool_alloc_slab,
+ mempool_free_slab, (void *) kc);
+}
+
static inline mempool_t *
mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
{
@@ -56,6 +76,13 @@ mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
*/
void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data);
void mempool_kfree(void *element, void *pool_data);
+
+static inline int mempool_init_kmalloc_pool(mempool_t *pool, int min_nr, size_t size)
+{
+ return mempool_init(pool, min_nr, mempool_kmalloc,
+ mempool_kfree, (void *) size);
+}
+
static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)
{
return mempool_create(min_nr, mempool_kmalloc, mempool_kfree,
@@ -68,6 +95,13 @@ static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)
*/
void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data);
void mempool_free_pages(void *element, void *pool_data);
+
+static inline int mempool_init_page_pool(mempool_t *pool, int min_nr, int order)
+{
+ return mempool_init(pool, min_nr, mempool_alloc_pages,
+ mempool_free_pages, (void *)(long)order);
+}
+
static inline mempool_t *mempool_create_page_pool(int min_nr, int order)
{
return mempool_create(min_nr, mempool_alloc_pages, mempool_free_pages,
diff --git a/mm/mempool.c b/mm/mempool.c
index 5c9dce3471..df90ace400 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -137,6 +137,28 @@ static void *remove_element(mempool_t *pool, gfp_t flags)
return element;
}
+/**
+ * mempool_destroy - exit a mempool initialized with mempool_init()
+ * @pool: pointer to the memory pool which was initialized with
+ * mempool_init().
+ *
+ * Free all reserved elements in @pool and @pool itself. This function
+ * only sleeps if the free_fn() function sleeps.
+ *
+ * May be called on a zeroed but uninitialized mempool (i.e. allocated with
+ * kzalloc()).
+ */
+void mempool_exit(mempool_t *pool)
+{
+ while (pool->curr_nr) {
+ void *element = remove_element(pool, GFP_KERNEL);
+ pool->free(element, pool->pool_data);
+ }
+ kfree(pool->elements);
+ pool->elements = NULL;
+}
+EXPORT_SYMBOL(mempool_exit);
+
/**
* mempool_destroy - deallocate a memory pool
* @pool: pointer to the memory pool which was allocated via
@@ -150,15 +172,65 @@ void mempool_destroy(mempool_t *pool)
if (unlikely(!pool))
return;
- while (pool->curr_nr) {
- void *element = remove_element(pool, GFP_KERNEL);
- pool->free(element, pool->pool_data);
- }
- kfree(pool->elements);
+ mempool_exit(pool);
kfree(pool);
}
EXPORT_SYMBOL(mempool_destroy);
+int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data,
+ gfp_t gfp_mask, int node_id)
+{
+ spin_lock_init(&pool->lock);
+ pool->min_nr = min_nr;
+ pool->pool_data = pool_data;
+ pool->alloc = alloc_fn;
+ pool->free = free_fn;
+ init_waitqueue_head(&pool->wait);
+
+ pool->elements = kmalloc_array_node(min_nr, sizeof(void *),
+ gfp_mask, node_id);
+ if (!pool->elements)
+ return -ENOMEM;
+
+ /*
+ * First pre-allocate the guaranteed number of buffers.
+ */
+ while (pool->curr_nr < pool->min_nr) {
+ void *element;
+
+ element = pool->alloc(gfp_mask, pool->pool_data);
+ if (unlikely(!element)) {
+ mempool_exit(pool);
+ return -ENOMEM;
+ }
+ add_element(pool, element);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(mempool_init_node);
+
+/**
+ * mempool_init - initialize a memory pool
+ * @min_nr: the minimum number of elements guaranteed to be
+ * allocated for this pool.
+ * @alloc_fn: user-defined element-allocation function.
+ * @free_fn: user-defined element-freeing function.
+ * @pool_data: optional private data available to the user-defined functions.
+ *
+ * Like mempool_create(), but initializes the pool in (i.e. embedded in another
+ * structure).
+ */
+int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data)
+{
+ return mempool_init_node(pool, min_nr, alloc_fn, free_fn,
+ pool_data, GFP_KERNEL, NUMA_NO_NODE);
+
+}
+EXPORT_SYMBOL(mempool_init);
+
/**
* mempool_create - create a memory pool
* @min_nr: the minimum number of elements guaranteed to be
@@ -186,35 +258,17 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
gfp_t gfp_mask, int node_id)
{
mempool_t *pool;
+
pool = kzalloc_node(sizeof(*pool), gfp_mask, node_id);
if (!pool)
return NULL;
- pool->elements = kmalloc_array_node(min_nr, sizeof(void *),
- gfp_mask, node_id);
- if (!pool->elements) {
+
+ if (mempool_init_node(pool, min_nr, alloc_fn, free_fn, pool_data,
+ gfp_mask, node_id)) {
kfree(pool);
return NULL;
}
- spin_lock_init(&pool->lock);
- pool->min_nr = min_nr;
- pool->pool_data = pool_data;
- init_waitqueue_head(&pool->wait);
- pool->alloc = alloc_fn;
- pool->free = free_fn;
- /*
- * First pre-allocate the guaranteed number of buffers.
- */
- while (pool->curr_nr < pool->min_nr) {
- void *element;
-
- element = pool->alloc(gfp_mask, pool->pool_data);
- if (unlikely(!element)) {
- mempool_destroy(pool);
- return NULL;
- }
- add_element(pool, element);
- }
return pool;
}
EXPORT_SYMBOL(mempool_create_node);
--
2.17.0
Minor performance improvement by getting rid of pointer indirections
from allocation/freeing fastpaths.
Signed-off-by: Kent Overstreet <[email protected]>
---
block/bio-integrity.c | 29 ++++++++++++++---------------
block/bio.c | 36 +++++++++++++++++-------------------
include/linux/bio.h | 10 +++++-----
3 files changed, 36 insertions(+), 39 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 9cfdd6c83b..add7c7c853 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -56,12 +56,12 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
struct bio_set *bs = bio->bi_pool;
unsigned inline_vecs;
- if (!bs || !bs->bio_integrity_pool) {
+ if (!bs || !mempool_initialized(&bs->bio_integrity_pool)) {
bip = kmalloc(sizeof(struct bio_integrity_payload) +
sizeof(struct bio_vec) * nr_vecs, gfp_mask);
inline_vecs = nr_vecs;
} else {
- bip = mempool_alloc(bs->bio_integrity_pool, gfp_mask);
+ bip = mempool_alloc(&bs->bio_integrity_pool, gfp_mask);
inline_vecs = BIP_INLINE_VECS;
}
@@ -74,7 +74,7 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
unsigned long idx = 0;
bip->bip_vec = bvec_alloc(gfp_mask, nr_vecs, &idx,
- bs->bvec_integrity_pool);
+ &bs->bvec_integrity_pool);
if (!bip->bip_vec)
goto err;
bip->bip_max_vcnt = bvec_nr_vecs(idx);
@@ -90,7 +90,7 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
return bip;
err:
- mempool_free(bip, bs->bio_integrity_pool);
+ mempool_free(bip, &bs->bio_integrity_pool);
return ERR_PTR(-ENOMEM);
}
EXPORT_SYMBOL(bio_integrity_alloc);
@@ -111,10 +111,10 @@ static void bio_integrity_free(struct bio *bio)
kfree(page_address(bip->bip_vec->bv_page) +
bip->bip_vec->bv_offset);
- if (bs && bs->bio_integrity_pool) {
- bvec_free(bs->bvec_integrity_pool, bip->bip_vec, bip->bip_slab);
+ if (bs && mempool_initialized(&bs->bio_integrity_pool)) {
+ bvec_free(&bs->bvec_integrity_pool, bip->bip_vec, bip->bip_slab);
- mempool_free(bip, bs->bio_integrity_pool);
+ mempool_free(bip, &bs->bio_integrity_pool);
} else {
kfree(bip);
}
@@ -465,16 +465,15 @@ EXPORT_SYMBOL(bio_integrity_clone);
int bioset_integrity_create(struct bio_set *bs, int pool_size)
{
- if (bs->bio_integrity_pool)
+ if (mempool_initialized(&bs->bio_integrity_pool))
return 0;
- bs->bio_integrity_pool = mempool_create_slab_pool(pool_size, bip_slab);
- if (!bs->bio_integrity_pool)
+ if (mempool_init_slab_pool(&bs->bio_integrity_pool,
+ pool_size, bip_slab))
return -1;
- bs->bvec_integrity_pool = biovec_create_pool(pool_size);
- if (!bs->bvec_integrity_pool) {
- mempool_destroy(bs->bio_integrity_pool);
+ if (biovec_init_pool(&bs->bvec_integrity_pool, pool_size)) {
+ mempool_exit(&bs->bio_integrity_pool);
return -1;
}
@@ -484,8 +483,8 @@ EXPORT_SYMBOL(bioset_integrity_create);
void bioset_integrity_free(struct bio_set *bs)
{
- mempool_destroy(bs->bio_integrity_pool);
- mempool_destroy(bs->bvec_integrity_pool);
+ mempool_exit(&bs->bio_integrity_pool);
+ mempool_exit(&bs->bvec_integrity_pool);
}
EXPORT_SYMBOL(bioset_integrity_free);
diff --git a/block/bio.c b/block/bio.c
index e1708db482..360e9bcea5 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -254,7 +254,7 @@ static void bio_free(struct bio *bio)
bio_uninit(bio);
if (bs) {
- bvec_free(bs->bvec_pool, bio->bi_io_vec, BVEC_POOL_IDX(bio));
+ bvec_free(&bs->bvec_pool, bio->bi_io_vec, BVEC_POOL_IDX(bio));
/*
* If we have front padding, adjust the bio pointer before freeing
@@ -262,7 +262,7 @@ static void bio_free(struct bio *bio)
p = bio;
p -= bs->front_pad;
- mempool_free(p, bs->bio_pool);
+ mempool_free(p, &bs->bio_pool);
} else {
/* Bio was allocated by bio_kmalloc() */
kfree(bio);
@@ -454,7 +454,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
inline_vecs = nr_iovecs;
} else {
/* should not use nobvec bioset for nr_iovecs > 0 */
- if (WARN_ON_ONCE(!bs->bvec_pool && nr_iovecs > 0))
+ if (WARN_ON_ONCE(!mempool_initialized(&bs->bvec_pool) &&
+ nr_iovecs > 0))
return NULL;
/*
* generic_make_request() converts recursion to iteration; this
@@ -483,11 +484,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
bs->rescue_workqueue)
gfp_mask &= ~__GFP_DIRECT_RECLAIM;
- p = mempool_alloc(bs->bio_pool, gfp_mask);
+ p = mempool_alloc(&bs->bio_pool, gfp_mask);
if (!p && gfp_mask != saved_gfp) {
punt_bios_to_rescuer(bs);
gfp_mask = saved_gfp;
- p = mempool_alloc(bs->bio_pool, gfp_mask);
+ p = mempool_alloc(&bs->bio_pool, gfp_mask);
}
front_pad = bs->front_pad;
@@ -503,11 +504,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
if (nr_iovecs > inline_vecs) {
unsigned long idx = 0;
- bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
+ bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
if (!bvl && gfp_mask != saved_gfp) {
punt_bios_to_rescuer(bs);
gfp_mask = saved_gfp;
- bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
+ bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
}
if (unlikely(!bvl))
@@ -524,7 +525,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
return bio;
err_free:
- mempool_free(p, bs->bio_pool);
+ mempool_free(p, &bs->bio_pool);
return NULL;
}
EXPORT_SYMBOL(bio_alloc_bioset);
@@ -1848,11 +1849,11 @@ EXPORT_SYMBOL_GPL(bio_trim);
* create memory pools for biovec's in a bio_set.
* use the global biovec slabs created for general use.
*/
-mempool_t *biovec_create_pool(int pool_entries)
+int biovec_init_pool(mempool_t *pool, int pool_entries)
{
struct biovec_slab *bp = bvec_slabs + BVEC_POOL_MAX;
- return mempool_create_slab_pool(pool_entries, bp->slab);
+ return mempool_init_slab_pool(pool, pool_entries, bp->slab);
}
void bioset_free(struct bio_set *bs)
@@ -1860,8 +1861,8 @@ void bioset_free(struct bio_set *bs)
if (bs->rescue_workqueue)
destroy_workqueue(bs->rescue_workqueue);
- mempool_destroy(bs->bio_pool);
- mempool_destroy(bs->bvec_pool);
+ mempool_exit(&bs->bio_pool);
+ mempool_exit(&bs->bvec_pool);
bioset_integrity_free(bs);
bio_put_slab(bs);
@@ -1913,15 +1914,12 @@ struct bio_set *bioset_create(unsigned int pool_size,
return NULL;
}
- bs->bio_pool = mempool_create_slab_pool(pool_size, bs->bio_slab);
- if (!bs->bio_pool)
+ if (mempool_init_slab_pool(&bs->bio_pool, pool_size, bs->bio_slab))
goto bad;
- if (flags & BIOSET_NEED_BVECS) {
- bs->bvec_pool = biovec_create_pool(pool_size);
- if (!bs->bvec_pool)
- goto bad;
- }
+ if ((flags & BIOSET_NEED_BVECS) &&
+ biovec_init_pool(&bs->bvec_pool, pool_size))
+ goto bad;
if (!(flags & BIOSET_NEED_RESCUER))
return bs;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ce547a25e8..720f7261d0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -412,7 +412,7 @@ enum {
BIOSET_NEED_RESCUER = BIT(1),
};
extern void bioset_free(struct bio_set *);
-extern mempool_t *biovec_create_pool(int pool_entries);
+extern int biovec_init_pool(mempool_t *pool, int pool_entries);
extern struct bio *bio_alloc_bioset(gfp_t, unsigned int, struct bio_set *);
extern void bio_put(struct bio *);
@@ -722,11 +722,11 @@ struct bio_set {
struct kmem_cache *bio_slab;
unsigned int front_pad;
- mempool_t *bio_pool;
- mempool_t *bvec_pool;
+ mempool_t bio_pool;
+ mempool_t bvec_pool;
#if defined(CONFIG_BLK_DEV_INTEGRITY)
- mempool_t *bio_integrity_pool;
- mempool_t *bvec_integrity_pool;
+ mempool_t bio_integrity_pool;
+ mempool_t bvec_integrity_pool;
#endif
/*
--
2.17.0
On Fri, May 18, 2018 at 03:49:01AM -0400, Kent Overstreet wrote:
> Minor performance improvement by getting rid of pointer indirections
> from allocation/freeing fastpaths.
Reviewed-by: Johannes Thumshirn <[email protected]>
Although I'd prefer numbers in the changelog when claiming a
performance improvement.
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Looks good,
Reviewed-by: Johannes Thumshirn <[email protected]>
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Looks good,
Reviewed-by: Johannes Thumshirn <[email protected]>
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Looks good,
Reviewed-by: Johannes Thumshirn <[email protected]>
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> bcachefs makes use of them - also, add a proper lg_lock_init()
Why?! lglocks are horrid things, we got rid of them for a reason. They
have terrifying worst case preemption off latencies.
Why can't you use something like per-cpu rwsems?
On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
No.. and most certainly not without a _very_ good reason.
On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > bcachefs makes use of them - also, add a proper lg_lock_init()
>
> Why?! lglocks are horrid things, we got rid of them for a reason. They
> have terrifying worst case preemption off latencies.
Ah. That was missing from your commit message.
> Why can't you use something like per-cpu rwsems?
Well,
a) in my use case, lg_global_lock() pretty much isn't used in normal operation,
it's only called when starting mark and sweep gc (which is not needed
anymore and disabled by default, it'll eventually get rolled into online
fsck) and for device resize
b) I'm using it in conjection with percpu counters, and technically yes I
certainly _could_ use per-cpu sleepable locks (mutexes would make more sense
for me than rwsems), there's a bit of a clash there and it's going to be a
bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
makes any sense in that case, so it'd mean auditing and converting all the
code that touches the relevant data structures).
If you're really that dead set against lglocks I might just come up with a new
lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if the
global lock is held...
On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
>
> No.. and most certainly not without a _very_ good reason.
Ok, can I ask why?
Here's what it's for:
commit 61782bf71eef83919af100a9747d8d86dfdf3605
Author: Kent Overstreet <[email protected]>
Date: Fri May 18 06:14:56 2018 -0400
bcachefs: SIX locks (shared/intent/exclusive)
New lock for bcachefs, like read/write locks but with a third state,
intent.
Intent locks conflict with each other, but not with read locks; taking a
write lock requires first holding an intent lock.
diff --git a/fs/bcachefs/six.c b/fs/bcachefs/six.c
new file mode 100644
index 0000000000..afa59a476a
--- /dev/null
+++ b/fs/bcachefs/six.c
@@ -0,0 +1,516 @@
+
+#include <linux/log2.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/sched/rt.h>
+
+#include "six.h"
+
+#define six_acquire(l, t) lock_acquire(l, 0, t, 0, 0, NULL, _RET_IP_)
+#define six_release(l) lock_release(l, 0, _RET_IP_)
+
+struct six_lock_vals {
+ /* Value we add to the lock in order to take the lock: */
+ u64 lock_val;
+
+ /* If the lock has this value (used as a mask), taking the lock fails: */
+ u64 lock_fail;
+
+ /* Value we add to the lock in order to release the lock: */
+ u64 unlock_val;
+
+ /* Mask that indicates lock is held for this type: */
+ u64 held_mask;
+
+ /* Waitlist we wakeup when releasing the lock: */
+ enum six_lock_type unlock_wakeup;
+};
+
+#define __SIX_LOCK_HELD_read __SIX_VAL(read_lock, ~0)
+#define __SIX_LOCK_HELD_intent __SIX_VAL(intent_lock, ~0)
+#define __SIX_LOCK_HELD_write __SIX_VAL(seq, 1)
+
+#define LOCK_VALS { \
+ [SIX_LOCK_read] = { \
+ .lock_val = __SIX_VAL(read_lock, 1), \
+ .lock_fail = __SIX_LOCK_HELD_write, \
+ .unlock_val = -__SIX_VAL(read_lock, 1), \
+ .held_mask = __SIX_LOCK_HELD_read, \
+ .unlock_wakeup = SIX_LOCK_write, \
+ }, \
+ [SIX_LOCK_intent] = { \
+ .lock_val = __SIX_VAL(intent_lock, 1), \
+ .lock_fail = __SIX_LOCK_HELD_intent, \
+ .unlock_val = -__SIX_VAL(intent_lock, 1), \
+ .held_mask = __SIX_LOCK_HELD_intent, \
+ .unlock_wakeup = SIX_LOCK_intent, \
+ }, \
+ [SIX_LOCK_write] = { \
+ .lock_val = __SIX_VAL(seq, 1), \
+ .lock_fail = __SIX_LOCK_HELD_read, \
+ .unlock_val = __SIX_VAL(seq, 1), \
+ .held_mask = __SIX_LOCK_HELD_write, \
+ .unlock_wakeup = SIX_LOCK_read, \
+ }, \
+}
+
+static inline void six_set_owner(struct six_lock *lock, enum six_lock_type type,
+ union six_lock_state old)
+{
+ if (type != SIX_LOCK_intent)
+ return;
+
+ if (!old.intent_lock) {
+ EBUG_ON(lock->owner);
+ lock->owner = current;
+ } else {
+ EBUG_ON(lock->owner != current);
+ }
+}
+
+static inline void six_clear_owner(struct six_lock *lock, enum six_lock_type type)
+{
+ if (type != SIX_LOCK_intent)
+ return;
+
+ EBUG_ON(lock->owner != current);
+
+ if (lock->state.intent_lock == 1)
+ lock->owner = NULL;
+}
+
+static __always_inline bool do_six_trylock_type(struct six_lock *lock,
+ enum six_lock_type type)
+{
+ const struct six_lock_vals l[] = LOCK_VALS;
+ union six_lock_state old;
+ u64 v = READ_ONCE(lock->state.v);
+
+ EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
+
+ do {
+ old.v = v;
+
+ EBUG_ON(type == SIX_LOCK_write &&
+ ((old.v & __SIX_LOCK_HELD_write) ||
+ !(old.v & __SIX_LOCK_HELD_intent)));
+
+ if (old.v & l[type].lock_fail)
+ return false;
+ } while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+ old.v,
+ old.v + l[type].lock_val)) != old.v);
+
+ six_set_owner(lock, type, old);
+ return true;
+}
+
+__always_inline __flatten
+static bool __six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+ if (!do_six_trylock_type(lock, type))
+ return false;
+
+ six_acquire(&lock->dep_map, 1);
+ return true;
+}
+
+__always_inline __flatten
+static bool __six_relock_type(struct six_lock *lock, enum six_lock_type type,
+ unsigned seq)
+{
+ const struct six_lock_vals l[] = LOCK_VALS;
+ union six_lock_state old;
+ u64 v = READ_ONCE(lock->state.v);
+
+ do {
+ old.v = v;
+
+ if (old.seq != seq || old.v & l[type].lock_fail)
+ return false;
+ } while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+ old.v,
+ old.v + l[type].lock_val)) != old.v);
+
+ six_set_owner(lock, type, old);
+ six_acquire(&lock->dep_map, 1);
+ return true;
+}
+
+struct six_lock_waiter {
+ struct list_head list;
+ struct task_struct *task;
+};
+
+/* This is probably up there with the more evil things I've done */
+#define waitlist_bitnr(id) ilog2((((union six_lock_state) { .waiters = 1 << (id) }).l))
+
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER
+
+static inline int six_can_spin_on_owner(struct six_lock *lock)
+{
+ struct task_struct *owner;
+ int retval = 1;
+
+ if (need_resched())
+ return 0;
+
+ rcu_read_lock();
+ owner = READ_ONCE(lock->owner);
+ if (owner)
+ retval = owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
+
+static inline bool six_spin_on_owner(struct six_lock *lock,
+ struct task_struct *owner)
+{
+ bool ret = true;
+
+ rcu_read_lock();
+ while (lock->owner == owner) {
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_
+ * checking lock->owner still matches owner. If that fails,
+ * owner might point to freed memory. If it still matches,
+ * the rcu_read_lock() ensures the memory stays valid.
+ */
+ barrier();
+
+ if (!owner->on_cpu || need_resched()) {
+ ret = false;
+ break;
+ }
+
+ cpu_relax();
+ }
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+ struct task_struct *task = current;
+
+ if (type == SIX_LOCK_write)
+ return false;
+
+ preempt_disable();
+ if (!six_can_spin_on_owner(lock))
+ goto fail;
+
+ if (!osq_lock(&lock->osq))
+ goto fail;
+
+ while (1) {
+ struct task_struct *owner;
+
+ /*
+ * If there's an owner, wait for it to either
+ * release the lock or go to sleep.
+ */
+ owner = READ_ONCE(lock->owner);
+ if (owner && !six_spin_on_owner(lock, owner))
+ break;
+
+ if (do_six_trylock_type(lock, type)) {
+ osq_unlock(&lock->osq);
+ preempt_enable();
+ return true;
+ }
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!owner && (need_resched() || rt_task(task)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ cpu_relax();
+ }
+
+ osq_unlock(&lock->osq);
+fail:
+ preempt_enable();
+
+ /*
+ * If we fell out of the spin path because of need_resched(),
+ * reschedule now, before we try-lock again. This avoids getting
+ * scheduled out right after we obtained the lock.
+ */
+ if (need_resched())
+ schedule();
+
+ return false;
+}
+
+#else /* CONFIG_LOCK_SPIN_ON_OWNER */
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+ return false;
+}
+
+#endif
+
+noinline
+static void __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type type)
+{
+ const struct six_lock_vals l[] = LOCK_VALS;
+ union six_lock_state old, new;
+ struct six_lock_waiter wait;
+ u64 v;
+
+ if (six_optimistic_spin(lock, type))
+ return;
+
+ lock_contended(&lock->dep_map, _RET_IP_);
+
+ INIT_LIST_HEAD(&wait.list);
+ wait.task = current;
+
+ while (1) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ if (type == SIX_LOCK_write)
+ EBUG_ON(lock->owner != current);
+ else if (list_empty_careful(&wait.list)) {
+ raw_spin_lock(&lock->wait_lock);
+ list_add_tail(&wait.list, &lock->wait_list[type]);
+ raw_spin_unlock(&lock->wait_lock);
+ }
+
+ v = READ_ONCE(lock->state.v);
+ do {
+ new.v = old.v = v;
+
+ if (!(old.v & l[type].lock_fail))
+ new.v += l[type].lock_val;
+ else if (!(new.waiters & (1 << type)))
+ new.waiters |= 1 << type;
+ else
+ break; /* waiting bit already set */
+ } while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+ old.v, new.v)) != old.v);
+
+ if (!(old.v & l[type].lock_fail))
+ break;
+
+ schedule();
+ }
+
+ six_set_owner(lock, type, old);
+
+ __set_current_state(TASK_RUNNING);
+
+ if (!list_empty_careful(&wait.list)) {
+ raw_spin_lock(&lock->wait_lock);
+ list_del_init(&wait.list);
+ raw_spin_unlock(&lock->wait_lock);
+ }
+}
+
+__always_inline
+static void __six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+ six_acquire(&lock->dep_map, 0);
+
+ if (!do_six_trylock_type(lock, type))
+ __six_lock_type_slowpath(lock, type);
+
+ lock_acquired(&lock->dep_map, _RET_IP_);
+}
+
+static inline void six_lock_wakeup(struct six_lock *lock,
+ union six_lock_state state,
+ unsigned waitlist_id)
+{
+ struct list_head *wait_list = &lock->wait_list[waitlist_id];
+ struct six_lock_waiter *w, *next;
+
+ if (waitlist_id == SIX_LOCK_write && state.read_lock)
+ return;
+
+ if (!(state.waiters & (1 << waitlist_id)))
+ return;
+
+ clear_bit(waitlist_bitnr(waitlist_id),
+ (unsigned long *) &lock->state.v);
+
+ if (waitlist_id == SIX_LOCK_write) {
+ struct task_struct *p = READ_ONCE(lock->owner);
+
+ if (p)
+ wake_up_process(p);
+ return;
+ }
+
+ raw_spin_lock(&lock->wait_lock);
+
+ list_for_each_entry_safe(w, next, wait_list, list) {
+ list_del_init(&w->list);
+
+ if (wake_up_process(w->task) &&
+ waitlist_id != SIX_LOCK_read) {
+ if (!list_empty(wait_list))
+ set_bit(waitlist_bitnr(waitlist_id),
+ (unsigned long *) &lock->state.v);
+ break;
+ }
+ }
+
+ raw_spin_unlock(&lock->wait_lock);
+}
+
+__always_inline __flatten
+static void __six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+ const struct six_lock_vals l[] = LOCK_VALS;
+ union six_lock_state state;
+
+ EBUG_ON(!(lock->state.v & l[type].held_mask));
+ EBUG_ON(type == SIX_LOCK_write &&
+ !(lock->state.v & __SIX_LOCK_HELD_intent));
+
+ six_clear_owner(lock, type);
+
+ state.v = atomic64_add_return_release(l[type].unlock_val,
+ &lock->state.counter);
+ six_release(&lock->dep_map);
+ six_lock_wakeup(lock, state, l[type].unlock_wakeup);
+}
+
+#ifdef SIX_LOCK_SEPARATE_LOCKFNS
+
+#define __SIX_LOCK(type) \
+bool six_trylock_##type(struct six_lock *lock) \
+{ \
+ return __six_trylock_type(lock, SIX_LOCK_##type); \
+} \
+ \
+bool six_relock_##type(struct six_lock *lock, u32 seq) \
+{ \
+ return __six_relock_type(lock, SIX_LOCK_##type, seq); \
+} \
+ \
+void six_lock_##type(struct six_lock *lock) \
+{ \
+ __six_lock_type(lock, SIX_LOCK_##type); \
+} \
+ \
+void six_unlock_##type(struct six_lock *lock) \
+{ \
+ __six_unlock_type(lock, SIX_LOCK_##type); \
+}
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+
+#undef __SIX_LOCK
+
+#else
+
+bool six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+ return __six_trylock_type(lock, type);
+}
+
+bool six_relock_type(struct six_lock *lock, enum six_lock_type type,
+ unsigned seq)
+{
+ return __six_relock_type(lock, type, seq);
+
+}
+
+void six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+ __six_lock_type(lock, type);
+}
+
+void six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+ __six_unlock_type(lock, type);
+}
+
+#endif
+
+/* Convert from intent to read: */
+void six_lock_downgrade(struct six_lock *lock)
+{
+ six_lock_increment(lock, SIX_LOCK_read);
+ six_unlock_intent(lock);
+}
+
+bool six_lock_tryupgrade(struct six_lock *lock)
+{
+ const struct six_lock_vals l[] = LOCK_VALS;
+ union six_lock_state old, new;
+ u64 v = READ_ONCE(lock->state.v);
+
+ do {
+ new.v = old.v = v;
+
+ EBUG_ON(!(old.v & l[SIX_LOCK_read].held_mask));
+
+ new.v += l[SIX_LOCK_read].unlock_val;
+
+ if (new.v & l[SIX_LOCK_intent].lock_fail)
+ return false;
+
+ new.v += l[SIX_LOCK_intent].lock_val;
+ } while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+ old.v, new.v)) != old.v);
+
+ six_set_owner(lock, SIX_LOCK_intent, old);
+ six_lock_wakeup(lock, new, l[SIX_LOCK_read].unlock_wakeup);
+
+ return true;
+}
+
+bool six_trylock_convert(struct six_lock *lock,
+ enum six_lock_type from,
+ enum six_lock_type to)
+{
+ EBUG_ON(to == SIX_LOCK_write || from == SIX_LOCK_write);
+
+ if (to == from)
+ return true;
+
+ if (to == SIX_LOCK_read) {
+ six_lock_downgrade(lock);
+ return true;
+ } else {
+ return six_lock_tryupgrade(lock);
+ }
+}
+
+/*
+ * Increment read/intent lock count, assuming we already have it read or intent
+ * locked:
+ */
+void six_lock_increment(struct six_lock *lock, enum six_lock_type type)
+{
+ const struct six_lock_vals l[] = LOCK_VALS;
+
+ EBUG_ON(type == SIX_LOCK_write);
+ six_acquire(&lock->dep_map, 0);
+
+ /* XXX: assert already locked, and that we don't overflow: */
+
+ atomic64_add(l[type].lock_val, &lock->state.counter);
+}
diff --git a/fs/bcachefs/six.h b/fs/bcachefs/six.h
new file mode 100644
index 0000000000..f518c64c40
--- /dev/null
+++ b/fs/bcachefs/six.h
@@ -0,0 +1,190 @@
+#ifndef _BCACHEFS_SIX_H
+#define _BCACHEFS_SIX_H
+
+#include <linux/lockdep.h>
+#include <linux/osq_lock.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+
+#include "util.h"
+
+#define SIX_LOCK_SEPARATE_LOCKFNS
+
+/*
+ * LOCK STATES:
+ *
+ * read, intent, write (i.e. shared/intent/exclusive, hence the name)
+ *
+ * read and write work as with normal read/write locks - a lock can have
+ * multiple readers, but write excludes reads and other write locks.
+ *
+ * Intent does not block read, but it does block other intent locks. The idea is
+ * by taking an intent lock, you can then later upgrade to a write lock without
+ * dropping your read lock and without deadlocking - because no other thread has
+ * the intent lock and thus no other thread could be trying to take the write
+ * lock.
+ */
+
+union six_lock_state {
+ struct {
+ atomic64_t counter;
+ };
+
+ struct {
+ u64 v;
+ };
+
+ struct {
+ /* for waitlist_bitnr() */
+ unsigned long l;
+ };
+
+ struct {
+ unsigned read_lock:26;
+ unsigned intent_lock:3;
+ unsigned waiters:3;
+ /*
+ * seq works much like in seqlocks: it's incremented every time
+ * we lock and unlock for write.
+ *
+ * If it's odd write lock is held, even unlocked.
+ *
+ * Thus readers can unlock, and then lock again later iff it
+ * hasn't been modified in the meantime.
+ */
+ u32 seq;
+ };
+};
+
+#define SIX_LOCK_MAX_RECURSE ((1 << 3) - 1)
+
+enum six_lock_type {
+ SIX_LOCK_read,
+ SIX_LOCK_intent,
+ SIX_LOCK_write,
+};
+
+struct six_lock {
+ union six_lock_state state;
+ struct task_struct *owner;
+ struct optimistic_spin_queue osq;
+
+ raw_spinlock_t wait_lock;
+ struct list_head wait_list[2];
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+static __always_inline void __six_lock_init(struct six_lock *lock,
+ const char *name,
+ struct lock_class_key *key)
+{
+ atomic64_set(&lock->state.counter, 0);
+ raw_spin_lock_init(&lock->wait_lock);
+ INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_read]);
+ INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_intent]);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ debug_check_no_locks_freed((void *) lock, sizeof(*lock));
+ lockdep_init_map(&lock->dep_map, name, key, 0);
+#endif
+}
+
+#define six_lock_init(lock) \
+do { \
+ static struct lock_class_key __key; \
+ \
+ __six_lock_init((lock), #lock, &__key); \
+} while (0)
+
+#define __SIX_VAL(field, _v) (((union six_lock_state) { .field = _v }).v)
+
+#ifdef SIX_LOCK_SEPARATE_LOCKFNS
+
+#define __SIX_LOCK(type) \
+bool six_trylock_##type(struct six_lock *); \
+bool six_relock_##type(struct six_lock *, u32); \
+void six_lock_##type(struct six_lock *); \
+void six_unlock_##type(struct six_lock *);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+#undef __SIX_LOCK
+
+#define SIX_LOCK_DISPATCH(type, fn, ...) \
+ switch (type) { \
+ case SIX_LOCK_read: \
+ return fn##_read(__VA_ARGS__); \
+ case SIX_LOCK_intent: \
+ return fn##_intent(__VA_ARGS__); \
+ case SIX_LOCK_write: \
+ return fn##_write(__VA_ARGS__); \
+ default: \
+ BUG(); \
+ }
+
+static inline bool six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+ SIX_LOCK_DISPATCH(type, six_trylock, lock);
+}
+
+static inline bool six_relock_type(struct six_lock *lock, enum six_lock_type type,
+ unsigned seq)
+{
+ SIX_LOCK_DISPATCH(type, six_relock, lock, seq);
+}
+
+static inline void six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+ SIX_LOCK_DISPATCH(type, six_lock, lock);
+}
+
+static inline void six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+ SIX_LOCK_DISPATCH(type, six_unlock, lock);
+}
+
+#else
+
+bool six_trylock_type(struct six_lock *, enum six_lock_type);
+bool six_relock_type(struct six_lock *, enum six_lock_type, unsigned);
+void six_lock_type(struct six_lock *, enum six_lock_type);
+void six_unlock_type(struct six_lock *, enum six_lock_type);
+
+#define __SIX_LOCK(type) \
+static __always_inline bool six_trylock_##type(struct six_lock *lock) \
+{ \
+ return six_trylock_type(lock, SIX_LOCK_##type); \
+} \
+ \
+static __always_inline bool six_relock_##type(struct six_lock *lock, u32 seq)\
+{ \
+ return six_relock_type(lock, SIX_LOCK_##type, seq); \
+} \
+ \
+static __always_inline void six_lock_##type(struct six_lock *lock) \
+{ \
+ six_lock_type(lock, SIX_LOCK_##type); \
+} \
+ \
+static __always_inline void six_unlock_##type(struct six_lock *lock) \
+{ \
+ six_unlock_type(lock, SIX_LOCK_##type); \
+}
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+#undef __SIX_LOCK
+
+#endif
+
+void six_lock_downgrade(struct six_lock *);
+bool six_lock_tryupgrade(struct six_lock *);
+bool six_trylock_convert(struct six_lock *, enum six_lock_type,
+ enum six_lock_type);
+
+void six_lock_increment(struct six_lock *, enum six_lock_type);
+
+#endif /* _BCACHEFS_SIX_H */
On Fri, May 18, 2018 at 06:13:53AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > > bcachefs makes use of them - also, add a proper lg_lock_init()
> >
> > Why?! lglocks are horrid things, we got rid of them for a reason. They
> > have terrifying worst case preemption off latencies.
>
> Ah. That was missing from your commit message.
Yeah, sorry, sometimes it's hard to state what is obvious to oneself :/
> > Why can't you use something like per-cpu rwsems?
>
> Well,
>
> a) in my use case, lg_global_lock() pretty much isn't used in normal operation,
> it's only called when starting mark and sweep gc (which is not needed
> anymore and disabled by default, it'll eventually get rolled into online
> fsck) and for device resize
>
> b) I'm using it in conjection with percpu counters, and technically yes I
> certainly _could_ use per-cpu sleepable locks (mutexes would make more sense
> for me than rwsems), there's a bit of a clash there and it's going to be a
> bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
> makes any sense in that case, so it'd mean auditing and converting all the
> code that touches the relevant data structures).
Well, lg is a reader-writer style lock per definition, as you want
concurrency on the local and full exclusion against the global, so I'm
not sure how mutexes fit into this.
In any case, have a look at percpu_down_read_preempt_disable() and
percpu_up_read_preempt_enable(); they're a bit of a hack but they should
work for you I think.
They will sleep at down_read, but the entire actual critical section
will be with preemption disabled -- therefore it had better be short and
bounded, and the RT guys will thank you for not using spinlock_t under
it (use raw_spinlock_t if you have to).
The (global) writer side will block and be preemptible like normal.
> If you're really that dead set against lglocks I might just come up with a new
> lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if the
> global lock is held...
See above, we already have this ;-)
On Fri, May 18, 2018 at 06:18:04AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> >
> > No.. and most certainly not without a _very_ good reason.
>
> Ok, can I ask why?
Because it is an internal helper for lock implementations that want to
do optimistic spinning, it isn't a lock on its own and lacks several
things you would expect.
Using it is tricky and I don't trust random module authors to get 1+1
right, let alone use this thing correctly (no judgement on your code,
just in general).
> Here's what it's for:
I'll try and have a look soon :-) But does that really _have_ to live in
a module?
On Fri, May 18, 2018 at 01:08:08PM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 06:18:04AM -0400, Kent Overstreet wrote:
> > On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> > > On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> > >
> > > No.. and most certainly not without a _very_ good reason.
> >
> > Ok, can I ask why?
>
> Because it is an internal helper for lock implementations that want to
> do optimistic spinning, it isn't a lock on its own and lacks several
> things you would expect.
>
> Using it is tricky and I don't trust random module authors to get 1+1
> right, let alone use this thing correctly (no judgement on your code,
> just in general).
Yeah, that's true. I just modelled my usage on the rwsem code.
It does strike me that the whole optimistic spin algorithm
(mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
out. They've been growing more optimizations I see, and the optimizations mostly
aren't specific to either locks.
> > Here's what it's for:
>
> I'll try and have a look soon :-) But does that really _have_ to live in
> a module?
No, I'd be completely fine with moving six locks out of bcachefs, just don't
know that there'd be any other users. But I suppose we do have other filesystems
that use btrees, and that's what they're for.
On Fri, May 18, 2018 at 01:03:39PM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 06:13:53AM -0400, Kent Overstreet wrote:
> > On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> > > On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > > > bcachefs makes use of them - also, add a proper lg_lock_init()
> > >
> > > Why?! lglocks are horrid things, we got rid of them for a reason. They
> > > have terrifying worst case preemption off latencies.
> >
> > Ah. That was missing from your commit message.
>
> Yeah, sorry, sometimes it's hard to state what is obvious to oneself :/
>
> > > Why can't you use something like per-cpu rwsems?
> >
> > Well,
> >
> > a) in my use case, lg_global_lock() pretty much isn't used in normal operation,
> > it's only called when starting mark and sweep gc (which is not needed
> > anymore and disabled by default, it'll eventually get rolled into online
> > fsck) and for device resize
> >
> > b) I'm using it in conjection with percpu counters, and technically yes I
> > certainly _could_ use per-cpu sleepable locks (mutexes would make more sense
> > for me than rwsems), there's a bit of a clash there and it's going to be a
> > bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
> > makes any sense in that case, so it'd mean auditing and converting all the
> > code that touches the relevant data structures).
>
> Well, lg is a reader-writer style lock per definition, as you want
> concurrency on the local and full exclusion against the global, so I'm
> not sure how mutexes fit into this.
>
> In any case, have a look at percpu_down_read_preempt_disable() and
> percpu_up_read_preempt_enable(); they're a bit of a hack but they should
> work for you I think.
>
> They will sleep at down_read, but the entire actual critical section
> will be with preemption disabled -- therefore it had better be short and
> bounded, and the RT guys will thank you for not using spinlock_t under
> it (use raw_spinlock_t if you have to).
>
> The (global) writer side will block and be preemptible like normal.
>
> > If you're really that dead set against lglocks I might just come up with a new
> > lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if the
> > global lock is held...
>
> See above, we already have this ;-)
Ok, I think this might work. I'll have to stare awhile and make sure I remember
everything I'm currently depending on the lglock for...
On Fri, May 18, 2018 at 07:32:05AM -0400, Kent Overstreet wrote:
> It does strike me that the whole optimistic spin algorithm
> (mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
> out. They've been growing more optimizations I see, and the optimizations mostly
> aren't specific to either locks.
There's a few unfortunate differences between the two; but yes it would
be good if we could reduce some of the duplication found there.
One of the distinct differences is that mutex (now) has the lock state
and owner in a single atomic word, while rwsem still tracks the owner in
a separate word and thus needs to account for 'inconsistent' owner
state.
And then there's warts such as ww_mutex and rwsem_owner_is_reader and
similar.
On Fri, May 18, 2018 at 01:40:54PM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 07:32:05AM -0400, Kent Overstreet wrote:
> > It does strike me that the whole optimistic spin algorithm
> > (mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
> > out. They've been growing more optimizations I see, and the optimizations mostly
> > aren't specific to either locks.
>
> There's a few unfortunate differences between the two; but yes it would
> be good if we could reduce some of the duplication found there.
>
> One of the distinct differences is that mutex (now) has the lock state
> and owner in a single atomic word, while rwsem still tracks the owner in
> a separate word and thus needs to account for 'inconsistent' owner
> state.
>
> And then there's warts such as ww_mutex and rwsem_owner_is_reader and
> similar.
The rwsem code seems workable, osq_optimistic_spin() takes callback for trylock
and get_owner - I just added a OWNER_NO_SPIN value that get_owner() can return.
The mutex code though... wtf...
commit 5c7defe81ee722f5087cbeda9be6e7ee715515d7
Author: Kent Overstreet <[email protected]>
Date: Fri May 18 08:35:55 2018 -0400
Factor out osq_optimistic_spin()
diff --git a/fs/bcachefs/six.c b/fs/bcachefs/six.c
index afa59a476a..dbaf52abef 100644
--- a/fs/bcachefs/six.c
+++ b/fs/bcachefs/six.c
@@ -1,5 +1,6 @@
#include <linux/log2.h>
+#include <linux/osq_optimistic_spin.h>
#include <linux/preempt.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
@@ -146,127 +147,26 @@ struct six_lock_waiter {
/* This is probably up there with the more evil things I've done */
#define waitlist_bitnr(id) ilog2((((union six_lock_state) { .waiters = 1 << (id) }).l))
-#ifdef CONFIG_LOCK_SPIN_ON_OWNER
-
-static inline int six_can_spin_on_owner(struct six_lock *lock)
-{
- struct task_struct *owner;
- int retval = 1;
-
- if (need_resched())
- return 0;
-
- rcu_read_lock();
- owner = READ_ONCE(lock->owner);
- if (owner)
- retval = owner->on_cpu;
- rcu_read_unlock();
- /*
- * if lock->owner is not set, the mutex owner may have just acquired
- * it and not set the owner yet or the mutex has been released.
- */
- return retval;
-}
-
-static inline bool six_spin_on_owner(struct six_lock *lock,
- struct task_struct *owner)
+static inline struct task_struct *six_osq_get_owner(struct optimistic_spin_queue *osq)
{
- bool ret = true;
-
- rcu_read_lock();
- while (lock->owner == owner) {
- /*
- * Ensure we emit the owner->on_cpu, dereference _after_
- * checking lock->owner still matches owner. If that fails,
- * owner might point to freed memory. If it still matches,
- * the rcu_read_lock() ensures the memory stays valid.
- */
- barrier();
-
- if (!owner->on_cpu || need_resched()) {
- ret = false;
- break;
- }
-
- cpu_relax();
- }
- rcu_read_unlock();
+ struct six_lock *lock = container_of(osq, struct six_lock, osq);
- return ret;
+ return lock->owner;
}
-static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+static inline bool six_osq_trylock_read(struct optimistic_spin_queue *osq)
{
- struct task_struct *task = current;
-
- if (type == SIX_LOCK_write)
- return false;
-
- preempt_disable();
- if (!six_can_spin_on_owner(lock))
- goto fail;
-
- if (!osq_lock(&lock->osq))
- goto fail;
-
- while (1) {
- struct task_struct *owner;
-
- /*
- * If there's an owner, wait for it to either
- * release the lock or go to sleep.
- */
- owner = READ_ONCE(lock->owner);
- if (owner && !six_spin_on_owner(lock, owner))
- break;
+ struct six_lock *lock = container_of(osq, struct six_lock, osq);
- if (do_six_trylock_type(lock, type)) {
- osq_unlock(&lock->osq);
- preempt_enable();
- return true;
- }
-
- /*
- * When there's no owner, we might have preempted between the
- * owner acquiring the lock and setting the owner field. If
- * we're an RT task that will live-lock because we won't let
- * the owner complete.
- */
- if (!owner && (need_resched() || rt_task(task)))
- break;
-
- /*
- * The cpu_relax() call is a compiler barrier which forces
- * everything in this loop to be re-loaded. We don't need
- * memory barriers as we'll eventually observe the right
- * values at the cost of a few extra spins.
- */
- cpu_relax();
- }
-
- osq_unlock(&lock->osq);
-fail:
- preempt_enable();
-
- /*
- * If we fell out of the spin path because of need_resched(),
- * reschedule now, before we try-lock again. This avoids getting
- * scheduled out right after we obtained the lock.
- */
- if (need_resched())
- schedule();
-
- return false;
+ return do_six_trylock_type(lock, SIX_LOCK_read);
}
-#else /* CONFIG_LOCK_SPIN_ON_OWNER */
-
-static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+static inline bool six_osq_trylock_intent(struct optimistic_spin_queue *osq)
{
- return false;
-}
+ struct six_lock *lock = container_of(osq, struct six_lock, osq);
-#endif
+ return do_six_trylock_type(lock, SIX_LOCK_intent);
+}
noinline
static void __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type type)
@@ -276,8 +176,20 @@ static void __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type t
struct six_lock_waiter wait;
u64 v;
- if (six_optimistic_spin(lock, type))
- return;
+ switch (type) {
+ case SIX_LOCK_read:
+ if (osq_optimistic_spin(&lock->osq, six_osq_get_owner,
+ six_osq_trylock_read))
+ return;
+ break;
+ case SIX_LOCK_intent:
+ if (osq_optimistic_spin(&lock->osq, six_osq_get_owner,
+ six_osq_trylock_intent))
+ return;
+ break;
+ case SIX_LOCK_write:
+ break;
+ }
lock_contended(&lock->dep_map, _RET_IP_);
diff --git a/include/linux/osq_optimistic_spin.h b/include/linux/osq_optimistic_spin.h
new file mode 100644
index 0000000000..1f5fd1f85b
--- /dev/null
+++ b/include/linux/osq_optimistic_spin.h
@@ -0,0 +1,155 @@
+#ifndef __LINUX_OSQ_OPTIMISTIC_SPIN_H
+#define __LINUX_OSQ_OPTIMISTIC_SPIN_H
+
+#include <linux/sched.h>
+#include <linux/sched/rt.h>
+
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER
+
+typedef struct task_struct *(*osq_get_owner_fn)(struct optimistic_spin_queue *osq);
+typedef bool (*osq_trylock_fn)(struct optimistic_spin_queue *osq);
+
+#define OWNER_NO_SPIN ((struct task_struct *) 1UL)
+
+static inline bool osq_can_spin_on_owner(struct optimistic_spin_queue *lock,
+ osq_get_owner_fn get_owner)
+{
+ struct task_struct *owner;
+ bool ret;
+
+ if (need_resched())
+ return false;
+
+ rcu_read_lock();
+ owner = get_owner(lock);
+ /*
+ * if lock->owner is not set, the lock owner may have just acquired
+ * it and not set the owner yet, or it may have just been unlocked
+ */
+ if (!owner)
+ ret = true;
+ else if (owner == OWNER_NO_SPIN)
+ ret = false;
+ else
+ ret = owner->on_cpu != 0;
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static inline bool osq_spin_on_owner(struct optimistic_spin_queue *lock,
+ struct task_struct *owner,
+ osq_get_owner_fn get_owner)
+{
+ if (!owner)
+ return true;
+
+ if (owner == OWNER_NO_SPIN)
+ return false;
+
+ while (1) {
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_
+ * checking lock->owner still matches owner. If that fails,
+ * owner might point to freed memory. If it still matches,
+ * the rcu_read_lock() ensures the memory stays valid.
+ */
+ barrier();
+ if (get_owner(lock) != owner)
+ return true;
+
+ if (!owner->on_cpu || need_resched() ||
+ vcpu_is_preempted(task_cpu(owner)))
+ return false;
+
+ cpu_relax();
+ }
+}
+
+static inline bool osq_optimistic_spin(struct optimistic_spin_queue *lock,
+ osq_get_owner_fn get_owner,
+ osq_trylock_fn trylock)
+{
+ struct task_struct *task = current;
+
+ preempt_disable();
+ if (!osq_can_spin_on_owner(lock, get_owner))
+ goto fail;
+
+ if (!osq_lock(lock))
+ goto fail;
+
+ while (1) {
+ struct task_struct *owner;
+
+ /*
+ * If there's an owner, wait for it to either
+ * release the lock or go to sleep.
+ */
+ rcu_read_lock();
+ owner = get_owner(lock);
+ if (!osq_spin_on_owner(lock, owner, get_owner)) {
+ rcu_read_unlock();
+ break;
+ }
+ rcu_read_unlock();
+
+ if (trylock(lock)) {
+ osq_unlock(lock);
+ preempt_enable();
+ return true;
+ }
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!owner && (need_resched() || rt_task(task)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ cpu_relax();
+ }
+
+ osq_unlock(lock);
+fail:
+ preempt_enable();
+
+ /*
+ * If we fell out of the spin path because of need_resched(),
+ * reschedule now, before we try-lock again. This avoids getting
+ * scheduled out right after we obtained the lock.
+ */
+ if (need_resched())
+ schedule();
+
+ return false;
+}
+
+static inline bool osq_has_spinner(struct optimistic_spin_queue *lock)
+{
+ return osq_is_locked(lock);
+}
+
+#else /* CONFIG_LOCK_SPIN_ON_OWNER */
+
+static inline bool osq_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+ return false;
+}
+
+static inline bool osq_has_spinner(struct optimistic_spin_queue *lock)
+{
+ return false;
+}
+
+#endif
+
+#endif /* __LINUX_OSQ_OPTIMISTIC_SPIN_H */
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index e795908f36..a25ee6668f 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -13,6 +13,7 @@
#include <linux/rwsem.h>
#include <linux/init.h>
#include <linux/export.h>
+#include <linux/osq_optimistic_spin.h>
#include <linux/sched/signal.h>
#include <linux/sched/rt.h>
#include <linux/sched/wake_q.h>
@@ -325,11 +326,21 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
}
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+
+static inline struct task_struct *rwsem_osq_get_owner(struct optimistic_spin_queue *osq)
+{
+ struct rw_semaphore *sem = container_of(osq, struct rw_semaphore, osq);
+ struct task_struct *owner = READ_ONCE(sem->owner);
+
+ return rwsem_owner_is_reader(owner) ? OWNER_NO_SPIN : owner;
+}
+
/*
* Try to acquire write lock before the writer has been put on wait queue.
*/
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+static inline bool rwsem_osq_trylock(struct optimistic_spin_queue *osq)
{
+ struct rw_semaphore *sem = container_of(osq, struct rw_semaphore, osq);
long old, count = atomic_long_read(&sem->count);
while (true) {
@@ -347,125 +358,10 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
}
}
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
-{
- struct task_struct *owner;
- bool ret = true;
-
- if (need_resched())
- return false;
-
- rcu_read_lock();
- owner = READ_ONCE(sem->owner);
- if (!rwsem_owner_is_writer(owner)) {
- /*
- * Don't spin if the rwsem is readers owned.
- */
- ret = !rwsem_owner_is_reader(owner);
- goto done;
- }
-
- /*
- * As lock holder preemption issue, we both skip spinning if task is not
- * on cpu or its cpu is preempted
- */
- ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-done:
- rcu_read_unlock();
- return ret;
-}
-
-/*
- * Return true only if we can still spin on the owner field of the rwsem.
- */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
-{
- struct task_struct *owner = READ_ONCE(sem->owner);
-
- if (!rwsem_owner_is_writer(owner))
- goto out;
-
- rcu_read_lock();
- while (sem->owner == owner) {
- /*
- * Ensure we emit the owner->on_cpu, dereference _after_
- * checking sem->owner still matches owner, if that fails,
- * owner might point to free()d memory, if it still matches,
- * the rcu_read_lock() ensures the memory stays valid.
- */
- barrier();
-
- /*
- * abort spinning when need_resched or owner is not running or
- * owner's cpu is preempted.
- */
- if (!owner->on_cpu || need_resched() ||
- vcpu_is_preempted(task_cpu(owner))) {
- rcu_read_unlock();
- return false;
- }
-
- cpu_relax();
- }
- rcu_read_unlock();
-out:
- /*
- * If there is a new owner or the owner is not set, we continue
- * spinning.
- */
- return !rwsem_owner_is_reader(READ_ONCE(sem->owner));
-}
-
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
- bool taken = false;
-
- preempt_disable();
-
- /* sem->wait_lock should not be held when doing optimistic spinning */
- if (!rwsem_can_spin_on_owner(sem))
- goto done;
-
- if (!osq_lock(&sem->osq))
- goto done;
-
- /*
- * Optimistically spin on the owner field and attempt to acquire the
- * lock whenever the owner changes. Spinning will be stopped when:
- * 1) the owning writer isn't running; or
- * 2) readers own the lock as we can't determine if they are
- * actively running or not.
- */
- while (rwsem_spin_on_owner(sem)) {
- /*
- * Try to acquire the lock
- */
- if (rwsem_try_write_lock_unqueued(sem)) {
- taken = true;
- break;
- }
-
- /*
- * When there's no owner, we might have preempted between the
- * owner acquiring the lock and setting the owner field. If
- * we're an RT task that will live-lock because we won't let
- * the owner complete.
- */
- if (!sem->owner && (need_resched() || rt_task(current)))
- break;
-
- /*
- * The cpu_relax() call is a compiler barrier which forces
- * everything in this loop to be re-loaded. We don't need
- * memory barriers as we'll eventually observe the right
- * values at the cost of a few extra spins.
- */
- cpu_relax();
- }
- osq_unlock(&sem->osq);
-done:
- preempt_enable();
- return taken;
+ return osq_optimistic_spin(&sem->osq, rwsem_osq_get_owner,
+ rwsem_osq_trylock);
}
/*
On Fri, May 18, 2018 at 03:49:00AM -0400, Kent Overstreet wrote:
> Add a per address space lock around adding pages to the pagecache - making it
> possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
> hopefully making truncate and dio a bit saner.
(moving this section here from the overall description so I can reply
to it in one place)
> * pagecache add lock
>
> This is the only one that touches existing code in nontrivial ways.
> The problem it's solving is that there is no existing general mechanism
> for shooting down pages in the page and keeping them removed, which is a
> real problem if you're doing anything that modifies file data and isn't
> buffered writes.
>
> Historically, the only problematic case has been direct IO, and people
> have been willing to say "well, if you mix buffered and direct IO you
> get what you deserve", and that's probably not unreasonable. But now we
> have fallocate insert range and collapse range, and those are broken in
> ways I frankly don't want to think about if they can't ensure consistency
> with the page cache.
ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
You may get pushback on the grounds that this ought to be a
filesystem-specific lock rather than one embedded in the generic inode.
> Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> historically been rather fragile, IMO it might be a good think if we
> switched it to a more general rigorous mechanism.
>
> I need this solved for bcachefs because without this mechanism, the page
> cache inconsistencies lead to various assertions popping (primarily when
> we didn't think we need to get a disk reservation going by page cache
> state, but then do the actual write and disk space accounting says oops,
> we did need one). And having to reason about what can happen without
> a locking mechanism for this is not something I care to spend brain
> cycles on.
>
> That said, my patch is kind of ugly, and it requires filesystem changes
> for other filesystems to take advantage of it. And unfortunately, since
> one of the code paths that needs locking is readahead, I don't see any
> realistic way of implementing the locking within just bcachefs code.
>
> So I'm hoping someone has an idea for something cleaner (I think I recall
> Matthew Wilcox saying he had an idea for how to use xarray to solve this),
> but if not I'll polish up my pagecache add lock patch and see what I can
> do to make it less ugly, and hopefully other people find it palatable
> or at least useful.
My idea with the XArray is that we have a number of reserved entries which
we can use as blocking entries. I was originally planning on making this
an XArray feature, but I now believe it's a page-cache-special feature.
We can always revisit that decision if it turns out to be useful to
another user.
API:
int filemap_block_range(struct address_space *mapping, loff_t start,
loff_t end);
void filemap_remove_block(struct address_space *mapping, loff_t start,
loff_t end);
- After removing a block, the pagecache is empty between [start, end].
- You have to treat the block as a single entity; don't unblock only
a subrange of the range you originally blocked.
- Lookups of a page within a blocked range return NULL.
- Attempts to add a page to a blocked range sleep on one of the
page_wait_table queues.
- Attempts to block a blocked range will also sleep on one of the
page_wait_table queues. Is this restriction acceptable for your use
case? It's clearly not a problem for fallocate insert/collapse. It
would only be a problem for Direct I/O if people are doing subpage
directio from within the same page. I think that's rare enough to
not be a problem (but please tell me if I'm wrong!)
On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > Historically, the only problematic case has been direct IO, and people
> > have been willing to say "well, if you mix buffered and direct IO you
> > get what you deserve", and that's probably not unreasonable. But now we
> > have fallocate insert range and collapse range, and those are broken in
> > ways I frankly don't want to think about if they can't ensure consistency
> > with the page cache.
>
> ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> You may get pushback on the grounds that this ought to be a
> filesystem-specific lock rather than one embedded in the generic inode.
Honestly I think this probably should be in the core. But IFF we move
it to the core the existing users of per-fs locks need to be moved
over first. E.g. XFS as the very first one, and at least ext4 and f2fs
that copied the approach, and probably more if you audit deep enough.
On Fri, May 18, 2018 at 03:49:02AM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> mm/filemap.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 31dd888785..78b99019bf 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1845,6 +1845,7 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
>
> return ret;
> }
> +EXPORT_SYMBOL(find_get_pages_range);
EXPORT_SYMBOL_GPL and only together with the actual submission of
a user of the interface.
On Fri, May 18, 2018 at 03:49:08AM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <[email protected]>
Looks generally fine. A little changelog with an explanation of
how we obviously never could get here with irqs disabled would
be nice, though.
Completely lacks any explanation, including why this should be
in lib/. Also should come in the same series with actual users
of the infrastructure.
On Fri, May 18, 2018 at 03:49:13AM -0400, Kent Overstreet wrote:
> Prep work for bcachefs - being a fork of bcache it also uses closures
Hell no. This code needs to go away and not actually be promoted to
lib/.
On Fri, May 18, 2018 at 03:49:18AM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <[email protected]>
Completely lacks any explanation or argument why it would be useful.
On Fri, May 18, 2018 at 09:02:45AM -0700, Christoph Hellwig wrote:
> On Fri, May 18, 2018 at 03:49:18AM -0400, Kent Overstreet wrote:
> > Signed-off-by: Kent Overstreet <[email protected]>
>
> Completely lacks any explanation or argument why it would be useful.
It's in the cover letter...
* Dynamic fault injection
I've actually had this code sitting in my tree since forever... I know we have
an existing fault injection framework, but I think this one is quite a bit
nicer
to actually use.
It works very much like the dynamic debug infrastructure - for those who aren't
familiar, dynamic debug makes it so you can list and individually
enable/disable
every pr_debug() callsite in debugfs.
So to add a fault injection site with this, you just stick a call to
dynamic_fault("foobar") somewhere in your code - dynamic_fault() returns true
if
you should fail whatever it is you're testing. And then it'll show up in
debugfs, where you can enable/disable faults by file/linenumber, module, name,
etc.
The patch then also adds macros that wrap all the various memory allocation
functions and fail if dynamic_fault("memory") returns true - which means you
can
see in debugfs every place you're allocating memory and fail all of them or
just
individually (I have tests that iterate over all the faults and flip them on
one
by one). I also use it in bcachefs to add fault injection points for uncommon
error paths in the filesystem startup/recovery path, and for various hard to
test slowpaths that only happen if we race in weird ways (race_fault()).
On Fri, May 18, 2018 at 09:02:03AM -0700, Christoph Hellwig wrote:
> Completely lacks any explanation, including why this should be
> in lib/. Also should come in the same series with actual users
> of the infrastructure.
Also in the cover letter...
* Generic radix trees
This is a very simple radix tree implementation that can store types of
arbitrary size, not just pointers/unsigned long. It could probably replace
flex arrays.
On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > Historically, the only problematic case has been direct IO, and people
> > > have been willing to say "well, if you mix buffered and direct IO you
> > > get what you deserve", and that's probably not unreasonable. But now we
> > > have fallocate insert range and collapse range, and those are broken in
> > > ways I frankly don't want to think about if they can't ensure consistency
> > > with the page cache.
> >
> > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > You may get pushback on the grounds that this ought to be a
> > filesystem-specific lock rather than one embedded in the generic inode.
>
> Honestly I think this probably should be in the core. But IFF we move
> it to the core the existing users of per-fs locks need to be moved
> over first. E.g. XFS as the very first one, and at least ext4 and f2fs
> that copied the approach, and probably more if you audit deep enough.
I didn't know about i_mmap_sem, thanks
But, using a rw semaphore wouldn't work for dio writes, and I do need dio writes
to block pagecache adds in bcachefs since the dio write could overwrite
uncompressed data or a reservation with compressed data, which means new writes
need a disk reservation.
Also I'm guessing ext4 takes the lock at the top of the read path? That sucks
for reads that are already cached, the generic buffered read path can do cached
reads without taking any per inode locks - that's why I pushed the lock down to
only be taken when we have to add a page to the pagecache.
Definitely less ugly doing it that way though...
On Fri, May 18, 2018 at 03:48:58AM -0400, Kent Overstreet wrote:
> These are all the remaining patches in my bcachefs tree that touch stuff outside
> fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
> some discussion first.
>
> * pagecache add lock
>
> This is the only one that touches existing code in nontrivial ways. The problem
> it's solving is that there is no existing general mechanism for shooting down
> pages in the page and keeping them removed, which is a real problem if you're
> doing anything that modifies file data and isn't buffered writes.
>
> Historically, the only problematic case has been direct IO, and people have been
> willing to say "well, if you mix buffered and direct IO you get what you
> deserve", and that's probably not unreasonable. But now we have fallocate insert
> range and collapse range, and those are broken in ways I frankly don't want to
> think about if they can't ensure consistency with the page cache.
>
> Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> historically been rather fragile, IMO it might be a good think if we switched it
> to a more general rigorous mechanism.
>
> I need this solved for bcachefs because without this mechanism, the page cache
> inconsistencies lead to various assertions popping (primarily when we didn't
> think we need to get a disk reservation going by page cache state, but then do
> the actual write and disk space accounting says oops, we did need one). And
> having to reason about what can happen without a locking mechanism for this is
> not something I care to spend brain cycles on.
>
> That said, my patch is kind of ugly, and it requires filesystem changes for
> other filesystems to take advantage of it. And unfortunately, since one of the
> code paths that needs locking is readahead, I don't see any realistic way of
> implementing the locking within just bcachefs code.
>
> So I'm hoping someone has an idea for something cleaner (I think I recall
> Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
> if not I'll polish up my pagecache add lock patch and see what I can do to make
> it less ugly, and hopefully other people find it palatable or at least useful.
>
> * lglocks
>
> They were removed by Peter Zijlstra when the last in kernel user was removed,
> but I've found them useful. His commit message seems to imply he doesn't think
> people should be using them, but I'm not sure why. They are a bit niche though,
> I can move them to fs/bcachefs if people would prefer.
>
> * Generic radix trees
>
> This is a very simple radix tree implementation that can store types of
> arbitrary size, not just pointers/unsigned long. It could probably replace
> flex arrays.
>
> * Dynamic fault injection
>
I've not looked at this at all so this may not cover your usecase, but I
implemeted a bpf_override_return() to do focused error injection a year ago. I
have this script
https://github.com/josefbacik/debug-scripts/blob/master/inject-error.py
that does it generically, all you have to do is tag the function you want to be
error injectable with ALLOW_ERROR_INJECTION() and then you get all these nice
things like a debugfs interface to trigger them or use the above script to
trigger specific errors and such. Thanks,
Josef
On Fri, May 18, 2018 at 01:45:36PM -0400, Josef Bacik wrote:
> On Fri, May 18, 2018 at 03:48:58AM -0400, Kent Overstreet wrote:
> > These are all the remaining patches in my bcachefs tree that touch stuff outside
> > fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
> > some discussion first.
> >
> > * pagecache add lock
> >
> > This is the only one that touches existing code in nontrivial ways. The problem
> > it's solving is that there is no existing general mechanism for shooting down
> > pages in the page and keeping them removed, which is a real problem if you're
> > doing anything that modifies file data and isn't buffered writes.
> >
> > Historically, the only problematic case has been direct IO, and people have been
> > willing to say "well, if you mix buffered and direct IO you get what you
> > deserve", and that's probably not unreasonable. But now we have fallocate insert
> > range and collapse range, and those are broken in ways I frankly don't want to
> > think about if they can't ensure consistency with the page cache.
> >
> > Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> > historically been rather fragile, IMO it might be a good think if we switched it
> > to a more general rigorous mechanism.
> >
> > I need this solved for bcachefs because without this mechanism, the page cache
> > inconsistencies lead to various assertions popping (primarily when we didn't
> > think we need to get a disk reservation going by page cache state, but then do
> > the actual write and disk space accounting says oops, we did need one). And
> > having to reason about what can happen without a locking mechanism for this is
> > not something I care to spend brain cycles on.
> >
> > That said, my patch is kind of ugly, and it requires filesystem changes for
> > other filesystems to take advantage of it. And unfortunately, since one of the
> > code paths that needs locking is readahead, I don't see any realistic way of
> > implementing the locking within just bcachefs code.
> >
> > So I'm hoping someone has an idea for something cleaner (I think I recall
> > Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
> > if not I'll polish up my pagecache add lock patch and see what I can do to make
> > it less ugly, and hopefully other people find it palatable or at least useful.
> >
> > * lglocks
> >
> > They were removed by Peter Zijlstra when the last in kernel user was removed,
> > but I've found them useful. His commit message seems to imply he doesn't think
> > people should be using them, but I'm not sure why. They are a bit niche though,
> > I can move them to fs/bcachefs if people would prefer.
> >
> > * Generic radix trees
> >
> > This is a very simple radix tree implementation that can store types of
> > arbitrary size, not just pointers/unsigned long. It could probably replace
> > flex arrays.
> >
> > * Dynamic fault injection
> >
>
> I've not looked at this at all so this may not cover your usecase, but I
> implemeted a bpf_override_return() to do focused error injection a year ago. I
> have this script
>
> https://github.com/josefbacik/debug-scripts/blob/master/inject-error.py
>
> that does it generically, all you have to do is tag the function you want to be
> error injectable with ALLOW_ERROR_INJECTION() and then you get all these nice
> things like a debugfs interface to trigger them or use the above script to
> trigger specific errors and such. Thanks,
That sounds pretty cool...
What about being able to add a random fault injection point in the middle of an
existing function? Being able to stick race_fault() in random places was a
pretty big win in terms of getting good code coverage out of realistic tests.
On Fri, May 18, 2018 at 01:49:12PM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 01:45:36PM -0400, Josef Bacik wrote:
> > On Fri, May 18, 2018 at 03:48:58AM -0400, Kent Overstreet wrote:
> > > These are all the remaining patches in my bcachefs tree that touch stuff outside
> > > fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
> > > some discussion first.
> > >
> > > * pagecache add lock
> > >
> > > This is the only one that touches existing code in nontrivial ways. The problem
> > > it's solving is that there is no existing general mechanism for shooting down
> > > pages in the page and keeping them removed, which is a real problem if you're
> > > doing anything that modifies file data and isn't buffered writes.
> > >
> > > Historically, the only problematic case has been direct IO, and people have been
> > > willing to say "well, if you mix buffered and direct IO you get what you
> > > deserve", and that's probably not unreasonable. But now we have fallocate insert
> > > range and collapse range, and those are broken in ways I frankly don't want to
> > > think about if they can't ensure consistency with the page cache.
> > >
> > > Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> > > historically been rather fragile, IMO it might be a good think if we switched it
> > > to a more general rigorous mechanism.
> > >
> > > I need this solved for bcachefs because without this mechanism, the page cache
> > > inconsistencies lead to various assertions popping (primarily when we didn't
> > > think we need to get a disk reservation going by page cache state, but then do
> > > the actual write and disk space accounting says oops, we did need one). And
> > > having to reason about what can happen without a locking mechanism for this is
> > > not something I care to spend brain cycles on.
> > >
> > > That said, my patch is kind of ugly, and it requires filesystem changes for
> > > other filesystems to take advantage of it. And unfortunately, since one of the
> > > code paths that needs locking is readahead, I don't see any realistic way of
> > > implementing the locking within just bcachefs code.
> > >
> > > So I'm hoping someone has an idea for something cleaner (I think I recall
> > > Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
> > > if not I'll polish up my pagecache add lock patch and see what I can do to make
> > > it less ugly, and hopefully other people find it palatable or at least useful.
> > >
> > > * lglocks
> > >
> > > They were removed by Peter Zijlstra when the last in kernel user was removed,
> > > but I've found them useful. His commit message seems to imply he doesn't think
> > > people should be using them, but I'm not sure why. They are a bit niche though,
> > > I can move them to fs/bcachefs if people would prefer.
> > >
> > > * Generic radix trees
> > >
> > > This is a very simple radix tree implementation that can store types of
> > > arbitrary size, not just pointers/unsigned long. It could probably replace
> > > flex arrays.
> > >
> > > * Dynamic fault injection
> > >
> >
> > I've not looked at this at all so this may not cover your usecase, but I
> > implemeted a bpf_override_return() to do focused error injection a year ago. I
> > have this script
> >
> > https://github.com/josefbacik/debug-scripts/blob/master/inject-error.py
> >
> > that does it generically, all you have to do is tag the function you want to be
> > error injectable with ALLOW_ERROR_INJECTION() and then you get all these nice
> > things like a debugfs interface to trigger them or use the above script to
> > trigger specific errors and such. Thanks,
>
> That sounds pretty cool...
>
> What about being able to add a random fault injection point in the middle of an
> existing function? Being able to stick race_fault() in random places was a
> pretty big win in terms of getting good code coverage out of realistic tests.
There's nothing stopping us from doing that, it just uses a kprobe to override
the function with our helper, so we could conceivably put it anywhere in the
function. The reason I limited it to individual functions was because it was
easier than trying to figure out the side-effects of stopping mid-function. If
I needed to fail mid-function I just added a helper where I needed it and failed
that instead. I imagine safety is going to be of larger concern if we allow bpf
scripts to randomly return anywhere inside a function, even if the function is
marked as allowing error injection. Thanks,
Josef
On Fri, May 18, 2018 at 02:03:25PM -0400, Josef Bacik wrote:
> There's nothing stopping us from doing that, it just uses a kprobe to override
> the function with our helper, so we could conceivably put it anywhere in the
> function. The reason I limited it to individual functions was because it was
> easier than trying to figure out the side-effects of stopping mid-function. If
> I needed to fail mid-function I just added a helper where I needed it and failed
> that instead. I imagine safety is going to be of larger concern if we allow bpf
> scripts to randomly return anywhere inside a function, even if the function is
> marked as allowing error injection. Thanks,
Ahh no, that's not what I want... here's an example:
https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/btree_cache.c#n674
Here we've got to do this thing which can race - which is fine, we just need to
check for and handle the race, on line 709 - but actually exercising that with a
test is difficult since it requires a heavily multithreaded workload with btree
nodes getting evicted to see it happen, so - it pretends the race happened if
race_fault() returns true. The race_fault() invocation shows up in debugfs,
where userspace can tell it to fire.
the way it works is dynamic_fault() is a macro that expands to a static struct
dfault_descriptor, stuck in a particular linker section so the dynamic fault
code can find them and stick them in debugfs (which is also the way dynamic
debug works).
#define dynamic_fault(_class) \
({ \
static struct _dfault descriptor \
__used __aligned(8) __attribute__((section("__faults"))) = { \
.modname = KBUILD_MODNAME, \
.function = __func__, \
.filename = __FILE__, \
.line = __LINE__, \
.class = _class, \
}; \
\
static_key_false(&descriptor.enabled) && \
__dynamic_fault_enabled(&descriptor); \
})
Honestly it still seems like the cleanest and safest way of doing it to me...
On May 18, 2018, at 1:49 AM, Kent Overstreet <[email protected]> wrote:
>
> Signed-off-by: Kent Overstreet <[email protected]>
I agree with Christoph that even if there was some explanation in the cover
letter, there should be something at least as good in the patch itself. The
cover letter is not saved, but the commit stays around forever, and should
explain how this should be added to code, and how to use it from userspace.
That said, I think this is a useful functionality. We have something similar
in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
test a distributed filesystem, which is just a CPP macro with an unlikely()
branch, while this looks more sophisticated. This looks like it has some
added functionality like having more than one fault enabled at a time.
If this lands we could likely switch our code over to using this.
Some things that are missing from this patch that is in our code:
- in addition to the basic "enabled" and "oneshot" mechanisms, we have:
- timeout: sleep for N msec to simulate network/disk/locking delays
- race: wait with one thread until a second thread hits matching check
We also have a "fail_val" that allows making the check conditional (e.g.
only operation on server "N" should fail, only RPC opcode "N", etc).
Cheers, Andreas
> ---
> include/asm-generic/vmlinux.lds.h | 4 +
> include/linux/dynamic_fault.h | 117 +++++
> lib/Kconfig.debug | 5 +
> lib/Makefile | 2 +
> lib/dynamic_fault.c | 760 ++++++++++++++++++++++++++++++
> 5 files changed, 888 insertions(+)
> create mode 100644 include/linux/dynamic_fault.h
> create mode 100644 lib/dynamic_fault.c
>
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index 1ab0e520d6..a4c9dfcbbd 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -246,6 +246,10 @@
> VMLINUX_SYMBOL(__start___verbose) = .; \
> KEEP(*(__verbose)) \
> VMLINUX_SYMBOL(__stop___verbose) = .; \
> + . = ALIGN(8); \
> + VMLINUX_SYMBOL(__start___faults) = .; \
> + *(__faults) \
> + VMLINUX_SYMBOL(__stop___faults) = .; \
> LIKELY_PROFILE() \
> BRANCH_PROFILE() \
> TRACE_PRINTKS() \
> diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
> new file mode 100644
> index 0000000000..6e7bb56ae8
> --- /dev/null
> +++ b/include/linux/dynamic_fault.h
> @@ -0,0 +1,117 @@
> +#ifndef _DYNAMIC_FAULT_H
> +#define _DYNAMIC_FAULT_H
> +
> +#include <linux/bio.h>
> +#include <linux/jump_label.h>
> +#include <linux/slab.h>
> +
> +enum dfault_enabled {
> + DFAULT_DISABLED,
> + DFAULT_ENABLED,
> + DFAULT_ONESHOT,
> +};
> +
> +union dfault_state {
> + struct {
> + unsigned enabled:2;
> + unsigned count:30;
> + };
> +
> + struct {
> + unsigned v;
> + };
> +};
> +
> +/*
> + * An instance of this structure is created in a special
> + * ELF section at every dynamic fault callsite. At runtime,
> + * the special section is treated as an array of these.
> + */
> +struct _dfault {
> + const char *modname;
> + const char *function;
> + const char *filename;
> + const char *class;
> +
> + const u16 line;
> +
> + unsigned frequency;
> + union dfault_state state;
> +
> + struct static_key enabled;
> +} __aligned(8);
> +
> +
> +#ifdef CONFIG_DYNAMIC_FAULT
> +
> +int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod);
> +int dfault_remove_module(char *mod_name);
> +bool __dynamic_fault_enabled(struct _dfault *);
> +
> +#define dynamic_fault(_class) \
> +({ \
> + static struct _dfault descriptor \
> + __used __aligned(8) __attribute__((section("__faults"))) = { \
> + .modname = KBUILD_MODNAME, \
> + .function = __func__, \
> + .filename = __FILE__, \
> + .line = __LINE__, \
> + .class = _class, \
> + }; \
> + \
> + static_key_false(&descriptor.enabled) && \
> + __dynamic_fault_enabled(&descriptor); \
> +})
> +
> +#define memory_fault() dynamic_fault("memory")
> +#define race_fault() dynamic_fault("race")
> +
> +#define kmalloc(...) \
> + (memory_fault() ? NULL : kmalloc(__VA_ARGS__))
> +#define kzalloc(...) \
> + (memory_fault() ? NULL : kzalloc(__VA_ARGS__))
> +#define krealloc(...) \
> + (memory_fault() ? NULL : krealloc(__VA_ARGS__))
> +
> +#define mempool_alloc(pool, gfp_mask) \
> + ((!gfpflags_allow_blocking(gfp_mask) && memory_fault()) \
> + ? NULL : mempool_alloc(pool, gfp_mask))
> +
> +#define __get_free_pages(...) \
> + (memory_fault() ? 0 : __get_free_pages(__VA_ARGS__))
> +#define alloc_pages_node(...) \
> + (memory_fault() ? NULL : alloc_pages_node(__VA_ARGS__))
> +#define alloc_pages_nodemask(...) \
> + (memory_fault() ? NULL : alloc_pages_nodemask(__VA_ARGS__))
> +
> +#define bio_alloc_bioset(gfp_mask, ...) \
> + ((!gfpflags_allow_blocking(gfp_mask) && memory_fault()) \
> + ? NULL : bio_alloc_bioset(gfp_mask, __VA_ARGS__))
> +
> +#define bio_clone(bio, gfp_mask) \
> + ((!gfpflags_allow_blocking(gfp_mask) && memory_fault()) \
> + ? NULL : bio_clone(bio, gfp_mask))
> +
> +#define bio_clone_bioset(bio, gfp_mask, bs) \
> + ((!gfpflags_allow_blocking(gfp_mask) && memory_fault()) \
> + ? NULL : bio_clone_bioset(bio, gfp_mask, bs))
> +
> +#define bio_kmalloc(...) \
> + (memory_fault() ? NULL : bio_kmalloc(__VA_ARGS__))
> +#define bio_clone_kmalloc(...) \
> + (memory_fault() ? NULL : bio_clone_kmalloc(__VA_ARGS__))
> +
> +#define bio_iov_iter_get_pages(...) \
> + (memory_fault() ? -ENOMEM : bio_iov_iter_get_pages(__VA_ARGS__))
> +
> +#else /* CONFIG_DYNAMIC_FAULT */
> +
> +#define dfault_add_module(tab, n, modname) 0
> +#define dfault_remove_module(mod) 0
> +#define dynamic_fault(_class) 0
> +#define memory_fault() 0
> +#define race_fault() 0
> +
> +#endif /* CONFIG_DYNAMIC_FAULT */
> +
> +#endif
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 427292a759..7b7ca0d813 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1599,6 +1599,11 @@ config LATENCYTOP
> Enable this option if you want to use the LatencyTOP tool
> to find out which userspace is blocking on what kernel operations.
>
> +config DYNAMIC_FAULT
> + bool "Enable dynamic fault support"
> + default n
> + depends on DEBUG_FS
> +
> source kernel/trace/Kconfig
>
> config PROVIDE_OHCI1394_DMA_INIT
> diff --git a/lib/Makefile b/lib/Makefile
> index 66d2231d70..f6f70f4771 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -158,6 +158,8 @@ obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
>
> obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o
>
> +obj-$(CONFIG_DYNAMIC_FAULT) += dynamic_fault.o
> +
> obj-$(CONFIG_NLATTR) += nlattr.o
>
> obj-$(CONFIG_LRU_CACHE) += lru_cache.o
> diff --git a/lib/dynamic_fault.c b/lib/dynamic_fault.c
> new file mode 100644
> index 0000000000..75fc9a1b4b
> --- /dev/null
> +++ b/lib/dynamic_fault.c
> @@ -0,0 +1,760 @@
> +/*
> + * lib/dynamic_fault.c
> + *
> + * make dynamic_fault() calls runtime configurable based upon their
> + * source module.
> + *
> + * Copyright (C) 2011 Adam Berkan <[email protected]>
> + * Based on dynamic_debug.c:
> + * Copyright (C) 2008 Jason Baron <[email protected]>
> + * By Greg Banks <[email protected]>
> + * Copyright (c) 2008 Silicon Graphics Inc. All Rights Reserved.
> + *
> + */
> +
> +#define pr_fmt(fmt) "dfault: " fmt "\n"
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/moduleparam.h>
> +#include <linux/kallsyms.h>
> +#include <linux/version.h>
> +#include <linux/types.h>
> +#include <linux/mutex.h>
> +#include <linux/proc_fs.h>
> +#include <linux/seq_file.h>
> +#include <linux/list.h>
> +#include <linux/sysctl.h>
> +#include <linux/ctype.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>
> +#include <linux/dynamic_fault.h>
> +#include <linux/debugfs.h>
> +#include <linux/slab.h>
> +
> +#undef kzalloc
> +
> +extern struct _dfault __start___faults[];
> +extern struct _dfault __stop___faults[];
> +
> +struct dfault_table {
> + struct list_head link;
> + char *mod_name;
> + unsigned int num_dfaults;
> + struct _dfault *dfaults;
> +};
> +
> +struct dfault_query {
> + const char *filename;
> + const char *module;
> + const char *function;
> + const char *class;
> + unsigned int first_line, last_line;
> + unsigned int first_index, last_index;
> +
> + unsigned match_line:1;
> + unsigned match_index:1;
> +
> + unsigned set_enabled:1;
> + unsigned enabled:2;
> +
> + unsigned set_frequency:1;
> + unsigned frequency;
> +};
> +
> +struct dfault_iter {
> + struct dfault_table *table;
> + unsigned int idx;
> +};
> +
> +static DEFINE_MUTEX(dfault_lock);
> +static LIST_HEAD(dfault_tables);
> +
> +bool __dynamic_fault_enabled(struct _dfault *df)
> +{
> + union dfault_state old, new;
> + unsigned v = df->state.v;
> + bool ret;
> +
> + do {
> + old.v = new.v = v;
> +
> + if (new.enabled == DFAULT_DISABLED)
> + return false;
> +
> + ret = df->frequency
> + ? ++new.count >= df->frequency
> + : true;
> + if (ret)
> + new.count = 0;
> + if (ret && new.enabled == DFAULT_ONESHOT)
> + new.enabled = DFAULT_DISABLED;
> + } while ((v = cmpxchg(&df->state.v, old.v, new.v)) != old.v);
> +
> + if (ret)
> + pr_debug("returned true for %s:%u", df->filename, df->line);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(__dynamic_fault_enabled);
> +
> +/* Return the last part of a pathname */
> +static inline const char *basename(const char *path)
> +{
> + const char *tail = strrchr(path, '/');
> +
> + return tail ? tail + 1 : path;
> +}
> +
> +/* format a string into buf[] which describes the _dfault's flags */
> +static char *dfault_describe_flags(struct _dfault *df, char *buf, size_t buflen)
> +{
> + switch (df->state.enabled) {
> + case DFAULT_DISABLED:
> + strlcpy(buf, "disabled", buflen);
> + break;
> + case DFAULT_ENABLED:
> + strlcpy(buf, "enabled", buflen);
> + break;
> + case DFAULT_ONESHOT:
> + strlcpy(buf, "oneshot", buflen);
> + break;
> + default:
> + BUG();
> + }
> +
> + return buf;
> +}
> +
> +/*
> + * must be called with dfault_lock held
> + */
> +
> +/*
> + * Search the tables for _dfault's which match the given
> + * `query' and apply the `flags' and `mask' to them. Tells
> + * the user which dfault's were changed, or whether none
> + * were matched.
> + */
> +static int dfault_change(const struct dfault_query *query)
> +{
> + struct dfault_table *dt;
> + unsigned int nfound = 0;
> + unsigned i, index = 0;
> + char flagbuf[16];
> +
> + /* search for matching dfaults */
> + mutex_lock(&dfault_lock);
> + list_for_each_entry(dt, &dfault_tables, link) {
> +
> + /* match against the module name */
> + if (query->module != NULL &&
> + strcmp(query->module, dt->mod_name))
> + continue;
> +
> + for (i = 0 ; i < dt->num_dfaults ; i++) {
> + struct _dfault *df = &dt->dfaults[i];
> +
> + /* match against the source filename */
> + if (query->filename != NULL &&
> + strcmp(query->filename, df->filename) &&
> + strcmp(query->filename, basename(df->filename)))
> + continue;
> +
> + /* match against the function */
> + if (query->function != NULL &&
> + strcmp(query->function, df->function))
> + continue;
> +
> + /* match against the class */
> + if (query->class) {
> + size_t len = strlen(query->class);
> +
> + if (strncmp(query->class, df->class, len))
> + continue;
> +
> + if (df->class[len] && df->class[len] != ':')
> + continue;
> + }
> +
> + /* match against the line number range */
> + if (query->match_line &&
> + (df->line < query->first_line ||
> + df->line > query->last_line))
> + continue;
> +
> + /* match against the fault index */
> + if (query->match_index &&
> + (index < query->first_index ||
> + index > query->last_index)) {
> + index++;
> + continue;
> + }
> +
> + if (query->set_enabled &&
> + query->enabled != df->state.enabled) {
> + if (query->enabled != DFAULT_DISABLED)
> + static_key_slow_inc(&df->enabled);
> + else if (df->state.enabled != DFAULT_DISABLED)
> + static_key_slow_dec(&df->enabled);
> +
> + df->state.enabled = query->enabled;
> + }
> +
> + if (query->set_frequency)
> + df->frequency = query->frequency;
> +
> + pr_debug("changed %s:%d [%s]%s #%d %s",
> + df->filename, df->line, dt->mod_name,
> + df->function, index,
> + dfault_describe_flags(df, flagbuf,
> + sizeof(flagbuf)));
> +
> + index++;
> + nfound++;
> + }
> + }
> + mutex_unlock(&dfault_lock);
> +
> + pr_debug("dfault: %u matches", nfound);
> +
> + return nfound ? 0 : -ENOENT;
> +}
> +
> +/*
> + * Split the buffer `buf' into space-separated words.
> + * Handles simple " and ' quoting, i.e. without nested,
> + * embedded or escaped \". Return the number of words
> + * or <0 on error.
> + */
> +static int dfault_tokenize(char *buf, char *words[], int maxwords)
> +{
> + int nwords = 0;
> +
> + while (*buf) {
> + char *end;
> +
> + /* Skip leading whitespace */
> + buf = skip_spaces(buf);
> + if (!*buf)
> + break; /* oh, it was trailing whitespace */
> +
> + /* Run `end' over a word, either whitespace separated or quoted
> + */
> + if (*buf == '"' || *buf == '\'') {
> + int quote = *buf++;
> +
> + for (end = buf ; *end && *end != quote ; end++)
> + ;
> + if (!*end)
> + return -EINVAL; /* unclosed quote */
> + } else {
> + for (end = buf ; *end && !isspace(*end) ; end++)
> + ;
> + BUG_ON(end == buf);
> + }
> + /* Here `buf' is the start of the word, `end' is one past the
> + * end
> + */
> +
> + if (nwords == maxwords)
> + return -EINVAL; /* ran out of words[] before bytes */
> + if (*end)
> + *end++ = '\0'; /* terminate the word */
> + words[nwords++] = buf;
> + buf = end;
> + }
> +
> + return nwords;
> +}
> +
> +/*
> + * Parse a range.
> + */
> +static inline int parse_range(char *str,
> + unsigned int *first,
> + unsigned int *last)
> +{
> + char *first_str = str;
> + char *last_str = strchr(first_str, '-');
> +
> + if (last_str)
> + *last_str++ = '\0';
> +
> + if (kstrtouint(first_str, 10, first))
> + return -EINVAL;
> +
> + if (!last_str)
> + *last = *first;
> + else if (kstrtouint(last_str, 10, last))
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +enum dfault_token {
> + TOK_INVALID,
> +
> + /* Queries */
> + TOK_FUNC,
> + TOK_FILE,
> + TOK_LINE,
> + TOK_MODULE,
> + TOK_CLASS,
> + TOK_INDEX,
> +
> + /* Commands */
> + TOK_DISABLE,
> + TOK_ENABLE,
> + TOK_ONESHOT,
> + TOK_FREQUENCY,
> +};
> +
> +static const struct {
> + const char *str;
> + enum dfault_token tok;
> + unsigned args_required;
> +} dfault_token_strs[] = {
> + { "func", TOK_FUNC, 1, },
> + { "file", TOK_FILE, 1, },
> + { "line", TOK_LINE, 1, },
> + { "module", TOK_MODULE, 1, },
> + { "class", TOK_CLASS, 1, },
> + { "index", TOK_INDEX, 1, },
> + { "disable", TOK_DISABLE, 0, },
> + { "enable", TOK_ENABLE, 0, },
> + { "oneshot", TOK_ONESHOT, 0, },
> + { "frequency", TOK_FREQUENCY, 1, },
> +};
> +
> +static enum dfault_token str_to_token(const char *word, unsigned nr_words)
> +{
> + unsigned i;
> +
> + for (i = 0; i < ARRAY_SIZE(dfault_token_strs); i++)
> + if (!strcmp(word, dfault_token_strs[i].str)) {
> + if (nr_words < dfault_token_strs[i].args_required) {
> + pr_debug("insufficient arguments to \"%s\"",
> + word);
> + return TOK_INVALID;
> + }
> +
> + return dfault_token_strs[i].tok;
> + }
> +
> + pr_debug("unknown keyword \"%s\"", word);
> +
> + return TOK_INVALID;
> +}
> +
> +static int dfault_parse_command(struct dfault_query *query,
> + enum dfault_token tok,
> + char *words[], size_t nr_words)
> +{
> + unsigned i = 0;
> + int ret;
> +
> + switch (tok) {
> + case TOK_INVALID:
> + return -EINVAL;
> + case TOK_FUNC:
> + query->function = words[i++];
> + case TOK_FILE:
> + query->filename = words[i++];
> + return 1;
> + case TOK_LINE:
> + ret = parse_range(words[i++],
> + &query->first_line,
> + &query->last_line);
> + if (ret)
> + return ret;
> + query->match_line = true;
> + break;
> + case TOK_MODULE:
> + query->module = words[i++];
> + break;
> + case TOK_CLASS:
> + query->class = words[i++];
> + break;
> + case TOK_INDEX:
> + ret = parse_range(words[i++],
> + &query->first_index,
> + &query->last_index);
> + if (ret)
> + return ret;
> + query->match_index = true;
> + break;
> + case TOK_DISABLE:
> + query->set_enabled = true;
> + query->enabled = DFAULT_DISABLED;
> + break;
> + case TOK_ENABLE:
> + query->set_enabled = true;
> + query->enabled = DFAULT_ENABLED;
> + break;
> + case TOK_ONESHOT:
> + query->set_enabled = true;
> + query->enabled = DFAULT_ONESHOT;
> + break;
> + case TOK_FREQUENCY:
> + query->set_frequency = 1;
> + ret = kstrtouint(words[i++], 10, &query->frequency);
> + if (ret)
> + return ret;
> +
> + if (!query->set_enabled) {
> + query->set_enabled = 1;
> + query->enabled = DFAULT_ENABLED;
> + }
> + break;
> + }
> +
> + return i;
> +}
> +
> +/*
> + * Parse words[] as a dfault query specification, which is a series
> + * of (keyword, value) pairs chosen from these possibilities:
> + *
> + * func <function-name>
> + * file <full-pathname>
> + * file <base-filename>
> + * module <module-name>
> + * line <lineno>
> + * line <first-lineno>-<last-lineno> // where either may be empty
> + * index <m>-<n> // dynamic faults numbered from <m>
> + * // to <n> inside each matching function
> + */
> +static int dfault_parse_query(struct dfault_query *query,
> + char *words[], size_t nr_words)
> +{
> + unsigned i = 0;
> +
> + while (i < nr_words) {
> + const char *tok_str = words[i++];
> + enum dfault_token tok = str_to_token(tok_str, nr_words - i);
> + int ret = dfault_parse_command(query, tok, words + i,
> + nr_words - i);
> +
> + if (ret < 0)
> + return ret;
> + i += ret;
> + BUG_ON(i > nr_words);
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * File_ops->write method for <debugfs>/dynamic_fault/conrol. Gathers the
> + * command text from userspace, parses and executes it.
> + */
> +static ssize_t dfault_proc_write(struct file *file, const char __user *ubuf,
> + size_t len, loff_t *offp)
> +{
> + struct dfault_query query;
> +#define MAXWORDS 9
> + int nwords;
> + char *words[MAXWORDS];
> + char tmpbuf[256];
> + int ret;
> +
> + memset(&query, 0, sizeof(query));
> +
> + if (len == 0)
> + return 0;
> + /* we don't check *offp -- multiple writes() are allowed */
> + if (len > sizeof(tmpbuf)-1)
> + return -E2BIG;
> + if (copy_from_user(tmpbuf, ubuf, len))
> + return -EFAULT;
> + tmpbuf[len] = '\0';
> +
> + pr_debug("read %zu bytes from userspace", len);
> +
> + nwords = dfault_tokenize(tmpbuf, words, MAXWORDS);
> + if (nwords < 0)
> + return -EINVAL;
> + if (dfault_parse_query(&query, words, nwords))
> + return -EINVAL;
> +
> + /* actually go and implement the change */
> + ret = dfault_change(&query);
> + if (ret < 0)
> + return ret;
> +
> + *offp += len;
> + return len;
> +}
> +
> +/* Control file read code */
> +
> +/*
> + * Set the iterator to point to the first _dfault object
> + * and return a pointer to that first object. Returns
> + * NULL if there are no _dfaults at all.
> + */
> +static struct _dfault *dfault_iter_first(struct dfault_iter *iter)
> +{
> + if (list_empty(&dfault_tables)) {
> + iter->table = NULL;
> + iter->idx = 0;
> + return NULL;
> + }
> + iter->table = list_entry(dfault_tables.next,
> + struct dfault_table, link);
> + iter->idx = 0;
> + return &iter->table->dfaults[iter->idx];
> +}
> +
> +/*
> + * Advance the iterator to point to the next _dfault
> + * object from the one the iterator currently points at,
> + * and returns a pointer to the new _dfault. Returns
> + * NULL if the iterator has seen all the _dfaults.
> + */
> +static struct _dfault *dfault_iter_next(struct dfault_iter *iter)
> +{
> + if (iter->table == NULL)
> + return NULL;
> + if (++iter->idx == iter->table->num_dfaults) {
> + /* iterate to next table */
> + iter->idx = 0;
> + if (list_is_last(&iter->table->link, &dfault_tables)) {
> + iter->table = NULL;
> + return NULL;
> + }
> + iter->table = list_entry(iter->table->link.next,
> + struct dfault_table, link);
> + }
> + return &iter->table->dfaults[iter->idx];
> +}
> +
> +/*
> + * Seq_ops start method. Called at the start of every
> + * read() call from userspace. Takes the dfault_lock and
> + * seeks the seq_file's iterator to the given position.
> + */
> +static void *dfault_proc_start(struct seq_file *m, loff_t *pos)
> +{
> + struct dfault_iter *iter = m->private;
> + struct _dfault *dp;
> + int n = *pos;
> +
> + mutex_lock(&dfault_lock);
> +
> + if (n < 0)
> + return NULL;
> + dp = dfault_iter_first(iter);
> + while (dp != NULL && --n >= 0)
> + dp = dfault_iter_next(iter);
> + return dp;
> +}
> +
> +/*
> + * Seq_ops next method. Called several times within a read()
> + * call from userspace, with dfault_lock held. Walks to the
> + * next _dfault object with a special case for the header line.
> + */
> +static void *dfault_proc_next(struct seq_file *m, void *p, loff_t *pos)
> +{
> + struct dfault_iter *iter = m->private;
> + struct _dfault *dp;
> +
> + if (p == SEQ_START_TOKEN)
> + dp = dfault_iter_first(iter);
> + else
> + dp = dfault_iter_next(iter);
> + ++*pos;
> + return dp;
> +}
> +
> +/*
> + * Seq_ops show method. Called several times within a read()
> + * call from userspace, with dfault_lock held. Formats the
> + * current _dfault as a single human-readable line, with a
> + * special case for the header line.
> + */
> +static int dfault_proc_show(struct seq_file *m, void *p)
> +{
> + struct dfault_iter *iter = m->private;
> + struct _dfault *df = p;
> + char flagsbuf[8];
> +
> + seq_printf(m, "%s:%u class:%s module:%s func:%s %s \"\"\n",
> + df->filename, df->line, df->class,
> + iter->table->mod_name, df->function,
> + dfault_describe_flags(df, flagsbuf, sizeof(flagsbuf)));
> +
> + return 0;
> +}
> +
> +/*
> + * Seq_ops stop method. Called at the end of each read()
> + * call from userspace. Drops dfault_lock.
> + */
> +static void dfault_proc_stop(struct seq_file *m, void *p)
> +{
> + mutex_unlock(&dfault_lock);
> +}
> +
> +static const struct seq_operations dfault_proc_seqops = {
> + .start = dfault_proc_start,
> + .next = dfault_proc_next,
> + .show = dfault_proc_show,
> + .stop = dfault_proc_stop
> +};
> +
> +/*
> + * File_ops->open method for <debugfs>/dynamic_fault/control. Does the seq_file
> + * setup dance, and also creates an iterator to walk the _dfaults.
> + * Note that we create a seq_file always, even for O_WRONLY files
> + * where it's not needed, as doing so simplifies the ->release method.
> + */
> +static int dfault_proc_open(struct inode *inode, struct file *file)
> +{
> + struct dfault_iter *iter;
> + int err;
> +
> + iter = kzalloc(sizeof(*iter), GFP_KERNEL);
> + if (iter == NULL)
> + return -ENOMEM;
> +
> + err = seq_open(file, &dfault_proc_seqops);
> + if (err) {
> + kfree(iter);
> + return err;
> + }
> + ((struct seq_file *) file->private_data)->private = iter;
> + return 0;
> +}
> +
> +static const struct file_operations dfault_proc_fops = {
> + .owner = THIS_MODULE,
> + .open = dfault_proc_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = seq_release_private,
> + .write = dfault_proc_write
> +};
> +
> +/*
> + * Allocate a new dfault_table for the given module
> + * and add it to the global list.
> + */
> +int dfault_add_module(struct _dfault *tab, unsigned int n,
> + const char *name)
> +{
> + struct dfault_table *dt;
> + char *new_name;
> + const char *func = NULL;
> + int i;
> +
> + dt = kzalloc(sizeof(*dt), GFP_KERNEL);
> + if (dt == NULL)
> + return -ENOMEM;
> + new_name = kstrdup(name, GFP_KERNEL);
> + if (new_name == NULL) {
> + kfree(dt);
> + return -ENOMEM;
> + }
> + dt->mod_name = new_name;
> + dt->num_dfaults = n;
> + dt->dfaults = tab;
> +
> + mutex_lock(&dfault_lock);
> + list_add_tail(&dt->link, &dfault_tables);
> + mutex_unlock(&dfault_lock);
> +
> + /* __attribute__(("section")) emits things in reverse order */
> + for (i = n - 1; i >= 0; i--)
> + if (!func || strcmp(tab[i].function, func))
> + func = tab[i].function;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(dfault_add_module);
> +
> +static void dfault_table_free(struct dfault_table *dt)
> +{
> + list_del_init(&dt->link);
> + kfree(dt->mod_name);
> + kfree(dt);
> +}
> +
> +/*
> + * Called in response to a module being unloaded. Removes
> + * any dfault_table's which point at the module.
> + */
> +int dfault_remove_module(char *mod_name)
> +{
> + struct dfault_table *dt, *nextdt;
> + int ret = -ENOENT;
> +
> + mutex_lock(&dfault_lock);
> + list_for_each_entry_safe(dt, nextdt, &dfault_tables, link) {
> + if (!strcmp(dt->mod_name, mod_name)) {
> + dfault_table_free(dt);
> + ret = 0;
> + }
> + }
> + mutex_unlock(&dfault_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(dfault_remove_module);
> +
> +static void dfault_remove_all_tables(void)
> +{
> + mutex_lock(&dfault_lock);
> + while (!list_empty(&dfault_tables)) {
> + struct dfault_table *dt = list_entry(dfault_tables.next,
> + struct dfault_table,
> + link);
> + dfault_table_free(dt);
> + }
> + mutex_unlock(&dfault_lock);
> +}
> +
> +static int __init dynamic_fault_init(void)
> +{
> + struct dentry *dir, *file;
> + struct _dfault *iter, *iter_start;
> + const char *modname = NULL;
> + int ret = 0;
> + int n = 0;
> +
> + dir = debugfs_create_dir("dynamic_fault", NULL);
> + if (!dir)
> + return -ENOMEM;
> + file = debugfs_create_file("control", 0644, dir, NULL,
> + &dfault_proc_fops);
> + if (!file) {
> + debugfs_remove(dir);
> + return -ENOMEM;
> + }
> + if (__start___faults != __stop___faults) {
> + iter = __start___faults;
> + modname = iter->modname;
> + iter_start = iter;
> + for (; iter < __stop___faults; iter++) {
> + if (strcmp(modname, iter->modname)) {
> + ret = dfault_add_module(iter_start, n, modname);
> + if (ret)
> + goto out_free;
> + n = 0;
> + modname = iter->modname;
> + iter_start = iter;
> + }
> + n++;
> + }
> + ret = dfault_add_module(iter_start, n, modname);
> + }
> +out_free:
> + if (ret) {
> + dfault_remove_all_tables();
> + debugfs_remove(dir);
> + debugfs_remove(file);
> + }
> + return 0;
> +}
> +module_init(dynamic_fault_init);
> --
> 2.17.0
>
Cheers, Andreas
On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
> On May 18, 2018, at 1:49 AM, Kent Overstreet <[email protected]> wrote:
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
>
> I agree with Christoph that even if there was some explanation in the cover
> letter, there should be something at least as good in the patch itself. The
> cover letter is not saved, but the commit stays around forever, and should
> explain how this should be added to code, and how to use it from userspace.
>
>
> That said, I think this is a useful functionality. We have something similar
> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
> test a distributed filesystem, which is just a CPP macro with an unlikely()
> branch, while this looks more sophisticated. This looks like it has some
> added functionality like having more than one fault enabled at a time.
> If this lands we could likely switch our code over to using this.
This is pretty much what I was looking for, I just wanted to know if this patch
was interesting enough to anyone that I should spend more time on it or just
drop it :) Agreed on documentation. I think it's also worth factoring out the
functionality for the elf section trick that dynamic debug uses too.
> Some things that are missing from this patch that is in our code:
>
> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
> - timeout: sleep for N msec to simulate network/disk/locking delays
> - race: wait with one thread until a second thread hits matching check
>
> We also have a "fail_val" that allows making the check conditional (e.g.
> only operation on server "N" should fail, only RPC opcode "N", etc).
Those all sound like good ideas... fail_val especially, I think with that we'd
have all the functionality the existing fault injection framework has (which is
way to heavyweight to actually get used, imo)
On May 18, 2018, at 1:10 PM, Kent Overstreet <[email protected]> wrote:
>
> On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
>> On May 18, 2018, at 1:49 AM, Kent Overstreet <[email protected]> wrote:
>>>
>>> Signed-off-by: Kent Overstreet <[email protected]>
>>
>> I agree with Christoph that even if there was some explanation in the cover
>> letter, there should be something at least as good in the patch itself. The
>> cover letter is not saved, but the commit stays around forever, and should
>> explain how this should be added to code, and how to use it from userspace.
>>
>>
>> That said, I think this is a useful functionality. We have something similar
>> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
>> test a distributed filesystem, which is just a CPP macro with an unlikely()
>> branch, while this looks more sophisticated. This looks like it has some
>> added functionality like having more than one fault enabled at a time.
>> If this lands we could likely switch our code over to using this.
>
> This is pretty much what I was looking for, I just wanted to know if this
> patch was interesting enough to anyone that I should spend more time on it
> or just drop it :) Agreed on documentation. I think it's also worth
> factoring out the functionality for the elf section trick that dynamic
> debug uses too.
>
>> Some things that are missing from this patch that is in our code:
>>
>> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>> - timeout: sleep for N msec to simulate network/disk/locking delays
>> - race: wait with one thread until a second thread hits matching check
>>
>> We also have a "fail_val" that allows making the check conditional (e.g.
>> only operation on server "N" should fail, only RPC opcode "N", etc).
>
> Those all sound like good ideas... fail_val especially, I think with that
> we'd have all the functionality the existing fault injection framework has
> (which is way too heavyweight to actually get used, imo)
The other thing that we have that is slightly orthogonal to your modes,
which is possible because we just have a __u32 for the fault location,
is that the "oneshot" mode is just a mask added to the fault location
together with "fail_val" is that we can add other masks "fail N times",
"fail randomly 1/N times", or "pass N times before failure". The other
mask is set in the kernel when the fault was actually hit, so that test
scripts can poll until that happens, and then continue running.
The "fail randomly 1/N times" was useful for detecting memory allocation
failure handling under load, but that has been superseded by the same
functionality in kmalloc(), and it sounds like your fault injection can
do this deterministically for every allocation?
Cheers, Andreas
On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > Historically, the only problematic case has been direct IO, and people
> > > have been willing to say "well, if you mix buffered and direct IO you
> > > get what you deserve", and that's probably not unreasonable. But now we
> > > have fallocate insert range and collapse range, and those are broken in
> > > ways I frankly don't want to think about if they can't ensure consistency
> > > with the page cache.
> >
> > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > You may get pushback on the grounds that this ought to be a
> > filesystem-specific lock rather than one embedded in the generic inode.
>
> Honestly I think this probably should be in the core. But IFF we move
> it to the core the existing users of per-fs locks need to be moved
> over first. E.g. XFS as the very first one, and at least ext4 and f2fs
> that copied the approach, and probably more if you audit deep enough.
I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
merging bcachefs. Sorry, but that's a bit crazy.
I am more than happy to work on the locking itself if we can agree on what
semantics we want out of it. We have two possible approaches, and we're going to
have to pick one first: the locking can be done at the top of the IO stack (like
ext4 and I'm guessing xfs), but then we're adding locking overhead to buffered
reads and writes that don't need it because they're only touching pages that are
already in cache.
Or we can go with my approach, pushing down the locking to only when we need to
add pages to the page cache. I think if we started out by merging my approach,
it would be pretty easy to have it make use of Mathew's fancy xarray based range
locking when that goes in, the semantics should be similar enough.
If people are ok with and willing to use my approach, I can polish it up - add
lockdep support and whatever else I can think of, and attempt to get rid of the
stupid recursive part.
But that's got to be decided first, where in the call stack the locking should
be done.
On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> >
> > Honestly I think this probably should be in the core. But IFF we move
> > it to the core the existing users of per-fs locks need to be moved
> > over first. E.g. XFS as the very first one, and at least ext4 and f2fs
> > that copied the approach, and probably more if you audit deep enough.
>
> I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> merging bcachefs. Sorry, but that's a bit crazy.
It isn't crazy at all. In general we expect people to do their fair
share of core work to get their pet feature in. How much is acceptable
is a difficult question and not black and white.
But if you want to grow a critical core structure you better take a stab
at converting existing users. Without that the tradeoff of growing
that core structure is simply not given.
Or to put it in words for this exact feature: unless your new field
is also used by mainstream file systems it is not going to be added.
On Wed, May 23, 2018 at 08:22:39AM -0700, Christoph Hellwig wrote:
> On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> > >
> > > Honestly I think this probably should be in the core. But IFF we move
> > > it to the core the existing users of per-fs locks need to be moved
> > > over first. E.g. XFS as the very first one, and at least ext4 and f2fs
> > > that copied the approach, and probably more if you audit deep enough.
> >
> > I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> > merging bcachefs. Sorry, but that's a bit crazy.
>
> It isn't crazy at all. In general we expect people to do their fair
> share of core work to get their pet feature in. How much is acceptable
> is a difficult question and not black and white.
>
> But if you want to grow a critical core structure you better take a stab
> at converting existing users. Without that the tradeoff of growing
> that core structure is simply not given.
>
> Or to put it in words for this exact feature: unless your new field
> is also used by mainstream file systems it is not going to be added.
Christoph, I'm really not someone you can accuse of avoiding my share of core
work and refactoring and you know it.
When are you going to get around to converting existing users of fs/direct-io.c
to iomap so it can be deleted? The kernel is carrying around two dio
implementations right now thanks to you. Not a good situation, is it?
On Fri, May 18, 2018 at 01:45:49PM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> > On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > > Historically, the only problematic case has been direct IO, and people
> > > > have been willing to say "well, if you mix buffered and direct IO you
> > > > get what you deserve", and that's probably not unreasonable. But now we
> > > > have fallocate insert range and collapse range, and those are broken in
> > > > ways I frankly don't want to think about if they can't ensure consistency
> > > > with the page cache.
> > >
> > > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > > You may get pushback on the grounds that this ought to be a
> > > filesystem-specific lock rather than one embedded in the generic inode.
> >
> > Honestly I think this probably should be in the core. But IFF we move
> > it to the core the existing users of per-fs locks need to be moved
> > over first. E.g. XFS as the very first one, and at least ext4 and f2fs
> > that copied the approach, and probably more if you audit deep enough.
>
> I didn't know about i_mmap_sem, thanks
>
> But, using a rw semaphore wouldn't work for dio writes, and I do need dio writes
> to block pagecache adds in bcachefs since the dio write could overwrite
> uncompressed data or a reservation with compressed data, which means new writes
> need a disk reservation.
>
> Also I'm guessing ext4 takes the lock at the top of the read path? That sucks
> for reads that are already cached, the generic buffered read path can do cached
> reads without taking any per inode locks - that's why I pushed the lock down to
> only be taken when we have to add a page to the pagecache.
>
> Definitely less ugly doing it that way though...
More notes on locking design: Mathew, this is very relevant to any sort of range
locking, too.
There's some really tricky issues related to dio writes + page faults. If your
particular filesystem doesn't care about minor page cache consistency issues
caused by dio writes most of this may not be relevant to you, but I honestly
would find it a little hard to believe this isn't an issue for _any_ other
filesystems.
Current situation today, for most filesystems: top of the dio write path shoots
down the region of the pagecache for the file it's writing to, with
filemap_write_and_wait_range() then invalidate_inode_pages2_range().
This is racy though and does _not_ gurarantee page cache consistency, i.e. we
can end up in a situation where the write completes, but we have stale data -
and worse, potentially stale metadata, in the buffer heads or whatever your
filesystem uses.
Ok, solving that is easy enough, we need a lock dio writes can take that
prevents pages from being added to a mapping for their duration (or alternately,
range locks, but range locks can be viewed as just an optimization).
This explains my locking design - if you have a lock that can be taken for "add"
or "block", where multiple threads can take it for add or block, but it can't be
in both states simultaneously then it does what we want, you can have multiple
outstanding dio writes or multiple buffered IO operations bringing pages in and
it doesn't add much overhead.
This isn't enough, though.
- calling filemap_write_and_wait_range then invalidate_inode_pages2_range can
race - the call to invalidate_inode_pages2_range will fail if any of the
pages have been redirtied, and we'll have stale pages in the page cache.
The ideal solution for this would be a function to do both, that removes
pages from the page cache atomically with clearing PageDirty and kicking off
writeback. Alternately, you can have .page_mkwrite and the buffered write
path take the pagecache add lock when they have to dirty a page, but that
kind of sucks.
- pagefaults, via gup()
this is the really annoying one, if userspace does a dio write where the buf
they're writing is mmapped and overlaps with the part of the file they're
writing to, yay fun
we call gup() after shooting down the part of the pagecache we're writing
to, so gup() just faults it back in again and we're left with stale data in
the page cache again.
the generic dio code tries to deal with this these days by calling
invalidate_inode_pages2_range() again after the dio write completes. Again
though, invalidate_inode_pages2_range() will fail if the pages are dirty, and
we're left with stale data in the page cache.
I _think_ (haven't tried this yet) it ought to be possible to move the second
call to invalidate_inode_pages2_range() to immediately after the call to
gup() - this change wouldn't make sense in the current code without locking,
but it would make sense if the dio write path is holding a pagecache add lock
to prevent anything but its own faults via gup() from bringing pages back in.
Also need to prevent the pages gup() faulted in from being dirtied until they
can be removed again...
Also, one of the things my patch did was change filemap_fault() to not kick
off readahead and only read in single pages if we're being called with
pagecache block lock held (i.e. from dio write to the same file). I'm pretty
sure this is unnecessary though, if I add the second
invalidate_inode_pages2_range() call to my own dio code after gup() (which I
believe is the correct solution anyways).
- lock recursion
my locking approach pushes down the locking to only when we're adding pages
to the radix tree, unlike, I belive, the ext4/xfs approach.
this means that dio write takes page_cache_block lock, and page faults take
page_cache_add lock, so dio write -> gup -> fault is a recursive locking
deadlock unless we do something about it.
my solution was to add a pointer to a pagecache_lock to task_struct, so we
can detect the recursion when we go to take pagecache_add lock. It's ugly,
but I haven't thought of a better way to do it.
I'm pretty sure that the xarray based range locking Matthew wants to do will
hit the same issue, it's just inherent to pushing the locking down to where
we manipulate the radix tree - which IMO we want to do, so that buffered
reads/writes to cached data aren't taking these locks unnecessarily; those
are the fast paths.
If anyone has any better ideas than what I've come up with, or sees any gaping
holes in my design, please speak up...
Even if you don't care about dio write consistency, having this locking in the
core VFS could mean converting truncate to use it, which I think would be a
major benefit too.