2014-07-01 21:57:07

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: Please review: generic brick framework + first application: asynchronous block device replication

Hi together,

after almost 20 years, I am happy to be back at the kernel hacker
community with a new project called MARS Light (Multiversion
Asynchronous Replication System).

Its application area is _different_ from DRBD:

MARS replicates generic block devices asynchronously over long distances
and through network bottlenecks, while the synchronous DRBD works best with
crossover cables (running DRBD through long-distance network bottlenecks
may lead to serious problems described in the presentation below and
also observed in practice -- however I must clearly emphasize that
I can confirm from our experiences in 1&1 datacenters that DRBD runs
very fine in appropriate short-distance scenarios -- so both systems
just have different application areas, not more, not less).

In addition, MARS can replicate to k > 2 replicas out of the box.

For a quick overview, differences to DRBD (conceptual / behavioural),
feature comparisons (also to the commercial DRBD/proxy), etc, please look
at the presentation slides from LinuxTag 2014:

https://github.com/schoebel/mars/blob/master/docu/MARS_LinuxTag2014.pdf?raw=true

...which is an extended version of my LCA2014 presentation from January
2014 where some attending kernel hackers already could get some impressions.

If you want a deeper understanding of concepts and operations, please
read the manual at

https://github.com/schoebel/mars/blob/master/docu/mars-manual.pdf?raw=true

MARS is in production at 1&1 Internet AG since March 2014.

In addition, MARS has been extensively tested with a fully automatic
test suite developed by Frank Liepold (also available at
https://github.com/schoebel/mars ). It contains more than 100 testcases.

Although the test suite has some shortcomings (many false positives
when run uncustomized/unmodified on different hardware/networks), it has
proved to me a valuable tool at least for regression testing.
Unfortunately, Frank is no longer at 1&1. When I had more time, I would
fix the test suite to make it more robust. Alternatively, help from
the community would be highly appreciated! Please contact me
by email if you are seriously interested.

The github version of MARS should be compilable out-of-tree with
elder kernels (starting at least from 2.6.32).

In contrast, the attached patches are for kernel 3.16 and should no
longer contain code for backward compatibility (as well as containing
many other code cleanups, in order to pass checkpatch.pl except
some probably false-positives and except LONG_LINE).

The github version can almost fully automatically be converted to the
(proposed) upstream version via ./rework-mars-for-upstream.pl which
not only renames some identifiers to (hopefully) better names / more
systematic naming conventions via some heavy regex magic, but also
moves files to different (configurable) locations. If anyone wants
a different location than drivers/block/mars/ (e.g. for the generic
brick framework part which doesn't really belong to "drivers" because
it /potentially/ can be used almost everywhere) it should be very
easy to adapt this.

If possible and if it makes sense, I will also fix many _systematic_
review complaints in ./rework-mars-for-upstream.pl instead of in the
C sources. ./rework-mars-for-upstream.pl starts in the out-of-tree
MARS repo (see github) from the branch WIP-BASE, and creates two
branches WIP-PORTABLE (which contains the intended future base
for the out-of-tree version) and WIP-PROPOSED-UPSTREAM (where the
code for backwards compatibility is already stripped off).
Finally, the files are transferred to the kernel repo (using
different paths) and the kernel patchset is generated
where the new files appear as starting afresh.

For some limited time (a few years), the out-of-tree repo must be
maintained in parallel to the kernel upstream, because 1&1
(and probably other people in the world) are using very old
kernels, at least for some time. My long-term goal is to freeze
the out-of-tree version some day and only maintain the in-tree
version permanently.

The attached kernel patchset (as generated by rework-mars-for-upstream.pl)
contains 4 parts which could theoretically be submitted independently
from each other, but IMHO that wouldn't make sense in order to get a
_working_ system:

1) the generic brick framework. Many concepts are from my old Athomux
research project from the University of Stuttgart. The current Linux
implementation is only "instance based", while Athomux was the first
prototype implementation of a fully "instance oriented" (IOP) system.
The future "MARS Full" is planned to make full use of IOP.

Details on IOP concepts can be found at http://www.athomux.net under papers/
(also look for the monography written in German if you are /very/ deeply
interested - and of course I will be happy to explain it personally
to anyone, best at a meeting opportunity).

2) the first framework personality called "XIO" (eXtended IO), conceptually
similar to AIO, conceptually a true superset of BIO.

3) the first application "MARS Light" which uses the XIO personality.

Notice that 1) to 3) make _no_ _modifications_ to any other parts of
the kernel! They just reside in their own subdirectory, each.

IMHO, 1) to 2) potentially form a new subsystem in the kernel. Of course,
there might be different opinions on that, so I prefer starting with a
small version containing only the needed things for MARS, and later
moving / extending it only when needed.

4) only 2 patches (the last two ones in the patchset) which should make only
_trivial_ modifications to the rest of the kernel: mostly some additional
EXPORT_SYMBOL() and of course some 1-liners for Kconfig and Makefile.

The attached version for item 4) is the so-called "generic" pre-patch
which is also needed for out-of-tree builds with elder kernels.
The current version of MARS can only be compiled as a module
(if needed, this restriction could be overcome some day).

Please, if possible, include this pre-patch (or a substitute) more quickly
if the main code review would take a longer time. You would help me
establishing MARS more widely in the world / at Linux distros
via the out-of-tree version.

It would be great if maintainers for elder *.y kernel branches would
also include the corresponding pre-patch for their version, this
would help me _greatly_. Specialized versions for elder kernels can be
found at github in the pre-patches/ subdirectory.

The "generic" pre-patch generically calls EXPORT_SYMBOL() on all
sys_*() functions, instead of marking only the needed ones. IMHO,
this has the advantage that no maintainance is needed whenever some
future extension of MARS (or any other external kernel modules) need
dynamic linking on such a symbol. Of course, it has the disadvantage
of growing the symbol table. IMHO, the sys_* are _anyway_ standardized
by POSIX and other standards, forming one of the most stable APIs in the
world. So there should be no other drawback when mass exporting those
symbols - even better than exporting any other kernel symbol.

If the "generic" version of the pre-patch is objected / rejected
for any reason, I will happily provide you a new version exporting only
the needed symbols.

Although I am very busy working at 1&1 (not always on MARS),
I will try to answer all your questions in the next time.

I would be glad to get invited to the Kernel Summit, and I would
like to meet some old friends again from ancient times when I was
active in the community, but sadly lost connection due to fateful
private reasons.

Thanks and cheers,

Thomas


2014-07-01 21:47:43

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 01/50] mars: add new file include/linux/brick/lamport.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/lamport.h | 10 ++++++++++
1 file changed, 10 insertions(+)
create mode 100644 include/linux/brick/lamport.h

diff --git a/include/linux/brick/lamport.h b/include/linux/brick/lamport.h
new file mode 100644
index 0000000..f567eee
--- /dev/null
+++ b/include/linux/brick/lamport.h
@@ -0,0 +1,10 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef LAMPORT_H
+#define LAMPORT_H
+
+#include <linux/time.h>
+
+extern void get_lamport(struct timespec *now);
+extern void set_lamport(struct timespec *old);
+
+#endif
--
2.0.0

2014-07-01 21:47:58

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 21/50] mars: add new file include/linux/xio_net.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/xio_net.h | 126 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 126 insertions(+)
create mode 100644 include/linux/xio_net.h

diff --git a/include/linux/xio_net.h b/include/linux/xio_net.h
new file mode 100644
index 0000000..c9f48f1
--- /dev/null
+++ b/include/linux/xio_net.h
@@ -0,0 +1,126 @@
+/* (c) 2011 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_NET_H
+#define XIO_NET_H
+
+#include <net/sock.h>
+#include <net/ipconfig.h>
+#include <net/tcp.h>
+
+#include <linux/brick/brick.h>
+
+extern int xio_net_default_port;
+extern bool xio_net_is_alive;
+
+#define MAX_DESC_CACHE 16
+
+/* The original struct socket has no refcount. This leads to problems
+ * during long-lasting system calls when racing with socket shutdown.
+ *
+ * The original idea of struct xio_socket was just a small wrapper
+ * adding a refcount and some debugging aid.
+ * Later, some buffering was added in order to take advantage of
+ * kernel_sendpage().
+ * Caching of meta description has also been added.
+ */
+struct xio_socket {
+ struct socket *s_socket;
+ void *s_buffer;
+ atomic_t s_count;
+ int s_pos;
+ int s_debug_nr;
+ int s_send_abort;
+ int s_recv_abort;
+ int s_send_cnt;
+ int s_recv_cnt;
+ bool s_shutdown_on_err;
+ bool s_alive;
+
+ u8 s_send_proto;
+ u8 s_recv_proto;
+ struct xio_desc_cache *s_desc_send[MAX_DESC_CACHE];
+ struct xio_desc_cache *s_desc_recv[MAX_DESC_CACHE];
+};
+
+struct xio_tcp_params {
+ int ip_tos;
+ int tcp_window_size;
+ int tcp_nodelay;
+ int tcp_timeout;
+ int tcp_keepcnt;
+ int tcp_keepintvl;
+ int tcp_keepidle;
+};
+
+extern struct xio_tcp_params default_tcp_params;
+
+enum {
+ CMD_NOP,
+ CMD_NOTIFY,
+ CMD_CONNECT,
+ CMD_GETINFO,
+ CMD_GETENTS,
+ CMD_AIO,
+ CMD_CB,
+};
+
+#define CMD_FLAG_MASK 255
+#define CMD_FLAG_HAS_DATA 256
+
+struct xio_cmd {
+ struct timespec cmd_stamp; /* for automatic lamport clock */
+ int cmd_code;
+ int cmd_int1;
+
+ /* int cmd_int2; */
+ /* int cmd_int3; */
+ char *cmd_str1;
+
+ /* char *cmd_str2; */
+ /* char *cmd_str3; */
+};
+
+extern const struct meta xio_cmd_meta[];
+
+extern char *(*xio_translate_hostname)(const char *name);
+
+/* Low-level network traffic
+ */
+extern int xio_create_sockaddr(struct sockaddr_storage *addr, const char *spec);
+
+extern int xio_create_socket(struct xio_socket *msock, struct sockaddr_storage *addr, bool is_server);
+extern int xio_accept_socket(struct xio_socket *new_msock, struct xio_socket *old_msock);
+extern bool xio_get_socket(struct xio_socket *msock);
+extern void xio_put_socket(struct xio_socket *msock);
+extern void xio_shutdown_socket(struct xio_socket *msock);
+extern bool xio_socket_is_alive(struct xio_socket *msock);
+
+extern int xio_send_raw(struct xio_socket *msock, const void *buf, int len, bool cork);
+extern int xio_recv_raw(struct xio_socket *msock, void *buf, int minlen, int maxlen);
+
+/* Mid-level generic field data exchange
+ */
+extern int xio_send_struct(struct xio_socket *msock, const void *data, const struct meta *meta);
+#define xio_recv_struct(_sock_, _data_, _meta_) \
+ ({ \
+ _xio_recv_struct(_sock_, _data_, _meta_, __LINE__); \
+ })
+extern int _xio_recv_struct(struct xio_socket *msock, void *data, const struct meta *meta, int line);
+
+/* High-level transport of xio structures
+ */
+extern int xio_send_dent_list(struct xio_socket *msock, struct list_head *anchor);
+extern int xio_recv_dent_list(struct xio_socket *msock, struct list_head *anchor);
+
+extern int xio_send_aio(struct xio_socket *msock, struct aio_object *aio);
+extern int xio_recv_aio(struct xio_socket *msock, struct aio_object *aio, struct xio_cmd *cmd);
+extern int xio_send_cb(struct xio_socket *msock, struct aio_object *aio);
+extern int xio_recv_cb(struct xio_socket *msock, struct aio_object *aio, struct xio_cmd *cmd);
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_xio_net(void);
+extern void exit_xio_net(void);
+
+#endif
--
2.0.0

2014-07-01 21:48:08

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 27/50] mars: add new file include/linux/xio/xio_bio.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/xio/xio_bio.h | 69 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 69 insertions(+)
create mode 100644 include/linux/xio/xio_bio.h

diff --git a/include/linux/xio/xio_bio.h b/include/linux/xio/xio_bio.h
new file mode 100644
index 0000000..73fad4e
--- /dev/null
+++ b/include/linux/xio/xio_bio.h
@@ -0,0 +1,69 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_BIO_H
+#define XIO_BIO_H
+
+#define BIO_SUBMIT_MAX_LATENCY 250 /* 250 us */
+#define BIO_IO_R_MAX_LATENCY 40000 /* 40 ms */
+#define BIO_IO_W_MAX_LATENCY 100000 /* 100 ms */
+
+extern struct threshold bio_submit_threshold;
+extern struct threshold bio_io_threshold[2];
+
+#include <linux/blkdev.h>
+
+struct bio_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct list_head io_head;
+ struct bio *bio;
+ struct bio_output *output;
+ unsigned long long start_stamp;
+ int status_code;
+ int hash_pos;
+ int alloc_len;
+ bool do_dealloc;
+};
+
+struct bio_brick {
+ XIO_BRICK(bio);
+ /* tunables */
+ int ra_pages;
+ int bg_threshold;
+ int bg_maxfly;
+ bool do_noidle;
+ bool do_sync;
+ bool do_unplug;
+
+ /* readonly */
+ loff_t total_size;
+ atomic_t fly_count[XIO_PRIO_NR];
+ atomic_t queue_count[XIO_PRIO_NR];
+ atomic_t completed_count;
+ atomic_t total_completed_count[XIO_PRIO_NR];
+
+ /* private */
+ spinlock_t lock;
+ struct list_head queue_list[XIO_PRIO_NR];
+ struct list_head submitted_list[2];
+ struct list_head completed_list;
+
+ wait_queue_head_t submit_event;
+ wait_queue_head_t response_event;
+ struct mapfree_info *mf;
+ struct block_device *bdev;
+ struct task_struct *submit_thread;
+ struct task_struct *response_thread;
+ int bvec_max;
+ bool submitted;
+};
+
+struct bio_input {
+ XIO_INPUT(bio);
+};
+
+struct bio_output {
+ XIO_OUTPUT(bio);
+};
+
+XIO_TYPES(bio);
+
+#endif
--
2.0.0

2014-07-01 21:48:06

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 23/50] mars: add new file include/linux/lib_mapfree.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/lib_mapfree.h | 65 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 65 insertions(+)
create mode 100644 include/linux/lib_mapfree.h

diff --git a/include/linux/lib_mapfree.h b/include/linux/lib_mapfree.h
new file mode 100644
index 0000000..416a901
--- /dev/null
+++ b/include/linux/lib_mapfree.h
@@ -0,0 +1,65 @@
+/* (c) 2012 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_LIB_MAPFREE_H
+#define XIO_LIB_MAPFREE_H
+
+/* Mapfree infrastructure.
+ *
+ * Purposes:
+ *
+ * 1) Open files only once when possible, do ref-counting on struct mapfree_info
+ *
+ * 2) Automatically call invalidate_mapping_pages() in the background on
+ * "unused" areas to free resources.
+ * Used areas can be indicated by calling mapfree_set() frequently.
+ * Usage model: tailored to sequential logfiles.
+ *
+ * 3) Do it all in a completely decoupled manner, in order to prevent resource deadlocks.
+ *
+ * 4) Also to prevent deadlocks: always set mapping_set_gfp_mask() accordingly.
+ */
+
+#include <linux/xio.h>
+
+extern int mapfree_period_sec;
+extern int mapfree_grace_keep_mb;
+
+struct mapfree_info {
+ struct list_head mf_head;
+ struct list_head mf_dirty_anchor;
+ char *mf_name;
+ struct file *mf_filp;
+ int mf_flags;
+ atomic_t mf_count;
+ spinlock_t mf_lock;
+ loff_t mf_min[2];
+ loff_t mf_last;
+ loff_t mf_max;
+ long long mf_jiffies;
+};
+
+struct dirty_info {
+ struct list_head dirty_head;
+ struct aio_object *dirty_aio;
+ int dirty_stage;
+};
+
+struct mapfree_info *mapfree_get(const char *filename, int flags);
+
+void mapfree_put(struct mapfree_info *mf);
+
+void mapfree_set(struct mapfree_info *mf, loff_t min, loff_t max);
+
+/***************** dirty IOs on the fly *****************/
+
+void mf_insert_dirty(struct mapfree_info *mf, struct dirty_info *di);
+void mf_remove_dirty(struct mapfree_info *mf, struct dirty_info *di);
+void mf_get_dirty(struct mapfree_info *mf, loff_t *min, loff_t *max, int min_stage, int max_stage);
+void mf_get_any_dirty(const char *filename, loff_t *min, loff_t *max, int min_stage, int max_stage);
+
+/***************** module init stuff ************************/
+
+int __init init_xio_mapfree(void);
+
+void exit_xio_mapfree(void);
+
+#endif
--
2.0.0

2014-07-01 21:48:03

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 20/50] mars: add new file drivers/block/mars/xio_bricks/xio.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/xio.c | 183 ++++++++++++++++++++++++++++++++++++
1 file changed, 183 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/xio.c

diff --git a/drivers/block/mars/xio_bricks/xio.c b/drivers/block/mars/xio_bricks/xio.c
new file mode 100644
index 0000000..cc13478
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/xio.c
@@ -0,0 +1,183 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/utsname.h>
+
+#include <linux/xio.h>
+
+/************************************************************/
+
+/* infrastructure */
+
+struct banning xio_global_ban = {};
+EXPORT_SYMBOL_GPL(xio_global_ban);
+atomic_t xio_global_io_flying = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(xio_global_io_flying);
+
+static char *id;
+
+/* TODO: better use MAC addresses (or motherboard IDs where available).
+ * Or, at least, some checks for MAC addresses should be recorded / added.
+ * When the nodename is misconfigured, data might be scrambled.
+ * MAC addresses should be more secure.
+ * In ideal case, further checks should be added to prohibit accidental
+ * name clashes.
+ */
+char *my_id(void)
+{
+ struct new_utsname *u;
+
+ if (!id) {
+ /* down_read(&uts_sem); // FIXME: this is currenty not EXPORTed from the kernel! */
+ u = utsname();
+ if (u)
+ id = brick_strdup(u->nodename);
+ /* up_read(&uts_sem); */
+ }
+ return id;
+}
+EXPORT_SYMBOL_GPL(my_id);
+
+/************************************************************/
+
+/* object stuff */
+
+const struct generic_object_type aio_type = {
+ .object_type_name = "aio",
+ .default_size = sizeof(struct aio_object),
+ .object_type_nr = OBJ_TYPE_AIO,
+};
+EXPORT_SYMBOL_GPL(aio_type);
+
+/************************************************************/
+
+/* brick stuff */
+
+/*******************************************************************/
+
+/* meta descriptions */
+
+const struct meta xio_info_meta[] = {
+ META_INI(current_size, struct xio_info, FIELD_INT),
+ META_INI(tf_align, struct xio_info, FIELD_INT),
+ META_INI(tf_min_size, struct xio_info, FIELD_INT),
+ {}
+};
+EXPORT_SYMBOL_GPL(xio_info_meta);
+
+const struct meta xio_aio_meta[] = {
+ META_INI(_object_cb.cb_error, struct aio_object, FIELD_INT),
+ META_INI(io_pos, struct aio_object, FIELD_INT),
+ META_INI(io_len, struct aio_object, FIELD_INT),
+ META_INI(io_may_write, struct aio_object, FIELD_INT),
+ META_INI(io_prio, struct aio_object, FIELD_INT),
+ META_INI(io_cs_mode, struct aio_object, FIELD_INT),
+ META_INI(io_timeout, struct aio_object, FIELD_INT),
+ META_INI(io_total_size, struct aio_object, FIELD_INT),
+ META_INI(io_checksum, struct aio_object, FIELD_RAW),
+ META_INI(io_flags, struct aio_object, FIELD_INT),
+ META_INI(io_rw, struct aio_object, FIELD_INT),
+ META_INI(io_id, struct aio_object, FIELD_INT),
+ META_INI(io_skip_sync, struct aio_object, FIELD_INT),
+ {}
+};
+EXPORT_SYMBOL_GPL(xio_aio_meta);
+
+const struct meta xio_timespec_meta[] = {
+ META_INI_TRANSFER(tv_sec, struct timespec, FIELD_UINT, 8),
+ META_INI_TRANSFER(tv_nsec, struct timespec, FIELD_UINT, 4),
+ {}
+};
+EXPORT_SYMBOL_GPL(xio_timespec_meta);
+
+/************************************************************/
+
+/* crypto stuff */
+
+#include <linux/scatterlist.h>
+#include <linux/crypto.h>
+
+static struct crypto_hash *xio_tfm;
+static struct semaphore tfm_sem = __SEMAPHORE_INITIALIZER(tfm_sem, 1);
+int xio_digest_size = 0;
+EXPORT_SYMBOL_GPL(xio_digest_size);
+
+void xio_digest(unsigned char *digest, void *data, int len)
+{
+ struct hash_desc desc = {
+ .tfm = xio_tfm,
+ .flags = 0,
+ };
+ struct scatterlist sg;
+
+ memset(digest, 0, xio_digest_size);
+
+ /* TODO: use per-thread instance, omit locking */
+ down(&tfm_sem);
+
+ crypto_hash_init(&desc);
+ sg_init_table(&sg, 1);
+ sg_set_buf(&sg, data, len);
+ crypto_hash_update(&desc, &sg, sg.length);
+ crypto_hash_final(&desc, digest);
+ up(&tfm_sem);
+}
+EXPORT_SYMBOL_GPL(xio_digest);
+
+void aio_checksum(struct aio_object *aio)
+{
+ unsigned char checksum[xio_digest_size];
+ int len;
+
+ if (aio->io_cs_mode <= 0 || !aio->io_data)
+ goto out_return;
+ xio_digest(checksum, aio->io_data, aio->io_len);
+
+ len = sizeof(aio->io_checksum);
+ if (len > xio_digest_size)
+ len = xio_digest_size;
+ memcpy(&aio->io_checksum, checksum, len);
+out_return:;
+}
+EXPORT_SYMBOL_GPL(aio_checksum);
+
+/*******************************************************************/
+
+/* init stuff */
+
+int __init init_xio(void)
+{
+ XIO_INF("init_xio()\n");
+
+ xio_tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
+ if (!xio_tfm) {
+ XIO_ERR("cannot alloc crypto hash\n");
+ return -ENOMEM;
+ }
+ if (IS_ERR(xio_tfm)) {
+ XIO_ERR("alloc crypto hash failed, status = %d\n", (int)PTR_ERR(xio_tfm));
+ return PTR_ERR(xio_tfm);
+ }
+ xio_digest_size = crypto_hash_digestsize(xio_tfm);
+ XIO_INF("digest_size = %d\n", xio_digest_size);
+
+ return 0;
+}
+
+void exit_xio(void)
+{
+ XIO_INF("exit_xio()\n");
+
+ if (xio_tfm)
+ crypto_free_hash(xio_tfm);
+
+ if (id) {
+ brick_string_free(id);
+ id = NULL;
+ }
+}
--
2.0.0

2014-07-01 21:48:00

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 16/50] mars: add new file drivers/block/mars/lib_timing.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/lib_timing.c | 51 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
create mode 100644 drivers/block/mars/lib_timing.c

diff --git a/drivers/block/mars/lib_timing.c b/drivers/block/mars/lib_timing.c
new file mode 100644
index 0000000..9221c0b
--- /dev/null
+++ b/drivers/block/mars/lib_timing.c
@@ -0,0 +1,51 @@
+/* (c) 2012 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/brick/lib_timing.h>
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#ifdef CONFIG_MARS_DEBUG
+
+int report_timing(struct timing_stats *tim, char *str, int maxlen)
+{
+ int len = 0;
+ int time = 1;
+ int resol = 1;
+
+ static const char * const units[] = {
+ "us",
+ "ms",
+ "s",
+ "ERROR"
+ };
+ const char *unit = units[0];
+ int unit_index = 0;
+ int i;
+
+ for (i = 0; i < TIMING_MAX; i++) {
+ int this_len = scnprintf(str,
+
+ maxlen,
+ "<%d%s = %d (%lld) ",
+ resol,
+ unit,
+ tim->tim_count[i],
+ (long long)tim->tim_count[i] * time);
+ str += this_len;
+ len += this_len;
+ maxlen -= this_len;
+ if (maxlen <= 1)
+ break;
+ resol <<= 1;
+ time <<= 1;
+ if (resol >= 1000) {
+ resol = 1;
+ unit = units[++unit_index];
+ }
+ }
+ return len;
+}
+EXPORT_SYMBOL_GPL(report_timing);
+
+#endif
--
2.0.0

2014-07-01 21:47:56

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 10/50] mars: add new file drivers/block/mars/brick.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/brick.c | 801 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 801 insertions(+)
create mode 100644 drivers/block/mars/brick.c

diff --git a/drivers/block/mars/brick.c b/drivers/block/mars/brick.c
new file mode 100644
index 0000000..8826a0d
--- /dev/null
+++ b/drivers/block/mars/brick.c
@@ -0,0 +1,801 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#define _STRATEGY
+
+#include <linux/brick/brick.h>
+#include <linux/brick/brick_mem.h>
+
+/************************************************************/
+
+/* init / exit functions */
+
+void _generic_output_init(struct generic_brick *brick,
+ const struct generic_output_type *type,
+ struct generic_output *output)
+{
+ output->brick = brick;
+ output->type = type;
+ output->ops = type->master_ops;
+ output->nr_connected = 0;
+ INIT_LIST_HEAD(&output->output_head);
+}
+EXPORT_SYMBOL_GPL(_generic_output_init);
+
+void _generic_output_exit(struct generic_output *output)
+{
+ list_del_init(&output->output_head);
+ output->brick = NULL;
+ output->type = NULL;
+ output->ops = NULL;
+ output->nr_connected = 0;
+}
+EXPORT_SYMBOL_GPL(_generic_output_exit);
+
+int generic_brick_init(const struct generic_brick_type *type, struct generic_brick *brick)
+{
+ brick->aspect_context.brick_index = get_brick_nr();
+ brick->type = type;
+ brick->ops = type->master_ops;
+ brick->nr_inputs = 0;
+ brick->nr_outputs = 0;
+ brick->power.off_led = true;
+ init_waitqueue_head(&brick->power.event);
+ INIT_LIST_HEAD(&brick->tmp_head);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_brick_init);
+
+void generic_brick_exit(struct generic_brick *brick)
+{
+ list_del_init(&brick->tmp_head);
+ brick->type = NULL;
+ brick->ops = NULL;
+ brick->nr_inputs = 0;
+ brick->nr_outputs = 0;
+ put_brick_nr(brick->aspect_context.brick_index);
+}
+EXPORT_SYMBOL_GPL(generic_brick_exit);
+
+int generic_input_init(struct generic_brick *brick,
+ int index,
+ const struct generic_input_type *type,
+ struct generic_input *input)
+{
+ if (index < 0 || index >= brick->type->max_inputs)
+ return -EINVAL;
+ if (brick->inputs[index])
+ return -EEXIST;
+ input->brick = brick;
+ input->type = type;
+ input->connect = NULL;
+ INIT_LIST_HEAD(&input->input_head);
+ brick->inputs[index] = input;
+ brick->nr_inputs++;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_input_init);
+
+void generic_input_exit(struct generic_input *input)
+{
+ list_del_init(&input->input_head);
+ input->brick = NULL;
+ input->type = NULL;
+ input->connect = NULL;
+}
+EXPORT_SYMBOL_GPL(generic_input_exit);
+
+int generic_output_init(struct generic_brick *brick,
+ int index,
+ const struct generic_output_type *type,
+ struct generic_output *output)
+{
+ if (index < 0 || index >= brick->type->max_outputs)
+ return -ENOMEM;
+ if (brick->outputs[index])
+ return -EEXIST;
+ _generic_output_init(brick, type, output);
+ brick->outputs[index] = output;
+ brick->nr_outputs++;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_output_init);
+
+int generic_size(const struct generic_brick_type *brick_type)
+{
+ int size = brick_type->brick_size;
+ int i;
+
+ size += brick_type->max_inputs * sizeof(void *);
+ for (i = 0; i < brick_type->max_inputs; i++)
+ size += brick_type->default_input_types[i]->input_size;
+ size += brick_type->max_outputs * sizeof(void *);
+ for (i = 0; i < brick_type->max_outputs; i++)
+ size += brick_type->default_output_types[i]->output_size;
+ return size;
+}
+EXPORT_SYMBOL_GPL(generic_size);
+
+int generic_connect(struct generic_input *input, struct generic_output *output)
+{
+ BRICK_DBG("generic_connect(input=%p, output=%p)\n", input, output);
+ if (unlikely(!input || !output))
+ return -EINVAL;
+ if (unlikely(input->connect))
+ return -EEXIST;
+ if (unlikely(!list_empty(&input->input_head)))
+ return -EINVAL;
+ /* helps only against the most common errors */
+ if (unlikely(input->brick == output->brick))
+ return -EDEADLK;
+
+ input->connect = output;
+ output->nr_connected++;
+ list_add(&input->input_head, &output->output_head);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_connect);
+
+int generic_disconnect(struct generic_input *input)
+{
+ struct generic_output *connect;
+
+ BRICK_DBG("generic_disconnect(input=%p)\n", input);
+ if (!input)
+ return -EINVAL;
+ connect = input->connect;
+ if (connect) {
+ connect->nr_connected--;
+ input->connect = NULL;
+ list_del_init(&input->input_head);
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_disconnect);
+
+/************************************************************/
+
+/* general */
+
+int _brick_msleep(int msecs, bool shorten)
+{
+ unsigned long timeout;
+
+ flush_signals(current);
+ if (msecs <= 0) {
+ schedule();
+ return 0;
+ }
+ timeout = msecs_to_jiffies(msecs) + 1;
+
+ timeout = schedule_timeout_interruptible(timeout);
+
+ if (!shorten) {
+ while ((long)timeout > 0)
+ timeout = schedule_timeout_uninterruptible(timeout);
+ }
+
+ return jiffies_to_msecs(timeout);
+}
+EXPORT_SYMBOL_GPL(_brick_msleep);
+
+struct kthread {
+ unsigned long flags;
+ unsigned int cpu;
+ void *data;
+ struct completion parked;
+ struct completion exited;
+};
+
+enum KTHREAD_BITS {
+ KTHREAD_IS_PER_CPU = 0,
+ KTHREAD_SHOULD_STOP,
+ KTHREAD_SHOULD_PARK,
+ KTHREAD_IS_PARKED,
+};
+
+#define to_kthread(tsk) \
+ container_of((tsk)->vfork_done, struct kthread, exited)
+
+/**
+ * kthread_stop_nowait - like kthread_stop(), but don't wait for termination.
+ * @k: thread created by kthread_create().
+ *
+ * If threadfn() may call do_exit() itself, the caller must ensure
+ * task_struct can't go away.
+ *
+ * Therefore, you must not call this twice (or after kthread_stop()), at least
+ * if you don't get_task_struct() yourself.
+ */
+static struct kthread *task_get_live_kthread(struct task_struct *k)
+{
+ struct kthread *kthread;
+
+ get_task_struct(k);
+ kthread = to_kthread(k);
+ /* It might have exited */
+ barrier();
+ if (k->vfork_done != NULL)
+ return kthread;
+ return NULL;
+}
+
+void kthread_stop_nowait(struct task_struct *k)
+{
+ struct kthread *kthread = task_get_live_kthread(k);
+
+ if (kthread) {
+ set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
+ clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ wake_up_process(k);
+ wait_for_completion(&kthread->exited);
+ }
+
+ put_task_struct(k);
+}
+EXPORT_SYMBOL_GPL(kthread_stop_nowait);
+
+void brick_thread_stop_nowait(struct task_struct *k)
+{
+ kthread_stop_nowait(k);
+}
+EXPORT_SYMBOL_GPL(brick_thread_stop_nowait);
+
+/************************************************************/
+
+/* number management */
+
+static char *nr_table;
+int nr_max = 256;
+EXPORT_SYMBOL_GPL(nr_max);
+
+int get_brick_nr(void)
+{
+ char *new;
+ int nr;
+
+ if (unlikely(!nr_table))
+ nr_table = brick_zmem_alloc(nr_max);
+
+ for (;;) {
+ for (nr = 1; nr < nr_max; nr++) {
+ if (!nr_table[nr]) {
+ nr_table[nr] = 1;
+ return nr;
+ }
+ }
+ new = brick_zmem_alloc(nr_max << 1);
+ memcpy(new, nr_table, nr_max);
+ brick_mem_free(nr_table);
+ nr_table = new;
+ nr_max <<= 1;
+ }
+}
+EXPORT_SYMBOL_GPL(get_brick_nr);
+
+void put_brick_nr(int nr)
+{
+ if (likely(nr_table && nr > 0 && nr < nr_max))
+ nr_table[nr] = 0;
+}
+EXPORT_SYMBOL_GPL(put_brick_nr);
+
+/************************************************************/
+
+/* object stuff */
+
+/************************************************************/
+
+/* brick stuff */
+
+static int nr_brick_types;
+static const struct generic_brick_type *brick_types[MAX_BRICK_TYPES];
+
+int generic_register_brick_type(const struct generic_brick_type *new_type)
+{
+ int i;
+ int found = -1;
+
+ BRICK_DBG("generic_register_brick_type() name=%s\n", new_type->type_name);
+ for (i = 0; i < nr_brick_types; i++) {
+ if (!brick_types[i]) {
+ found = i;
+ continue;
+ }
+ if (!strcmp(brick_types[i]->type_name, new_type->type_name))
+ return 0;
+ }
+ if (found < 0) {
+ if (nr_brick_types >= MAX_BRICK_TYPES) {
+ BRICK_ERR("sorry, cannot register bricktype %s.\n", new_type->type_name);
+ return -ENOMEM;
+ }
+ found = nr_brick_types++;
+ }
+ brick_types[found] = new_type;
+ BRICK_DBG("generic_register_brick_type() done.\n");
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_register_brick_type);
+
+int generic_unregister_brick_type(const struct generic_brick_type *old_type)
+{
+ BRICK_DBG("generic_unregister_brick_type()\n");
+ return -1; /* NYI */
+}
+EXPORT_SYMBOL_GPL(generic_unregister_brick_type);
+
+int generic_brick_init_full(
+ void *data,
+ int size,
+ const struct generic_brick_type *brick_type,
+ const struct generic_input_type **input_types,
+ const struct generic_output_type **output_types)
+{
+ struct generic_brick *brick = data;
+ int status;
+ int i;
+
+ if (unlikely(!data)) {
+ BRICK_ERR("invalid memory\n");
+ return -EINVAL;
+ }
+
+ /* call the generic constructors */
+
+ status = generic_brick_init(brick_type, brick);
+ if (status)
+ return status;
+ data += brick_type->brick_size;
+ size -= brick_type->brick_size;
+ if (size < 0) {
+ BRICK_ERR("Not enough MEMORY\n");
+ return -ENOMEM;
+ }
+ if (!input_types) {
+ input_types = brick_type->default_input_types;
+ if (unlikely(!input_types)) {
+ BRICK_ERR("no input types specified\n");
+ return -EINVAL;
+ }
+ }
+ brick->inputs = data;
+ data += sizeof(void *) * brick_type->max_inputs;
+ size -= sizeof(void *) * brick_type->max_inputs;
+ if (size < 0)
+ return -ENOMEM;
+ for (i = 0; i < brick_type->max_inputs; i++) {
+ struct generic_input *input = data;
+ const struct generic_input_type *type = *input_types++;
+
+ if (!type || type->input_size <= 0)
+ return -EINVAL;
+ BRICK_DBG("generic_brick_init_full: calling generic_input_init()\n");
+ status = generic_input_init(brick, i, type, input);
+ if (status < 0)
+ return status;
+ data += type->input_size;
+ size -= type->input_size;
+ if (size < 0)
+ return -ENOMEM;
+ }
+ if (!output_types) {
+ output_types = brick_type->default_output_types;
+ if (unlikely(!output_types)) {
+ BRICK_ERR("no output types specified\n");
+ return -EINVAL;
+ }
+ }
+ brick->outputs = data;
+ data += sizeof(void *) * brick_type->max_outputs;
+ size -= sizeof(void *) * brick_type->max_outputs;
+ if (size < 0)
+ return -ENOMEM;
+ for (i = 0; i < brick_type->max_outputs; i++) {
+ struct generic_output *output = data;
+ const struct generic_output_type *type = *output_types++;
+
+ if (!type || type->output_size <= 0)
+ return -EINVAL;
+ BRICK_DBG("generic_brick_init_full: calling generic_output_init()\n");
+ generic_output_init(brick, i, type, output);
+ if (status < 0)
+ return status;
+ data += type->output_size;
+ size -= type->output_size;
+ if (size < 0)
+ return -ENOMEM;
+ }
+
+ /* call the specific constructors */
+ if (brick_type->brick_construct) {
+ BRICK_DBG("generic_brick_init_full: calling brick_construct()\n");
+ status = brick_type->brick_construct(brick);
+ if (status < 0)
+ return status;
+ }
+ for (i = 0; i < brick_type->max_inputs; i++) {
+ struct generic_input *input = brick->inputs[i];
+
+ if (!input)
+ continue;
+ if (!input->type) {
+ BRICK_ERR("input has no associated type!\n");
+ continue;
+ }
+ if (input->type->input_construct) {
+ BRICK_DBG("generic_brick_init_full: calling input_construct()\n");
+ status = input->type->input_construct(input);
+ if (status < 0)
+ return status;
+ }
+ }
+ for (i = 0; i < brick_type->max_outputs; i++) {
+ struct generic_output *output = brick->outputs[i];
+
+ if (!output)
+ continue;
+ if (!output->type) {
+ BRICK_ERR("output has no associated type!\n");
+ continue;
+ }
+ if (output->type->output_construct) {
+ BRICK_DBG("generic_brick_init_full: calling output_construct()\n");
+ status = output->type->output_construct(output);
+ if (status < 0)
+ return status;
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_brick_init_full);
+
+int generic_brick_exit_full(struct generic_brick *brick)
+{
+ int i;
+ int status;
+
+ /* first, check all outputs */
+ for (i = 0; i < brick->type->max_outputs; i++) {
+ struct generic_output *output = brick->outputs[i];
+
+ if (!output)
+ continue;
+ if (!output->type) {
+ BRICK_ERR("output has no associated type!\n");
+ continue;
+ }
+ if (output->nr_connected) {
+ BRICK_ERR("output is connected!\n");
+ return -EPERM;
+ }
+ }
+ /* ok, test succeeded. start destruction... */
+ for (i = 0; i < brick->type->max_outputs; i++) {
+ struct generic_output *output = brick->outputs[i];
+
+ if (!output)
+ continue;
+ if (!output->type) {
+ BRICK_ERR("output has no associated type!\n");
+ continue;
+ }
+ if (output->type->output_destruct) {
+ BRICK_DBG("generic_brick_exit_full: calling output_destruct()\n");
+ status = output->type->output_destruct(output);
+ if (status < 0)
+ return status;
+ _generic_output_exit(output);
+ brick->outputs[i] = NULL; /* others may remain leftover */
+ }
+ }
+ for (i = 0; i < brick->type->max_inputs; i++) {
+ struct generic_input *input = brick->inputs[i];
+
+ if (!input)
+ continue;
+ if (!input->type) {
+ BRICK_ERR("input has no associated type!\n");
+ continue;
+ }
+ if (input->type->input_destruct) {
+ status = generic_disconnect(input);
+ if (status < 0)
+ return status;
+ BRICK_DBG("generic_brick_exit_full: calling input_destruct()\n");
+ status = input->type->input_destruct(input);
+ if (status < 0)
+ return status;
+ brick->inputs[i] = NULL; /* others may remain leftover */
+ generic_input_exit(input);
+ }
+ }
+ if (brick->type->brick_destruct) {
+ BRICK_DBG("generic_brick_exit_full: calling brick_destruct()\n");
+ status = brick->type->brick_destruct(brick);
+ if (status < 0)
+ return status;
+ }
+ generic_brick_exit(brick);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(generic_brick_exit_full);
+
+/**********************************************************************/
+
+/* default implementations */
+
+struct generic_object *generic_alloc(struct generic_object_layout *object_layout,
+ const struct generic_object_type *object_type)
+{
+ struct generic_object *object;
+ void *data;
+ int object_size;
+ int aspect_nr_max;
+ int total_size;
+ int hint_size;
+
+ CHECK_PTR_NULL(object_type, err);
+ CHECK_PTR(object_layout, err);
+
+ object_size = object_type->default_size;
+ aspect_nr_max = nr_max;
+ total_size = object_size + aspect_nr_max * sizeof(void *);
+ hint_size = object_layout->size_hint;
+ if (likely(total_size <= hint_size)) {
+ total_size = hint_size;
+ } else { /* usually happens only at the first time */
+ object_layout->size_hint = total_size;
+ }
+
+ data = brick_zmem_alloc(total_size);
+
+ atomic_inc(&object_layout->alloc_count);
+ atomic_inc(&object_layout->total_alloc_count);
+
+ object = data;
+ object->object_type = object_type;
+ object->object_layout = object_layout;
+ object->aspects = data + object_size;
+ object->aspect_nr_max = aspect_nr_max;
+ object->free_offset = object_size + aspect_nr_max * sizeof(void *);
+ object->max_offset = total_size;
+
+ if (object_type->init_fn) {
+ int status = object_type->init_fn(object);
+
+ if (status < 0)
+ goto err_free;
+ }
+
+ return object;
+
+err_free:
+ brick_mem_free(data);
+err:
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(generic_alloc);
+
+void generic_free(struct generic_object *object)
+{
+ const struct generic_object_type *object_type;
+ struct generic_object_layout *object_layout;
+ int i;
+
+ CHECK_PTR(object, done);
+ object_type = object->object_type;
+ CHECK_PTR_NULL(object_type, done);
+ object_layout = object->object_layout;
+ CHECK_PTR(object_layout, done);
+ _CHECK_ATOMIC(&object->obj_count, !=, 0);
+
+ atomic_dec(&object_layout->alloc_count);
+ for (i = 0; i < object->aspect_nr_max; i++) {
+ const struct generic_aspect_type *aspect_type;
+ struct generic_aspect *aspect = object->aspects[i];
+
+ if (!aspect)
+ continue;
+ object->aspects[i] = NULL;
+ aspect_type = aspect->aspect_type;
+ CHECK_PTR_NULL(aspect_type, done);
+ if (aspect_type->exit_fn)
+ aspect_type->exit_fn(aspect);
+ if (aspect->shortcut)
+ continue;
+ brick_mem_free(aspect);
+ atomic_dec(&object_layout->aspect_count);
+ }
+ if (object_type->exit_fn)
+ object_type->exit_fn(object);
+ brick_mem_free(object);
+done:;
+}
+EXPORT_SYMBOL_GPL(generic_free);
+
+static inline
+struct generic_aspect *_new_aspect(const struct generic_aspect_type *aspect_type, struct generic_object *obj)
+{
+ struct generic_aspect *res = NULL;
+ int size;
+ int rest;
+
+ size = aspect_type->aspect_size;
+ rest = obj->max_offset - obj->free_offset;
+ if (likely(size <= rest)) {
+ /* Optimisation: re-use single memory allocation for both
+ * the object and the new aspect.
+ */
+ res = ((void *)obj) + obj->free_offset;
+ obj->free_offset += size;
+ res->shortcut = true;
+ } else {
+ struct generic_object_layout *object_layout = obj->object_layout;
+
+ CHECK_PTR(object_layout, done);
+ /* Maintain the size hint.
+ * In future, only small aspects should be integrated into
+ * the same memory block, and the hint should not grow larger
+ * than PAGE_SIZE if it was smaller before.
+ */
+ if (size < PAGE_SIZE / 2) {
+ int max;
+
+ max = obj->free_offset + size;
+ /* This is racy, but races won't do any harm because
+ * it is just a hint, not essential.
+ */
+ if ((max < PAGE_SIZE || object_layout->size_hint > PAGE_SIZE) &&
+ object_layout->size_hint < max)
+ object_layout->size_hint = max;
+ }
+
+ res = brick_zmem_alloc(size);
+ atomic_inc(&object_layout->aspect_count);
+ atomic_inc(&object_layout->total_aspect_count);
+ }
+ res->object = obj;
+ res->aspect_type = aspect_type;
+
+ if (aspect_type->init_fn) {
+ int status = aspect_type->init_fn(res);
+
+ if (unlikely(status < 0)) {
+ BRICK_ERR("aspect init %p %p %p status = %d\n", aspect_type, obj, res, status);
+ goto done;
+ }
+ }
+
+done:
+ return res;
+}
+
+struct generic_aspect *generic_get_aspect(struct generic_brick *brick, struct generic_object *obj)
+{
+ struct generic_aspect *res = NULL;
+ int nr;
+
+ CHECK_PTR(brick, done);
+ CHECK_PTR(obj, done);
+
+ nr = brick->aspect_context.brick_index;
+ if (unlikely(nr <= 0 || nr >= obj->aspect_nr_max)) {
+ BRICK_ERR("bad nr = %d\n", nr);
+ goto done;
+ }
+
+ res = obj->aspects[nr];
+ if (!res) {
+ const struct generic_object_type *object_type = obj->object_type;
+ const struct generic_brick_type *brick_type = brick->type;
+ const struct generic_aspect_type *aspect_type;
+ int object_type_nr;
+
+ CHECK_PTR_NULL(object_type, done);
+ CHECK_PTR_NULL(brick_type, done);
+ object_type_nr = object_type->object_type_nr;
+ aspect_type = brick_type->aspect_types[object_type_nr];
+ CHECK_PTR_NULL(aspect_type, done);
+
+ res = _new_aspect(aspect_type, obj);
+
+ obj->aspects[nr] = res;
+ }
+ CHECK_PTR(res, done);
+ CHECK_PTR(res->object, done);
+ _CHECK(res->object == obj, done);
+
+done:
+ return res;
+}
+EXPORT_SYMBOL_GPL(generic_get_aspect);
+
+/***************************************************************/
+
+/* helper stuff */
+
+void set_button(struct generic_switch *sw, bool val, bool force)
+{
+ bool oldval = sw->button;
+
+ sw->force_off |= force;
+ if (sw->force_off)
+ val = false;
+ if (val != oldval) {
+ sw->button = val;
+ wake_up_interruptible(&sw->event);
+ }
+}
+EXPORT_SYMBOL_GPL(set_button);
+
+void set_on_led(struct generic_switch *sw, bool val)
+{
+ bool oldval = sw->on_led;
+
+ if (val != oldval) {
+ sw->on_led = val;
+ wake_up_interruptible(&sw->event);
+ }
+}
+EXPORT_SYMBOL_GPL(set_on_led);
+
+void set_off_led(struct generic_switch *sw, bool val)
+{
+ bool oldval = sw->off_led;
+
+ if (val != oldval) {
+ sw->off_led = val;
+ wake_up_interruptible(&sw->event);
+ }
+}
+EXPORT_SYMBOL_GPL(set_off_led);
+
+void set_button_wait(struct generic_brick *brick, bool val, bool force, int timeout)
+{
+ set_button(&brick->power, val, force);
+ if (brick->ops)
+ (void)brick->ops->brick_switch(brick);
+ if (val)
+ wait_event_interruptible_timeout(brick->power.event, brick->power.on_led, timeout);
+ else
+ wait_event_interruptible_timeout(brick->power.event, brick->power.off_led, timeout);
+}
+EXPORT_SYMBOL_GPL(set_button_wait);
+
+/***************************************************************/
+
+/* meta stuff */
+
+const struct meta *find_meta(const struct meta *meta, const char *field_name)
+{
+ const struct meta *tmp;
+
+ for (tmp = meta; tmp->field_name; tmp++) {
+ if (!strcmp(field_name, tmp->field_name))
+ return tmp;
+ }
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(find_meta);
+
+/***********************************************************************/
+
+/* module init stuff */
+
+int __init init_brick(void)
+{
+ nr_table = brick_zmem_alloc(nr_max);
+ return 0;
+}
+
+void exit_brick(void)
+{
+ if (nr_table) {
+ brick_mem_free(nr_table);
+ nr_table = NULL;
+ }
+}
--
2.0.0

2014-07-01 21:47:55

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 09/50] mars: add new file include/linux/brick/brick.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/brick.h | 632 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 632 insertions(+)
create mode 100644 include/linux/brick/brick.h

diff --git a/include/linux/brick/brick.h b/include/linux/brick/brick.h
new file mode 100644
index 0000000..14769b2
--- /dev/null
+++ b/include/linux/brick/brick.h
@@ -0,0 +1,632 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef BRICK_H
+#define BRICK_H
+
+#include <linux/list.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kthread.h>
+
+#include <linux/atomic.h>
+
+#include <linux/brick/brick_say.h>
+#include <linux/brick/meta.h>
+
+#define MAX_BRICK_TYPES 64
+
+#define brick_msleep(msecs) _brick_msleep(msecs, false)
+extern int _brick_msleep(int msecs, bool shorten);
+#define brick_yield() brick_msleep(0)
+
+/***********************************************************************/
+
+/* printk() replacements */
+
+#define SAFE_STR(str) ((str) ? (str) : "NULL")
+
+#define _BRICK_MSG(_class, _dump, _fmt, _args...) \
+ brick_say(_class, _dump, "BRICK", __BASE_FILE__, __LINE__, __func__, _fmt, ##_args)
+
+#define BRICK_FAT(_fmt, _args...) _BRICK_MSG(SAY_FATAL, true, _fmt, ##_args)
+#define BRICK_ERR(_fmt, _args...) _BRICK_MSG(SAY_ERROR, false, _fmt, ##_args)
+#define BRICK_WRN(_fmt, _args...) _BRICK_MSG(SAY_WARN, false, _fmt, ##_args)
+#define BRICK_INF(_fmt, _args...) _BRICK_MSG(SAY_INFO, false, _fmt, ##_args)
+
+#ifdef BRICK_DEBUGGING
+#define BRICK_DBG(_fmt, _args...) _BRICK_MSG(SAY_DEBUG, false, _fmt, ##_args)
+#else
+#define BRICK_DBG(_args...) /**/
+#endif
+
+#include <linux/brick/brick_checking.h>
+
+/***********************************************************************/
+
+/* number management helpers */
+
+extern int get_brick_nr(void);
+extern void put_brick_nr(int nr);
+
+/***********************************************************************/
+
+/* definitions for generic objects with aspects */
+
+struct generic_object;
+struct generic_aspect;
+
+#define GENERIC_ASPECT_TYPE(OBJTYPE) \
+ /* readonly from outside */ \
+ const char *aspect_type_name; \
+ const struct generic_object_type *object_type; \
+ /* private */ \
+ int aspect_size; \
+ int (*init_fn)(struct OBJTYPE##_aspect *ini); \
+ void (*exit_fn)(struct OBJTYPE##_aspect *ini); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_aspect_type {
+ GENERIC_ASPECT_TYPE(generic);
+};
+
+#define GENERIC_OBJECT_TYPE(OBJTYPE) \
+ /* readonly from outside */ \
+ const char *object_type_name; \
+ /* private */ \
+ int default_size; \
+ int object_type_nr; \
+ int (*init_fn)(struct OBJTYPE##_object *ini); \
+ void (*exit_fn)(struct OBJTYPE##_object *ini); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_object_type {
+ GENERIC_OBJECT_TYPE(generic);
+};
+
+#define GENERIC_OBJECT_LAYOUT(OBJTYPE) \
+ /* private */ \
+ int size_hint; \
+ atomic_t alloc_count; \
+ atomic_t aspect_count; \
+ atomic_t total_alloc_count; \
+ atomic_t total_aspect_count; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_object_layout {
+ GENERIC_OBJECT_LAYOUT(generic);
+};
+
+#define GENERIC_OBJECT(OBJTYPE) \
+ /* maintenance, access by macros */ \
+ atomic_t obj_count; /* reference counter */ \
+ bool obj_initialized; /* internally used for checking */ \
+ /* readonly from outside */ \
+ const struct generic_object_type *object_type; \
+ /* private */ \
+ struct generic_object_layout *object_layout; \
+ struct OBJTYPE##_aspect **aspects; \
+ int aspect_nr_max; \
+ int free_offset; \
+ int max_offset; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_object {
+ GENERIC_OBJECT(generic);
+};
+
+#define GENERIC_ASPECT(OBJTYPE) \
+ /* readonly from outside */ \
+ struct OBJTYPE##_object *object; \
+ const struct generic_aspect_type *aspect_type; \
+ /* private */ \
+ bool shortcut; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_aspect {
+ GENERIC_ASPECT(generic);
+};
+
+#define GENERIC_ASPECT_CONTEXT(OBJTYPE) \
+ /* private (for any layer) */ \
+ int brick_index; /* globally unique */ \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_aspect_context {
+ GENERIC_ASPECT_CONTEXT(generic);
+};
+
+#define obj_check(aio) \
+ ({ \
+ if (unlikely(BRICK_CHECKING && !(aio)->obj_initialized)) {\
+ XIO_ERR("aio %p is not initialized\n", (aio)); \
+ } \
+ CHECK_ATOMIC(&(aio)->obj_count, 1); \
+ })
+
+#define obj_get_first(aio) \
+ ({ \
+ if (unlikely(BRICK_CHECKING && (aio)->obj_initialized)) {\
+ XIO_ERR("aio %p is already initialized\n", (aio));\
+ } \
+ _CHECK_ATOMIC(&(aio)->obj_count, !=, 0); \
+ (aio)->obj_initialized = true; \
+ atomic_inc(&(aio)->obj_count); \
+ })
+
+#define obj_get(aio) \
+ ({ \
+ obj_check(aio); \
+ atomic_inc(&(aio)->obj_count); \
+ })
+
+#define obj_put(aio) \
+ ({ \
+ obj_check(aio); \
+ atomic_dec_and_test(&(aio)->obj_count); \
+ })
+
+#define obj_free(aio) \
+ ({ \
+ if (likely(aio)) { \
+ generic_free((struct generic_object *)(aio)); \
+ } \
+ })
+
+/***********************************************************************/
+
+/* definitions for asynchronous callback objects */
+
+#define GENERIC_CALLBACK(OBJTYPE) \
+ /* set by macros, afterwards readonly from outside */ \
+ void (*cb_fn)(struct OBJTYPE##_callback *cb); \
+ void *cb_private; \
+ int cb_error; \
+ /* private */ \
+ struct generic_callback *cb_next; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_callback {
+ GENERIC_CALLBACK(generic);
+};
+
+#define CALLBACK_OBJECT(OBJTYPE) \
+ GENERIC_OBJECT(OBJTYPE); \
+ /* private, access by macros */ \
+ struct generic_callback *object_cb; \
+ struct generic_callback _object_cb; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct callback_object {
+ CALLBACK_OBJECT(generic);
+};
+
+/* Initial setup of the callback chain
+ */
+#define _SETUP_CALLBACK(obj, fn, priv) \
+do { \
+ (obj)->_object_cb.cb_fn = (fn); \
+ (obj)->_object_cb.cb_private = (priv); \
+ (obj)->_object_cb.cb_error = 0; \
+ (obj)->_object_cb.cb_next = NULL; \
+ (obj)->object_cb = &(obj)->_object_cb; \
+} while (0)
+
+#ifdef BRICK_DEBUGGING
+#define SETUP_CALLBACK(obj, fn, priv) \
+do { \
+ if (unlikely((obj)->_object_cb.cb_fn)) { \
+ BRICK_ERR("callback function %p is already installed (new=%p)\n",\
+ (obj)->_object_cb.cb_fn, (fn)); \
+ } \
+ _SETUP_CALLBACK(obj, fn, priv) \
+} while (0)
+#else
+#define SETUP_CALLBACK(obj, fn, priv) _SETUP_CALLBACK(obj, fn, priv)
+#endif
+
+/* Insert a new member into the callback chain
+ */
+#define _INSERT_CALLBACK(obj, new, fn, priv) \
+do { \
+ if (likely(!(new)->cb_fn)) { \
+ (new)->cb_fn = (fn); \
+ (new)->cb_private = (priv); \
+ (new)->cb_error = 0; \
+ (new)->cb_next = (obj)->object_cb; \
+ (obj)->object_cb = (new); \
+ } \
+} while (0)
+
+#ifdef BRICK_DEBUGGING
+#define INSERT_CALLBACK(obj, new, fn, priv) \
+do { \
+ if (unlikely(!(obj)->_object_cb.cb_fn)) { \
+ BRICK_ERR("initical callback function is missing\n"); \
+ } \
+ if (unlikely((new)->cb_fn)) { \
+ BRICK_ERR("new object %p is not pristine\n", (new)->cb_fn);\
+ } \
+ _INSERT_CALLBACK(obj, new, fn, priv); \
+} while (0)
+#else
+#define INSERT_CALLBACK(obj, new, fn, priv) _INSERT_CALLBACK(obj, new, fn, priv)
+#endif
+
+/* Call the first callback in the chain.
+ */
+#define SIMPLE_CALLBACK(obj, err) \
+do { \
+ if (likely(obj)) { \
+ struct generic_callback *__cb = (obj)->object_cb; \
+ if (likely(__cb)) { \
+ __cb->cb_error = (err); \
+ __cb->cb_fn(__cb); \
+ } else { \
+ BRICK_ERR("callback object_cb pointer is NULL\n");\
+ } \
+ } else { \
+ BRICK_ERR("callback obj pointer is NULL\n"); \
+ } \
+} while (0)
+
+#define CHECKED_CALLBACK(obj, err, done) \
+do { \
+ struct generic_callback *__cb; \
+ CHECK_PTR(obj, done); \
+ __cb = (obj)->object_cb; \
+ CHECK_PTR_NULL(__cb, done); \
+ __cb->cb_error = (err); \
+ __cb->cb_fn(__cb); \
+} while (0)
+
+/* An intermediate callback handler must call this
+ * to continue the callback chain.
+ */
+#define NEXT_CHECKED_CALLBACK(cb, done) \
+do { \
+ struct generic_callback *__next_cb = (cb)->cb_next; \
+ CHECK_PTR_NULL(__next_cb, done); \
+ __next_cb->cb_error = (cb)->cb_error; \
+ __next_cb->cb_fn(__next_cb); \
+} while (0)
+
+/* The last callback handler in the chain should call this
+ * for checking whether the end of the chain has been reached
+ */
+#define LAST_CALLBACK(cb) \
+do { \
+ struct generic_callback *__next_cb = (cb)->cb_next; \
+ if (unlikely(__next_cb)) { \
+ BRICK_ERR("end of callback chain %p has not been reached, rest = %p\n", (cb), __next_cb);\
+ } \
+} while (0)
+
+/* Query the callback status.
+ * This uses always the first member of the chain!
+ */
+#define CALLBACK_ERROR(obj) \
+ ((obj)->object_cb ? (obj)->object_cb->cb_error : -EINVAL)
+
+/***********************************************************************/
+
+/* definitions for generic bricks */
+
+struct generic_input;
+struct generic_output;
+struct generic_brick_ops;
+struct generic_output_ops;
+struct generic_brick_type;
+
+struct generic_switch {
+ /* set by strategy layer, readonly from worker layer */
+ bool button;
+
+ /* set by worker layer, readonly from strategy layer */
+ bool on_led;
+ bool off_led;
+
+ /* private (for any layer) */
+ bool force_off;
+ int percent_done;
+
+ wait_queue_head_t event;
+};
+
+#define GENERIC_BRICK(BRITYPE) \
+ /* accessible */ \
+ struct generic_switch power; \
+ /* set by strategy layer, readonly from worker layer */ \
+ const struct BRITYPE##_brick_type *type; \
+ int nr_inputs; \
+ int nr_outputs; \
+ struct BRITYPE##_input **inputs; \
+ struct BRITYPE##_output **outputs; \
+ /* private (for any layer) */ \
+ struct BRITYPE##_brick_ops *ops; \
+ struct generic_aspect_context aspect_context; \
+ int (*free)(struct BRITYPE##_brick *del); \
+ struct list_head tmp_head; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_brick {
+ GENERIC_BRICK(generic);
+};
+
+#define GENERIC_INPUT(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ struct BRITYPE##_brick *brick; \
+ const struct BRITYPE##_input_type *type; \
+ /* private (for any layer) */ \
+ struct BRITYPE##_output *connect; \
+ struct list_head input_head; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_input {
+ GENERIC_INPUT(generic);
+};
+
+#define GENERIC_OUTPUT(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ struct BRITYPE##_brick *brick; \
+ const struct BRITYPE##_output_type *type; \
+ /* private (for any layer) */ \
+ struct BRITYPE##_output_ops *ops; \
+ struct list_head output_head; \
+ int nr_connected; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_output {
+ GENERIC_OUTPUT(generic);
+};
+
+#define GENERIC_OUTPUT_CALL(OUTPUT, OP, ARGS...) \
+ ( \
+ (OUTPUT) && (OUTPUT)->ops->OP ? \
+ (OUTPUT)->ops->OP(OUTPUT, ##ARGS) : \
+ -ENOSYS \
+ )
+
+#define GENERIC_INPUT_CALL(INPUT, OP, ARGS...) \
+ ( \
+ (INPUT) && (INPUT)->connect ? \
+ GENERIC_OUTPUT_CALL((INPUT)->connect, OP, ##ARGS) : \
+ -ENOTCONN \
+ )
+
+#define GENERIC_BRICK_OPS(BRITYPE) \
+ int (*brick_switch)(struct BRITYPE##_brick *brick); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_brick_ops {
+ GENERIC_BRICK_OPS(generic);
+};
+
+#define GENERIC_OUTPUT_OPS(BRITYPE) \
+ /*int (*output_start)(struct BRITYPE##_output *output);*/ \
+ /*int (*output_stop)(struct BRITYPE##_output *output);*/ \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_output_ops {
+ GENERIC_OUTPUT_OPS(generic)
+};
+
+/* although possible, *_type should never be extended */
+#define GENERIC_BRICK_TYPE(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ const char *type_name; \
+ int max_inputs; \
+ int max_outputs; \
+ const struct BRITYPE##_input_type **default_input_types; \
+ const char **default_input_names; \
+ const struct BRITYPE##_output_type **default_output_types; \
+ const char **default_output_names; \
+ /* private (for any layer) */ \
+ int brick_size; \
+ struct BRITYPE##_brick_ops *master_ops; \
+ const struct generic_aspect_type **aspect_types; \
+ const struct BRITYPE##_input_types **default_type; \
+ int (*brick_construct)(struct BRITYPE##_brick *brick); \
+ int (*brick_destruct)(struct BRITYPE##_brick *brick); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_brick_type {
+ GENERIC_BRICK_TYPE(generic);
+};
+
+#define GENERIC_INPUT_TYPE(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ char *type_name; \
+ /* private (for any layer) */ \
+ int input_size; \
+ int (*input_construct)(struct BRITYPE##_input *input); \
+ int (*input_destruct)(struct BRITYPE##_input *input); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_input_type {
+ GENERIC_INPUT_TYPE(generic);
+};
+
+#define GENERIC_OUTPUT_TYPE(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ char *type_name; \
+ /* private (for any layer) */ \
+ int output_size; \
+ struct BRITYPE##_output_ops *master_ops; \
+ int (*output_construct)(struct BRITYPE##_output *output); \
+ int (*output_destruct)(struct BRITYPE##_output *output); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct generic_output_type {
+ GENERIC_OUTPUT_TYPE(generic);
+};
+
+int generic_register_brick_type(const struct generic_brick_type *new_type);
+int generic_unregister_brick_type(const struct generic_brick_type *old_type);
+
+extern void _generic_output_init(struct generic_brick *brick,
+ const struct generic_output_type *type,
+ struct generic_output *output);
+
+extern void _generic_output_exit(struct generic_output *output);
+
+#ifdef _STRATEGY /* call this only in strategy bricks, never in ordinary bricks */
+
+/* you need this only if you circumvent generic_brick_init_full() */
+extern int generic_brick_init(const struct generic_brick_type *type, struct generic_brick *brick);
+
+extern void generic_brick_exit(struct generic_brick *brick);
+
+extern int generic_input_init(struct generic_brick *brick,
+ int index,
+ const struct generic_input_type *type,
+ struct generic_input *input);
+
+extern void generic_input_exit(struct generic_input *input);
+
+extern int generic_output_init(struct generic_brick *brick,
+ int index,
+ const struct generic_output_type *type,
+ struct generic_output *output);
+
+extern int generic_size(const struct generic_brick_type *brick_type);
+
+extern int generic_connect(struct generic_input *input, struct generic_output *output);
+
+extern int generic_disconnect(struct generic_input *input);
+
+/* If possible, use this instead of generic_*_init().
+ * input_types and output_types may be NULL => use default_*_types
+ */
+int generic_brick_init_full(
+ void *data,
+ int size,
+ const struct generic_brick_type *brick_type,
+ const struct generic_input_type **input_types,
+ const struct generic_output_type **output_types);
+
+int generic_brick_exit_full(
+ struct generic_brick *brick);
+
+#endif /* _STRATEGY */
+
+/* simple wrappers for type safety */
+
+#define DECLARE_BRICK_FUNCTIONS(BRITYPE) \
+extern inline int BRITYPE##_register_brick_type(void) \
+{ \
+ extern const struct BRITYPE##_brick_type BRITYPE##_brick_type; \
+ extern int BRITYPE##_brick_nr; \
+ if (unlikely(BRITYPE##_brick_nr >= 0)) { \
+ BRICK_ERR("brick type " #BRITYPE " is already registered.\n");\
+ return -EEXIST; \
+ } \
+ BRITYPE##_brick_nr = generic_register_brick_type((const struct generic_brick_type *)&BRITYPE##_brick_type);\
+ return BRITYPE##_brick_nr < 0 ? BRITYPE##_brick_nr : 0; \
+} \
+ \
+extern inline int BRITYPE##_unregister_brick_type(void) \
+{ \
+ extern const struct BRITYPE##_brick_type BRITYPE##_brick_type; \
+ return generic_unregister_brick_type((const struct generic_brick_type *)&BRITYPE##_brick_type);\
+} \
+ \
+extern const struct BRITYPE##_brick_type BRITYPE##_brick_type; \
+extern const struct BRITYPE##_input_type BRITYPE##_input_type; \
+extern const struct BRITYPE##_output_type BRITYPE##_output_type; \
+/* this comment is for keeping TRAILING_SEMICOLON happy */
+
+/*********************************************************************/
+
+/* default operations on objects / aspects */
+
+extern struct generic_object *generic_alloc(struct generic_object_layout *object_layout,
+ const struct generic_object_type *object_type);
+
+extern void generic_free(struct generic_object *object);
+extern struct generic_aspect *generic_get_aspect(struct generic_brick *brick, struct generic_object *obj);
+
+#define DECLARE_OBJECT_FUNCTIONS(OBJTYPE) \
+extern inline struct OBJTYPE##_object *alloc_##OBJTYPE(struct generic_object_layout *layout)\
+{ \
+ return (void *)generic_alloc(layout, &OBJTYPE##_type); \
+}
+
+#define DECLARE_ASPECT_FUNCTIONS(BRITYPE, OBJTYPE) \
+ \
+extern inline struct OBJTYPE##_object *BRITYPE##_alloc_##OBJTYPE(struct BRITYPE##_brick *brick)\
+{ \
+ return alloc_##OBJTYPE(&brick->OBJTYPE##_object_layout); \
+} \
+ \
+extern inline struct BRITYPE##_##OBJTYPE##_aspect *BRITYPE##_##OBJTYPE##_get_aspect(struct BRITYPE##_brick *brick,\
+ struct OBJTYPE##_object *obj) \
+{ \
+ return (void *)generic_get_aspect((struct generic_brick *)brick, (struct generic_object *)obj);\
+}
+
+/*********************************************************************/
+
+/* some general helpers */
+
+#ifdef _STRATEGY /* call this only from the strategy implementation */
+
+/* Generic interface to simple brick status changes.
+ */
+extern void set_button(struct generic_switch *sw, bool val, bool force);
+extern void set_on_led(struct generic_switch *sw, bool val);
+extern void set_off_led(struct generic_switch *sw, bool val);
+/*
+ * "Forced switch off" means that it cannot be switched on again.
+ */
+extern void set_button_wait(struct generic_brick *brick, bool val, bool force, int timeout);
+
+#endif
+
+/***********************************************************************/
+
+/* thread automation (avoid code duplication) */
+
+#define brick_thread_create(_thread_fn, _data, _fmt, _args...) \
+ ({ \
+ struct task_struct *_thr = kthread_create(_thread_fn, _data, _fmt, ##_args);\
+ if (unlikely(IS_ERR(_thr))) { \
+ int _err = PTR_ERR(_thr); \
+ BRICK_ERR("cannot create thread '%s', status = %d\n", _fmt, _err);\
+ _thr = NULL; \
+ } else { \
+ struct say_channel *ch = get_binding(current); \
+ if (likely(ch)) \
+ bind_to_channel(ch, _thr); \
+ get_task_struct(_thr); \
+ wake_up_process(_thr); \
+ } \
+ _thr; \
+ })
+
+extern void brick_thread_stop_nowait(struct task_struct *k);
+
+#define brick_thread_stop(_thread) \
+ do { \
+ if (likely(_thread)) { \
+ BRICK_DBG("stopping thread '%s'\n", (_thread)->comm);\
+ kthread_stop(_thread); \
+ BRICK_DBG("thread '%s' finished.\n", (_thread)->comm);\
+ remove_binding(_thread); \
+ put_task_struct(_thread); \
+ _thread = NULL; \
+ } \
+ } while (0)
+
+#define brick_thread_should_stop() \
+ ({ \
+ brick_yield(); \
+ kthread_should_stop(); \
+ })
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_brick(void);
+extern void exit_brick(void);
+
+#endif
--
2.0.0

2014-07-01 21:49:33

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 48/50] mars: add new file drivers/block/mars/Kconfig

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/Kconfig | 371 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 371 insertions(+)
create mode 100644 drivers/block/mars/Kconfig

diff --git a/drivers/block/mars/Kconfig b/drivers/block/mars/Kconfig
new file mode 100644
index 0000000..89765f2
--- /dev/null
+++ b/drivers/block/mars/Kconfig
@@ -0,0 +1,371 @@
+#
+# MARS configuration
+#
+
+config MARS
+ tristate "storage system MARS (EXPERIMENTAL)"
+ depends on BLOCK && PROC_SYSCTL && HIGH_RES_TIMERS && !DEBUG_SLAB && !DEBUG_SG
+ default n
+ ---help---
+ MARS is a long-distance replication of generic block devices.
+ It works asynchronously and tolerates network bottlenecks.
+ Please read the full documentation at
+ https://github.com/schoebel/mars/blob/master/docu/mars-manual.pdf?raw=true
+ Always compile MARS as a module!
+
+config MARS_CHECKS
+ bool "enable simple runtime checks in MARS"
+ depends on MARS
+ default y
+ ---help---
+ These checks should be rather lightweight. Use them
+ for beta testing and for production systems where
+ safety is more important than performance.
+ In case of bugs in the reference counting, an automatic repair
+ is attempted, which lowers the risk of memory corruptions.
+ Disable only if you need the absolutely last grain of
+ performance.
+ If unsure, say Y here.
+
+config MARS_DEBUG
+ bool "enable full runtime checks and some tracing in MARS"
+ depends on MARS
+ default n
+ ---help---
+ Some of these checks and some additional error tracing may
+ consume noticable amounts of memory. However, this is extremely
+ valuable for finding bugs, even in production systems.
+
+ OFF for production systems. ON for testing!
+
+ If you encounter bugs in production systems, you
+ may / should use this also in production if you carefully
+ monitor your systems.
+
+config MARS_DEBUG_MEM
+ bool "debug memory operations"
+ depends on MARS_DEBUG
+ default n
+ ---help---
+ This adds considerable space and time overhead, but catches
+ many errors (including some that are not caught by kmemleak).
+
+ OFF for production systems. ON for testing!
+ Use only for development and thorough testing!
+
+config MARS_DEBUG_MEM_STRONG
+ bool "intensified debugging of memory operations"
+ depends on MARS_DEBUG_MEM
+ default y
+ ---help---
+ Trace all block allocations, find more errors.
+ Adds some overhead.
+
+ Use for debugging of new bricks or for intensified
+ regression testing.
+
+config MARS_DEBUG_ORDER0
+ bool "also debug order0 operations"
+ depends on MARS_DEBUG_MEM
+ default n
+ ---help---
+ Turn even order 0 allocations into order 1 ones and provoke
+ heavy memory fragmentation problems from the buddy allocator,
+ but catch some additional memory problems.
+ Use only if you know what you are doing!
+ Normally OFF.
+
+config MARS_DEFAULT_PORT
+ int "port number where MARS is listening"
+ depends on MARS
+ default 7777
+ ---help---
+ Best practice is to uniformly use the same port number
+ in a cluster. Therefore, this is a compiletime constant.
+ You may override this at insmod time via the mars_port= parameter.
+
+config MARS_SEPARATE_PORTS
+ bool "use separate port numbers for traffic shaping"
+ depends on MARS
+ default y
+ ---help---
+ When enabled, the following port assignments will be used:
+
+ CONFIG_MARS_DEFAULT_PORT : updates of symlinks
+ CONFIG_MARS_DEFAULT_PORT + 1 : replication of logfiles
+ CONFIG_MARS_DEFAULT_PORT + 2 : (initial) sync traffic
+
+ As a consequence, external traffic shaping may be used to
+ individually control the network speed for different types
+ of traffic.
+
+ Please don't hinder the symlink updates in any way -- they are
+ most vital, and they produce no mass traffic at all
+ (it's only some kind of meta-information traffic).
+
+ Say Y if you have a big datacenter.
+ Say N if you cannot afford a bigger hole in your firefall.
+ If unsure, say Y.
+
+
+config MARS_LOGDIR
+ string "absolute path to the logging directory"
+ depends on MARS
+ default "/mars"
+ ---help---
+ Path to the directory where all MARS messages will reside.
+ Usually this is equal to the global /mars directory.
+
+ Logfiles and status files obey the following naming conventions:
+ 0.debug.log
+ 1.info.log
+ 2.warn.log
+ 3.error.log
+ 4.fatal.log
+ 5.total.log
+ Logfiles must already exist in order to be appended.
+ Logiles can be rotated by renaming them and creating
+ a new empty file in place of the old one.
+
+ Status files follow the same rules, but .log is replaced
+ by .status, and they are created automatically. Their content
+ is however limited to a few seconds or minutes.
+
+ Leave this at the default unless you know what you are doing.
+
+config MARS_ROLLOVER_INTERVAL
+ int "rollover time of logging status files (in seconds)"
+ depends on MARS
+ default 3
+ ---help---
+ This may influence the system load; dont use too low numbers.
+
+ Leave this at the default unless you know what you are doing.
+
+config MARS_SCAN_INTERVAL
+ int "re-scanning of symlinks in /mars/ (in seconds)"
+ depends on MARS
+ default 5
+ ---help---
+ This may influence the system load; dont use too low numbers.
+
+ Leave this at the default unless you know what you are doing.
+
+config MARS_PROPAGATE_INTERVAL
+ int "network propagation delay of changes in /mars/ (in seconds)"
+ depends on MARS
+ default 5
+ ---help---
+ This may influence the system load; dont use too low numbers.
+
+ Leave this at the default unless you know what you are doing.
+
+config MARS_SYNC_FLIP_INTERVAL
+ int "interrpt sync by logfile update after (seconds)"
+ depends on MARS
+ default 60
+ ---help---
+ 0 = OFF. Normally ON.
+ When disabled, application of logfiles may wait for
+ a very time, until full sync has finished. As a
+ consequence, your /mars/ filesystem may run out
+ of space. When enabled, the applied logfiles can
+ be deleted, freeing space on /mars/. Therefore,
+ will usually want this. However, you may increase
+ the time interval to increase throughput in favour
+ of latency.
+
+ Leave this at the default unless you know what you are doing.
+
+config MARS_NETIO_TIMEOUT
+ int "timeout for remote IO operations (in seconds)"
+ depends on MARS
+ default 30
+ ---help---
+ In case of network hangs, don't wait forever, but rather
+ abort with -ENOTCONN
+ when == 0, wait forever (may lead to hanging operations
+ similar to NFS hard mounts)
+
+ Leave this at the default unless you know what you are doing.
+
+config MARS_FAST_FULLSYNC
+ bool "decrease network traffic at initial sync"
+ depends on MARS
+ default y
+ ---help---
+ Normally ON.
+ When on, both sides will read the data, compute a md5
+ checksum, and compare them. Only in case the checksum
+ mismatches, the data will be actually transferred over
+ the network. This may increase the IO traffic in favour
+ of network traffic. Usually it does no harm to re-read
+ the same data twice (only in case of mismatches) over bio
+ because RAID controllers will usually cache their data
+ for some time. In case of buffered aio reads from filesystems,
+ the data is cached by the kernel anyway.
+
+config MARS_MIN_SPACE_4
+ int "absolutely necessary free space in /mars/ (hard limit in GB)"
+ depends on MARS
+ default 2
+ ---help---
+ HARDEST EMERGENCY LIMIT
+
+ When free space in /mars/ drops under this limit,
+ transaction logging to /mars/ will stop at all,
+ even at all primary resources. All IO will directly go to the
+ underlying raw devices. The transaction logfile sequence numbers
+ will be disrupted, deliberately leaving holes in the sequence.
+
+ This is a last-resort desperate action of the kernel.
+
+ As a consequence, all secodaries will have no chance to
+ replay at that gap, even if they got the logfiles.
+ The secondaries will stop at the gap, left in an outdated,
+ but logically consistent state.
+
+ After the problem has been fixed, the secondaries must
+ start a full-sync in order to continue replication at the
+ recent state.
+
+ This is the hardest measure the kernel can take in order
+ to TRY to continue undisrupted operation at the primary side.
+
+ In general, you should avoid such situations at the admin level.
+
+ Please implement your own monitoring at the admin level,
+ which warns you and/or takes appropriate countermeasures
+ much earlier.
+
+ Never rely on this emergency feature!
+
+config MARS_MIN_SPACE_3
+ int "free space in /mars/ for primary logfiles (additional limit in GB)"
+ depends on MARS
+ default 2
+ ---help---
+ MEDIUM EMERGENCY LIMIT
+
+ When free space in /mars/ drops under
+ MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3,
+ elder transaction logfiles will be deleted at primary resources.
+
+ As a consequence, the secondaries may no longer be able to
+ get a consecute series of copies of logfiles.
+ As a result, they may get stuck somewhere inbetween at an
+ outdated, but logically consistent state.
+
+ This is a desperate action of the kernel.
+
+ After the problem has been fixed, some secondaries may need to
+ start a full-sync in order to continue replication at the
+ recent state.
+
+ In general, you should avoid such situations at the admin level.
+
+ Please implement your own monitoring at the admin level,
+ which warns you and/or takes appropriate countermeasures
+ much earlier.
+
+ Never rely on this emergency feature!
+
+config MARS_MIN_SPACE_2
+ int "free space in /mars/ for secondary logfiles (additional limit in GB)"
+ depends on MARS
+ default 2
+ ---help---
+ MEDIUM EMERGENCY LIMIT
+
+ When free space in /mars/ drops under
+ MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2,
+ elder transaction logfiles will be deleted at secondary resources.
+
+ As a consequence, some local secondary resources
+ may get stuck somewhere inbetween at an
+ outdated, but logically consistent state.
+
+ This is a desperate action of the kernel.
+
+ After the problem has been fixed and the free space becomes
+ larger than MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2
+ + MARS_MIN_SPACE_2, the secondary tries to fetch the missing
+ logfiles from the primary again.
+
+ However, if the necessary logfiles have been deleted at the
+ primary side in the meantime, this may fail.
+
+ In general, you should avoid such situations at the admin level.
+
+ Please implement your own monitoring at the admin level,
+ which warns you and/or takes appropriate countermeasures
+ much earlier.
+
+ Never rely on this emergency feature!
+
+config MARS_MIN_SPACE_1
+ int "free space in /mars/ for replication (additional limit in GB)"
+ depends on MARS
+ default 2
+ ---help---
+ LOWEST EMERGENCY LIMIT
+
+ When free space in /mars/ drops under MARS_MIN_SPACE_4
+ + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2 + MARS_MIN_SPACE_1,
+ fetching of transaction logfiles will stop at local secondary
+ resources.
+
+ As a consequence, some local secondary resources
+ may get stuck somewhere inbetween at an
+ outdated, but logically consistent state.
+
+ This is a desperate action of the kernel.
+
+ After the problem has been fixed and the free space becomes
+ larger than MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2
+ + MARS_MIN_SPACE_2, the secondary will continue fetching its
+ copy of logfiles from the primary side.
+
+ In general, you should avoid such situations at the admin level.
+
+ Please implement your own monitoring at the admin level,
+ which warns you and/or takes appropriate countermeasures
+ much earlier.
+
+ Never rely on this emergency feature!
+
+config MARS_MIN_SPACE_0
+ int "total space needed in /mars/ for (additional limit in GB)"
+ depends on MARS
+ default 12
+ ---help---
+ Operational pre-requirement.
+
+ In order to use MARS, the total space available in /mars/ must
+ be at least MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2
+ + MARS_MIN_SPACE_1 + MARS_MIN_SPACE_0.
+
+ If you cannot afford that amount of storage space, please use
+ DRBD in place of MARS.
+
+config MARS_LOGROT_AUTO
+ int "automatic logrotate when logfile exceeds size (in GB)"
+ depends on MARS
+ default 32
+ ---help---
+ You could switch this off by setting to 0. However, deletion
+ of really huge logfiles can take several minutes, or even substantial
+ fractions of hours (depending on the underlying filesystem).
+ Thus it is highly recommended to limit the logfile size to some
+ reasonable maximum size. Switch only off for experiments!
+
+## remove_this
+config MARS_PREFER_SIO
+ bool "prefer sio bricks instead of aio"
+ depends on MARS_DEBUG
+ default n
+ ---help---
+ Normally OFF for production systems.
+ Only use as alternative for testing.
+
+## end_remove_this
--
2.0.0

2014-07-01 21:47:53

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 03/50] mars: add new file include/linux/brick/brick_say.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/brick_say.h | 80 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 80 insertions(+)
create mode 100644 include/linux/brick/brick_say.h

diff --git a/include/linux/brick/brick_say.h b/include/linux/brick/brick_say.h
new file mode 100644
index 0000000..9f240c8
--- /dev/null
+++ b/include/linux/brick/brick_say.h
@@ -0,0 +1,80 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef BRICK_SAY_H
+#define BRICK_SAY_H
+
+/***********************************************************************/
+
+extern int brick_say_logging;
+extern int brick_say_debug;
+extern int brick_say_syslog_min;
+extern int brick_say_syslog_max;
+extern int brick_say_syslog_flood_class;
+extern int brick_say_syslog_flood_limit;
+extern int brick_say_syslog_flood_recovery;
+extern int delay_say_on_overflow;
+
+/* printk() replacements */
+
+enum {
+ SAY_DEBUG,
+ SAY_INFO,
+ SAY_WARN,
+ SAY_ERROR,
+ SAY_FATAL,
+ SAY_TOTAL,
+ MAX_SAY_CLASS
+};
+
+extern const char *say_class[MAX_SAY_CLASS];
+
+struct say_channel;
+
+extern struct say_channel *default_channel;
+
+extern struct say_channel *make_channel(const char *name, bool must_exit);
+
+extern void del_channel(struct say_channel *ch);
+
+extern void bind_to_channel(struct say_channel *ch, struct task_struct *whom);
+
+#define bind_me(_name) \
+ bind_to_channel(make_channel(_name), current)
+
+extern struct say_channel *get_binding(struct task_struct *whom);
+
+extern void remove_binding_from(struct say_channel *ch, struct task_struct *whom);
+extern void remove_binding(struct task_struct *whom);
+
+extern void rollover_channel(struct say_channel *ch);
+extern void rollover_all(void);
+
+extern void say_to(struct say_channel *ch, int class, const char *fmt, ...) __printf(3, 4);
+
+#define say(_class, _fmt, _args...) \
+ say_to(NULL, _class, _fmt, ##_args)
+
+extern void brick_say_to(struct say_channel *ch,
+ int class,
+ bool dump,
+ const char *prefix,
+ const char *file,
+ int line,
+ const char *func,
+ const char *fmt,
+
+ ...) __printf(8,
+ 9);
+
+#define brick_say(_class, _dump, _prefix, _file, _line, _func, _fmt, _args...)\
+ brick_say_to(NULL, _class, _dump, _prefix, _file, _line, _func, _fmt, ##_args)
+
+extern void init_say(void);
+extern void exit_say(void);
+
+#ifdef CONFIG_MARS_DEBUG
+extern void brick_dump_stack(void);
+#else /* CONFIG_MARS_DEBUG */
+#define brick_dump_stack() /*empty*/
+#endif /* CONFIG_MARS_DEBUG */
+
+#endif
--
2.0.0

2014-07-01 21:50:18

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 47/50] mars: add new file drivers/block/mars/Makefile

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/Makefile | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
create mode 100644 drivers/block/mars/Makefile

diff --git a/drivers/block/mars/Makefile b/drivers/block/mars/Makefile
new file mode 100644
index 0000000..6764e8b
--- /dev/null
+++ b/drivers/block/mars/Makefile
@@ -0,0 +1,38 @@
+#
+# Makefile for MARS
+#
+
+obj-$(CONFIG_MARS) += mars.o
+
+KBUILD_CFLAGS += -fdelete-null-pointer-checks
+
+mars-objs := \
+ lamport.o \
+ brick_say.o \
+ brick_mem.o \
+ brick.o \
+ xio_bricks/xio.o \
+ xio_bricks/lib_log.o \
+ lib_rank.o \
+ lib_limiter.o \
+ lib_timing.o \
+ xio_bricks/lib_mapfree.o \
+ xio_bricks/xio_net.o \
+ xio_bricks/xio_server.o \
+ xio_bricks/xio_client.o \
+ xio_bricks/xio_aio.o \
+ xio_bricks/xio_bio.o \
+ xio_bricks/xio_if.o \
+ xio_bricks/xio_copy.o \
+ xio_bricks/xio_trans_logger.o \
+ mars_light/light_strategy.o \
+ mars_light/light_net.o \
+ mars_light/mars_proc.o \
+ mars_light/mars_light.o
+
+ifdef CONFIG_DEBUG_KERNEL
+
+KBUILD_CFLAGS += -fno-inline-functions -fno-inline-small-functions -fno-inline-functions-called-once
+
+endif
+
--
2.0.0

2014-07-01 21:47:50

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 04/50] mars: add new file drivers/block/mars/brick_say.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/brick_say.c | 931 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 931 insertions(+)
create mode 100644 drivers/block/mars/brick_say.c

diff --git a/drivers/block/mars/brick_say.c b/drivers/block/mars/brick_say.c
new file mode 100644
index 0000000..09b975f
--- /dev/null
+++ b/drivers/block/mars/brick_say.c
@@ -0,0 +1,931 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/brick/brick_say.h>
+#include <linux/brick/lamport.h>
+
+/*******************************************************************/
+
+/* messaging */
+
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/file.h>
+#include <linux/sched.h>
+#include <linux/preempt.h>
+#include <linux/hardirq.h>
+#include <linux/smp.h>
+#include <linux/slab.h>
+#include <linux/kthread.h>
+#include <linux/syscalls.h>
+
+#include <linux/uaccess.h>
+
+#ifndef GFP_BRICK
+#define GFP_BRICK GFP_NOIO
+#endif
+
+#define SAY_ORDER 0
+#define SAY_BUFMAX (PAGE_SIZE << SAY_ORDER)
+#define SAY_BUF_LIMIT (SAY_BUFMAX - 1500)
+#define MAX_FILELEN 16
+#define MAX_IDS 1000
+
+const char *say_class[MAX_SAY_CLASS] = {
+ [SAY_DEBUG] = "debug",
+ [SAY_INFO] = "info",
+ [SAY_WARN] = "warn",
+ [SAY_ERROR] = "error",
+ [SAY_FATAL] = "fatal",
+ [SAY_TOTAL] = "total",
+};
+EXPORT_SYMBOL_GPL(say_class);
+
+int brick_say_logging = 1;
+EXPORT_SYMBOL_GPL(brick_say_logging);
+module_param_named(say_logging, brick_say_logging, int, 0);
+int brick_say_debug = 0;
+EXPORT_SYMBOL_GPL(brick_say_debug);
+module_param_named(say_debug, brick_say_debug, int, 0);
+
+int brick_say_syslog_min = 1;
+EXPORT_SYMBOL_GPL(brick_say_syslog_min);
+int brick_say_syslog_max = -1;
+EXPORT_SYMBOL_GPL(brick_say_syslog_max);
+int brick_say_syslog_flood_class = 3;
+EXPORT_SYMBOL_GPL(brick_say_syslog_flood_class);
+int brick_say_syslog_flood_limit = 20;
+EXPORT_SYMBOL_GPL(brick_say_syslog_flood_limit);
+int brick_say_syslog_flood_recovery = 300;
+EXPORT_SYMBOL_GPL(brick_say_syslog_flood_recovery);
+int delay_say_on_overflow =
+#ifdef CONFIG_MARS_DEBUG
+ 1;
+#else
+ 0;
+#endif
+EXPORT_SYMBOL_GPL(delay_say_on_overflow);
+
+static atomic_t say_alloc_channels = ATOMIC_INIT(0);
+static atomic_t say_alloc_names = ATOMIC_INIT(0);
+static atomic_t say_alloc_pages = ATOMIC_INIT(0);
+
+static unsigned long flood_start_jiffies;
+static int flood_count;
+
+struct say_channel {
+ char *ch_name;
+ struct say_channel *ch_next;
+ spinlock_t ch_lock[MAX_SAY_CLASS];
+ char *ch_buf[MAX_SAY_CLASS][2];
+
+ short ch_index[MAX_SAY_CLASS];
+ struct file *ch_filp[MAX_SAY_CLASS][2];
+ int ch_overflow[MAX_SAY_CLASS];
+ bool ch_written[MAX_SAY_CLASS];
+ bool ch_rollover;
+ bool ch_must_exist;
+ bool ch_is_dir;
+ bool ch_delete;
+ int ch_status_written;
+ int ch_id_max;
+ void *ch_ids[MAX_IDS];
+
+ wait_queue_head_t ch_progress;
+};
+
+struct say_channel *default_channel = NULL;
+EXPORT_SYMBOL_GPL(default_channel);
+
+static struct say_channel *channel_list;
+
+static rwlock_t say_lock = __RW_LOCK_UNLOCKED(say_lock);
+
+static struct task_struct *say_thread;
+
+static DECLARE_WAIT_QUEUE_HEAD(say_event);
+
+bool say_dirty = false;
+
+#define use_atomic() \
+ ((preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK)) != 0 || irqs_disabled())
+
+static
+void wait_channel(struct say_channel *ch, int class)
+{
+ if (delay_say_on_overflow && ch->ch_index[class] > SAY_BUF_LIMIT) {
+ if (!use_atomic()) {
+ say_dirty = true;
+ wake_up_interruptible(&say_event);
+ wait_event_interruptible_timeout(ch->ch_progress,
+ ch->ch_index[class] < SAY_BUF_LIMIT,
+ HZ / 10);
+ }
+ }
+}
+
+static
+struct say_channel *find_channel(const void *id)
+{
+ struct say_channel *res = default_channel;
+ struct say_channel *ch;
+
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ int i;
+
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (ch->ch_ids[i] == id) {
+ res = ch;
+ goto found;
+ }
+ }
+ }
+found:
+ read_unlock(&say_lock);
+ return res;
+}
+
+static
+void _remove_binding(struct task_struct *whom)
+{
+ struct say_channel *ch;
+ int i;
+
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (ch->ch_ids[i] == whom)
+ ch->ch_ids[i] = NULL;
+ }
+ }
+}
+
+void bind_to_channel(struct say_channel *ch, struct task_struct *whom)
+{
+ int i;
+
+ write_lock(&say_lock);
+ _remove_binding(whom);
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (!ch->ch_ids[i]) {
+ ch->ch_ids[i] = whom;
+ goto done;
+ }
+ }
+ if (likely(ch->ch_id_max < MAX_IDS - 1))
+ ch->ch_ids[ch->ch_id_max++] = whom;
+ else
+ goto err;
+done:
+ write_unlock(&say_lock);
+ goto out_return;
+err:
+ write_unlock(&say_lock);
+
+ say_to(default_channel, SAY_ERROR, "ID overflow for thread '%s'\n", whom->comm);
+out_return:;
+}
+EXPORT_SYMBOL_GPL(bind_to_channel);
+
+struct say_channel *get_binding(struct task_struct *whom)
+{
+ struct say_channel *ch;
+ int i;
+
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (ch->ch_ids[i] == whom)
+ goto found;
+ }
+ }
+ ch = NULL;
+found:
+ read_unlock(&say_lock);
+ return ch;
+}
+EXPORT_SYMBOL_GPL(get_binding);
+
+void remove_binding_from(struct say_channel *ch, struct task_struct *whom)
+{
+ bool found = false;
+ int i;
+
+ write_lock(&say_lock);
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (ch->ch_ids[i] == whom) {
+ ch->ch_ids[i] = NULL;
+ found = true;
+ break;
+ }
+ }
+ if (!found)
+ _remove_binding(whom);
+ write_unlock(&say_lock);
+}
+EXPORT_SYMBOL_GPL(remove_binding_from);
+
+void remove_binding(struct task_struct *whom)
+{
+ write_lock(&say_lock);
+ _remove_binding(whom);
+ write_unlock(&say_lock);
+}
+EXPORT_SYMBOL_GPL(remove_binding);
+
+void rollover_channel(struct say_channel *ch)
+{
+ if (!ch)
+ ch = find_channel(current);
+ if (likely(ch))
+ ch->ch_rollover = true;
+}
+EXPORT_SYMBOL_GPL(rollover_channel);
+
+void rollover_all(void)
+{
+ struct say_channel *ch;
+
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next)
+ ch->ch_rollover = true;
+ read_unlock(&say_lock);
+}
+EXPORT_SYMBOL_GPL(rollover_all);
+
+void del_channel(struct say_channel *ch)
+{
+ if (unlikely(!ch))
+ goto out_return;
+ if (unlikely(ch == default_channel)) {
+ say_to(default_channel, SAY_ERROR, "thread '%s' tried to delete the default channel\n", current->comm);
+ goto out_return;
+ }
+
+ ch->ch_delete = true;
+out_return:;
+}
+EXPORT_SYMBOL_GPL(del_channel);
+
+static
+void _del_channel(struct say_channel *ch)
+{
+ struct say_channel *tmp;
+ struct say_channel **_tmp;
+ int i, j;
+
+ if (!ch)
+ goto out_return;
+ write_lock(&say_lock);
+ for (_tmp = &channel_list; (tmp = *_tmp) != NULL; _tmp = &tmp->ch_next) {
+ if (tmp == ch) {
+ *_tmp = tmp->ch_next;
+ break;
+ }
+ }
+ write_unlock(&say_lock);
+
+ for (i = 0; i < MAX_SAY_CLASS; i++) {
+ for (j = 0; j < 2; j++) {
+ if (ch->ch_filp[i][j]) {
+ filp_close(ch->ch_filp[i][j], NULL);
+ ch->ch_filp[i][j] = NULL;
+ }
+ }
+ for (j = 0; j < 2; j++) {
+ char *buf = ch->ch_buf[i][j];
+
+ if (buf) {
+ __free_pages(virt_to_page((unsigned long)buf), SAY_ORDER);
+ atomic_dec(&say_alloc_pages);
+ }
+ }
+ }
+ if (ch->ch_name) {
+ atomic_dec(&say_alloc_names);
+ kfree(ch->ch_name);
+ }
+ kfree(ch);
+ atomic_dec(&say_alloc_channels);
+out_return:;
+}
+
+static
+struct say_channel *_make_channel(const char *name, bool must_exist)
+{
+ struct say_channel *res = NULL;
+ struct kstat kstat = {};
+ int i, j;
+ unsigned long mode = use_atomic() ? GFP_ATOMIC : GFP_BRICK;
+
+ mm_segment_t oldfs;
+ bool is_dir = false;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = vfs_stat((char *)name, &kstat);
+ set_fs(oldfs);
+
+ if (unlikely(status < 0)) {
+ if (must_exist) {
+ say(SAY_ERROR, "cannot create channel '%s', status = %d\n", name, status);
+ goto done;
+ }
+ } else {
+ is_dir = S_ISDIR(kstat.mode);
+ }
+
+restart:
+ res = kzalloc(sizeof(struct say_channel), mode);
+ if (unlikely(!res)) {
+ schedule();
+ goto restart;
+ }
+ atomic_inc(&say_alloc_channels);
+ res->ch_must_exist = must_exist;
+ res->ch_is_dir = is_dir;
+ init_waitqueue_head(&res->ch_progress);
+restart2:
+ res->ch_name = kstrdup(name, mode);
+ if (unlikely(!res->ch_name)) {
+ schedule();
+ goto restart2;
+ }
+ atomic_inc(&say_alloc_names);
+ for (i = 0; i < MAX_SAY_CLASS; i++) {
+ spin_lock_init(&res->ch_lock[i]);
+ for (j = 0; j < 2; j++) {
+ char *buf;
+
+restart3:
+ buf = (void *)__get_free_pages(mode, SAY_ORDER);
+ if (unlikely(!buf)) {
+ schedule();
+ goto restart3;
+ }
+ atomic_inc(&say_alloc_pages);
+ res->ch_buf[i][j] = buf;
+ }
+ }
+done:
+ return res;
+}
+
+struct say_channel *make_channel(const char *name, bool must_exist)
+{
+ struct say_channel *res = NULL;
+ struct say_channel *ch;
+
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ if (!strcmp(ch->ch_name, name)) {
+ res = ch;
+ break;
+ }
+ }
+ read_unlock(&say_lock);
+
+ if (unlikely(!res)) {
+ res = _make_channel(name, must_exist);
+ if (unlikely(!res))
+ goto done;
+
+ write_lock(&say_lock);
+
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ if (ch != res && unlikely(!strcmp(ch->ch_name, name))) {
+ _del_channel(res);
+ res = ch;
+ goto race_found;
+ }
+ }
+
+ res->ch_next = channel_list;
+ channel_list = res;
+
+race_found:
+ write_unlock(&say_lock);
+ }
+
+done:
+ return res;
+}
+EXPORT_SYMBOL_GPL(make_channel);
+
+/* tell gcc to check for varargs errors */
+static
+void _say(struct say_channel *ch, int class, va_list args, bool use_args, const char *fmt, ...) __printf(5, 6);
+
+static
+void _say(struct say_channel *ch, int class, va_list args, bool use_args, const char *fmt, ...)
+{
+ char *start;
+ int offset;
+ int rest;
+ int written;
+
+ if (unlikely(!ch))
+ goto out_return;
+ if (unlikely(ch->ch_delete && ch != default_channel)) {
+ say_to(default_channel, SAY_ERROR, "thread '%s' tried to write on deleted channel\n", current->comm);
+ goto out_return;
+ }
+
+ offset = ch->ch_index[class];
+ start = ch->ch_buf[class][0] + offset;
+ rest = SAY_BUFMAX - 1 - offset;
+ if (unlikely(rest <= 0)) {
+ ch->ch_overflow[class]++;
+ goto out_return;
+ }
+
+ if (use_args) {
+ va_list args2;
+
+ va_start(args2, fmt);
+ written = vscnprintf(start, rest, fmt, args2);
+ va_end(args2);
+ } else {
+ written = vscnprintf(start, rest, fmt, args);
+ }
+
+ if (likely(rest > written)) {
+ start[written] = '\0';
+ ch->ch_index[class] += written;
+ say_dirty = true;
+ } else {
+ /* indicate overflow */
+ start[0] = '\0';
+ ch->ch_overflow[class]++;
+ }
+out_return:;
+}
+
+void say_to(struct say_channel *ch, int class, const char *fmt, ...)
+{
+ va_list args;
+ unsigned long flags;
+
+ if (!class && !brick_say_debug)
+ goto out_return;
+ if (!ch)
+ ch = find_channel(current);
+
+ if (likely(ch)) {
+ if (!ch->ch_is_dir)
+ class = SAY_TOTAL;
+ if (likely(class >= 0 && class < MAX_SAY_CLASS)) {
+ wait_channel(ch, class);
+ spin_lock_irqsave(&ch->ch_lock[class], flags);
+
+ va_start(args, fmt);
+ _say(ch, class, args, false, fmt);
+ va_end(args);
+
+ spin_unlock_irqrestore(&ch->ch_lock[class], flags);
+ }
+ }
+
+ ch = default_channel;
+ if (likely(ch)) {
+ class = SAY_TOTAL;
+ wait_channel(ch, class);
+ spin_lock_irqsave(&ch->ch_lock[class], flags);
+
+ va_start(args, fmt);
+ _say(ch, class, args, false, fmt);
+ va_end(args);
+
+ spin_unlock_irqrestore(&ch->ch_lock[class], flags);
+
+ wake_up_interruptible(&say_event);
+ }
+out_return:;
+}
+EXPORT_SYMBOL_GPL(say_to);
+
+void brick_say_to(struct say_channel *ch,
+ int class,
+ bool dump,
+ const char *prefix,
+ const char *file,
+ int line,
+ const char *func,
+ const char *fmt,
+ ...)
+{
+ const char *channel_name = "-";
+ struct timespec s_now;
+ struct timespec l_now;
+ int filelen;
+ int orig_class;
+ va_list args;
+ unsigned long flags;
+
+ if (!class && !brick_say_debug)
+ goto out_return;
+ s_now = CURRENT_TIME;
+ get_lamport(&l_now);
+
+ if (!ch)
+ ch = find_channel(current);
+
+ orig_class = class;
+
+ /* limit the filename */
+ filelen = strlen(file);
+ if (filelen > MAX_FILELEN)
+ file += filelen - MAX_FILELEN;
+
+ if (likely(ch)) {
+ channel_name = ch->ch_name;
+ if (!ch->ch_is_dir)
+ class = SAY_TOTAL;
+ if (likely(class >= 0 && class < MAX_SAY_CLASS)) {
+ wait_channel(ch, class);
+ spin_lock_irqsave(&ch->ch_lock[class], flags);
+
+ _say(ch, class, NULL, true,
+ "%ld.%09ld %ld.%09ld %s %s[%d] %s:%d %s(): ",
+ s_now.tv_sec, s_now.tv_nsec,
+ l_now.tv_sec, l_now.tv_nsec,
+ prefix,
+ current->comm, (int)smp_processor_id(),
+ file, line,
+ func);
+
+ va_start(args, fmt);
+ _say(ch, class, args, false, fmt);
+ va_end(args);
+
+ spin_unlock_irqrestore(&ch->ch_lock[class], flags);
+ }
+ }
+
+ ch = default_channel;
+ if (likely(ch)) {
+ wait_channel(ch, SAY_TOTAL);
+ spin_lock_irqsave(&ch->ch_lock[SAY_TOTAL], flags);
+
+ _say(ch, SAY_TOTAL, NULL, true,
+ "%ld.%09ld %ld.%09ld %s_%-5s %s %s[%d] %s:%d %s(): ",
+ s_now.tv_sec, s_now.tv_nsec,
+ l_now.tv_sec, l_now.tv_nsec,
+ prefix, say_class[orig_class],
+ channel_name,
+ current->comm, (int)smp_processor_id(),
+ file, line,
+ func);
+
+ va_start(args, fmt);
+ _say(ch, SAY_TOTAL, args, false, fmt);
+ va_end(args);
+
+ spin_unlock_irqrestore(&ch->ch_lock[SAY_TOTAL], flags);
+
+ }
+#ifdef CONFIG_MARS_DEBUG
+ if (dump)
+ brick_dump_stack();
+#endif
+ wake_up_interruptible(&say_event);
+out_return:;
+}
+EXPORT_SYMBOL_GPL(brick_say_to);
+
+static
+void try_open_file(struct file **file, char *filename, bool creat)
+{
+ struct address_space *mapping;
+ int flags = O_APPEND | O_WRONLY | O_LARGEFILE;
+ int prot = 0600;
+
+ if (creat)
+ flags |= O_CREAT;
+
+ *file = filp_open(filename, flags, prot);
+ if (unlikely(IS_ERR(*file))) {
+ *file = NULL;
+ goto out_return;
+ }
+ mapping = (*file)->f_mapping;
+ if (likely(mapping))
+ mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~(__GFP_IO | __GFP_FS));
+out_return:;
+}
+
+static
+void out_to_file(struct file *file, char *buf, int len)
+{
+ loff_t log_pos = 0;
+
+ mm_segment_t oldfs;
+
+ if (file) {
+ oldfs = get_fs();
+ set_fs(get_ds());
+ (void)vfs_write(file, buf, len, &log_pos);
+ set_fs(oldfs);
+ }
+}
+
+static inline
+void reset_flood(void)
+{
+ if (flood_start_jiffies &&
+ time_is_before_jiffies(flood_start_jiffies + brick_say_syslog_flood_recovery * HZ)) {
+ flood_start_jiffies = 0;
+ flood_count = 0;
+ }
+}
+
+/* checkpatch.pl PREFER_PR_LEVEL:
+ *
+ * This is intended as _the_ _one_ subsystem-specific place where
+ * subsystem-specific {BRICK, MARS}_{INF, WRN, ERR, ...}() functions are
+ * mapped to the ordinary kernel printk().
+ *
+ * As noted elsewhere, dev_info() and friends connot be used in XIO_Light,
+ * because /dev/mars/ devices are created dynamically at runtime.
+ * On secondaries, no /dev/mars/ devices may exist at all.
+ *
+ * In case you want different naming conventions such als
+ * brick_err_printk() or similar, this could be automatically converted
+ * via the ./rework-mars-for-upstream.pl script.
+ * IMHO, the current naming conventions are short, systematic, and
+ * intuitive, so I personally would keep them in their current form.
+ */
+static
+void printk_with_class(int class, char *buf)
+{
+ switch (class) {
+ case SAY_INFO:
+ printk(KERN_INFO "%s", buf);
+ break;
+ case SAY_WARN:
+ printk(KERN_WARNING "%s", buf);
+ break;
+ case SAY_ERROR:
+ case SAY_FATAL:
+ printk(KERN_ERR "%s", buf);
+ break;
+ default:
+ printk(KERN_DEBUG "%s", buf);
+ }
+}
+
+static
+void out_to_syslog(int class, char *buf, int len)
+{
+ reset_flood();
+ if (class >= brick_say_syslog_min && class <= brick_say_syslog_max) {
+ buf[len] = '\0';
+ printk_with_class(class, buf);
+ } else if (class >= brick_say_syslog_flood_class && brick_say_syslog_flood_class >= 0 && class != SAY_TOTAL) {
+ flood_start_jiffies = jiffies;
+ if (++flood_count <= brick_say_syslog_flood_limit) {
+ buf[len] = '\0';
+ printk_with_class(class, buf);
+ }
+ }
+}
+
+static inline
+char *_make_filename(struct say_channel *ch, int class, int transact, int add_tmp)
+{
+ char *filename;
+
+restart:
+ filename = kmalloc(1024, GFP_KERNEL);
+ if (unlikely(!filename)) {
+ schedule();
+ goto restart;
+ }
+ atomic_inc(&say_alloc_names);
+ if (ch->ch_is_dir) {
+ snprintf(filename,
+ 1023,
+ "%s/%d.%s.%s%s",
+ ch->ch_name,
+ class,
+ say_class[class],
+ transact ? "status" : "log",
+ add_tmp ? ".tmp" : "");
+ } else {
+ snprintf(filename, 1023, "%s.%s%s", ch->ch_name, transact ? "status" : "log", add_tmp ? ".tmp" : "");
+ }
+ return filename;
+}
+
+static
+void _rollover_channel(struct say_channel *ch)
+{
+ int start = 0;
+ int class;
+
+ ch->ch_rollover = false;
+ ch->ch_status_written = 0;
+
+ if (!ch->ch_is_dir)
+ start = SAY_TOTAL;
+
+ for (class = start; class < MAX_SAY_CLASS; class++) {
+ char *old = _make_filename(ch, class, 1, 1);
+ char *new = _make_filename(ch, class, 1, 0);
+
+ if (likely(old && new)) {
+ int i;
+
+ mm_segment_t oldfs;
+
+ for (i = 0; i < 2; i++) {
+ if (ch->ch_filp[class][i]) {
+ filp_close(ch->ch_filp[class][i], NULL);
+ ch->ch_filp[class][i] = NULL;
+ }
+ }
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ sys_rename(old, new);
+ set_fs(oldfs);
+ }
+
+ if (likely(old)) {
+ kfree(old);
+ atomic_dec(&say_alloc_names);
+ }
+ if (likely(new)) {
+ kfree(new);
+ atomic_dec(&say_alloc_names);
+ }
+ }
+}
+
+static
+void treat_channel(struct say_channel *ch, int class)
+{
+ int len;
+ int overflow;
+ int transact;
+ int start;
+ char *buf;
+ char *tmp;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ch->ch_lock[class], flags);
+
+ buf = ch->ch_buf[class][0];
+ tmp = ch->ch_buf[class][1];
+ ch->ch_buf[class][1] = buf;
+ ch->ch_buf[class][0] = tmp;
+ len = ch->ch_index[class];
+ ch->ch_index[class] = 0;
+ overflow = ch->ch_overflow[class];
+ ch->ch_overflow[class] = 0;
+
+ spin_unlock_irqrestore(&ch->ch_lock[class], flags);
+
+ wake_up_interruptible(&ch->ch_progress);
+
+ ch->ch_status_written += len;
+ out_to_syslog(class, buf, len);
+ start = 0;
+ if (!brick_say_logging)
+ start++;
+ for (transact = start; transact < 2; transact++) {
+ if (unlikely(!ch->ch_filp[class][transact])) {
+ char *filename = _make_filename(ch, class, transact, transact);
+
+ if (likely(filename)) {
+ try_open_file(&ch->ch_filp[class][transact], filename, transact);
+ kfree(filename);
+ atomic_dec(&say_alloc_names);
+ }
+ }
+ out_to_file(ch->ch_filp[class][transact], buf, len);
+ }
+
+ if (unlikely(overflow > 0)) {
+ struct timespec s_now = CURRENT_TIME;
+ struct timespec l_now;
+
+ get_lamport(&l_now);
+ len = scnprintf(buf,
+ SAY_BUFMAX,
+ "%ld.%09ld %ld.%09ld %s %d OVERFLOW %d times\n",
+ s_now.tv_sec, s_now.tv_nsec,
+ l_now.tv_sec, l_now.tv_nsec,
+ ch->ch_name,
+ class,
+ overflow);
+ ch->ch_status_written += len;
+ out_to_syslog(class, buf, len);
+ for (transact = 0; transact < 2; transact++)
+ out_to_file(ch->ch_filp[class][transact], buf, len);
+ }
+}
+
+static
+int _say_thread(void *data)
+{
+ while (!kthread_should_stop()) {
+ struct say_channel *ch;
+ int i;
+
+ wait_event_interruptible_timeout(say_event, say_dirty, HZ);
+ say_dirty = false;
+
+restart_rollover:
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ if (ch->ch_rollover && ch->ch_status_written > 0) {
+ read_unlock(&say_lock);
+ _rollover_channel(ch);
+ goto restart_rollover;
+ }
+ }
+ read_unlock(&say_lock);
+
+restart:
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ int start = 0;
+
+ if (!ch->ch_is_dir)
+ start = SAY_TOTAL;
+ for (i = start; i < MAX_SAY_CLASS; i++) {
+ if (ch->ch_index[i] > 0) {
+ read_unlock(&say_lock);
+ treat_channel(ch, i);
+ goto restart;
+ }
+ }
+ if (ch->ch_delete) {
+ read_unlock(&say_lock);
+ _del_channel(ch);
+ goto restart;
+ }
+ }
+ read_unlock(&say_lock);
+ }
+
+ return 0;
+}
+
+void init_say(void)
+{
+ default_channel = make_channel(CONFIG_MARS_LOGDIR, true);
+ say_thread = kthread_create(_say_thread, NULL, "brick_say");
+ if (IS_ERR(say_thread)) {
+ say_thread = NULL;
+ } else {
+ get_task_struct(say_thread);
+ wake_up_process(say_thread);
+ }
+
+}
+EXPORT_SYMBOL_GPL(init_say);
+
+void exit_say(void)
+{
+ int memleak_channels;
+ int memleak_names;
+ int memleak_pages;
+
+ if (say_thread) {
+ kthread_stop(say_thread);
+ put_task_struct(say_thread);
+ say_thread = NULL;
+ }
+
+ default_channel = NULL;
+ while (channel_list)
+ _del_channel(channel_list);
+
+ memleak_channels = atomic_read(&say_alloc_channels);
+ memleak_names = atomic_read(&say_alloc_names);
+ memleak_pages = atomic_read(&say_alloc_pages);
+ if (unlikely(memleak_channels || memleak_names || memleak_pages))
+ printk("MEMLEAK: channels=%d names=%d pages=%d\n", memleak_channels, memleak_names, memleak_pages);
+}
+EXPORT_SYMBOL_GPL(exit_say);
+
+#ifdef CONFIG_MARS_DEBUG
+
+static int dump_max = 5;
+
+void brick_dump_stack(void)
+{
+ if (dump_max > 0) {
+ dump_max--; /* racy, but does no harm */
+ dump_stack();
+ }
+}
+EXPORT_SYMBOL(brick_dump_stack);
+
+#endif
--
2.0.0

2014-07-01 21:47:47

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 07/50] mars: add new file include/linux/brick/brick_checking.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/brick_checking.h | 88 ++++++++++++++++++++++++++++++++++++
1 file changed, 88 insertions(+)
create mode 100644 include/linux/brick/brick_checking.h

diff --git a/include/linux/brick/brick_checking.h b/include/linux/brick/brick_checking.h
new file mode 100644
index 0000000..e5d107d
--- /dev/null
+++ b/include/linux/brick/brick_checking.h
@@ -0,0 +1,88 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef BRICK_CHECKING_H
+#define BRICK_CHECKING_H
+
+/***********************************************************************/
+
+/* checking */
+
+#if defined(CONFIG_MARS_DEBUG) || defined(CONFIG_MARS_CHECKS)
+#define BRICK_CHECKING true
+#else
+#define BRICK_CHECKING false
+#endif
+
+#define _CHECK_ATOMIC(atom, OP, minval) \
+do { \
+ if (BRICK_CHECKING) { \
+ int __test = atomic_read(atom); \
+ if (unlikely(__test OP(minval))) { \
+ atomic_set(atom, minval); \
+ BRICK_ERR("%d: atomic " #atom " " #OP " " #minval " (%d)\n", __LINE__, __test);\
+ } \
+ } \
+} while (0)
+
+#define CHECK_ATOMIC(atom, minval) \
+ _CHECK_ATOMIC(atom, <, minval)
+
+#define CHECK_HEAD_EMPTY(head) \
+do { \
+ if (BRICK_CHECKING && unlikely(!list_empty(head) && (head)->next)) {\
+ list_del_init(head); \
+ BRICK_ERR("%d: list_head " #head " (%p) not empty\n", __LINE__, head);\
+ } \
+} while (0)
+
+#ifdef CONFIG_MARS_DEBUG_MEM
+#define CHECK_PTR_DEAD(ptr, label) \
+do { \
+ if (BRICK_CHECKING && unlikely((ptr) == (void *)0x5a5a5a5a5a5a5a5a)) {\
+ BRICK_FAT("%d: pointer '" #ptr "' is DEAD\n", __LINE__);\
+ goto label; \
+ } \
+} while (0)
+#else
+#define CHECK_PTR_DEAD(ptr, label) /*empty*/
+#endif
+
+#define CHECK_PTR_NULL(ptr, label) \
+do { \
+ CHECK_PTR_DEAD(ptr, label); \
+ if (BRICK_CHECKING && unlikely(!(ptr))) { \
+ BRICK_FAT("%d: pointer '" #ptr "' is NULL\n", __LINE__);\
+ goto label; \
+ } \
+} while (0)
+
+#ifdef CONFIG_MARS_DEBUG
+#define CHECK_PTR(ptr, label) \
+do { \
+ CHECK_PTR_NULL(ptr, label); \
+ if (BRICK_CHECKING && unlikely(!virt_addr_valid(ptr))) { \
+ BRICK_FAT("%d: pointer '" #ptr "' (%p) is no valid virtual KERNEL address\n", __LINE__, ptr);\
+ goto label; \
+ } \
+} while (0)
+#else
+#define CHECK_PTR(ptr, label) CHECK_PTR_NULL(ptr, label)
+#endif
+
+#define CHECK_ASPECT(a_ptr, o_ptr, label) \
+do { \
+ if (BRICK_CHECKING && unlikely((a_ptr)->object != o_ptr)) { \
+ BRICK_FAT("%d: aspect pointer '" #a_ptr "' (%p) belongs to object %p, not to " #o_ptr " (%p)\n",\
+ __LINE__, a_ptr, (a_ptr)->object, o_ptr); \
+ goto label; \
+ } \
+} while (0)
+
+#define _CHECK(ptr, label) \
+do { \
+ if (BRICK_CHECKING && unlikely(!(ptr))) { \
+ BRICK_FAT("%d: condition '" #ptr "' is VIOLATED\n", __LINE__);\
+ goto label; \
+ } \
+} while (0)
+
+#endif
--
2.0.0

2014-07-01 21:50:59

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 40/50] mars: add new file drivers/block/mars/mars_light/light_strategy.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/mars_light/light_strategy.c | 1898 ++++++++++++++++++++++++
1 file changed, 1898 insertions(+)
create mode 100644 drivers/block/mars/mars_light/light_strategy.c

diff --git a/drivers/block/mars/mars_light/light_strategy.c b/drivers/block/mars/mars_light/light_strategy.c
new file mode 100644
index 0000000..04a26fe
--- /dev/null
+++ b/drivers/block/mars/mars_light/light_strategy.c
@@ -0,0 +1,1898 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#define XIO_DEBUGGING
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/file.h>
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/utsname.h>
+
+#include <linux/mars_light/light_strategy.h>
+
+#include <linux/lib_mapfree.h>
+#include <linux/xio/xio_client.h>
+
+#include <linux/syscalls.h>
+#include <linux/namei.h>
+#include <linux/kthread.h>
+#include <linux/statfs.h>
+
+#define SKIP_BIO false
+
+/*******************************************************************/
+
+/* meta descriptions */
+
+const struct meta mars_kstat_meta[] = {
+ META_INI(ino, struct kstat, FIELD_UINT),
+ META_INI(mode, struct kstat, FIELD_UINT),
+ META_INI(size, struct kstat, FIELD_INT),
+ META_INI_SUB(atime, struct kstat, xio_timespec_meta),
+ META_INI_SUB(mtime, struct kstat, xio_timespec_meta),
+ META_INI_SUB(ctime, struct kstat, xio_timespec_meta),
+ META_INI_TRANSFER(blksize, struct kstat, FIELD_UINT, 4),
+ {}
+};
+EXPORT_SYMBOL_GPL(mars_kstat_meta);
+
+const struct meta mars_dent_meta[] = {
+ META_INI(d_name, struct mars_dent, FIELD_STRING),
+ META_INI(d_rest, struct mars_dent, FIELD_STRING),
+ META_INI(d_path, struct mars_dent, FIELD_STRING),
+ META_INI(d_type, struct mars_dent, FIELD_UINT),
+ META_INI(d_class, struct mars_dent, FIELD_INT),
+ META_INI(d_serial, struct mars_dent, FIELD_INT),
+ META_INI(d_corr_A, struct mars_dent, FIELD_INT),
+ META_INI(d_corr_B, struct mars_dent, FIELD_INT),
+ META_INI_SUB(stat_val, struct mars_dent, mars_kstat_meta),
+ META_INI(link_val, struct mars_dent, FIELD_STRING),
+ META_INI(d_args, struct mars_dent, FIELD_STRING),
+ META_INI(d_argv[0], struct mars_dent, FIELD_STRING),
+ META_INI(d_argv[1], struct mars_dent, FIELD_STRING),
+ META_INI(d_argv[2], struct mars_dent, FIELD_STRING),
+ META_INI(d_argv[3], struct mars_dent, FIELD_STRING),
+ {}
+};
+EXPORT_SYMBOL_GPL(mars_dent_meta);
+
+/*******************************************************************/
+
+/* some helpers */
+
+static inline
+int _length_paranoia(int len, int line)
+{
+ if (unlikely(len < 0)) {
+ XIO_ERR("implausible string length %d (line=%d)\n", len, line);
+ len = PAGE_SIZE - 2;
+ } else if (unlikely(len > PAGE_SIZE - 2)) {
+ XIO_WRN("string length %d will be truncated to %d (line=%d)\n",
+ len, (int)PAGE_SIZE - 2, line);
+ len = PAGE_SIZE - 2;
+ }
+ return len;
+}
+
+int mars_stat(const char *path, struct kstat *stat, bool use_lstat)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ if (use_lstat)
+ status = vfs_lstat((char *)path, stat);
+ else
+ status = vfs_stat((char *)path, stat);
+ set_fs(oldfs);
+
+ if (likely(status >= 0))
+ set_lamport(&stat->mtime);
+
+ return status;
+}
+EXPORT_SYMBOL_GPL(mars_stat);
+
+int mars_mkdir(const char *path)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = sys_mkdir(path, 0700);
+ set_fs(oldfs);
+
+ return status;
+}
+EXPORT_SYMBOL_GPL(mars_mkdir);
+
+int mars_rmdir(const char *path)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = sys_rmdir(path);
+ set_fs(oldfs);
+
+ return status;
+}
+EXPORT_SYMBOL_GPL(mars_rmdir);
+
+int mars_unlink(const char *path)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = sys_unlink(path);
+ set_fs(oldfs);
+
+ return status;
+}
+EXPORT_SYMBOL_GPL(mars_unlink);
+
+int mars_symlink(const char *oldpath, const char *newpath, const struct timespec *stamp, uid_t uid)
+{
+ char *tmp = backskip_replace(newpath, '/', true, "/.tmp-");
+
+ mm_segment_t oldfs;
+ struct kstat stat = {};
+ struct timespec times[2];
+ int status = -ENOMEM;
+
+ if (unlikely(!tmp))
+ goto done;
+
+ if (stamp)
+ memcpy(&times[0], stamp, sizeof(times[0]));
+ else
+ get_lamport(&times[0]);
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ /* Some filesystems have only full second resolution.
+ * Thus it may happen that the new timestamp is not
+ * truly moving forward when called twice shortly.
+ * This is a _workaround_, to be replaced by a better
+ * method somewhen.
+ */
+ status = vfs_lstat((char *)newpath, &stat);
+ if (status >= 0 &&
+ !stamp &&
+ !stat.mtime.tv_nsec &&
+ times[0].tv_sec == stat.mtime.tv_sec) {
+ XIO_DBG("workaround timestamp tv_sec=%ld\n", stat.mtime.tv_sec);
+ times[0].tv_sec = stat.mtime.tv_sec + 1;
+ /* Setting tv_nsec to 1 prevents from unnecessarily reentering
+ * this workaround again if accidentally the original tv_nsec
+ * had been 0 or if the workaround had been triggered.
+ */
+ times[0].tv_nsec = 1;
+ }
+
+ (void)sys_unlink(tmp);
+
+ status = sys_symlink(oldpath, tmp);
+
+ if (status >= 0) {
+ sys_lchown(tmp, uid, 0);
+ memcpy(&times[1], &times[0], sizeof(struct timespec));
+ status = do_utimes(AT_FDCWD, tmp, times, AT_SYMLINK_NOFOLLOW);
+ }
+
+ if (status >= 0) {
+ set_lamport(&times[0]);
+ status = mars_rename(tmp, newpath);
+ }
+ set_fs(oldfs);
+ brick_string_free(tmp);
+
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(mars_symlink);
+
+char *mars_readlink(const char *newpath)
+{
+ char *res = NULL;
+ struct path path = {};
+
+ mm_segment_t oldfs;
+ struct inode *inode;
+ int len;
+ int status = -ENOMEM;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+
+ status = user_path_at(AT_FDCWD, newpath, 0, &path);
+ if (unlikely(status < 0)) {
+ XIO_DBG("link '%s' does not exist, status = %d\n", newpath, status);
+ goto done_fs;
+ }
+
+ inode = path.dentry->d_inode;
+ if (unlikely(!inode || !S_ISLNK(inode->i_mode))) {
+ XIO_ERR("link '%s' has invalid inode\n", newpath);
+ status = -EINVAL;
+ goto done_put;
+ }
+
+ len = i_size_read(inode);
+ if (unlikely(len <= 0 || len > PAGE_SIZE)) {
+ XIO_ERR("link '%s' invalid length = %d\n", newpath, len);
+ status = -EINVAL;
+ goto done_put;
+ }
+ res = brick_string_alloc(len + 2);
+
+ status = inode->i_op->readlink(path.dentry, res, len + 1);
+ if (unlikely(status < 0))
+ XIO_ERR("cannot read link '%s', status = %d\n", newpath, status);
+ else
+ set_lamport(&inode->i_mtime);
+done_put:
+ path_put(&path);
+
+done_fs:
+ set_fs(oldfs);
+ if (unlikely(status < 0)) {
+ if (unlikely(!res))
+ res = brick_string_alloc(1);
+ res[0] = '\0';
+ }
+ return res;
+}
+EXPORT_SYMBOL_GPL(mars_readlink);
+
+int mars_rename(const char *oldpath, const char *newpath)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = sys_rename(oldpath, newpath);
+ set_fs(oldfs);
+
+ return status;
+}
+EXPORT_SYMBOL_GPL(mars_rename);
+
+int mars_chmod(const char *path, mode_t mode)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = sys_chmod(path, mode);
+ set_fs(oldfs);
+
+ return status;
+}
+EXPORT_SYMBOL_GPL(mars_chmod);
+
+int mars_lchown(const char *path, uid_t uid)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = sys_lchown(path, uid, 0);
+ set_fs(oldfs);
+
+ return status;
+}
+EXPORT_SYMBOL_GPL(mars_lchown);
+
+loff_t _compute_space(struct kstatfs *kstatfs, loff_t raw_val)
+{
+ int fsize = kstatfs->f_frsize;
+
+ if (fsize <= 0)
+ fsize = kstatfs->f_bsize;
+
+ XIO_INF("fsize = %d raw_val = %lld\n", fsize, raw_val);
+ /* illegal values? cannot do anything.... */
+ if (fsize <= 0)
+ return 0;
+
+ /* prevent intermediate integer overflows */
+ if (fsize <= 1024)
+ return raw_val / (1024 / fsize);
+
+ return raw_val * (fsize / 1024);
+}
+
+void mars_remaining_space(const char *fspath, loff_t *total, loff_t *remaining)
+{
+ struct path path = {};
+ struct kstatfs kstatfs = {};
+
+ mm_segment_t oldfs;
+ int res;
+
+ *total = *remaining = 0;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+
+ res = user_path_at(AT_FDCWD, fspath, 0, &path);
+
+ set_fs(oldfs);
+
+ if (unlikely(res < 0)) {
+ XIO_ERR("cannot get fspath '%s', err = %d\n\n", fspath, res);
+ goto err;
+ }
+ if (unlikely(!path.dentry)) {
+ XIO_ERR("bad dentry for fspath '%s'\n", fspath);
+ res = -ENXIO;
+ goto done;
+ }
+
+#ifdef ST_RDONLY
+ res = vfs_statfs(&path, &kstatfs);
+#else
+ res = vfs_statfs(path.dentry, &kstatfs);
+#endif
+ if (unlikely(res < 0))
+ goto done;
+
+ *total = _compute_space(&kstatfs, kstatfs.f_blocks);
+ *remaining = _compute_space(&kstatfs, kstatfs.f_bavail);
+
+done:
+ path_put(&path);
+err:;
+}
+EXPORT_SYMBOL_GPL(mars_remaining_space);
+
+/************************************************************/
+
+/* thread binding */
+
+void bind_to_dent(struct mars_dent *dent, struct say_channel **ch)
+{
+ if (!dent) {
+ if (*ch) {
+ remove_binding_from(*ch, current);
+ *ch = NULL;
+ }
+ goto out_return;
+ }
+ /* Memoize the channel. This is executed only once for each dent. */
+ if (unlikely(!dent->d_say_channel)) {
+ struct mars_dent *test = dent->d_parent;
+
+ for (;;) {
+ if (!test) {
+ dent->d_say_channel = default_channel;
+ break;
+ }
+ if (test->d_use_channel && test->d_path) {
+ dent->d_say_channel = make_channel(test->d_path, true);
+ break;
+ }
+ test = test->d_parent;
+ }
+ }
+ if (dent->d_say_channel != *ch) {
+ if (*ch)
+ remove_binding_from(*ch, current);
+ *ch = dent->d_say_channel;
+ if (*ch)
+ bind_to_channel(*ch, current);
+ }
+out_return:;
+}
+EXPORT_SYMBOL_GPL(bind_to_dent);
+
+/************************************************************/
+
+/* infrastructure */
+
+struct mars_global *mars_global = NULL;
+EXPORT_SYMBOL_GPL(mars_global);
+
+void (*_local_trigger)(void) = NULL;
+EXPORT_SYMBOL_GPL(_local_trigger);
+
+static
+void __local_trigger(void)
+{
+ if (mars_global) {
+ mars_global->main_trigger = true;
+ wake_up_interruptible_all(&mars_global->main_event);
+ }
+}
+
+bool xio_check_inputs(struct xio_brick *brick)
+{
+ int max_inputs;
+ int i;
+
+ if (likely(brick->type)) {
+ max_inputs = brick->type->max_inputs;
+ } else {
+ XIO_ERR("uninitialized brick '%s' '%s'\n", SAFE_STR(brick->brick_name), SAFE_STR(brick->brick_path));
+ return true;
+ }
+ for (i = 0; i < max_inputs; i++) {
+ struct xio_input *input = brick->inputs[i];
+ struct xio_output *prev_output;
+ struct xio_brick *prev_brick;
+
+ if (!input)
+ continue;
+ prev_output = input->connect;
+ if (!prev_output)
+ continue;
+ prev_brick = prev_output->brick;
+ CHECK_PTR(prev_brick, done);
+ if (prev_brick->power.on_led)
+ continue;
+done:
+ return true;
+ }
+ return false;
+}
+EXPORT_SYMBOL_GPL(xio_check_inputs);
+
+bool xio_check_outputs(struct xio_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < brick->type->max_outputs; i++) {
+ struct xio_output *output = brick->outputs[i];
+
+ if (!output || !output->nr_connected)
+ continue;
+ return true;
+ }
+ return false;
+}
+EXPORT_SYMBOL_GPL(xio_check_outputs);
+
+int mars_power_button(struct xio_brick *brick, bool val, bool force_off)
+{
+ int status = 0;
+ bool oldval = brick->power.button;
+
+ if (force_off && !val)
+ brick->power.force_off = true;
+
+ if (brick->power.force_off)
+ val = false;
+
+ if (val != oldval) {
+ /* check whether switching is possible */
+ status = -EINVAL;
+ if (val) { /* check all inputs */
+ if (unlikely(xio_check_inputs(brick))) {
+ XIO_ERR("CANNOT SWITCH ON: brick '%s' '%s' has a turned-off predecessor\n",
+ brick->brick_name,
+ brick->brick_path);
+ goto done;
+ }
+ } else { /* check all outputs */
+ if (unlikely(xio_check_outputs(brick))) {
+ /* For now, we have a strong rule:
+ * Switching off is only allowed when no successor brick
+ * exists at all. This could be relaxed to checking
+ * whether all successor bricks are actually switched off.
+ * Probabĺy it is a good idea to retain the stronger rule
+ * as long as nobody needs the relaxed one.
+ */
+ XIO_ERR("CANNOT SWITCH OFF: brick '%s' '%s' has a successor\n",
+ brick->brick_name,
+ brick->brick_path);
+ goto done;
+ }
+ }
+
+ XIO_DBG("brick '%s' '%s' type '%s' power button %d -> %d\n",
+ brick->brick_name,
+ brick->brick_path,
+ brick->type->type_name,
+ oldval,
+ val);
+
+ set_button(&brick->power, val, false);
+ }
+
+ if (unlikely(!brick->ops)) {
+ XIO_ERR("brick '%s' '%s' has no brick_switch() method\n", brick->brick_name, brick->brick_path);
+ status = -EINVAL;
+ goto done;
+ }
+
+ /* Always call the switch function, even if nothing changes.
+ * The implementations must be idempotent.
+ * They may exploit the regular calls for some maintenance operations
+ * (e.g. changing disk capacity etc).
+ */
+ status = brick->ops->brick_switch(brick);
+
+ if (val != oldval)
+ local_trigger();
+
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(mars_power_button);
+
+/*******************************************************************/
+
+/* strategy layer */
+
+struct mars_cookie {
+ struct mars_global *global;
+
+ mars_dent_checker_fn checker;
+ char *path;
+ struct mars_dent *parent;
+ int allocsize;
+ int depth;
+ bool hit;
+};
+
+static
+int get_inode(char *newpath, struct mars_dent *dent)
+{
+ mm_segment_t oldfs;
+ int status;
+ struct kstat tmp = {};
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+
+ status = vfs_lstat(newpath, &tmp);
+ if (status < 0) {
+ XIO_WRN("cannot stat '%s', status = %d\n", newpath, status);
+ goto done;
+ }
+
+ memcpy(&dent->stat_val, &tmp, sizeof(dent->stat_val));
+
+ if (S_ISLNK(dent->stat_val.mode)) {
+ struct path path = {};
+ int len = dent->stat_val.size;
+ struct inode *inode;
+ char *link;
+
+ if (unlikely(len <= 0)) {
+ XIO_ERR("symlink '%s' bad len = %d\n", newpath, len);
+ status = -EINVAL;
+ goto done;
+ }
+ len = _length_paranoia(len, __LINE__);
+
+ status = user_path_at(AT_FDCWD, newpath, 0, &path);
+ if (unlikely(status < 0)) {
+ XIO_WRN("cannot read link '%s'\n", newpath);
+ goto done;
+ }
+
+ inode = path.dentry->d_inode;
+
+ status = -ENOMEM;
+ link = brick_string_alloc(len + 2);
+ status = inode->i_op->readlink(path.dentry, link, len + 1);
+ link[len] = '\0';
+ if (status < 0 ||
+ (dent->link_val && !strncmp(dent->link_val, link, len))) {
+ brick_string_free(link);
+ } else {
+ brick_string_free(dent->link_val);
+ dent->link_val = link;
+ }
+ path_put(&path);
+ } else if (S_ISREG(dent->stat_val.mode) && dent->d_name && !strncmp(dent->d_name, "log-", 4)) {
+ loff_t min = dent->stat_val.size;
+ loff_t max = 0;
+
+ dent->d_corr_A = 0;
+ dent->d_corr_B = 0;
+ mf_get_any_dirty(newpath, &min, &max, 0, 2);
+ if (min < dent->stat_val.size) {
+ XIO_DBG("file '%s' A size=%lld min=%lld max=%lld\n", newpath, dent->stat_val.size, min, max);
+ dent->d_corr_A = min;
+ }
+ mf_get_any_dirty(newpath, &min, &max, 0, 3);
+ if (min < dent->stat_val.size) {
+ XIO_DBG("file '%s' B size=%lld min=%lld max=%lld\n", newpath, dent->stat_val.size, min, max);
+ dent->d_corr_B = min;
+ }
+ }
+
+done:
+ set_fs(oldfs);
+ return status;
+}
+
+static
+int dent_compare(struct mars_dent *a, struct mars_dent *b)
+{
+ if (a->d_class < b->d_class)
+ return -1;
+ if (a->d_class > b->d_class)
+ return 1;
+ if (a->d_serial < b->d_serial)
+ return -1;
+ if (a->d_serial > b->d_serial)
+ return 1;
+ return strcmp(a->d_path, b->d_path);
+}
+
+struct mars_dir_context {
+ struct dir_context ctx;
+ struct mars_cookie *cookie;
+};
+
+static
+int mars_filler(void *__buf, const char *name, int namlen, loff_t offset,
+ u64 ino, unsigned int d_type)
+{
+ struct mars_dir_context *buf = __buf;
+ struct mars_cookie *cookie = buf->cookie;
+
+ struct mars_global *global = cookie->global;
+ struct list_head *anchor = &global->dent_anchor;
+ struct list_head *start = anchor;
+ struct mars_dent *dent;
+ struct list_head *tmp;
+ char *newpath;
+ int prefix = 0;
+ int pathlen;
+ int class;
+ int serial = 0;
+ bool use_channel = false;
+
+ cookie->hit = true;
+
+ if (name[0] == '.')
+ return 0;
+
+ class = cookie->checker(cookie->parent, name, namlen, d_type, &prefix, &serial, &use_channel);
+ if (class < 0)
+ return 0;
+
+ pathlen = strlen(cookie->path);
+ newpath = brick_string_alloc(pathlen + namlen + 2);
+ memcpy(newpath, cookie->path, pathlen);
+ newpath[pathlen++] = '/';
+ memcpy(newpath + pathlen, name, namlen);
+ pathlen += namlen;
+ newpath[pathlen] = '\0';
+
+ dent = brick_zmem_alloc(cookie->allocsize);
+
+ dent->d_class = class;
+ dent->d_serial = serial;
+ dent->d_path = newpath;
+
+ for (tmp = anchor->next; tmp != anchor; tmp = tmp->next) {
+ struct mars_dent *test = container_of(tmp, struct mars_dent, dent_link);
+ int cmp = dent_compare(test, dent);
+
+ if (!cmp) {
+ brick_mem_free(dent);
+ dent = test;
+ goto found;
+ }
+ /* keep the list sorted. find the next smallest member. */
+ if (cmp > 0)
+ break;
+ start = tmp;
+ }
+
+ dent->d_name = brick_string_alloc(namlen + 1);
+ memcpy(dent->d_name, name, namlen);
+ dent->d_name[namlen] = '\0';
+ dent->d_rest = brick_strdup(dent->d_name + prefix);
+
+ newpath = NULL;
+
+ INIT_LIST_HEAD(&dent->dent_link);
+ INIT_LIST_HEAD(&dent->brick_list);
+
+ list_add(&dent->dent_link, start);
+
+found:
+ dent->d_type = d_type;
+ dent->d_class = class;
+ dent->d_serial = serial;
+ if (dent->d_parent)
+ dent->d_parent->d_child_count--;
+ dent->d_parent = cookie->parent;
+ if (dent->d_parent)
+ dent->d_parent->d_child_count++;
+ dent->d_depth = cookie->depth;
+ dent->d_global = global;
+ dent->d_killme = false;
+ dent->d_use_channel = use_channel;
+ brick_string_free(newpath);
+ return 0;
+}
+
+static int _mars_readdir(struct mars_cookie *cookie)
+{
+ struct file *f;
+ struct address_space *mapping;
+
+ mm_segment_t oldfs;
+ int status = 0;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ f = filp_open(cookie->path, O_DIRECTORY | O_RDONLY, 0);
+ set_fs(oldfs);
+ if (unlikely(IS_ERR(f)))
+ return PTR_ERR(f);
+ mapping = f->f_mapping;
+ if (mapping)
+ mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~(__GFP_IO | __GFP_FS));
+
+ for (;;) {
+ struct mars_dir_context buf = {
+ .ctx.actor = mars_filler,
+ .cookie = cookie,
+ };
+
+ cookie->hit = false;
+ status = iterate_dir(f, &buf.ctx);
+ if (!cookie->hit)
+ break;
+ if (unlikely(status < 0)) {
+ XIO_ERR("readdir() on path='%s' status=%d\n", cookie->path, status);
+ break;
+ }
+ }
+
+ filp_close(f, NULL);
+ return status;
+}
+
+int mars_dent_work(struct mars_global *global,
+ char *dirname,
+ int allocsize,
+ mars_dent_checker_fn checker,
+ mars_dent_worker_fn worker,
+ void *buf,
+ int maxdepth)
+{
+ static int version;
+
+ struct mars_cookie cookie = {
+ .global = global,
+ .checker = checker,
+ .path = dirname,
+ .parent = NULL,
+ .allocsize = allocsize,
+ .depth = 0,
+ };
+ struct say_channel *say_channel = NULL;
+ struct list_head *tmp;
+ struct list_head *next;
+ int rounds = 0;
+ int status;
+ int total_status = 0;
+ bool found_dir;
+
+ /* Initialize the flat dent list
+ */
+ version++;
+ global->global_version = version;
+ total_status = _mars_readdir(&cookie);
+
+ if (total_status || !worker)
+ goto done;
+
+ down_write(&global->dent_mutex);
+
+restart:
+ found_dir = false;
+
+ /* First, get all the inode information in a separate pass
+ * before starting work.
+ * The separate pass is necessary because some dents may
+ * forward-reference other dents, and it would be a pity if
+ * some inodes were not available or were outdated.
+ */
+ for (tmp = global->dent_anchor.next; tmp != &global->dent_anchor; tmp = tmp->next) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ /* treat any member only once during this invocation */
+ if (dent->d_version == version)
+ continue;
+ dent->d_version = version;
+
+ bind_to_dent(dent, &say_channel);
+
+ status = get_inode(dent->d_path, dent);
+ total_status |= status;
+
+ /* recurse into subdirectories by inserting into the flat list */
+ if (S_ISDIR(dent->stat_val.mode) && dent->d_depth <= maxdepth) {
+ struct mars_cookie sub_cookie = {
+ .global = global,
+ .checker = checker,
+ .path = dent->d_path,
+ .allocsize = allocsize,
+ .parent = dent,
+ .depth = dent->d_depth + 1,
+ };
+ found_dir = true;
+ status = _mars_readdir(&sub_cookie);
+ total_status |= status;
+ if (status < 0)
+ XIO_INF("forward: status %d on '%s'\n", status, dent->d_path);
+ }
+ }
+ bind_to_dent(NULL, &say_channel);
+
+ if (found_dir && ++rounds < 10) {
+ brick_yield();
+ goto restart;
+ }
+
+ up_write(&global->dent_mutex);
+
+ /* Preparation pass.
+ * Here is a chance to mark some dents for removal
+ * (or other types of non-destructive operations)
+ */
+ down_read(&global->dent_mutex);
+ for (tmp = global->dent_anchor.next,
+ next = tmp->next; tmp != &global->dent_anchor; tmp = next,
+ next = next->next) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ up_read(&global->dent_mutex);
+
+ brick_yield();
+
+ bind_to_dent(dent, &say_channel);
+
+ status = worker(buf, dent, true, false);
+ down_read(&global->dent_mutex);
+ total_status |= status;
+ }
+ up_read(&global->dent_mutex);
+
+ bind_to_dent(NULL, &say_channel);
+
+ /* Remove all dents marked for removal.
+ */
+ down_write(&global->dent_mutex);
+ for (tmp = global->dent_anchor.next,
+ next = tmp->next; tmp != &global->dent_anchor; tmp = next,
+ next = next->next) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ if (!dent->d_killme)
+ continue;
+
+ bind_to_dent(dent, &say_channel);
+
+ XIO_DBG("killing dent '%s'\n", dent->d_path);
+ list_del_init(tmp);
+ xio_free_dent(dent);
+ }
+ up_write(&global->dent_mutex);
+
+ bind_to_dent(NULL, &say_channel);
+
+ /* Forward pass.
+ */
+ down_read(&global->dent_mutex);
+ for (tmp = global->dent_anchor.next,
+ next = tmp->next; tmp != &global->dent_anchor; tmp = next,
+ next = next->next) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ up_read(&global->dent_mutex);
+
+ brick_yield();
+
+ bind_to_dent(dent, &say_channel);
+
+ status = worker(buf, dent, false, false);
+
+ down_read(&global->dent_mutex);
+ total_status |= status;
+ }
+ bind_to_dent(NULL, &say_channel);
+
+ /* Backward pass.
+ */
+ for (tmp = global->dent_anchor.prev,
+ next = tmp->prev; tmp != &global->dent_anchor; tmp = next,
+ next = next->prev) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ up_read(&global->dent_mutex);
+
+ brick_yield();
+
+ bind_to_dent(dent, &say_channel);
+
+ status = worker(buf, dent, false, true);
+
+ down_read(&global->dent_mutex);
+ total_status |= status;
+ if (status < 0)
+ XIO_INF("backwards: status %d on '%s'\n", status, dent->d_path);
+ }
+ up_read(&global->dent_mutex);
+
+ bind_to_dent(NULL, &say_channel);
+
+done:
+ return total_status;
+}
+EXPORT_SYMBOL_GPL(mars_dent_work);
+
+struct mars_dent *_mars_find_dent(struct mars_global *global, const char *path)
+{
+ struct mars_dent *res = NULL;
+ struct list_head *tmp;
+
+ if (!rwsem_is_locked(&global->dent_mutex))
+ XIO_ERR("dent_mutex not held!\n");
+
+ for (tmp = global->dent_anchor.next; tmp != &global->dent_anchor; tmp = tmp->next) {
+ struct mars_dent *tmp_dent = container_of(tmp, struct mars_dent, dent_link);
+
+ if (!strcmp(tmp_dent->d_path, path)) {
+ res = tmp_dent;
+ break;
+ }
+ }
+
+ return res;
+}
+EXPORT_SYMBOL_GPL(_mars_find_dent);
+
+struct mars_dent *mars_find_dent(struct mars_global *global, const char *path)
+{
+ struct mars_dent *res;
+
+ if (!global)
+ return NULL;
+ down_read(&global->dent_mutex);
+ res = _mars_find_dent(global, path);
+ up_read(&global->dent_mutex);
+ return res;
+}
+EXPORT_SYMBOL_GPL(mars_find_dent);
+
+int mars_find_dent_all(struct mars_global *global, char *prefix, struct mars_dent ***table)
+{
+ int max = 1024; /* provisionary */
+ int count = 0;
+ struct list_head *tmp;
+ struct mars_dent **res;
+ int prefix_len = strlen(prefix);
+
+ if (unlikely(!global))
+ goto done;
+
+ res = brick_zmem_alloc(max * sizeof(void *));
+ *table = res;
+
+ down_read(&global->dent_mutex);
+ for (tmp = global->dent_anchor.next; tmp != &global->dent_anchor; tmp = tmp->next) {
+ struct mars_dent *tmp_dent = container_of(tmp, struct mars_dent, dent_link);
+ int this_len;
+
+ if (!tmp_dent->d_path)
+ continue;
+ this_len = strlen(tmp_dent->d_path);
+ if (this_len < prefix_len || strncmp(tmp_dent->d_path, prefix, prefix_len))
+ continue;
+ res[count++] = tmp_dent;
+ if (count >= max)
+ break;
+ }
+ up_read(&global->dent_mutex);
+
+done:
+ return count;
+}
+EXPORT_SYMBOL_GPL(mars_find_dent_all);
+
+void xio_kill_dent(struct mars_dent *dent)
+{
+ dent->d_killme = true;
+ xio_kill_brick_all(NULL, &dent->brick_list, true);
+}
+EXPORT_SYMBOL_GPL(xio_kill_dent);
+
+void xio_free_dent(struct mars_dent *dent)
+{
+ int i;
+
+ xio_kill_dent(dent);
+
+ CHECK_HEAD_EMPTY(&dent->dent_link);
+ CHECK_HEAD_EMPTY(&dent->brick_list);
+
+ for (i = 0; i < MARS_ARGV_MAX; i++)
+ brick_string_free(dent->d_argv[i]);
+ brick_string_free(dent->d_args);
+ brick_string_free(dent->d_name);
+ brick_string_free(dent->d_rest);
+ brick_string_free(dent->d_path);
+ brick_string_free(dent->link_val);
+ if (likely(dent->d_parent))
+ dent->d_parent->d_child_count--;
+ if (dent->d_private_destruct)
+ dent->d_private_destruct(dent->d_private);
+ brick_mem_free(dent->d_private);
+ brick_mem_free(dent);
+}
+EXPORT_SYMBOL_GPL(xio_free_dent);
+
+void xio_free_dent_all(struct mars_global *global, struct list_head *anchor)
+{
+ LIST_HEAD(tmp_list);
+
+ if (global)
+ down_write(&global->dent_mutex);
+ list_replace_init(anchor, &tmp_list);
+ if (global)
+ up_write(&global->dent_mutex);
+ XIO_DBG("is_empty=%d\n", list_empty(&tmp_list));
+ while (!list_empty(&tmp_list)) {
+ struct mars_dent *dent;
+
+ dent = container_of(tmp_list.prev, struct mars_dent, dent_link);
+ list_del_init(&dent->dent_link);
+ xio_free_dent(dent);
+ }
+}
+EXPORT_SYMBOL_GPL(xio_free_dent_all);
+
+/*******************************************************************/
+
+/* low-level brick instantiation */
+
+struct xio_brick *mars_find_brick(struct mars_global *global, const void *brick_type, const char *path)
+{
+ struct list_head *tmp;
+
+ if (!global || !path)
+ return NULL;
+
+ down_read(&global->brick_mutex);
+
+ for (tmp = global->brick_anchor.next; tmp != &global->brick_anchor; tmp = tmp->next) {
+ struct xio_brick *test = container_of(tmp, struct xio_brick, global_brick_link);
+
+ if (!strcmp(test->brick_path, path)) {
+ up_read(&global->brick_mutex);
+ if (brick_type && test->type != brick_type) {
+ XIO_ERR("bad brick type\n");
+ return NULL;
+ }
+ return test;
+ }
+ }
+
+ up_read(&global->brick_mutex);
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(mars_find_brick);
+
+int xio_free_brick(struct xio_brick *brick)
+{
+ struct mars_global *global;
+ int i;
+ int count;
+ int status;
+ int sleeptime;
+ int maxsleep;
+
+ if (!brick) {
+ XIO_ERR("bad brick parameter\n");
+ status = -EINVAL;
+ goto done;
+ }
+
+ if (!brick->power.force_off || !brick->power.off_led) {
+ XIO_WRN("brick '%s' is not freeable\n", brick->brick_path);
+ status = -ETXTBSY;
+ goto done;
+ }
+
+ /* first check whether the brick is in use somewhere */
+ for (i = 0; i < brick->type->max_outputs; i++) {
+ struct xio_output *output = brick->outputs[i];
+
+ if (output && output->nr_connected > 0) {
+ XIO_WRN("brick '%s' not freeable, output %i is used\n", brick->brick_path, i);
+ status = -EEXIST;
+ goto done;
+ }
+ }
+
+ /* Should not happen, but workaround: wait until flying IO has vanished */
+ maxsleep = 20000;
+ sleeptime = 1000;
+ for (;;) {
+ count = atomic_read(&brick->aio_object_layout.alloc_count);
+ if (likely(!count))
+ break;
+ if (maxsleep > 0) {
+ XIO_WRN("MEMLEAK: brick '%s' has %d aios allocated (total = %d, maxsleep = %d)\n",
+ brick->brick_path,
+ count,
+ atomic_read(&brick->aio_object_layout.total_alloc_count),
+ maxsleep);
+ } else {
+ XIO_ERR("MEMLEAK: brick '%s' has %d aios allocated (total = %d)\n",
+ brick->brick_path,
+ count,
+ atomic_read(&brick->aio_object_layout.total_alloc_count));
+ break;
+ }
+ brick_msleep(sleeptime);
+ maxsleep -= sleeptime;
+ }
+
+ XIO_DBG("===> freeing brick name = '%s' path = '%s'\n", brick->brick_name, brick->brick_path);
+
+ global = brick->global;
+ if (global) {
+ down_write(&global->brick_mutex);
+ list_del_init(&brick->global_brick_link);
+ list_del_init(&brick->dent_brick_link);
+ up_write(&global->brick_mutex);
+ }
+
+ for (i = 0; i < brick->type->max_inputs; i++) {
+ struct xio_input *input = brick->inputs[i];
+
+ if (input) {
+ XIO_DBG("disconnecting input %i\n", i);
+ generic_disconnect((void *)input);
+ }
+ }
+
+ XIO_DBG("deallocate name = '%s' path = '%s'\n", SAFE_STR(brick->brick_name), SAFE_STR(brick->brick_path));
+ brick_string_free(brick->brick_name);
+ brick_string_free(brick->brick_path);
+
+ status = generic_brick_exit_full((void *)brick);
+
+ if (status >= 0) {
+ brick_mem_free(brick);
+ local_trigger();
+ } else {
+ XIO_ERR("error freeing brick, status = %d\n", status);
+ }
+
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_free_brick);
+
+struct xio_brick *xio_make_brick(struct mars_global *global,
+ struct mars_dent *belongs,
+ const void *_brick_type,
+ const char *path,
+ const char *name)
+{
+ const struct generic_brick_type *brick_type = _brick_type;
+ const struct generic_input_type **input_types;
+ const struct generic_output_type **output_types;
+ struct xio_brick *res;
+ int size;
+ int i;
+ int status;
+
+ size = brick_type->brick_size +
+ (brick_type->max_inputs + brick_type->max_outputs) * sizeof(void *);
+ input_types = brick_type->default_input_types;
+ for (i = 0; i < brick_type->max_inputs; i++) {
+ const struct generic_input_type *type = *input_types++;
+
+ if (unlikely(!type)) {
+ XIO_ERR("input_type %d is missing\n", i);
+ goto err_name;
+ }
+ if (unlikely(type->input_size <= 0)) {
+ XIO_ERR("bad input_size at %d\n", i);
+ goto err_name;
+ }
+ size += type->input_size;
+ }
+ output_types = brick_type->default_output_types;
+ for (i = 0; i < brick_type->max_outputs; i++) {
+ const struct generic_output_type *type = *output_types++;
+
+ if (unlikely(!type)) {
+ XIO_ERR("output_type %d is missing\n", i);
+ goto err_name;
+ }
+ if (unlikely(type->output_size <= 0)) {
+ XIO_ERR("bad output_size at %d\n", i);
+ goto err_name;
+ }
+ size += type->output_size;
+ }
+
+ res = brick_zmem_alloc(size);
+ res->global = global;
+ INIT_LIST_HEAD(&res->dent_brick_link);
+ res->brick_name = brick_strdup(name);
+ res->brick_path = brick_strdup(path);
+
+ status = generic_brick_init_full(res, size, brick_type, NULL, NULL);
+ XIO_DBG("brick '%s' init '%s' '%s' (status=%d)\n", brick_type->type_name, path, name, status);
+ if (status < 0) {
+ XIO_ERR("cannot init brick %s\n", brick_type->type_name);
+ goto err_path;
+ }
+ res->free = xio_free_brick;
+
+ /* Immediately make it visible, regardless of internal state.
+ * Switching on / etc must be done separately.
+ */
+ down_write(&global->brick_mutex);
+ list_add(&res->global_brick_link, &global->brick_anchor);
+ if (belongs)
+ list_add_tail(&res->dent_brick_link, &belongs->brick_list);
+ up_write(&global->brick_mutex);
+
+ return res;
+
+err_path:
+ brick_string_free(res->brick_name);
+ brick_string_free(res->brick_path);
+ brick_mem_free(res);
+err_name:
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(xio_make_brick);
+
+int xio_kill_brick(struct xio_brick *brick)
+{
+ struct mars_global *global;
+ int status = -EINVAL;
+
+ CHECK_PTR(brick, done);
+ global = brick->global;
+
+ XIO_DBG("===> killing brick %s path = '%s' name = '%s'\n",
+ brick->type ? SAFE_STR(brick->type->type_name) : "undef",
+ SAFE_STR(brick->brick_path),
+ SAFE_STR(brick->brick_name));
+
+ if (unlikely(brick->nr_outputs > 0 && brick->outputs[0] && brick->outputs[0]->nr_connected)) {
+ XIO_ERR("sorry, output is in use '%s'\n", SAFE_STR(brick->brick_path));
+ goto done;
+ }
+
+ if (global) {
+ down_write(&global->brick_mutex);
+ list_del_init(&brick->global_brick_link);
+ list_del_init(&brick->dent_brick_link);
+ up_write(&global->brick_mutex);
+ }
+
+ if (brick->show_status)
+ brick->show_status(brick, true);
+
+ /* start shutdown */
+ set_button_wait((void *)brick, false, true, 0);
+
+ if (likely(brick->power.off_led)) {
+ int max_inputs = 0;
+ int i;
+
+ if (likely(brick->type)) {
+ max_inputs = brick->type->max_inputs;
+ } else {
+ XIO_ERR("uninitialized brick '%s' '%s'\n",
+ SAFE_STR(brick->brick_name),
+ SAFE_STR(brick->brick_path));
+ }
+
+ XIO_DBG("---> freeing '%s' '%s'\n", SAFE_STR(brick->brick_name), SAFE_STR(brick->brick_path));
+
+ if (brick->kill_ptr)
+ *brick->kill_ptr = NULL;
+
+ for (i = 0; i < max_inputs; i++) {
+ struct generic_input *input = (void *)brick->inputs[i];
+
+ if (!input)
+ continue;
+ status = generic_disconnect(input);
+ if (unlikely(status < 0)) {
+ XIO_ERR("brick '%s' '%s' disconnect %d failed, status = %d\n",
+ SAFE_STR(brick->brick_name),
+ SAFE_STR(brick->brick_path),
+ i,
+ status);
+ goto done;
+ }
+ }
+ if (likely(brick->free)) {
+ status = brick->free(brick);
+ if (unlikely(status < 0)) {
+ XIO_ERR("freeing '%s' '%s' failed, status = %d\n",
+ SAFE_STR(brick->brick_name),
+ SAFE_STR(brick->brick_path),
+ status);
+ goto done;
+ }
+ } else {
+ XIO_ERR("brick '%s' '%s' has no destructor\n",
+ SAFE_STR(brick->brick_name),
+ SAFE_STR(brick->brick_path));
+ }
+ status = 0;
+ } else {
+ XIO_ERR("brick '%s' '%s' is not off\n", SAFE_STR(brick->brick_name), SAFE_STR(brick->brick_path));
+ status = -EIO;
+ }
+
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_kill_brick);
+
+int xio_kill_brick_all(struct mars_global *global, struct list_head *anchor, bool use_dent_link)
+{
+ int status = 0;
+
+ if (!anchor || !anchor->next)
+ goto done;
+ if (global)
+ down_write(&global->brick_mutex);
+ while (!list_empty(anchor)) {
+ struct list_head *tmp = anchor->next;
+ struct xio_brick *brick;
+
+ if (use_dent_link)
+ brick = container_of(tmp, struct xio_brick, dent_brick_link);
+ else
+ brick = container_of(tmp, struct xio_brick, global_brick_link);
+ list_del_init(tmp);
+ if (global)
+ up_write(&global->brick_mutex);
+ status |= xio_kill_brick(brick);
+ if (global)
+ down_write(&global->brick_mutex);
+ }
+ if (global)
+ up_write(&global->brick_mutex);
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_kill_brick_all);
+
+int xio_kill_brick_when_possible(struct mars_global *global,
+ struct list_head *anchor,
+ bool use_dent_link,
+ const struct xio_brick_type *type,
+ bool even_on)
+{
+ int return_status = 0;
+ struct list_head *tmp;
+
+restart:
+ if (global)
+ down_write(&global->brick_mutex);
+ for (tmp = anchor->next; tmp != anchor; tmp = tmp->next) {
+ struct xio_brick *brick;
+ int status;
+
+ if (use_dent_link)
+ brick = container_of(tmp, struct xio_brick, dent_brick_link);
+ else
+ brick = container_of(tmp, struct xio_brick, global_brick_link);
+ /* only kill the right brick types */
+ if (type && brick->type != type)
+ continue;
+ /* only kill marked bricks */
+ if (!brick->killme)
+ continue;
+ /* only kill unconnected bricks */
+ if (brick->nr_outputs > 0 && brick->outputs[0] && brick->outputs[0]->nr_connected > 0)
+ continue;
+ if (!even_on && (brick->power.button || !brick->power.off_led))
+ continue;
+ /* Workaround FIXME:
+ * only kill bricks which have not been touched during the current mars_dent_work() round.
+ * some bricks like aio seem to have races between startup and termination of threads.
+ * disable this for stress-testing the allocation/deallocation logic.
+ * OTOH, frequently doing useless starts/stops is no good idea.
+ * CHECK: how to avoid too frequent switching by other means?
+ */
+ if (brick->kill_round++ < 1)
+ continue;
+
+ list_del_init(tmp);
+ if (global)
+ up_write(&global->brick_mutex);
+
+ XIO_DBG("KILLING '%s'\n", brick->brick_name);
+ status = xio_kill_brick(brick);
+
+ if (status >= 0)
+ return_status++;
+ else
+ return status;
+ /* The list may have changed during the open lock
+ * in unpredictable ways.
+ */
+ goto restart;
+ }
+ if (global)
+ up_write(&global->brick_mutex);
+ return return_status;
+}
+EXPORT_SYMBOL_GPL(xio_kill_brick_when_possible);
+
+/*******************************************************************/
+
+/* mid-level brick instantiation (identity is based on path strings) */
+
+char *_vpath_make(int line, const char *fmt, va_list *args)
+{
+ va_list copy_args;
+ char dummy[2];
+ int len;
+ char *res;
+
+ memcpy(&copy_args, args, sizeof(copy_args));
+ len = vsnprintf(dummy, sizeof(dummy), fmt, copy_args);
+ len = _length_paranoia(len, line);
+ res = _brick_string_alloc(len + 2, line);
+
+ vsnprintf(res, len + 1, fmt, *args);
+
+ return res;
+}
+EXPORT_SYMBOL_GPL(_vpath_make);
+
+char *_path_make(int line, const char *fmt, ...)
+{
+ va_list args;
+ char *res;
+
+ va_start(args, fmt);
+ res = _vpath_make(line, fmt, &args);
+ va_end(args);
+ return res;
+}
+EXPORT_SYMBOL_GPL(_path_make);
+
+char *_backskip_replace(int line, const char *path, char delim, bool insert, const char *fmt, ...)
+{
+ int path_len = strlen(path);
+ int fmt_len;
+ int total_len;
+ char *res;
+ va_list args;
+ int pos = path_len;
+ int plus;
+ char dummy[2];
+
+ va_start(args, fmt);
+ fmt_len = vsnprintf(dummy, sizeof(dummy), fmt, args);
+ va_end(args);
+ fmt_len = _length_paranoia(fmt_len, line);
+
+ total_len = fmt_len + path_len;
+ total_len = _length_paranoia(total_len, line);
+
+ res = _brick_string_alloc(total_len + 2, line);
+
+ while (pos > 0 && path[pos] != '/')
+ pos--;
+ if (delim != '/') {
+ while (pos < path_len && path[pos] != delim)
+ pos++;
+ }
+ memcpy(res, path, pos);
+
+ va_start(args, fmt);
+ plus = vscnprintf(res + pos, total_len - pos, fmt, args);
+ va_end(args);
+
+ if (insert)
+ strncpy(res + pos + plus, path + pos + 1, total_len - pos - plus);
+ return res;
+}
+EXPORT_SYMBOL_GPL(_backskip_replace);
+
+struct xio_brick *path_find_brick(struct mars_global *global, const void *brick_type, const char *fmt, ...)
+{
+ va_list args;
+ char *fullpath;
+ struct xio_brick *res;
+
+ va_start(args, fmt);
+ fullpath = vpath_make(fmt, &args);
+ va_end(args);
+
+ if (!fullpath)
+ return NULL;
+ res = mars_find_brick(global, brick_type, fullpath);
+ brick_string_free(fullpath);
+ return res;
+}
+EXPORT_SYMBOL_GPL(path_find_brick);
+
+const struct generic_brick_type *_client_brick_type = NULL;
+EXPORT_SYMBOL_GPL(_client_brick_type);
+const struct generic_brick_type *_bio_brick_type = NULL;
+EXPORT_SYMBOL_GPL(_bio_brick_type);
+const struct generic_brick_type *_aio_brick_type = NULL;
+EXPORT_SYMBOL_GPL(_aio_brick_type);
+const struct generic_brick_type *_sio_brick_type = NULL;
+EXPORT_SYMBOL_GPL(_sio_brick_type);
+
+struct xio_brick *make_brick_all(
+ struct mars_global *global,
+ struct mars_dent *belongs,
+ int (*setup_fn)(struct xio_brick *brick, void *private),
+ void *private,
+ const char *new_name,
+ const struct generic_brick_type *new_brick_type,
+ const struct generic_brick_type *prev_brick_type[],
+/* -1 = off, 0 = leave in current state, +1 = create when necessary, +2 = create + switch on */
+ int switch_override,
+ const char *new_fmt,
+ const char *prev_fmt[],
+ int prev_count,
+ ...
+ )
+{
+ va_list args;
+ const char *new_path;
+ char *_new_path = NULL;
+ struct xio_brick *brick = NULL;
+ char *paths[prev_count];
+ struct xio_brick *prev[prev_count];
+ bool switch_state;
+ int i;
+ int status;
+
+ /* treat variable arguments */
+ va_start(args, prev_count);
+ if (new_fmt)
+ new_path = _new_path = vpath_make(new_fmt, &args);
+ else
+ new_path = new_name;
+ for (i = 0; i < prev_count; i++)
+ paths[i] = vpath_make(prev_fmt[i], &args);
+ va_end(args);
+
+ if (!new_path) {
+ XIO_ERR("could not create new path\n");
+ goto err;
+ }
+
+ /* get old switch state */
+ brick = mars_find_brick(global, NULL, new_path);
+ switch_state = false;
+ if (brick)
+ switch_state = brick->power.button;
+ /* override? */
+ if (switch_override > 1)
+ switch_state = true;
+ else if (switch_override < 0)
+ switch_state = false;
+ /* even higher override */
+ if (global && !global->global_power.button)
+ switch_state = false;
+
+ /* brick already existing? */
+ if (brick) {
+ /* just switch the power state */
+ XIO_DBG("found existing brick '%s'\n", new_path);
+ /* highest general override */
+ if (xio_check_outputs(brick)) {
+ if (!switch_state)
+ XIO_DBG("brick '%s' override 0 -> 1\n", new_path);
+ switch_state = true;
+ }
+ goto do_switch;
+ }
+
+ /* brick not existing => check whether to create it */
+ if (switch_override < 1) { /* don't create */
+ XIO_DBG("no need for brick '%s'\n", new_path);
+ goto done;
+ }
+ XIO_DBG("make new brick '%s'\n", new_path);
+ if (!new_name)
+ new_name = new_path;
+
+ XIO_DBG("----> new brick type = '%s' path = '%s' name = '%s'\n",
+ new_brick_type->type_name,
+ new_path,
+ new_name);
+
+ /* get all predecessor bricks */
+ for (i = 0; i < prev_count; i++) {
+ char *path = paths[i];
+
+ if (!path) {
+ XIO_ERR("could not build path %d\n", i);
+ goto err;
+ }
+
+ prev[i] = mars_find_brick(global, NULL, path);
+
+ if (!prev[i]) {
+ XIO_WRN("prev brick '%s' does not exist\n", path);
+ goto err;
+ }
+ XIO_DBG("------> predecessor %d path = '%s'\n", i, path);
+ if (!prev[i]->power.on_led) {
+ switch_state = false;
+ XIO_DBG("predecessor power is not on\n");
+ }
+ }
+
+ /* some generic brick replacements (better performance / network functionality) */
+ brick = NULL;
+ if ((new_brick_type == _bio_brick_type || new_brick_type == _aio_brick_type)
+ && _client_brick_type != NULL) {
+ char *remote = strchr(new_name, '@');
+
+ if (remote) {
+ remote++;
+ XIO_DBG("substitute by remote brick '%s' on peer '%s'\n", new_name, remote);
+
+ brick = xio_make_brick(global, belongs, _client_brick_type, new_path, new_name);
+ if (brick) {
+ struct client_brick *_brick = (void *)brick;
+
+ _brick->max_flying = 10000;
+ }
+ }
+ }
+ if (!brick && new_brick_type == _bio_brick_type && _aio_brick_type) {
+ struct kstat test = {};
+ int status = mars_stat(new_path, &test, false);
+
+ if (SKIP_BIO || status < 0 || !S_ISBLK(test.mode)) {
+ new_brick_type = _aio_brick_type;
+ XIO_DBG("substitute bio by aio\n");
+ }
+ }
+#ifdef CONFIG_MARS_PREFER_SIO
+ if (!brick && new_brick_type == _aio_brick_type && _sio_brick_type) {
+ new_brick_type = _sio_brick_type;
+ XIO_DBG("substitute aio by sio\n");
+ }
+#endif
+
+ /* create it... */
+ if (!brick)
+ brick = xio_make_brick(global, belongs, new_brick_type, new_path, new_name);
+ if (unlikely(!brick)) {
+ XIO_ERR("creation failed '%s' '%s'\n", new_path, new_name);
+ goto err;
+ }
+ if (unlikely(brick->nr_inputs < prev_count)) {
+ XIO_ERR("'%s' wrong number of arguments: %d < %d\n", new_path, brick->nr_inputs, prev_count);
+ goto err;
+ }
+
+ /* connect the wires */
+ for (i = 0; i < prev_count; i++) {
+ int status;
+
+ status = generic_connect((void *)brick->inputs[i], (void *)prev[i]->outputs[0]);
+ if (unlikely(status < 0)) {
+ XIO_ERR("'%s' '%s' cannot connect input %d\n", new_path, new_name, i);
+ goto err;
+ }
+ }
+
+do_switch:
+ /* call setup function */
+ if (setup_fn) {
+ int setup_status = setup_fn(brick, private);
+
+ if (setup_status <= 0)
+ switch_state = 0;
+ }
+
+ /* switch on/off (may fail silently, but responsibility is at the workers) */
+ status = mars_power_button((void *)brick, switch_state, false);
+ XIO_DBG("switch '%s' to %d status = %d\n", new_path, switch_state, status);
+ goto done;
+
+err:
+ if (brick)
+ xio_kill_brick(brick);
+ brick = NULL;
+done:
+ for (i = 0; i < prev_count; i++) {
+ if (paths[i])
+ brick_string_free(paths[i]);
+ }
+ if (_new_path)
+ brick_string_free(_new_path);
+
+ return brick;
+}
+EXPORT_SYMBOL_GPL(make_brick_all);
+
+/***********************************************************************/
+
+/* statistics */
+
+int global_show_statist = 0;
+EXPORT_SYMBOL_GPL(global_show_statist);
+module_param_named(show_statist, global_show_statist, int, 0);
+
+static
+void _show_one(struct xio_brick *test, int *brick_count)
+{
+ int i;
+
+ if (*brick_count)
+ XIO_DBG("---------\n");
+ XIO_DBG("BRICK type = %s path = '%s' name = '%s' size_hint=%d aios_alloc = %d aios_apsect_alloc = %d total_aios_alloc = %d total_aios_aspects = %d button = %d off = %d on = %d\n",
+ SAFE_STR(test->type->type_name),
+ SAFE_STR(test->brick_path),
+ SAFE_STR(test->brick_name),
+ test->aio_object_layout.size_hint,
+ atomic_read(&test->aio_object_layout.alloc_count),
+ atomic_read(&test->aio_object_layout.aspect_count),
+ atomic_read(&test->aio_object_layout.total_alloc_count),
+ atomic_read(&test->aio_object_layout.total_aspect_count),
+ test->power.button,
+ test->power.off_led,
+ test->power.on_led);
+ (*brick_count)++;
+ if (test->ops && test->ops->brick_statistics) {
+ char *info = test->ops->brick_statistics(test, 0);
+
+ if (info) {
+ XIO_DBG(" %s", info);
+ brick_string_free(info);
+ }
+ }
+ for (i = 0; i < test->type->max_inputs; i++) {
+ struct xio_input *input = test->inputs[i];
+ struct xio_output *output = input ? input->connect : NULL;
+
+ if (output) {
+ XIO_DBG(" input %d connected with %s path = '%s' name = '%s'\n",
+ i,
+ SAFE_STR(output->brick->type->type_name),
+ SAFE_STR(output->brick->brick_path),
+ SAFE_STR(output->brick->brick_name));
+ } else {
+ XIO_DBG(" input %d not connected\n", i);
+ }
+ }
+ for (i = 0; i < test->type->max_outputs; i++) {
+ struct xio_output *output = test->outputs[i];
+
+ if (output)
+ XIO_DBG(" output %d nr_connected = %d\n", i, output->nr_connected);
+ }
+}
+
+void show_statistics(struct mars_global *global, const char *class)
+{
+ struct list_head *tmp;
+ int dent_count = 0;
+ int brick_count = 0;
+
+ if (!global_show_statist)
+ return; /* silently */
+
+ brick_mem_statistics(false);
+
+ down_read(&global->brick_mutex);
+ XIO_DBG("================================== %s bricks:\n", class);
+ for (tmp = global->brick_anchor.next; tmp != &global->brick_anchor; tmp = tmp->next) {
+ struct xio_brick *test;
+
+ test = container_of(tmp, struct xio_brick, global_brick_link);
+ _show_one(test, &brick_count);
+ }
+ up_read(&global->brick_mutex);
+
+ XIO_DBG("================================== %s dents:\n", class);
+ down_read(&global->dent_mutex);
+ for (tmp = global->dent_anchor.next; tmp != &global->dent_anchor; tmp = tmp->next) {
+ struct mars_dent *dent;
+ struct list_head *sub;
+
+ dent = container_of(tmp, struct mars_dent, dent_link);
+ XIO_DBG("dent %d '%s' '%s' stamp=%ld.%09ld\n",
+ dent->d_class,
+ SAFE_STR(dent->d_path),
+ SAFE_STR(dent->link_val),
+ dent->stat_val.mtime.tv_sec,
+ dent->stat_val.mtime.tv_nsec);
+ dent_count++;
+ for (sub = dent->brick_list.next; sub != &dent->brick_list; sub = sub->next) {
+ struct xio_brick *test;
+
+ test = container_of(sub, struct xio_brick, dent_brick_link);
+ XIO_DBG(" owner of brick '%s'\n", SAFE_STR(test->brick_path));
+ }
+ }
+ up_read(&global->dent_mutex);
+
+ XIO_DBG("==================== %s STATISTICS: %d dents, %d bricks, %lld KB free\n",
+ class,
+ dent_count,
+ brick_count,
+ global_remaining_space);
+}
+EXPORT_SYMBOL_GPL(show_statistics);
+
+/*******************************************************************/
+
+/* power led handling */
+
+void xio_set_power_on_led(struct xio_brick *brick, bool val)
+{
+ bool oldval = brick->power.on_led;
+
+ if (val != oldval) {
+ set_on_led(&brick->power, val);
+ local_trigger();
+ }
+}
+EXPORT_SYMBOL_GPL(xio_set_power_on_led);
+
+void xio_set_power_off_led(struct xio_brick *brick, bool val)
+{
+ bool oldval = brick->power.off_led;
+
+ if (val != oldval) {
+ set_off_led(&brick->power, val);
+ local_trigger();
+ }
+}
+EXPORT_SYMBOL_GPL(xio_set_power_off_led);
+
+/*******************************************************************/
+
+/* init stuff */
+
+int __init init_sy(void)
+{
+ XIO_INF("init_sy()\n");
+
+ _local_trigger = __local_trigger;
+
+ return 0;
+}
+
+void exit_sy(void)
+{
+ _local_trigger = NULL;
+
+ XIO_INF("exit_sy()\n");
+}
--
2.0.0

2014-07-01 21:51:31

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 38/50] mars: add new file drivers/block/mars/xio_bricks/xio_trans_logger.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/xio_trans_logger.c | 3295 ++++++++++++++++++++++
1 file changed, 3295 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/xio_trans_logger.c

diff --git a/drivers/block/mars/xio_bricks/xio_trans_logger.c b/drivers/block/mars/xio_bricks/xio_trans_logger.c
new file mode 100644
index 0000000..489299b
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/xio_trans_logger.c
@@ -0,0 +1,3295 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+/* Trans_Logger brick */
+
+#define XIO_DEBUGGING
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/bio.h>
+
+#include <linux/xio.h>
+#include <linux/brick/lib_rank.h>
+#include <linux/brick/lib_limiter.h>
+
+#include <linux/xio/xio_trans_logger.h>
+
+/* variants */
+#define KEEP_UNIQUE
+#define DELAY_CALLERS /* this is _needed_ for production systems */
+/* When possible, queue 1 executes phase3_startio() directly without
+ * intermediate queueing into queue 3 => may be irritating, but has better
+ * performance. NOTICE: when some day the IO scheduling should be
+ * different between queue 1 and 3, you MUST disable this in order
+ * to distinguish between them!
+ */
+#define SHORTCUT_1_to_3
+
+/* commenting this out is dangerous for data integrity! use only for testing! */
+#define USE_MEMCPY
+#define DO_WRITEBACK /* otherwise FAKE IO */
+#define REPLAY_DATA
+
+/* tuning */
+#ifdef BRICK_DEBUG_MEM
+#define CONF_TRANS_CHUNKSIZE (128 * 1024 - PAGE_SIZE * 2)
+#else
+#define CONF_TRANS_CHUNKSIZE (128 * 1024)
+#endif
+#define CONF_TRANS_MAX_AIO_SIZE PAGE_SIZE
+#define CONF_TRANS_ALIGN 0
+
+#define XIO_RPL(_args...) /*empty*/
+
+struct trans_logger_hash_anchor {
+ struct rw_semaphore hash_mutex;
+ struct list_head hash_anchor;
+};
+
+#define NR_HASH_PAGES 64
+
+#define MAX_HASH_PAGES (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor *))
+#define HASH_PER_PAGE (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor))
+#define HASH_TOTAL (NR_HASH_PAGES * HASH_PER_PAGE)
+
+/************************ global tuning ***********************/
+
+int trans_logger_completion_semantics = 1;
+EXPORT_SYMBOL_GPL(trans_logger_completion_semantics);
+
+int trans_logger_do_crc =
+#ifdef CONFIG_MARS_DEBUG
+ true;
+#else
+ false;
+#endif
+EXPORT_SYMBOL_GPL(trans_logger_do_crc);
+
+int trans_logger_mem_usage; /* in KB */
+EXPORT_SYMBOL_GPL(trans_logger_mem_usage);
+
+int trans_logger_max_interleave = -1;
+EXPORT_SYMBOL_GPL(trans_logger_max_interleave);
+
+int trans_logger_resume = 1;
+EXPORT_SYMBOL_GPL(trans_logger_resume);
+
+int trans_logger_replay_timeout = 1; /* in s */
+EXPORT_SYMBOL_GPL(trans_logger_replay_timeout);
+
+struct writeback_group global_writeback = {
+ .lock = __RW_LOCK_UNLOCKED(global_writeback.lock),
+ .group_anchor = LIST_HEAD_INIT(global_writeback.group_anchor),
+ .until_percent = 30,
+};
+EXPORT_SYMBOL_GPL(global_writeback);
+
+static
+void add_to_group(struct writeback_group *gr, struct trans_logger_brick *brick)
+{
+ write_lock(&gr->lock);
+ list_add_tail(&brick->group_head, &gr->group_anchor);
+ write_unlock(&gr->lock);
+}
+
+static
+void remove_from_group(struct writeback_group *gr, struct trans_logger_brick *brick)
+{
+ write_lock(&gr->lock);
+ list_del_init(&brick->group_head);
+ gr->leader = NULL;
+ write_unlock(&gr->lock);
+}
+
+static
+struct trans_logger_brick *elect_leader(struct writeback_group *gr)
+{
+ struct trans_logger_brick *res = gr->leader;
+ struct list_head *tmp;
+
+ if (res && gr->until_percent >= 0) {
+ loff_t used = atomic64_read(&res->shadow_mem_used);
+
+ if (used > gr->biggest * gr->until_percent / 100)
+ goto done;
+ }
+
+ read_lock(&gr->lock);
+ for (tmp = gr->group_anchor.next; tmp != &gr->group_anchor; tmp = tmp->next) {
+ struct trans_logger_brick *test = container_of(tmp, struct trans_logger_brick, group_head);
+ loff_t new_used = atomic64_read(&test->shadow_mem_used);
+
+ if (!res || new_used > atomic64_read(&res->shadow_mem_used)) {
+ res = test;
+ gr->biggest = new_used;
+ }
+ }
+ read_unlock(&gr->lock);
+
+ gr->leader = res;
+
+done:
+ return res;
+}
+
+/************************ own type definitions ***********************/
+
+static inline
+int lh_cmp(loff_t *a, loff_t *b)
+{
+ if (*a < *b)
+ return -1;
+ if (*a > *b)
+ return 1;
+ return 0;
+}
+
+static inline
+int tr_cmp(struct pairing_heap_logger *_a, struct pairing_heap_logger *_b)
+{
+ struct logger_head *a = container_of(_a, struct logger_head, ph);
+ struct logger_head *b = container_of(_b, struct logger_head, ph);
+
+ return lh_cmp(a->lh_pos, b->lh_pos);
+}
+
+_PAIRING_HEAP_FUNCTIONS(static, logger, tr_cmp);
+
+static inline
+loff_t *lh_get(struct logger_head *th)
+{
+ return th->lh_pos;
+}
+
+QUEUE_FUNCTIONS(logger, struct logger_head, lh_head, lh_get, lh_cmp, logger);
+
+/************************* logger queue handling ***********************/
+
+static inline
+void qq_init(struct logger_queue *q, struct trans_logger_brick *brick)
+{
+ q_logger_init(q);
+ q->q_event = &brick->worker_event;
+ q->q_brick = brick;
+}
+
+static inline
+void qq_inc_flying(struct logger_queue *q)
+{
+ q_logger_inc_flying(q);
+}
+
+static inline
+void qq_dec_flying(struct logger_queue *q)
+{
+ q_logger_dec_flying(q);
+}
+
+static inline
+void qq_aio_insert(struct logger_queue *q, struct trans_logger_aio_aspect *aio_a)
+{
+ struct aio_object *aio = aio_a->object;
+
+ obj_get(aio); /* must be paired with __trans_logger_io_put() */
+ atomic_inc(&q->q_brick->inner_balance_count);
+
+ q_logger_insert(q, &aio_a->lh);
+}
+
+static inline
+void qq_wb_insert(struct logger_queue *q, struct writeback_info *wb)
+{
+ q_logger_insert(q, &wb->w_lh);
+}
+
+static inline
+void qq_aio_pushback(struct logger_queue *q, struct trans_logger_aio_aspect *aio_a)
+{
+ obj_check(aio_a->object);
+
+ q->pushback_count++;
+
+ q_logger_pushback(q, &aio_a->lh);
+}
+
+static inline
+void qq_wb_pushback(struct logger_queue *q, struct writeback_info *wb)
+{
+ q->pushback_count++;
+ q_logger_pushback(q, &wb->w_lh);
+}
+
+static inline
+struct trans_logger_aio_aspect *qq_aio_fetch(struct logger_queue *q)
+{
+ struct logger_head *test;
+ struct trans_logger_aio_aspect *aio_a = NULL;
+
+ test = q_logger_fetch(q);
+
+ if (test) {
+ aio_a = container_of(test, struct trans_logger_aio_aspect, lh);
+ obj_check(aio_a->object);
+ }
+ return aio_a;
+}
+
+static inline
+struct writeback_info *qq_wb_fetch(struct logger_queue *q)
+{
+ struct logger_head *test;
+ struct writeback_info *res = NULL;
+
+ test = q_logger_fetch(q);
+
+ if (test)
+ res = container_of(test, struct writeback_info, w_lh);
+ return res;
+}
+
+/************************ own helper functions ***********************/
+
+static inline
+int hash_fn(loff_t pos)
+{
+ /* simple and stupid */
+ long base_index = pos >> REGION_SIZE_BITS;
+
+ base_index += base_index / HASH_TOTAL / 7;
+ return base_index % HASH_TOTAL;
+}
+
+static inline
+struct trans_logger_aio_aspect *_hash_find(struct list_head *start,
+ loff_t pos,
+ int *max_len,
+ bool use_collect_head,
+ bool find_unstable)
+{
+ struct list_head *tmp;
+ struct trans_logger_aio_aspect *res = NULL;
+ int len = *max_len;
+
+ /* The lists are always sorted according to age (newest first).
+ * Caution: there may be duplicates in the list, some of them
+ * overlapping with the search area in many different ways.
+ */
+ for (tmp = start->next; tmp != start; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *test_a;
+ struct aio_object *test;
+ int diff;
+
+ if (use_collect_head)
+ test_a = container_of(tmp, struct trans_logger_aio_aspect, collect_head);
+ else
+ test_a = container_of(tmp, struct trans_logger_aio_aspect, hash_head);
+ test = test_a->object;
+
+ obj_check(test);
+
+ /* are the regions overlapping? */
+ if (pos >= test->io_pos + test->io_len || pos + len <= test->io_pos)
+ continue; /* not relevant */
+
+ /* searching for unstable elements (only in special cases) */
+ if (find_unstable && test_a->is_stable)
+ break;
+
+ diff = test->io_pos - pos;
+ if (diff <= 0) {
+ int restlen = test->io_len + diff;
+
+ res = test_a;
+ if (restlen < len)
+ len = restlen;
+ break;
+ }
+ if (diff < len)
+ len = diff;
+ }
+
+ *max_len = len;
+ return res;
+}
+
+static
+struct trans_logger_aio_aspect *hash_find(struct trans_logger_brick *brick,
+ loff_t pos,
+ int *max_len,
+ bool find_unstable)
+{
+
+ int hash = hash_fn(pos);
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+ struct trans_logger_hash_anchor *start = &sub_table[hash % HASH_PER_PAGE];
+ struct trans_logger_aio_aspect *res;
+
+ atomic_inc(&brick->total_hash_find_count);
+
+ down_read(&start->hash_mutex);
+
+ res = _hash_find(&start->hash_anchor, pos, max_len, false, find_unstable);
+
+ /* Ensure the found aio can't go away...
+ */
+ if (res && res->object)
+ obj_get(res->object);
+
+ up_read(&start->hash_mutex);
+
+ return res;
+}
+
+static
+void hash_insert(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *elem_a)
+{
+ int hash = hash_fn(elem_a->object->io_pos);
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+ struct trans_logger_hash_anchor *start = &sub_table[hash % HASH_PER_PAGE];
+
+ CHECK_HEAD_EMPTY(&elem_a->hash_head);
+ obj_check(elem_a->object);
+
+ /* only for statistics: */
+ atomic_inc(&brick->hash_count);
+ atomic_inc(&brick->total_hash_insert_count);
+
+ down_write(&start->hash_mutex);
+
+ list_add(&elem_a->hash_head, &start->hash_anchor);
+ elem_a->is_hashed = true;
+
+ up_write(&start->hash_mutex);
+}
+
+/* Find the transitive closure of overlapping requests
+ * and collect them into a list.
+ */
+static
+void hash_extend(struct trans_logger_brick *brick, loff_t *_pos, int *_len, struct list_head *collect_list)
+{
+ loff_t pos = *_pos;
+ int len = *_len;
+ int hash = hash_fn(pos);
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+ struct trans_logger_hash_anchor *start = &sub_table[hash % HASH_PER_PAGE];
+ struct list_head *tmp;
+ bool extended;
+
+ if (collect_list)
+ CHECK_HEAD_EMPTY(collect_list);
+
+ atomic_inc(&brick->total_hash_extend_count);
+
+ down_read(&start->hash_mutex);
+
+ do {
+ extended = false;
+
+ for (tmp = start->hash_anchor.next; tmp != &start->hash_anchor; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *test_a;
+ struct aio_object *test;
+ loff_t diff;
+
+ test_a = container_of(tmp, struct trans_logger_aio_aspect, hash_head);
+ test = test_a->object;
+
+ obj_check(test);
+
+ /* are the regions overlapping? */
+ if (pos >= test->io_pos + test->io_len || pos + len <= test->io_pos)
+ continue; /* not relevant */
+
+ /* collision detection */
+ if (test_a->is_collected)
+ goto collision;
+
+ /* no writeback of non-persistent data */
+ if (!(test_a->is_persistent & test_a->is_completed))
+ goto collision;
+
+ /* extend the search region when necessary */
+ diff = pos - test->io_pos;
+ if (diff > 0) {
+ len += diff;
+ pos = test->io_pos;
+ extended = true;
+ }
+ diff = (test->io_pos + test->io_len) - (pos + len);
+ if (diff > 0) {
+ len += diff;
+ extended = true;
+ }
+ }
+ } while (extended); /* start over for transitive closure */
+
+ *_pos = pos;
+ *_len = len;
+
+ for (tmp = start->hash_anchor.next; tmp != &start->hash_anchor; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *test_a;
+ struct aio_object *test;
+
+ test_a = container_of(tmp, struct trans_logger_aio_aspect, hash_head);
+ test = test_a->object;
+
+ /* are the regions overlapping? */
+ if (pos >= test->io_pos + test->io_len || pos + len <= test->io_pos)
+ continue; /* not relevant */
+
+ /* collect */
+ CHECK_HEAD_EMPTY(&test_a->collect_head);
+ if (unlikely(test_a->is_collected))
+ XIO_ERR("collision detection did not work\n");
+ test_a->is_collected = true;
+ obj_check(test);
+ list_add_tail(&test_a->collect_head, collect_list);
+ }
+
+collision:
+ up_read(&start->hash_mutex);
+}
+
+/* Atomically put all elements from the list.
+ * All elements must reside in the same collision list.
+ */
+static inline
+void hash_put_all(struct trans_logger_brick *brick, struct list_head *list)
+{
+ struct list_head *tmp;
+ struct trans_logger_hash_anchor *start = NULL;
+ int first_hash = -1;
+
+ for (tmp = list->next; tmp != list; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *elem_a;
+ struct aio_object *elem;
+ int hash;
+
+ elem_a = container_of(tmp, struct trans_logger_aio_aspect, collect_head);
+ elem = elem_a->object;
+ CHECK_PTR(elem, err);
+ obj_check(elem);
+
+ hash = hash_fn(elem->io_pos);
+ if (!start) {
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+
+ start = &sub_table[hash % HASH_PER_PAGE];
+ first_hash = hash;
+ down_write(&start->hash_mutex);
+ } else if (unlikely(hash != first_hash)) {
+ XIO_ERR("oops, different hashes: %d != %d\n", hash, first_hash);
+ }
+
+ if (!elem_a->is_hashed)
+ continue;
+
+ list_del_init(&elem_a->hash_head);
+ elem_a->is_hashed = false;
+ atomic_dec(&brick->hash_count);
+ }
+
+err:
+ if (start)
+ up_write(&start->hash_mutex);
+}
+
+static inline
+void hash_ensure_stableness(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *aio_a)
+{
+ if (!aio_a->is_stable) {
+ struct aio_object *aio = aio_a->object;
+ int hash = hash_fn(aio->io_pos);
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+ struct trans_logger_hash_anchor *start = &sub_table[hash % HASH_PER_PAGE];
+
+ down_write(&start->hash_mutex);
+
+ aio_a->is_stable = true;
+
+ up_write(&start->hash_mutex);
+ }
+}
+
+static
+void _inf_callback(struct trans_logger_input *input, bool force)
+{
+ if (!force &&
+ input->inf_last_jiffies &&
+ input->inf_last_jiffies + 4 * HZ > (long long)jiffies)
+ goto out_return;
+ if (input->inf.inf_callback && input->is_operating) {
+ input->inf_last_jiffies = jiffies;
+
+ input->inf.inf_callback(&input->inf);
+
+ input->inf_last_jiffies = jiffies;
+ } else {
+ XIO_DBG("%p skipped callback, callback = %p is_operating = %d\n",
+ input,
+ input->inf.inf_callback,
+ input->is_operating);
+ }
+out_return:;
+}
+
+static inline
+int _congested(struct trans_logger_brick *brick)
+{
+ return atomic_read(&brick->q_phase[0].q_queued)
+ || atomic_read(&brick->q_phase[0].q_flying)
+ || atomic_read(&brick->q_phase[1].q_queued)
+ || atomic_read(&brick->q_phase[1].q_flying)
+ || atomic_read(&brick->q_phase[2].q_queued)
+ || atomic_read(&brick->q_phase[2].q_flying)
+ || atomic_read(&brick->q_phase[3].q_queued)
+ || atomic_read(&brick->q_phase[3].q_flying);
+}
+
+/***************** own brick * input * output operations *****************/
+
+atomic_t global_mshadow_count = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(global_mshadow_count);
+atomic64_t global_mshadow_used = ATOMIC64_INIT(0);
+EXPORT_SYMBOL_GPL(global_mshadow_used);
+
+static
+int trans_logger_get_info(struct trans_logger_output *output, struct xio_info *info)
+{
+ struct trans_logger_input *input = output->brick->inputs[TL_INPUT_READ];
+
+ return GENERIC_INPUT_CALL(input, xio_get_info, info);
+}
+
+static
+int _make_sshadow(struct trans_logger_output *output,
+ struct trans_logger_aio_aspect *aio_a,
+ struct trans_logger_aio_aspect *mshadow_a)
+{
+ struct trans_logger_brick *brick = output->brick;
+ struct aio_object *aio = aio_a->object;
+ struct aio_object *mshadow;
+ int diff;
+
+ mshadow = mshadow_a->object;
+ if (unlikely(aio->io_len > mshadow->io_len)) {
+ XIO_ERR("oops %d -> %d\n", aio->io_len, mshadow->io_len);
+ aio->io_len = mshadow->io_len;
+ }
+ if (unlikely(mshadow_a == aio_a)) {
+ XIO_ERR("oops %p == %p\n", mshadow_a, aio_a);
+ return -EINVAL;
+ }
+
+ diff = aio->io_pos - mshadow->io_pos;
+ if (unlikely(diff < 0)) {
+ XIO_ERR("oops diff = %d\n", diff);
+ return -EINVAL;
+ }
+
+ /* Attach aio to the existing shadow ("slave shadow").
+ */
+ aio_a->shadow_data = mshadow_a->shadow_data + diff;
+ aio_a->do_dealloc = false;
+ if (!aio->io_data) { /* buffered IO */
+ aio->io_data = aio_a->shadow_data;
+ aio_a->do_buffered = true;
+ atomic_inc(&brick->total_sshadow_buffered_count);
+ }
+ aio->io_flags = mshadow->io_flags;
+ aio_a->shadow_aio = mshadow_a;
+ aio_a->my_brick = brick;
+
+ /* Get an ordinary internal reference
+ */
+ obj_get_first(aio); /* must be paired with __trans_logger_io_put() */
+ atomic_inc(&brick->inner_balance_count);
+
+ /* The internal reference from slave to master is already
+ * present due to hash_find(),
+ * such that the master cannot go away before the slave.
+ * It is compensated by master transition in __trans_logger_io_put()
+ */
+ atomic_inc(&brick->inner_balance_count);
+
+ atomic_inc(&brick->sshadow_count);
+ atomic_inc(&brick->total_sshadow_count);
+
+ if (unlikely(aio->io_len <= 0)) {
+ XIO_ERR("oops, len = %d\n", aio->io_len);
+ return -EINVAL;
+ }
+
+ return aio->io_len;
+}
+
+static
+int _read_io_get(struct trans_logger_output *output, struct trans_logger_aio_aspect *aio_a)
+{
+ struct trans_logger_brick *brick = output->brick;
+ struct aio_object *aio = aio_a->object;
+ struct trans_logger_input *input = brick->inputs[TL_INPUT_READ];
+ struct trans_logger_aio_aspect *mshadow_a;
+
+ /* Look if there is a newer version on the fly, shadowing
+ * the old one.
+ * When a shadow is found, use it as buffer for the aio.
+ */
+ mshadow_a = hash_find(brick, aio->io_pos, &aio->io_len, false);
+ if (!mshadow_a)
+ return GENERIC_INPUT_CALL(input, aio_get, aio);
+
+ return _make_sshadow(output, aio_a, mshadow_a);
+}
+
+static
+int _write_io_get(struct trans_logger_output *output, struct trans_logger_aio_aspect *aio_a)
+{
+ struct trans_logger_brick *brick = output->brick;
+ struct aio_object *aio = aio_a->object;
+ void *data;
+
+#ifdef KEEP_UNIQUE
+ struct trans_logger_aio_aspect *mshadow_a;
+
+#endif
+
+#ifdef CONFIG_MARS_DEBUG
+ if (unlikely(aio->io_len <= 0)) {
+ XIO_ERR("oops, io_len = %d\n", aio->io_len);
+ return -EINVAL;
+ }
+#endif
+
+#ifdef KEEP_UNIQUE
+ mshadow_a = hash_find(brick, aio->io_pos, &aio->io_len, true);
+ if (mshadow_a)
+ return _make_sshadow(output, aio_a, mshadow_a);
+#endif
+
+#ifdef DELAY_CALLERS
+ /* delay in case of too many master shadows / memory shortage */
+ wait_event_interruptible_timeout(brick->caller_event,
+ !brick->delay_callers &&
+ (brick_global_memlimit < 1024 || atomic64_read(&global_mshadow_used) / 1024 < brick_global_memlimit),
+ HZ / 2);
+#endif
+
+ /* create a new master shadow */
+ data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len));
+ atomic64_add(aio->io_len, &brick->shadow_mem_used);
+#ifdef CONFIG_MARS_DEBUG
+ memset(data, 0x11, aio->io_len);
+#endif
+ aio_a->shadow_data = data;
+ aio_a->do_dealloc = true;
+ if (!aio->io_data) { /* buffered IO */
+ aio->io_data = data;
+ aio_a->do_buffered = true;
+ atomic_inc(&brick->total_mshadow_buffered_count);
+ }
+ aio_a->my_brick = brick;
+ aio->io_flags = 0;
+ aio_a->shadow_aio = aio_a; /* cyclic self-reference => indicates master shadow */
+
+ atomic_inc(&brick->mshadow_count);
+ atomic_inc(&brick->total_mshadow_count);
+ atomic_inc(&global_mshadow_count);
+ atomic64_add(aio->io_len, &global_mshadow_used);
+
+ atomic_inc(&brick->inner_balance_count);
+ obj_get_first(aio); /* must be paired with __trans_logger_io_put() */
+
+ return aio->io_len;
+}
+
+static
+int trans_logger_io_get(struct trans_logger_output *output, struct aio_object *aio)
+{
+ struct trans_logger_brick *brick;
+ struct trans_logger_aio_aspect *aio_a;
+ loff_t base_offset;
+
+ CHECK_PTR(output, err);
+ brick = output->brick;
+ CHECK_PTR(brick, err);
+ CHECK_PTR(aio, err);
+
+ aio_a = trans_logger_aio_get_aspect(brick, aio);
+ CHECK_PTR(aio_a, err);
+ CHECK_ASPECT(aio_a, aio, err);
+
+ atomic_inc(&brick->outer_balance_count);
+
+ if (aio->obj_initialized) { /* setup already performed */
+ obj_check(aio);
+ obj_get(aio); /* must be paired with __trans_logger_io_put() */
+ return aio->io_len;
+ }
+
+ get_lamport(&aio_a->stamp);
+
+ if (aio->io_len > CONF_TRANS_MAX_AIO_SIZE && CONF_TRANS_MAX_AIO_SIZE > 0)
+ aio->io_len = CONF_TRANS_MAX_AIO_SIZE;
+
+ /* ensure that REGION_SIZE boundaries are obeyed by hashing */
+ base_offset = aio->io_pos & (loff_t)(REGION_SIZE - 1);
+ if (aio->io_len > REGION_SIZE - base_offset)
+ aio->io_len = REGION_SIZE - base_offset;
+
+ if (aio->io_may_write == READ)
+ return _read_io_get(output, aio_a);
+
+ if (unlikely(brick->stopped_logging)) { /* only in EMERGENCY mode */
+ aio_a->is_emergency = true;
+ /* Wait until writeback has finished.
+ * We have to this because writeback is out-of-order.
+ * Otherwise consistency could be violated for some time.
+ */
+ while (_congested(brick)) {
+ /* in case of emergency, busy-wait should be acceptable */
+ brick_msleep(HZ / 10);
+ }
+ return _read_io_get(output, aio_a);
+ }
+
+ /* FIXME: THIS IS PROVISIONARY (use event instead)
+ */
+ while (unlikely(!brick->power.on_led))
+ brick_msleep(HZ / 10);
+
+ return _write_io_get(output, aio_a);
+
+err:
+ return -EINVAL;
+}
+
+static void pos_complete(struct trans_logger_aio_aspect *orig_aio_a);
+
+static
+void __trans_logger_io_put(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *aio_a)
+{
+ struct aio_object *aio;
+ struct trans_logger_aio_aspect *shadow_a;
+ struct trans_logger_input *input;
+
+restart:
+ CHECK_PTR(aio_a, err);
+ aio = aio_a->object;
+ CHECK_PTR(aio, err);
+
+ obj_check(aio);
+
+ /* are we a shadow (whether master or slave)? */
+ shadow_a = aio_a->shadow_aio;
+ if (shadow_a) {
+ bool finished;
+
+ CHECK_PTR(shadow_a, err);
+ CHECK_PTR(shadow_a->object, err);
+ obj_check(shadow_a->object);
+
+ finished = obj_put(aio);
+ atomic_dec(&brick->inner_balance_count);
+ if (unlikely(finished && aio_a->is_hashed)) {
+ XIO_ERR("trying to put a hashed aio, pos = %lld len = %d\n", aio->io_pos, aio->io_len);
+ finished = false; /* leaves a memleak */
+ }
+
+ if (!finished)
+ goto out_return;
+
+ CHECK_HEAD_EMPTY(&aio_a->lh.lh_head);
+ CHECK_HEAD_EMPTY(&aio_a->hash_head);
+ CHECK_HEAD_EMPTY(&aio_a->replay_head);
+ CHECK_HEAD_EMPTY(&aio_a->collect_head);
+ CHECK_HEAD_EMPTY(&aio_a->sub_list);
+ CHECK_HEAD_EMPTY(&aio_a->sub_head);
+
+ if (aio_a->is_collected && likely(aio_a->wb_error >= 0))
+ pos_complete(aio_a);
+
+ CHECK_HEAD_EMPTY(&aio_a->pos_head);
+
+ if (shadow_a != aio_a) { /* we are a slave shadow */
+ /* XIO_DBG("slave\n"); */
+ atomic_dec(&brick->sshadow_count);
+ CHECK_HEAD_EMPTY(&aio_a->hash_head);
+ obj_free(aio);
+ /* now put the master shadow */
+ aio_a = shadow_a;
+ goto restart;
+ }
+ /* we are a master shadow */
+ CHECK_PTR(aio_a->shadow_data, err);
+ if (aio_a->do_dealloc) {
+ brick_block_free(aio_a->shadow_data, aio_a->alloc_len);
+ atomic64_sub(aio->io_len, &brick->shadow_mem_used);
+ aio_a->shadow_data = NULL;
+ aio_a->do_dealloc = false;
+ }
+ if (aio_a->do_buffered)
+ aio->io_data = NULL;
+ atomic_dec(&brick->mshadow_count);
+ atomic_dec(&global_mshadow_count);
+ atomic64_sub(aio->io_len, &global_mshadow_used);
+ obj_free(aio);
+ goto out_return;
+ }
+
+ /* only READ is allowed on non-shadow buffers */
+ if (unlikely(aio->io_rw != READ && !aio_a->is_emergency))
+ XIO_FAT("bad operation %d on non-shadow\n", aio->io_rw);
+
+ /* no shadow => call through */
+ input = brick->inputs[TL_INPUT_READ];
+ CHECK_PTR(input, err);
+
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+
+err:;
+out_return:;
+}
+
+static
+void _trans_logger_io_put(struct trans_logger_output *output, struct aio_object *aio)
+{
+ struct trans_logger_aio_aspect *aio_a;
+
+ aio_a = trans_logger_aio_get_aspect(output->brick, aio);
+ CHECK_PTR(aio_a, err);
+ CHECK_ASPECT(aio_a, aio, err);
+
+ __trans_logger_io_put(output->brick, aio_a);
+ goto out_return;
+err:
+ XIO_FAT("giving up...\n");
+out_return:;
+}
+
+static
+void trans_logger_io_put(struct trans_logger_output *output, struct aio_object *aio)
+{
+ struct trans_logger_brick *brick = output->brick;
+
+ atomic_dec(&brick->outer_balance_count);
+ _trans_logger_io_put(output, aio);
+}
+
+static
+void _trans_logger_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *aio_a;
+ struct trans_logger_brick *brick;
+
+ aio_a = cb->cb_private;
+ CHECK_PTR(aio_a, err);
+ if (unlikely(&aio_a->cb != cb)) {
+ XIO_FAT("bad callback -- hanging up\n");
+ goto err;
+ }
+ brick = aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ NEXT_CHECKED_CALLBACK(cb, err);
+
+ atomic_dec(&brick->any_fly_count);
+ atomic_inc(&brick->total_cb_count);
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle callback\n");
+out_return:;
+}
+
+static
+void trans_logger_io_io(struct trans_logger_output *output, struct aio_object *aio)
+{
+ struct trans_logger_brick *brick = output->brick;
+ struct trans_logger_aio_aspect *aio_a;
+ struct trans_logger_aio_aspect *shadow_a;
+ struct trans_logger_input *input;
+
+ obj_check(aio);
+
+ aio_a = trans_logger_aio_get_aspect(brick, aio);
+ CHECK_PTR(aio_a, err);
+ CHECK_ASPECT(aio_a, aio, err);
+
+ /* statistics */
+ if (aio->io_rw)
+ atomic_inc(&brick->total_write_count);
+ else
+ atomic_inc(&brick->total_read_count);
+ /* is this a shadow buffer? */
+ shadow_a = aio_a->shadow_aio;
+ if (shadow_a) {
+ CHECK_HEAD_EMPTY(&aio_a->lh.lh_head);
+ CHECK_HEAD_EMPTY(&aio_a->hash_head);
+ CHECK_HEAD_EMPTY(&aio_a->pos_head);
+
+ obj_get(aio); /* must be paired with __trans_logger_io_put() */
+ atomic_inc(&brick->inner_balance_count);
+
+ qq_aio_insert(&brick->q_phase[0], aio_a);
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+ }
+
+ /* only READ is allowed on non-shadow buffers */
+ if (unlikely(aio->io_rw != READ && !aio_a->is_emergency))
+ XIO_FAT("bad operation %d on non-shadow\n", aio->io_rw);
+
+ atomic_inc(&brick->any_fly_count);
+
+ aio_a->my_brick = brick;
+
+ INSERT_CALLBACK(aio, &aio_a->cb, _trans_logger_endio, aio_a);
+
+ input = output->brick->inputs[TL_INPUT_READ];
+
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle IO\n");
+out_return:;
+}
+
+/***************************** writeback info *****************************/
+
+/* save final completion status when necessary
+ */
+static
+void pos_complete(struct trans_logger_aio_aspect *orig_aio_a)
+{
+ struct trans_logger_brick *brick = orig_aio_a->my_brick;
+ struct trans_logger_input *log_input = orig_aio_a->log_input;
+ loff_t finished;
+ struct list_head *tmp;
+
+ CHECK_PTR(brick, err);
+ CHECK_PTR(log_input, err);
+
+ atomic_inc(&brick->total_writeback_count);
+
+ tmp = &orig_aio_a->pos_head;
+
+ down(&log_input->inf_mutex);
+
+ finished = orig_aio_a->log_pos;
+ /* am I the first member? (means "youngest" list entry) */
+ if (tmp == log_input->pos_list.next) {
+ if (unlikely(finished <= log_input->inf.inf_min_pos))
+ XIO_ERR("backskip in log writeback: %lld -> %lld\n", log_input->inf.inf_min_pos, finished);
+ if (unlikely(finished > log_input->inf.inf_max_pos))
+ XIO_ERR("min_pos > max_pos: %lld > %lld\n", finished, log_input->inf.inf_max_pos);
+ log_input->inf.inf_min_pos = finished;
+ get_lamport(&log_input->inf.inf_min_pos_stamp);
+ _inf_callback(log_input, false);
+ } else {
+ struct trans_logger_aio_aspect *prev_aio_a;
+
+ prev_aio_a = container_of(tmp->prev, struct trans_logger_aio_aspect, pos_head);
+ if (unlikely(finished <= prev_aio_a->log_pos)) {
+ XIO_ERR("backskip: %lld -> %lld\n", finished, prev_aio_a->log_pos);
+ } else {
+ /* Transitively transfer log_pos to the predecessor
+ * to correctly reflect the committed region.
+ */
+ prev_aio_a->log_pos = finished;
+ }
+ }
+
+ list_del_init(tmp);
+ atomic_dec(&log_input->pos_count);
+
+ up(&log_input->inf_mutex);
+err:;
+}
+
+static
+void free_writeback(struct writeback_info *wb)
+{
+ struct list_head *tmp;
+
+ if (unlikely(wb->w_error < 0)) {
+ XIO_ERR("writeback error = %d at pos = %lld len = %d, writeback is incomplete\n",
+ wb->w_error,
+ wb->w_pos,
+ wb->w_len);
+ }
+
+ /* Now complete the original requests.
+ */
+ while ((tmp = wb->w_collect_list.next) != &wb->w_collect_list) {
+ struct trans_logger_aio_aspect *orig_aio_a;
+ struct aio_object *orig_aio;
+
+ list_del_init(tmp);
+
+ orig_aio_a = container_of(tmp, struct trans_logger_aio_aspect, collect_head);
+ orig_aio = orig_aio_a->object;
+
+ obj_check(orig_aio);
+ if (unlikely(!orig_aio_a->is_collected)) {
+ XIO_ERR("request %lld (len = %d) was not collected\n",
+ orig_aio->io_pos,
+ orig_aio->io_len);
+ }
+ if (unlikely(wb->w_error < 0))
+ orig_aio_a->wb_error = wb->w_error;
+
+ __trans_logger_io_put(orig_aio_a->my_brick, orig_aio_a);
+ }
+
+ brick_mem_free(wb);
+}
+
+/* Generic endio() for writeback_info
+ */
+static
+void wb_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_brick *brick;
+ struct writeback_info *wb;
+ atomic_t *dec;
+ int rw;
+
+ /* make checkpatch.pl happy with a blank line - is this a false positive? */
+
+ void (**_endio)(struct generic_callback *cb);
+ void (*endio)(struct generic_callback *cb);
+
+ LAST_CALLBACK(cb);
+ sub_aio_a = cb->cb_private;
+ CHECK_PTR(sub_aio_a, err);
+ sub_aio = sub_aio_a->object;
+ CHECK_PTR(sub_aio, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ if (cb->cb_error < 0)
+ wb->w_error = cb->cb_error;
+
+ atomic_dec(&brick->wb_balance_count);
+
+ rw = sub_aio_a->orig_rw;
+ dec = rw ? &wb->w_sub_write_count : &wb->w_sub_read_count;
+ CHECK_ATOMIC(dec, 1);
+ if (!atomic_dec_and_test(dec))
+ goto done;
+
+ _endio = rw ? &wb->write_endio : &wb->read_endio;
+ endio = *_endio;
+ *_endio = NULL;
+ if (likely(endio))
+ endio(cb);
+ else
+ XIO_ERR("internal: no endio defined\n");
+done:
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_FAT("hanging up....\n");
+out_return:;
+}
+
+/* Atomically create writeback info, based on "snapshot" of current hash
+ * state.
+ * Notice that the hash can change during writeback IO, thus we need
+ * struct writeback_info to precisely catch that information at a single
+ * point in time.
+ */
+static
+struct writeback_info *make_writeback(struct trans_logger_brick *brick, loff_t pos, int len)
+{
+ struct writeback_info *wb;
+ struct trans_logger_input *read_input;
+ struct trans_logger_input *write_input;
+ int write_input_nr;
+
+ /* Allocate structure representing a bunch of adjacent writebacks
+ */
+ wb = brick_zmem_alloc(sizeof(struct writeback_info));
+ if (unlikely(len < 0))
+ XIO_ERR("len = %d\n", len);
+
+ wb->w_brick = brick;
+ wb->w_pos = pos;
+ wb->w_len = len;
+ wb->w_lh.lh_pos = &wb->w_pos;
+ INIT_LIST_HEAD(&wb->w_lh.lh_head);
+ INIT_LIST_HEAD(&wb->w_collect_list);
+ INIT_LIST_HEAD(&wb->w_sub_read_list);
+ INIT_LIST_HEAD(&wb->w_sub_write_list);
+
+ /* Atomically fetch transitive closure on all requests
+ * overlapping with the current search region.
+ */
+ hash_extend(brick, &wb->w_pos, &wb->w_len, &wb->w_collect_list);
+
+ if (list_empty(&wb->w_collect_list))
+ goto collision;
+
+ pos = wb->w_pos;
+ len = wb->w_len;
+
+ if (unlikely(len < 0))
+ XIO_ERR("len = %d\n", len);
+
+ /* Determine the "channels" we want to operate on
+ */
+ read_input = brick->inputs[TL_INPUT_READ];
+ write_input_nr = TL_INPUT_WRITEBACK;
+ write_input = brick->inputs[write_input_nr];
+ if (!write_input->connect) {
+ write_input_nr = TL_INPUT_READ;
+ write_input = read_input;
+ }
+
+ /* Create sub_aios for read of old disk version (phase1)
+ */
+ if (brick->log_reads) {
+ while (len > 0) {
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_input *log_input;
+ int this_len;
+ int status;
+
+ sub_aio = trans_logger_alloc_aio(brick);
+
+ sub_aio->io_pos = pos;
+ sub_aio->io_len = len;
+ sub_aio->io_may_write = READ;
+ sub_aio->io_rw = READ;
+ sub_aio->io_data = NULL;
+
+ sub_aio_a = trans_logger_aio_get_aspect(brick, sub_aio);
+ CHECK_PTR(sub_aio_a, err);
+ CHECK_ASPECT(sub_aio_a, sub_aio, err);
+
+ sub_aio_a->my_input = read_input;
+ log_input = brick->inputs[brick->log_input_nr];
+ sub_aio_a->log_input = log_input;
+ atomic_inc(&log_input->log_obj_count);
+ sub_aio_a->my_brick = brick;
+ sub_aio_a->orig_rw = READ;
+ sub_aio_a->wb = wb;
+
+ status = GENERIC_INPUT_CALL(read_input, aio_get, sub_aio);
+ if (unlikely(status < 0)) {
+ XIO_FAT("cannot get sub_aio, status = %d\n", status);
+ goto err;
+ }
+
+ list_add_tail(&sub_aio_a->sub_head, &wb->w_sub_read_list);
+ atomic_inc(&wb->w_sub_read_count);
+ atomic_inc(&brick->wb_balance_count);
+
+ this_len = sub_aio->io_len;
+ pos += this_len;
+ len -= this_len;
+ }
+ /* Re-init for startover
+ */
+ pos = wb->w_pos;
+ len = wb->w_len;
+ }
+
+ /* Always create sub_aios for writeback (phase3)
+ */
+ while (len > 0) {
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_aio_aspect *orig_aio_a;
+ struct aio_object *orig_aio;
+ struct trans_logger_input *log_input;
+ void *data;
+ int this_len = len;
+ int diff;
+ int status;
+
+ atomic_inc(&brick->total_hash_find_count);
+
+ orig_aio_a = _hash_find(&wb->w_collect_list, pos, &this_len, true, false);
+ if (unlikely(!orig_aio_a)) {
+ XIO_FAT("could not find data\n");
+ goto err;
+ }
+
+ orig_aio = orig_aio_a->object;
+ diff = pos - orig_aio->io_pos;
+ if (unlikely(diff < 0)) {
+ XIO_FAT("bad diff %d\n", diff);
+ goto err;
+ }
+ data = orig_aio_a->shadow_data + diff;
+
+ sub_aio = trans_logger_alloc_aio(brick);
+
+ sub_aio->io_pos = pos;
+ sub_aio->io_len = this_len;
+ sub_aio->io_may_write = WRITE;
+ sub_aio->io_rw = WRITE;
+ sub_aio->io_data = data;
+
+ sub_aio_a = trans_logger_aio_get_aspect(brick, sub_aio);
+ CHECK_PTR(sub_aio_a, err);
+ CHECK_ASPECT(sub_aio_a, sub_aio, err);
+
+ sub_aio_a->orig_aio_a = orig_aio_a;
+ sub_aio_a->my_input = write_input;
+ log_input = orig_aio_a->log_input;
+ sub_aio_a->log_input = log_input;
+ atomic_inc(&log_input->log_obj_count);
+ sub_aio_a->my_brick = brick;
+ sub_aio_a->orig_rw = WRITE;
+ sub_aio_a->wb = wb;
+
+ status = GENERIC_INPUT_CALL(write_input, aio_get, sub_aio);
+ if (unlikely(status < 0)) {
+ XIO_FAT("cannot get sub_aio, status = %d\n", status);
+ wb->w_error = status;
+ goto err;
+ }
+
+ list_add_tail(&sub_aio_a->sub_head, &wb->w_sub_write_list);
+ atomic_inc(&wb->w_sub_write_count);
+ atomic_inc(&brick->wb_balance_count);
+
+ this_len = sub_aio->io_len;
+ pos += this_len;
+ len -= this_len;
+ }
+
+ return wb;
+
+err:
+ XIO_ERR("cleaning up...\n");
+collision:
+ if (wb)
+ free_writeback(wb);
+ return NULL;
+}
+
+static inline
+void _fire_one(struct list_head *tmp, bool do_update)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_input *sub_input;
+
+ sub_aio_a = container_of(tmp, struct trans_logger_aio_aspect, sub_head);
+ sub_aio = sub_aio_a->object;
+
+ if (unlikely(sub_aio_a->is_fired)) {
+ XIO_ERR("trying to fire twice\n");
+ goto out_return;
+ }
+ sub_aio_a->is_fired = true;
+
+ SETUP_CALLBACK(sub_aio, wb_endio, sub_aio_a);
+
+ sub_input = sub_aio_a->my_input;
+
+#ifdef DO_WRITEBACK
+ GENERIC_INPUT_CALL(sub_input, aio_io, sub_aio);
+#else
+ SIMPLE_CALLBACK(sub_aio, 0);
+#endif
+ if (do_update) { /* CHECK: shouldnt we do this always? */
+ GENERIC_INPUT_CALL(sub_input, aio_put, sub_aio);
+ }
+out_return:;
+}
+
+static inline
+void fire_writeback(struct list_head *start, bool do_update)
+{
+ struct list_head *tmp;
+
+ /* Caution! The wb structure may get deallocated
+ * during _fire_one() in some cases (e.g. when the
+ * callback is directly called by the aio_io operation).
+ * Ensure that no ptr dereferencing can take
+ * place after working on the last list member.
+ */
+ tmp = start->next;
+ while (tmp != start) {
+ struct list_head *next = tmp->next;
+
+ list_del_init(tmp);
+ _fire_one(tmp, do_update);
+ tmp = next;
+ }
+}
+
+static inline
+void update_max_pos(struct trans_logger_aio_aspect *orig_aio_a)
+{
+ loff_t max_pos = orig_aio_a->log_pos;
+ struct trans_logger_input *log_input = orig_aio_a->log_input;
+
+ CHECK_PTR(log_input, done);
+
+ down(&log_input->inf_mutex);
+
+ if (unlikely(max_pos < log_input->inf.inf_min_pos))
+ XIO_ERR("new max_pos < min_pos: %lld < %lld\n", max_pos, log_input->inf.inf_min_pos);
+ if (log_input->inf.inf_max_pos < max_pos) {
+ log_input->inf.inf_max_pos = max_pos;
+ get_lamport(&log_input->inf.inf_max_pos_stamp);
+ _inf_callback(log_input, false);
+ }
+
+ up(&log_input->inf_mutex);
+done:;
+}
+
+static inline
+void update_writeback_info(struct writeback_info *wb)
+{
+ struct list_head *start = &wb->w_collect_list;
+ struct list_head *tmp;
+
+ /* Notice: in case of log rotation, each list member
+ * may belong to a different log_input.
+ */
+ for (tmp = start->next; tmp != start; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *orig_aio_a;
+
+ orig_aio_a = container_of(tmp, struct trans_logger_aio_aspect, collect_head);
+ update_max_pos(orig_aio_a);
+ }
+}
+
+/***************************** worker thread *****************************/
+
+/*********************************************************************
+ * Phase 0: write transaction log entry for the original write request.
+ */
+
+static
+void _complete(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *orig_aio_a, int error, bool pre_io)
+{
+ struct aio_object *orig_aio;
+
+ orig_aio = orig_aio_a->object;
+ CHECK_PTR(orig_aio, err);
+
+ if (orig_aio_a->is_completed ||
+ (pre_io &&
+ (trans_logger_completion_semantics >= 2 ||
+ (trans_logger_completion_semantics >= 1 && !orig_aio->io_skip_sync)))) {
+ goto done;
+ }
+
+ if (cmpxchg(&orig_aio_a->is_completed, false, true))
+ goto done;
+
+ atomic_dec(&brick->log_fly_count);
+
+ if (likely(error >= 0)) {
+ aio_checksum(orig_aio);
+ orig_aio->io_flags &= ~AIO_WRITING;
+ orig_aio->io_flags |= AIO_UPTODATE;
+ }
+ CHECKED_CALLBACK(orig_aio, error, err);
+
+done:
+ goto out_return;
+err:
+ XIO_ERR("giving up...\n");
+out_return:;
+}
+
+static
+void phase0_preio(void *private)
+{
+ struct trans_logger_aio_aspect *orig_aio_a;
+ struct trans_logger_brick *brick;
+
+ orig_aio_a = private;
+ CHECK_PTR(orig_aio_a, err);
+ CHECK_PTR(orig_aio_a->object, err);
+ brick = orig_aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ /* signal completion to the upper layer */
+/**/
+
+ obj_check(orig_aio_a->object);
+ _complete(brick, orig_aio_a, 0, true);
+ obj_check(orig_aio_a->object);
+ goto out_return;
+err:
+ XIO_ERR("giving up...\n");
+out_return:;
+}
+
+static
+void phase0_endio(void *private, int error)
+{
+ struct aio_object *orig_aio;
+ struct trans_logger_aio_aspect *orig_aio_a;
+ struct trans_logger_brick *brick;
+
+ orig_aio_a = private;
+ CHECK_PTR(orig_aio_a, err);
+ brick = orig_aio_a->my_brick;
+ CHECK_PTR(brick, err);
+ orig_aio = orig_aio_a->object;
+ CHECK_PTR(orig_aio, err);
+
+ orig_aio_a->is_persistent = true;
+ qq_dec_flying(&brick->q_phase[0]);
+
+ /* Pin aio->obj_count so it can't go away
+ * after _complete().
+ */
+ _CHECK(orig_aio_a->shadow_aio, err);
+ obj_get(orig_aio); /* must be paired with __trans_logger_io_put() */
+ atomic_inc(&brick->inner_balance_count);
+
+ /* signal completion to the upper layer */
+ _complete(brick, orig_aio_a, error, false);
+
+ /* Queue up for the next phase.
+ */
+ qq_aio_insert(&brick->q_phase[1], orig_aio_a);
+
+ /* Undo the above pinning
+ */
+ __trans_logger_io_put(brick, orig_aio_a);
+
+ banning_reset(&brick->q_phase[0].q_banning);
+
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_ERR("giving up...\n");
+out_return:;
+}
+
+static
+bool phase0_startio(struct trans_logger_aio_aspect *orig_aio_a)
+{
+ struct aio_object *orig_aio;
+ struct trans_logger_brick *brick;
+ struct trans_logger_input *input;
+ struct log_status *logst;
+ loff_t log_pos;
+ void *data;
+ bool ok;
+
+ CHECK_PTR(orig_aio_a, err);
+ orig_aio = orig_aio_a->object;
+ CHECK_PTR(orig_aio, err);
+ brick = orig_aio_a->my_brick;
+ CHECK_PTR(brick, err);
+ input = orig_aio_a->log_input;
+ CHECK_PTR(input, err);
+ logst = &input->logst;
+ logst->do_crc = trans_logger_do_crc;
+
+ {
+ struct log_header l = {
+ .l_stamp = orig_aio_a->stamp,
+ .l_pos = orig_aio->io_pos,
+ .l_len = orig_aio->io_len,
+ .l_code = CODE_WRITE_NEW,
+ };
+ data = log_reserve(logst, &l);
+ }
+ if (unlikely(!data))
+ goto err;
+
+ hash_ensure_stableness(brick, orig_aio_a);
+
+ memcpy(data, orig_aio_a->shadow_data, orig_aio->io_len);
+
+ atomic_inc(&brick->log_fly_count);
+
+ ok = log_finalize(logst, orig_aio->io_len, phase0_endio, orig_aio_a);
+ if (unlikely(!ok)) {
+ atomic_dec(&brick->log_fly_count);
+ goto err;
+ }
+ log_pos = logst->log_pos + logst->offset;
+ orig_aio_a->log_pos = log_pos;
+
+ /* update new log_pos in the symlinks */
+ down(&input->inf_mutex);
+ input->inf.inf_log_pos = log_pos;
+ memcpy(&input->inf.inf_log_pos_stamp, &logst->log_pos_stamp, sizeof(input->inf.inf_log_pos_stamp));
+ _inf_callback(input, false);
+
+#ifdef CONFIG_MARS_DEBUG
+ if (!list_empty(&input->pos_list)) {
+ struct trans_logger_aio_aspect *last_aio_a;
+
+ last_aio_a = container_of(input->pos_list.prev, struct trans_logger_aio_aspect, pos_head);
+ if (last_aio_a->log_pos >= orig_aio_a->log_pos)
+ XIO_ERR("backskip in pos_list, %lld >= %lld\n", last_aio_a->log_pos, orig_aio_a->log_pos);
+ }
+#endif
+ list_add_tail(&orig_aio_a->pos_head, &input->pos_list);
+ atomic_inc(&input->pos_count);
+ up(&input->inf_mutex);
+
+ qq_inc_flying(&brick->q_phase[0]);
+
+ phase0_preio(orig_aio_a);
+
+ return true;
+
+err:
+ return false;
+}
+
+static
+bool prep_phase_startio(struct trans_logger_aio_aspect *aio_a)
+{
+ struct aio_object *aio = aio_a->object;
+ struct trans_logger_aio_aspect *shadow_a;
+ struct trans_logger_brick *brick;
+
+ CHECK_PTR(aio, err);
+ shadow_a = aio_a->shadow_aio;
+ CHECK_PTR(shadow_a, err);
+ brick = aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ if (aio->io_rw == READ) {
+ /* nothing to do: directly signal success. */
+ struct aio_object *shadow = shadow_a->object;
+
+ if (unlikely(shadow == aio))
+ XIO_ERR("oops, we should be a slave shadow, but are a master one\n");
+#ifdef USE_MEMCPY
+ if (aio_a->shadow_data != aio->io_data) {
+ if (unlikely(aio->io_len <= 0 || aio->io_len > PAGE_SIZE))
+ XIO_ERR("implausible io_len = %d\n", aio->io_len);
+ memcpy(aio->io_data, aio_a->shadow_data, aio->io_len);
+ }
+#endif
+ aio->io_flags |= AIO_UPTODATE;
+
+ CHECKED_CALLBACK(aio, 0, err);
+
+ __trans_logger_io_put(brick, aio_a);
+
+ return true;
+ }
+ /* else WRITE */
+ CHECK_HEAD_EMPTY(&aio_a->lh.lh_head);
+ CHECK_HEAD_EMPTY(&aio_a->hash_head);
+ if (unlikely(aio->io_flags & (AIO_READING | AIO_WRITING)))
+ XIO_ERR("bad flags %d\n", aio->io_flags);
+ /* In case of non-buffered IO, the buffer is
+ * under control of the user. In particular, he
+ * may change it without telling us.
+ * Therefore we make a copy (or "snapshot") here.
+ */
+ aio->io_flags |= AIO_WRITING;
+#ifdef USE_MEMCPY
+ if (aio_a->shadow_data != aio->io_data) {
+ if (unlikely(aio->io_len <= 0 || aio->io_len > PAGE_SIZE))
+ XIO_ERR("implausible io_len = %d\n", aio->io_len);
+ memcpy(aio_a->shadow_data, aio->io_data, aio->io_len);
+ }
+#endif
+ aio_a->is_dirty = true;
+ aio_a->shadow_aio->is_dirty = true;
+#ifndef KEEP_UNIQUE
+ if (unlikely(aio_a->shadow_aio != aio_a))
+ XIO_ERR("something is wrong: %p != %p\n", aio_a->shadow_aio, aio_a);
+#endif
+ if (likely(!aio_a->is_hashed)) {
+ struct trans_logger_input *log_input;
+
+ log_input = brick->inputs[brick->log_input_nr];
+ aio_a->log_input = log_input;
+ atomic_inc(&log_input->log_obj_count);
+ hash_insert(brick, aio_a);
+ } else {
+ XIO_ERR("tried to hash twice\n");
+ }
+ return phase0_startio(aio_a);
+
+err:
+ XIO_ERR("cannot work\n");
+ brick_msleep(1000);
+ return false;
+}
+
+/*********************************************************************
+ * Phase 1: read original version of data.
+ * This happens _after_ phase 0, deliberately.
+ * We are explicitly dealing with old and new versions.
+ * The new version is hashed in memory all the time (such that parallel
+ * READs will see them), so we have plenty of time for getting the
+ * old version from disk somewhen later, e.g. when IO contention is low.
+ */
+
+static
+void phase1_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct writeback_info *wb;
+ struct trans_logger_brick *brick;
+
+ CHECK_PTR(cb, err);
+ sub_aio_a = cb->cb_private;
+ CHECK_PTR(sub_aio_a, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ if (unlikely(cb->cb_error < 0)) {
+ XIO_FAT("IO error %d\n", cb->cb_error);
+ goto err;
+ }
+
+ qq_dec_flying(&brick->q_phase[1]);
+
+ banning_reset(&brick->q_phase[1].q_banning);
+
+ /* queue up for the next phase */
+ qq_wb_insert(&brick->q_phase[2], wb);
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_FAT("hanging up....\n");
+out_return:;
+}
+
+static void phase3_endio(struct generic_callback *cb);
+
+static bool phase3_startio(struct writeback_info *wb);
+
+static
+bool phase1_startio(struct trans_logger_aio_aspect *orig_aio_a)
+{
+ struct aio_object *orig_aio;
+ struct trans_logger_brick *brick;
+ struct writeback_info *wb = NULL;
+
+ CHECK_PTR(orig_aio_a, err);
+ orig_aio = orig_aio_a->object;
+ CHECK_PTR(orig_aio, err);
+ brick = orig_aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ if (orig_aio_a->is_collected)
+ goto done;
+ if (!orig_aio_a->is_hashed)
+ goto done;
+
+ wb = make_writeback(brick, orig_aio->io_pos, orig_aio->io_len);
+ if (unlikely(!wb))
+ goto collision;
+
+ if (unlikely(list_empty(&wb->w_sub_write_list))) {
+ XIO_ERR("sub_write_list is empty, orig pos = %lld len = %d (collected=%d), extended pos = %lld len = %d\n",
+ orig_aio->io_pos,
+ orig_aio->io_len,
+ (int)orig_aio_a->is_collected,
+ wb->w_pos,
+ wb->w_len);
+ goto err;
+ }
+
+ wb->read_endio = phase1_endio;
+ wb->write_endio = phase3_endio;
+ atomic_set(&wb->w_sub_log_count, atomic_read(&wb->w_sub_read_count));
+
+ if (brick->log_reads) {
+ qq_inc_flying(&brick->q_phase[1]);
+ fire_writeback(&wb->w_sub_read_list, false);
+ } else { /* shortcut */
+#ifndef SHORTCUT_1_to_3
+ qq_wb_insert(&brick->q_phase[3], wb);
+ wake_up_interruptible_all(&brick->worker_event);
+#else
+ return phase3_startio(wb);
+#endif
+ }
+
+done:
+ return true;
+
+err:
+ if (wb)
+ free_writeback(wb);
+collision:
+ return false;
+}
+
+/*********************************************************************
+ * Phase 2: log the old disk version.
+ */
+
+static inline
+void _phase2_endio(struct writeback_info *wb)
+{
+ struct trans_logger_brick *brick = wb->w_brick;
+
+ /* queue up for the next phase */
+ qq_wb_insert(&brick->q_phase[3], wb);
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+out_return:;
+}
+
+static
+void phase2_endio(void *private, int error)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct trans_logger_brick *brick;
+ struct writeback_info *wb;
+
+ sub_aio_a = private;
+ CHECK_PTR(sub_aio_a, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ qq_dec_flying(&brick->q_phase[2]);
+
+ if (unlikely(error < 0)) {
+ XIO_FAT("IO error %d\n", error);
+ goto err; /* FIXME: this leads to hanging requests. do better. */
+ }
+
+ CHECK_ATOMIC(&wb->w_sub_log_count, 1);
+ if (atomic_dec_and_test(&wb->w_sub_log_count)) {
+ banning_reset(&brick->q_phase[2].q_banning);
+ _phase2_endio(wb);
+ }
+ goto out_return;
+err:
+ XIO_FAT("hanging up....\n");
+out_return:;
+}
+
+static
+bool _phase2_startio(struct trans_logger_aio_aspect *sub_aio_a)
+{
+ struct aio_object *sub_aio = NULL;
+ struct writeback_info *wb;
+ struct trans_logger_input *input;
+ struct trans_logger_brick *brick;
+ struct log_status *logst;
+ void *data;
+ bool ok;
+
+ CHECK_PTR(sub_aio_a, err);
+ sub_aio = sub_aio_a->object;
+ CHECK_PTR(sub_aio, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+ input = sub_aio_a->log_input;
+ CHECK_PTR(input, err);
+ logst = &input->logst;
+ logst->do_crc = trans_logger_do_crc;
+
+ {
+ struct log_header l = {
+ .l_stamp = sub_aio_a->stamp,
+ .l_pos = sub_aio->io_pos,
+ .l_len = sub_aio->io_len,
+ .l_code = CODE_WRITE_OLD,
+ };
+ data = log_reserve(logst, &l);
+ }
+
+ if (unlikely(!data))
+ goto err;
+
+ memcpy(data, sub_aio->io_data, sub_aio->io_len);
+
+ ok = log_finalize(logst, sub_aio->io_len, phase2_endio, sub_aio_a);
+ if (unlikely(!ok))
+ goto err;
+
+ qq_inc_flying(&brick->q_phase[2]);
+
+ return true;
+
+err:
+ XIO_FAT("cannot log old data, pos = %lld len = %d\n",
+ sub_aio ? sub_aio->io_pos : 0,
+ sub_aio ? sub_aio->io_len : 0);
+ return false;
+}
+
+static
+bool phase2_startio(struct writeback_info *wb)
+{
+ struct trans_logger_brick *brick;
+ bool ok = true;
+
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ if (brick->log_reads && atomic_read(&wb->w_sub_log_count) > 0) {
+ struct list_head *start;
+ struct list_head *tmp;
+
+ start = &wb->w_sub_read_list;
+ for (tmp = start->next; tmp != start; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+
+ sub_aio_a = container_of(tmp, struct trans_logger_aio_aspect, sub_head);
+ sub_aio = sub_aio_a->object;
+
+ if (!_phase2_startio(sub_aio_a))
+ ok = false;
+ }
+ wake_up_interruptible_all(&brick->worker_event);
+ } else {
+ _phase2_endio(wb);
+ }
+ return ok;
+err:
+ return false;
+}
+
+/*********************************************************************
+ * Phase 3: overwrite old disk version with new version.
+ */
+
+static
+void phase3_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct writeback_info *wb;
+ struct trans_logger_brick *brick;
+
+ CHECK_PTR(cb, err);
+ sub_aio_a = cb->cb_private;
+ CHECK_PTR(sub_aio_a, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ if (unlikely(cb->cb_error < 0)) {
+ XIO_FAT("IO error %d\n", cb->cb_error);
+ goto err;
+ }
+
+ hash_put_all(brick, &wb->w_collect_list);
+
+ qq_dec_flying(&brick->q_phase[3]);
+ atomic_inc(&brick->total_writeback_cluster_count);
+
+ free_writeback(wb);
+
+ banning_reset(&brick->q_phase[3].q_banning);
+
+ wake_up_interruptible_all(&brick->worker_event);
+
+ goto out_return;
+err:
+ XIO_FAT("hanging up....\n");
+out_return:;
+}
+
+static
+bool phase3_startio(struct writeback_info *wb)
+{
+ struct list_head *start = &wb->w_sub_read_list;
+ struct list_head *tmp;
+
+ /* Cleanup read requests (if they exist from previous phases)
+ */
+ while ((tmp = start->next) != start) {
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_input *sub_input;
+
+ list_del_init(tmp);
+
+ sub_aio_a = container_of(tmp, struct trans_logger_aio_aspect, sub_head);
+ sub_aio = sub_aio_a->object;
+ sub_input = sub_aio_a->my_input;
+
+ GENERIC_INPUT_CALL(sub_input, aio_put, sub_aio);
+ }
+
+ update_writeback_info(wb);
+
+ /* Start writeback IO
+ */
+ qq_inc_flying(&wb->w_brick->q_phase[3]);
+ fire_writeback(&wb->w_sub_write_list, true);
+ return true;
+}
+
+/*********************************************************************
+ * The logger thread.
+ * There is only a single instance, dealing with all requests in parallel.
+ */
+
+static
+int run_aio_queue(struct logger_queue *q,
+ bool (*startio)(struct trans_logger_aio_aspect *sub_aio_a),
+ int max,
+ bool do_limit)
+{
+ struct trans_logger_brick *brick = q->q_brick;
+ int total_len = 0;
+ bool found = false;
+ bool ok;
+ int res = 0;
+
+ do {
+ struct trans_logger_aio_aspect *aio_a;
+
+ aio_a = qq_aio_fetch(q);
+ if (!aio_a)
+ goto done;
+
+ if (do_limit && likely(aio_a->object))
+ total_len += aio_a->object->io_len;
+
+ ok = startio(aio_a);
+ if (unlikely(!ok)) {
+ qq_aio_pushback(q, aio_a);
+ goto done;
+ }
+ res++;
+ found = true;
+ __trans_logger_io_put(aio_a->my_brick, aio_a);
+ } while (--max > 0);
+
+done:
+ if (found) {
+ xio_limit(&global_writeback.limiter, (total_len - 1) / 1024 + 1);
+ wake_up_interruptible_all(&brick->worker_event);
+ }
+ return res;
+}
+
+static
+int run_wb_queue(struct logger_queue *q, bool (*startio)(struct writeback_info *wb), int max)
+{
+ struct trans_logger_brick *brick = q->q_brick;
+ int total_len = 0;
+ bool found = false;
+ bool ok;
+ int res = 0;
+
+ do {
+ struct writeback_info *wb;
+
+ wb = qq_wb_fetch(q);
+ if (!wb)
+ goto done;
+
+ total_len += wb->w_len;
+
+ ok = startio(wb);
+ if (unlikely(!ok)) {
+ qq_wb_pushback(q, wb);
+ goto done;
+ }
+ res++;
+ found = true;
+ } while (--max > 0);
+
+done:
+ if (found) {
+ xio_limit(&global_writeback.limiter, (total_len - 1) / 1024 + 1);
+ wake_up_interruptible_all(&brick->worker_event);
+ }
+ return res;
+}
+
+/* Ranking tables.
+ */
+static
+struct rank_info float_queue_rank_log[] = {
+ { 0, 0 },
+ { 1, 100 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info float_queue_rank_io[] = {
+ { 0, 0 },
+ { 1, 1 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info float_fly_rank_log[] = {
+ { 0, 0 },
+ { 1, 1 },
+ { 32, 10 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info float_fly_rank_io[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { 2, -10 },
+ { 10000, -200 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info nofloat_queue_rank_log[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info nofloat_queue_rank_io[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { 100, 100 },
+ { RKI_DUMMY }
+};
+
+#define nofloat_fly_rank_log float_fly_rank_log
+
+static
+struct rank_info nofloat_fly_rank_io[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { 128, 8 },
+ { 129, -200 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info *queue_ranks[2][LOGGER_QUEUES] = {
+ [0] = {
+ [0] = float_queue_rank_log,
+ [1] = float_queue_rank_io,
+ [2] = float_queue_rank_io,
+ [3] = float_queue_rank_io,
+ },
+ [1] = {
+ [0] = nofloat_queue_rank_log,
+ [1] = nofloat_queue_rank_io,
+ [2] = nofloat_queue_rank_io,
+ [3] = nofloat_queue_rank_io,
+ },
+};
+static
+struct rank_info *fly_ranks[2][LOGGER_QUEUES] = {
+ [0] = {
+ [0] = float_fly_rank_log,
+ [1] = float_fly_rank_io,
+ [2] = float_fly_rank_io,
+ [3] = float_fly_rank_io,
+ },
+ [1] = {
+ [0] = nofloat_fly_rank_log,
+ [1] = nofloat_fly_rank_io,
+ [2] = nofloat_fly_rank_io,
+ [3] = nofloat_fly_rank_io,
+ },
+};
+
+static
+struct rank_info extra_rank_aio_flying[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { 16, 30 },
+ { 31, 0 },
+ { 32, -200 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info global_rank_aio_flying[] = {
+ { 0, 0 },
+ { 63, 0 },
+ { 64, -200 },
+ { RKI_DUMMY }
+};
+
+static
+int _do_ranking(struct trans_logger_brick *brick, struct rank_data rkd[])
+{
+ int res;
+ int i;
+ int floating_mode;
+ int aio_flying;
+ bool delay_callers;
+
+ ranking_start(rkd, LOGGER_QUEUES);
+
+ /* check the memory situation... */
+ delay_callers = false;
+ floating_mode = 1;
+ if (brick_global_memlimit >= 1024) {
+ int global_mem_used = atomic64_read(&global_mshadow_used) / 1024;
+
+ trans_logger_mem_usage = global_mem_used;
+
+ floating_mode = (global_mem_used < brick_global_memlimit / 2) ? 0 : 1;
+
+ if (global_mem_used >= brick_global_memlimit)
+ delay_callers = true;
+
+ } else if (brick->shadow_mem_limit >= 8) {
+ int local_mem_used = atomic64_read(&brick->shadow_mem_used) / 1024;
+
+ floating_mode = (local_mem_used < brick->shadow_mem_limit / 2) ? 0 : 1;
+
+ if (local_mem_used >= brick->shadow_mem_limit)
+ delay_callers = true;
+ }
+ if (delay_callers) {
+ if (!brick->delay_callers) {
+ brick->delay_callers = true;
+ atomic_inc(&brick->total_delay_count);
+ }
+ } else if (brick->delay_callers) {
+ brick->delay_callers = false;
+ wake_up_interruptible(&brick->caller_event);
+ }
+
+ /* global limit for flying aios */
+ ranking_compute(&rkd[0], global_rank_aio_flying, atomic_read(&global_aio_flying));
+
+ /* local limit for flying aios */
+ aio_flying = 0;
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+
+ aio_flying += atomic_read(&input->logst.aio_flying);
+ }
+
+ /* obey the basic rules... */
+ for (i = 0; i < LOGGER_QUEUES; i++) {
+ int queued = atomic_read(&brick->q_phase[i].q_queued);
+ int flying;
+
+ /* This must come first.
+ * When a queue is empty, you must not credit any positive points.
+ * Otherwise, (almost) infinite selection of untreatable
+ * queues may occur.
+ */
+ if (queued <= 0)
+ continue;
+
+ if (banning_is_hit(&brick->q_phase[i].q_banning))
+ break;
+
+ if (i == 0) {
+ /* limit aio IO parallelism on transaction log */
+ ranking_compute(&rkd[0], extra_rank_aio_flying, aio_flying);
+ } else if (i == 1 && !floating_mode) {
+ struct trans_logger_brick *leader;
+ int lim;
+
+ if (!aio_flying && atomic_read(&brick->q_phase[0].q_queued) > 0)
+ break;
+
+ leader = elect_leader(&global_writeback);
+ if (leader != brick)
+ break;
+
+ if (banning_is_hit(&xio_global_ban))
+ break;
+
+ lim = xio_limit(&global_writeback.limiter, 0);
+ if (lim > 0)
+ break;
+ }
+
+ ranking_compute(&rkd[i], queue_ranks[floating_mode][i], queued);
+
+ flying = atomic_read(&brick->q_phase[i].q_flying);
+
+ ranking_compute(&rkd[i], fly_ranks[floating_mode][i], flying);
+ }
+
+ /* finalize it */
+ ranking_stop(rkd, LOGGER_QUEUES);
+
+ res = ranking_select(rkd, LOGGER_QUEUES);
+
+ return res;
+}
+
+static
+void _init_input(struct trans_logger_input *input, loff_t start_pos, loff_t end_pos)
+{
+ struct trans_logger_brick *brick = input->brick;
+ struct log_status *logst = &input->logst;
+
+ init_logst(logst, (void *)input, start_pos, end_pos);
+ logst->signal_event = &brick->worker_event;
+ logst->align_size = CONF_TRANS_ALIGN;
+ logst->chunk_size = CONF_TRANS_CHUNKSIZE;
+ logst->max_size = CONF_TRANS_MAX_AIO_SIZE;
+
+ input->inf.inf_min_pos = start_pos;
+ input->inf.inf_max_pos = end_pos;
+ get_lamport(&input->inf.inf_max_pos_stamp);
+ memcpy(&input->inf.inf_min_pos_stamp, &input->inf.inf_max_pos_stamp, sizeof(input->inf.inf_min_pos_stamp));
+
+ logst->log_pos = start_pos;
+ input->inf.inf_log_pos = start_pos;
+ input->inf_last_jiffies = jiffies;
+ input->inf.inf_is_replaying = false;
+ input->inf.inf_is_logging = false;
+
+ input->is_operating = true;
+}
+
+static
+void _init_inputs(struct trans_logger_brick *brick, bool is_first)
+{
+ struct trans_logger_input *input;
+ int old_nr = brick->old_input_nr;
+ int log_nr = brick->log_input_nr;
+ int new_nr = brick->new_input_nr;
+
+ if (!is_first &&
+ (new_nr == log_nr ||
+ log_nr != old_nr)) {
+ goto done;
+ }
+ if (unlikely(new_nr < TL_INPUT_LOG1 || new_nr > TL_INPUT_LOG2)) {
+ XIO_ERR("bad new_input_nr = %d\n", new_nr);
+ goto done;
+ }
+
+ input = brick->inputs[new_nr];
+ CHECK_PTR(input, done);
+
+ if (input->is_operating || !input->connect)
+ goto done;
+
+ down(&input->inf_mutex);
+
+ _init_input(input, 0, 0);
+ input->inf.inf_is_logging = is_first;
+
+ /* from now on, new requests should go to the new input */
+ brick->log_input_nr = new_nr;
+ XIO_INF("switched over to new logfile %d (old = %d)\n", new_nr, old_nr);
+
+ /* Flush the old log buffer and update its symlinks.
+ * Notice: for some short time, _both_ logfiles may grow
+ * due to (harmless) races with log_flush().
+ */
+ if (likely(!is_first)) {
+ struct trans_logger_input *other_input = brick->inputs[old_nr];
+
+ down(&other_input->inf_mutex);
+ log_flush(&other_input->logst);
+ _inf_callback(other_input, true);
+ up(&other_input->inf_mutex);
+ }
+
+ _inf_callback(input, true);
+
+ up(&input->inf_mutex);
+done:;
+}
+
+static
+int _nr_flying_inputs(struct trans_logger_brick *brick)
+{
+ int count = 0;
+ int i;
+
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+ struct log_status *logst = &input->logst;
+
+ if (input->is_operating)
+ count += logst->count;
+ }
+ return count;
+}
+
+static
+void _flush_inputs(struct trans_logger_brick *brick)
+{
+ int i;
+
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+ struct log_status *logst = &input->logst;
+
+ if (input->is_operating && logst->count > 0) {
+ atomic_inc(&brick->total_flush_count);
+ log_flush(logst);
+ }
+ }
+}
+
+static
+void _exit_inputs(struct trans_logger_brick *brick, bool force)
+{
+ int i;
+
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+ struct log_status *logst = &input->logst;
+
+ if (input->is_operating &&
+ (force || !input->connect)) {
+ bool old_replaying = input->inf.inf_is_replaying;
+ bool old_logging = input->inf.inf_is_logging;
+
+ XIO_DBG("cleaning up input %d (log = %d old = %d), old_replaying = %d old_logging = %d\n",
+ i,
+ brick->log_input_nr,
+ brick->old_input_nr,
+ old_replaying,
+ old_logging);
+ exit_logst(logst);
+ /* no locking here: we should be the only thread doing this. */
+ _inf_callback(input, true);
+ input->inf_last_jiffies = 0;
+ input->inf.inf_is_replaying = false;
+ input->inf.inf_is_logging = false;
+ input->is_operating = false;
+ if (i == brick->old_input_nr && i != brick->log_input_nr) {
+ struct trans_logger_input *other_input = brick->inputs[brick->log_input_nr];
+
+ down(&other_input->inf_mutex);
+ brick->old_input_nr = brick->log_input_nr;
+ other_input->inf.inf_is_replaying = old_replaying;
+ other_input->inf.inf_is_logging = old_logging;
+ _inf_callback(other_input, true);
+ up(&other_input->inf_mutex);
+ }
+ }
+ }
+}
+
+/* Performance-critical:
+ * Calling log_flush() too often may result in
+ * increased overhead (and thus in lower throughput).
+ * Call it only when the IO scheduler need not do anything else.
+ * OTOH, calling it too seldom may hold back
+ * IO completion for the end user for too long time.
+ *
+ * Be careful to flush any leftovers in the log buffer, at least after
+ * some short delay.
+ *
+ * Description of flush_mode:
+ * 0 = flush unconditionally
+ * 1 = flush only when nothing can be appended to the transaction log
+ * 2 = see 1 && flush only when the user is waiting for an answer
+ * 3 = see 1 && not 2 && flush only when there is no other activity (background mode)
+ * Notice: 3 makes only sense for leftovers where the user is _not_ waiting for
+ */
+static inline
+void flush_inputs(struct trans_logger_brick *brick, int flush_mode)
+{
+ if (flush_mode < 1 ||
+ /* there is nothing to append any more */
+ (atomic_read(&brick->q_phase[0].q_queued) <= 0 &&
+ /* and the user is waiting for an answer */
+ (flush_mode < 2 ||
+ atomic_read(&brick->log_fly_count) > 0 ||
+ /* else flush any leftovers in background, when there is no writeback activity */
+ (flush_mode == 3 &&
+ atomic_read(&brick->q_phase[1].q_flying) + atomic_read(&brick->q_phase[3].q_flying) <= 0))))
+ _flush_inputs(brick);
+}
+
+static
+void trans_logger_log(struct trans_logger_brick *brick)
+{
+ struct rank_data rkd[LOGGER_QUEUES] = {};
+ long long old_jiffies = jiffies;
+ long long work_jiffies = jiffies;
+ int interleave = 0;
+ int nr_flying;
+
+ brick->replay_code = 0; /* indicates "running" */
+ brick->disk_io_error = 0;
+
+ _init_inputs(brick, true);
+
+ xio_set_power_on_led((void *)brick, true);
+
+ while (!brick_thread_should_stop() || _congested(brick)) {
+ int winner;
+ int nr;
+
+ wait_event_interruptible_timeout(
+ brick->worker_event,
+ ({
+ winner = _do_ranking(brick, rkd);
+ if (winner < 0) { /* no more work to do */
+ int flush_mode = 2 - ((int)(jiffies - work_jiffies)) / (HZ * 2);
+
+ flush_inputs(brick, flush_mode);
+ interleave = 0;
+ } else { /* reset the timer whenever something is to do */
+ work_jiffies = jiffies;
+ }
+ winner >= 0;
+ }),
+ HZ / 10);
+
+ atomic_inc(&brick->total_round_count);
+
+ if (brick->cease_logging)
+ brick->stopped_logging = true;
+ else if (brick->stopped_logging && !_congested(brick))
+ brick->stopped_logging = false;
+
+ _init_inputs(brick, false);
+
+ switch (winner) {
+ case 0:
+ interleave = 0;
+ nr = run_aio_queue(&brick->q_phase[0],
+ prep_phase_startio,
+ brick->q_phase[0].q_batchlen,
+ true);
+ goto done;
+ case 1:
+ if (interleave >= trans_logger_max_interleave && trans_logger_max_interleave >= 0) {
+ interleave = 0;
+ flush_inputs(brick, 3);
+ }
+ nr = run_aio_queue(&brick->q_phase[1], phase1_startio, brick->q_phase[1].q_batchlen, true);
+ interleave += nr;
+ goto done;
+ case 2:
+ interleave = 0;
+ nr = run_wb_queue(&brick->q_phase[2], phase2_startio, brick->q_phase[2].q_batchlen);
+ goto done;
+ case 3:
+ if (interleave >= trans_logger_max_interleave && trans_logger_max_interleave >= 0) {
+ interleave = 0;
+ flush_inputs(brick, 3);
+ }
+ nr = run_wb_queue(&brick->q_phase[3], phase3_startio, brick->q_phase[3].q_batchlen);
+ interleave += nr;
+done:
+ if (unlikely(nr <= 0)) {
+ /* This should not happen!
+ * However, in error situations, the ranking
+ * algorithm cannot foresee anything.
+ */
+ brick->q_phase[winner].no_progress_count++;
+ banning_hit(&brick->q_phase[winner].q_banning, 10000);
+ flush_inputs(brick, 0);
+ }
+ ranking_select_done(rkd, winner, nr);
+ break;
+
+ default:
+ break;
+ }
+
+ /* Update symlinks even during pauses.
+ */
+ if (winner < 0 && ((long long)jiffies) - old_jiffies >= HZ) {
+ int i;
+
+ old_jiffies = jiffies;
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+
+ down(&input->inf_mutex);
+ _inf_callback(input, false);
+ up(&input->inf_mutex);
+ }
+ }
+
+ _exit_inputs(brick, false);
+ }
+
+ for (;;) {
+ _exit_inputs(brick, true);
+ nr_flying = _nr_flying_inputs(brick);
+ if (nr_flying <= 0)
+ break;
+ XIO_INF("%d inputs are operating\n", nr_flying);
+ brick_msleep(1000);
+ }
+}
+
+/***************************** log replay *****************************/
+
+static
+void replay_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *aio_a = cb->cb_private;
+ struct trans_logger_brick *brick;
+ bool ok;
+
+ LAST_CALLBACK(cb);
+ CHECK_PTR(aio_a, err);
+ brick = aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ if (unlikely(cb->cb_error < 0)) {
+ brick->disk_io_error = cb->cb_error;
+ XIO_ERR("IO error = %d\n", cb->cb_error);
+ }
+
+ spin_lock(&brick->replay_lock);
+ ok = !list_empty(&aio_a->replay_head);
+ list_del_init(&aio_a->replay_head);
+ spin_unlock(&brick->replay_lock);
+
+ if (likely(ok))
+ atomic_dec(&brick->replay_count);
+ else
+ XIO_ERR("callback with empty replay_head (replay_count=%d)\n", atomic_read(&brick->replay_count));
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle replay IO\n");
+out_return:;
+}
+
+static
+bool _has_conflict(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *aio_a)
+{
+ struct aio_object *aio = aio_a->object;
+ struct list_head *tmp;
+ bool res = false;
+
+/**/
+
+ spin_lock(&brick->replay_lock);
+
+ for (tmp = brick->replay_list.next; tmp != &brick->replay_list; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *tmp_a;
+ struct aio_object *tmp_aio;
+
+ tmp_a = container_of(tmp, struct trans_logger_aio_aspect, replay_head);
+ tmp_aio = tmp_a->object;
+ if (tmp_aio->io_pos + tmp_aio->io_len > aio->io_pos && tmp_aio->io_pos < aio->io_pos + aio->io_len) {
+ res = true;
+ break;
+ }
+ }
+
+ spin_unlock(&brick->replay_lock);
+ return res;
+}
+
+static
+void wait_replay(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *aio_a)
+{
+ const int max = 512; /* limit parallelism somewhat */
+ int conflicts = 0;
+ bool ok = false;
+ bool was_empty;
+
+ wait_event_interruptible_timeout(brick->worker_event,
+ atomic_read(&brick->replay_count) < max
+ && (_has_conflict(brick, aio_a) ? conflicts++ : (ok = true), ok),
+ 60 * HZ);
+
+ atomic_inc(&brick->total_replay_count);
+ if (conflicts)
+ atomic_inc(&brick->total_replay_conflict_count);
+
+ spin_lock(&brick->replay_lock);
+ was_empty = !!list_empty(&aio_a->replay_head);
+ if (likely(was_empty))
+ atomic_inc(&brick->replay_count);
+ else
+ list_del(&aio_a->replay_head);
+ list_add(&aio_a->replay_head, &brick->replay_list);
+ spin_unlock(&brick->replay_lock);
+
+ if (unlikely(!was_empty)) {
+ XIO_ERR("replay_head was already used (ok=%d, conflicts=%d, replay_count=%d)\n",
+ ok,
+ conflicts,
+ atomic_read(&brick->replay_count));
+ }
+}
+
+static
+int replay_data(struct trans_logger_brick *brick, loff_t pos, void *buf, int len)
+{
+ struct trans_logger_input *input = brick->inputs[TL_INPUT_WRITEBACK];
+ int status;
+
+ if (!input->connect)
+ input = brick->inputs[TL_INPUT_READ];
+
+ /* TODO for better efficiency:
+ * Instead of starting IO here, just put the data into the hashes
+ * and queues such that ordinary IO will be corrected.
+ * Writeback will be lazy then.
+ * The switch infrastructure must be changed before this
+ * becomes possible.
+ */
+#ifdef REPLAY_DATA
+ while (len > 0) {
+ struct aio_object *aio;
+ struct trans_logger_aio_aspect *aio_a;
+
+ status = -ENOMEM;
+ aio = trans_logger_alloc_aio(brick);
+ aio_a = trans_logger_aio_get_aspect(brick, aio);
+ CHECK_PTR(aio_a, done);
+ CHECK_ASPECT(aio_a, aio, done);
+
+ aio->io_pos = pos;
+ aio->io_data = NULL;
+ aio->io_len = len;
+ aio->io_may_write = WRITE;
+ aio->io_rw = WRITE;
+
+ status = GENERIC_INPUT_CALL(input, aio_get, aio);
+ if (unlikely(status < 0)) {
+ XIO_ERR("cannot get aio, status = %d\n", status);
+ goto done;
+ }
+ if (unlikely(!aio->io_data)) {
+ status = -ENOMEM;
+ XIO_ERR("cannot get aio, status = %d\n", status);
+ goto done;
+ }
+ if (unlikely(aio->io_len <= 0 || aio->io_len > len)) {
+ status = -EINVAL;
+ XIO_ERR("bad aio len = %d (requested = %d)\n", aio->io_len, len);
+ goto done;
+ }
+
+ wait_replay(brick, aio_a);
+
+ memcpy(aio->io_data, buf, aio->io_len);
+
+ SETUP_CALLBACK(aio, replay_endio, aio_a);
+ aio_a->my_brick = brick;
+
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+
+ if (unlikely(aio->io_len <= 0)) {
+ status = -EINVAL;
+ XIO_ERR("bad aio len = %d (requested = %d)\n", aio->io_len, len);
+ goto done;
+ }
+
+ pos += aio->io_len;
+ buf += aio->io_len;
+ len -= aio->io_len;
+
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+ }
+#endif
+ status = 0;
+done:
+ return status;
+}
+
+static
+void trans_logger_replay(struct trans_logger_brick *brick)
+{
+ struct trans_logger_input *input = brick->inputs[brick->log_input_nr];
+ struct log_header lh = {};
+ loff_t start_pos;
+ loff_t end_pos;
+ loff_t finished_pos = -1;
+ loff_t new_finished_pos = -1;
+ long long old_jiffies = jiffies;
+ int nr_flying;
+ int backoff = 0;
+ int status = 0;
+
+ brick->replay_code = 0; /* indicates "running" */
+ brick->disk_io_error = 0;
+
+ start_pos = brick->replay_start_pos;
+ end_pos = brick->replay_end_pos;
+ brick->replay_current_pos = start_pos;
+
+ _init_input(input, start_pos, end_pos);
+
+ input->inf.inf_min_pos = start_pos;
+ input->inf.inf_max_pos = end_pos;
+ input->inf.inf_log_pos = end_pos;
+ input->inf.inf_is_replaying = true;
+ input->inf.inf_is_logging = false;
+
+ XIO_INF("starting replay from %lld to %lld\n", start_pos, end_pos);
+
+ xio_set_power_on_led((void *)brick, true);
+
+ for (;;) {
+ void *buf = NULL;
+ int len = 0;
+
+ if (brick_thread_should_stop() ||
+ (!brick->continuous_replay_mode && finished_pos >= brick->replay_end_pos)) {
+ status = 0; /* treat as EOF */
+ break;
+ }
+
+ status = log_read(&input->logst, false, &lh, &buf, &len);
+
+ new_finished_pos = input->logst.log_pos + input->logst.offset;
+ XIO_RPL("read %lld %lld\n", finished_pos, new_finished_pos);
+
+ if (status == -EAGAIN) {
+ loff_t remaining = brick->replay_end_pos - new_finished_pos;
+
+ XIO_DBG("got -EAGAIN, remaining = %lld\n", remaining);
+ if (brick->replay_tolerance > 0 && remaining < brick->replay_tolerance) {
+ XIO_WRN("logfile is truncated at position %lld (end_pos = %lld, remaining = %lld, tolerance = %d)\n",
+ new_finished_pos,
+ brick->replay_end_pos,
+ remaining,
+ brick->replay_tolerance);
+ finished_pos = new_finished_pos;
+ brick->replay_code = status;
+ break;
+ }
+ brick_msleep(backoff);
+ if (backoff < trans_logger_replay_timeout * 1000) {
+ backoff += 100;
+ } else {
+ XIO_WRN("logfile replay not possible at position %lld (end_pos = %lld, remaining = %lld), please check/repair your logfile in userspace by some tool!\n",
+ new_finished_pos,
+ brick->replay_end_pos,
+ remaining);
+ brick->replay_code = status;
+ break;
+ }
+ continue;
+ }
+ if (unlikely(status < 0)) {
+ brick->replay_code = status;
+ XIO_WRN("cannot read logfile data, status = %d\n", status);
+ break;
+ }
+
+ if ((!status && len <= 0) ||
+ new_finished_pos > brick->replay_end_pos) { /* EOF -> wait until brick_thread_should_stop() */
+ XIO_DBG("EOF at %lld (old = %lld, end_pos = %lld)\n",
+ new_finished_pos,
+ finished_pos,
+ brick->replay_end_pos);
+ if (!brick->continuous_replay_mode) {
+ /* notice: finished_pos remains at old value here! */
+ break;
+ }
+ brick_msleep(1000);
+ continue;
+ }
+
+ if (lh.l_code != CODE_WRITE_NEW) {
+ /* ignore other records silently */;
+ } else if (unlikely(brick->disk_io_error)) {
+ status = brick->disk_io_error;
+ brick->replay_code = status;
+ XIO_ERR("IO error %d\n", status);
+ break;
+ } else if (likely(buf && len)) {
+ if (brick->replay_limiter)
+ xio_limit_sleep(brick->replay_limiter, (len - 1) / 1024 + 1);
+ status = replay_data(brick, lh.l_pos, buf, len);
+ XIO_RPL("replay %lld %lld (pos=%lld status=%d)\n",
+ finished_pos,
+ new_finished_pos,
+ lh.l_pos,
+ status);
+ if (unlikely(status < 0)) {
+ brick->replay_code = status;
+ XIO_ERR("cannot replay data at pos = %lld len = %d, status = %d\n",
+ lh.l_pos,
+ len,
+ status);
+ break;
+ } else {
+ finished_pos = new_finished_pos;
+ }
+ }
+
+ /* do this _after_ any opportunities for errors... */
+ if ((atomic_read(&brick->replay_count) <= 0 ||
+ ((long long)jiffies) - old_jiffies >= HZ * 3) &&
+ finished_pos >= 0) {
+ /* for safety, wait until the IO queue has drained. */
+ wait_event_interruptible_timeout(brick->worker_event,
+ atomic_read(&brick->replay_count) <= 0,
+ 30 * HZ);
+
+ if (unlikely(brick->disk_io_error)) {
+ status = brick->disk_io_error;
+ brick->replay_code = status;
+ XIO_ERR("IO error %d\n", status);
+ break;
+ }
+
+ down(&input->inf_mutex);
+ input->inf.inf_min_pos = finished_pos;
+ get_lamport(&input->inf.inf_min_pos_stamp);
+ old_jiffies = jiffies;
+ _inf_callback(input, false);
+ up(&input->inf_mutex);
+ }
+ _exit_inputs(brick, false);
+ }
+
+ XIO_INF("waiting for finish...\n");
+
+ wait_event_interruptible_timeout(brick->worker_event, atomic_read(&brick->replay_count) <= 0, 60 * HZ);
+
+ if (unlikely(finished_pos > brick->replay_end_pos)) {
+ XIO_ERR("finished_pos too large: %lld + %d = %lld > %lld\n",
+ input->logst.log_pos,
+ input->logst.offset,
+ finished_pos,
+ brick->replay_end_pos);
+ }
+
+ if (finished_pos >= 0 && !brick->disk_io_error) {
+ input->inf.inf_min_pos = finished_pos;
+ brick->replay_current_pos = finished_pos;
+ }
+
+ get_lamport(&input->inf.inf_min_pos_stamp);
+
+ if (status >= 0 && finished_pos == brick->replay_end_pos) {
+ XIO_INF("replay finished at %lld\n", finished_pos);
+ brick->replay_code = 1;
+ } else if (status == -EAGAIN && finished_pos + brick->replay_tolerance > brick->replay_end_pos) {
+ XIO_INF("TOLERANCE: logfile is incomplete at %lld (of %lld)\n", finished_pos, brick->replay_end_pos);
+ brick->replay_code = 2;
+ } else if (status < 0) {
+ if (finished_pos < 0)
+ finished_pos = new_finished_pos;
+ if (finished_pos + brick->replay_tolerance > brick->replay_end_pos) {
+ XIO_INF("TOLERANCE: logfile is incomplete at %lld (of %lld), status = %d\n",
+ finished_pos,
+ brick->replay_end_pos,
+ status);
+ } else {
+ XIO_ERR("replay error %d at %lld (of %lld)\n", status, finished_pos, brick->replay_end_pos);
+ }
+ brick->replay_code = status;
+ } else {
+ XIO_INF("replay stopped prematurely at %lld (of %lld)\n", finished_pos, brick->replay_end_pos);
+ brick->replay_code = 2;
+ }
+
+ for (;;) {
+ _exit_inputs(brick, true);
+ nr_flying = _nr_flying_inputs(brick);
+ if (nr_flying <= 0)
+ break;
+ XIO_INF("%d inputs are operating\n", nr_flying);
+ brick_msleep(1000);
+ }
+
+ local_trigger();
+
+ while (!brick_thread_should_stop())
+ brick_msleep(500);
+}
+
+/************************ logger thread * switching ************************/
+
+static
+int trans_logger_thread(void *data)
+{
+ struct trans_logger_output *output = data;
+ struct trans_logger_brick *brick = output->brick;
+
+ XIO_INF("........... logger has started.\n");
+
+ if (brick->replay_mode)
+ trans_logger_replay(brick);
+ else
+ trans_logger_log(brick);
+ XIO_INF("........... logger has stopped.\n");
+ xio_set_power_on_led((void *)brick, false);
+ xio_set_power_off_led((void *)brick, true);
+ return 0;
+}
+
+static
+int trans_logger_switch(struct trans_logger_brick *brick)
+{
+ static int index;
+ struct trans_logger_output *output = brick->outputs[0];
+
+ if (brick->power.button) {
+ if (!brick->thread && brick->power.off_led) {
+ xio_set_power_off_led((void *)brick, false);
+
+ brick->thread = brick_thread_create(trans_logger_thread, output, "xio_logger%d", index++);
+ if (unlikely(!brick->thread)) {
+ XIO_ERR("cannot create logger thread\n");
+ return -ENOENT;
+ }
+ }
+ } else {
+ xio_set_power_on_led((void *)brick, false);
+ if (brick->thread) {
+ XIO_INF("stopping thread...\n");
+ brick_thread_stop(brick->thread);
+ brick->thread = NULL;
+ }
+ }
+ return 0;
+}
+
+/*************** informational * statistics **************/
+
+static
+char *trans_logger_statistics(struct trans_logger_brick *brick, int verbose)
+{
+ char *res = brick_string_alloc(1024);
+
+ snprintf(res, 1023,
+ "mode replay=%d continuous=%d replay_code=%d disk_io_error=%d log_reads=%d | cease_logging=%d stopped_logging=%d congested=%d | replay_start_pos = %lld replay_end_pos = %lld | new_input_nr = %d log_input_nr = %d (old = %d) inf_min_pos1 = %lld inf_max_pos1 = %lld inf_min_pos2 = %lld inf_max_pos2 = %lld | total hash_insert=%d hash_find=%d hash_extend=%d replay=%d replay_conflict=%d (%d%%) callbacks=%d reads=%d writes=%d flushes=%d (%d%%) wb_clusters=%d writebacks=%d (%d%%) shortcut=%d (%d%%) mshadow=%d sshadow=%d mshadow_buffered=%d sshadow_buffered=%d rounds=%d restarts=%d delays=%d phase0=%d phase1=%d phase2=%d phase3=%d | current #aios = %d shadow_mem_used=%ld/%lld replay_count=%d mshadow=%d/%d sshadow=%d hash_count=%d balance=%d/%d/%d/%d pos_count1=%d pos_count2=%d log_aios1=%d log_aios2=%d any_fly=%d log_fly=%d aio_flying1=%d aio_flying2=%d phase0=%d+%d <%d/%d> phase1=%d+%d <%d/%d> phase2=%d+%d <%d/%d> phase3=%d+%d <%d/%d>\n",
+ brick->replay_mode,
+ brick->continuous_replay_mode,
+ brick->replay_code,
+ brick->disk_io_error,
+ brick->log_reads,
+ brick->cease_logging,
+ brick->stopped_logging,
+ _congested(brick),
+ brick->replay_start_pos,
+ brick->replay_end_pos,
+ brick->new_input_nr,
+ brick->log_input_nr,
+ brick->old_input_nr,
+ brick->inputs[TL_INPUT_LOG1]->inf.inf_min_pos,
+ brick->inputs[TL_INPUT_LOG1]->inf.inf_max_pos,
+ brick->inputs[TL_INPUT_LOG2]->inf.inf_min_pos,
+ brick->inputs[TL_INPUT_LOG2]->inf.inf_max_pos,
+ atomic_read(&brick->total_hash_insert_count),
+ atomic_read(&brick->total_hash_find_count),
+ atomic_read(&brick->total_hash_extend_count),
+ atomic_read(&brick->total_replay_count),
+ atomic_read(&brick->total_replay_conflict_count),
+ atomic_read(&brick->total_replay_count) ? atomic_read(&brick->total_replay_conflict_count) * 100 / atomic_read(&brick->total_replay_count) : 0,
+ atomic_read(&brick->total_cb_count),
+ atomic_read(&brick->total_read_count),
+ atomic_read(&brick->total_write_count),
+ atomic_read(&brick->total_flush_count),
+ atomic_read(&brick->total_write_count) ? atomic_read(&brick->total_flush_count) * 100 / atomic_read(&brick->total_write_count) : 0,
+ atomic_read(&brick->total_writeback_cluster_count),
+ atomic_read(&brick->total_writeback_count),
+ atomic_read(&brick->total_writeback_cluster_count) ? atomic_read(&brick->total_writeback_count) * 100 / atomic_read(&brick->total_writeback_cluster_count) : 0,
+ atomic_read(&brick->total_shortcut_count),
+ atomic_read(&brick->total_writeback_count) ? atomic_read(&brick->total_shortcut_count) * 100 / atomic_read(&brick->total_writeback_count) : 0,
+ atomic_read(&brick->total_mshadow_count),
+ atomic_read(&brick->total_sshadow_count),
+ atomic_read(&brick->total_mshadow_buffered_count),
+ atomic_read(&brick->total_sshadow_buffered_count),
+ atomic_read(&brick->total_round_count),
+ atomic_read(&brick->total_restart_count),
+ atomic_read(&brick->total_delay_count),
+ atomic_read(&brick->q_phase[0].q_total),
+ atomic_read(&brick->q_phase[1].q_total),
+ atomic_read(&brick->q_phase[2].q_total),
+ atomic_read(&brick->q_phase[3].q_total),
+ atomic_read(&brick->aio_object_layout.alloc_count),
+ atomic64_read(&brick->shadow_mem_used) / 1024,
+ brick_global_memlimit,
+ atomic_read(&brick->replay_count),
+ atomic_read(&brick->mshadow_count),
+ brick->shadow_mem_limit,
+ atomic_read(&brick->sshadow_count),
+ atomic_read(&brick->hash_count),
+ atomic_read(&brick->sub_balance_count),
+ atomic_read(&brick->inner_balance_count),
+ atomic_read(&brick->outer_balance_count),
+ atomic_read(&brick->wb_balance_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG1]->pos_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG2]->pos_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG1]->log_obj_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG2]->log_obj_count),
+ atomic_read(&brick->any_fly_count),
+ atomic_read(&brick->log_fly_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG1]->logst.aio_flying),
+ atomic_read(&brick->inputs[TL_INPUT_LOG2]->logst.aio_flying),
+ atomic_read(&brick->q_phase[0].q_queued),
+ atomic_read(&brick->q_phase[0].q_flying),
+ brick->q_phase[0].pushback_count,
+ brick->q_phase[0].no_progress_count,
+ atomic_read(&brick->q_phase[1].q_queued),
+ atomic_read(&brick->q_phase[1].q_flying),
+ brick->q_phase[1].pushback_count,
+ brick->q_phase[1].no_progress_count,
+ atomic_read(&brick->q_phase[2].q_queued),
+ atomic_read(&brick->q_phase[2].q_flying),
+ brick->q_phase[2].pushback_count,
+ brick->q_phase[2].no_progress_count,
+ atomic_read(&brick->q_phase[3].q_queued),
+ atomic_read(&brick->q_phase[3].q_flying),
+ brick->q_phase[3].pushback_count,
+ brick->q_phase[3].no_progress_count);
+ return res;
+}
+
+static
+void trans_logger_reset_statistics(struct trans_logger_brick *brick)
+{
+ atomic_set(&brick->total_hash_insert_count, 0);
+ atomic_set(&brick->total_hash_find_count, 0);
+ atomic_set(&brick->total_hash_extend_count, 0);
+ atomic_set(&brick->total_replay_count, 0);
+ atomic_set(&brick->total_replay_conflict_count, 0);
+ atomic_set(&brick->total_cb_count, 0);
+ atomic_set(&brick->total_read_count, 0);
+ atomic_set(&brick->total_write_count, 0);
+ atomic_set(&brick->total_flush_count, 0);
+ atomic_set(&brick->total_writeback_count, 0);
+ atomic_set(&brick->total_writeback_cluster_count, 0);
+ atomic_set(&brick->total_shortcut_count, 0);
+ atomic_set(&brick->total_mshadow_count, 0);
+ atomic_set(&brick->total_sshadow_count, 0);
+ atomic_set(&brick->total_mshadow_buffered_count, 0);
+ atomic_set(&brick->total_sshadow_buffered_count, 0);
+ atomic_set(&brick->total_round_count, 0);
+ atomic_set(&brick->total_restart_count, 0);
+ atomic_set(&brick->total_delay_count, 0);
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static
+int trans_logger_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct trans_logger_aio_aspect *ini = (void *)_ini;
+
+ ini->lh.lh_pos = &ini->object->io_pos;
+ INIT_LIST_HEAD(&ini->lh.lh_head);
+ INIT_LIST_HEAD(&ini->hash_head);
+ INIT_LIST_HEAD(&ini->pos_head);
+ INIT_LIST_HEAD(&ini->replay_head);
+ INIT_LIST_HEAD(&ini->collect_head);
+ INIT_LIST_HEAD(&ini->sub_list);
+ INIT_LIST_HEAD(&ini->sub_head);
+ return 0;
+}
+
+static
+void trans_logger_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct trans_logger_aio_aspect *ini = (void *)_ini;
+
+ CHECK_HEAD_EMPTY(&ini->lh.lh_head);
+ CHECK_HEAD_EMPTY(&ini->hash_head);
+ CHECK_HEAD_EMPTY(&ini->pos_head);
+ CHECK_HEAD_EMPTY(&ini->replay_head);
+ CHECK_HEAD_EMPTY(&ini->collect_head);
+ CHECK_HEAD_EMPTY(&ini->sub_list);
+ CHECK_HEAD_EMPTY(&ini->sub_head);
+ if (ini->log_input)
+ atomic_dec(&ini->log_input->log_obj_count);
+}
+
+XIO_MAKE_STATICS(trans_logger);
+
+/********************* brick constructors * destructors *******************/
+
+static
+void _free_pages(struct trans_logger_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < NR_HASH_PAGES; i++) {
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[i];
+ int j;
+
+ if (!sub_table)
+ continue;
+ for (j = 0; j < HASH_PER_PAGE; j++) {
+ struct trans_logger_hash_anchor *start = &sub_table[j];
+
+ CHECK_HEAD_EMPTY(&start->hash_anchor);
+ }
+ brick_block_free(sub_table, PAGE_SIZE);
+ }
+ brick_block_free(brick->hash_table, PAGE_SIZE);
+}
+
+static
+int trans_logger_brick_construct(struct trans_logger_brick *brick)
+{
+ int i;
+
+ brick->hash_table = brick_block_alloc(0, PAGE_SIZE);
+ memset(brick->hash_table, 0, PAGE_SIZE);
+
+ for (i = 0; i < NR_HASH_PAGES; i++) {
+ struct trans_logger_hash_anchor *sub_table;
+ int j;
+
+ /* this should be usually optimized away as dead code */
+ if (unlikely(i >= MAX_HASH_PAGES)) {
+ XIO_ERR("sorry, subtable index %d is too large.\n", i);
+ _free_pages(brick);
+ return -EINVAL;
+ }
+
+ sub_table = brick_block_alloc(0, PAGE_SIZE);
+ brick->hash_table[i] = sub_table;
+
+ memset(sub_table, 0, PAGE_SIZE);
+ for (j = 0; j < HASH_PER_PAGE; j++) {
+ struct trans_logger_hash_anchor *start = &sub_table[j];
+
+ init_rwsem(&start->hash_mutex);
+ INIT_LIST_HEAD(&start->hash_anchor);
+ }
+ }
+
+ atomic_set(&brick->hash_count, 0);
+ spin_lock_init(&brick->replay_lock);
+ INIT_LIST_HEAD(&brick->replay_list);
+ INIT_LIST_HEAD(&brick->group_head);
+ init_waitqueue_head(&brick->worker_event);
+ init_waitqueue_head(&brick->caller_event);
+ qq_init(&brick->q_phase[0], brick);
+ qq_init(&brick->q_phase[1], brick);
+ qq_init(&brick->q_phase[2], brick);
+ qq_init(&brick->q_phase[3], brick);
+ brick->q_phase[0].q_insert_info = "q0_ins";
+ brick->q_phase[0].q_pushback_info = "q0_push";
+ brick->q_phase[0].q_fetch_info = "q0_fetch";
+ brick->q_phase[1].q_insert_info = "q1_ins";
+ brick->q_phase[1].q_pushback_info = "q1_push";
+ brick->q_phase[1].q_fetch_info = "q1_fetch";
+ brick->q_phase[2].q_insert_info = "q2_ins";
+ brick->q_phase[2].q_pushback_info = "q2_push";
+ brick->q_phase[2].q_fetch_info = "q2_fetch";
+ brick->q_phase[3].q_insert_info = "q3_ins";
+ brick->q_phase[3].q_pushback_info = "q3_push";
+ brick->q_phase[3].q_fetch_info = "q3_fetch";
+ brick->new_input_nr = TL_INPUT_LOG1;
+ brick->log_input_nr = TL_INPUT_LOG1;
+ brick->old_input_nr = TL_INPUT_LOG1;
+ add_to_group(&global_writeback, brick);
+ return 0;
+}
+
+static
+int trans_logger_brick_destruct(struct trans_logger_brick *brick)
+{
+ _free_pages(brick);
+ CHECK_HEAD_EMPTY(&brick->replay_list);
+ remove_from_group(&global_writeback, brick);
+ return 0;
+}
+
+static
+int trans_logger_output_construct(struct trans_logger_output *output)
+{
+ return 0;
+}
+
+static
+int trans_logger_input_construct(struct trans_logger_input *input)
+{
+ INIT_LIST_HEAD(&input->pos_list);
+ sema_init(&input->inf_mutex, 1);
+ return 0;
+}
+
+static
+int trans_logger_input_destruct(struct trans_logger_input *input)
+{
+ CHECK_HEAD_EMPTY(&input->pos_list);
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct trans_logger_brick_ops trans_logger_brick_ops = {
+ .brick_switch = trans_logger_switch,
+ .brick_statistics = trans_logger_statistics,
+ .reset_statistics = trans_logger_reset_statistics,
+};
+
+static struct trans_logger_output_ops trans_logger_output_ops = {
+ .xio_get_info = trans_logger_get_info,
+ .aio_get = trans_logger_io_get,
+ .aio_put = trans_logger_io_put,
+ .aio_io = trans_logger_io_io,
+};
+
+const struct trans_logger_input_type trans_logger_input_type = {
+ .type_name = "trans_logger_input",
+ .input_size = sizeof(struct trans_logger_input),
+ .input_construct = &trans_logger_input_construct,
+ .input_destruct = &trans_logger_input_destruct,
+};
+
+static const struct trans_logger_input_type *trans_logger_input_types[] = {
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+};
+
+const struct trans_logger_output_type trans_logger_output_type = {
+ .type_name = "trans_logger_output",
+ .output_size = sizeof(struct trans_logger_output),
+ .master_ops = &trans_logger_output_ops,
+ .output_construct = &trans_logger_output_construct,
+};
+
+static const struct trans_logger_output_type *trans_logger_output_types[] = {
+ &trans_logger_output_type,
+};
+
+const struct trans_logger_brick_type trans_logger_brick_type = {
+ .type_name = "trans_logger_brick",
+ .brick_size = sizeof(struct trans_logger_brick),
+ .max_inputs = TL_INPUT_NR,
+ .max_outputs = 1,
+ .master_ops = &trans_logger_brick_ops,
+ .aspect_types = trans_logger_aspect_types,
+ .default_input_types = trans_logger_input_types,
+ .default_output_types = trans_logger_output_types,
+ .brick_construct = &trans_logger_brick_construct,
+ .brick_destruct = &trans_logger_brick_destruct,
+};
+EXPORT_SYMBOL_GPL(trans_logger_brick_type);
+
+/***************** module init stuff ************************/
+
+int __init init_xio_trans_logger(void)
+{
+ XIO_INF("init_trans_logger()\n");
+ return trans_logger_register_brick_type();
+}
+
+void exit_xio_trans_logger(void)
+{
+ XIO_INF("exit_trans_logger()\n");
+ trans_logger_unregister_brick_type();
+}
--
2.0.0

2014-07-01 21:51:29

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 39/50] mars: add new file include/linux/mars_light/light_strategy.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/mars_light/light_strategy.h | 224 ++++++++++++++++++++++++++++++
1 file changed, 224 insertions(+)
create mode 100644 include/linux/mars_light/light_strategy.h

diff --git a/include/linux/mars_light/light_strategy.h b/include/linux/mars_light/light_strategy.h
new file mode 100644
index 0000000..b483381
--- /dev/null
+++ b/include/linux/mars_light/light_strategy.h
@@ -0,0 +1,224 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+/* OLD CODE => will disappear! */
+#ifndef _OLD_STRATEGY
+#define _OLD_STRATEGY
+
+#define _STRATEGY /* call this only in strategy bricks, never in ordinary bricks */
+
+#include <linux/xio.h>
+
+#define MARS_ARGV_MAX 4
+
+extern loff_t global_total_space;
+extern loff_t global_remaining_space;
+
+extern int global_logrot_auto;
+extern int global_free_space_0;
+extern int global_free_space_1;
+extern int global_free_space_2;
+extern int global_free_space_3;
+extern int global_free_space_4;
+extern int global_sync_want;
+extern int global_sync_nr;
+extern int global_sync_limit;
+extern int mars_rollover_interval;
+extern int mars_scan_interval;
+extern int mars_propagate_interval;
+extern int mars_sync_flip_interval;
+extern int mars_peer_abort;
+extern int mars_emergency_mode;
+extern int mars_reset_emergency;
+extern int mars_keep_msg;
+
+extern int mars_fast_fullsync;
+
+extern char *my_id(void);
+
+#define MARS_DENT(TYPE) \
+ struct list_head dent_link; \
+ struct list_head brick_list; \
+ struct TYPE *d_parent; \
+ char *d_argv[MARS_ARGV_MAX]; /* for internal use, will be automatically deallocated*/\
+ char *d_args; /* ditto uninterpreted */ \
+ char *d_name; /* current path component */ \
+ char *d_rest; /* some "meaningful" rest of d_name*/ \
+ char *d_path; /* full absolute path */ \
+ struct say_channel *d_say_channel; /* for messages */ \
+ loff_t d_corr_A; /* logical size correction */ \
+ loff_t d_corr_B; /* logical size correction */ \
+ int d_depth; \
+ unsigned int d_type; /* from readdir() => often DT_UNKNOWN => don't rely on it, use stat_val.mode instead */\
+ int d_class; /* for pre-grouping order */ \
+ int d_serial; /* for pre-grouping order */ \
+ int d_version; /* dynamic programming per call of mars_ent_work() */\
+ int d_child_count; \
+ bool d_killme; \
+ bool d_use_channel; \
+ struct kstat stat_val; \
+ char *link_val; \
+ struct mars_global *d_global; \
+ void (*d_private_destruct)(void *private); \
+ void *d_private; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct mars_dent {
+ MARS_DENT(mars_dent);
+};
+
+extern const struct meta mars_kstat_meta[];
+extern const struct meta mars_dent_meta[];
+
+struct mars_global {
+ struct rw_semaphore dent_mutex;
+ struct rw_semaphore brick_mutex;
+ struct generic_switch global_power;
+ struct list_head dent_anchor;
+ struct list_head brick_anchor;
+
+ wait_queue_head_t main_event;
+ int global_version;
+ int deleted_my_border;
+ int deleted_border;
+ int deleted_min;
+ bool main_trigger;
+};
+
+extern void bind_to_dent(struct mars_dent *dent, struct say_channel **ch);
+
+typedef int (*mars_dent_checker_fn)(struct mars_dent *parent,
+ const char *name,
+ int namlen,
+ unsigned int d_type,
+ int *prefix,
+ int *serial,
+ bool *use_channel);
+
+typedef int (*mars_dent_worker_fn)(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction);
+
+extern int mars_dent_work(struct mars_global *global,
+ char *dirname,
+ int allocsize,
+
+ mars_dent_checker_fn checker,
+ mars_dent_worker_fn worker,
+ void *buf,
+ int maxdepth);
+
+extern struct mars_dent *_mars_find_dent(struct mars_global *global, const char *path);
+extern struct mars_dent *mars_find_dent(struct mars_global *global, const char *path);
+extern int mars_find_dent_all(struct mars_global *global, char *prefix, struct mars_dent ***table);
+extern void xio_kill_dent(struct mars_dent *dent);
+extern void xio_free_dent(struct mars_dent *dent);
+extern void xio_free_dent_all(struct mars_global *global, struct list_head *anchor);
+
+/* low-level brick instantiation */
+
+extern struct xio_brick *mars_find_brick(struct mars_global *global, const void *brick_type, const char *path);
+extern struct xio_brick *xio_make_brick(struct mars_global *global,
+ struct mars_dent *belongs,
+ const void *_brick_type,
+ const char *path,
+ const char *name);
+
+extern int xio_free_brick(struct xio_brick *brick);
+extern int xio_kill_brick(struct xio_brick *brick);
+extern int xio_kill_brick_all(struct mars_global *global, struct list_head *anchor, bool use_dent_link);
+extern int xio_kill_brick_when_possible(struct mars_global *global,
+ struct list_head *anchor,
+ bool use_dent_link,
+ const struct xio_brick_type *type,
+ bool even_on);
+
+/* mid-level brick instantiation (identity is based on path strings) */
+
+extern char *_vpath_make(int line, const char *fmt, va_list *args);
+extern char *_path_make(int line, const char *fmt, ...);
+extern char *_backskip_replace(int line, const char *path, char delim, bool insert, const char *fmt, ...);
+
+#define vpath_make(_fmt, _args) \
+ _vpath_make(__LINE__, _fmt, _args)
+#define path_make(_fmt, _args...) \
+ _path_make(__LINE__, _fmt, ##_args)
+#define backskip_replace(_path, _delim, _insert, _fmt, _args...) \
+ _backskip_replace(__LINE__, _path, _delim, _insert, _fmt, ##_args)
+
+extern struct xio_brick *path_find_brick(struct mars_global *global, const void *brick_type, const char *fmt, ...);
+
+/* Create a new brick and connect its inputs to a set of predecessors.
+ * When @timeout > 0, switch on the brick as well as its predecessors.
+ */
+extern struct xio_brick *make_brick_all(
+ struct mars_global *global,
+ struct mars_dent *belongs,
+ int (*setup_fn)(struct xio_brick *brick, void *private),
+ void *private,
+ const char *new_name,
+ const struct generic_brick_type *new_brick_type,
+ const struct generic_brick_type *prev_brick_type[],
+/* -1 = off, 0 = leave in current state, +1 = create when necessary, +2 = create + switch on */
+ int switch_override,
+ const char *new_fmt,
+ const char *prev_fmt[],
+ int prev_count,
+ ...
+ );
+
+/* general MARS infrastructure */
+
+/* General fs wrappers (for abstraction)
+ */
+extern int mars_stat(const char *path, struct kstat *stat, bool use_lstat);
+extern int mars_mkdir(const char *path);
+extern int mars_rmdir(const char *path);
+extern int mars_unlink(const char *path);
+extern int mars_symlink(const char *oldpath, const char *newpath, const struct timespec *stamp, uid_t uid);
+extern char *mars_readlink(const char *newpath);
+extern int mars_rename(const char *oldpath, const char *newpath);
+extern int mars_chmod(const char *path, mode_t mode);
+extern int mars_lchown(const char *path, uid_t uid);
+extern void mars_remaining_space(const char *fspath, loff_t *total, loff_t *remaining);
+
+/***********************************************************************/
+
+extern struct mars_global *mars_global;
+
+extern bool xio_check_inputs(struct xio_brick *brick);
+extern bool xio_check_outputs(struct xio_brick *brick);
+
+extern int mars_power_button(struct xio_brick *brick, bool val, bool force_off);
+
+/***********************************************************************/
+
+/* statistics */
+
+extern int global_show_statist;
+
+void show_statistics(struct mars_global *global, const char *class);
+
+/***********************************************************************/
+
+/* quirk */
+
+extern int mars_mem_percent;
+
+extern int external_checker(struct mars_dent *parent,
+ const char *_name,
+ int namlen,
+ unsigned int d_type,
+ int *prefix,
+ int *serial,
+ bool *use_channel);
+
+void from_remote_trigger(void);
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_sy(void);
+extern void exit_sy(void);
+
+extern int init_sy_net(void);
+extern void exit_sy_net(void);
+
+#endif
--
2.0.0

2014-07-01 21:51:27

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 45/50] mars: add new file drivers/block/mars/mars_light/mars_proc.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/mars_light/mars_proc.c | 349 ++++++++++++++++++++++++++++++
1 file changed, 349 insertions(+)
create mode 100644 drivers/block/mars/mars_light/mars_proc.c

diff --git a/drivers/block/mars/mars_light/mars_proc.c b/drivers/block/mars/mars_light/mars_proc.c
new file mode 100644
index 0000000..fefa470
--- /dev/null
+++ b/drivers/block/mars/mars_light/mars_proc.c
@@ -0,0 +1,349 @@
+/* (c) 2011 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/sysctl.h>
+#include <linux/uaccess.h>
+
+#include <linux/mars_light/light_strategy.h>
+#include <linux/mars_light/mars_proc.h>
+#include <linux/lib_mapfree.h>
+#include <linux/xio/xio_bio.h>
+#include <linux/xio/xio_aio.h>
+#include <linux/xio/xio_if.h>
+#include <linux/xio/xio_copy.h>
+#include <linux/xio/xio_client.h>
+#include <linux/xio/xio_server.h>
+#include <linux/xio/xio_trans_logger.h>
+
+xio_info_fn xio_info = NULL;
+
+static
+int trigger_sysctl_handler(
+ ctl_table * table, /* checkpatch.pl insists on a space after "*" */
+ int write,
+ void __user *buffer,
+ size_t *length,
+ loff_t *ppos)
+{
+ ssize_t res = 0;
+ size_t len = *length;
+
+ XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos);
+
+ if (!len || *ppos > 0)
+ goto done;
+
+ if (write) {
+ char tmp[8] = {};
+
+ res = len; /* fake consumption of all data */
+
+ if (len > 7)
+ len = 7;
+ if (!copy_from_user(tmp, buffer, len)) {
+ int code = 0;
+ int status = kstrtoint(tmp, 0, &code);
+
+ /* the return value from ssanf() does not matter */
+ (void)status;
+ if (code > 0)
+ local_trigger();
+ if (code > 1)
+ remote_trigger();
+ }
+ } else {
+ char *answer = "MARS module not operational\n";
+ char *tmp = NULL;
+ int mylen;
+
+ if (xio_info) {
+ answer = "internal error while determining xio_info\n";
+ tmp = xio_info();
+ if (tmp)
+ answer = tmp;
+ }
+
+ mylen = strlen(answer);
+ if (len > mylen)
+ len = mylen;
+ res = len;
+ if (copy_to_user(buffer, answer, len)) {
+ XIO_ERR("write %ld bytes at %p failed\n", len, buffer);
+ res = -EFAULT;
+ }
+ brick_string_free(tmp);
+ }
+
+done:
+ XIO_DBG("res = %ld\n", res);
+ *length = res;
+ if (res >= 0) {
+ *ppos += res;
+ return 0;
+ }
+ return res;
+}
+
+static
+int lamport_sysctl_handler(
+ ctl_table * table, /* checkpatch.pl insists on a space after "*" */
+ int write,
+ void __user *buffer,
+ size_t *length,
+ loff_t *ppos)
+{
+ ssize_t res = 0;
+ size_t len = *length;
+
+ XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos);
+
+ if (!len || *ppos > 0)
+ goto done;
+
+ if (write) {
+ return -EINVAL;
+ } else {
+ int my_len = 128;
+ char *tmp = brick_string_alloc(my_len);
+ struct timespec know = CURRENT_TIME;
+ struct timespec lnow;
+
+ get_lamport(&lnow);
+
+ res = scnprintf(tmp, my_len,
+ "CURRENT_TIME=%ld.%09ld\nlamport_now=%ld.%09ld\n",
+ know.tv_sec, know.tv_nsec,
+ lnow.tv_sec, lnow.tv_nsec
+ );
+
+ if (copy_to_user(buffer, tmp, res)) {
+ XIO_ERR("write %ld bytes at %p failed\n", res, buffer);
+ res = -EFAULT;
+ }
+ brick_string_free(tmp);
+ }
+
+done:
+ XIO_DBG("res = %ld\n", res);
+ *length = res;
+ if (res >= 0) {
+ *ppos += res;
+ return 0;
+ }
+ return res;
+}
+
+#ifdef CTL_UNNUMBERED
+#define _CTL_NAME .ctl_name = CTL_UNNUMBERED,
+#define _CTL_STRATEGY(handler) .strategy = &handler,
+#else
+#define _CTL_NAME /*empty*/
+#define _CTL_STRATEGY(handler) /*empty*/
+#endif
+
+#define VEC_ENTRY(NAME, VAR, MODE, COUNT) \
+ { \
+ _CTL_NAME \
+ .procname = NAME, \
+ .data = &(VAR), \
+ .maxlen = sizeof(int) * (COUNT), \
+ .mode = MODE, \
+ .proc_handler = &proc_dointvec, \
+ _CTL_STRATEGY(sysctl_intvec) \
+ }
+
+#define INT_ENTRY(NAME, VAR, MODE) \
+ VEC_ENTRY(NAME, VAR, MODE, 1)
+
+/* checkpatch.pl: no, these complex values cannot be easily enclosed
+ * in parentheses. If { ... } were used inside the macro body, it would
+ * no longer be possible to add additional fields externally.
+ * I could inject further fields externally via parameters, but
+ * that would make it less understandable.
+ */
+#define LIMITER_ENTRIES(VAR, PREFIX, SUFFIX) \
+ INT_ENTRY(PREFIX "_ratelimit_" SUFFIX, (VAR)->lim_max_rate, 0600),\
+ INT_ENTRY(PREFIX "_maxdelay_ms", (VAR)->lim_max_delay, 0600), \
+ INT_ENTRY(PREFIX "_minwindow_ms", (VAR)->lim_min_window, 0600),\
+ INT_ENTRY(PREFIX "_maxwindow_ms", (VAR)->lim_max_window, 0600),\
+ INT_ENTRY(PREFIX "_cumul_" SUFFIX, (VAR)->lim_cumul, 0600), \
+ INT_ENTRY(PREFIX "_count_ops", (VAR)->lim_count, 0600), \
+ INT_ENTRY(PREFIX "_rate_" SUFFIX, (VAR)->lim_rate, 0400) \
+
+#define THRESHOLD_ENTRIES(VAR, PREFIX) \
+ INT_ENTRY(PREFIX "_threshold_us", (VAR)->thr_limit, 0600), \
+ INT_ENTRY(PREFIX "_factor_percent", (VAR)->thr_factor, 0600), \
+ INT_ENTRY(PREFIX "_plus_us", (VAR)->thr_plus, 0600), \
+ INT_ENTRY(PREFIX "_triggered", (VAR)->thr_triggered, 0400),\
+ INT_ENTRY(PREFIX "_true_hit", (VAR)->thr_true_hit, 0400) \
+
+static
+ctl_table traffic_tuning_table[] = {
+ LIMITER_ENTRIES(&client_limiter, "client_role_traffic", "kb"),
+ LIMITER_ENTRIES(&server_limiter, "server_role_traffic", "kb"),
+ {}
+};
+
+static
+ctl_table io_tuning_table[] = {
+ LIMITER_ENTRIES(&global_writeback.limiter, "writeback", "kb"),
+ INT_ENTRY("writeback_until_percent", global_writeback.until_percent, 0600),
+ THRESHOLD_ENTRIES(&bio_submit_threshold, "bio_submit"),
+ THRESHOLD_ENTRIES(&bio_io_threshold[0], "bio_io_r"),
+ THRESHOLD_ENTRIES(&bio_io_threshold[1], "bio_io_w"),
+ THRESHOLD_ENTRIES(&aio_submit_threshold, "aio_submit"),
+ THRESHOLD_ENTRIES(&aio_io_threshold[0], "aio_io_r"),
+ THRESHOLD_ENTRIES(&aio_io_threshold[1], "aio_io_w"),
+ THRESHOLD_ENTRIES(&aio_sync_threshold, "aio_sync"),
+ {}
+};
+
+static
+ctl_table tcp_tuning_table[] = {
+ INT_ENTRY("ip_tos", default_tcp_params.ip_tos, 0600),
+ INT_ENTRY("tcp_window_size", default_tcp_params.tcp_window_size, 0600),
+ INT_ENTRY("tcp_nodelay", default_tcp_params.tcp_nodelay, 0600),
+ INT_ENTRY("tcp_timeout", default_tcp_params.tcp_timeout, 0600),
+ INT_ENTRY("tcp_keepcnt", default_tcp_params.tcp_keepcnt, 0600),
+ INT_ENTRY("tcp_keepintvl", default_tcp_params.tcp_keepintvl, 0600),
+ INT_ENTRY("tcp_keepidle", default_tcp_params.tcp_keepidle, 0600),
+ {}
+};
+
+static
+ctl_table mars_table[] = {
+ {
+ _CTL_NAME
+ .procname = "trigger",
+ .mode = 0200,
+ .proc_handler = &trigger_sysctl_handler,
+ },
+ {
+ _CTL_NAME
+ .procname = "info",
+ .mode = 0400,
+ .proc_handler = &trigger_sysctl_handler,
+ },
+ {
+ _CTL_NAME
+ .procname = "lamport_clock",
+ .mode = 0400,
+ .proc_handler = &lamport_sysctl_handler,
+ },
+ INT_ENTRY("show_log_messages", brick_say_logging, 0600),
+ INT_ENTRY("show_debug_messages", brick_say_debug, 0600),
+ INT_ENTRY("show_statistics_global", global_show_statist, 0600),
+ INT_ENTRY("show_statistics_server", server_show_statist, 0600),
+ INT_ENTRY("aio_sync_mode", aio_sync_mode, 0600),
+ INT_ENTRY("logger_completion_semantics", trans_logger_completion_semantics, 0600),
+ INT_ENTRY("logger_do_crc", trans_logger_do_crc, 0600),
+ INT_ENTRY("syslog_min_class", brick_say_syslog_min, 0600),
+ INT_ENTRY("syslog_max_class", brick_say_syslog_max, 0600),
+ INT_ENTRY("syslog_flood_class", brick_say_syslog_flood_class, 0600),
+ INT_ENTRY("syslog_flood_limit", brick_say_syslog_flood_limit, 0600),
+ INT_ENTRY("syslog_flood_recovery_s", brick_say_syslog_flood_recovery, 0600),
+ INT_ENTRY("delay_say_on_overflow", delay_say_on_overflow, 0600),
+ INT_ENTRY("mapfree_period_sec", mapfree_period_sec, 0600),
+ INT_ENTRY("mapfree_grace_keep_mb", mapfree_grace_keep_mb, 0600),
+ INT_ENTRY("logger_max_interleave", trans_logger_max_interleave, 0600),
+ INT_ENTRY("logger_resume", trans_logger_resume, 0600),
+ INT_ENTRY("logger_replay_timeout_sec", trans_logger_replay_timeout, 0600),
+ INT_ENTRY("mem_limit_percent", mars_mem_percent, 0600),
+ INT_ENTRY("logger_mem_used_kb", trans_logger_mem_usage, 0400),
+ INT_ENTRY("mem_used_raw_kb", brick_global_block_used, 0400),
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ INT_ENTRY("mem_allow_freelist", brick_allow_freelist, 0600),
+ VEC_ENTRY("mem_freelist_max", brick_mem_freelist_max, 0600, BRICK_MAX_ORDER+1),
+ VEC_ENTRY("mem_alloc_count", brick_mem_alloc_count, 0400, BRICK_MAX_ORDER+1),
+ VEC_ENTRY("mem_alloc_max", brick_mem_alloc_count, 0600, BRICK_MAX_ORDER+1),
+#endif
+ INT_ENTRY("io_flying_count", xio_global_io_flying, 0400),
+ INT_ENTRY("copy_overlap", xio_copy_overlap, 0600),
+ INT_ENTRY("copy_read_prio", xio_copy_read_prio, 0600),
+ INT_ENTRY("copy_write_prio", xio_copy_write_prio, 0600),
+ INT_ENTRY("copy_read_max_fly", xio_copy_read_max_fly, 0600),
+ INT_ENTRY("copy_write_max_fly", xio_copy_write_max_fly, 0600),
+ INT_ENTRY("statusfiles_rollover_sec", mars_rollover_interval, 0600),
+ INT_ENTRY("scan_interval_sec", mars_scan_interval, 0600),
+ INT_ENTRY("propagate_interval_sec", mars_propagate_interval, 0600),
+ INT_ENTRY("sync_flip_interval_sec", mars_sync_flip_interval, 0600),
+ INT_ENTRY("peer_abort", mars_peer_abort, 0600),
+ INT_ENTRY("client_abort", xio_client_abort, 0600),
+ INT_ENTRY("do_fast_fullsync", mars_fast_fullsync, 0600),
+ INT_ENTRY("logrot_auto_gb", global_logrot_auto, 0600),
+ INT_ENTRY("remaining_space_kb", global_remaining_space, 0400),
+ INT_ENTRY("required_total_space_0_gb", global_free_space_0, 0600),
+ INT_ENTRY("required_free_space_1_gb", global_free_space_1, 0600),
+ INT_ENTRY("required_free_space_2_gb", global_free_space_2, 0600),
+ INT_ENTRY("required_free_space_3_gb", global_free_space_3, 0600),
+ INT_ENTRY("required_free_space_4_gb", global_free_space_4, 0600),
+ INT_ENTRY("sync_want", global_sync_want, 0400),
+ INT_ENTRY("sync_nr", global_sync_nr, 0400),
+ INT_ENTRY("sync_limit", global_sync_limit, 0600),
+ INT_ENTRY("mars_emergency_mode", mars_emergency_mode, 0600),
+ INT_ENTRY("mars_reset_emergency", mars_reset_emergency, 0600),
+ INT_ENTRY("mars_keep_msg_s", mars_keep_msg, 0600),
+ INT_ENTRY("write_throttle_start_percent", xio_throttle_start, 0600),
+ INT_ENTRY("write_throttle_end_percent", xio_throttle_end, 0600),
+ INT_ENTRY("write_throttle_size_threshold_kb", if_throttle_start_size, 0400),
+ LIMITER_ENTRIES(&if_throttle, "write_throttle", "kb"),
+ /* changing makes no sense because the server will immediately start upon modprobe */
+ INT_ENTRY("xio_port", xio_net_default_port, 0400),
+ INT_ENTRY("network_io_timeout", global_net_io_timeout, 0600),
+ {
+ _CTL_NAME
+ .procname = "traffic_tuning",
+ .mode = 0500,
+ .child = traffic_tuning_table,
+ },
+ {
+ _CTL_NAME
+ .procname = "io_tuning",
+ .mode = 0500,
+ .child = io_tuning_table,
+ },
+ {
+ _CTL_NAME
+ .procname = "tcp_tuning",
+ .mode = 0500,
+ .child = tcp_tuning_table,
+ },
+ {}
+};
+
+static
+ctl_table mars_root_table[] = {
+ {
+ _CTL_NAME
+ .procname = "mars",
+ .mode = 0500,
+ .child = mars_table,
+ },
+ {}
+};
+
+/***************** module init stuff ************************/
+
+static struct ctl_table_header *header;
+
+int __init init_xio_proc(void)
+{
+
+ XIO_INF("init_proc()\n");
+
+ header = register_sysctl_table(mars_root_table);
+
+ return 0;
+}
+
+void exit_xio_proc(void)
+{
+ XIO_INF("exit_proc()\n");
+ if (header) {
+ unregister_sysctl_table(header);
+ header = NULL;
+ }
+}
--
2.0.0

2014-07-01 21:52:41

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 28/50] mars: add new file drivers/block/mars/xio_bricks/xio_bio.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/xio_bio.c | 810 ++++++++++++++++++++++++++++++++
1 file changed, 810 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/xio_bio.c

diff --git a/drivers/block/mars/xio_bricks/xio_bio.c b/drivers/block/mars/xio_bricks/xio_bio.c
new file mode 100644
index 0000000..2fc3922
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/xio_bio.c
@@ -0,0 +1,810 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+/* Bio brick (interface to blkdev IO via kernel bios) */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/bio.h>
+
+#include <linux/xio.h>
+#include <linux/brick/lib_timing.h>
+#include <linux/lib_mapfree.h>
+
+#include <linux/xio/xio_bio.h>
+static struct timing_stats timings[2];
+
+struct threshold bio_submit_threshold = {
+ .thr_ban = &xio_global_ban,
+ .thr_limit = BIO_SUBMIT_MAX_LATENCY,
+ .thr_factor = 100,
+ .thr_plus = 0,
+};
+EXPORT_SYMBOL_GPL(bio_submit_threshold);
+
+struct threshold bio_io_threshold[2] = {
+ [0] = {
+ .thr_ban = &xio_global_ban,
+ .thr_limit = BIO_IO_R_MAX_LATENCY,
+ .thr_factor = 10,
+ .thr_plus = 10000,
+ },
+ [1] = {
+ .thr_ban = &xio_global_ban,
+ .thr_limit = BIO_IO_W_MAX_LATENCY,
+ .thr_factor = 10,
+ .thr_plus = 10000,
+ },
+};
+EXPORT_SYMBOL_GPL(bio_io_threshold);
+
+/************************ own type definitions ***********************/
+
+/************************ own helper functions ***********************/
+
+/* This is called from the kernel bio layer.
+ */
+static
+void bio_callback(struct bio *bio, int code)
+{
+ struct bio_aio_aspect *aio_a = bio->bi_private;
+ struct bio_brick *brick;
+ unsigned long flags;
+
+ CHECK_PTR(aio_a, err);
+ CHECK_PTR(aio_a->output, err);
+ brick = aio_a->output->brick;
+ CHECK_PTR(brick, err);
+
+ aio_a->status_code = code;
+
+ spin_lock_irqsave(&brick->lock, flags);
+ list_del(&aio_a->io_head);
+ list_add_tail(&aio_a->io_head, &brick->completed_list);
+ atomic_inc(&brick->completed_count);
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ wake_up_interruptible(&brick->response_event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle bio callback\n");
+out_return:;
+}
+
+/* Map from kernel address/length to struct page (if not already known),
+ * check alignment constraints, create bio from it.
+ * Return the length (may be smaller than requested).
+ */
+static
+int make_bio(struct bio_brick *brick,
+ void *data,
+ int len,
+ loff_t pos,
+ struct bio_aio_aspect *private,
+ struct bio **_bio)
+{
+ unsigned long long sector;
+ int sector_offset;
+ int data_offset;
+ int page_offset;
+ int page_len;
+ int bvec_count;
+ int rest_len = len;
+ int result_len = 0;
+ int status;
+ int i;
+ struct bio *bio = NULL;
+ struct block_device *bdev;
+
+ status = -EINVAL;
+ CHECK_PTR(brick, out);
+ bdev = brick->bdev;
+ CHECK_PTR(bdev, out);
+
+ if (unlikely(rest_len <= 0)) {
+ XIO_ERR("bad bio len %d\n", rest_len);
+ goto out;
+ }
+
+ sector = pos >> 9; /* TODO: make dynamic */
+ sector_offset = pos & ((1 << 9) - 1); /* TODO: make dynamic */
+ data_offset = ((unsigned long)data) & ((1 << 9) - 1); /* TODO: make dynamic */
+
+ if (unlikely(sector_offset > 0)) {
+ XIO_ERR("odd sector offset %d\n", sector_offset);
+ goto out;
+ }
+ if (unlikely(sector_offset != data_offset)) {
+ XIO_ERR("bad alignment: sector_offset %d != data_offet %d\n", sector_offset, data_offset);
+ goto out;
+ }
+ if (unlikely(rest_len & ((1 << 9) - 1))) {
+ XIO_ERR("odd length %d\n", rest_len);
+ goto out;
+ }
+
+ page_offset = ((unsigned long)data) & (PAGE_SIZE-1);
+ page_len = rest_len + page_offset;
+ bvec_count = (page_len - 1) / PAGE_SIZE + 1;
+ if (bvec_count > brick->bvec_max)
+ bvec_count = brick->bvec_max;
+
+ bio = bio_alloc(GFP_BRICK, bvec_count);
+ status = -ENOMEM;
+
+ for (i = 0; i < bvec_count && rest_len > 0; i++) {
+ struct page *page;
+ int this_rest = PAGE_SIZE - page_offset;
+ int this_len = rest_len;
+
+ if (this_len > this_rest)
+ this_len = this_rest;
+
+ page = brick_iomap(data, &page_offset, &this_len);
+ if (unlikely(!page)) {
+ XIO_ERR("cannot iomap() kernel address %p\n", data);
+ status = -EINVAL;
+ goto out;
+ }
+
+ bio->bi_io_vec[i].bv_page = page;
+ bio->bi_io_vec[i].bv_len = this_len;
+ bio->bi_io_vec[i].bv_offset = page_offset;
+
+ data += this_len;
+ rest_len -= this_len;
+ result_len += this_len;
+ page_offset = 0;
+ }
+
+ if (unlikely(rest_len != 0)) {
+ XIO_ERR("computation of bvec_count %d was wrong, diff=%d\n", bvec_count, rest_len);
+ status = -EIO;
+ goto out;
+ }
+
+ bio->bi_vcnt = i;
+ bio->bi_iter.bi_idx = 0;
+ bio->bi_iter.bi_size = result_len;
+ bio->bi_iter.bi_sector = sector;
+ bio->bi_bdev = bdev;
+ bio->bi_private = private;
+ bio->bi_end_io = bio_callback;
+ bio->bi_rw = 0; /* must be filled in later */
+ status = result_len;
+
+out:
+ if (unlikely(status < 0)) {
+ XIO_ERR("error %d\n", status);
+ if (bio) {
+ bio_put(bio);
+ bio = NULL;
+ }
+ }
+ *_bio = bio;
+ return status;
+}
+
+/***************** own brick * input * output operations *****************/
+
+#define PRIO_INDEX(aio) ((aio)->io_prio + 1)
+
+static int bio_get_info(struct bio_output *output, struct xio_info *info)
+{
+ struct bio_brick *brick = output->brick;
+ struct inode *inode;
+ int status = -ENOENT;
+
+ if (unlikely(!brick->mf ||
+ !brick->mf->mf_filp ||
+ !brick->mf->mf_filp->f_mapping)) {
+ goto done;
+ }
+ inode = brick->mf->mf_filp->f_mapping->host;
+ if (unlikely(!inode))
+ goto done;
+
+ info->tf_align = 512;
+ info->tf_min_size = 512;
+ brick->total_size = i_size_read(inode);
+ info->current_size = brick->total_size;
+ XIO_DBG("determined device size = %lld\n", info->current_size);
+ status = 0;
+
+done:
+ return status;
+}
+
+static int bio_io_get(struct bio_output *output, struct aio_object *aio)
+{
+ struct bio_aio_aspect *aio_a;
+ int status = -EINVAL;
+
+ CHECK_PTR(output, done);
+ CHECK_PTR(output->brick, done);
+
+ if (aio->obj_initialized) {
+ obj_get(aio);
+ return aio->io_len;
+ }
+
+ aio_a = bio_aio_get_aspect(output->brick, aio);
+ CHECK_PTR(aio_a, done);
+ aio_a->output = output;
+ aio_a->bio = NULL;
+
+ if (!aio->io_data) { /* buffered IO. */
+ status = -ENOMEM;
+ aio->io_data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len));
+ aio_a->do_dealloc = true;
+ }
+
+ status = make_bio(output->brick, aio->io_data, aio->io_len, aio->io_pos, aio_a, &aio_a->bio);
+ if (unlikely(status < 0 || !aio_a->bio)) {
+ XIO_ERR("could not create bio, status = %d\n", status);
+ goto done;
+ }
+
+ if (unlikely(aio->io_prio < XIO_PRIO_HIGH))
+ aio->io_prio = XIO_PRIO_HIGH;
+ else if (unlikely(aio->io_prio > XIO_PRIO_LOW))
+ aio->io_prio = XIO_PRIO_LOW;
+
+ aio->io_len = status;
+ obj_get_first(aio);
+ status = 0;
+
+done:
+ return status;
+}
+
+static
+void _bio_io_put(struct bio_output *output, struct aio_object *aio)
+{
+ struct bio_aio_aspect *aio_a;
+
+ aio->io_total_size = output->brick->total_size;
+
+ aio_a = bio_aio_get_aspect(output->brick, aio);
+ CHECK_PTR(aio_a, err);
+
+ if (likely(aio_a->bio)) {
+ bio_put(aio_a->bio);
+ aio_a->bio = NULL;
+ }
+ if (aio_a->do_dealloc) {
+ brick_block_free(aio->io_data, aio_a->alloc_len);
+ aio->io_data = NULL;
+ }
+ obj_free(aio);
+
+ goto out_return;
+err:
+ XIO_FAT("cannot work\n");
+out_return:;
+}
+
+#define BIO_AIO_PUT(output, aio) \
+ ({ \
+ if (obj_put(aio)) { \
+ _bio_io_put(output, aio); \
+ } \
+ })
+
+static
+void bio_io_put(struct bio_output *output, struct aio_object *aio)
+{
+ BIO_AIO_PUT(output, aio);
+}
+
+static
+void _bio_io_io(struct bio_output *output, struct aio_object *aio, bool cork)
+{
+ struct bio_brick *brick = output->brick;
+ struct bio_aio_aspect *aio_a = bio_aio_get_aspect(output->brick, aio);
+ struct bio *bio;
+ unsigned long long latency;
+ unsigned long flags;
+ int rw;
+ int status = -EINVAL;
+
+ CHECK_PTR(aio_a, err);
+ bio = aio_a->bio;
+ CHECK_PTR(bio, err);
+
+ obj_get(aio);
+ atomic_inc(&brick->fly_count[PRIO_INDEX(aio)]);
+
+ bio_get(bio);
+
+ rw = aio->io_rw & 1;
+ if (brick->do_noidle && !cork)
+ rw |= REQ_NOIDLE;
+ if (!aio->io_skip_sync) {
+ if (brick->do_sync)
+ rw |= REQ_SYNC;
+ }
+
+ aio_a->start_stamp = cpu_clock(raw_smp_processor_id());
+ spin_lock_irqsave(&brick->lock, flags);
+ list_add_tail(&aio_a->io_head, &brick->submitted_list[rw & 1]);
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ bio->bi_rw = rw;
+ latency = TIME_STATS(
+ &timings[rw & 1],
+ submit_bio(rw, bio)
+ );
+
+ threshold_check(&bio_submit_threshold, latency);
+
+ status = 0;
+ if (unlikely(bio_flagged(bio, BIO_EOPNOTSUPP)))
+ status = -EOPNOTSUPP;
+
+ if (likely(status >= 0))
+ goto done;
+
+ bio_put(bio);
+ atomic_dec(&brick->fly_count[PRIO_INDEX(aio)]);
+
+err:
+ XIO_ERR("IO error %d\n", status);
+ CHECKED_CALLBACK(aio, status, done);
+ atomic_dec(&xio_global_io_flying);
+
+done:;
+}
+
+static
+void bio_io_io(struct bio_output *output, struct aio_object *aio)
+{
+ CHECK_PTR(aio, fatal);
+
+ obj_get(aio);
+ atomic_inc(&xio_global_io_flying);
+
+ if (aio->io_prio == XIO_PRIO_LOW ||
+ (aio->io_prio == XIO_PRIO_NORMAL && aio->io_rw)) {
+ struct bio_aio_aspect *aio_a = bio_aio_get_aspect(output->brick, aio);
+ struct bio_brick *brick = output->brick;
+ unsigned long flags;
+
+ spin_lock_irqsave(&brick->lock, flags);
+ list_add_tail(&aio_a->io_head, &brick->queue_list[PRIO_INDEX(aio)]);
+ atomic_inc(&brick->queue_count[PRIO_INDEX(aio)]);
+ spin_unlock_irqrestore(&brick->lock, flags);
+ brick->submitted = true;
+
+ wake_up_interruptible(&brick->submit_event);
+ goto out_return;
+ }
+
+ /* realtime IO: start immediately */
+ _bio_io_io(output, aio, false);
+ BIO_AIO_PUT(output, aio);
+ goto out_return;
+fatal:
+ XIO_FAT("cannot handle aio %p on output %p\n", aio, output);
+out_return:;
+}
+
+static
+int bio_response_thread(void *data)
+{
+ struct bio_brick *brick = data;
+
+ XIO_INF("bio response thread has started on '%s'.\n", brick->brick_path);
+
+ for (;;) {
+ LIST_HEAD(tmp_list);
+ unsigned long flags;
+ int thr_limit;
+ int sleeptime;
+ int count;
+ int i;
+
+ thr_limit = bio_io_threshold[0].thr_limit;
+ if (bio_io_threshold[1].thr_limit < thr_limit)
+ thr_limit = bio_io_threshold[1].thr_limit;
+
+ sleeptime = HZ / 10;
+ if (thr_limit > 0) {
+ sleeptime = thr_limit / (1000000 * 2 / HZ);
+ if (unlikely(sleeptime < 2))
+ sleeptime = 2;
+ }
+
+ wait_event_interruptible_timeout(
+ brick->response_event,
+ atomic_read(&brick->completed_count) > 0,
+ sleeptime);
+
+ spin_lock_irqsave(&brick->lock, flags);
+ list_replace_init(&brick->completed_list, &tmp_list);
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ count = 0;
+ for (;;) {
+ struct list_head *tmp;
+ struct bio_aio_aspect *aio_a;
+ struct aio_object *aio;
+ unsigned long long latency;
+ int code;
+
+ if (list_empty(&tmp_list)) {
+ if (brick_thread_should_stop())
+ goto done;
+ break;
+ }
+
+ tmp = tmp_list.next;
+ list_del_init(tmp);
+ atomic_dec(&brick->completed_count);
+
+ aio_a = container_of(tmp, struct bio_aio_aspect, io_head);
+ aio = aio_a->object;
+
+ latency = cpu_clock(raw_smp_processor_id()) - aio_a->start_stamp;
+ threshold_check(&bio_io_threshold[aio->io_rw & 1], latency);
+
+ code = aio_a->status_code;
+
+ if (code < 0) {
+ XIO_ERR("IO error %d\n", code);
+ } else {
+ aio_checksum(aio);
+ aio->io_flags |= AIO_UPTODATE;
+ }
+
+ SIMPLE_CALLBACK(aio, code);
+
+ atomic_dec(&brick->fly_count[PRIO_INDEX(aio)]);
+ atomic_inc(&brick->total_completed_count[PRIO_INDEX(aio)]);
+ count++;
+
+ if (likely(aio_a->bio))
+ bio_put(aio_a->bio);
+ BIO_AIO_PUT(aio_a->output, aio);
+
+ atomic_dec(&xio_global_io_flying);
+ }
+
+ /* Try to detect slow requests as early as possible,
+ * even before they have completed.
+ */
+ for (i = 0; i < 2; i++) {
+ unsigned long long eldest = 0;
+
+ spin_lock_irqsave(&brick->lock, flags);
+ if (!list_empty(&brick->submitted_list[i])) {
+ struct bio_aio_aspect *aio_a;
+
+ aio_a = container_of(brick->submitted_list[i].next, struct bio_aio_aspect, io_head);
+ eldest = aio_a->start_stamp;
+ }
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ if (eldest)
+ threshold_check(&bio_io_threshold[i], cpu_clock(raw_smp_processor_id()) - eldest);
+ }
+
+ if (count) {
+ brick->submitted = true;
+ wake_up_interruptible(&brick->submit_event);
+ }
+ }
+done:
+ XIO_INF("bio response thread has stopped.\n");
+ return 0;
+}
+
+static
+bool _bg_should_run(struct bio_brick *brick)
+{
+ return (atomic_read(&brick->queue_count[2]) > 0 &&
+ atomic_read(&brick->fly_count[0]) + atomic_read(&brick->fly_count[1]) <= brick->bg_threshold &&
+ (brick->bg_maxfly <= 0 || atomic_read(&brick->fly_count[2]) < brick->bg_maxfly));
+}
+
+static
+int bio_submit_thread(void *data)
+{
+ struct bio_brick *brick = data;
+
+ XIO_INF("bio submit thread has started on '%s'.\n", brick->brick_path);
+
+ while (!brick_thread_should_stop()) {
+ int prio;
+
+ wait_event_interruptible_timeout(
+ brick->submit_event,
+ brick->submitted,
+ HZ / 2);
+
+ brick->submitted = false;
+
+ for (prio = 0; prio < XIO_PRIO_NR; prio++) {
+ LIST_HEAD(tmp_list);
+ unsigned long flags;
+
+ if (prio == XIO_PRIO_NR-1 && !_bg_should_run(brick))
+ break;
+
+ spin_lock_irqsave(&brick->lock, flags);
+ list_replace_init(&brick->queue_list[prio], &tmp_list);
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ while (!list_empty(&tmp_list)) {
+ struct list_head *tmp = tmp_list.next;
+ struct bio_aio_aspect *aio_a;
+ struct aio_object *aio;
+ bool cork;
+
+ list_del_init(tmp);
+
+ aio_a = container_of(tmp, struct bio_aio_aspect, io_head);
+ aio = aio_a->object;
+ if (unlikely(!aio)) {
+ XIO_ERR("invalid aio\n");
+ continue;
+ }
+
+ atomic_dec(&brick->queue_count[PRIO_INDEX(aio)]);
+ cork = atomic_read(&brick->queue_count[PRIO_INDEX(aio)]) > 0;
+
+ _bio_io_io(aio_a->output, aio, cork);
+
+ BIO_AIO_PUT(aio_a->output, aio);
+ }
+ }
+ }
+
+ XIO_INF("bio submit thread has stopped.\n");
+ return 0;
+}
+
+static int bio_switch(struct bio_brick *brick)
+{
+ int status = 0;
+
+ if (brick->power.button) {
+ if (brick->power.on_led)
+ goto done;
+
+ xio_set_power_off_led((void *)brick, false);
+
+ if (!brick->bdev) {
+ static int index;
+ const char *path = brick->brick_path;
+ int flags = O_RDWR | O_EXCL | O_LARGEFILE;
+ struct address_space *mapping;
+ struct inode *inode = NULL;
+ struct request_queue *q;
+
+ brick->mf = mapfree_get(path, flags);
+ if (unlikely(!brick->mf || !brick->mf->mf_filp)) {
+ status = -ENOENT;
+ XIO_ERR("cannot open file '%s'\n", path);
+ goto done;
+ }
+ mapping = brick->mf->mf_filp->f_mapping;
+ if (likely(mapping))
+ inode = mapping->host;
+ if (unlikely(!mapping || !inode)) {
+ XIO_ERR("internal problem with '%s'\n", path);
+ status = -EINVAL;
+ goto done;
+ }
+ if (unlikely(!S_ISBLK(inode->i_mode) || !inode->i_bdev)) {
+ XIO_ERR("sorry, '%s' is not a block device\n", path);
+ status = -ENODEV;
+ goto done;
+ }
+
+ mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~(__GFP_IO | __GFP_FS));
+
+ q = bdev_get_queue(inode->i_bdev);
+ if (unlikely(!q)) {
+ XIO_ERR("internal queue '%s' does not exist\n", path);
+ status = -EINVAL;
+ goto done;
+ }
+
+ XIO_INF("'%s' ra_pages OLD=%lu NEW=%d\n",
+ path,
+ q->backing_dev_info.ra_pages,
+ brick->ra_pages);
+ q->backing_dev_info.ra_pages = brick->ra_pages;
+
+ brick->bvec_max = queue_max_hw_sectors(q) >> (PAGE_SHIFT - 9);
+ brick->total_size = i_size_read(inode);
+
+ brick->response_thread = brick_thread_create(bio_response_thread,
+ brick,
+ "xio_bio_r%d",
+ index);
+ brick->submit_thread = brick_thread_create(bio_submit_thread, brick, "xio_bio_s%d", index);
+ status = -ENOMEM;
+ if (likely(brick->submit_thread && brick->response_thread)) {
+ brick->bdev = inode->i_bdev;
+ index++;
+ status = 0;
+ }
+ }
+ }
+
+ xio_set_power_on_led((void *)brick, brick->power.button && brick->bdev != NULL);
+
+done:
+ if (status < 0 || !brick->power.button) {
+ if (brick->mf) {
+ mapfree_put(brick->mf);
+ brick->mf = NULL;
+ }
+ if (brick->submit_thread) {
+ brick_thread_stop(brick->submit_thread);
+ brick->submit_thread = NULL;
+ }
+ if (brick->response_thread)
+ brick_thread_stop(brick->response_thread);
+ brick->bdev = NULL;
+ if (!brick->power.button) {
+ xio_set_power_off_led((void *)brick, true);
+ brick->total_size = 0;
+ }
+ }
+ return status;
+}
+
+/*************** informational * statistics **************/
+
+static noinline
+char *bio_statistics(struct bio_brick *brick, int verbose)
+{
+ char *res = brick_string_alloc(4096);
+ int pos = 0;
+
+ pos += report_timing(&timings[0], res + pos, 4096 - pos);
+ pos += report_timing(&timings[1], res + pos, 4096 - pos);
+
+ snprintf(res + pos, 4096 - pos,
+ "total completed[0] = %d completed[1] = %d completed[2] = %d | queued[0] = %d queued[1] = %d queued[2] = %d flying[0] = %d flying[1] = %d flying[2] = %d completing = %d\n",
+ atomic_read(&brick->total_completed_count[0]),
+ atomic_read(&brick->total_completed_count[1]),
+ atomic_read(&brick->total_completed_count[2]),
+ atomic_read(&brick->fly_count[0]),
+ atomic_read(&brick->queue_count[0]),
+ atomic_read(&brick->queue_count[1]),
+ atomic_read(&brick->queue_count[2]),
+ atomic_read(&brick->fly_count[1]),
+ atomic_read(&brick->fly_count[2]),
+ atomic_read(&brick->completed_count));
+
+ return res;
+}
+
+static noinline
+void bio_reset_statistics(struct bio_brick *brick)
+{
+ atomic_set(&brick->total_completed_count[0], 0);
+ atomic_set(&brick->total_completed_count[1], 0);
+ atomic_set(&brick->total_completed_count[2], 0);
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int bio_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct bio_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->io_head);
+ return 0;
+}
+
+static void bio_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct bio_aio_aspect *ini = (void *)_ini;
+
+ (void)ini;
+}
+
+XIO_MAKE_STATICS(bio);
+
+/********************* brick constructors * destructors *******************/
+
+static int bio_brick_construct(struct bio_brick *brick)
+{
+ spin_lock_init(&brick->lock);
+ INIT_LIST_HEAD(&brick->queue_list[0]);
+ INIT_LIST_HEAD(&brick->queue_list[1]);
+ INIT_LIST_HEAD(&brick->queue_list[2]);
+ INIT_LIST_HEAD(&brick->submitted_list[0]);
+ INIT_LIST_HEAD(&brick->submitted_list[1]);
+ INIT_LIST_HEAD(&brick->completed_list);
+ init_waitqueue_head(&brick->submit_event);
+ init_waitqueue_head(&brick->response_event);
+ return 0;
+}
+
+static int bio_brick_destruct(struct bio_brick *brick)
+{
+ return 0;
+}
+
+static int bio_output_construct(struct bio_output *output)
+{
+ return 0;
+}
+
+static int bio_output_destruct(struct bio_output *output)
+{
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct bio_brick_ops bio_brick_ops = {
+ .brick_switch = bio_switch,
+ .brick_statistics = bio_statistics,
+ .reset_statistics = bio_reset_statistics,
+};
+
+static struct bio_output_ops bio_output_ops = {
+ .xio_get_info = bio_get_info,
+ .aio_get = bio_io_get,
+ .aio_put = bio_io_put,
+ .aio_io = bio_io_io,
+};
+
+const struct bio_input_type bio_input_type = {
+ .type_name = "bio_input",
+ .input_size = sizeof(struct bio_input),
+};
+
+static const struct bio_input_type *bio_input_types[] = {
+ &bio_input_type,
+};
+
+const struct bio_output_type bio_output_type = {
+ .type_name = "bio_output",
+ .output_size = sizeof(struct bio_output),
+ .master_ops = &bio_output_ops,
+ .output_construct = &bio_output_construct,
+ .output_destruct = &bio_output_destruct,
+};
+
+static const struct bio_output_type *bio_output_types[] = {
+ &bio_output_type,
+};
+
+const struct bio_brick_type bio_brick_type = {
+ .type_name = "bio_brick",
+ .brick_size = sizeof(struct bio_brick),
+ .max_inputs = 0,
+ .max_outputs = 1,
+ .master_ops = &bio_brick_ops,
+ .aspect_types = bio_aspect_types,
+ .default_input_types = bio_input_types,
+ .default_output_types = bio_output_types,
+ .brick_construct = &bio_brick_construct,
+ .brick_destruct = &bio_brick_destruct,
+};
+EXPORT_SYMBOL_GPL(bio_brick_type);
+
+/***************** module init stuff ************************/
+
+int __init init_xio_bio(void)
+{
+ XIO_INF("init_bio()\n");
+ _bio_brick_type = (void *)&bio_brick_type;
+ return bio_register_brick_type();
+}
+
+void exit_xio_bio(void)
+{
+ XIO_INF("exit_bio()\n");
+ bio_unregister_brick_type();
+}
--
2.0.0

2014-07-01 21:52:40

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 41/50] mars: add new file drivers/block/mars/mars_light/light_net.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/mars_light/light_net.c | 99 +++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)
create mode 100644 drivers/block/mars/mars_light/light_net.c

diff --git a/drivers/block/mars/mars_light/light_net.c b/drivers/block/mars/mars_light/light_net.c
new file mode 100644
index 0000000..170e51b
--- /dev/null
+++ b/drivers/block/mars/mars_light/light_net.c
@@ -0,0 +1,99 @@
+/* (c) 2011 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/mars_light/light_strategy.h>
+#include <linux/xio_net.h>
+
+static
+char *_xio_translate_hostname(const char *name)
+{
+ struct mars_global *global = mars_global;
+ char *res = brick_strdup(name);
+ struct mars_dent *test;
+ char *tmp;
+
+ if (unlikely(!global))
+ goto done;
+
+ for (tmp = res; *tmp; tmp++) {
+ if (*tmp == ':') {
+ *tmp = '\0';
+ break;
+ }
+ }
+
+ tmp = path_make("/mars/ips/ip-%s", res);
+ if (unlikely(!tmp))
+ goto done;
+
+ test = mars_find_dent(global, tmp);
+ if (test && test->link_val) {
+ XIO_DBG("'%s' => '%s'\n", tmp, test->link_val);
+ brick_string_free(res);
+ res = brick_strdup(test->link_val);
+ } else {
+ XIO_DBG("no translation for '%s'\n", tmp);
+ }
+ brick_string_free(tmp);
+
+done:
+ return res;
+}
+
+int xio_send_dent_list(struct xio_socket *sock, struct list_head *anchor)
+{
+ struct list_head *tmp;
+ struct mars_dent *dent;
+ int status = 0;
+
+ for (tmp = anchor->next; tmp != anchor; tmp = tmp->next) {
+ dent = container_of(tmp, struct mars_dent, dent_link);
+ status = xio_send_struct(sock, dent, mars_dent_meta);
+ if (status < 0)
+ break;
+ }
+ if (status >= 0) { /* send EOR */
+ status = xio_send_struct(sock, NULL, mars_dent_meta);
+ }
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_send_dent_list);
+
+int xio_recv_dent_list(struct xio_socket *sock, struct list_head *anchor)
+{
+ int status;
+
+ for (;;) {
+ struct mars_dent *dent = brick_zmem_alloc(sizeof(struct mars_dent));
+
+ INIT_LIST_HEAD(&dent->dent_link);
+ INIT_LIST_HEAD(&dent->brick_list);
+
+ status = xio_recv_struct(sock, dent, mars_dent_meta);
+ if (status <= 0) {
+ xio_free_dent(dent);
+ goto done;
+ }
+ list_add_tail(&dent->dent_link, anchor);
+ }
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_recv_dent_list);
+
+/***************** module init stuff ************************/
+
+int __init init_sy_net(void)
+{
+ XIO_INF("init_sy_net()\n");
+ xio_translate_hostname = _xio_translate_hostname;
+ return 0;
+}
+
+void exit_sy_net(void)
+{
+ XIO_INF("exit_sy_net()\n");
+}
--
2.0.0

2014-07-01 21:52:38

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 50/50] mars: activate MARS in drivers/block/mars/

From: Thomas Schoebel-Theuer <[email protected]>

---
drivers/block/Kconfig | 2 ++
drivers/block/Makefile | 1 +
2 files changed, 3 insertions(+)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 014a1cf..8646956 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -283,6 +283,8 @@ config BLK_DEV_CRYPTOLOOP

source "drivers/block/drbd/Kconfig"

+source "drivers/block/mars/Kconfig"
+
config BLK_DEV_NBD
tristate "Network block device support"
depends on NET
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 02b688d..b0f1e81 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -37,6 +37,7 @@ obj-$(CONFIG_BLK_DEV_HD) += hd.o
obj-$(CONFIG_XEN_BLKDEV_FRONTEND) += xen-blkfront.o
obj-$(CONFIG_XEN_BLKDEV_BACKEND) += xen-blkback/
obj-$(CONFIG_BLK_DEV_DRBD) += drbd/
+obj-$(CONFIG_MARS) += mars/
obj-$(CONFIG_BLK_DEV_RBD) += rbd.o
obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/

--
2.0.0

2014-07-01 21:52:37

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 36/50] mars: add new file drivers/block/mars/xio_bricks/xio_copy.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/xio_copy.c | 976 +++++++++++++++++++++++++++++++
1 file changed, 976 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/xio_copy.c

diff --git a/drivers/block/mars/xio_bricks/xio_copy.c b/drivers/block/mars/xio_bricks/xio_copy.c
new file mode 100644
index 0000000..b5a5001
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/xio_copy.c
@@ -0,0 +1,976 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+/* Copy brick (just for demonstration) */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/xio.h>
+#include <linux/brick/lib_limiter.h>
+
+#ifndef READ
+#define READ 0
+#define WRITE 1
+#endif
+
+#define COPY_CHUNK (PAGE_SIZE)
+#define NR_COPY_REQUESTS (32 * 1024 * 1024 / COPY_CHUNK)
+
+#define STATES_PER_PAGE (PAGE_SIZE / sizeof(struct copy_state))
+#define MAX_SUB_TABLES (NR_COPY_REQUESTS / STATES_PER_PAGE + (NR_COPY_REQUESTS % STATES_PER_PAGE ? 1 : 0)\
+)
+#define MAX_COPY_REQUESTS (PAGE_SIZE / sizeof(struct copy_state *) * STATES_PER_PAGE)
+
+#define GET_STATE(brick, index) \
+ ((brick)->st[(index) / STATES_PER_PAGE][(index) % STATES_PER_PAGE])
+
+/************************ own type definitions ***********************/
+
+#include <linux/xio/xio_copy.h>
+
+int xio_copy_overlap = 1;
+EXPORT_SYMBOL_GPL(xio_copy_overlap);
+
+int xio_copy_read_prio = XIO_PRIO_NORMAL;
+EXPORT_SYMBOL_GPL(xio_copy_read_prio);
+
+int xio_copy_write_prio = XIO_PRIO_NORMAL;
+EXPORT_SYMBOL_GPL(xio_copy_write_prio);
+
+int xio_copy_read_max_fly = 0;
+EXPORT_SYMBOL_GPL(xio_copy_read_max_fly);
+
+int xio_copy_write_max_fly = 0;
+EXPORT_SYMBOL_GPL(xio_copy_write_max_fly);
+
+#define is_read_limited(brick) \
+ (xio_copy_read_max_fly > 0 && atomic_read(&(brick)->copy_read_flight) >= xio_copy_read_max_fly)
+
+#define is_write_limited(brick) \
+ (xio_copy_write_max_fly > 0 && atomic_read(&(brick)->copy_write_flight) >= xio_copy_write_max_fly)
+
+/************************ own helper functions ***********************/
+
+/* TODO:
+ * The clash logic is untested / alpha stage (Feb. 2011).
+ *
+ * For now, the output is never used, so this cannot do harm.
+ *
+ * In order to get the output really working / enterprise grade,
+ * some larger test effort should be invested.
+ */
+static inline
+void _clash(struct copy_brick *brick)
+{
+ brick->trigger = true;
+ set_bit(0, &brick->clash);
+ atomic_inc(&brick->total_clash_count);
+ wake_up_interruptible(&brick->event);
+}
+
+static inline
+int _clear_clash(struct copy_brick *brick)
+{
+ int old;
+
+ old = test_and_clear_bit(0, &brick->clash);
+ return old;
+}
+
+/* Current semantics:
+ *
+ * All writes are always going to the original input A. They are _not_
+ * replicated to B.
+ *
+ * In order to get B really uptodate, you have to replay the right
+ * transaction logs there (at the right time).
+ * [If you had no writes on A at all during the copy, of course
+ * this is not necessary]
+ *
+ * When utilize_mode is on, reads can utilize the already copied
+ * region from B, but only as long as this region has not been
+ * invalidated by writes (indicated by low_dirty).
+ *
+ * TODO: implement replicated writes, together with some transaction
+ * replay logic applying the transaction logs _only_ after
+ * crashes during inconsistency caused by partial replication of writes.
+ */
+static
+int _determine_input(struct copy_brick *brick, struct aio_object *aio)
+{
+ int rw;
+ int below;
+ int behind;
+ loff_t io_end;
+
+ if (!brick->utilize_mode || brick->low_dirty)
+ return INPUT_A_IO;
+
+ io_end = aio->io_pos + aio->io_len;
+ below = io_end <= brick->copy_start;
+ behind = !brick->copy_end || aio->io_pos >= brick->copy_end;
+ rw = aio->io_may_write | aio->io_rw;
+ if (rw) {
+ if (!behind) {
+ brick->low_dirty = true;
+ if (!below) {
+ _clash(brick);
+ wake_up_interruptible(&brick->event);
+ }
+ }
+ return INPUT_A_IO;
+ }
+
+ if (below)
+ return INPUT_B_IO;
+
+ return INPUT_A_IO;
+}
+
+#define GET_INDEX(pos) (((pos) / COPY_CHUNK) % NR_COPY_REQUESTS)
+#define GET_OFFSET(pos) ((pos) % COPY_CHUNK)
+
+static
+void __clear_aio(struct copy_brick *brick, struct aio_object *aio, int queue)
+{
+ struct copy_input *input;
+
+ input = queue ? brick->inputs[INPUT_B_COPY] : brick->inputs[INPUT_A_COPY];
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+}
+
+static
+void _clear_aio(struct copy_brick *brick, int index, int queue)
+{
+ struct copy_state *st = &GET_STATE(brick, index);
+ struct aio_object *aio = st->table[queue];
+
+ if (aio) {
+ if (unlikely(st->active[queue])) {
+ XIO_ERR("clearing active aio, index = %d queue = %d\n", index, queue);
+ st->active[queue] = false;
+ }
+ __clear_aio(brick, aio, queue);
+ st->table[queue] = NULL;
+ }
+}
+
+static
+void _clear_all_aio(struct copy_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < NR_COPY_REQUESTS; i++) {
+ GET_STATE(brick, i).state = COPY_STATE_START;
+ _clear_aio(brick, i, 0);
+ _clear_aio(brick, i, 1);
+ }
+}
+
+static
+void _clear_state_table(struct copy_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < MAX_SUB_TABLES; i++) {
+ struct copy_state *sub_table = brick->st[i];
+
+ memset(sub_table, 0, PAGE_SIZE);
+ }
+}
+
+static
+void copy_endio(struct generic_callback *cb)
+{
+ struct copy_aio_aspect *aio_a;
+ struct aio_object *aio;
+ struct copy_brick *brick;
+ struct copy_state *st;
+ int index;
+ int queue;
+ int error = 0;
+
+ LAST_CALLBACK(cb);
+ aio_a = cb->cb_private;
+ CHECK_PTR(aio_a, err);
+ aio = aio_a->object;
+ CHECK_PTR(aio, err);
+ brick = aio_a->brick;
+ CHECK_PTR(brick, err);
+
+ queue = aio_a->queue;
+ index = GET_INDEX(aio->io_pos);
+ st = &GET_STATE(brick, index);
+
+ if (unlikely(queue < 0 || queue >= 2)) {
+ XIO_ERR("bad queue %d\n", queue);
+ error = -EINVAL;
+ goto exit;
+ }
+ st->active[queue] = false;
+ if (unlikely(st->table[queue])) {
+ XIO_ERR("table corruption at %d %d (%p => %p)\n", index, queue, st->table[queue], aio);
+ error = -EEXIST;
+ goto exit;
+ }
+ if (unlikely(cb->cb_error < 0)) {
+ error = cb->cb_error;
+ __clear_aio(brick, aio, queue);
+ /* This is racy, but does no harm.
+ * Worst case just produces more error output.
+ */
+ if (!brick->copy_error_count++)
+ XIO_WRN("IO error %d on index %d, old state = %d\n", cb->cb_error, index, st->state);
+ } else {
+ if (unlikely(st->table[queue])) {
+ XIO_ERR("overwriting index %d, state = %d\n", index, st->state);
+ _clear_aio(brick, index, queue);
+ }
+ st->table[queue] = aio;
+ }
+
+exit:
+ if (unlikely(error < 0)) {
+ st->error = error;
+ _clash(brick);
+ }
+ if (aio->io_rw)
+ atomic_dec(&brick->copy_write_flight);
+ else
+ atomic_dec(&brick->copy_read_flight);
+ brick->trigger = true;
+ wake_up_interruptible(&brick->event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle callback\n");
+out_return:;
+}
+
+static
+int _make_aio(struct copy_brick *brick,
+ int index,
+ int queue,
+ void *data,
+ loff_t pos,
+ loff_t end_pos,
+ int rw,
+ int cs_mode)
+{
+ struct aio_object *aio;
+ struct copy_aio_aspect *aio_a;
+ struct copy_input *input;
+ int offset;
+ int len;
+ int status = -EAGAIN;
+
+ if (brick->clash || end_pos <= 0)
+ goto done;
+
+ aio = copy_alloc_aio(brick);
+ status = -ENOMEM;
+
+ aio_a = copy_aio_get_aspect(brick, aio);
+ if (unlikely(!aio_a)) {
+ XIO_FAT("cannot get own apsect\n");
+ goto done;
+ }
+
+ aio_a->brick = brick;
+ aio_a->queue = queue;
+ aio->io_may_write = rw;
+ aio->io_rw = rw;
+ aio->io_data = data;
+ aio->io_pos = pos;
+ aio->io_cs_mode = cs_mode;
+ offset = GET_OFFSET(pos);
+ len = COPY_CHUNK - offset;
+ if (pos + len > end_pos)
+ len = end_pos - pos;
+ aio->io_len = len;
+ aio->io_prio = rw ?
+ xio_copy_write_prio :
+ xio_copy_read_prio;
+ if (aio->io_prio < XIO_PRIO_HIGH || aio->io_prio > XIO_PRIO_LOW)
+ aio->io_prio = brick->io_prio;
+
+ SETUP_CALLBACK(aio, copy_endio, aio_a);
+
+ input = queue ? brick->inputs[INPUT_B_COPY] : brick->inputs[INPUT_A_COPY];
+ status = GENERIC_INPUT_CALL(input, aio_get, aio);
+ if (unlikely(status < 0)) {
+ XIO_ERR("status = %d\n", status);
+ obj_free(aio);
+ goto done;
+ }
+ if (unlikely(aio->io_len < len))
+ XIO_DBG("shorten len %d < %d\n", aio->io_len, len);
+ if (queue == 0) {
+ GET_STATE(brick, index).len = aio->io_len;
+ } else if (unlikely(aio->io_len < GET_STATE(brick, index).len)) {
+ XIO_DBG("shorten len %d < %d at index %d\n", aio->io_len, GET_STATE(brick, index).len, index);
+ GET_STATE(brick, index).len = aio->io_len;
+ }
+
+ GET_STATE(brick, index).active[queue] = true;
+ if (rw)
+ atomic_inc(&brick->copy_write_flight);
+ else
+ atomic_inc(&brick->copy_read_flight);
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+
+done:
+ return status;
+}
+
+static
+void _update_percent(struct copy_brick *brick)
+{
+ if (brick->copy_last > brick->copy_start + 8 * 1024 * 1024
+ || time_is_before_jiffies(brick->last_jiffies + 5 * HZ)
+ || (brick->copy_last == brick->copy_end && brick->copy_end > 0)) {
+ brick->copy_start = brick->copy_last;
+ brick->last_jiffies = jiffies;
+ brick->power.percent_done = brick->copy_end > 0 ? brick->copy_start * 100 / brick->copy_end : 0;
+ XIO_INF("'%s' copied %lld / %lld bytes (%d%%)\n",
+ brick->brick_path,
+ brick->copy_last,
+ brick->copy_end,
+ brick->power.percent_done);
+ }
+}
+
+/* The heart of this brick.
+ * State transition function of the finite automaton.
+ * In case no progress is possible (e.g. preconditions not
+ * yet true), the state is left as is (idempotence property:
+ * calling this too often does no harm, just costs performance).
+ */
+static
+int _next_state(struct copy_brick *brick, int index, loff_t pos)
+{
+ struct aio_object *aio0;
+ struct aio_object *aio1;
+ struct copy_state *st;
+ char state;
+ char next_state;
+ bool do_restart = false;
+ int progress = 0;
+ int status;
+
+ st = &GET_STATE(brick, index);
+ next_state = st->state;
+
+restart:
+ state = next_state;
+
+ do_restart = false;
+
+ switch (state) {
+ case COPY_STATE_RESET:
+ /* This state is only entered after errors or
+ * in restarting situations.
+ */
+ _clear_aio(brick, index, 1);
+ _clear_aio(brick, index, 0);
+ next_state = COPY_STATE_START;
+ /* fallthrough */
+ case COPY_STATE_START:
+ /* This is the relgular starting state.
+ * It must be zero, automatically entered via memset()
+ */
+ if (st->table[0] || st->table[1]) {
+ XIO_ERR("index %d not startable\n", index);
+ progress = -EPROTO;
+ goto idle;
+ }
+
+ _clear_aio(brick, index, 1);
+ _clear_aio(brick, index, 0);
+ st->writeout = false;
+ st->error = 0;
+
+ if (brick->is_aborting ||
+ is_read_limited(brick))
+ goto idle;
+
+ status = _make_aio(brick, index, 0, NULL, pos, brick->copy_end, READ, brick->verify_mode ? 2 : 0);
+ if (unlikely(status < 0)) {
+ XIO_WRN("status = %d\n", status);
+ progress = status;
+ break;
+ }
+
+ next_state = COPY_STATE_READ1;
+ if (!brick->verify_mode)
+ break;
+
+ next_state = COPY_STATE_START2;
+ /* fallthrough */
+ case COPY_STATE_START2:
+ status = _make_aio(brick, index, 1, NULL, pos, brick->copy_end, READ, 2);
+ if (unlikely(status < 0)) {
+ XIO_WRN("status = %d\n", status);
+ progress = status;
+ break;
+ }
+ next_state = COPY_STATE_READ2;
+ /* fallthrough */
+ case COPY_STATE_READ2:
+ aio1 = st->table[1];
+ if (!aio1) { /* idempotence: wait by unchanged state */
+ goto idle;
+ }
+ /* fallthrough => wait for both aios to appear */
+ case COPY_STATE_READ1:
+ case COPY_STATE_READ3:
+ aio0 = st->table[0];
+ if (!aio0) { /* idempotence: wait by unchanged state */
+ goto idle;
+ }
+ if (brick->copy_limiter) {
+ int amount = (aio0->io_len - 1) / 1024 + 1;
+
+ xio_limit_sleep(brick->copy_limiter, amount);
+ }
+ /* on append mode: increase the end pointer dynamically */
+ if (brick->append_mode > 0 && aio0->io_total_size && aio0->io_total_size > brick->copy_end)
+ brick->copy_end = aio0->io_total_size;
+ /* do verify (when applicable) */
+ aio1 = st->table[1];
+ if (aio1 && state != COPY_STATE_READ3) {
+ int len = aio0->io_len;
+ bool ok;
+
+ if (len != aio1->io_len) {
+ ok = false;
+ } else if (aio0->io_cs_mode) {
+ static unsigned char null[sizeof(aio0->io_checksum)];
+
+ ok = !memcmp(aio0->io_checksum, aio1->io_checksum, sizeof(aio0->io_checksum));
+ if (ok)
+ ok = memcmp(aio0->io_checksum, null, sizeof(aio0->io_checksum)) != 0;
+ } else if (!aio0->io_data || !aio1->io_data) {
+ ok = false;
+ } else {
+ ok = !memcmp(aio0->io_data, aio1->io_data, len);
+ }
+
+ _clear_aio(brick, index, 1);
+
+ if (ok)
+ brick->verify_ok_count++;
+ else
+ brick->verify_error_count++;
+
+ if (ok || !brick->repair_mode) {
+ /* skip start of writing, goto final treatment of writeout */
+ next_state = COPY_STATE_CLEANUP;
+ break;
+ }
+ }
+
+ if (aio0->io_cs_mode > 1) { /* re-read, this time with data */
+ _clear_aio(brick, index, 0);
+ status = _make_aio(brick, index, 0, NULL, pos, brick->copy_end, READ, 0);
+ if (unlikely(status < 0)) {
+ XIO_WRN("status = %d\n", status);
+ progress = status;
+ next_state = COPY_STATE_RESET;
+ break;
+ }
+ next_state = COPY_STATE_READ3;
+ break;
+ }
+ next_state = COPY_STATE_WRITE;
+ /* fallthrough */
+ case COPY_STATE_WRITE:
+ if (is_write_limited(brick))
+ goto idle;
+ /* Obey ordering to get a strict "append" behaviour.
+ * We assume that we don't need to wait for completion
+ * of the previous write to avoid a sparse result file
+ * under all circumstances, i.e. we only assure that
+ * _starting_ the writes is in order.
+ * This is only correct when all lower bricks obey the
+ * order of io_io() operations.
+ * Currenty, bio and aio are obeying this. Be careful when
+ * implementing new IO bricks!
+ */
+ if (st->prev >= 0 && !GET_STATE(brick, st->prev).writeout)
+ goto idle;
+ aio0 = st->table[0];
+ if (unlikely(!aio0 || !aio0->io_data)) {
+ XIO_ERR("src buffer for write does not exist, state %d at index %d\n", state, index);
+ progress = -EILSEQ;
+ break;
+ }
+ if (unlikely(brick->is_aborting)) {
+ progress = -EINTR;
+ break;
+ }
+ /* start writeout */
+ status = _make_aio(brick, index, 1, aio0->io_data, pos, pos + aio0->io_len, WRITE, 0);
+ if (unlikely(status < 0)) {
+ XIO_WRN("status = %d\n", status);
+ progress = status;
+ next_state = COPY_STATE_RESET;
+ break;
+ }
+ /* Attention! overlapped IO behind EOF could
+ * lead to temporary inconsistent state of the
+ * file, because the write order may be different from
+ * strict O_APPEND behaviour.
+ */
+ if (xio_copy_overlap)
+ st->writeout = true;
+ next_state = COPY_STATE_WRITTEN;
+ /* fallthrough */
+ case COPY_STATE_WRITTEN:
+ aio1 = st->table[1];
+ if (!aio1) { /* idempotence: wait by unchanged state */
+ goto idle;
+ }
+ st->writeout = true;
+ /* rechecking means to start over again.
+ * ATTENTIION! this may lead to infinite request
+ * submission loops, intentionally.
+ * TODO: implement some timeout means.
+ */
+ if (brick->recheck_mode && brick->repair_mode) {
+ next_state = COPY_STATE_RESET;
+ break;
+ }
+ next_state = COPY_STATE_CLEANUP;
+ /* fallthrough */
+ case COPY_STATE_CLEANUP:
+ _clear_aio(brick, index, 1);
+ _clear_aio(brick, index, 0);
+ next_state = COPY_STATE_FINISHED;
+ /* fallthrough */
+ case COPY_STATE_FINISHED:
+ /* Indicate successful completion by remaining in this state.
+ * Restart of the finite automaton must be done externally.
+ */
+ goto idle;
+ default:
+ XIO_ERR("illegal state %d at index %d\n", state, index);
+ _clash(brick);
+ progress = -EILSEQ;
+ }
+
+ do_restart = (state != next_state);
+
+idle:
+ if (unlikely(progress < 0)) {
+ st->error = progress;
+ XIO_WRN("progress = %d\n", progress);
+ progress = 0;
+ _clash(brick);
+ } else if (do_restart) {
+ goto restart;
+ } else if (st->state != next_state) {
+ progress++;
+ }
+
+ /* save the resulting state */
+ st->state = next_state;
+ return progress;
+}
+
+static
+int _run_copy(struct copy_brick *brick)
+{
+ int max;
+ loff_t pos;
+ loff_t limit = -1;
+
+ short prev;
+ int progress;
+
+ if (unlikely(_clear_clash(brick))) {
+ XIO_DBG("clash\n");
+ if (atomic_read(&brick->copy_read_flight) + atomic_read(&brick->copy_write_flight) > 0) {
+ /* wait until all pending copy IO has finished
+ */
+ _clash(brick);
+ XIO_DBG("re-clash\n");
+ brick_msleep(100);
+ return 0;
+ }
+ _clear_all_aio(brick);
+ _clear_state_table(brick);
+ }
+
+ /* Do at most max iterations in the below loop
+ */
+ max = NR_COPY_REQUESTS - atomic_read(&brick->io_flight) * 2;
+
+ prev = -1;
+ progress = 0;
+ for (pos = brick->copy_last; pos < brick->copy_end || brick->append_mode > 1; pos = ((pos / COPY_CHUNK) + 1) * COPY_CHUNK) {
+ int index = GET_INDEX(pos);
+ struct copy_state *st = &GET_STATE(brick, index);
+
+ if (max-- <= 0)
+ break;
+ st->prev = prev;
+ prev = index;
+ /* call the finite state automaton */
+ if (!(st->active[0] | st->active[1])) {
+ progress += _next_state(brick, index, pos);
+ limit = pos;
+ }
+ }
+
+ /* check the resulting state: can we advance the copy_last pointer? */
+ if (likely(progress && !brick->clash)) {
+ int count = 0;
+
+ for (pos = brick->copy_last; pos <= limit; pos = ((pos / COPY_CHUNK) + 1) * COPY_CHUNK) {
+ int index = GET_INDEX(pos);
+ struct copy_state *st = &GET_STATE(brick, index);
+
+ if (st->state != COPY_STATE_FINISHED)
+ break;
+ if (unlikely(st->error < 0)) {
+ if (!brick->copy_error) {
+ brick->copy_error = st->error;
+ XIO_WRN("IO error = %d\n", st->error);
+ }
+ if (brick->abort_mode)
+ brick->is_aborting = true;
+ break;
+ }
+ /* rollover */
+ st->state = COPY_STATE_START;
+ count += st->len;
+ /* check contiguity */
+ if (unlikely(GET_OFFSET(pos) + st->len != COPY_CHUNK))
+ break;
+ }
+ if (count > 0) {
+ brick->copy_last += count;
+ get_lamport(&brick->copy_last_stamp);
+ _update_percent(brick);
+ }
+ }
+ return progress;
+}
+
+static
+bool _is_done(struct copy_brick *brick)
+{
+ if (brick_thread_should_stop())
+ brick->is_aborting = true;
+ return brick->is_aborting &&
+ atomic_read(&brick->copy_read_flight) + atomic_read(&brick->copy_write_flight) <= 0;
+}
+
+static int _copy_thread(void *data)
+{
+ struct copy_brick *brick = data;
+ int rounds = 0;
+
+ XIO_DBG("--------------- copy_thread %p starting\n", brick);
+ brick->copy_error = 0;
+ brick->copy_error_count = 0;
+ brick->verify_ok_count = 0;
+ brick->verify_error_count = 0;
+ xio_set_power_on_led((void *)brick, true);
+ brick->trigger = true;
+
+ while (!_is_done(brick)) {
+ loff_t old_start = brick->copy_start;
+ loff_t old_end = brick->copy_end;
+ int progress = 0;
+
+ if (old_end > 0) {
+ progress = _run_copy(brick);
+ if (!progress || ++rounds > 1000)
+ rounds = 0;
+ }
+
+ wait_event_interruptible_timeout(brick->event,
+ progress > 0 ||
+ brick->trigger ||
+ brick->copy_start != old_start ||
+ brick->copy_end != old_end ||
+ _is_done(brick),
+ 1 * HZ);
+ brick->trigger = false;
+ }
+
+ XIO_DBG("--------------- copy_thread terminating (%d read requests / %d write requests flying, copy_start = %lld copy_end = %lld)\n",
+ atomic_read(&brick->copy_read_flight),
+ atomic_read(&brick->copy_write_flight),
+ brick->copy_start,
+ brick->copy_end);
+
+ _clear_all_aio(brick);
+ xio_set_power_off_led((void *)brick, true);
+ XIO_DBG("--------------- copy_thread done.\n");
+ return 0;
+}
+
+/***************** own brick * input * output operations *****************/
+
+static int copy_get_info(struct copy_output *output, struct xio_info *info)
+{
+ struct copy_input *input = output->brick->inputs[INPUT_B_IO];
+
+ return GENERIC_INPUT_CALL(input, xio_get_info, info);
+}
+
+static int copy_io_get(struct copy_output *output, struct aio_object *aio)
+{
+ struct copy_input *input;
+ int index;
+ int status;
+
+ index = _determine_input(output->brick, aio);
+ input = output->brick->inputs[index];
+ status = GENERIC_INPUT_CALL(input, aio_get, aio);
+ if (status >= 0)
+ atomic_inc(&output->brick->io_flight);
+ return status;
+}
+
+static void copy_io_put(struct copy_output *output, struct aio_object *aio)
+{
+ struct copy_input *input;
+ int index;
+
+ index = _determine_input(output->brick, aio);
+ input = output->brick->inputs[index];
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+ if (atomic_dec_and_test(&output->brick->io_flight)) {
+ output->brick->trigger = true;
+ wake_up_interruptible(&output->brick->event);
+ }
+}
+
+static void copy_io_io(struct copy_output *output, struct aio_object *aio)
+{
+ struct copy_input *input;
+ int index;
+
+ index = _determine_input(output->brick, aio);
+ input = output->brick->inputs[index];
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+}
+
+static int copy_switch(struct copy_brick *brick)
+{
+ static int version;
+
+ XIO_DBG("power.button = %d\n", brick->power.button);
+ if (brick->power.button) {
+ if (brick->power.on_led)
+ goto done;
+ xio_set_power_off_led((void *)brick, false);
+ brick->is_aborting = false;
+ if (!brick->thread) {
+ brick->copy_last = brick->copy_start;
+ get_lamport(&brick->copy_last_stamp);
+ brick->thread = brick_thread_create(_copy_thread, brick, "xio_copy%d", version++);
+ if (brick->thread) {
+ brick->trigger = true;
+ } else {
+ xio_set_power_off_led((void *)brick, true);
+ XIO_ERR("could not start copy thread\n");
+ }
+ }
+ } else {
+ if (brick->power.off_led)
+ goto done;
+ xio_set_power_on_led((void *)brick, false);
+ if (brick->thread) {
+ XIO_INF("stopping thread...\n");
+ brick_thread_stop(brick->thread);
+ }
+ }
+ _update_percent(brick);
+done:
+ return 0;
+}
+
+/*************** informational * statistics **************/
+
+static
+char *copy_statistics(struct copy_brick *brick, int verbose)
+{
+ char *res = brick_string_alloc(1024);
+
+ snprintf(res, 1024,
+ "copy_start = %lld copy_last = %lld copy_end = %lld copy_error = %d copy_error_count = %d verify_ok_count = %d verify_error_count = %d low_dirty = %d is_aborting = %d clash = %lu | total clash_count = %d | io_flight = %d copy_read_flight = %d copy_write_flight = %d\n",
+ brick->copy_start,
+ brick->copy_last,
+ brick->copy_end,
+ brick->copy_error,
+ brick->copy_error_count,
+ brick->verify_ok_count,
+ brick->verify_error_count,
+ brick->low_dirty,
+ brick->is_aborting,
+ brick->clash,
+ atomic_read(&brick->total_clash_count),
+ atomic_read(&brick->io_flight),
+ atomic_read(&brick->copy_read_flight),
+ atomic_read(&brick->copy_write_flight));
+
+ return res;
+}
+
+static
+void copy_reset_statistics(struct copy_brick *brick)
+{
+ atomic_set(&brick->total_clash_count, 0);
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int copy_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct copy_aio_aspect *ini = (void *)_ini;
+
+ (void)ini;
+ return 0;
+}
+
+static void copy_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct copy_aio_aspect *ini = (void *)_ini;
+
+ (void)ini;
+}
+
+XIO_MAKE_STATICS(copy);
+
+/********************* brick constructors * destructors *******************/
+
+static
+void _free_pages(struct copy_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < MAX_SUB_TABLES; i++) {
+ struct copy_state *sub_table = brick->st[i];
+
+ if (!sub_table)
+ continue;
+
+ brick_block_free(sub_table, PAGE_SIZE);
+ }
+ brick_block_free(brick->st, PAGE_SIZE);
+}
+
+static int copy_brick_construct(struct copy_brick *brick)
+{
+ int i;
+
+ brick->st = brick_block_alloc(0, PAGE_SIZE);
+ memset(brick->st, 0, PAGE_SIZE);
+
+ for (i = 0; i < MAX_SUB_TABLES; i++) {
+ struct copy_state *sub_table;
+
+ /* this should be usually optimized away as dead code */
+ if (unlikely(i >= MAX_SUB_TABLES)) {
+ XIO_ERR("sorry, subtable index %d is too large.\n", i);
+ _free_pages(brick);
+ return -EINVAL;
+ }
+
+ sub_table = brick_block_alloc(0, PAGE_SIZE);
+ brick->st[i] = sub_table;
+ memset(sub_table, 0, PAGE_SIZE);
+ }
+
+ init_waitqueue_head(&brick->event);
+ sema_init(&brick->mutex, 1);
+ return 0;
+}
+
+static int copy_brick_destruct(struct copy_brick *brick)
+{
+ _free_pages(brick);
+ return 0;
+}
+
+static int copy_output_construct(struct copy_output *output)
+{
+ return 0;
+}
+
+static int copy_output_destruct(struct copy_output *output)
+{
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct copy_brick_ops copy_brick_ops = {
+ .brick_switch = copy_switch,
+ .brick_statistics = copy_statistics,
+ .reset_statistics = copy_reset_statistics,
+};
+
+static struct copy_output_ops copy_output_ops = {
+ .xio_get_info = copy_get_info,
+ .aio_get = copy_io_get,
+ .aio_put = copy_io_put,
+ .aio_io = copy_io_io,
+};
+
+const struct copy_input_type copy_input_type = {
+ .type_name = "copy_input",
+ .input_size = sizeof(struct copy_input),
+};
+
+static const struct copy_input_type *copy_input_types[] = {
+ &copy_input_type,
+ &copy_input_type,
+ &copy_input_type,
+ &copy_input_type,
+};
+
+const struct copy_output_type copy_output_type = {
+ .type_name = "copy_output",
+ .output_size = sizeof(struct copy_output),
+ .master_ops = &copy_output_ops,
+ .output_construct = &copy_output_construct,
+ .output_destruct = &copy_output_destruct,
+};
+
+static const struct copy_output_type *copy_output_types[] = {
+ &copy_output_type,
+};
+
+const struct copy_brick_type copy_brick_type = {
+ .type_name = "copy_brick",
+ .brick_size = sizeof(struct copy_brick),
+ .max_inputs = 4,
+ .max_outputs = 1,
+ .master_ops = &copy_brick_ops,
+ .aspect_types = copy_aspect_types,
+ .default_input_types = copy_input_types,
+ .default_output_types = copy_output_types,
+ .brick_construct = &copy_brick_construct,
+ .brick_destruct = &copy_brick_destruct,
+};
+EXPORT_SYMBOL_GPL(copy_brick_type);
+
+/***************** module init stuff ************************/
+
+int __init init_xio_copy(void)
+{
+ XIO_INF("init_copy()\n");
+ return copy_register_brick_type();
+}
+
+void exit_xio_copy(void)
+{
+ XIO_INF("exit_copy()\n");
+ copy_unregister_brick_type();
+}
--
2.0.0

2014-07-01 21:53:40

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 35/50] mars: add new file include/linux/xio/xio_copy.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/xio/xio_copy.h | 99 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)
create mode 100644 include/linux/xio/xio_copy.h

diff --git a/include/linux/xio/xio_copy.h b/include/linux/xio/xio_copy.h
new file mode 100644
index 0000000..d487d52
--- /dev/null
+++ b/include/linux/xio/xio_copy.h
@@ -0,0 +1,99 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_COPY_H
+#define XIO_COPY_H
+
+#include <linux/wait.h>
+#include <linux/semaphore.h>
+
+#define INPUT_A_IO 0
+#define INPUT_A_COPY 1
+#define INPUT_B_IO 2
+#define INPUT_B_COPY 3
+
+extern int xio_copy_overlap;
+extern int xio_copy_read_prio;
+extern int xio_copy_write_prio;
+extern int xio_copy_read_max_fly;
+extern int xio_copy_write_max_fly;
+
+enum {
+ COPY_STATE_RESET = -1,
+ COPY_STATE_START = 0, /* don't change this, it _must_ be zero */
+ COPY_STATE_START2,
+ COPY_STATE_READ1,
+ COPY_STATE_READ2,
+ COPY_STATE_READ3,
+ COPY_STATE_WRITE,
+ COPY_STATE_WRITTEN,
+ COPY_STATE_CLEANUP,
+ COPY_STATE_FINISHED,
+};
+
+struct copy_state {
+ struct aio_object *table[2];
+ bool active[2];
+ char state;
+ bool writeout;
+
+ short prev;
+ short len;
+ short error;
+};
+
+struct copy_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct copy_brick *brick;
+ int queue;
+};
+
+struct copy_brick {
+ XIO_BRICK(copy);
+ /* parameters */
+ struct xio_limiter *copy_limiter;
+ loff_t copy_start;
+
+ loff_t copy_end; /* stop working if == 0 */
+ int io_prio;
+
+ int append_mode; /* 1 = passively, 2 = actively */
+ bool verify_mode; /* 0 = copy, 1 = checksum+compare */
+ bool repair_mode; /* whether to repair in case of verify errors */
+ bool recheck_mode; /* whether to re-check after repairs (costs performance) */
+ bool utilize_mode; /* utilize already copied data */
+ bool abort_mode; /* abort on IO error (default is retry forever) */
+ /* readonly from outside */
+ loff_t copy_last; /* current working position */
+ struct timespec copy_last_stamp;
+ int copy_error;
+ int copy_error_count;
+ int verify_ok_count;
+ int verify_error_count;
+ bool low_dirty;
+ bool is_aborting;
+
+ /* internal */
+ bool trigger;
+ unsigned long clash;
+ atomic_t total_clash_count;
+ atomic_t io_flight;
+ atomic_t copy_read_flight;
+ atomic_t copy_write_flight;
+ unsigned long last_jiffies;
+
+ wait_queue_head_t event;
+ struct semaphore mutex;
+ struct task_struct *thread;
+ struct copy_state **st;
+};
+
+struct copy_input {
+ XIO_INPUT(copy);
+};
+
+struct copy_output {
+ XIO_OUTPUT(copy);
+};
+
+XIO_TYPES(copy);
+
+#endif
--
2.0.0

2014-07-01 21:53:43

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 14/50] mars: add new file drivers/block/mars/lib_rank.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/lib_rank.c | 73 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
create mode 100644 drivers/block/mars/lib_rank.c

diff --git a/drivers/block/mars/lib_rank.c b/drivers/block/mars/lib_rank.c
new file mode 100644
index 0000000..384cb61
--- /dev/null
+++ b/drivers/block/mars/lib_rank.c
@@ -0,0 +1,73 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+/* (c) 2012 Thomas Schoebel-Theuer */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <linux/brick/lib_rank.h>
+
+void ranking_compute(struct rank_data *rkd, const struct rank_info rki[], int x)
+{
+ int points = 0;
+ int i;
+
+ for (i = 0; ; i++) {
+ int x0;
+ int x1;
+ int y0;
+ int y1;
+
+ x0 = rki[i].rki_x;
+ if (x < x0)
+ break;
+
+ x1 = rki[i+1].rki_x;
+
+ if (unlikely(x1 == RKI_DUMMY)) {
+ points = rki[i].rki_y;
+ break;
+ }
+
+ if (x > x1)
+ continue;
+
+ y0 = rki[i].rki_y;
+ y1 = rki[i+1].rki_y;
+
+ /* linear interpolation */
+ points = ((long long)(x - x0) * (long long)(y1 - y0)) / (x1 - x0) + y0;
+ break;
+ }
+ rkd->rkd_tmp += points;
+}
+EXPORT_SYMBOL_GPL(ranking_compute);
+
+int ranking_select(struct rank_data rkd[], int rkd_count)
+{
+ int res = -1;
+ long long max = LLONG_MIN / 2;
+ int i;
+
+ for (i = 0; i < rkd_count; i++) {
+ struct rank_data *tmp = &rkd[i];
+ long long rest = tmp->rkd_current_points;
+
+ if (rest <= 0)
+ continue;
+ /* rest -= tmp->rkd_got; */
+ if (rest > max) {
+ max = rest;
+ res = i;
+ }
+ }
+ /* Prevent underflow in the long term
+ * and reset the "clocks" after each round of
+ * weighted round-robin selection.
+ */
+ if (max < 0 && res >= 0) {
+ for (i = 0; i < rkd_count; i++)
+ rkd[i].rkd_got += max;
+ }
+ return res;
+}
+EXPORT_SYMBOL_GPL(ranking_select);
--
2.0.0

2014-07-01 21:53:48

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 37/50] mars: add new file include/linux/xio/xio_trans_logger.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/xio/xio_trans_logger.h | 249 +++++++++++++++++++++++++++++++++++
1 file changed, 249 insertions(+)
create mode 100644 include/linux/xio/xio_trans_logger.h

diff --git a/include/linux/xio/xio_trans_logger.h b/include/linux/xio/xio_trans_logger.h
new file mode 100644
index 0000000..f499f42
--- /dev/null
+++ b/include/linux/xio/xio_trans_logger.h
@@ -0,0 +1,249 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_TRANS_LOGGER_H
+#define XIO_TRANS_LOGGER_H
+
+#define REGION_SIZE_BITS (PAGE_SHIFT + 4)
+#define REGION_SIZE (1 << REGION_SIZE_BITS)
+#define LOGGER_QUEUES 4
+
+#include <linux/time.h>
+
+#include <linux/xio.h>
+#include <linux/lib_log.h>
+#include <linux/brick/lib_pairing_heap.h>
+#include <linux/brick/lib_queue.h>
+#include <linux/brick/lib_timing.h>
+
+/************************ global tuning ***********************/
+
+/* 0 = early completion of all writes
+ * 1 = early completion of non-sync
+ * 2 = late completion
+ */
+extern int trans_logger_completion_semantics;
+extern int trans_logger_do_crc;
+extern int trans_logger_mem_usage; /* in KB */
+extern int trans_logger_max_interleave;
+extern int trans_logger_resume;
+extern int trans_logger_replay_timeout; /* in s */
+extern atomic_t global_mshadow_count;
+extern atomic64_t global_mshadow_used;
+
+struct writeback_group {
+ rwlock_t lock;
+ struct trans_logger_brick *leader;
+ loff_t biggest;
+ struct list_head group_anchor;
+
+ /* tuning */
+ struct xio_limiter limiter;
+ int until_percent;
+};
+
+extern struct writeback_group global_writeback;
+
+/******************************************************************/
+
+_PAIRING_HEAP_TYPEDEF(logger, /*empty*/)
+
+struct logger_queue {
+ QUEUE_ANCHOR(logger, loff_t, logger);
+ struct trans_logger_brick *q_brick;
+ const char *q_insert_info;
+ const char *q_pushback_info;
+ const char *q_fetch_info;
+ struct banning q_banning;
+ int no_progress_count;
+ int pushback_count;
+};
+
+struct logger_head {
+ struct list_head lh_head;
+ loff_t *lh_pos;
+ struct pairing_heap_logger ph;
+};
+
+/******************************************************************/
+
+#define TL_INPUT_READ 0
+#define TL_INPUT_WRITEBACK 0
+#define TL_INPUT_LOG1 1
+#define TL_INPUT_LOG2 2
+#define TL_INPUT_NR 3
+
+struct writeback_info {
+ struct trans_logger_brick *w_brick;
+ struct logger_head w_lh;
+ loff_t w_pos;
+ int w_len;
+ int w_error;
+
+ struct list_head w_collect_list; /* list of collected orig requests */
+ struct list_head w_sub_read_list; /* for saving the old data before overwrite */
+ struct list_head w_sub_write_list; /* for overwriting */
+ atomic_t w_sub_read_count;
+ atomic_t w_sub_write_count;
+ atomic_t w_sub_log_count;
+
+ /* make checkpatch.pl happy with a blank line - is this a false positive? */
+
+ void (*read_endio)(struct generic_callback *cb);
+ void (*write_endio)(struct generic_callback *cb);
+};
+
+struct trans_logger_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct trans_logger_brick *my_brick;
+ struct trans_logger_input *my_input;
+ struct trans_logger_input *log_input;
+ struct logger_head lh;
+ struct list_head hash_head;
+ struct list_head pos_head;
+ struct list_head replay_head;
+ struct list_head collect_head;
+ struct pairing_heap_logger ph;
+ struct trans_logger_aio_aspect *shadow_aio;
+ struct trans_logger_aio_aspect *orig_aio_a;
+ void *shadow_data;
+ int orig_rw;
+ int wb_error;
+ bool do_dealloc;
+ bool do_buffered;
+ bool is_hashed;
+ bool is_stable;
+ bool is_dirty;
+ bool is_collected;
+ bool is_fired;
+ bool is_completed;
+ bool is_persistent;
+ bool is_emergency;
+ struct timespec stamp;
+ loff_t log_pos;
+ struct generic_callback cb;
+ struct writeback_info *wb;
+ struct list_head sub_list;
+ struct list_head sub_head;
+ int total_sub_count;
+ int alloc_len;
+ atomic_t current_sub_count;
+};
+
+struct trans_logger_hash_anchor;
+
+struct trans_logger_brick {
+ XIO_BRICK(trans_logger);
+ /* parameters */
+ struct xio_limiter *replay_limiter;
+
+ int shadow_mem_limit; /* max # master shadows */
+ bool replay_mode; /* mode of operation */
+ bool continuous_replay_mode; /* mode of operation */
+ bool log_reads; /* additionally log pre-images */
+ bool cease_logging; /* direct IO without logging (only in case of EMERGENCY) */
+ loff_t replay_start_pos; /* where to start replay */
+ loff_t replay_end_pos; /* end of replay */
+ int new_input_nr; /* whereto we should switchover ASAP */
+ int replay_tolerance; /* how many bytes to ignore at truncated logfiles */
+ /* readonly from outside */
+ loff_t replay_current_pos; /* end of replay */
+ int log_input_nr; /* where we are currently logging to */
+ int old_input_nr; /* where old IO requests may be on the fly */
+ int replay_code; /* replay errors (if any) */
+ bool stopped_logging; /* direct IO without logging (only in case of EMERGENCY) */
+ /* private */
+ int disk_io_error; /* replay errors from callbacks */
+ struct trans_logger_hash_anchor **hash_table;
+ struct list_head group_head;
+ loff_t old_margin;
+ spinlock_t replay_lock;
+ struct list_head replay_list;
+ struct task_struct *thread;
+
+ wait_queue_head_t worker_event;
+ wait_queue_head_t caller_event;
+ /* statistics */
+ atomic64_t shadow_mem_used;
+ atomic_t replay_count;
+ atomic_t any_fly_count;
+ atomic_t log_fly_count;
+ atomic_t hash_count;
+ atomic_t mshadow_count;
+ atomic_t sshadow_count;
+ atomic_t outer_balance_count;
+ atomic_t inner_balance_count;
+ atomic_t sub_balance_count;
+ atomic_t wb_balance_count;
+ atomic_t total_hash_insert_count;
+ atomic_t total_hash_find_count;
+ atomic_t total_hash_extend_count;
+ atomic_t total_replay_count;
+ atomic_t total_replay_conflict_count;
+ atomic_t total_cb_count;
+ atomic_t total_read_count;
+ atomic_t total_write_count;
+ atomic_t total_flush_count;
+ atomic_t total_writeback_count;
+ atomic_t total_writeback_cluster_count;
+ atomic_t total_shortcut_count;
+ atomic_t total_mshadow_count;
+ atomic_t total_sshadow_count;
+ atomic_t total_mshadow_buffered_count;
+ atomic_t total_sshadow_buffered_count;
+ atomic_t total_round_count;
+ atomic_t total_restart_count;
+ atomic_t total_delay_count;
+
+ /* queues */
+ struct logger_queue q_phase[LOGGER_QUEUES];
+ bool delay_callers;
+};
+
+struct trans_logger_output {
+ XIO_OUTPUT(trans_logger);
+};
+
+#define MAX_HOST_LEN 32
+
+struct trans_logger_info {
+ /* to be maintained / initialized from outside */
+ void (*inf_callback)(struct trans_logger_info *inf);
+ void *inf_private;
+ char inf_host[MAX_HOST_LEN];
+
+ int inf_sequence; /* logfile sequence number */
+
+ /* maintained by trans_logger */
+ loff_t inf_min_pos; /* current replay position (both in replay mode and in logging mode) */
+ loff_t inf_max_pos; /* dito, indicating the "dirty" area which could be potentially "inconsistent" */
+ loff_t inf_log_pos; /* position of transaction logging (may be ahead of replay position) */
+ struct timespec inf_min_pos_stamp; /* when the data has been _successfully_ overwritten */
+/* when the data has _started_ overwrite (maybe "trashed" in case of errors / aborts) */
+ struct timespec inf_max_pos_stamp;
+
+ struct timespec inf_log_pos_stamp; /* stamp from transaction log */
+ bool inf_is_replaying;
+ bool inf_is_logging;
+};
+
+struct trans_logger_input {
+ XIO_INPUT(trans_logger);
+ /* parameters */
+ /* informational */
+ struct trans_logger_info inf;
+
+ /* readonly from outside */
+ atomic_t log_obj_count;
+ atomic_t pos_count;
+ bool is_operating;
+ long long last_jiffies;
+
+ /* private */
+ struct log_status logst;
+ struct list_head pos_list;
+ long long inf_last_jiffies;
+ struct semaphore inf_mutex;
+};
+
+XIO_TYPES(trans_logger);
+
+#endif
--
2.0.0

2014-07-01 21:54:07

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 19/50] mars: add new file include/linux/xio.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/xio.h | 273 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 273 insertions(+)
create mode 100644 include/linux/xio.h

diff --git a/include/linux/xio.h b/include/linux/xio.h
new file mode 100644
index 0000000..33f125d
--- /dev/null
+++ b/include/linux/xio.h
@@ -0,0 +1,273 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_H
+#define XIO_H
+
+#include <linux/semaphore.h>
+#include <linux/rwsem.h>
+
+/***********************************************************************/
+
+/* include the generic brick infrastructure */
+
+#define OBJ_TYPE_AIO 0
+#define OBJ_TYPE_MAX 1
+
+#include <linux/brick/brick.h>
+#include <linux/brick/brick_mem.h>
+#include <linux/brick/lamport.h>
+#include <linux/brick/lib_timing.h>
+
+/***********************************************************************/
+
+/* XIO-specific debugging helpers */
+
+#define _XIO_MSG(_class, _dump, _fmt, _args...) \
+ brick_say(_class, _dump, "XIO", __BASE_FILE__, __LINE__, __func__, _fmt, ##_args)
+
+#define XIO_FAT(_fmt, _args...) _XIO_MSG(SAY_FATAL, true, _fmt, ##_args)
+#define XIO_ERR(_fmt, _args...) _XIO_MSG(SAY_ERROR, false, _fmt, ##_args)
+#define XIO_WRN(_fmt, _args...) _XIO_MSG(SAY_WARN, false, _fmt, ##_args)
+#define XIO_INF(_fmt, _args...) _XIO_MSG(SAY_INFO, false, _fmt, ##_args)
+
+#ifdef XIO_DEBUGGING
+#define XIO_DBG(_fmt, _args...) _XIO_MSG(SAY_DEBUG, false, _fmt, ##_args)
+#else
+#define XIO_DBG(_args...) /**/
+#endif
+
+/***********************************************************************/
+
+/* XIO-specific definitions */
+
+#define XIO_PRIO_HIGH -1
+#define XIO_PRIO_NORMAL 0 /* this is automatically used by memset() */
+#define XIO_PRIO_LOW 1
+#define XIO_PRIO_NR 3
+
+/* object stuff */
+
+/* aio */
+
+#define AIO_UPTODATE 1
+#define AIO_READING 2
+#define AIO_WRITING 4
+
+extern const struct generic_object_type aio_type;
+
+#define XIO_CHECKSUM_SIZE 16
+
+#define AIO_OBJECT(OBJTYPE) \
+ CALLBACK_OBJECT(OBJTYPE); \
+ /* supplied by caller */ \
+ void *io_data; /* preset to NULL for buffered IO */ \
+ loff_t io_pos; \
+ int io_len; \
+ int io_may_write; \
+ int io_prio; \
+ int io_timeout; \
+ int io_cs_mode; /* 0 = off, 1 = checksum + data, 2 = checksum only */\
+ /* maintained by the aio implementation, readable for callers */\
+ loff_t io_total_size; /* just for info, need not be implemented */\
+ unsigned char io_checksum[XIO_CHECKSUM_SIZE]; \
+ int io_flags; \
+ int io_rw; \
+ int io_id; /* not mandatory; may be used for identification */\
+ bool io_skip_sync; /* skip sync for this particular aio */ \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct aio_object {
+ AIO_OBJECT(aio);
+};
+
+/* internal helper structs */
+
+struct xio_info {
+ loff_t current_size;
+
+ int tf_align; /* transfer alignment constraint */
+ int tf_min_size; /* transfer is only possible in multiples of this */
+};
+
+/* brick stuff */
+
+#define XIO_BRICK(BRITYPE) \
+ GENERIC_BRICK(BRITYPE); \
+ struct generic_object_layout aio_object_layout; \
+ struct list_head global_brick_link; \
+ struct list_head dent_brick_link; \
+ const char *brick_name; \
+ const char *brick_path; \
+ struct mars_global *global; \
+ void **kill_ptr; \
+ int kill_round; \
+ bool killme; \
+ void (*show_status)(struct xio_brick *brick, bool shutdown); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct xio_brick {
+ XIO_BRICK(xio);
+};
+
+#define XIO_INPUT(BRITYPE) \
+ GENERIC_INPUT(BRITYPE); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct xio_input {
+ XIO_INPUT(xio);
+};
+
+#define XIO_OUTPUT(BRITYPE) \
+ GENERIC_OUTPUT(BRITYPE); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+struct xio_output {
+ XIO_OUTPUT(xio);
+};
+
+#define XIO_BRICK_OPS(BRITYPE) \
+ GENERIC_BRICK_OPS(BRITYPE); \
+ char *(*brick_statistics)(struct BRITYPE##_brick *brick, int verbose);\
+ void (*reset_statistics)(struct BRITYPE##_brick *brick); \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+#define XIO_OUTPUT_OPS(BRITYPE) \
+ GENERIC_OUTPUT_OPS(BRITYPE); \
+ int (*xio_get_info)(struct BRITYPE##_output *output, struct xio_info *info);\
+ /* aio */ \
+ int (*aio_get)(struct BRITYPE##_output *output, struct aio_object *aio);\
+ void (*aio_io)(struct BRITYPE##_output *output, struct aio_object *aio);\
+ void (*aio_put)(struct BRITYPE##_output *output, struct aio_object *aio);\
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+/* all non-extendable types */
+
+#define _XIO_TYPES(BRITYPE) \
+ \
+struct BRITYPE##_brick_ops { \
+ XIO_BRICK_OPS(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_output_ops { \
+ XIO_OUTPUT_OPS(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_brick_type { \
+ GENERIC_BRICK_TYPE(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_input_type { \
+ GENERIC_INPUT_TYPE(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_output_type { \
+ GENERIC_OUTPUT_TYPE(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_callback { \
+ GENERIC_CALLBACK(BRITYPE); \
+}; \
+ \
+DECLARE_BRICK_FUNCTIONS(BRITYPE); \
+/* this comment is for keeping TRAILING_SEMICOLON happy */
+
+#define XIO_TYPES(BRITYPE) \
+ \
+_XIO_TYPES(BRITYPE) \
+ \
+DECLARE_ASPECT_FUNCTIONS(BRITYPE, aio); \
+extern int init_xio_##BRITYPE(void); \
+extern void exit_xio_##BRITYPE(void); \
+/* this comment is for keeping TRAILING_SEMICOLON happy */
+
+/* instantiate pseudo base-classes */
+
+DECLARE_OBJECT_FUNCTIONS(aio);
+_XIO_TYPES(xio);
+DECLARE_ASPECT_FUNCTIONS(xio, aio);
+
+/***********************************************************************/
+
+/* XIO-specific helpers */
+
+#define XIO_MAKE_STATICS(BRITYPE) \
+ \
+/* checkpatch.pl: the EXPORT_SYMBOL warning appears to be false positive */\
+int BRITYPE##_brick_nr = -EEXIST; \
+EXPORT_SYMBOL_GPL(BRITYPE##_brick_nr); \
+ \
+static const struct generic_aspect_type BRITYPE##_aio_aspect_type = { \
+ .aspect_type_name = #BRITYPE "_aio_aspect_type", \
+ .object_type = &aio_type, \
+ .aspect_size = sizeof(struct BRITYPE##_aio_aspect), \
+ .init_fn = BRITYPE##_aio_aspect_init_fn, \
+ .exit_fn = BRITYPE##_aio_aspect_exit_fn, \
+}; \
+ \
+static const struct generic_aspect_type *BRITYPE##_aspect_types[OBJ_TYPE_MAX] = {\
+ [OBJ_TYPE_AIO] = &BRITYPE##_aio_aspect_type, \
+}; \
+/* this comment is for keeping TRAILING_SEMICOLON happy */
+
+extern const struct meta xio_info_meta[];
+extern const struct meta xio_aio_meta[];
+extern const struct meta xio_timespec_meta[];
+
+/***********************************************************************/
+
+/* Some minimal upcalls from generic IO layer to the strategy layer.
+ * TODO: abstract away.
+ */
+
+extern void xio_set_power_on_led(struct xio_brick *brick, bool val);
+extern void xio_set_power_off_led(struct xio_brick *brick, bool val);
+
+/* this should disappear!
+ */
+extern void (*_local_trigger)(void);
+extern void (*_remote_trigger)(void);
+#define local_trigger() do { if (_local_trigger) { XIO_DBG("trigger...\n"); _local_trigger(); } } while (0)
+#define remote_trigger() \
+do { if (_remote_trigger) { XIO_DBG("remote_trigger...\n"); _remote_trigger(); } } while (0)
+
+/***********************************************************************/
+
+/* Some global stuff.
+ */
+
+extern struct banning xio_global_ban;
+
+extern atomic_t xio_global_io_flying;
+
+extern int xio_throttle_start;
+extern int xio_throttle_end;
+
+/***********************************************************************/
+
+/* Some special brick types for avoidance of cyclic references.
+ *
+ * The client/server network bricks use this for independent instantiation
+ * from the main instantiation logic (separate modprobe for xio_server
+ * is possible).
+ */
+extern const struct generic_brick_type *_client_brick_type;
+extern const struct generic_brick_type *_bio_brick_type;
+extern const struct generic_brick_type *_aio_brick_type;
+extern const struct generic_brick_type *_sio_brick_type;
+
+/***********************************************************************/
+
+/* Crypto stuff
+ */
+
+extern int xio_digest_size;
+extern void xio_digest(unsigned char *digest, void *data, int len);
+extern void aio_checksum(struct aio_object *aio);
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_xio(void);
+extern void exit_xio(void);
+
+#endif
--
2.0.0

2014-07-01 21:54:11

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 46/50] mars: add new file drivers/block/mars/mars_light/mars_light.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/mars_light/mars_light.c | 5675 ++++++++++++++++++++++++++++
1 file changed, 5675 insertions(+)
create mode 100644 drivers/block/mars/mars_light/mars_light.c

diff --git a/drivers/block/mars/mars_light/mars_light.c b/drivers/block/mars/mars_light/mars_light.c
new file mode 100644
index 0000000..bb89949
--- /dev/null
+++ b/drivers/block/mars/mars_light/mars_light.c
@@ -0,0 +1,5675 @@
+/* (c) 2011 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#define XIO_DEBUGGING
+
+/* This MUST be updated whenever INCOMPATIBLE changes are made to the
+ * symlink tree in /mars/ .
+ *
+ * Just adding a new symlink is usually not "incompatible", if
+ * other tools like marsadm just ignore it.
+ *
+ * "incompatible" means that something may BREAK.
+ */
+#define SYMLINK_TREE_VERSION "0.1"
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/major.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+
+#include <linux/mars_light/light_strategy.h>
+
+#include <linux/wait.h>
+
+#include <linux/lib_mapfree.h>
+
+/* used brick types */
+#include <linux/xio/xio_server.h>
+#include <linux/xio/xio_client.h>
+#include <linux/xio/xio_copy.h>
+#include <linux/xio/xio_bio.h>
+#include <linux/xio/xio_aio.h>
+#include <linux/xio/xio_trans_logger.h>
+#include <linux/xio/xio_if.h>
+#include <linux/mars_light/mars_proc.h>
+
+#define REPLAY_TOLERANCE (PAGE_SIZE + OVERHEAD)
+
+/* TODO: add human-readable timestamps */
+#define XIO_INF_TO(channel, fmt, args...) \
+ ({ \
+ say_to(channel, SAY_INFO, "%s: " fmt, say_class[SAY_INFO], ##args);\
+ XIO_INF(fmt, ##args); \
+ })
+
+#define XIO_WRN_TO(channel, fmt, args...) \
+ ({ \
+ say_to(channel, SAY_WARN, "%s: " fmt, say_class[SAY_WARN], ##args);\
+ XIO_WRN(fmt, ##args); \
+ })
+
+#define XIO_ERR_TO(channel, fmt, args...) \
+ ({ \
+ say_to(channel, SAY_ERROR, "%s: " fmt, say_class[SAY_ERROR], ##args);\
+ XIO_ERR(fmt, ##args); \
+ })
+
+loff_t raw_total_space = 0;
+loff_t global_total_space = 0;
+EXPORT_SYMBOL_GPL(global_total_space);
+
+loff_t raw_remaining_space = 0;
+loff_t global_remaining_space = 0;
+EXPORT_SYMBOL_GPL(global_remaining_space);
+
+int global_logrot_auto = CONFIG_MARS_LOGROT_AUTO;
+EXPORT_SYMBOL_GPL(global_logrot_auto);
+
+int global_free_space_0 = CONFIG_MARS_MIN_SPACE_0;
+EXPORT_SYMBOL_GPL(global_free_space_0);
+
+int global_free_space_1 = CONFIG_MARS_MIN_SPACE_1;
+EXPORT_SYMBOL_GPL(global_free_space_1);
+
+int global_free_space_2 = CONFIG_MARS_MIN_SPACE_2;
+EXPORT_SYMBOL_GPL(global_free_space_2);
+
+int global_free_space_3 = CONFIG_MARS_MIN_SPACE_3;
+EXPORT_SYMBOL_GPL(global_free_space_3);
+
+int global_free_space_4 = CONFIG_MARS_MIN_SPACE_4;
+EXPORT_SYMBOL_GPL(global_free_space_4);
+
+int _global_sync_want = 0;
+int global_sync_want = 0;
+EXPORT_SYMBOL_GPL(global_sync_want);
+
+int global_sync_nr = 0;
+EXPORT_SYMBOL_GPL(global_sync_nr);
+
+int global_sync_limit = 0;
+EXPORT_SYMBOL_GPL(global_sync_limit);
+
+int mars_rollover_interval = CONFIG_MARS_ROLLOVER_INTERVAL;
+EXPORT_SYMBOL_GPL(mars_rollover_interval);
+
+int mars_scan_interval = CONFIG_MARS_SCAN_INTERVAL;
+EXPORT_SYMBOL_GPL(mars_scan_interval);
+
+int mars_propagate_interval = CONFIG_MARS_PROPAGATE_INTERVAL;
+EXPORT_SYMBOL_GPL(mars_propagate_interval);
+
+int mars_sync_flip_interval = CONFIG_MARS_SYNC_FLIP_INTERVAL;
+EXPORT_SYMBOL_GPL(mars_sync_flip_interval);
+
+int mars_peer_abort = 7;
+EXPORT_SYMBOL_GPL(mars_peer_abort);
+
+int mars_fast_fullsync =
+#ifdef CONFIG_MARS_FAST_FULLSYNC
+ 1
+#else
+ 0
+#endif
+ ;
+EXPORT_SYMBOL_GPL(mars_fast_fullsync);
+
+int xio_throttle_start = 60;
+EXPORT_SYMBOL_GPL(xio_throttle_start);
+
+int xio_throttle_end = 90;
+EXPORT_SYMBOL_GPL(xio_throttle_end);
+
+int mars_emergency_mode = 0;
+EXPORT_SYMBOL_GPL(mars_emergency_mode);
+
+int mars_reset_emergency = 1;
+EXPORT_SYMBOL_GPL(mars_reset_emergency);
+
+int mars_keep_msg = 10;
+EXPORT_SYMBOL_GPL(mars_keep_msg);
+
+#define MARS_SYMLINK_MAX 1023
+
+struct key_value_pair {
+ const char *key;
+ char *val;
+ char *old_val;
+ unsigned long last_jiffies;
+ struct timespec system_stamp;
+ struct timespec lamport_stamp;
+};
+
+static inline
+void clear_vals(struct key_value_pair *start)
+{
+ while (start->key) {
+ brick_string_free(start->val);
+ start->val = NULL;
+ brick_string_free(start->old_val);
+ start->old_val = NULL;
+ start++;
+ }
+}
+
+static
+void show_vals(struct key_value_pair *start, const char *path, const char *add)
+{
+ while (start->key) {
+ char *dst = path_make("%s/actual-%s/msg-%s%s", path, my_id(), add, start->key);
+
+ /* show the old message for some keep_time if no new one is available */
+ if (!start->val && start->old_val &&
+ (long long)start->last_jiffies + mars_keep_msg * HZ <= (long long)jiffies) {
+ start->val = start->old_val;
+ start->old_val = NULL;
+ }
+ if (start->val) {
+ char *src = path_make("%ld.%09ld %ld.%09ld %s",
+ start->system_stamp.tv_sec, start->system_stamp.tv_nsec,
+ start->lamport_stamp.tv_sec, start->lamport_stamp.tv_nsec,
+ start->val);
+ mars_symlink(src, dst, NULL, 0);
+ brick_string_free(src);
+ brick_string_free(start->old_val);
+ start->old_val = start->val;
+ start->val = NULL;
+ } else {
+ mars_symlink("OK", dst, NULL, 0);
+ memset(&start->system_stamp, 0, sizeof(start->system_stamp));
+ memset(&start->lamport_stamp, 0, sizeof(start->lamport_stamp));
+ brick_string_free(start->old_val);
+ start->old_val = NULL;
+ }
+ brick_string_free(dst);
+ start++;
+ }
+}
+
+static inline
+void assign_keys(struct key_value_pair *start, const char **keys)
+{
+ while (*keys) {
+ start->key = *keys;
+ start++;
+ keys++;
+ }
+}
+
+static inline
+struct key_value_pair *find_key(struct key_value_pair *start, const char *key)
+{
+ while (start->key) {
+ if (!strcmp(start->key, key))
+ return start;
+ start++;
+ }
+ XIO_ERR("cannot find key '%s'\n", key);
+ return NULL;
+}
+
+static
+void _make_msg(int line, struct key_value_pair *pair, const char *fmt, ...) __printf(3, 4);
+
+static
+void _make_msg(int line, struct key_value_pair *pair, const char *fmt, ...)
+{
+ int len;
+ va_list args;
+
+ if (unlikely(!pair || !pair->key)) {
+ XIO_ERR("bad pointer %p at line %d\n", pair, line);
+ goto out_return;
+ }
+ pair->last_jiffies = jiffies;
+ if (!pair->val) {
+ pair->val = brick_string_alloc(MARS_SYMLINK_MAX + 1);
+ len = 0;
+ if (!pair->system_stamp.tv_sec) {
+ pair->system_stamp = CURRENT_TIME;
+ get_lamport(&pair->lamport_stamp);
+ }
+ } else {
+ len = strnlen(pair->val, MARS_SYMLINK_MAX);
+ if (unlikely(len >= MARS_SYMLINK_MAX - 48))
+ goto out_return;
+ pair->val[len++] = ',';
+ }
+
+ va_start(args, fmt);
+ vsnprintf(pair->val + len, MARS_SYMLINK_MAX - 1 - len, fmt, args);
+ va_end(args);
+out_return:;
+}
+
+#define make_msg(pair, fmt, args...) \
+ _make_msg(__LINE__, pair, fmt, ##args)
+
+static
+struct key_value_pair gbl_pairs[] = {
+ { NULL }
+};
+
+#define make_gbl_msg(key, fmt, args...) \
+ make_msg(find_key(gbl_pairs, key), fmt, ##args)
+
+static
+const char *rot_keys[] = {
+ /* from _update_version_link() */
+ "err-versionlink-skip",
+ /* from _update_info() */
+ "err-sequence-trash",
+ /* from _is_switchover_possible() */
+ "inf-versionlink-not-yet-exist",
+ "inf-versionlink-not-equal",
+ "inf-replay-not-yet-finished",
+ "err-bad-log-name",
+ "err-log-not-contiguous",
+ "err-versionlink-not-readable",
+ "err-replaylink-not-readable",
+ "err-splitbrain-detected",
+ /* from _update_file() */
+ "inf-fetch",
+ /* from make_sync() */
+ "inf-sync",
+ /* from make_log_step() */
+ "wrn-log-consecutive",
+ /* from make_log_finalize() */
+ "inf-replay-start",
+ "wrn-space-low",
+ "err-space-low",
+ "err-emergency",
+ "err-replay-stop",
+ /* from _check_logging_status() */
+ "inf-replay-tolerance",
+ "err-replay-size",
+ NULL,
+};
+
+#define make_rot_msg(rot, key, fmt, args...) \
+ make_msg(find_key(&(rot)->msgs[0], key), fmt, ##args)
+
+#define IS_EXHAUSTED() (mars_emergency_mode > 0)
+#define IS_EMERGENCY_SECONDARY() (mars_emergency_mode > 1)
+#define IS_EMERGENCY_PRIMARY() (mars_emergency_mode > 2)
+#define IS_JAMMED() (mars_emergency_mode > 3)
+
+static
+void _make_alivelink_str(const char *name, const char *src)
+{
+ char *dst = path_make("/mars/%s-%s", name, my_id());
+
+ if (!src || !dst) {
+ XIO_ERR("cannot make alivelink paths\n");
+ goto err;
+ }
+ XIO_DBG("'%s' -> '%s'\n", src, dst);
+ mars_symlink(src, dst, NULL, 0);
+err:
+ brick_string_free(dst);
+}
+
+static
+void _make_alivelink(const char *name, loff_t val)
+{
+ char *src = path_make("%lld", val);
+
+ _make_alivelink_str(name, src);
+ brick_string_free(src);
+}
+
+static
+int compute_emergency_mode(void)
+{
+ loff_t rest;
+ loff_t present;
+ loff_t limit = 0;
+ int mode = 4;
+ int this_mode = 0;
+
+ mars_remaining_space("/mars", &raw_total_space, &raw_remaining_space);
+ rest = raw_remaining_space;
+
+#define CHECK_LIMIT(LIMIT_VAR) \
+do { \
+ if (LIMIT_VAR > 0) \
+ limit += (loff_t)LIMIT_VAR * 1024 * 1024; \
+ if (rest < limit && !this_mode) { \
+ this_mode = mode; \
+ } \
+ mode--; \
+} while (0)
+
+ CHECK_LIMIT(global_free_space_4);
+ CHECK_LIMIT(global_free_space_3);
+ CHECK_LIMIT(global_free_space_2);
+ CHECK_LIMIT(global_free_space_1);
+
+ /* Decrease the emergeny mode only in single steps.
+ */
+ if (mars_reset_emergency && mars_emergency_mode > 0 && mars_emergency_mode > this_mode)
+ mars_emergency_mode--;
+ else
+ mars_emergency_mode = this_mode;
+ _make_alivelink("emergency", mars_emergency_mode);
+
+ rest -= limit;
+ if (rest < 0)
+ rest = 0;
+ global_remaining_space = rest;
+ _make_alivelink("rest-space", rest / (1024 * 1024));
+
+ present = raw_total_space - limit;
+ global_total_space = present;
+
+ if (xio_throttle_start > 0 &&
+ xio_throttle_end > xio_throttle_start &&
+ present > 0) {
+ loff_t percent_used = 100 - (rest * 100 / present);
+
+ if (percent_used < xio_throttle_start)
+ if_throttle_start_size = 0;
+ else if (percent_used >= xio_throttle_end)
+ if_throttle_start_size = 1;
+ else
+ if_throttle_start_size = (xio_throttle_end - percent_used) * 1024 / (xio_throttle_end - xio_throttle_start) + 1;
+ }
+
+ if (unlikely(present < global_free_space_0))
+ return -ENOSPC;
+ return 0;
+}
+
+/*****************************************************************/
+
+static struct task_struct *main_thread;
+
+typedef int (*light_worker_fn)(void *buf, struct mars_dent *dent);
+
+struct light_class {
+ char *cl_name;
+ int cl_len;
+ char cl_type;
+ bool cl_hostcontext;
+ bool cl_serial;
+ bool cl_use_channel;
+ int cl_father;
+
+ light_worker_fn cl_prepare;
+ light_worker_fn cl_forward;
+ light_worker_fn cl_backward;
+};
+
+/* the order is important! */
+enum {
+ /* root element: this must have index 0 */
+ CL_ROOT,
+ /* global ID */
+ CL_UUID,
+ /* global userspace */
+ CL_GLOBAL_USERSPACE,
+ CL_GLOBAL_USERSPACE_ITEMS,
+ /* global todos */
+ CL_GLOBAL_TODO,
+ CL_GLOBAL_TODO_DELETE,
+ CL_GLOBAL_TODO_DELETED,
+ CL_DEFAULTS0,
+ CL_DEFAULTS,
+ CL_DEFAULTS_ITEMS0,
+ CL_DEFAULTS_ITEMS,
+ /* replacement for DNS in kernelspace */
+ CL_IPS,
+ CL_PEERS,
+ CL_GBL_ACTUAL,
+ CL_GBL_ACTUAL_ITEMS,
+ CL_ALIVE,
+ CL_TIME,
+ CL_TREE,
+ CL_EMERGENCY,
+ CL_REST_SPACE,
+ /* resource definitions */
+ CL_RESOURCE,
+ CL_RESOURCE_USERSPACE,
+ CL_RESOURCE_USERSPACE_ITEMS,
+ CL_RES_DEFAULTS0,
+ CL_RES_DEFAULTS,
+ CL_RES_DEFAULTS_ITEMS0,
+ CL_RES_DEFAULTS_ITEMS,
+ CL_TODO,
+ CL_TODO_ITEMS,
+ CL_ACTUAL,
+ CL_ACTUAL_ITEMS,
+ CL_DATA,
+ CL_SIZE,
+ CL_ACTSIZE,
+ CL_PRIMARY,
+ CL_CONNECT,
+ CL_TRANSFER,
+ CL_SYNC,
+ CL_VERIF,
+ CL_SYNCPOS,
+ CL_VERSION,
+ CL_LOG,
+ CL_REPLAYSTATUS,
+ CL_DEVICE,
+};
+
+/*********************************************************************/
+
+/* needed for logfile rotation */
+
+#define MAX_INFOS 4
+
+struct mars_rotate {
+ struct list_head rot_head;
+ struct mars_global *global;
+ struct copy_brick *sync_brick;
+ struct mars_dent *replay_link;
+ struct xio_brick *bio_brick;
+ struct mars_dent *aio_dent;
+ struct aio_brick *aio_brick;
+ struct xio_info aio_info;
+ struct trans_logger_brick *trans_brick;
+ struct mars_dent *first_log;
+ struct mars_dent *relevant_log;
+ struct xio_brick *relevant_brick;
+ struct mars_dent *next_relevant_log;
+ struct xio_brick *next_relevant_brick;
+ struct mars_dent *next_next_relevant_log;
+ struct mars_dent *prev_log;
+ struct mars_dent *next_log;
+ struct mars_dent *syncstatus_dent;
+ struct if_brick *if_brick;
+ const char *fetch_path;
+ const char *fetch_peer;
+ const char *preferred_peer;
+ const char *parent_path;
+ const char *parent_rest;
+ const char *fetch_next_origin;
+ struct say_channel *log_say;
+ struct copy_brick *fetch_brick;
+ struct xio_limiter replay_limiter;
+ struct xio_limiter sync_limiter;
+ struct xio_limiter fetch_limiter;
+ int inf_prev_sequence;
+ long long flip_start;
+ loff_t dev_size;
+ loff_t start_pos;
+ loff_t end_pos;
+ int max_sequence;
+ int fetch_round;
+ int fetch_serial;
+ int fetch_next_serial;
+ int split_brain_serial;
+ int split_brain_round;
+ int fetch_next_is_available;
+ int relevant_serial;
+ bool has_symlinks;
+ bool res_shutdown;
+ bool has_error;
+ bool allow_update;
+ bool forbid_replay;
+ bool replay_mode;
+ bool todo_primary;
+ bool is_primary;
+ bool old_is_primary;
+ bool created_hole;
+ bool is_log_damaged;
+ bool has_emergency;
+ bool wants_sync;
+ bool gets_sync;
+ spinlock_t inf_lock;
+ bool infs_is_dirty[MAX_INFOS];
+ struct trans_logger_info infs[MAX_INFOS];
+ struct key_value_pair msgs[sizeof(rot_keys) / sizeof(char *)];
+};
+
+static LIST_HEAD(rot_anchor);
+
+/*********************************************************************/
+
+/* TUNING */
+
+int mars_mem_percent = 20;
+EXPORT_SYMBOL_GPL(mars_mem_percent);
+
+#define CONF_TRANS_SHADOW_LIMIT (1024 * 128) /* don't fill the hashtable too much */
+
+#define CONF_TRANS_BATCHLEN 64
+#define CONF_TRANS_PRIO XIO_PRIO_HIGH
+#define CONF_TRANS_LOG_READS false
+
+#define CONF_ALL_BATCHLEN 1
+#define CONF_ALL_PRIO XIO_PRIO_NORMAL
+
+#define IF_SKIP_SYNC true
+
+#define IF_MAX_PLUGGED 10000
+#define IF_READAHEAD 0
+
+#define BIO_READAHEAD 0
+#define BIO_NOIDLE true
+#define BIO_SYNC true
+#define BIO_UNPLUG true
+
+#define COPY_APPEND_MODE 0
+#define COPY_PRIO XIO_PRIO_LOW
+
+static
+int _set_trans_params(struct xio_brick *_brick, void *private)
+{
+ struct trans_logger_brick *trans_brick = (void *)_brick;
+
+ if (_brick->type != (void *)&trans_logger_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ if (!trans_brick->q_phase[1].q_ordering) {
+ trans_brick->q_phase[0].q_batchlen = CONF_TRANS_BATCHLEN;
+ trans_brick->q_phase[1].q_batchlen = CONF_ALL_BATCHLEN;
+ trans_brick->q_phase[2].q_batchlen = CONF_ALL_BATCHLEN;
+ trans_brick->q_phase[3].q_batchlen = CONF_ALL_BATCHLEN;
+
+ trans_brick->q_phase[0].q_io_prio = CONF_TRANS_PRIO;
+ trans_brick->q_phase[1].q_io_prio = CONF_ALL_PRIO;
+ trans_brick->q_phase[2].q_io_prio = CONF_ALL_PRIO;
+ trans_brick->q_phase[3].q_io_prio = CONF_ALL_PRIO;
+
+ trans_brick->q_phase[1].q_ordering = true;
+ trans_brick->q_phase[3].q_ordering = true;
+
+ trans_brick->shadow_mem_limit = CONF_TRANS_SHADOW_LIMIT;
+ trans_brick->log_reads = CONF_TRANS_LOG_READS;
+ }
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+struct client_cookie {
+ bool limit_mode;
+ bool create_mode;
+};
+
+static
+int _set_client_params(struct xio_brick *_brick, void *private)
+{
+ struct client_brick *client_brick = (void *)_brick;
+ struct client_cookie *clc = private;
+
+ client_brick->io_timeout = 0;
+ client_brick->limit_mode = clc ? clc->limit_mode : false;
+ client_brick->killme = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+static
+int _set_aio_params(struct xio_brick *_brick, void *private)
+{
+ struct aio_brick *aio_brick = (void *)_brick;
+ struct client_cookie *clc = private;
+
+ if (_brick->type == (void *)&client_brick_type)
+ return _set_client_params(_brick, private);
+ if (_brick->type != (void *)&aio_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ aio_brick->o_creat = clc && clc->create_mode;
+ aio_brick->o_direct = false; /* important! */
+ aio_brick->o_fdsync = true;
+ aio_brick->killme = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+static
+int _set_bio_params(struct xio_brick *_brick, void *private)
+{
+ struct bio_brick *bio_brick;
+
+ if (_brick->type == (void *)&client_brick_type)
+ return _set_client_params(_brick, private);
+ if (_brick->type == (void *)&aio_brick_type)
+ return _set_aio_params(_brick, private);
+ if (_brick->type != (void *)&bio_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ bio_brick = (void *)_brick;
+ bio_brick->ra_pages = BIO_READAHEAD;
+ bio_brick->do_noidle = BIO_NOIDLE;
+ bio_brick->do_sync = BIO_SYNC;
+ bio_brick->do_unplug = BIO_UNPLUG;
+ bio_brick->killme = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+static
+int _set_if_params(struct xio_brick *_brick, void *private)
+{
+ struct if_brick *if_brick = (void *)_brick;
+ struct mars_rotate *rot = private;
+
+ if (_brick->type != (void *)&if_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ if (!rot) {
+ XIO_ERR("too early\n");
+ return -EINVAL;
+ }
+ if (rot->dev_size <= 0) {
+ XIO_ERR("dev_size = %lld\n", rot->dev_size);
+ return -EINVAL;
+ }
+ if (if_brick->dev_size > 0 && rot->dev_size < if_brick->dev_size) {
+ XIO_ERR("new dev size = %lld < old dev_size = %lld\n", rot->dev_size, if_brick->dev_size);
+ return -EINVAL;
+ }
+ if_brick->dev_size = rot->dev_size;
+ if_brick->max_plugged = IF_MAX_PLUGGED;
+ if_brick->readahead = IF_READAHEAD;
+ if_brick->skip_sync = IF_SKIP_SYNC;
+ XIO_INF("name = '%s' path = '%s' size = %lld\n", _brick->brick_name, _brick->brick_path, if_brick->dev_size);
+ return 1;
+}
+
+struct copy_cookie {
+ const char *argv[2];
+ const char *copy_path;
+ loff_t start_pos;
+ loff_t end_pos;
+ bool verify_mode;
+
+ const char *fullpath[2];
+ struct xio_output *output[2];
+ struct xio_info info[2];
+};
+
+static
+int _set_copy_params(struct xio_brick *_brick, void *private)
+{
+ struct copy_brick *copy_brick = (void *)_brick;
+ struct copy_cookie *cc = private;
+ int status = 1;
+
+ if (_brick->type != (void *)&copy_brick_type) {
+ XIO_ERR("bad brick type\n");
+ status = -EINVAL;
+ goto done;
+ }
+ copy_brick->append_mode = COPY_APPEND_MODE;
+ copy_brick->io_prio = COPY_PRIO;
+ copy_brick->verify_mode = cc->verify_mode;
+ copy_brick->repair_mode = true;
+ copy_brick->killme = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+
+ /* Determine the copy area, switch on/off when necessary
+ */
+ if (!copy_brick->power.button && copy_brick->power.off_led) {
+ int i;
+
+ copy_brick->copy_last = 0;
+ for (i = 0; i < 2; i++) {
+ status = cc->output[i]->ops->xio_get_info(cc->output[i], &cc->info[i]);
+ if (status < 0) {
+ XIO_WRN("cannot determine current size of '%s'\n", cc->argv[i]);
+ goto done;
+ }
+ XIO_DBG("%d '%s' current_size = %lld\n", i, cc->fullpath[i], cc->info[i].current_size);
+ }
+ copy_brick->copy_start = cc->info[1].current_size;
+ if (cc->start_pos != -1) {
+ copy_brick->copy_start = cc->start_pos;
+ if (unlikely(cc->start_pos > cc->info[0].current_size)) {
+ XIO_ERR("bad start position %lld is larger than actual size %lld on '%s'\n",
+ cc->start_pos,
+ cc->info[0].current_size,
+ cc->copy_path);
+ status = -EINVAL;
+ goto done;
+ }
+ }
+ XIO_DBG("copy_start = %lld\n", copy_brick->copy_start);
+ copy_brick->copy_end = cc->info[0].current_size;
+ if (cc->end_pos != -1) {
+ if (unlikely(cc->end_pos > copy_brick->copy_end)) {
+ XIO_ERR("target size %lld is larger than actual size %lld on source\n",
+ cc->end_pos,
+ copy_brick->copy_end);
+ status = -EINVAL;
+ goto done;
+ }
+ copy_brick->copy_end = cc->end_pos;
+ if (unlikely(cc->end_pos > cc->info[1].current_size)) {
+ XIO_ERR("bad end position %lld is larger than actual size %lld on target\n",
+ cc->end_pos,
+ cc->info[1].current_size);
+ status = -EINVAL;
+ goto done;
+ }
+ }
+ XIO_DBG("copy_end = %lld\n", copy_brick->copy_end);
+ if (copy_brick->copy_start < copy_brick->copy_end) {
+ status = 1;
+ XIO_DBG("copy switch on\n");
+ }
+ } else if (copy_brick->power.button && copy_brick->power.on_led && copy_brick->copy_last == copy_brick->copy_end && copy_brick->copy_end > 0) {
+ status = 0;
+ XIO_DBG("copy switch off\n");
+ }
+
+done:
+ return status;
+}
+
+/*********************************************************************/
+
+/* internal helpers */
+
+#define MARS_DELIM ','
+
+static int _parse_args(struct mars_dent *dent, char *str, int count)
+{
+ int i;
+ int status = -EINVAL;
+
+ if (!str)
+ goto done;
+ if (!dent->d_args)
+ dent->d_args = brick_strdup(str);
+ for (i = 0; i < count; i++) {
+ char *tmp;
+ int len;
+
+ if (!*str)
+ goto done;
+ if (i == count-1) {
+ len = strlen(str);
+ } else {
+ char *tmp = strchr(str, MARS_DELIM);
+
+ if (!tmp)
+ goto done;
+ len = (tmp - str);
+ }
+ brick_string_free(dent->d_argv[i]);
+ tmp = brick_string_alloc(len + 1);
+ dent->d_argv[i] = tmp;
+ strncpy(dent->d_argv[i], str, len);
+ dent->d_argv[i][len] = '\0';
+
+ str += len;
+ if (i != count-1)
+ str++;
+ }
+ status = 0;
+done:
+ if (status < 0) {
+ XIO_ERR("bad syntax '%s' (should have %d args), status = %d\n",
+ dent->d_args ? dent->d_args : "",
+ count,
+ status);
+ }
+ return status;
+}
+
+static
+int _check_switch(struct mars_global *global, const char *path)
+{
+ int status;
+ int res = 0;
+ struct mars_dent *allow_dent;
+
+ allow_dent = mars_find_dent(global, path);
+ if (!allow_dent || !allow_dent->link_val)
+ goto done;
+ status = kstrtoint(allow_dent->link_val, 0, &res);
+ (void)status; /* treat errors as if the switch were set to 0 */
+ XIO_DBG("'%s' -> %d\n", path, res);
+
+done:
+ return res;
+}
+
+static
+int _check_allow(struct mars_global *global, struct mars_dent *parent, const char *name)
+{
+ int res = 0;
+ char *path = path_make("%s/todo-%s/%s", parent->d_path, my_id(), name);
+
+ if (!path)
+ goto done;
+
+ res = _check_switch(global, path);
+
+done:
+ brick_string_free(path);
+ return res;
+}
+
+#define skip_part(s) _skip_part(s, ',', ':')
+#define skip_sect(s) _skip_part(s, ':', 0)
+static inline
+int _skip_part(const char *str, const char del1, const char del2)
+{
+ int len = 0;
+
+ while (str[len] && str[len] != del1 && (!del2 || str[len] != del2))
+ len++;
+ return len;
+}
+
+static inline
+int skip_dir(const char *str)
+{
+ int len = 0;
+ int res = 0;
+
+ for (len = 0; str[len]; len++)
+ if (str[len] == '/')
+ res = len + 1;
+ return res;
+}
+
+static
+int parse_logfile_name(const char *str, int *seq, const char **host)
+{
+ char *_host;
+ int count;
+ int len = 0;
+ int len_host;
+
+ *seq = 0;
+ *host = NULL;
+
+ count = sscanf(str, "log-%d-%n", seq, &len);
+ if (unlikely(count != 1)) {
+ XIO_ERR("bad logfile name '%s', count=%d, len=%d\n", str, count, len);
+ return 0;
+ }
+
+ _host = brick_strdup(str + len);
+
+ len_host = skip_part(_host);
+ _host[len_host] = '\0';
+ *host = _host;
+ len += len_host;
+
+ return len;
+}
+
+static
+int compare_replaylinks(struct mars_rotate *rot, const char *hosta, const char *hostb)
+{
+ const char *linka = path_make("%s/replay-%s", rot->parent_path, hosta);
+ const char *linkb = path_make("%s/replay-%s", rot->parent_path, hostb);
+ const char *a = NULL;
+ const char *b = NULL;
+ int seqa;
+ int seqb;
+ int posa;
+ int posb;
+ loff_t offa = 0;
+ loff_t offb = -1;
+ loff_t taila = 0;
+ loff_t tailb = -1;
+ int count;
+ int res = -2;
+
+ if (unlikely(!linka || !linkb)) {
+ XIO_ERR("nen MEM");
+ goto done;
+ }
+
+ a = mars_readlink(linka);
+ if (unlikely(!a || !a[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read replaylink '%s'\n", linka);
+ goto done;
+ }
+ b = mars_readlink(linkb);
+ if (unlikely(!b || !b[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read replaylink '%s'\n", linkb);
+ goto done;
+ }
+
+ count = sscanf(a, "log-%d-%n", &seqa, &posa);
+ if (unlikely(count != 1))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linka, a);
+ count = sscanf(b, "log-%d-%n", &seqb, &posb);
+ if (unlikely(count != 1))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linkb, b);
+
+ if (seqa < seqb) {
+ res = -1;
+ goto done;
+ } else if (seqa > seqb) {
+ res = 1;
+ goto done;
+ }
+
+ posa += skip_part(a + posa);
+ posb += skip_part(b + posb);
+ if (unlikely(!a[posa++]))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linka, a);
+ if (unlikely(!b[posb++]))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linkb, b);
+
+ count = sscanf(a + posa, "%lld,%lld", &offa, &taila);
+ if (unlikely(count != 2))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linka, a);
+ count = sscanf(b + posb, "%lld,%lld", &offb, &tailb);
+ if (unlikely(count != 2))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linkb, b);
+
+ if (posa < posb)
+ res = -1;
+ else if (posa > posb)
+ res = 1;
+ else
+ res = 0;
+
+done:
+ brick_string_free(a);
+ brick_string_free(b);
+ brick_string_free(linka);
+ brick_string_free(linkb);
+ return res;
+}
+
+/*********************************************************************/
+
+/* status display */
+
+static
+int _update_link_when_necessary(struct mars_rotate *rot, const char *type, const char *old, const char *new)
+{
+ char *check = NULL;
+ int status = -EIO;
+ bool res = false;
+
+ if (unlikely(!old || !new))
+ goto out;
+
+ /* Check whether something really has changed (avoid
+ * useless/disturbing timestamp updates)
+ */
+ check = mars_readlink(new);
+ if (check && !strcmp(check, old)) {
+ XIO_DBG("%s symlink '%s' -> '%s' has not changed\n", type, old, new);
+ res = 0;
+ goto out;
+ }
+
+ status = mars_symlink(old, new, NULL, 0);
+ if (unlikely(status < 0)) {
+ XIO_ERR_TO(rot->log_say,
+ "cannot create %s symlink '%s' -> '%s' status = %d\n",
+ type,
+ old,
+ new,
+ status);
+ } else {
+ res = 1;
+ XIO_DBG("made %s symlink '%s' -> '%s' status = %d\n", type, old, new, status);
+ }
+
+out:
+ brick_string_free(check);
+ return res;
+}
+
+static
+int _update_replay_link(struct mars_rotate *rot, struct trans_logger_info *inf)
+{
+ char *old = NULL;
+ char *new = NULL;
+ int res = 0;
+
+ old = path_make("log-%09d-%s,%lld,%lld",
+ inf->inf_sequence,
+ inf->inf_host,
+ inf->inf_min_pos,
+ inf->inf_max_pos - inf->inf_min_pos);
+ if (!old)
+ goto out;
+ new = path_make("%s/replay-%s", rot->parent_path, my_id());
+ if (!new)
+ goto out;
+
+ res = _update_link_when_necessary(rot, "replay", old, new);
+
+out:
+ brick_string_free(new);
+ brick_string_free(old);
+ return res;
+}
+
+static
+int _update_version_link(struct mars_rotate *rot, struct trans_logger_info *inf)
+{
+ char *data = brick_string_alloc(0);
+ char *old = brick_string_alloc(0);
+ char *new = NULL;
+ unsigned char *digest = brick_string_alloc(0);
+ char *prev = NULL;
+ char *prev_link = NULL;
+ char *prev_digest = NULL;
+ int len;
+ int i;
+ int res = 0;
+
+ if (likely(inf->inf_sequence > 1)) {
+ if (unlikely((inf->inf_sequence < rot->inf_prev_sequence ||
+ inf->inf_sequence > rot->inf_prev_sequence + 1) &&
+ rot->inf_prev_sequence != 0)) {
+ char *skip_path = path_make("%s/skip-check-%s", rot->parent_path, my_id());
+ char *skip_link = mars_readlink(skip_path);
+ char *msg = "";
+ int skip_nr = -1;
+ int nr_char = 0;
+
+ if (likely(skip_link && skip_link[0])) {
+ int status = sscanf(skip_link, "%d%n", &skip_nr, &nr_char);
+
+ (void)status; /* keep msg empty in case of errors */
+ msg = skip_link + nr_char;
+ }
+ brick_string_free(skip_path);
+ if (likely(skip_nr != inf->inf_sequence)) {
+ XIO_ERR_TO(rot->log_say,
+ "SKIP in sequence numbers detected: %d != %d + 1\n",
+ inf->inf_sequence,
+ rot->inf_prev_sequence);
+ make_rot_msg(rot,
+ "err-versionlink-skip",
+ "SKIP in sequence numbers detected: %d != %d + 1",
+ inf->inf_sequence,
+ rot->inf_prev_sequence);
+ brick_string_free(skip_link);
+ goto out;
+ }
+ XIO_WRN_TO(rot->log_say,
+ "you explicitly requested to SKIP sequence numbers from %d to %d%s\n",
+ rot->inf_prev_sequence, inf->inf_sequence, msg);
+ brick_string_free(skip_link);
+ }
+ prev = path_make("%s/version-%09d-%s", rot->parent_path, inf->inf_sequence - 1, my_id());
+ if (unlikely(!prev)) {
+ XIO_ERR("no MEM\n");
+ goto out;
+ }
+ prev_link = mars_readlink(prev);
+ rot->inf_prev_sequence = inf->inf_sequence;
+ }
+
+ len = sprintf(data,
+ "%d,%s,%lld:%s",
+ inf->inf_sequence,
+ inf->inf_host,
+ inf->inf_log_pos,
+ prev_link ? prev_link : "");
+
+ XIO_DBG("data = '%s' len = %d\n", data, len);
+
+ xio_digest(digest, data, len);
+
+ len = 0;
+ for (i = 0; i < xio_digest_size; i++)
+ len += sprintf(old + len, "%02x", digest[i]);
+
+ if (likely(prev_link && prev_link[0])) {
+ char *tmp;
+
+ prev_digest = brick_strdup(prev_link);
+ /* take the part before ':' */
+ for (tmp = prev_digest; *tmp; tmp++)
+ if (*tmp == ':')
+ break;
+ *tmp = '\0';
+ }
+
+ len += sprintf(old + len,
+ ",log-%09d-%s,%lld:%s",
+ inf->inf_sequence,
+ inf->inf_host,
+ inf->inf_log_pos,
+ prev_digest ? prev_digest : "");
+
+ new = path_make("%s/version-%09d-%s", rot->parent_path, inf->inf_sequence, my_id());
+ if (!new) {
+ XIO_ERR("no MEM\n");
+ goto out;
+ }
+
+ res = _update_link_when_necessary(rot, "version", old, new);
+
+out:
+ brick_string_free(new);
+ brick_string_free(prev);
+ brick_string_free(data);
+ brick_string_free(digest);
+ brick_string_free(old);
+ brick_string_free(prev_link);
+ brick_string_free(prev_digest);
+ return res;
+}
+
+static
+void _update_info(struct trans_logger_info *inf)
+{
+ struct mars_rotate *rot = inf->inf_private;
+ int hash;
+
+ if (unlikely(!rot)) {
+ XIO_ERR("rot is NULL\n");
+ goto done;
+ }
+
+ XIO_DBG("inf = %p '%s' seq = %d min_pos = %lld max_pos = %lld log_pos = %lld is_replaying = %d is_logging = %d\n",
+ inf,
+ SAFE_STR(inf->inf_host),
+ inf->inf_sequence,
+ inf->inf_min_pos,
+ inf->inf_max_pos,
+ inf->inf_log_pos,
+ inf->inf_is_replaying,
+ inf->inf_is_logging);
+
+ hash = inf->inf_sequence % MAX_INFOS;
+ if (unlikely(rot->infs_is_dirty[hash])) {
+ if (unlikely(rot->infs[hash].inf_sequence != inf->inf_sequence)) {
+ XIO_ERR_TO(rot->log_say,
+ "buffer %d: sequence trash %d -> %d. is the mar_light thread hanging?\n",
+ hash,
+ rot->infs[hash].inf_sequence,
+ inf->inf_sequence);
+ make_rot_msg(rot,
+ "err-sequence-trash",
+ "buffer %d: sequence trash %d -> %d",
+ hash,
+ rot->infs[hash].inf_sequence,
+ inf->inf_sequence);
+ } else {
+ XIO_DBG("buffer %d is overwritten (sequence=%d)\n", hash, inf->inf_sequence);
+ }
+ }
+
+ spin_lock(&rot->inf_lock);
+ memcpy(&rot->infs[hash], inf, sizeof(struct trans_logger_info));
+ rot->infs_is_dirty[hash] = true;
+ spin_unlock(&rot->inf_lock);
+
+ local_trigger();
+done:;
+}
+
+static
+void write_info_links(struct mars_rotate *rot)
+{
+ struct trans_logger_info inf;
+ int count = 0;
+
+ for (;;) {
+ int hash = -1;
+ int min = 0;
+ int i;
+
+ spin_lock(&rot->inf_lock);
+ for (i = 0; i < MAX_INFOS; i++) {
+ if (!rot->infs_is_dirty[i])
+ continue;
+ if (!min || min > rot->infs[i].inf_sequence) {
+ min = rot->infs[i].inf_sequence;
+ hash = i;
+ }
+ }
+
+ if (hash < 0) {
+ spin_unlock(&rot->inf_lock);
+ break;
+ }
+
+ rot->infs_is_dirty[hash] = false;
+ memcpy(&inf, &rot->infs[hash], sizeof(struct trans_logger_info));
+ spin_unlock(&rot->inf_lock);
+
+ XIO_DBG("seq = %d min_pos = %lld max_pos = %lld log_pos = %lld is_replaying = %d is_logging = %d\n",
+ inf.inf_sequence,
+ inf.inf_min_pos,
+ inf.inf_max_pos,
+ inf.inf_log_pos,
+ inf.inf_is_replaying,
+ inf.inf_is_logging);
+
+ if (inf.inf_is_logging || inf.inf_is_replaying)
+ count += _update_replay_link(rot, &inf);
+ if (inf.inf_is_logging || inf.inf_is_replaying)
+ count += _update_version_link(rot, &inf);
+ }
+ if (count) {
+ if (inf.inf_min_pos == inf.inf_max_pos)
+ local_trigger();
+ remote_trigger();
+ }
+}
+
+static
+void _make_new_replaylink(struct mars_rotate *rot, char *new_host, int new_sequence, loff_t end_pos)
+{
+ struct trans_logger_info inf = {
+ .inf_private = rot,
+ .inf_sequence = new_sequence,
+ .inf_min_pos = 0,
+ .inf_max_pos = 0,
+ .inf_log_pos = end_pos,
+ .inf_is_replaying = true,
+ };
+ strncpy(inf.inf_host, new_host, sizeof(inf.inf_host));
+
+ XIO_DBG("new_host = '%s' new_sequence = %d end_pos = %lld\n", new_host, new_sequence, end_pos);
+
+ _update_replay_link(rot, &inf);
+ _update_version_link(rot, &inf);
+
+ local_trigger();
+ remote_trigger();
+}
+
+static
+int __show_actual(const char *path, const char *name, int val)
+{
+ char *src;
+ char *dst = NULL;
+ int status = -EINVAL;
+
+ src = path_make("%d", val);
+ dst = path_make("%s/actual-%s/%s", path, my_id(), name);
+ status = -ENOMEM;
+ if (!dst)
+ goto done;
+
+ XIO_DBG("symlink '%s' -> '%s'\n", dst, src);
+ status = mars_symlink(src, dst, NULL, 0);
+
+done:
+ brick_string_free(src);
+ brick_string_free(dst);
+ return status;
+}
+
+static inline
+int _show_actual(const char *path, const char *name, bool val)
+{
+ return __show_actual(path, name, val ? 1 : 0);
+}
+
+static
+void _show_primary(struct mars_rotate *rot, struct mars_dent *parent)
+{
+ int status;
+
+ if (!rot || !parent)
+ goto out_return;
+ status = _show_actual(parent->d_path, "is-primary", rot->is_primary);
+ if (rot->is_primary != rot->old_is_primary) {
+ rot->old_is_primary = rot->is_primary;
+ remote_trigger();
+ }
+out_return:;
+}
+
+static
+void _show_brick_status(struct xio_brick *test, bool shutdown)
+{
+ const char *path;
+ char *src;
+ char *dst;
+ int status;
+
+ path = test->brick_path;
+ if (!path) {
+ XIO_WRN("bad path\n");
+ goto out_return;
+ }
+ if (*path != '/') {
+ XIO_WRN("bogus path '%s'\n", path);
+ goto out_return;
+ }
+
+ src = (test->power.on_led && !shutdown) ? "1" : "0";
+ dst = backskip_replace(path, '/', true, "/actual-%s/", my_id());
+ if (!dst)
+ goto out_return;
+
+ status = mars_symlink(src, dst, NULL, 0);
+ XIO_DBG("status symlink '%s' -> '%s' status = %d\n", dst, src, status);
+ brick_string_free(dst);
+out_return:;
+}
+
+static
+void _show_status_all(struct mars_global *global)
+{
+ struct list_head *tmp;
+
+ down_read(&global->brick_mutex);
+ for (tmp = global->brick_anchor.next; tmp != &global->brick_anchor; tmp = tmp->next) {
+ struct xio_brick *test;
+
+ test = container_of(tmp, struct xio_brick, global_brick_link);
+ if (!test->show_status)
+ continue;
+ _show_brick_status(test, false);
+ }
+ up_read(&global->brick_mutex);
+}
+
+static
+void _show_rate(struct mars_rotate *rot, struct xio_limiter *limiter, bool running, const char *name)
+{
+ int rate = limiter->lim_rate;
+
+ __show_actual(rot->parent_path, name, rate);
+ if (!running)
+ xio_limit(limiter, 0);
+}
+
+/*********************************************************************/
+
+static
+int __make_copy(
+ struct mars_global *global,
+ struct mars_dent *belongs,
+ const char *switch_path,
+ const char *copy_path,
+ const char *parent,
+ const char *argv[],
+ struct key_value_pair *msg_pair,
+ loff_t start_pos, /* -1 means at EOF of source */
+ loff_t end_pos, /* -1 means at EOF of target */
+ bool verify_mode,
+ bool limit_mode,
+ struct copy_brick **__copy)
+{
+ struct xio_brick *copy;
+ struct copy_cookie cc = {};
+
+ struct client_cookie clc[2] = {
+ {
+ .limit_mode = limit_mode,
+ },
+ {
+ .limit_mode = limit_mode,
+ .create_mode = true,
+ },
+ };
+ int i;
+ bool switch_copy;
+ int status = -EINVAL;
+
+ if (!switch_path || !global)
+ goto done;
+
+ /* don't generate empty aio files if copy does not yet exist */
+ switch_copy = _check_switch(global, switch_path);
+ copy = mars_find_brick(global, &copy_brick_type, copy_path);
+ if (!copy && !switch_copy)
+ goto done;
+
+ /* create/find predecessor aio bricks */
+ for (i = 0; i < 2; i++) {
+ struct xio_brick *aio;
+
+ cc.argv[i] = argv[i];
+ if (parent) {
+ cc.fullpath[i] = path_make("%s/%s", parent, argv[i]);
+ if (!cc.fullpath[i]) {
+ XIO_ERR("cannot make path '%s/%s'\n", parent, argv[i]);
+ goto done;
+ }
+ } else {
+ cc.fullpath[i] = argv[i];
+ }
+
+ aio =
+ make_brick_all(global,
+ NULL,
+ _set_bio_params,
+ &clc[i],
+ NULL,
+ (const struct generic_brick_type *)&bio_brick_type,
+ (const struct generic_brick_type*[]){},
+ switch_copy ? 2 : -1,
+ cc.fullpath[i],
+ (const char *[]){},
+ 0);
+ if (!aio) {
+ XIO_DBG("cannot instantiate '%s'\n", cc.fullpath[i]);
+ make_msg(msg_pair, "cannot instantiate '%s'", cc.fullpath[i]);
+ goto done;
+ }
+ cc.output[i] = aio->outputs[0];
+ }
+
+ cc.copy_path = copy_path;
+ cc.start_pos = start_pos;
+ cc.end_pos = end_pos;
+ cc.verify_mode = verify_mode;
+
+ copy =
+ make_brick_all(global,
+ belongs,
+ _set_copy_params,
+ &cc,
+ cc.fullpath[1],
+ (const struct generic_brick_type *)&copy_brick_type,
+ (const struct generic_brick_type*[]){NULL, NULL, NULL, NULL},
+ (!switch_copy || IS_EXHAUSTED()) ? -1 : 2,
+ "%s",
+ (const char *[]){"%s", "%s", "%s", "%s"},
+ 4,
+ copy_path,
+ cc.fullpath[0],
+ cc.fullpath[0],
+ cc.fullpath[1],
+ cc.fullpath[1]);
+ if (copy) {
+ struct copy_brick *_copy = (void *)copy;
+
+ copy->show_status = _show_brick_status;
+ make_msg(msg_pair,
+ "from = '%s' to = '%s' on = %d start_pos = %lld end_pos = %lld actual_pos = %lld actual_stamp = %ld.%09ld rate = %d read_fly = %d write_fly = %d error_code = %d nr_errors = %d",
+ argv[0],
+ argv[1],
+ _copy->power.on_led,
+ _copy->copy_start,
+ _copy->copy_end,
+ _copy->copy_last,
+ _copy->copy_last_stamp.tv_sec, _copy->copy_last_stamp.tv_nsec,
+ _copy->copy_limiter ? _copy->copy_limiter->lim_rate : 0,
+ atomic_read(&_copy->copy_read_flight),
+ atomic_read(&_copy->copy_write_flight),
+ _copy->copy_error,
+ _copy->copy_error_count);
+ }
+ if (__copy)
+ *__copy = (void *)copy;
+
+ status = 0;
+
+done:
+ XIO_DBG("status = %d\n", status);
+ for (i = 0; i < 2; i++) {
+ if (cc.fullpath[i] && cc.fullpath[i] != argv[i])
+ brick_string_free(cc.fullpath[i]);
+ }
+ return status;
+}
+
+/*********************************************************************/
+
+/* remote workers */
+
+static
+rwlock_t peer_lock = __RW_LOCK_UNLOCKED(&peer_lock);
+
+static
+struct list_head peer_anchor = LIST_HEAD_INIT(peer_anchor);
+
+struct mars_peerinfo {
+ struct mars_global *global;
+ char *peer;
+ char *path;
+ struct xio_socket socket;
+ struct task_struct *peer_thread;
+ spinlock_t lock;
+ struct list_head peer_head;
+ struct list_head remote_dent_list;
+ unsigned long last_remote_jiffies;
+ int maxdepth;
+ bool to_remote_trigger;
+ bool from_remote_trigger;
+};
+
+static
+struct mars_peerinfo *find_peer(const char *peer_name)
+{
+ struct list_head *tmp;
+ struct mars_peerinfo *res = NULL;
+
+ read_lock(&peer_lock);
+ for (tmp = peer_anchor.next; tmp != &peer_anchor; tmp = tmp->next) {
+ struct mars_peerinfo *peer = container_of(tmp, struct mars_peerinfo, peer_head);
+
+ if (!strcmp(peer->peer, peer_name)) {
+ res = peer;
+ break;
+ }
+ }
+ read_unlock(&peer_lock);
+
+ return res;
+}
+
+static
+bool _is_usable_dir(const char *name)
+{
+ if (!strncmp(name, "resource-", 9)
+ || !strncmp(name, "todo-", 5)
+ || !strncmp(name, "actual-", 7)
+ || !strncmp(name, "defaults", 8)
+ ) {
+ return true;
+ }
+ return false;
+}
+
+static
+bool _is_peer_logfile(const char *name, const char *id)
+{
+ int len = strlen(name);
+ int idlen = id ? strlen(id) : 4 + 9 + 1;
+
+ if (len <= idlen ||
+ strncmp(name, "log-", 4) != 0) {
+ XIO_DBG("not a logfile at all: '%s'\n", name);
+ return false;
+ }
+ if (id &&
+ name[len - idlen - 1] == '-' &&
+ strncmp(name + len - idlen, id, idlen) == 0) {
+ XIO_DBG("not a peer logfile: '%s'\n", name);
+ return false;
+ }
+ XIO_DBG("found peer logfile: '%s'\n", name);
+ return true;
+}
+
+static
+int _update_file(struct mars_dent *parent,
+ const char *switch_path,
+ const char *copy_path,
+ const char *file,
+ const char *peer,
+ loff_t end_pos)
+{
+ struct mars_rotate *rot = parent->d_private;
+ struct mars_global *global = rot->global;
+
+#ifdef CONFIG_MARS_SEPARATE_PORTS
+ const char *tmp = path_make("%s@%s:%d", file, peer, xio_net_default_port + 1);
+
+#else
+ const char *tmp = path_make("%s@%s", file, peer);
+
+#endif
+ const char *argv[2] = { tmp, file };
+ struct copy_brick *copy = NULL;
+ struct key_value_pair *msg_pair = find_key(rot->msgs, "inf-fetch");
+ bool do_start = true;
+ int status = -ENOMEM;
+
+ if (unlikely(!tmp || !global))
+ goto done;
+
+ rot->fetch_round = 0;
+
+ if (rot->todo_primary | rot->is_primary) {
+ XIO_DBG("disallowing fetch, todo_primary=%d is_primary=%d\n", rot->todo_primary, rot->is_primary);
+ make_msg(msg_pair,
+ "disallowing fetch (todo_primary=%d is_primary=%d)",
+ rot->todo_primary,
+ rot->is_primary);
+ do_start = false;
+ }
+ if (do_start && !_check_allow(global, parent, "attach")) {
+ XIO_DBG("disabling fetch due to detach\n");
+ make_msg(msg_pair, "disabling fetch due to detach");
+ do_start = false;
+ }
+
+ XIO_DBG("src = '%s' dst = '%s'\n", tmp, file);
+ status = __make_copy(global,
+ NULL,
+ do_start ? switch_path : "",
+ copy_path,
+ NULL,
+ argv,
+ msg_pair,
+ -1,
+ -1,
+ false,
+ false,
+ &copy);
+ if (status >= 0 && copy) {
+ copy->copy_limiter = &rot->fetch_limiter;
+ /* FIXME: code is dead */
+ if (copy->append_mode && copy->power.on_led &&
+ end_pos > copy->copy_end) {
+ XIO_DBG("appending to '%s' %lld => %lld\n", copy_path, copy->copy_end, end_pos);
+ /* FIXME: use corrected length from xio_get_info() / see _set_copy_params() */
+ copy->copy_end = end_pos;
+ }
+ }
+
+done:
+ brick_string_free(tmp);
+ return status;
+}
+
+static
+int check_logfile(const char *peer,
+ struct mars_dent *remote_dent,
+ struct mars_dent *local_dent,
+ struct mars_dent *parent,
+ loff_t dst_size)
+{
+ loff_t src_size = remote_dent->stat_val.size;
+ struct mars_rotate *rot;
+ const char *switch_path = NULL;
+ struct copy_brick *fetch_brick;
+ int status = 0;
+
+ /* correct the remote size when necessary */
+ if (remote_dent->d_corr_B > 0 && remote_dent->d_corr_B < src_size) {
+ XIO_DBG("logfile '%s' correcting src_size from %lld to %lld\n",
+ remote_dent->d_path,
+ src_size,
+ remote_dent->d_corr_B);
+ src_size = remote_dent->d_corr_B;
+ }
+
+ /* plausibility checks */
+ if (unlikely(dst_size > src_size)) {
+ XIO_WRN("my local copy is larger than the remote one, ignoring\n");
+ status = -EINVAL;
+ goto done;
+ }
+
+ /* check whether we are participating in that resource */
+ rot = parent->d_private;
+ if (!rot) {
+ XIO_WRN("parent has no rot info\n");
+ status = -EINVAL;
+ goto done;
+ }
+ if (!rot->fetch_path) {
+ XIO_WRN("parent has no fetch_path\n");
+ status = -EINVAL;
+ goto done;
+ }
+
+ /* bookkeeping for serialization of logfile updates */
+ if (remote_dent->d_serial > rot->fetch_serial) {
+ rot->fetch_next_is_available++;
+ if (!rot->fetch_next_serial || !rot->fetch_next_origin) {
+ rot->fetch_next_serial = remote_dent->d_serial;
+ rot->fetch_next_origin = brick_strdup(remote_dent->d_rest);
+ } else if (rot->fetch_next_serial == remote_dent->d_serial && strcmp(rot->fetch_next_origin,
+ remote_dent->d_rest)) {
+ rot->split_brain_round = 0;
+ rot->split_brain_serial = remote_dent->d_serial;
+ XIO_WRN("SPLIT BRAIN (logfiles from '%s' and '%s' with same serial number %d) detected!\n",
+ rot->fetch_next_origin, remote_dent->d_rest, rot->split_brain_serial);
+ }
+ }
+
+ /* check whether connection is allowed */
+ switch_path = path_make("%s/todo-%s/connect", parent->d_path, my_id());
+
+ /* check whether copy is necessary */
+ fetch_brick = rot->fetch_brick;
+ XIO_DBG("fetch_brick = %p (remote '%s' %d) fetch_serial = %d\n",
+ fetch_brick,
+ remote_dent->d_path,
+ remote_dent->d_serial,
+ rot->fetch_serial);
+ if (fetch_brick) {
+ if (remote_dent->d_serial == rot->fetch_serial && rot->fetch_peer && !strcmp(peer, rot->fetch_peer)) {
+ /* treat copy brick instance underway */
+ status = _update_file(parent,
+ switch_path,
+ rot->fetch_path,
+ remote_dent->d_path,
+ peer,
+ src_size);
+ XIO_DBG("re-update '%s' from peer '%s' status = %d\n", remote_dent->d_path, peer, status);
+ }
+ } else if (!rot->fetch_serial && rot->allow_update &&
+ !rot->is_primary && !rot->old_is_primary &&
+ (!rot->preferred_peer || !strcmp(rot->preferred_peer, peer)) &&
+ (!rot->split_brain_serial || remote_dent->d_serial < rot->split_brain_serial) &&
+ (dst_size < src_size || !local_dent)) {
+ /* start copy brick instance */
+ status = _update_file(parent, switch_path, rot->fetch_path, remote_dent->d_path, peer, src_size);
+ XIO_DBG("update '%s' from peer '%s' status = %d\n", remote_dent->d_path, peer, status);
+ if (likely(status >= 0)) {
+ rot->fetch_serial = remote_dent->d_serial;
+ rot->fetch_next_is_available = 0;
+ brick_string_free(rot->fetch_peer);
+ rot->fetch_peer = brick_strdup(peer);
+ }
+ } else {
+ XIO_DBG("allow_update = %d src_size = %lld dst_size = %lld local_dent = %p\n",
+ rot->allow_update,
+ src_size,
+ dst_size,
+ local_dent);
+ }
+
+done:
+ brick_string_free(switch_path);
+ return status;
+}
+
+static
+int run_bone(struct mars_peerinfo *peer, struct mars_dent *remote_dent)
+{
+ int status = 0;
+ struct kstat local_stat = {};
+ const char *marker_path = NULL;
+ bool stat_ok;
+ bool update_mtime = true;
+ bool update_ctime = true;
+ bool run_trigger = false;
+
+ if (!strncmp(remote_dent->d_name, ".tmp", 4))
+ goto done;
+ if (!strncmp(remote_dent->d_name, ".deleted-", 9))
+ goto done;
+ if (!strncmp(remote_dent->d_name, "ignore", 6))
+ goto done;
+
+ /* create / check markers (prevent concurrent updates) */
+ if (remote_dent->link_val && !strncmp(remote_dent->d_path, "/mars/todo-global/delete-", 25)) {
+ marker_path = backskip_replace(remote_dent->link_val, '/', true, "/.deleted-");
+ if (mars_stat(marker_path, &local_stat, true) < 0 ||
+ timespec_compare(&remote_dent->stat_val.mtime, &local_stat.mtime) > 0) {
+ XIO_DBG("creating / updating marker '%s' mtime=%lu.%09lu\n",
+ marker_path, remote_dent->stat_val.mtime.tv_sec, remote_dent->stat_val.mtime.tv_nsec);
+ mars_symlink("1", marker_path, &remote_dent->stat_val.mtime, 0);
+ }
+ if (remote_dent->d_serial < peer->global->deleted_my_border) {
+ XIO_DBG("ignoring deletion '%s' at border %d\n",
+ remote_dent->d_path,
+ peer->global->deleted_my_border);
+ goto done;
+ }
+ } else {
+ /* check marker preventing concurrent updates from remote hosts when deletes are in progress */
+ marker_path = backskip_replace(remote_dent->d_path, '/', true, "/.deleted-");
+ if (mars_stat(marker_path, &local_stat, true) >= 0) {
+ if (timespec_compare(&remote_dent->stat_val.mtime, &local_stat.mtime) <= 0) {
+ XIO_DBG("marker '%s' exists, ignoring '%s' (new mtime=%lu.%09lu, marker mtime=%lu.%09lu)\n",
+ marker_path, remote_dent->d_path,
+ remote_dent->stat_val.mtime.tv_sec, remote_dent->stat_val.mtime.tv_nsec,
+ local_stat.mtime.tv_sec, local_stat.mtime.tv_nsec);
+ goto done;
+ } else {
+ XIO_DBG("marker '%s' exists, overwriting '%s' (new mtime=%lu.%09lu, marker mtime=%lu.%09lu)\n",
+ marker_path, remote_dent->d_path,
+ remote_dent->stat_val.mtime.tv_sec, remote_dent->stat_val.mtime.tv_nsec,
+ local_stat.mtime.tv_sec, local_stat.mtime.tv_nsec);
+ }
+ }
+ }
+
+ status = mars_stat(remote_dent->d_path, &local_stat, true);
+ stat_ok = (status >= 0);
+
+ if (stat_ok) {
+ update_mtime = timespec_compare(&remote_dent->stat_val.mtime, &local_stat.mtime) > 0;
+ update_ctime = timespec_compare(&remote_dent->stat_val.ctime, &local_stat.ctime) > 0;
+
+ if ((remote_dent->stat_val.mode & S_IRWXU) !=
+ (local_stat.mode & S_IRWXU) &&
+ update_ctime) {
+ mode_t newmode = local_stat.mode;
+
+ XIO_DBG("chmod '%s' 0x%xd -> 0x%xd\n",
+ remote_dent->d_path,
+ newmode & S_IRWXU,
+ remote_dent->stat_val.mode & S_IRWXU);
+ newmode &= ~S_IRWXU;
+ newmode |= (remote_dent->stat_val.mode & S_IRWXU);
+ mars_chmod(remote_dent->d_path, newmode);
+ run_trigger = true;
+ }
+
+ if (__kuid_val(remote_dent->stat_val.uid) != __kuid_val(local_stat.uid) && update_ctime) {
+ XIO_DBG("lchown '%s' %d -> %d\n",
+ remote_dent->d_path,
+ __kuid_val(local_stat.uid),
+ __kuid_val(remote_dent->stat_val.uid));
+ mars_lchown(remote_dent->d_path, __kuid_val(remote_dent->stat_val.uid));
+ run_trigger = true;
+ }
+ }
+
+ if (S_ISDIR(remote_dent->stat_val.mode)) {
+ if (!_is_usable_dir(remote_dent->d_name)) {
+ XIO_DBG("ignoring directory '%s'\n", remote_dent->d_path);
+ goto done;
+ }
+ if (!stat_ok) {
+ status = mars_mkdir(remote_dent->d_path);
+ XIO_DBG("create directory '%s' status = %d\n", remote_dent->d_path, status);
+ if (status >= 0) {
+ mars_chmod(remote_dent->d_path, remote_dent->stat_val.mode);
+ mars_lchown(remote_dent->d_path, __kuid_val(remote_dent->stat_val.uid));
+ }
+ }
+ } else if (S_ISLNK(remote_dent->stat_val.mode) && remote_dent->link_val) {
+ if (!stat_ok || update_mtime) {
+ status = mars_symlink(remote_dent->link_val,
+ remote_dent->d_path,
+ &remote_dent->stat_val.mtime,
+ __kuid_val(remote_dent->stat_val.uid));
+ XIO_DBG("create symlink '%s' -> '%s' status = %d\n",
+ remote_dent->d_path,
+ remote_dent->link_val,
+ status);
+ run_trigger = true;
+ }
+ } else if (S_ISREG(remote_dent->stat_val.mode) && _is_peer_logfile(remote_dent->d_name, my_id())) {
+ const char *parent_path = backskip_replace(remote_dent->d_path, '/', false, "");
+
+ if (likely(parent_path)) {
+ struct mars_dent *parent = mars_find_dent(peer->global, parent_path);
+
+ if (unlikely(!parent)) {
+ XIO_DBG("ignoring non-existing local resource '%s'\n", parent_path);
+ /* don't copy old / outdated logfiles */
+ } else {
+ struct mars_rotate *rot;
+
+ rot = parent->d_private;
+ if (rot && rot->relevant_serial > remote_dent->d_serial) {
+ XIO_DBG("ignoring outdated remote logfile '%s' (behind %d)\n",
+ remote_dent->d_path, rot->relevant_serial);
+ } else {
+ struct mars_dent *local_dent;
+
+ local_dent = mars_find_dent(peer->global, remote_dent->d_path);
+ status = check_logfile(peer->peer,
+ remote_dent,
+ local_dent,
+ parent,
+ local_stat.size);
+ }
+ }
+ brick_string_free(parent_path);
+ }
+ } else {
+ XIO_DBG("ignoring '%s'\n", remote_dent->d_path);
+ }
+
+done:
+ brick_string_free(marker_path);
+ if (status >= 0)
+ status = run_trigger ? 1 : 0;
+ return status;
+}
+
+static
+int run_bones(struct mars_peerinfo *peer)
+{
+ LIST_HEAD(tmp_list);
+ struct list_head *tmp;
+ bool run_trigger = false;
+ int status = 0;
+
+ spin_lock(&peer->lock);
+ list_replace_init(&peer->remote_dent_list, &tmp_list);
+ spin_unlock(&peer->lock);
+
+ XIO_DBG("remote_dent_list list_empty = %d\n", list_empty(&tmp_list));
+
+ for (tmp = tmp_list.next; tmp != &tmp_list; tmp = tmp->next) {
+ struct mars_dent *remote_dent = container_of(tmp, struct mars_dent, dent_link);
+
+ if (!remote_dent->d_path || !remote_dent->d_name) {
+ XIO_DBG("NULL\n");
+ continue;
+ }
+ status = run_bone(peer, remote_dent);
+ if (status > 0)
+ run_trigger = true;
+ /* XIO_DBG("path = '%s' worker status = %d\n", remote_dent->d_path, status); */
+ }
+
+ xio_free_dent_all(NULL, &tmp_list);
+
+ if (run_trigger)
+ local_trigger();
+ return status;
+}
+
+/*********************************************************************/
+
+/* remote working infrastructure */
+
+static
+void _peer_cleanup(struct mars_peerinfo *peer)
+{
+ XIO_DBG("cleanup\n");
+ if (xio_socket_is_alive(&peer->socket)) {
+ XIO_DBG("really shutdown socket\n");
+ xio_shutdown_socket(&peer->socket);
+ }
+ xio_put_socket(&peer->socket);
+}
+
+static DECLARE_WAIT_QUEUE_HEAD(remote_event);
+
+static
+int peer_thread(void *data)
+{
+ struct mars_peerinfo *peer = data;
+ char *real_peer;
+ struct sockaddr_storage sockaddr = {};
+
+ struct key_value_pair peer_pairs[] = {
+ { peer->peer },
+ { NULL }
+ };
+ int pause_time = 0;
+ bool do_kill = false;
+ int status;
+
+ if (!peer)
+ return -1;
+
+ real_peer = xio_translate_hostname(peer->peer);
+ XIO_INF("-------- peer thread starting on peer '%s' (%s)\n", peer->peer, real_peer);
+
+ status = xio_create_sockaddr(&sockaddr, real_peer);
+ if (unlikely(status < 0)) {
+ XIO_ERR("unusable remote address '%s' (%s)\n", real_peer, peer->peer);
+ goto done;
+ }
+
+ while (!brick_thread_should_stop()) {
+ LIST_HEAD(tmp_list);
+ LIST_HEAD(old_list);
+
+ struct xio_cmd cmd = {
+ .cmd_str1 = peer->path,
+ .cmd_int1 = peer->maxdepth,
+ };
+
+ show_vals(peer_pairs, "/mars", "connection-from-");
+
+ if (!xio_socket_is_alive(&peer->socket)) {
+ make_msg(peer_pairs, "connection to '%s' (%s) is dead", peer->peer, real_peer);
+ brick_string_free(real_peer);
+ real_peer = xio_translate_hostname(peer->peer);
+ status = xio_create_sockaddr(&sockaddr, real_peer);
+ if (unlikely(status < 0)) {
+ XIO_ERR("unusable remote address '%s' (%s)\n", real_peer, peer->peer);
+ make_msg(peer_pairs, "unusable remote address '%s' (%s)\n", real_peer, peer->peer);
+ brick_msleep(1000);
+ continue;
+ }
+ if (do_kill) {
+ do_kill = false;
+ _peer_cleanup(peer);
+ brick_msleep(1000);
+ continue;
+ }
+ if (!xio_net_is_alive) {
+ brick_msleep(1000);
+ continue;
+ }
+
+ status = xio_create_socket(&peer->socket, &sockaddr, false);
+ if (unlikely(status < 0)) {
+ XIO_INF("no connection to mars module on '%s' (%s) status = %d\n",
+ peer->peer,
+ real_peer,
+ status);
+ make_msg(peer_pairs,
+ "connection to '%s' (%s) could not be established: status = %d",
+ peer->peer,
+ real_peer,
+ status);
+ brick_msleep(2000);
+ continue;
+ }
+ do_kill = true;
+ peer->socket.s_shutdown_on_err = true;
+ peer->socket.s_send_abort = mars_peer_abort;
+ peer->socket.s_recv_abort = mars_peer_abort;
+ XIO_DBG("successfully opened socket to '%s'\n", real_peer);
+ brick_msleep(100);
+ continue;
+ }
+
+ make_msg(peer_pairs, "CONNECTED %s(%s)", peer->peer, real_peer);
+
+ if (peer->from_remote_trigger) {
+ pause_time = 0;
+ peer->from_remote_trigger = false;
+ XIO_DBG("got notify from peer.\n");
+ }
+
+ status = 0;
+ if (peer->to_remote_trigger) {
+ pause_time = 0;
+ peer->to_remote_trigger = false;
+ XIO_DBG("sending notify to peer...\n");
+ cmd.cmd_code = CMD_NOTIFY;
+ status = xio_send_struct(&peer->socket, &cmd, xio_cmd_meta);
+ }
+
+ if (likely(status >= 0)) {
+ cmd.cmd_code = CMD_GETENTS;
+ status = xio_send_struct(&peer->socket, &cmd, xio_cmd_meta);
+ }
+ if (unlikely(status < 0)) {
+ XIO_WRN("communication error on send, status = %d\n", status);
+ if (do_kill) {
+ do_kill = false;
+ _peer_cleanup(peer);
+ }
+ brick_msleep(1000);
+ continue;
+ }
+
+ XIO_DBG("fetching remote dentry list\n");
+ status = xio_recv_dent_list(&peer->socket, &tmp_list);
+ if (unlikely(status < 0)) {
+ XIO_WRN("communication error on receive, status = %d\n", status);
+ if (do_kill) {
+ do_kill = false;
+ _peer_cleanup(peer);
+ }
+ xio_free_dent_all(NULL, &tmp_list);
+ brick_msleep(2000);
+ continue;
+ }
+
+ if (likely(!list_empty(&tmp_list))) {
+ XIO_DBG("got remote denties\n");
+
+ spin_lock(&peer->lock);
+
+ list_replace_init(&peer->remote_dent_list, &old_list);
+ list_replace_init(&tmp_list, &peer->remote_dent_list);
+
+ spin_unlock(&peer->lock);
+
+ peer->last_remote_jiffies = jiffies;
+
+ local_trigger();
+
+ xio_free_dent_all(NULL, &old_list);
+ }
+
+ brick_msleep(100);
+ if (!brick_thread_should_stop()) {
+ if (pause_time < mars_propagate_interval)
+ pause_time++;
+ wait_event_interruptible_timeout(remote_event,
+ (peer->to_remote_trigger | peer->from_remote_trigger) ||
+ (mars_global && mars_global->main_trigger),
+ pause_time * HZ);
+ }
+ }
+
+ XIO_INF("-------- peer thread terminating\n");
+
+ make_msg(peer_pairs, "NOT connected %s(%s)", peer->peer, real_peer);
+ show_vals(peer_pairs, "/mars", "connection-from-");
+
+ if (do_kill)
+ _peer_cleanup(peer);
+
+done:
+ clear_vals(peer_pairs);
+ brick_string_free(real_peer);
+ return 0;
+}
+
+static
+void _make_alive(void)
+{
+ struct timespec now;
+ char *tmp;
+
+ get_lamport(&now);
+ tmp = path_make("%ld.%09ld", now.tv_sec, now.tv_nsec);
+ if (likely(tmp)) {
+ _make_alivelink_str("time", tmp);
+ brick_string_free(tmp);
+ }
+ _make_alivelink("alive", mars_global && mars_global->global_power.button ? 1 : 0);
+ _make_alivelink_str("tree", SYMLINK_TREE_VERSION);
+}
+
+void from_remote_trigger(void)
+{
+ struct list_head *tmp;
+ int count = 0;
+
+ _make_alive();
+
+ read_lock(&peer_lock);
+ for (tmp = peer_anchor.next; tmp != &peer_anchor; tmp = tmp->next) {
+ struct mars_peerinfo *peer = container_of(tmp, struct mars_peerinfo, peer_head);
+
+ peer->from_remote_trigger = true;
+ count++;
+ }
+ read_unlock(&peer_lock);
+
+ XIO_DBG("got trigger for %d peers\n", count);
+ wake_up_interruptible_all(&remote_event);
+}
+EXPORT_SYMBOL_GPL(from_remote_trigger);
+
+static
+void __remote_trigger(void)
+{
+ struct list_head *tmp;
+ int count = 0;
+
+ read_lock(&peer_lock);
+ for (tmp = peer_anchor.next; tmp != &peer_anchor; tmp = tmp->next) {
+ struct mars_peerinfo *peer = container_of(tmp, struct mars_peerinfo, peer_head);
+
+ peer->to_remote_trigger = true;
+ count++;
+ }
+ read_unlock(&peer_lock);
+
+ XIO_DBG("triggered %d peers\n", count);
+ wake_up_interruptible_all(&remote_event);
+}
+
+static
+bool is_shutdown(void)
+{
+ bool res = false;
+ int used = atomic_read(&global_mshadow_count);
+
+ if (used > 0) {
+ XIO_INF("global shutdown delayed: there are %d buffers in use, occupying %ld bytes\n",
+ used,
+ atomic64_read(&global_mshadow_used));
+ } else {
+ int rounds = 3;
+
+ while ((used = atomic_read(&xio_global_io_flying)) <= 0) {
+ if (--rounds <= 0) {
+ res = true;
+ break;
+ }
+ brick_msleep(30);
+ }
+ if (!res)
+ XIO_INF("global shutdown delayed: there are %d IO requests flying\n", used);
+ }
+ return res;
+}
+
+/*********************************************************************/
+
+/* helpers for worker functions */
+
+static int _kill_peer(void *buf, struct mars_dent *dent)
+{
+ LIST_HEAD(tmp_list);
+ struct mars_global *global = buf;
+ struct mars_peerinfo *peer = dent->d_private;
+
+ if (global->global_power.button)
+ return 0;
+ if (!peer)
+ return 0;
+
+ write_lock(&peer_lock);
+ list_del_init(&peer->peer_head);
+ write_unlock(&peer_lock);
+
+ XIO_INF("stopping peer thread...\n");
+ if (peer->peer_thread)
+ brick_thread_stop(peer->peer_thread);
+ spin_lock(&peer->lock);
+ list_replace_init(&peer->remote_dent_list, &tmp_list);
+ spin_unlock(&peer->lock);
+ xio_free_dent_all(NULL, &tmp_list);
+ brick_string_free(peer->peer);
+ brick_string_free(peer->path);
+ dent->d_private = NULL;
+ brick_mem_free(peer);
+ return 0;
+}
+
+static int _make_peer(struct mars_global *global, struct mars_dent *dent, char *path)
+{
+ static int serial;
+ struct mars_peerinfo *peer;
+ char *mypeer;
+ char *parent_path;
+ int status = 0;
+
+ if (unlikely(!global || !global->global_power.button ||
+ !dent || !dent->link_val || !dent->d_parent)) {
+ XIO_DBG("cannot work\n");
+ return 0;
+ }
+ parent_path = dent->d_parent->d_path;
+ if (unlikely(!parent_path)) {
+ XIO_DBG("cannot work\n");
+ return 0;
+ }
+ mypeer = dent->d_rest;
+ if (!mypeer) {
+ status = _parse_args(dent, dent->link_val, 1);
+ if (status < 0)
+ goto done;
+ mypeer = dent->d_argv[0];
+ }
+
+ XIO_DBG("peer '%s'\n", mypeer);
+ if (!dent->d_private) {
+ dent->d_private = brick_zmem_alloc(sizeof(struct mars_peerinfo));
+ peer = dent->d_private;
+ peer->global = global;
+ peer->peer = brick_strdup(mypeer);
+ peer->path = brick_strdup(path);
+ peer->maxdepth = 2;
+ spin_lock_init(&peer->lock);
+ INIT_LIST_HEAD(&peer->peer_head);
+ INIT_LIST_HEAD(&peer->remote_dent_list);
+
+ write_lock(&peer_lock);
+ list_add_tail(&peer->peer_head, &peer_anchor);
+ write_unlock(&peer_lock);
+ }
+
+ peer = dent->d_private;
+ if (!peer->peer_thread) {
+ peer->peer_thread = brick_thread_create(peer_thread, peer, "mars_peer%d", serial++);
+ if (unlikely(!peer->peer_thread)) {
+ XIO_ERR("cannot start peer thread\n");
+ return -1;
+ }
+ XIO_DBG("started peer thread\n");
+ }
+
+ /* This must be called by the main thread in order to
+ * avoid nasty races.
+ * The peer thread does nothing but fetching the dent list.
+ */
+ status = run_bones(peer);
+
+done:
+ return status;
+}
+
+static int kill_scan(void *buf, struct mars_dent *dent)
+{
+ return _kill_peer(buf, dent);
+}
+
+static int make_scan(void *buf, struct mars_dent *dent)
+{
+ XIO_DBG("path = '%s' peer = '%s'\n", dent->d_path, dent->d_rest);
+ /* don't connect to myself */
+ if (!strcmp(dent->d_rest, my_id()))
+ return 0;
+ return _make_peer(buf, dent, "/mars");
+}
+
+static
+int kill_any(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct list_head *tmp;
+
+ if (global->global_power.button || !is_shutdown())
+ return 0;
+
+ for (tmp = dent->brick_list.next; tmp != &dent->brick_list; tmp = tmp->next) {
+ struct xio_brick *brick = container_of(tmp, struct xio_brick, dent_brick_link);
+
+ if (brick->nr_outputs > 0 && brick->outputs[0] && brick->outputs[0]->nr_connected) {
+ XIO_DBG("cannot kill dent '%s' because brick '%s' is wired\n",
+ dent->d_path,
+ brick->brick_path);
+ return 0;
+ }
+ }
+
+ XIO_DBG("killing dent = '%s'\n", dent->d_path);
+ xio_kill_dent(dent);
+ return 1;
+}
+
+/*********************************************************************/
+
+/* handlers / helpers for logfile rotation */
+
+static
+void _create_new_logfile(const char *path)
+{
+ struct file *f;
+ const int flags = O_RDWR | O_CREAT | O_EXCL;
+ const int prot = 0600;
+
+ mm_segment_t oldfs;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ f = filp_open(path, flags, prot);
+ set_fs(oldfs);
+ if (IS_ERR(f)) {
+ int err = PTR_ERR(f);
+
+ if (err == -EEXIST)
+ XIO_INF("logfile '%s' already exists\n", path);
+ else
+ XIO_ERR("could not create logfile '%s' status = %d\n", path, err);
+ } else {
+ XIO_DBG("created empty logfile '%s'\n", path);
+ filp_close(f, NULL);
+ local_trigger();
+ }
+}
+
+static
+const char *get_replaylink(const char *parent_path, const char *host, const char **linkpath)
+{
+ const char *_linkpath = path_make("%s/replay-%s", parent_path, host);
+
+ *linkpath = _linkpath;
+ if (unlikely(!_linkpath)) {
+ XIO_ERR("no MEM\n");
+ return NULL;
+ }
+ return mars_readlink(_linkpath);
+}
+
+static
+const char *get_versionlink(const char *parent_path, int seq, const char *host, const char **linkpath)
+{
+ const char *_linkpath = path_make("%s/version-%09d-%s", parent_path, seq, host);
+
+ *linkpath = _linkpath;
+ if (unlikely(!_linkpath)) {
+ XIO_ERR("no MEM\n");
+ return NULL;
+ }
+ return mars_readlink(_linkpath);
+}
+
+static inline
+int _get_tolerance(struct mars_rotate *rot)
+{
+ if (rot->is_log_damaged)
+ return REPLAY_TOLERANCE;
+ return 0;
+}
+
+static
+bool is_switchover_possible(struct mars_rotate *rot,
+ const char *old_log_path,
+ const char *new_log_path,
+ int replay_tolerance,
+ bool skip_new)
+{
+ const char *old_log_name = old_log_path + skip_dir(old_log_path);
+ const char *new_log_name = new_log_path + skip_dir(new_log_path);
+ const char *old_host = NULL;
+ const char *new_host = NULL;
+ const char *own_versionlink_path = NULL;
+ const char *old_versionlink_path = NULL;
+ const char *new_versionlink_path = NULL;
+ const char *own_versionlink = NULL;
+ const char *old_versionlink = NULL;
+ const char *new_versionlink = NULL;
+ const char *own_replaylink_path = NULL;
+ const char *own_replaylink = NULL;
+ loff_t own_r_val;
+ loff_t own_v_val;
+ loff_t own_r_tail;
+ int old_log_seq;
+ int new_log_seq;
+ int own_r_offset;
+ int own_v_offset;
+ int own_r_len;
+ int own_v_len;
+ int len1;
+ int len2;
+ int offs2;
+ char dummy = 0;
+
+ bool res = false;
+
+ XIO_DBG("old_log = '%s' new_log = '%s' toler = %d skip_new = %d\n",
+ old_log_path, new_log_path, replay_tolerance, skip_new);
+
+ if (unlikely(!parse_logfile_name(old_log_name, &old_log_seq, &old_host))) {
+ make_rot_msg(rot, "err-bad-log-name", "logfile name '%s' cannot be parsed", old_log_name);
+ goto done;
+ }
+ if (unlikely(!parse_logfile_name(new_log_name, &new_log_seq, &new_host))) {
+ make_rot_msg(rot, "err-bad-log-name", "logfile name '%s' cannot be parsed", new_log_name);
+ goto done;
+ }
+
+ /* check: are the sequence numbers contiguous? */
+ if (unlikely(new_log_seq != old_log_seq + 1)) {
+ XIO_ERR_TO(rot->log_say,
+ "logfile sequence numbers are not contiguous (%d != %d + 1), old_log_path='%s' new_log_path='%s'\n",
+ new_log_seq,
+ old_log_seq,
+ old_log_path,
+ new_log_path);
+ make_rot_msg(rot,
+ "err-log-not-contiguous",
+ "logfile sequence numbers are not contiguous (%d != %d + 1) old_log_path='%s' new_log_path='%s'",
+ new_log_seq,
+ old_log_seq,
+ old_log_path,
+ new_log_path);
+ goto done;
+ }
+
+ /* fetch all the versionlinks and test for their existence. */
+ own_versionlink = get_versionlink(rot->parent_path, old_log_seq, my_id(), &own_versionlink_path);
+ if (unlikely(!own_versionlink || !own_versionlink[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read my own versionlink '%s'\n", SAFE_STR(own_versionlink_path));
+ make_rot_msg(rot,
+ "err-versionlink-not-readable",
+ "cannot read my own versionlink '%s'",
+ SAFE_STR(own_versionlink_path));
+ goto done;
+ }
+ old_versionlink = get_versionlink(rot->parent_path, old_log_seq, old_host, &old_versionlink_path);
+ if (unlikely(!old_versionlink || !old_versionlink[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read old versionlink '%s'\n", SAFE_STR(old_versionlink_path));
+ make_rot_msg(rot,
+ "err-versionlink-not-readable",
+ "cannot read old versionlink '%s'",
+ SAFE_STR(old_versionlink_path));
+ goto done;
+ }
+ if (!skip_new) {
+ new_versionlink = get_versionlink(rot->parent_path, new_log_seq, new_host, &new_versionlink_path);
+ if (unlikely(!new_versionlink || !new_versionlink[0])) {
+ XIO_INF_TO(rot->log_say,
+ "new versionlink '%s' does not yet exist, we must wait for it.\n",
+ SAFE_STR(new_versionlink_path));
+ make_rot_msg(rot,
+ "inf-versionlink-not-yet-exist",
+ "we must wait for new versionlink '%s'",
+ SAFE_STR(new_versionlink_path));
+ goto done;
+ }
+ }
+
+ /* check: are the versionlinks correct? */
+ if (unlikely(strcmp(own_versionlink, old_versionlink))) {
+ XIO_INF_TO(rot->log_say,
+ "old logfile is not yet completeley transferred, own_versionlink '%s' -> '%s' != old_versionlink '%s' -> '%s'\n",
+ own_versionlink_path,
+ own_versionlink,
+ old_versionlink_path,
+ old_versionlink);
+ make_rot_msg(rot,
+ "inf-versionlink-not-equal",
+ "old logfile is not yet completeley transferred (own_versionlink '%s' -> '%s' != old_versionlink '%s' -> '%s')",
+ own_versionlink_path,
+ own_versionlink,
+ old_versionlink_path,
+ old_versionlink);
+ goto done;
+ }
+
+ /* check: did I fully replay my old logfile data? */
+ own_replaylink = get_replaylink(rot->parent_path, my_id(), &own_replaylink_path);
+ if (unlikely(!own_replaylink || !own_replaylink[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read my own replaylink '%s'\n", SAFE_STR(own_replaylink_path));
+ goto done;
+ }
+ own_r_len = skip_part(own_replaylink);
+ own_v_offset = skip_part(own_versionlink);
+ if (unlikely(!own_versionlink[own_v_offset++])) {
+ XIO_ERR_TO(rot->log_say,
+ "own version link '%s' -> '%s' is malformed\n",
+ own_versionlink_path,
+ own_versionlink);
+ make_rot_msg(rot,
+ "err-replaylink-not-readable",
+ "own version link '%s' -> '%s' is malformed",
+ own_versionlink_path,
+ own_versionlink);
+ goto done;
+ }
+ own_v_len = skip_part(own_versionlink + own_v_offset);
+ if (unlikely(own_r_len != own_v_len ||
+ strncmp(own_replaylink, own_versionlink + own_v_offset, own_r_len))) {
+ XIO_ERR_TO(rot->log_say,
+ "internal problem: logfile name mismatch between '%s' and '%s'\n",
+ own_replaylink,
+ own_versionlink);
+ make_rot_msg(rot,
+ "err-bad-log-name",
+ "internal problem: logfile name mismatch between '%s' and '%s'",
+ own_replaylink,
+ own_versionlink);
+ goto done;
+ }
+ if (unlikely(!own_replaylink[own_r_len])) {
+ XIO_ERR_TO(rot->log_say,
+ "own replay link '%s' -> '%s' is malformed\n",
+ own_replaylink_path,
+ own_replaylink);
+ make_rot_msg(rot,
+ "err-replaylink-not-readable",
+ "own replay link '%s' -> '%s' is malformed",
+ own_replaylink_path,
+ own_replaylink);
+ goto done;
+ }
+ own_r_offset = own_r_len + 1;
+ if (unlikely(!own_versionlink[own_v_len])) {
+ XIO_ERR_TO(rot->log_say,
+ "own version link '%s' -> '%s' is malformed\n",
+ own_versionlink_path,
+ own_versionlink);
+ make_rot_msg(rot,
+ "err-versionlink-not-readable",
+ "own version link '%s' -> '%s' is malformed",
+ own_versionlink_path,
+ own_versionlink);
+ goto done;
+ }
+ own_v_offset += own_r_len + 1;
+ own_r_len = skip_part(own_replaylink + own_r_offset);
+ own_v_len = skip_part(own_versionlink + own_v_offset);
+ own_r_val = own_v_val = 0;
+ own_r_tail = 0;
+ if (sscanf(own_replaylink + own_r_offset, "%lld,%lld", &own_r_val, &own_r_tail) != 2) {
+ XIO_ERR_TO(rot->log_say,
+ "own replay link '%s' -> '%s' is malformed\n",
+ own_replaylink_path,
+ own_replaylink);
+ make_rot_msg(rot,
+ "err-replaylink-not-readable",
+ "own replay link '%s' -> '%s' is malformed",
+ own_replaylink_path,
+ own_replaylink);
+ goto done;
+ }
+ /* SSCANF_TO_KSTRTO: kstros64 does not work because of the next char */
+ if (sscanf(own_versionlink + own_v_offset, "%lld%c", &own_v_val, &dummy) != 2) {
+ XIO_ERR_TO(rot->log_say,
+ "own version link '%s' -> '%s' is malformed\n",
+ own_versionlink_path,
+ own_versionlink);
+ make_rot_msg(rot,
+ "err-versionlink-not-readable",
+ "own version link '%s' -> '%s' is malformed",
+ own_versionlink_path,
+ own_versionlink);
+ goto done;
+ }
+ if (unlikely(own_r_len > own_v_len || own_r_len + replay_tolerance < own_v_len)) {
+ XIO_INF_TO(rot->log_say,
+ "log replay is not yet finished: '%s' and '%s' are reporting different positions.\n",
+ own_replaylink,
+ own_versionlink);
+ make_rot_msg(rot,
+ "inf-replay-not-yet-finished",
+ "log replay is not yet finished: '%s' and '%s' are reporting different positions",
+ own_replaylink,
+ own_versionlink);
+ goto done;
+ }
+
+ /* last check: is the new versionlink based on the old one? */
+ if (!skip_new) {
+ len1 = skip_sect(own_versionlink);
+ offs2 = skip_sect(new_versionlink);
+ if (unlikely(!new_versionlink[offs2++])) {
+ XIO_ERR_TO(rot->log_say,
+ "new version link '%s' -> '%s' is malformed\n",
+ new_versionlink_path,
+ new_versionlink);
+ make_rot_msg(rot,
+ "err-versionlink-not-readable",
+ "new version link '%s' -> '%s' is malformed",
+ new_versionlink_path,
+ new_versionlink);
+ goto done;
+ }
+ len2 = skip_sect(new_versionlink + offs2);
+ if (unlikely(len1 != len2 ||
+ strncmp(own_versionlink, new_versionlink + offs2, len1))) {
+ XIO_WRN_TO(rot->log_say,
+ "VERSION MISMATCH old '%s' -> '%s' new '%s' -> '%s' ==(%d,%d) ===> check for SPLIT BRAIN!\n",
+ own_versionlink_path,
+ own_versionlink,
+ new_versionlink_path,
+ new_versionlink,
+ len1,
+ len2);
+ make_rot_msg(rot,
+ "err-splitbrain-detected",
+ "VERSION MISMATCH old '%s' -> '%s' new '%s' -> '%s' ==(%d,%d) ===> check for SPLIT BRAIN",
+ own_versionlink_path,
+ own_versionlink,
+ new_versionlink_path,
+ new_versionlink,
+ len1,
+ len2);
+ goto done;
+ }
+ }
+
+ /* report success */
+ res = true;
+ XIO_DBG("VERSION OK '%s' -> '%s'\n", own_versionlink_path, own_versionlink);
+
+done:
+ brick_string_free(old_host);
+ brick_string_free(new_host);
+ brick_string_free(own_versionlink_path);
+ brick_string_free(old_versionlink_path);
+ brick_string_free(new_versionlink_path);
+ brick_string_free(own_versionlink);
+ brick_string_free(old_versionlink);
+ brick_string_free(new_versionlink);
+ brick_string_free(own_replaylink_path);
+ brick_string_free(own_replaylink);
+ return res;
+}
+
+static
+void rot_destruct(void *_rot)
+{
+ struct mars_rotate *rot = _rot;
+
+ if (likely(rot)) {
+ list_del_init(&rot->rot_head);
+ write_info_links(rot);
+ del_channel(rot->log_say);
+ rot->log_say = NULL;
+ brick_string_free(rot->fetch_path);
+ brick_string_free(rot->fetch_peer);
+ brick_string_free(rot->preferred_peer);
+ brick_string_free(rot->parent_path);
+ brick_string_free(rot->parent_rest);
+ brick_string_free(rot->fetch_next_origin);
+ rot->fetch_path = NULL;
+ rot->fetch_peer = NULL;
+ rot->preferred_peer = NULL;
+ rot->parent_path = NULL;
+ rot->parent_rest = NULL;
+ rot->fetch_next_origin = NULL;
+ clear_vals(rot->msgs);
+ }
+}
+
+/* This must be called once at every round of logfile checking.
+ */
+static
+int make_log_init(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent = dent->d_parent;
+ struct xio_brick *bio_brick;
+ struct xio_brick *aio_brick;
+ struct xio_brick *trans_brick;
+ struct mars_rotate *rot = parent->d_private;
+ struct mars_dent *replay_link;
+ struct mars_dent *aio_dent;
+ struct xio_output *output;
+ const char *parent_path;
+ const char *replay_path = NULL;
+ const char *aio_path = NULL;
+ bool switch_on;
+ int status = 0;
+
+ if (!global->global_power.button)
+ goto done;
+ status = -EINVAL;
+ CHECK_PTR(parent, done);
+ parent_path = parent->d_path;
+ CHECK_PTR(parent_path, done);
+
+ if (!rot) {
+ const char *fetch_path;
+
+ rot = brick_zmem_alloc(sizeof(struct mars_rotate));
+ spin_lock_init(&rot->inf_lock);
+ fetch_path = path_make("%s/logfile-update", parent_path);
+ if (unlikely(!fetch_path)) {
+ XIO_ERR("cannot create fetch_path\n");
+ brick_mem_free(rot);
+ status = -ENOMEM;
+ goto done;
+ }
+ rot->fetch_path = fetch_path;
+ rot->global = global;
+ parent->d_private = rot;
+ parent->d_private_destruct = rot_destruct;
+ list_add_tail(&rot->rot_head, &rot_anchor);
+ assign_keys(rot->msgs, rot_keys);
+ }
+
+ rot->replay_link = NULL;
+ rot->aio_dent = NULL;
+ rot->aio_brick = NULL;
+ rot->first_log = NULL;
+ rot->relevant_log = NULL;
+ rot->relevant_serial = 0;
+ rot->relevant_brick = NULL;
+ rot->next_relevant_log = NULL;
+ rot->next_next_relevant_log = NULL;
+ rot->prev_log = NULL;
+ rot->next_log = NULL;
+ brick_string_free(rot->fetch_next_origin);
+ rot->fetch_next_origin = NULL;
+ rot->max_sequence = 0;
+ /* reset the split brain detector only when conflicts have gone for a number of rounds */
+ if (rot->split_brain_serial && rot->split_brain_round++ > 3)
+ rot->split_brain_serial = 0;
+ rot->fetch_next_serial = 0;
+ rot->has_error = false;
+ rot->wants_sync = false;
+ rot->has_symlinks = true;
+ brick_string_free(rot->preferred_peer);
+ rot->preferred_peer = NULL;
+
+ if (dent->link_val) {
+ int status = kstrtos64(dent->link_val, 0, &rot->dev_size);
+
+ (void)status; /* leave as before in case of errors */
+ }
+ if (!rot->parent_path) {
+ rot->parent_path = brick_strdup(parent_path);
+ rot->parent_rest = brick_strdup(parent->d_rest);
+ }
+
+ if (unlikely(!rot->log_say)) {
+ char *name = path_make("%s/logstatus-%s", parent_path, my_id());
+
+ if (likely(name)) {
+ rot->log_say = make_channel(name, false);
+ brick_string_free(name);
+ }
+ }
+
+ write_info_links(rot);
+
+ /* Fetch the replay status symlink.
+ * It must exist, and its value will control everything.
+ */
+ replay_path = path_make("%s/replay-%s", parent_path, my_id());
+ if (unlikely(!replay_path)) {
+ XIO_ERR("cannot make path\n");
+ status = -ENOMEM;
+ goto done;
+ }
+
+ replay_link = (void *)mars_find_dent(global, replay_path);
+ if (unlikely(!replay_link || !replay_link->link_val)) {
+ XIO_DBG("replay status symlink '%s' does not exist (%p)\n", replay_path, replay_link);
+ rot->allow_update = false;
+ status = -ENOENT;
+ goto done;
+ }
+
+ status = _parse_args(replay_link, replay_link->link_val, 3);
+ if (unlikely(status < 0))
+ goto done;
+ rot->replay_link = replay_link;
+
+ /* Fetch AIO dentry of the logfile.
+ */
+ if (rot->trans_brick) {
+ struct trans_logger_input *trans_input = rot->trans_brick->inputs[rot->trans_brick->old_input_nr];
+
+ if (trans_input && trans_input->is_operating) {
+ aio_path = path_make("%s/log-%09d-%s",
+ parent_path,
+ trans_input->inf.inf_sequence,
+ trans_input->inf.inf_host);
+ XIO_DBG("using logfile '%s' from trans_input %d (new=%d)\n",
+ SAFE_STR(aio_path),
+ rot->trans_brick->old_input_nr,
+ rot->trans_brick->log_input_nr);
+ }
+ }
+ if (!aio_path) {
+ aio_path = path_make("%s/%s", parent_path, replay_link->d_argv[0]);
+ XIO_DBG("using logfile '%s' from replay symlink\n", SAFE_STR(aio_path));
+ }
+ if (unlikely(!aio_path)) {
+ XIO_ERR("cannot make path\n");
+ status = -ENOMEM;
+ goto done;
+ }
+
+ aio_dent = (void *)mars_find_dent(global, aio_path);
+ if (unlikely(!aio_dent)) {
+ XIO_DBG("logfile '%s' does not exist\n", aio_path);
+ status = -ENOENT;
+ if (rot->todo_primary && !rot->is_primary && !rot->old_is_primary) {
+ int offset = strlen(aio_path) - strlen(my_id());
+
+ if (offset > 0 && aio_path[offset-1] == '-' && !strcmp(aio_path + offset, my_id())) {
+ /* try to create an empty logfile */
+ _create_new_logfile(aio_path);
+ }
+ }
+ goto done;
+ }
+ rot->aio_dent = aio_dent;
+
+ /* check whether attach is allowed */
+ switch_on = _check_allow(global, parent, "attach");
+ if (switch_on && rot->res_shutdown) {
+ XIO_ERR("cannot start transaction logger: resource shutdown mode is currently active\n");
+ switch_on = false;
+ }
+
+ /* Fetch / make the AIO brick instance
+ */
+ aio_brick =
+ make_brick_all(global,
+ aio_dent,
+ _set_aio_params,
+ NULL,
+ aio_path,
+ (const struct generic_brick_type *)&aio_brick_type,
+ (const struct generic_brick_type*[]){},
+/**/ rot->trans_brick || switch_on ? 2 : -1,
+ "%s",
+ (const char *[]){},
+ 0,
+ aio_path);
+ rot->aio_brick = (void *)aio_brick;
+ status = 0;
+ if (unlikely(!aio_brick || !aio_brick->power.on_led))
+ goto done; /* this may happen in case of detach */
+ bio_brick = rot->bio_brick;
+ if (unlikely(!bio_brick || !bio_brick->power.on_led))
+ goto done; /* this may happen in case of detach */
+
+ /* Fetch the actual logfile size
+ */
+ output = aio_brick->outputs[0];
+ status = output->ops->xio_get_info(output, &rot->aio_info);
+ if (status < 0) {
+ XIO_ERR("cannot get info on '%s'\n", aio_path);
+ goto done;
+ }
+ XIO_DBG("logfile '%s' size = %lld\n", aio_path, rot->aio_info.current_size);
+
+ if (rot->is_primary &&
+ global_logrot_auto > 0 &&
+ unlikely(rot->aio_info.current_size >= (loff_t)global_logrot_auto * 1024 * 1024 * 1024)) {
+ char *new_path = path_make("%s/log-%09d-%s", parent_path, aio_dent->d_serial + 1, my_id());
+
+ if (likely(new_path && !mars_find_dent(global, new_path))) {
+ XIO_INF("old logfile size = %lld, creating new logfile '%s'\n",
+ rot->aio_info.current_size,
+ new_path);
+ _create_new_logfile(new_path);
+ }
+ brick_string_free(new_path);
+ }
+
+ /* Fetch / make the transaction logger.
+ * We deliberately "forget" to connect the log input here.
+ * Will be carried out later in make_log_step().
+ * The final switch-on will be started in make_log_finalize().
+ */
+ trans_brick =
+ make_brick_all(global,
+ replay_link,
+ _set_trans_params,
+ NULL,
+ aio_path,
+ (const struct generic_brick_type *)&trans_logger_brick_type,
+ (const struct generic_brick_type *[]){NULL},
+ 1, /* create when necessary, but leave in current state otherwise */
+ "%s/replay-%s",
+ (const char *[]){"%s/data-%s"},
+ 1,
+ parent_path,
+ my_id(),
+ parent_path,
+ my_id());
+ rot->trans_brick = (void *)trans_brick;
+ status = -ENOENT;
+ if (!trans_brick)
+ goto done;
+ rot->trans_brick->kill_ptr = (void **)&rot->trans_brick;
+ rot->trans_brick->replay_limiter = &rot->replay_limiter;
+ /* For safety, default is to try an (unnecessary) replay in case
+ * something goes wrong later.
+ */
+ rot->replay_mode = true;
+
+ status = 0;
+
+done:
+ brick_string_free(aio_path);
+ brick_string_free(replay_path);
+ return status;
+}
+
+static
+bool _next_is_acceptable(struct mars_rotate *rot, struct mars_dent *old_dent, struct mars_dent *new_dent)
+{
+ /* Primaries are never allowed to consider logfiles not belonging to them.
+ * Secondaries need this for replay, unfortunately.
+ */
+ if ((rot->is_primary | rot->old_is_primary) ||
+ (rot->trans_brick && rot->trans_brick->power.on_led && !rot->trans_brick->replay_mode)) {
+ if (new_dent->stat_val.size) {
+ XIO_WRN("logrotate impossible, '%s' size = %lld\n",
+ new_dent->d_rest,
+ new_dent->stat_val.size);
+ return false;
+ }
+ if (strcmp(new_dent->d_rest, my_id())) {
+ XIO_WRN("logrotate impossible, '%s'\n", new_dent->d_rest);
+ return false;
+ }
+ } else {
+ /* Only secondaries should check for contiguity,
+ * primaries sometimes need holes for emergency mode.
+ */
+ if (new_dent->d_serial != old_dent->d_serial + 1)
+ return false;
+ }
+ return true;
+}
+
+/* Note: this is strictly called in d_serial order.
+ * This is important!
+ */
+static
+int make_log_step(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent = dent->d_parent;
+ struct mars_rotate *rot;
+ struct trans_logger_brick *trans_brick;
+ struct mars_dent *prev_log;
+ int status = -EINVAL;
+
+ CHECK_PTR(parent, err);
+ rot = parent->d_private;
+ if (!rot)
+ goto err;
+ CHECK_PTR(rot, err);
+
+ status = 0;
+ trans_brick = rot->trans_brick;
+ if (!global->global_power.button || !dent->d_parent || !trans_brick || rot->has_error) {
+ XIO_DBG("nothing to do rot_error = %d\n", rot->has_error);
+ goto done;
+ }
+
+ /* Check for consecutiveness of logfiles
+ */
+ prev_log = rot->next_log;
+ if (prev_log && prev_log->d_serial + 1 != dent->d_serial) {
+ XIO_WRN_TO(rot->log_say,
+ "transaction logs are not consecutive at '%s' (%d ~> %d)\n",
+ dent->d_path,
+ prev_log->d_serial,
+ dent->d_serial);
+ make_rot_msg(rot,
+ "wrn-log-consecutive",
+ "transaction logs are not consecutive at '%s' (%d ~> %d)\n",
+ dent->d_path,
+ prev_log->d_serial,
+ dent->d_serial);
+ }
+
+ if (dent->d_serial > rot->max_sequence)
+ rot->max_sequence = dent->d_serial;
+
+ if (!rot->first_log)
+ rot->first_log = dent;
+
+ /* Skip any logfiles after the relevant one.
+ * This should happen only when replaying multiple logfiles
+ * in sequence, or when starting a new logfile for writing.
+ */
+ status = 0;
+ if (rot->relevant_log) {
+ if (!rot->next_relevant_log) {
+ if (_next_is_acceptable(rot, rot->relevant_log, dent))
+ rot->next_relevant_log = dent;
+ } else if (!rot->next_next_relevant_log) {
+ if (_next_is_acceptable(rot, rot->next_relevant_log, dent))
+ rot->next_next_relevant_log = dent;
+ }
+ XIO_DBG("next_relevant_log = %p next_next_relevant_log = %p\n",
+ rot->next_relevant_log,
+ rot->next_next_relevant_log);
+ goto ok;
+ }
+
+ /* Preconditions
+ */
+ if (!rot->replay_link || !rot->aio_dent || !rot->aio_brick) {
+ XIO_DBG("nothing to do on '%s'\n", dent->d_path);
+ goto ok;
+ }
+
+ /* Remember the relevant log.
+ */
+ if (rot->aio_dent->d_serial == dent->d_serial) {
+ rot->relevant_serial = dent->d_serial;
+ rot->relevant_log = dent;
+ }
+
+ok:
+ /* All ok: switch over the indicators.
+ */
+ XIO_DBG("next_log = '%s'\n", dent->d_path);
+ rot->prev_log = rot->next_log;
+ rot->next_log = dent;
+
+done:
+ if (status < 0) {
+ XIO_DBG("rot_error status = %d\n", status);
+ rot->has_error = true;
+ }
+err:
+ return status;
+}
+
+/* Internal helper. Return codes:
+ * ret < 0 : error
+ * ret == 0 : not relevant
+ * ret == 1 : relevant, no transaction replay, switch to the next
+ * ret == 2 : relevant for transaction replay
+ * ret == 3 : relevant for appending
+ */
+static
+int _check_logging_status(struct mars_rotate *rot,
+ int *log_nr,
+ long long *oldpos_start,
+ long long *oldpos_end,
+ long long *newpos)
+{
+ struct mars_dent *dent = rot->relevant_log;
+ struct mars_dent *parent;
+ struct mars_global *global = NULL;
+ int status = 0;
+
+ if (!dent)
+ goto done;
+
+ status = -EINVAL;
+ parent = dent->d_parent;
+ CHECK_PTR(parent, done);
+ global = rot->global;
+ CHECK_PTR_NULL(global, done);
+ CHECK_PTR(rot->replay_link, done);
+ CHECK_PTR(rot->aio_brick, done);
+ CHECK_PTR(rot->aio_dent, done);
+
+ XIO_DBG(" dent = '%s'\n", dent->d_path);
+ XIO_DBG("aio_dent = '%s'\n", rot->aio_dent->d_path);
+ if (unlikely(strcmp(dent->d_path, rot->aio_dent->d_path)))
+ goto done;
+
+ if (sscanf(rot->replay_link->d_argv[0], "log-%d", log_nr) != 1) {
+ XIO_ERR_TO(rot->log_say,
+ "replay link has malformed logfile number '%s'\n",
+ rot->replay_link->d_argv[0]);
+ goto done;
+ }
+ if (kstrtos64(rot->replay_link->d_argv[1], 0, oldpos_start)) {
+ XIO_ERR_TO(rot->log_say,
+ "replay link has bad start position argument '%s'\n",
+ rot->replay_link->d_argv[1]);
+ goto done;
+ }
+ if (kstrtos64(rot->replay_link->d_argv[2], 0, oldpos_end)) {
+ XIO_ERR_TO(rot->log_say,
+ "replay link has bad end position argument '%s'\n",
+ rot->replay_link->d_argv[2]);
+ goto done;
+ }
+ *oldpos_end += *oldpos_start;
+ if (unlikely(*oldpos_end < *oldpos_start)) {
+ XIO_ERR_TO(rot->log_say, "replay link end_pos %lld < start_pos %lld\n", *oldpos_end, *oldpos_start);
+ /* safety: use the smaller value, it does not hurt */
+ *oldpos_start = *oldpos_end;
+ if (unlikely(*oldpos_start < 0))
+ *oldpos_start = 0;
+ }
+
+ *newpos = rot->aio_info.current_size;
+
+ if (unlikely(rot->aio_info.current_size < *oldpos_start)) {
+ XIO_ERR_TO(rot->log_say,
+ "oops, bad replay position attempted at logfile '%s' (file length %lld should never be smaller than requested position %lld, is your filesystem corrupted?) => please repair this by hand\n",
+ rot->aio_dent->d_path,
+ rot->aio_info.current_size,
+ *oldpos_start);
+ make_rot_msg(rot,
+ "err-replay-size",
+ "oops, bad replay position attempted at logfile '%s' (file length %lld should never be smaller than requested position %lld, is your filesystem corrupted?) => please repair this by hand",
+ rot->aio_dent->d_path,
+ rot->aio_info.current_size,
+ *oldpos_start);
+ status = -EBADF;
+ goto done;
+ }
+
+ status = 0;
+ if (rot->aio_info.current_size > *oldpos_start) {
+ if (rot->aio_info.current_size - *oldpos_start < REPLAY_TOLERANCE &&
+ (rot->todo_primary ||
+ (rot->relevant_log &&
+ rot->next_relevant_log &&
+ is_switchover_possible(rot,
+ rot->relevant_log->d_path,
+ rot->next_relevant_log->d_path,
+ _get_tolerance(rot),
+ false)))) {
+ XIO_INF_TO(rot->log_say,
+ "TOLERANCE: transaction log '%s' is treated as fully applied\n",
+ rot->aio_dent->d_path);
+ make_rot_msg(rot,
+ "inf-replay-tolerance",
+ "TOLERANCE: transaction log '%s' is treated as fully applied",
+ rot->aio_dent->d_path);
+ status = 1;
+ } else {
+ XIO_INF_TO(rot->log_say,
+ "transaction log replay is necessary on '%s' from %lld to %lld (dirty region ends at %lld)\n",
+ rot->aio_dent->d_path,
+ *oldpos_start,
+ rot->aio_info.current_size,
+ *oldpos_end);
+ status = 2;
+ }
+ } else if (rot->next_relevant_log) {
+ XIO_INF_TO(rot->log_say,
+ "transaction log '%s' is already applied, and the next one is available for switching\n",
+ rot->aio_dent->d_path);
+ status = 1;
+ } else if (rot->todo_primary) {
+ if (rot->aio_info.current_size > 0 || strcmp(dent->d_rest, my_id()) != 0) {
+ XIO_INF_TO(rot->log_say,
+ "transaction log '%s' is already applied (would be usable for appending at position %lld, but a fresh logfile will be used for safety reasons)\n",
+ rot->aio_dent->d_path,
+ *oldpos_end);
+ status = 1;
+ } else {
+ XIO_INF_TO(rot->log_say,
+ "empty transaction log '%s' is usable for me as a primary node\n",
+ rot->aio_dent->d_path);
+ status = 3;
+ }
+ } else {
+ XIO_DBG("transaction log '%s' is the last one, currently fully applied\n", rot->aio_dent->d_path);
+ status = 0;
+ }
+
+done:
+ return status;
+}
+
+static
+int _make_logging_status(struct mars_rotate *rot)
+{
+ struct mars_dent *dent = rot->relevant_log;
+ struct mars_dent *parent;
+ struct mars_global *global = NULL;
+ struct trans_logger_brick *trans_brick;
+ int log_nr = 0;
+ loff_t start_pos = 0;
+ loff_t dirty_pos = 0;
+ loff_t end_pos = 0;
+ int status = 0;
+
+ if (!dent)
+ goto done;
+
+ status = -EINVAL;
+ parent = dent->d_parent;
+ CHECK_PTR(parent, done);
+ global = rot->global;
+ CHECK_PTR_NULL(global, done);
+
+ status = 0;
+ trans_brick = rot->trans_brick;
+ if (!global->global_power.button || !trans_brick || rot->has_error) {
+ XIO_DBG("nothing to do rot_error = %d\n", rot->has_error);
+ goto done;
+ }
+
+ /* Find current logging status.
+ */
+ status = _check_logging_status(rot, &log_nr, &start_pos, &dirty_pos, &end_pos);
+ XIO_DBG("case = %d (todo_primary=%d is_primary=%d old_is_primary=%d)\n",
+ status,
+ rot->todo_primary,
+ rot->is_primary,
+ rot->old_is_primary);
+ if (status < 0)
+ goto done;
+ if (unlikely(start_pos < 0 || dirty_pos < start_pos || end_pos < dirty_pos)) {
+ XIO_ERR_TO(rot->log_say,
+ "replay symlink has implausible values: start_pos = %lld dirty_pos = %lld end_pos = %lld\n",
+ start_pos,
+ dirty_pos,
+ end_pos);
+ }
+ /* Relevant or not?
+ */
+ switch (status) {
+ case 0: /* not relevant */
+ goto ok;
+ case 1: /* Relevant, and transaction replay already finished.
+ * Allow switching over to a new logfile.
+ */
+ if (!trans_brick->power.button && !trans_brick->power.on_led && trans_brick->power.off_led) {
+ if (rot->next_relevant_log) {
+ int replay_tolerance = _get_tolerance(rot);
+ bool skip_new = !rot->next_next_relevant_log && rot->todo_primary;
+
+ XIO_DBG("check switchover from '%s' to '%s' (size = %lld, next_next = %p, skip_new = %d, replay_tolerance = %d)\n",
+ dent->d_path,
+ rot->next_relevant_log->d_path,
+ rot->next_relevant_log->stat_val.size,
+ rot->next_next_relevant_log,
+ skip_new,
+ replay_tolerance);
+ if (is_switchover_possible(rot,
+ dent->d_path,
+ rot->next_relevant_log->d_path,
+ replay_tolerance,
+ skip_new)) {
+ XIO_INF_TO(rot->log_say,
+ "start switchover from transaction log '%s' to '%s'\n",
+ dent->d_path,
+ rot->next_relevant_log->d_path);
+ _make_new_replaylink(rot,
+ rot->next_relevant_log->d_rest,
+ rot->next_relevant_log->d_serial,
+ rot->next_relevant_log->stat_val.size);
+ }
+ } else if (rot->todo_primary) {
+ if (dent->d_serial > log_nr)
+ log_nr = dent->d_serial;
+ XIO_INF_TO(rot->log_say,
+ "preparing new transaction log, number moves from %d to %d\n",
+ dent->d_serial,
+ log_nr + 1);
+ _make_new_replaylink(rot, my_id(), log_nr + 1, 0);
+ } else {
+ XIO_DBG("nothing to do on last transaction log '%s'\n", dent->d_path);
+ }
+ }
+ status = -EAGAIN;
+ goto done;
+ case 2: /* relevant for transaction replay */
+ XIO_INF_TO(rot->log_say,
+ "replaying transaction log '%s' from position %lld to %lld\n",
+ dent->d_path,
+ start_pos,
+ end_pos);
+ rot->replay_mode = true;
+ rot->start_pos = start_pos;
+ rot->end_pos = end_pos;
+ break;
+ case 3: /* relevant for appending */
+ XIO_INF_TO(rot->log_say, "appending to transaction log '%s'\n", dent->d_path);
+ rot->replay_mode = false;
+ rot->start_pos = 0;
+ rot->end_pos = 0;
+ break;
+ default:
+ XIO_ERR_TO(rot->log_say, "bad internal status %d\n", status);
+ status = -EINVAL;
+ goto done;
+ }
+
+ok:
+ /* All ok: switch over the indicators.
+ */
+ rot->prev_log = rot->next_log;
+ rot->next_log = dent;
+
+done:
+ if (status < 0) {
+ XIO_DBG("rot_error status = %d\n", status);
+ rot->has_error = true;
+ }
+ return status;
+}
+
+static
+void _init_trans_input(struct trans_logger_input *trans_input, struct mars_dent *log_dent, struct mars_rotate *rot)
+{
+ if (unlikely(trans_input->connect || trans_input->is_operating)) {
+ XIO_ERR("this should not happen\n");
+ goto out_return;
+ }
+
+ memset(&trans_input->inf, 0, sizeof(trans_input->inf));
+
+ strncpy(trans_input->inf.inf_host, log_dent->d_rest, sizeof(trans_input->inf.inf_host));
+ trans_input->inf.inf_sequence = log_dent->d_serial;
+ trans_input->inf.inf_private = rot;
+ trans_input->inf.inf_callback = _update_info;
+ XIO_DBG("initialized '%s' %d\n", trans_input->inf.inf_host, trans_input->inf.inf_sequence);
+out_return:;
+}
+
+static
+int _get_free_input(struct trans_logger_brick *trans_brick)
+{
+ int nr = (((trans_brick->log_input_nr - TL_INPUT_LOG1) + 1) % 2) + TL_INPUT_LOG1;
+ struct trans_logger_input *candidate;
+
+ candidate = trans_brick->inputs[nr];
+ if (unlikely(!candidate)) {
+ XIO_ERR("input nr = %d is corrupted!\n", nr);
+ return -EEXIST;
+ }
+ if (unlikely(candidate->is_operating || candidate->connect)) {
+ XIO_DBG("nr = %d unusable! is_operating = %d connect = %p\n",
+ nr,
+ candidate->is_operating,
+ candidate->connect);
+ return -EEXIST;
+ }
+ XIO_DBG("got nr = %d\n", nr);
+ return nr;
+}
+
+static
+void _rotate_trans(struct mars_rotate *rot)
+{
+ struct trans_logger_brick *trans_brick = rot->trans_brick;
+ int old_nr = trans_brick->old_input_nr;
+ int log_nr = trans_brick->log_input_nr;
+ int next_nr;
+
+ XIO_DBG("log_input_nr = %d old_input_nr = %d next_relevant_log = %p\n",
+ log_nr,
+ old_nr,
+ rot->next_relevant_log);
+
+ /* try to cleanup old log */
+ if (log_nr != old_nr) {
+ struct trans_logger_input *trans_input = trans_brick->inputs[old_nr];
+ struct trans_logger_input *new_input = trans_brick->inputs[log_nr];
+
+ if (!trans_input->connect) {
+ XIO_DBG("ignoring unused old input %d\n", old_nr);
+ } else if (!new_input->is_operating) {
+ XIO_DBG("ignoring uninitialized new input %d\n", log_nr);
+ } else if (trans_input->is_operating &&
+ trans_input->inf.inf_min_pos == trans_input->inf.inf_max_pos &&
+ list_empty(&trans_input->pos_list) &&
+ atomic_read(&trans_input->log_obj_count) <= 0) {
+ int status;
+
+ XIO_INF("cleanup old transaction log (%d -> %d)\n", old_nr, log_nr);
+ status = generic_disconnect((void *)trans_input);
+ if (unlikely(status < 0))
+ XIO_ERR("disconnect failed\n");
+ else
+ remote_trigger();
+ } else {
+ XIO_DBG("old transaction replay not yet finished: is_operating = %d pos %lld != %lld\n",
+ trans_input->is_operating,
+ trans_input->inf.inf_min_pos,
+ trans_input->inf.inf_max_pos);
+ }
+ } else
+ /* try to setup new log */
+ if (log_nr == trans_brick->new_input_nr &&
+ rot->next_relevant_log &&
+ (rot->next_relevant_log->d_serial == trans_brick->inputs[log_nr]->inf.inf_sequence + 1 ||
+ trans_brick->cease_logging)) {
+ struct trans_logger_input *trans_input;
+ int status;
+
+ next_nr = _get_free_input(trans_brick);
+ if (unlikely(next_nr < 0)) {
+ XIO_ERR_TO(rot->log_say, "no free input\n");
+ goto done;
+ }
+
+ XIO_DBG("start switchover %d -> %d\n", old_nr, next_nr);
+
+ rot->next_relevant_brick =
+ make_brick_all(rot->global,
+ rot->next_relevant_log,
+ _set_aio_params,
+ NULL,
+ rot->next_relevant_log->d_path,
+ (const struct generic_brick_type *)&aio_brick_type,
+ (const struct generic_brick_type *[]){},
+ 2, /* create + activate */
+ rot->next_relevant_log->d_path,
+ (const char *[]){},
+ 0);
+ if (unlikely(!rot->next_relevant_brick)) {
+ XIO_ERR_TO(rot->log_say,
+ "could not open next transaction log '%s'\n",
+ rot->next_relevant_log->d_path);
+ goto done;
+ }
+ trans_input = trans_brick->inputs[next_nr];
+ if (unlikely(!trans_input)) {
+ XIO_ERR_TO(rot->log_say, "internal log input does not exist\n");
+ goto done;
+ }
+
+ _init_trans_input(trans_input, rot->next_relevant_log, rot);
+
+ status = generic_connect((void *)trans_input, (void *)rot->next_relevant_brick->outputs[0]);
+ if (unlikely(status < 0)) {
+ XIO_ERR_TO(rot->log_say, "internal connect failed\n");
+ goto done;
+ }
+ trans_brick->new_input_nr = next_nr;
+ XIO_INF_TO(rot->log_say,
+ "started logrotate switchover from '%s' to '%s'\n",
+ rot->relevant_log->d_path,
+ rot->next_relevant_log->d_path);
+ }
+done:;
+}
+
+static
+void _change_trans(struct mars_rotate *rot)
+{
+ struct trans_logger_brick *trans_brick = rot->trans_brick;
+
+ XIO_DBG("replay_mode = %d start_pos = %lld end_pos = %lld\n",
+ trans_brick->replay_mode,
+ rot->start_pos,
+ rot->end_pos);
+
+ if (trans_brick->replay_mode) {
+ trans_brick->replay_start_pos = rot->start_pos;
+ trans_brick->replay_end_pos = rot->end_pos;
+ } else {
+ _rotate_trans(rot);
+ }
+}
+
+static
+int _start_trans(struct mars_rotate *rot)
+{
+ struct trans_logger_brick *trans_brick;
+ struct trans_logger_input *trans_input;
+ int nr;
+ int status;
+
+ /* Internal safety checks
+ */
+ status = -EINVAL;
+ if (unlikely(!rot)) {
+ XIO_ERR("rot is NULL\n");
+ goto done;
+ }
+ if (unlikely(!rot->aio_brick || !rot->relevant_log)) {
+ XIO_ERR("aio %p or relevant log %p is missing, this should not happen\n",
+ rot->aio_brick,
+ rot->relevant_log);
+ goto done;
+ }
+ trans_brick = rot->trans_brick;
+ if (unlikely(!trans_brick)) {
+ XIO_ERR("logger instance does not exist\n");
+ goto done;
+ }
+
+ /* Update status when already working
+ */
+ if (trans_brick->power.button || !trans_brick->power.off_led) {
+ _change_trans(rot);
+ status = 0;
+ goto done;
+ }
+
+ /* Further safety checks.
+ */
+ if (unlikely(rot->relevant_brick)) {
+ XIO_ERR("log aio brick already present, this should not happen\n");
+ goto done;
+ }
+ if (unlikely(trans_brick->inputs[TL_INPUT_LOG1]->is_operating || trans_brick->inputs[TL_INPUT_LOG2]->is_operating)) {
+ XIO_ERR("some input is operating, this should not happen\n");
+ goto done;
+ }
+
+ /* Allocate new input slot
+ */
+ nr = _get_free_input(trans_brick);
+ if (unlikely(nr < TL_INPUT_LOG1 || nr > TL_INPUT_LOG2)) {
+ XIO_ERR("bad new_input_nr = %d\n", nr);
+ goto done;
+ }
+ trans_brick->new_input_nr = nr;
+ trans_brick->old_input_nr = nr;
+ trans_brick->log_input_nr = nr;
+ trans_input = trans_brick->inputs[nr];
+ if (unlikely(!trans_input)) {
+ XIO_ERR("log input %d does not exist\n", nr);
+ goto done;
+ }
+
+ /* Open new transaction log
+ */
+ rot->relevant_brick =
+ make_brick_all(rot->global,
+ rot->relevant_log,
+ _set_aio_params,
+ NULL,
+ rot->relevant_log->d_path,
+ (const struct generic_brick_type *)&aio_brick_type,
+ (const struct generic_brick_type *[]){},
+ 2, /* start always */
+ rot->relevant_log->d_path,
+ (const char *[]){},
+ 0);
+ if (unlikely(!rot->relevant_brick)) {
+ XIO_ERR("log aio brick '%s' not open\n", rot->relevant_log->d_path);
+ goto done;
+ }
+
+ /* Supply all relevant parameters
+ */
+ trans_brick->replay_mode = rot->replay_mode;
+ trans_brick->replay_tolerance = REPLAY_TOLERANCE;
+ _init_trans_input(trans_input, rot->relevant_log, rot);
+
+ /* Connect to new transaction log
+ */
+ status = generic_connect((void *)trans_input, (void *)rot->relevant_brick->outputs[0]);
+ if (unlikely(status < 0)) {
+ XIO_ERR("initial connect failed\n");
+ goto done;
+ }
+
+ _change_trans(rot);
+
+ /* Switch on....
+ */
+ status = mars_power_button((void *)trans_brick, true, false);
+ XIO_DBG("status = %d\n", status);
+
+done:
+ return status;
+}
+
+static
+int _stop_trans(struct mars_rotate *rot, const char *parent_path)
+{
+ struct trans_logger_brick *trans_brick = rot->trans_brick;
+ int status = 0;
+
+ if (!trans_brick)
+ goto done;
+
+ /* Switch off temporarily....
+ */
+ status = mars_power_button((void *)trans_brick, false, false);
+ XIO_DBG("status = %d\n", status);
+ if (status < 0)
+ goto done;
+
+ /* Disconnect old connection(s)
+ */
+ if (trans_brick->power.off_led) {
+ int i;
+
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *trans_input;
+
+ trans_input = trans_brick->inputs[i];
+ if (trans_input && !trans_input->is_operating) {
+ if (trans_input->connect)
+ (void)generic_disconnect((void *)trans_input);
+ }
+ }
+ }
+
+done:
+ return status;
+}
+
+static
+int make_log_finalize(struct mars_global *global, struct mars_dent *dent)
+{
+ struct mars_dent *parent = dent->d_parent;
+ struct mars_rotate *rot;
+ struct trans_logger_brick *trans_brick;
+ struct copy_brick *fetch_brick;
+ bool is_attached;
+ bool is_stopped;
+ int status = -EINVAL;
+
+ CHECK_PTR(parent, err);
+ rot = parent->d_private;
+ if (!rot)
+ goto err;
+ CHECK_PTR(rot, err);
+ rot->has_symlinks = true;
+ trans_brick = rot->trans_brick;
+ status = 0;
+ if (!trans_brick) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ /* Handle jamming (a very exceptional state)
+ */
+ if (IS_JAMMED()) {
+#ifndef CONFIG_MARS_DEBUG
+ brick_say_logging = 0;
+#endif
+ rot->has_emergency = true;
+ XIO_ERR_TO(rot->log_say, "DISK SPACE IS EXTREMELY LOW on %s\n", rot->parent_path);
+ make_rot_msg(rot, "err-space-low", "DISK SPACE IS EXTREMELY LOW");
+ } else {
+ int limit = _check_allow(global, parent, "emergency-limit");
+
+ rot->has_emergency = (limit > 0 && global_remaining_space * 100 / global_total_space < limit);
+ XIO_DBG("has_emergency=%d limit=%d remaining_space=%lld total_space=%lld\n",
+ rot->has_emergency, limit, global_remaining_space, global_total_space);
+ }
+ _show_actual(parent->d_path, "has-emergency", rot->has_emergency);
+ if (rot->has_emergency) {
+ if (rot->todo_primary || rot->is_primary) {
+ trans_brick->cease_logging = true;
+ rot->inf_prev_sequence = 0; /* disable checking */
+ }
+ } else {
+ if (!trans_logger_resume) {
+ XIO_INF_TO(rot->log_say,
+ "emergency mode on %s could be turned off now, but /proc/sys/mars/logger_resume inhibits it.\n",
+ rot->parent_path);
+ } else {
+ trans_brick->cease_logging = false;
+ XIO_INF_TO(rot->log_say, "emergency mode on %s will be turned off again\n", rot->parent_path);
+ }
+ }
+ is_stopped = trans_brick->cease_logging | trans_brick->stopped_logging;
+ _show_actual(parent->d_path, "is-emergency", is_stopped);
+ if (is_stopped) {
+ XIO_ERR_TO(rot->log_say,
+ "EMERGENCY MODE on %s: stopped transaction logging, and created a hole in the logfile sequence nubers.\n",
+ rot->parent_path);
+ make_rot_msg(rot,
+ "err-emergency",
+ "EMERGENCY MODE on %s: stopped transaction logging, and created a hole in the logfile sequence nubers.\n",
+ rot->parent_path);
+ /* Create a hole in the sequence of logfile numbers.
+ * The secondaries will later stumble over it.
+ */
+ if (!rot->created_hole) {
+ char *new_path = path_make("%s/log-%09d-%s",
+
+ rot->parent_path,
+ rot->max_sequence + 10,
+ my_id());
+ if (likely(new_path && !mars_find_dent(global, new_path))) {
+ XIO_INF_TO(rot->log_say, "EMERGENCY: creating new logfile '%s'\n", new_path);
+ _create_new_logfile(new_path);
+ rot->created_hole = true;
+ }
+ brick_string_free(new_path);
+ }
+ } else {
+ rot->created_hole = false;
+ }
+
+ if (IS_EMERGENCY_PRIMARY() || (!rot->todo_primary && IS_EMERGENCY_SECONDARY())) {
+ XIO_WRN_TO(rot->log_say, "EMERGENCY: the space on /mars/ is very low. Expect some problems!\n");
+ if (rot->first_log && rot->first_log != rot->relevant_log) {
+ XIO_WRN_TO(rot->log_say,
+ "EMERGENCY: ruthlessly freeing old logfile '%s', don't cry on any ramifications.\n",
+ rot->first_log->d_path);
+ make_rot_msg(rot,
+ "wrn-space-low",
+ "EMERGENCY: ruthlessly freeing old logfile '%s'",
+ rot->first_log->d_path);
+ mars_unlink(rot->first_log->d_path);
+ rot->first_log->d_killme = true;
+ /* give it a chance to cease deleting next time */
+ compute_emergency_mode();
+ } else {
+ make_rot_msg(rot,
+ "wrn-space-low",
+ "EMERGENCY: the space on /mars/ is very low. Expect some problems!");
+ }
+ } else if (IS_EXHAUSTED()) {
+ XIO_WRN_TO(rot->log_say,
+ "EMERGENCY: the space on /mars/ is becoming low. Stopping all fetches of logfiles for secondary resources.\n");
+ make_rot_msg(rot,
+ "wrn-space-low",
+ "EMERGENCY: the space on /mars/ is becoming low. Stopping all fetches of logfiles for secondary resources.");
+ }
+
+ if (trans_brick->replay_mode) {
+ if (trans_brick->replay_code > 0) {
+ XIO_INF_TO(rot->log_say,
+ "logfile replay ended successfully at position %lld\n",
+ trans_brick->replay_current_pos);
+ } else if (trans_brick->replay_code == -EAGAIN ||
+ trans_brick->replay_end_pos - trans_brick->replay_current_pos < trans_brick->replay_tolerance) {
+ XIO_INF_TO(rot->log_say,
+ "logfile replay stopped intermediately at position %lld\n",
+ trans_brick->replay_current_pos);
+ } else if (trans_brick->replay_code < 0) {
+ XIO_ERR_TO(rot->log_say,
+ "logfile replay stopped with error = %d at position %lld\n",
+ trans_brick->replay_code,
+ trans_brick->replay_current_pos);
+ make_rot_msg(rot,
+ "err-replay-stop",
+ "logfile replay stopped with error = %d at position %lld",
+ trans_brick->replay_code,
+ trans_brick->replay_current_pos);
+ }
+ }
+
+ /* Stopping is also possible in case of errors
+ */
+ if (trans_brick->power.button && trans_brick->power.on_led && !trans_brick->power.off_led) {
+ bool do_stop = true;
+
+ if (trans_brick->replay_mode) {
+ rot->is_log_damaged =
+ trans_brick->replay_code == -EAGAIN &&
+ trans_brick->replay_end_pos - trans_brick->replay_current_pos < trans_brick->replay_tolerance;
+ do_stop = trans_brick->replay_code != 0 ||
+ !global->global_power.button ||
+ !_check_allow(global, parent, "allow-replay") ||
+ !_check_allow(global, parent, "attach");
+ } else {
+ do_stop =
+ !rot->if_brick &&
+ !rot->is_primary &&
+ (!rot->todo_primary ||
+ !_check_allow(global, parent, "attach"));
+ }
+
+ XIO_DBG("replay_mode = %d replay_code = %d is_primary = %d do_stop = %d\n",
+ trans_brick->replay_mode,
+ trans_brick->replay_code,
+ rot->is_primary,
+ (int)do_stop);
+
+ if (do_stop)
+ status = _stop_trans(rot, parent->d_path);
+ else
+ _change_trans(rot);
+ goto done;
+ }
+
+ /* Starting is only possible when no error occurred.
+ */
+ if (!rot->relevant_log || rot->has_error) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ /* Start when necessary
+ */
+ if (!trans_brick->power.button && !trans_brick->power.on_led && trans_brick->power.off_led) {
+ bool do_start;
+
+ status = _make_logging_status(rot);
+ if (status <= 0)
+ goto done;
+
+ rot->is_log_damaged = false;
+
+ do_start = (!rot->replay_mode ||
+ (rot->start_pos != rot->end_pos &&
+ _check_allow(global, parent, "allow-replay")));
+
+ if (do_start && rot->forbid_replay) {
+ XIO_INF("cannot start replay because sync wants to start\n");
+ make_rot_msg(rot, "inf-replay-start", "cannot start replay because sync wants to star");
+ do_start = false;
+ }
+
+ if (do_start && rot->sync_brick && !rot->sync_brick->power.off_led) {
+ XIO_INF("cannot start replay because sync is running\n");
+ make_rot_msg(rot, "inf-replay-start", "cannot start replay because sync is running");
+ do_start = false;
+ }
+
+ XIO_DBG("rot->replay_mode = %d rot->start_pos = %lld rot->end_pos = %lld | do_start = %d\n",
+ rot->replay_mode,
+ rot->start_pos,
+ rot->end_pos,
+ do_start);
+
+ if (do_start)
+ status = _start_trans(rot);
+ }
+
+done:
+ /* check whether some copy has finished */
+ fetch_brick = (struct copy_brick *)mars_find_brick(global, &copy_brick_type, rot->fetch_path);
+ XIO_DBG("fetch_path = '%s' fetch_brick = %p\n", rot->fetch_path, fetch_brick);
+ if (fetch_brick &&
+ (fetch_brick->power.off_led ||
+ !global->global_power.button ||
+ !_check_allow(global, parent, "connect") ||
+ !_check_allow(global, parent, "attach") ||
+ (fetch_brick->copy_last == fetch_brick->copy_end &&
+ (rot->fetch_next_is_available > 0 ||
+ rot->fetch_round++ > 3)))) {
+ status = xio_kill_brick((void *)fetch_brick);
+ if (status < 0)
+ XIO_ERR("could not kill fetch_brick, status = %d\n", status);
+ else
+ fetch_brick = NULL;
+ local_trigger();
+ }
+ rot->fetch_next_is_available = 0;
+ rot->fetch_brick = fetch_brick;
+ if (fetch_brick)
+ fetch_brick->kill_ptr = (void **)&rot->fetch_brick;
+ else
+ rot->fetch_serial = 0;
+ /* remove trans_logger (when possible) upon detach */
+ is_attached = !!rot->trans_brick;
+ _show_actual(rot->parent_path, "is-attached", is_attached);
+
+ if (rot->trans_brick && rot->trans_brick->power.off_led && !rot->trans_brick->outputs[0]->nr_connected) {
+ bool do_attach = _check_allow(global, parent, "attach");
+
+ XIO_DBG("do_attach = %d\n", do_attach);
+ if (!do_attach) {
+ rot->trans_brick->killme = true;
+ rot->trans_brick = NULL;
+ }
+ }
+
+ _show_actual(rot->parent_path,
+ "is-replaying",
+ rot->trans_brick && rot->trans_brick->replay_mode && !rot->trans_brick->power.off_led);
+ _show_rate(rot, &rot->replay_limiter, rot->trans_brick && rot->trans_brick->power.on_led, "replay_rate");
+ _show_actual(rot->parent_path, "is-copying", rot->fetch_brick && !rot->fetch_brick->power.off_led);
+ _show_rate(rot, &rot->fetch_limiter, rot->fetch_brick && rot->fetch_brick->power.on_led, "file_rate");
+ _show_actual(rot->parent_path, "is-syncing", rot->sync_brick && !rot->sync_brick->power.off_led);
+ _show_rate(rot, &rot->sync_limiter, rot->sync_brick && rot->sync_brick->power.on_led, "sync_rate");
+err:
+ return status;
+}
+
+/*********************************************************************/
+
+/* specific handlers */
+
+static
+int make_primary(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent;
+ struct mars_rotate *rot;
+ int status = -EINVAL;
+
+ parent = dent->d_parent;
+ CHECK_PTR(parent, done);
+ rot = parent->d_private;
+ if (!rot)
+ goto done;
+ CHECK_PTR(rot, done);
+
+ rot->has_symlinks = true;
+
+ rot->todo_primary =
+ global->global_power.button && dent->link_val && !strcmp(dent->link_val, my_id());
+ XIO_DBG("todo_primary = %d is_primary = %d\n", rot->todo_primary, rot->is_primary);
+ status = 0;
+
+done:
+ return status;
+}
+
+static
+int make_bio(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_rotate *rot;
+ struct xio_brick *brick;
+ bool switch_on;
+ int status = 0;
+
+ if (!global || !global->global_power.button || !dent->d_parent)
+ goto done;
+ rot = dent->d_parent->d_private;
+ if (!rot)
+ goto done;
+
+ rot->has_symlinks = true;
+
+ switch_on = _check_allow(global, dent->d_parent, "attach");
+ if (switch_on && rot->res_shutdown) {
+ XIO_ERR("cannot access disk: resource shutdown mode is currently active\n");
+ switch_on = false;
+ }
+
+ brick =
+ make_brick_all(global,
+ dent,
+ _set_bio_params,
+ NULL,
+ dent->d_path,
+ (const struct generic_brick_type *)&bio_brick_type,
+ (const struct generic_brick_type *[]){},
+ switch_on ? 2 : -1,
+ dent->d_path,
+ (const char *[]){},
+ 0);
+ rot->bio_brick = brick;
+ if (unlikely(!brick)) {
+ status = -ENXIO;
+ goto done;
+ }
+
+ /* Report the actual size of the device.
+ * It may be larger than the global size.
+ */
+ if (brick && brick->power.on_led) {
+ struct xio_info info = {};
+ struct xio_output *output;
+ char *src = NULL;
+ char *dst = NULL;
+
+ output = brick->outputs[0];
+ status = output->ops->xio_get_info(output, &info);
+ if (status < 0) {
+ XIO_ERR("cannot get info on '%s'\n", dent->d_path);
+ goto done;
+ }
+ src = path_make("%lld", info.current_size);
+ dst = path_make("%s/actsize-%s", dent->d_parent->d_path, my_id());
+ if (src && dst)
+ (void)mars_symlink(src, dst, NULL, 0);
+ brick_string_free(src);
+ brick_string_free(dst);
+ }
+
+done:
+ return status;
+}
+
+static int make_replay(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent = dent->d_parent;
+ int status = 0;
+
+ if (!global->global_power.button || !parent || !dent->link_val) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ status = make_log_finalize(global, dent);
+ if (status < 0) {
+ XIO_DBG("logger not initialized\n");
+ goto done;
+ }
+
+done:
+ return status;
+}
+
+static
+int make_dev(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent = dent->d_parent;
+ struct mars_rotate *rot = NULL;
+ struct xio_brick *dev_brick;
+ struct if_brick *_dev_brick;
+ char *dev_name = NULL;
+ bool switch_on;
+ int open_count = 0;
+ int status = 0;
+
+ if (!parent || !dent->link_val) {
+ XIO_ERR("nothing to do\n");
+ return -EINVAL;
+ }
+ rot = parent->d_private;
+ if (!rot || !rot->parent_path) {
+ XIO_DBG("nothing to do\n");
+ goto err;
+ }
+ rot->has_symlinks = true;
+ if (!rot->trans_brick) {
+ XIO_DBG("transaction logger does not exist\n");
+ goto done;
+ }
+ if (!global->global_power.button &&
+ (!rot->if_brick || rot->if_brick->power.off_led)) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+ if (rot->dev_size <= 0) {
+ XIO_WRN("trying to create device '%s' with zero size\n", dent->d_path);
+ goto done;
+ }
+
+ status = _parse_args(dent, dent->link_val, 1);
+ if (status < 0) {
+ XIO_DBG("fail\n");
+ goto done;
+ }
+
+ dev_name = path_make("mars/%s", dent->d_argv[0]);
+
+ switch_on =
+ (rot->if_brick && atomic_read(&rot->if_brick->open_count) > 0) ||
+ (rot->todo_primary &&
+ !rot->trans_brick->replay_mode &&
+ rot->trans_brick->power.on_led &&
+ _check_allow(global, dent->d_parent, "attach"));
+ if (!global->global_power.button)
+ switch_on = false;
+ if (switch_on && rot->res_shutdown) {
+ XIO_ERR("cannot create device: resource shutdown mode is currently active\n");
+ switch_on = false;
+ }
+
+ dev_brick =
+ make_brick_all(global,
+ dent,
+ _set_if_params,
+ rot,
+ dev_name,
+ (const struct generic_brick_type *)&if_brick_type,
+ (const struct generic_brick_type *[]){(const struct generic_brick_type *)&trans_logger_brick_type},
+ switch_on ? 2 : -1,
+ "%s/device-%s",
+ (const char *[]){"%s/replay-%s"},
+ 1,
+ parent->d_path,
+ my_id(),
+ parent->d_path,
+ my_id());
+ rot->if_brick = (void *)dev_brick;
+ if (!dev_brick) {
+ XIO_DBG("device not shown\n");
+ goto done;
+ }
+ if (!switch_on) {
+ XIO_DBG("setting killme on if_brick\n");
+ dev_brick->killme = true;
+ }
+ dev_brick->kill_ptr = (void **)&rot->if_brick;
+ dev_brick->show_status = _show_brick_status;
+ _dev_brick = (void *)dev_brick;
+ open_count = atomic_read(&_dev_brick->open_count);
+
+done:
+ __show_actual(rot->parent_path, "open-count", open_count);
+ rot->is_primary =
+ rot->if_brick && !rot->if_brick->power.off_led;
+ _show_primary(rot, parent);
+
+err:
+ brick_string_free(dev_name);
+ return status;
+}
+
+static
+int kill_dev(void *buf, struct mars_dent *dent)
+{
+ struct mars_dent *parent = dent->d_parent;
+ int status = kill_any(buf, dent);
+
+ if (status > 0 && parent) {
+ struct mars_rotate *rot = parent->d_private;
+
+ if (rot)
+ rot->if_brick = NULL;
+ }
+ return status;
+}
+
+static
+int _update_syncstatus(struct mars_rotate *rot, struct copy_brick *copy, char *peer)
+{
+ const char *src = NULL;
+ const char *dst = NULL;
+ int status = -ENOMEM;
+
+ src = path_make("%lld", copy->copy_last);
+ dst = path_make("%s/syncstatus-%s", rot->parent_path, my_id());
+ if (unlikely(!src || !dst))
+ goto done;
+
+ status = _update_link_when_necessary(rot, "syncstatus", src, dst);
+
+ brick_string_free(src);
+ brick_string_free(dst);
+ src = path_make("%lld,%lld", copy->verify_ok_count, copy->verify_error_count);
+ dst = path_make("%s/verifystatus-%s", rot->parent_path, my_id());
+ if (unlikely(!src || !dst))
+ goto done;
+
+ (void)_update_link_when_necessary(rot, "verifystatus", src, dst);
+
+ if (copy->copy_last == copy->copy_end && status >= 0) { /* create syncpos symlink */
+ const char *syncpos_path = path_make("%s/syncpos-%s", rot->parent_path, my_id());
+ const char *peer_replay_path = path_make("%s/replay-%s", rot->parent_path, peer);
+ char *peer_replay_link = NULL;
+ struct kstat syncpos_stat = {};
+ struct kstat syncstatus_stat = {};
+ struct kstat peer_replay_stat = {};
+
+ if (syncpos_path &&
+ peer_replay_path &&
+ mars_stat(dst, &syncstatus_stat, true) >= 0 &&
+ mars_stat(peer_replay_path, &peer_replay_stat, true) >= 0 &&
+ timespec_compare(&syncstatus_stat.mtime, &peer_replay_stat.mtime) <= 0) {
+ peer_replay_link = mars_readlink(peer_replay_path);
+ if (peer_replay_link && peer_replay_link[0] &&
+ (mars_stat(syncpos_path, &syncpos_stat, true) < 0 ||
+ timespec_compare(&syncpos_stat.mtime, &syncstatus_stat.mtime) < 0)) {
+ _update_link_when_necessary(rot, "syncpos", peer_replay_link, syncpos_path);
+ }
+ }
+ brick_string_free(peer_replay_link);
+ brick_string_free(peer_replay_path);
+ brick_string_free(syncpos_path);
+ }
+
+done:
+ brick_string_free(src);
+ brick_string_free(dst);
+ return status;
+}
+
+static int make_sync(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_rotate *rot;
+ loff_t start_pos = 0;
+ loff_t end_pos = 0;
+ struct mars_dent *size_dent;
+ struct mars_dent *primary_dent;
+ char *peer;
+ struct copy_brick *copy = NULL;
+ char *tmp = NULL;
+ const char *switch_path = NULL;
+ const char *copy_path = NULL;
+ const char *src = NULL;
+ const char *dst = NULL;
+ bool do_start;
+ int status;
+
+ if (!global->global_power.button || !dent->d_parent || !dent->link_val)
+ return 0;
+
+ do_start = _check_allow(global, dent->d_parent, "attach");
+
+ /* Analyze replay position
+ */
+ status = kstrtos64(dent->link_val, 0, &start_pos);
+ if (unlikely(status)) {
+ XIO_ERR("bad syncstatus symlink syntax '%s' (%s)\n", dent->link_val, dent->d_path);
+ status = -EINVAL;
+ goto done;
+ }
+
+ rot = dent->d_parent->d_private;
+ status = -ENOENT;
+ CHECK_PTR(rot, done);
+
+ rot->forbid_replay = false;
+ rot->has_symlinks = true;
+ rot->allow_update = true;
+ rot->syncstatus_dent = dent;
+
+ /* Sync necessary?
+ */
+ tmp = path_make("%s/size", dent->d_parent->d_path);
+ status = -ENOMEM;
+ if (unlikely(!tmp))
+ goto done;
+ size_dent = (void *)mars_find_dent(global, tmp);
+ if (!size_dent || !size_dent->link_val) {
+ XIO_ERR("cannot determine size '%s'\n", tmp);
+ status = -ENOENT;
+ goto done;
+ }
+ status = kstrtos64(size_dent->link_val, 0, &end_pos);
+ if (unlikely(status)) {
+ XIO_ERR("bad size symlink syntax '%s' (%s)\n", size_dent->link_val, tmp);
+ status = -EINVAL;
+ goto done;
+ }
+ if (start_pos >= end_pos) {
+ XIO_DBG("no data sync necessary, size = %lld\n", start_pos);
+ do_start = false;
+ }
+ brick_string_free(tmp);
+
+ /* Determine peer
+ */
+ tmp = path_make("%s/primary", dent->d_parent->d_path);
+ status = -ENOMEM;
+ if (unlikely(!tmp))
+ goto done;
+ primary_dent = (void *)mars_find_dent(global, tmp);
+ if (!primary_dent || !primary_dent->link_val) {
+ XIO_ERR("cannot determine primary, symlink '%s'\n", tmp);
+ status = 0;
+ goto done;
+ }
+ peer = primary_dent->link_val;
+ if (!strcmp(peer, "(none)")) {
+ XIO_INF("cannot start sync, no primary is designated\n");
+ status = 0;
+ goto done;
+ }
+
+ /* Don't try syncing detached resources
+ */
+ if (do_start && !_check_allow(global, dent->d_parent, "attach"))
+ do_start = false;
+
+ /* Disallow contemporary sync & logfile_replay
+ */
+ if (do_start &&
+ rot->trans_brick &&
+ !rot->trans_brick->power.off_led) {
+ XIO_INF("cannot start sync because logger is working\n");
+ do_start = false;
+ }
+
+ /* Disallow overwrite of newer data
+ */
+ if (do_start && compare_replaylinks(rot, peer, my_id()) < 0) {
+ XIO_INF("cannot start sync because my data is newer than the remote one at '%s'!\n", peer);
+ do_start = false;
+ rot->forbid_replay = true;
+ }
+
+ /* Flip between replay and sync
+ */
+ if (do_start && rot->replay_mode && rot->end_pos > rot->start_pos &&
+ mars_sync_flip_interval >= 8) {
+ if (!rot->flip_start) {
+ rot->flip_start = jiffies;
+ } else if ((long long)jiffies - rot->flip_start > CONFIG_MARS_SYNC_FLIP_INTERVAL * HZ) {
+ do_start = false;
+ rot->flip_start = jiffies + mars_sync_flip_interval * HZ;
+ }
+ } else {
+ rot->flip_start = 0;
+ }
+
+ /* Start copy
+ */
+#ifdef CONFIG_MARS_SEPARATE_PORTS
+ src = path_make("data-%s@%s:%d", peer, peer, xio_net_default_port + 2);
+#else
+ src = path_make("data-%s@%s", peer, peer);
+#endif
+ dst = path_make("data-%s", my_id());
+ copy_path = backskip_replace(dent->d_path, '/', true, "/copy-");
+
+ /* check whether connection is allowed */
+ switch_path = path_make("%s/todo-%s/sync", dent->d_parent->d_path, my_id());
+
+ status = -ENOMEM;
+ if (unlikely(!src || !dst || !copy_path || !switch_path))
+ goto done;
+
+ XIO_DBG("initial sync '%s' => '%s' do_start = %d\n", src, dst, do_start);
+
+ rot->wants_sync = (do_start != 0);
+ if (rot->wants_sync && global_sync_limit > 0) {
+ do_start = rot->gets_sync;
+ if (!rot->gets_sync) {
+ XIO_INF_TO(rot->log_say,
+ "won't start sync because of parallelism limit %d\n",
+ global_sync_limit);
+ }
+ }
+
+ {
+ const char *argv[2] = { src, dst };
+
+ status = __make_copy(global,
+ dent,
+ do_start ? switch_path : "",
+ copy_path,
+ dent->d_parent->d_path,
+ argv,
+ find_key(rot->msgs,
+ "inf-sync"),
+ start_pos,
+ end_pos,
+ mars_fast_fullsync > 0,
+ true,
+ &copy);
+ if (copy) {
+ copy->kill_ptr = (void **)&rot->sync_brick;
+ copy->copy_limiter = &rot->sync_limiter;
+ }
+ rot->sync_brick = copy;
+ }
+
+ /* Update syncstatus symlink
+ */
+ if (status >= 0 && copy &&
+ ((copy->power.button && copy->power.on_led) ||
+ (copy->copy_last == copy->copy_end && copy->copy_end > 0))) {
+ status = _update_syncstatus(rot, copy, peer);
+ }
+
+done:
+ XIO_DBG("status = %d\n", status);
+ brick_string_free(tmp);
+ brick_string_free(src);
+ brick_string_free(dst);
+ brick_string_free(copy_path);
+ brick_string_free(switch_path);
+ return status;
+}
+
+static
+bool remember_peer(struct mars_rotate *rot, struct mars_peerinfo *peer)
+{
+ if (!peer || !rot || rot->preferred_peer)
+ return false;
+
+ if ((long long)peer->last_remote_jiffies + mars_scan_interval * HZ * 2 < (long long)jiffies)
+ return false;
+
+ rot->preferred_peer = brick_strdup(peer->peer);
+ return true;
+}
+
+static
+int make_connect(void *buf, struct mars_dent *dent)
+{
+ struct mars_rotate *rot;
+ struct mars_peerinfo *peer;
+ char *names;
+ char *this_name;
+ char *tmp;
+
+ if (unlikely(!dent->d_parent || !dent->link_val))
+ goto done;
+ rot = dent->d_parent->d_private;
+ if (unlikely(!rot))
+ goto done;
+
+ names = brick_strdup(dent->link_val);
+ for (tmp = this_name = names; *tmp; tmp++) {
+ if (*tmp == MARS_DELIM) {
+ *tmp = '\0';
+ peer = find_peer(this_name);
+ if (remember_peer(rot, peer))
+ goto found;
+ this_name = tmp + 1;
+ }
+ }
+ peer = find_peer(this_name);
+ remember_peer(rot, peer);
+
+found:
+ brick_string_free(names);
+done:
+ return 0;
+}
+
+static int prepare_delete(void *buf, struct mars_dent *dent)
+{
+ struct kstat stat;
+ struct kstat *to_delete = NULL;
+ struct mars_global *global = buf;
+ struct mars_dent *target;
+ struct mars_dent *response;
+ const char *marker_path = NULL;
+ const char *response_path = NULL;
+ struct xio_brick *brick;
+ int max_serial = 0;
+ int status;
+
+ if (!global || !dent || !dent->link_val || !dent->d_path)
+ goto err;
+
+ /* create a marker which prevents concurrent updates from remote hosts */
+ marker_path = backskip_replace(dent->link_val, '/', true, "/.deleted-");
+ if (mars_stat(marker_path, &stat, true) < 0 ||
+ timespec_compare(&dent->stat_val.mtime, &stat.mtime) > 0) {
+ XIO_DBG("creating / updating marker '%s' mtime=%lu.%09lu\n",
+ marker_path, dent->stat_val.mtime.tv_sec, dent->stat_val.mtime.tv_nsec);
+ mars_symlink("1", marker_path, &dent->stat_val.mtime, 0);
+ }
+
+ brick = mars_find_brick(global, NULL, dent->link_val);
+ if (brick &&
+ unlikely((brick->nr_outputs > 0 && brick->outputs[0] && brick->outputs[0]->nr_connected) ||
+ (brick->type == (void *)&if_brick_type && !brick->power.off_led))) {
+ XIO_WRN("target '%s' cannot be deleted, its brick '%s' in use\n",
+ dent->link_val,
+ SAFE_STR(brick->brick_name));
+ goto done;
+ }
+
+ status = 0;
+ target = mars_find_dent(global, dent->link_val);
+ if (target) {
+ if (timespec_compare(&target->stat_val.mtime, &dent->stat_val.mtime) > 0) {
+ XIO_WRN("target '%s' has newer timestamp than deletion link, ignoring\n", dent->link_val);
+ status = -EAGAIN;
+ goto ok;
+ }
+ if (target->d_child_count) {
+ XIO_WRN("target '%s' has %d children, cannot kill\n", dent->link_val, target->d_child_count);
+ goto done;
+ }
+ target->d_killme = true;
+ XIO_DBG("target '%s' marked for removal\n", dent->link_val);
+ to_delete = &target->stat_val;
+ } else if (mars_stat(dent->link_val, &stat, true) >= 0) {
+ if (timespec_compare(&stat.mtime, &dent->stat_val.mtime) > 0) {
+ XIO_WRN("target '%s' has newer timestamp than deletion link, ignoring\n", dent->link_val);
+ status = -EAGAIN;
+ goto ok;
+ }
+ to_delete = &stat;
+ } else {
+ status = -EAGAIN;
+ XIO_DBG("target '%s' does no longer exist\n", dent->link_val);
+ }
+ if (to_delete) {
+ if (S_ISDIR(to_delete->mode)) {
+ status = mars_rmdir(dent->link_val);
+ XIO_DBG("rmdir '%s', status = %d\n", dent->link_val, status);
+ } else {
+ status = mars_unlink(dent->link_val);
+ XIO_DBG("unlink '%s', status = %d\n", dent->link_val, status);
+ }
+ }
+
+ok:
+ if (status < 0) {
+ XIO_DBG("deletion '%s' to target '%s' is accomplished\n",
+ dent->d_path, dent->link_val);
+ if (dent->d_serial <= global->deleted_border) {
+ XIO_DBG("removing deletion symlink '%s'\n", dent->d_path);
+ dent->d_killme = true;
+ mars_unlink(dent->d_path);
+ XIO_DBG("removing marker '%s'\n", marker_path);
+ mars_unlink(marker_path);
+ }
+ }
+
+done:
+ /* tell the world that we have seen this deletion... (even when not yet accomplished) */
+ response_path = path_make("/mars/todo-global/deleted-%s", my_id());
+ response = mars_find_dent(global, response_path);
+ if (response && response->link_val) {
+ int status = kstrtoint(response->link_val, 0, &max_serial);
+
+ (void)status; /* leave untouched in case of errors */
+ }
+ if (dent->d_serial > max_serial) {
+ char response_val[16];
+
+ max_serial = dent->d_serial;
+ global->deleted_my_border = max_serial;
+ snprintf(response_val, sizeof(response_val), "%09d", max_serial);
+ mars_symlink(response_val, response_path, NULL, 0);
+ }
+
+err:
+ brick_string_free(marker_path);
+ brick_string_free(response_path);
+ return 0;
+}
+
+static int check_deleted(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ int serial = 0;
+ int status;
+
+ if (!global || !dent || !dent->link_val)
+ goto done;
+
+ status = kstrtoint(dent->link_val, 0, &serial);
+ if (unlikely(status || serial <= 0)) {
+ XIO_WRN("cannot parse symlink '%s' -> '%s'\n", dent->d_path, dent->link_val);
+ goto done;
+ }
+
+ if (!strcmp(dent->d_rest, my_id()))
+ global->deleted_my_border = serial;
+
+ /* Compute the minimum of the deletion progress among
+ * the resource members.
+ */
+ if (serial < global->deleted_min || !global->deleted_min)
+ global->deleted_min = serial;
+
+done:
+ return 0;
+}
+
+static
+int make_res(void *buf, struct mars_dent *dent)
+{
+ struct mars_rotate *rot = dent->d_private;
+
+ if (!rot) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ rot->has_symlinks = false;
+
+done:
+ return 0;
+}
+
+static
+int kill_res(void *buf, struct mars_dent *dent)
+{
+ struct mars_rotate *rot = dent->d_private;
+
+ if (unlikely(!rot || !rot->parent_path)) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ show_vals(rot->msgs, rot->parent_path, "");
+
+ if (unlikely(!rot->global || !rot->global->global_power.button)) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+ if (rot->has_symlinks) {
+ XIO_DBG("symlinks were present, nothing to kill.\n");
+ goto done;
+ }
+
+ /* this code is only executed in case of forced deletion of symlinks */
+ if (rot->if_brick || rot->sync_brick || rot->fetch_brick || rot->trans_brick) {
+ rot->res_shutdown = true;
+ XIO_WRN("resource '%s' has no symlinks, shutting down.\n", rot->parent_path);
+ }
+ if (rot->if_brick) {
+ if (atomic_read(&rot->if_brick->open_count) > 0) {
+ XIO_ERR("cannot destroy resource '%s': device is is use!\n", rot->parent_path);
+ goto done;
+ }
+ rot->if_brick->killme = true;
+ if (!rot->if_brick->power.off_led) {
+ int status = mars_power_button((void *)rot->if_brick, false, false);
+
+ XIO_INF("switching off resource '%s', device status = %d\n", rot->parent_path, status);
+ } else {
+ xio_kill_brick((void *)rot->if_brick);
+ rot->if_brick = NULL;
+ }
+ }
+ if (rot->sync_brick) {
+ rot->sync_brick->killme = true;
+ if (!rot->sync_brick->power.off_led) {
+ int status = mars_power_button((void *)rot->sync_brick, false, false);
+
+ XIO_INF("switching off resource '%s', sync status = %d\n", rot->parent_path, status);
+ }
+ }
+ if (rot->fetch_brick) {
+ rot->fetch_brick->killme = true;
+ if (!rot->fetch_brick->power.off_led) {
+ int status = mars_power_button((void *)rot->fetch_brick, false, false);
+
+ XIO_INF("switching off resource '%s', fetch status = %d\n", rot->parent_path, status);
+ }
+ }
+ if (rot->trans_brick) {
+ struct trans_logger_output *output = rot->trans_brick->outputs[0];
+
+ if (!output || output->nr_connected) {
+ XIO_ERR("cannot destroy resource '%s': trans_logger is is use!\n", rot->parent_path);
+ goto done;
+ }
+ rot->trans_brick->killme = true;
+ if (!rot->trans_brick->power.off_led) {
+ int status = mars_power_button((void *)rot->trans_brick, false, false);
+
+ XIO_INF("switching off resource '%s', logger status = %d\n", rot->parent_path, status);
+ }
+ }
+ if (!rot->if_brick && !rot->sync_brick && !rot->fetch_brick && !rot->trans_brick)
+ rot->res_shutdown = false;
+
+done:
+ return 0;
+}
+
+static
+int make_defaults(void *buf, struct mars_dent *dent)
+{
+ if (!dent->link_val)
+ goto done;
+
+ XIO_DBG("name = '%s' value = '%s'\n", dent->d_name, dent->link_val);
+
+ if (!strcmp(dent->d_name, "sync-limit")) {
+ int status = kstrtoint(dent->link_val, 0, &global_sync_limit);
+
+ (void)status; /* leave untouched in case of errors */
+ } else if (!strcmp(dent->d_name, "sync-pref-list")) {
+ const char *start;
+ struct list_head *tmp;
+ int len;
+ int want_count = 0;
+ int get_count = 0;
+
+ for (tmp = rot_anchor.next; tmp != &rot_anchor; tmp = tmp->next) {
+ struct mars_rotate *rot = container_of(tmp, struct mars_rotate, rot_head);
+
+ if (rot->wants_sync)
+ want_count++;
+ else
+ rot->gets_sync = false;
+ if (rot->sync_brick && rot->sync_brick->power.on_led)
+ get_count++;
+ }
+ global_sync_want = want_count;
+ global_sync_nr = get_count;
+
+ /* prefer mentioned resources in the right order */
+ for (start = dent->link_val; *start && get_count < global_sync_limit; start += len) {
+ len = 1;
+ while (start[len] && start[len] != ',')
+ len++;
+ for (tmp = rot_anchor.next; tmp != &rot_anchor; tmp = tmp->next) {
+ struct mars_rotate *rot = container_of(tmp, struct mars_rotate, rot_head);
+
+ if (rot->wants_sync && rot->parent_rest && !strncmp(start, rot->parent_rest, len)) {
+ rot->gets_sync = true;
+ get_count++;
+ XIO_DBG("new get_count = %d res = '%s' wants_sync = %d gets_sync = %d\n",
+ get_count, rot->parent_rest, rot->wants_sync, rot->gets_sync);
+ break;
+ }
+ }
+ if (start[len])
+ len++;
+ }
+ /* fill up with unmentioned resources */
+ for (tmp = rot_anchor.next; tmp != &rot_anchor && get_count < global_sync_limit; tmp = tmp->next) {
+ struct mars_rotate *rot = container_of(tmp, struct mars_rotate, rot_head);
+
+ if (rot->wants_sync && !rot->gets_sync) {
+ rot->gets_sync = true;
+ get_count++;
+ }
+ XIO_DBG("new get_count = %d res = '%s' wants_sync = %d gets_sync = %d\n",
+ get_count, rot->parent_rest, rot->wants_sync, rot->gets_sync);
+ }
+ XIO_DBG("final want_count = %d get_count = %d\n", want_count, get_count);
+ } else {
+ XIO_DBG("unimplemented default '%s'\n", dent->d_name);
+ }
+done:
+ return 0;
+}
+
+/*********************************************************************/
+
+/* Please keep the order the same as in the enum.
+ */
+static const struct light_class light_classes[] = {
+ /* Placeholder for root node /mars/
+ */
+ [CL_ROOT] = {
+ },
+
+ /* UUID, indentifying the whole cluster.
+ */
+ [CL_UUID] = {
+ .cl_name = "uuid",
+ .cl_len = 4,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+
+ /* Subdirectory for global userspace items...
+ */
+ [CL_GLOBAL_USERSPACE] = {
+ .cl_name = "userspace",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_ROOT,
+ },
+ [CL_GLOBAL_USERSPACE_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_GLOBAL_USERSPACE,
+ },
+
+ /* Subdirectory for defaults...
+ */
+ [CL_DEFAULTS0] = {
+ .cl_name = "defaults",
+ .cl_len = 8,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_ROOT,
+ },
+ [CL_DEFAULTS] = {
+ .cl_name = "defaults-",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_hostcontext = true,
+ .cl_father = CL_ROOT,
+ },
+ /* ... and its contents
+ */
+ [CL_DEFAULTS_ITEMS0] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_DEFAULTS0,
+ },
+ [CL_DEFAULTS_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_DEFAULTS,
+ .cl_forward = make_defaults,
+ },
+
+ /* Subdirectory for global controlling items...
+ */
+ [CL_GLOBAL_TODO] = {
+ .cl_name = "todo-global",
+ .cl_len = 11,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_ROOT,
+ },
+ /* ... and its contents
+ */
+ [CL_GLOBAL_TODO_DELETE] = {
+ .cl_name = "delete-",
+ .cl_len = 7,
+ .cl_type = 'l',
+ .cl_serial = true,
+ .cl_hostcontext = false, /* ignore context, although present */
+ .cl_father = CL_GLOBAL_TODO,
+ .cl_prepare = prepare_delete,
+ },
+ [CL_GLOBAL_TODO_DELETED] = {
+ .cl_name = "deleted-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_father = CL_GLOBAL_TODO,
+ .cl_prepare = check_deleted,
+ },
+
+ /* Directory containing the addresses of all peers
+ */
+ [CL_IPS] = {
+ .cl_name = "ips",
+ .cl_len = 3,
+ .cl_type = 'd',
+ .cl_father = CL_ROOT,
+ },
+ /* Anyone participating in a MARS cluster must
+ * be named here (symlink pointing to the IP address).
+ * We have no DNS in kernel space.
+ */
+ [CL_PEERS] = {
+ .cl_name = "ip-",
+ .cl_len = 3,
+ .cl_type = 'l',
+ .cl_father = CL_IPS,
+ .cl_forward = make_scan,
+ .cl_backward = kill_scan,
+ },
+ /* Subdirectory for actual state
+ */
+ [CL_GBL_ACTUAL] = {
+ .cl_name = "actual-",
+ .cl_len = 7,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_ROOT,
+ },
+ /* ... and its contents
+ */
+ [CL_GBL_ACTUAL_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_GBL_ACTUAL,
+ },
+ /* Indicate aliveness of all cluster paritcipants
+ * by the timestamp of this link.
+ */
+ [CL_ALIVE] = {
+ .cl_name = "alive-",
+ .cl_len = 6,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+ [CL_TIME] = {
+ .cl_name = "time-",
+ .cl_len = 5,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+ /* Show version indication for symlink tree.
+ */
+ [CL_TREE] = {
+ .cl_name = "tree-",
+ .cl_len = 5,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+ /* Indicate whether filesystem is full
+ */
+ [CL_EMERGENCY] = {
+ .cl_name = "emergency-",
+ .cl_len = 10,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+ /* dto as percentage
+ */
+ [CL_REST_SPACE] = {
+ .cl_name = "rest-space-",
+ .cl_len = 11,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+
+ /* Directory containing all items of a resource
+ */
+ [CL_RESOURCE] = {
+ .cl_name = "resource-",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_use_channel = true,
+ .cl_father = CL_ROOT,
+ .cl_forward = make_res,
+ .cl_backward = kill_res,
+ },
+
+ /* Subdirectory for resource-specific userspace items...
+ */
+ [CL_RESOURCE_USERSPACE] = {
+ .cl_name = "userspace",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ [CL_RESOURCE_USERSPACE_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_RESOURCE_USERSPACE,
+ },
+
+ /* Subdirectory for defaults...
+ */
+ [CL_RES_DEFAULTS0] = {
+ .cl_name = "defaults",
+ .cl_len = 8,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ [CL_RES_DEFAULTS] = {
+ .cl_name = "defaults-",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ /* ... and its contents
+ */
+ [CL_RES_DEFAULTS_ITEMS0] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_RES_DEFAULTS0,
+ },
+ [CL_RES_DEFAULTS_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_RES_DEFAULTS,
+ },
+
+ /* Subdirectory for controlling items...
+ */
+ [CL_TODO] = {
+ .cl_name = "todo-",
+ .cl_len = 5,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ /* ... and its contents
+ */
+ [CL_TODO_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_TODO,
+ },
+
+ /* Subdirectory for actual state
+ */
+ [CL_ACTUAL] = {
+ .cl_name = "actual-",
+ .cl_len = 7,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ /* ... and its contents
+ */
+ [CL_ACTUAL_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_ACTUAL,
+ },
+
+ /* File or symlink to the real device / real (sparse) file
+ * when hostcontext is missing, the corresponding peer will
+ * not participate in that resource.
+ */
+ [CL_DATA] = {
+ .cl_name = "data-",
+ .cl_len = 5,
+ .cl_type = 'F',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_bio,
+ .cl_backward = kill_any,
+ },
+ /* Symlink indicating the (common) size of the resource
+ */
+ [CL_SIZE] = {
+ .cl_name = "size",
+ .cl_len = 4,
+ .cl_type = 'l',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_log_init,
+ .cl_backward = kill_any,
+ },
+ /* Dito for each individual size
+ */
+ [CL_ACTSIZE] = {
+ .cl_name = "actsize-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ },
+ /* Symlink pointing to the name of the primary node
+ */
+ [CL_PRIMARY] = {
+ .cl_name = "primary",
+ .cl_len = 7,
+ .cl_type = 'l',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_primary,
+ .cl_backward = NULL,
+ },
+ /* Symlink for connection preferences
+ */
+ [CL_CONNECT] = {
+ .cl_name = "connect-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_connect,
+ },
+ /* informational symlink indicating the current
+ * status / start / pos / end of logfile transfers.
+ */
+ [CL_TRANSFER] = {
+ .cl_name = "transferstatus-",
+ .cl_len = 15,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ },
+ /* symlink indicating the current status / end
+ * of initial data sync.
+ */
+ [CL_SYNC] = {
+ .cl_name = "syncstatus-",
+ .cl_len = 11,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_sync,
+ .cl_backward = kill_any,
+ },
+ /* informational symlink for verify status
+ * of initial data sync.
+ */
+ [CL_VERIF] = {
+ .cl_name = "verifystatus-",
+ .cl_len = 13,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ },
+ /* informational symlink: after sync has finished,
+ * keep a copy of the replay symlink from the primary.
+ * when comparing the own replay symlink against this,
+ * we can determine whether we are consistent.
+ */
+ [CL_SYNCPOS] = {
+ .cl_name = "syncpos-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ },
+ /* Passive symlink indicating the split-brain crypto hash
+ */
+ [CL_VERSION] = {
+ .cl_name = "version-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_serial = true,
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ /* Logfiles for transaction logger
+ */
+ [CL_LOG] = {
+ .cl_name = "log-",
+ .cl_len = 4,
+ .cl_type = 'F',
+ .cl_serial = true,
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_log_step,
+ .cl_backward = kill_any,
+ },
+ /* Symlink indicating the last state of
+ * transaction log replay.
+ */
+ [CL_REPLAYSTATUS] = {
+ .cl_name = "replay-",
+ .cl_len = 7,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_replay,
+ .cl_backward = kill_any,
+ },
+
+ /* Name of the device appearing at the primary
+ */
+ [CL_DEVICE] = {
+ .cl_name = "device-",
+ .cl_len = 7,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_dev,
+ .cl_backward = kill_dev,
+ },
+ {}
+};
+
+/* Helper routine to pre-determine the relevance of a name from the filesystem.
+ */
+static
+int _checker(struct mars_dent *parent,
+ const char *_name,
+ int namlen,
+ unsigned int d_type,
+ int *prefix,
+ int *serial,
+ bool *use_channel,
+ bool external_mode)
+{
+ int class;
+ int status = -2;
+
+#ifdef XIO_DEBUGGING
+ const char *name = brick_strndup(_name, namlen);
+
+#else
+ const char *name = _name;
+
+#endif
+
+ /* XIO_DBG("trying '%s' '%s'\n", path, name); */
+ for (class = CL_ROOT + 1; ; class++) {
+ const struct light_class *test = &light_classes[class];
+ int len = test->cl_len;
+
+ if (!test->cl_name) { /* end of table */
+ break;
+ }
+
+ /* XIO_DBG(" testing class '%s'\n", test->cl_name); */
+
+#ifdef XIO_DEBUGGING
+ if (len != strlen(test->cl_name)) {
+ XIO_ERR("internal table '%s' mismatch: %d != %d\n",
+ test->cl_name,
+ len,
+ (int)strlen(test->cl_name));
+ len = strlen(test->cl_name);
+ }
+#endif
+
+ if (test->cl_father &&
+ (!parent || parent->d_class != test->cl_father)) {
+ continue;
+ }
+
+ if (len > 0 &&
+ (namlen < len || memcmp(name, test->cl_name, len))) {
+ continue;
+ }
+
+ /* XIO_DBG("path '%s/%s' matches class %d '%s'\n", path, name, class, test->cl_name); */
+
+ /* check special contexts */
+ if (test->cl_serial) {
+ int plus = 0;
+ int count;
+
+ count = sscanf(name+len, "%d%n", serial, &plus);
+ if (count < 1) {
+ /* XIO_DBG("'%s' serial number mismatch at '%s'\n", name, name+len); */
+ continue;
+ }
+ /* XIO_DBG("'%s' serial number = %d\n", name, *serial); */
+ len += plus;
+ if (name[len] == '-')
+ len++;
+ }
+ if (prefix)
+ *prefix = len;
+ if (test->cl_hostcontext && !external_mode) {
+ if (memcmp(name+len, my_id(), namlen-len)) {
+ /* XIO_DBG("context mismatch '%s' at '%s'\n", name, name+len); */
+ continue;
+ }
+ }
+
+ /* all ok */
+ status = class;
+ *use_channel = test->cl_use_channel;
+ }
+
+#ifdef XIO_DEBUGGING
+ brick_string_free(name);
+#endif
+ return status;
+}
+
+static
+int light_checker(struct mars_dent *parent,
+ const char *_name,
+ int namlen,
+ unsigned int d_type,
+ int *prefix,
+ int *serial,
+ bool *use_channel)
+{
+ return _checker(parent, _name, namlen, d_type, prefix, serial, use_channel, false);
+}
+
+int external_checker(struct mars_dent *parent,
+ const char *_name,
+ int namlen,
+ unsigned int d_type,
+ int *prefix,
+ int *serial,
+ bool *use_channel)
+{
+ return _checker(parent, _name, namlen, d_type, prefix, serial, use_channel, true);
+}
+
+/* Do some syntactic checks, then delegate work to the real worker functions
+ * from the light_classes[] table.
+ */
+static int light_worker(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction)
+{
+ light_worker_fn worker;
+ int class = dent->d_class;
+
+ if (class < 0 || class >= sizeof(light_classes)/sizeof(struct light_class)) {
+ XIO_ERR("bad internal class %d of '%s'\n", class, dent->d_path);
+ return -EINVAL;
+ }
+ switch (light_classes[class].cl_type) {
+ case 'd':
+ if (!S_ISDIR(dent->stat_val.mode)) {
+ XIO_ERR("'%s' should be a directory, but is something else\n", dent->d_path);
+ return -EINVAL;
+ }
+ break;
+ case 'f':
+ if (!S_ISREG(dent->stat_val.mode)) {
+ XIO_ERR("'%s' should be a regular file, but is something else\n", dent->d_path);
+ return -EINVAL;
+ }
+ break;
+ case 'F':
+ if (!S_ISREG(dent->stat_val.mode) && !S_ISLNK(dent->stat_val.mode)) {
+ XIO_ERR("'%s' should be a regular file or a symlink, but is something else\n", dent->d_path);
+ return -EINVAL;
+ }
+ break;
+ case 'l':
+ if (!S_ISLNK(dent->stat_val.mode)) {
+ XIO_ERR("'%s' should be a symlink, but is something else\n", dent->d_path);
+ return -EINVAL;
+ }
+ break;
+ }
+ if (likely(class > CL_ROOT)) {
+ int father = light_classes[class].cl_father;
+
+ if (father == CL_ROOT) {
+ if (unlikely(dent->d_parent)) {
+ XIO_ERR("'%s' class %d is not at the root of the hierarchy\n", dent->d_path, class);
+ return -EINVAL;
+ }
+ } else if (unlikely(!dent->d_parent || dent->d_parent->d_class != father)) {
+ XIO_ERR("last component '%s' from '%s' is at the wrong position in the hierarchy (class = %d, parent_class = %d, parent = '%s')\n",
+ dent->d_name,
+ dent->d_path,
+ father,
+ dent->d_parent ? dent->d_parent->d_class : -9999,
+ dent->d_parent ? dent->d_parent->d_path : "");
+ return -EINVAL;
+ }
+ }
+ if (prepare)
+ worker = light_classes[class].cl_prepare;
+ else if (direction)
+ worker = light_classes[class].cl_backward;
+ else
+ worker = light_classes[class].cl_forward;
+ if (worker) {
+ int status;
+
+ if (!direction)
+ XIO_DBG("--- start working %s on '%s' rest='%s'\n",
+ direction ? "backward" : "forward",
+ dent->d_path,
+ dent->d_rest);
+ status = worker(global, (void *)dent);
+ XIO_DBG("--- done, worked %s on '%s', status = %d\n",
+ direction ? "backward" : "forward",
+ dent->d_path,
+ status);
+ return status;
+ }
+ return 0;
+}
+
+static struct mars_global _global = {
+ .dent_anchor = LIST_HEAD_INIT(_global.dent_anchor),
+ .brick_anchor = LIST_HEAD_INIT(_global.brick_anchor),
+ .global_power = {
+ .button = true,
+ },
+ .dent_mutex = __RWSEM_INITIALIZER(_global.dent_mutex),
+ .brick_mutex = __RWSEM_INITIALIZER(_global.brick_mutex),
+ .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(_global.main_event),
+};
+
+static int light_thread(void *data)
+{
+ long long last_rollover = jiffies;
+ char *id = my_id();
+ int status = 0;
+
+ mars_global = &_global;
+
+ if (!id || strlen(id) < 2) {
+ XIO_ERR("invalid hostname\n");
+ status = -EFAULT;
+ goto done;
+ }
+
+ XIO_INF("-------- starting as host '%s' ----------\n", id);
+
+ while (_global.global_power.button || !list_empty(&_global.brick_anchor)) {
+ int status;
+
+ XIO_DBG("-------- NEW ROUND ---------\n");
+
+ if (mars_mem_percent < 0)
+ mars_mem_percent = 0;
+ if (mars_mem_percent > 70)
+ mars_mem_percent = 70;
+ brick_global_memlimit = (long long)brick_global_memavail * mars_mem_percent / 100;
+
+ brick_msleep(100);
+
+ if (brick_thread_should_stop()) {
+ _global.global_power.button = false;
+ xio_net_is_alive = false;
+ }
+
+ _make_alive();
+
+ compute_emergency_mode();
+
+ XIO_DBG("-------- start worker ---------\n");
+ _global.deleted_min = 0;
+ status = mars_dent_work(&_global,
+ "/mars",
+ sizeof(struct mars_dent),
+ light_checker,
+ light_worker,
+ &_global,
+ 3);
+ _global.deleted_border = _global.deleted_min;
+ XIO_DBG("-------- worker deleted_min = %d status = %d\n", _global.deleted_min, status);
+
+ if (!_global.global_power.button) {
+ status = xio_kill_brick_when_possible(&_global,
+ &_global.brick_anchor,
+ false,
+ (void *)&copy_brick_type,
+ true);
+ XIO_DBG("kill copy bricks (when possible) = %d\n", status);
+ }
+
+ status = xio_kill_brick_when_possible(&_global, &_global.brick_anchor, false, NULL, false);
+ XIO_DBG("kill main bricks (when possible) = %d\n", status);
+
+ status = xio_kill_brick_when_possible(&_global,
+ &_global.brick_anchor,
+ false,
+ (void *)&client_brick_type,
+ true);
+ XIO_DBG("kill client bricks (when possible) = %d\n", status);
+ status = xio_kill_brick_when_possible(&_global,
+ &_global.brick_anchor,
+ false,
+ (void *)&aio_brick_type,
+ true);
+ XIO_DBG("kill aio bricks (when possible) = %d\n", status);
+ status = xio_kill_brick_when_possible(&_global,
+ &_global.brick_anchor,
+ false,
+ (void *)&bio_brick_type,
+ true);
+ XIO_DBG("kill bio bricks (when possible) = %d\n", status);
+
+ if ((long long)jiffies + mars_rollover_interval * HZ >= last_rollover) {
+ last_rollover = jiffies;
+ rollover_all();
+ }
+
+ _show_status_all(&_global);
+ show_vals(gbl_pairs, "/mars", "");
+ show_statistics(&_global, "main");
+
+ XIO_DBG("ban_count = %d ban_renew_count = %d\n",
+ xio_global_ban.ban_count,
+ xio_global_ban.ban_renew_count);
+
+ brick_msleep(500);
+
+ wait_event_interruptible_timeout(_global.main_event, _global.main_trigger, mars_scan_interval * HZ);
+
+ _global.main_trigger = false;
+ }
+
+done:
+ XIO_INF("-------- cleaning up ----------\n");
+ remote_trigger();
+ brick_msleep(1000);
+
+ xio_free_dent_all(&_global, &_global.dent_anchor);
+ xio_kill_brick_all(&_global, &_global.brick_anchor, false);
+
+ _show_status_all(&_global);
+ show_vals(gbl_pairs, "/mars", "");
+ show_statistics(&_global, "main");
+
+ mars_global = NULL;
+
+ XIO_INF("-------- done status = %d ----------\n", status);
+ /* cleanup_mm(); */
+ return status;
+}
+
+static
+char *_xio_info(void)
+{
+ int max = PAGE_SIZE - 64;
+ char *txt;
+ struct list_head *tmp;
+ int dent_count = 0;
+ int brick_count = 0;
+ int pos = 0;
+
+ if (unlikely(!mars_global))
+ return NULL;
+
+ txt = brick_string_alloc(max);
+
+ txt[--max] = '\0'; /* safeguard */
+
+ down_read(&mars_global->brick_mutex);
+ for (tmp = mars_global->brick_anchor.next; tmp != &mars_global->brick_anchor; tmp = tmp->next) {
+ struct xio_brick *test;
+
+ brick_count++;
+ test = container_of(tmp, struct xio_brick, global_brick_link);
+ pos += scnprintf(
+ txt + pos, max - pos,
+ "brick button=%d off=%d on=%d path='%s'\n",
+ test->power.button,
+ test->power.off_led,
+ test->power.on_led,
+ test->brick_path
+ );
+ }
+ up_read(&mars_global->brick_mutex);
+
+ pos += scnprintf(
+ txt + pos, max - pos,
+ "SUMMARY: brick_count=%d dent_count=%d\n",
+ brick_count,
+ dent_count
+ );
+
+ return txt;
+}
+
+#define INIT_MAX 32
+static char *exit_names[INIT_MAX];
+static void (*exit_fn[INIT_MAX])(void);
+static int exit_fn_nr;
+
+#define DO_INIT(name) \
+ do { \
+ XIO_DBG("=== starting module " #name "...\n"); \
+ status = init_##name(); \
+ if (status < 0) \
+ goto done; \
+ exit_names[exit_fn_nr] = #name; \
+ exit_fn[exit_fn_nr++] = exit_##name; \
+ } while (0)
+
+void (*_remote_trigger)(void);
+EXPORT_SYMBOL_GPL(_remote_trigger);
+
+static void exit_light(void)
+{
+ XIO_DBG("====================== stopping everything...\n");
+ /* TODO: make this thread-safe. */
+ if (main_thread) {
+ XIO_DBG("=== stopping light thread...\n");
+ local_trigger();
+ XIO_INF("stopping main thread...\n");
+ brick_thread_stop(main_thread);
+ }
+
+ xio_info = NULL;
+ _remote_trigger = NULL;
+
+ while (exit_fn_nr > 0) {
+ XIO_DBG("=== stopping module %s ...\n", exit_names[exit_fn_nr - 1]);
+ exit_fn[--exit_fn_nr]();
+ }
+ XIO_DBG("====================== stopped everything.\n");
+ exit_say();
+ /* checkpatch.pl: dev_info() and friends cannot be used
+ * because MARS handles many devices dynamically at runtime.
+ * At secondary nodes, no device may be present at all.
+ */
+ printk(KERN_INFO "stopped MARS\n");
+ /* Workaround for nasty race: some kernel threads have not yet
+ * really finished even _after_ kthread_stop() and may execute
+ * some code which will disappear right after return from this
+ * function.
+ * A correct solution would probably need the help of the kernel
+ * scheduler.
+ */
+ brick_msleep(1000);
+}
+
+static int __init init_light(void)
+{
+ int new_limit = 4096;
+ int status = 0;
+
+ /* bump the min_free limit */
+ if (min_free_kbytes < new_limit)
+ min_free_kbytes = new_limit;
+
+ /* checkpatch.pl: dev_info() and friends cannot be used
+ * (see also the above sister comment)
+ */
+ printk(KERN_INFO "loading MARS, tree_version=%s\n", SYMLINK_TREE_VERSION);
+
+ init_say(); /* this must come first */
+
+ /* be careful: order is important!
+ */
+ DO_INIT(brick_mem);
+ DO_INIT(brick);
+ DO_INIT(xio);
+ DO_INIT(xio_mapfree);
+ DO_INIT(xio_net);
+ DO_INIT(xio_client);
+ DO_INIT(xio_aio);
+ DO_INIT(xio_bio);
+ DO_INIT(xio_server);
+ DO_INIT(xio_copy);
+ DO_INIT(log_format);
+ DO_INIT(xio_trans_logger);
+ DO_INIT(xio_if);
+
+ DO_INIT(sy);
+ DO_INIT(sy_net);
+ DO_INIT(xio_proc);
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ brick_pre_reserve[5] = 64;
+ brick_mem_reserve();
+#endif
+
+ status = compute_emergency_mode();
+ if (unlikely(status < 0)) {
+ XIO_ERR("Sorry, your /mars/ filesystem is too small!\n");
+ goto done;
+ }
+
+ main_thread = brick_thread_create(light_thread, NULL, "mars_light");
+ if (unlikely(!main_thread)) {
+ status = -ENOENT;
+ goto done;
+ }
+
+done:
+ if (status < 0) {
+ XIO_ERR("module init failed with status = %d, exiting.\n", status);
+ exit_light();
+ }
+ _remote_trigger = __remote_trigger;
+ xio_info = _xio_info;
+ return status;
+}
+
+/* force module loading */
+const void *dummy1 = &client_brick_type;
+const void *dummy2 = &server_brick_type;
+
+MODULE_DESCRIPTION("MARS Light");
+MODULE_AUTHOR("Thomas Schoebel-Theuer <[email protected]>");
+MODULE_VERSION(SYMLINK_TREE_VERSION);
+MODULE_LICENSE("GPL");
+
+#ifndef CONFIG_MARS_DEBUG
+MODULE_INFO(debug, "production");
+#else
+MODULE_INFO(debug, "DEBUG");
+#endif
+#ifdef CONFIG_MARS_DEBUG_MEM
+MODULE_INFO(io, "BAD_PERFORMANCE");
+#endif
+#ifdef CONFIG_MARS_DEBUG_ORDER0
+MODULE_INFO(memory, "EVIL_PERFORMANCE");
+#endif
+
+module_init(init_light);
+module_exit(exit_light);
--
2.0.0

2014-07-01 21:54:04

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 26/50] mars: add new file drivers/block/mars/xio_bricks/lib_log.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/lib_log.c | 500 ++++++++++++++++++++++++++++++++
1 file changed, 500 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/lib_log.c

diff --git a/drivers/block/mars/xio_bricks/lib_log.c b/drivers/block/mars/xio_bricks/lib_log.c
new file mode 100644
index 0000000..e068551
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/lib_log.c
@@ -0,0 +1,500 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bio.h>
+
+#include <linux/lib_log.h>
+
+atomic_t global_aio_flying = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(global_aio_flying);
+
+void exit_logst(struct log_status *logst)
+{
+ int count;
+
+ log_flush(logst);
+
+ /* TODO: replace by event */
+ count = 0;
+ while (atomic_read(&logst->aio_flying) > 0) {
+ if (!count++)
+ XIO_DBG("waiting for IO terminating...");
+ brick_msleep(500);
+ }
+ if (logst->read_aio) {
+ XIO_DBG("putting read_aio\n");
+ GENERIC_INPUT_CALL(logst->input, aio_put, logst->read_aio);
+ logst->read_aio = NULL;
+ }
+ if (logst->log_aio) {
+ XIO_DBG("putting log_aio\n");
+ GENERIC_INPUT_CALL(logst->input, aio_put, logst->log_aio);
+ logst->log_aio = NULL;
+ }
+}
+EXPORT_SYMBOL_GPL(exit_logst);
+
+void init_logst(struct log_status *logst, struct xio_input *input, loff_t start_pos, loff_t end_pos)
+{
+ exit_logst(logst);
+
+ memset(logst, 0, sizeof(struct log_status));
+
+ logst->input = input;
+ logst->brick = input->brick;
+ logst->start_pos = start_pos;
+ logst->log_pos = start_pos;
+ logst->end_pos = end_pos;
+ init_waitqueue_head(&logst->event);
+}
+EXPORT_SYMBOL_GPL(init_logst);
+
+#define XIO_LOG_CB_MAX 32
+
+struct log_cb_info {
+ struct aio_object *aio;
+ struct log_status *logst;
+ struct semaphore mutex;
+ atomic_t refcount;
+ int nr_cb;
+
+ /* make checkpatch.pl happy with a blank line - is this a false positive? */
+
+ void (*endios[XIO_LOG_CB_MAX])(void *private, int error);
+ void *privates[XIO_LOG_CB_MAX];
+};
+
+static
+void put_log_cb_info(struct log_cb_info *cb_info)
+{
+ if (atomic_dec_and_test(&cb_info->refcount))
+ brick_mem_free(cb_info);
+}
+
+static
+void _do_callbacks(struct log_cb_info *cb_info, int error)
+{
+ int i;
+
+ down(&cb_info->mutex);
+ for (i = 0; i < cb_info->nr_cb; i++) {
+ void (*end_fn)(void *private, int error);
+
+ end_fn = cb_info->endios[i];
+ cb_info->endios[i] = NULL;
+ if (end_fn)
+ end_fn(cb_info->privates[i], error);
+ }
+ up(&cb_info->mutex);
+}
+
+static
+void log_write_endio(struct generic_callback *cb)
+{
+ struct log_cb_info *cb_info = cb->cb_private;
+ struct log_status *logst;
+
+ LAST_CALLBACK(cb);
+ CHECK_PTR(cb_info, err);
+
+ logst = cb_info->logst;
+ CHECK_PTR(logst, done);
+
+ _do_callbacks(cb_info, cb->cb_error);
+
+done:
+ put_log_cb_info(cb_info);
+ atomic_dec(&logst->aio_flying);
+ atomic_dec(&global_aio_flying);
+ if (logst->signal_event)
+ wake_up_interruptible(logst->signal_event);
+
+ goto out_return;
+err:
+ XIO_FAT("internal pointer corruption\n");
+out_return:;
+}
+
+void log_flush(struct log_status *logst)
+{
+ struct aio_object *aio = logst->log_aio;
+ struct log_cb_info *cb_info;
+ int align_size;
+ int gap;
+
+ if (!aio || !logst->count)
+ goto out_return;
+ gap = 0;
+ align_size = (logst->align_size / PAGE_SIZE) * PAGE_SIZE;
+ if (align_size > 0) {
+ /* round up to next alignment border */
+ int align_offset = logst->offset & (align_size-1);
+
+ if (align_offset > 0) {
+ int restlen = aio->io_len - logst->offset;
+
+ gap = align_size - align_offset;
+ if (unlikely(gap > restlen))
+ gap = restlen;
+ }
+ }
+ if (gap > 0) {
+ /* don't leak information from kernelspace */
+ memset(aio->io_data + logst->offset, 0, gap);
+ logst->offset += gap;
+ }
+ aio->io_len = logst->offset;
+ memcpy(&logst->log_pos_stamp, &logst->tmp_pos_stamp, sizeof(logst->log_pos_stamp));
+
+ cb_info = logst->private;
+ logst->private = NULL;
+ SETUP_CALLBACK(aio, log_write_endio, cb_info);
+ cb_info->logst = logst;
+ aio->io_rw = 1;
+
+ atomic_inc(&logst->aio_flying);
+ atomic_inc(&global_aio_flying);
+
+ GENERIC_INPUT_CALL(logst->input, aio_io, aio);
+ GENERIC_INPUT_CALL(logst->input, aio_put, aio);
+
+ logst->log_pos += logst->offset;
+ logst->offset = 0;
+ logst->count = 0;
+ logst->log_aio = NULL;
+
+ put_log_cb_info(cb_info);
+out_return:;
+}
+EXPORT_SYMBOL_GPL(log_flush);
+
+void *log_reserve(struct log_status *logst, struct log_header *lh)
+{
+ struct log_cb_info *cb_info = logst->private;
+ struct aio_object *aio;
+ void *data;
+
+ short total_len = lh->l_len + OVERHEAD;
+ int offset;
+ int status;
+
+ if (unlikely(lh->l_len <= 0 || lh->l_len > logst->max_size)) {
+ XIO_ERR("trying to write %d bytes, max allowed = %d\n", lh->l_len, logst->max_size);
+ goto err;
+ }
+
+ aio = logst->log_aio;
+ if ((aio && total_len > aio->io_len - logst->offset)
+ || !cb_info || cb_info->nr_cb >= XIO_LOG_CB_MAX) {
+ log_flush(logst);
+ }
+
+ aio = logst->log_aio;
+ if (!aio) {
+ if (unlikely(logst->private)) {
+ XIO_ERR("oops\n");
+ brick_mem_free(logst->private);
+ }
+ logst->private = brick_zmem_alloc(sizeof(struct log_cb_info));
+ cb_info = logst->private;
+ sema_init(&cb_info->mutex, 1);
+ atomic_set(&cb_info->refcount, 2);
+
+ aio = xio_alloc_aio(logst->brick);
+ cb_info->aio = aio;
+
+ aio->io_pos = logst->log_pos;
+ aio->io_len = logst->chunk_size ? logst->chunk_size : total_len;
+ aio->io_may_write = WRITE;
+ aio->io_prio = logst->io_prio;
+
+ for (;;) {
+ status = GENERIC_INPUT_CALL(logst->input, aio_get, aio);
+ if (likely(status >= 0))
+ break;
+ if (status != -ENOMEM && status != -EAGAIN) {
+ XIO_ERR("aio_get() failed, status = %d\n", status);
+ goto err_free;
+ }
+ brick_msleep(100);
+ }
+
+ if (unlikely(aio->io_len < total_len)) {
+ XIO_ERR("io_len = %d total_len = %d\n", aio->io_len, total_len);
+ goto put;
+ }
+
+ logst->offset = 0;
+ logst->log_aio = aio;
+ }
+
+ offset = logst->offset;
+ data = aio->io_data;
+ DATA_PUT(data, offset, START_MAGIC);
+ DATA_PUT(data, offset, (char)FORMAT_VERSION);
+ logst->validflag_offset = offset;
+ DATA_PUT(data, offset, (char)0); /* valid_flag */
+ DATA_PUT(data, offset, total_len); /* start of next header */
+ DATA_PUT(data, offset, lh->l_stamp.tv_sec);
+ DATA_PUT(data, offset, lh->l_stamp.tv_nsec);
+ DATA_PUT(data, offset, lh->l_pos);
+ logst->reallen_offset = offset;
+ DATA_PUT(data, offset, lh->l_len);
+ DATA_PUT(data, offset, (short)0); /* spare */
+ DATA_PUT(data, offset, (int)0); /* spare */
+ DATA_PUT(data, offset, lh->l_code);
+ DATA_PUT(data, offset, (short)0); /* spare */
+
+ /* remember the last timestamp */
+ memcpy(&logst->tmp_pos_stamp, &lh->l_stamp, sizeof(logst->tmp_pos_stamp));
+
+ logst->payload_offset = offset;
+ logst->payload_len = lh->l_len;
+
+ return data + offset;
+
+put:
+ GENERIC_INPUT_CALL(logst->input, aio_put, aio);
+ logst->log_aio = NULL;
+ return NULL;
+
+err_free:
+ obj_free(aio);
+ if (logst->private) {
+ /* TODO: if callbacks are already registered, call them here with some error code */
+ brick_mem_free(logst->private);
+ logst->private = NULL;
+ }
+err:
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(log_reserve);
+
+bool log_finalize(struct log_status *logst, int len, void (*endio)(void *private, int error), void *private)
+{
+ struct aio_object *aio = logst->log_aio;
+ struct log_cb_info *cb_info = logst->private;
+ struct timespec now;
+ void *data;
+ int offset;
+ int restlen;
+ int nr_cb;
+ int crc;
+ bool ok = false;
+
+ CHECK_PTR(aio, err);
+
+ if (unlikely(len > logst->payload_len)) {
+ XIO_ERR("trying to write more than reserved (%d > %d)\n", len, logst->payload_len);
+ goto err;
+ }
+ restlen = aio->io_len - logst->offset;
+ if (unlikely(len + END_OVERHEAD > restlen)) {
+ XIO_ERR("trying to write more than available (%d > %d)\n", len, (int)(restlen - END_OVERHEAD));
+ goto err;
+ }
+ if (unlikely(!cb_info || cb_info->nr_cb >= XIO_LOG_CB_MAX)) {
+ XIO_ERR("too many endio() calls\n");
+ goto err;
+ }
+
+ data = aio->io_data;
+
+ crc = 0;
+ if (logst->do_crc) {
+ unsigned char checksum[xio_digest_size];
+
+ xio_digest(checksum, data + logst->payload_offset, len);
+ crc = *(int *)checksum;
+ }
+
+ /* Correct the length in the header.
+ */
+ offset = logst->reallen_offset;
+ DATA_PUT(data, offset, (short)len);
+
+ /* Write the trailer.
+ */
+ offset = logst->payload_offset + len;
+ DATA_PUT(data, offset, END_MAGIC);
+ DATA_PUT(data, offset, crc);
+ DATA_PUT(data, offset, (char)1); /* valid_flag copy */
+ DATA_PUT(data, offset, (char)0); /* spare */
+ DATA_PUT(data, offset, (short)0); /* spare */
+ DATA_PUT(data, offset, logst->seq_nr + 1);
+ get_lamport(&now); /* when the log entry was ready. */
+ DATA_PUT(data, offset, now.tv_sec);
+ DATA_PUT(data, offset, now.tv_nsec);
+
+ if (unlikely(offset > aio->io_len)) {
+ XIO_FAT("length calculation was wrong: %d > %d\n", offset, aio->io_len);
+ goto err;
+ }
+ logst->offset = offset;
+
+ /* This must come last. In case of incomplete
+ * or even overlapping disk transfers, this indicates
+ * the completeness / integrity of the payload at
+ * the time of starting the transfer.
+ */
+ offset = logst->validflag_offset;
+ DATA_PUT(data, offset, (char)1);
+
+ nr_cb = cb_info->nr_cb++;
+ cb_info->endios[nr_cb] = endio;
+ cb_info->privates[nr_cb] = private;
+
+ /* report success */
+ logst->seq_nr++;
+ logst->count++;
+ ok = true;
+
+err:
+ return ok;
+}
+EXPORT_SYMBOL_GPL(log_finalize);
+
+static
+void log_read_endio(struct generic_callback *cb)
+{
+ struct log_status *logst = cb->cb_private;
+
+ LAST_CALLBACK(cb);
+ CHECK_PTR(logst, err);
+ logst->error_code = cb->cb_error;
+ logst->got = true;
+ wake_up_interruptible(&logst->event);
+ goto out_return;
+err:
+ XIO_FAT("internal pointer corruption\n");
+out_return:;
+}
+
+int log_read(struct log_status *logst, bool sloppy, struct log_header *lh, void **payload, int *payload_len)
+{
+ struct aio_object *aio;
+ int old_offset;
+ int status;
+
+restart:
+ status = 0;
+ aio = logst->read_aio;
+ if (!aio || logst->do_free) {
+ loff_t this_len;
+
+ if (aio) {
+ GENERIC_INPUT_CALL(logst->input, aio_put, aio);
+ logst->read_aio = NULL;
+ logst->log_pos += logst->offset;
+ logst->offset = 0;
+ }
+
+ this_len = logst->end_pos - logst->log_pos;
+ if (this_len > logst->chunk_size) {
+ this_len = logst->chunk_size;
+ } else if (unlikely(this_len <= 0)) {
+ XIO_ERR("tried bad IO len %lld, start_pos = %lld log_pos = %lld end_pos = %lld\n",
+ this_len,
+ logst->start_pos,
+ logst->log_pos,
+ logst->end_pos);
+ status = -EOVERFLOW;
+ goto done;
+ }
+
+ aio = xio_alloc_aio(logst->brick);
+ aio->io_pos = logst->log_pos;
+ aio->io_len = this_len;
+ aio->io_prio = logst->io_prio;
+
+ status = GENERIC_INPUT_CALL(logst->input, aio_get, aio);
+ if (unlikely(status < 0)) {
+ if (status != -ENODATA)
+ XIO_ERR("aio_get() failed, status = %d\n", status);
+ goto done_free;
+ }
+ if (unlikely(aio->io_len <= OVERHEAD)) { /* EOF */
+ status = 0;
+ goto done_put;
+ }
+
+ SETUP_CALLBACK(aio, log_read_endio, logst);
+ aio->io_rw = READ;
+ logst->offset = 0;
+ logst->got = false;
+ logst->do_free = false;
+
+ GENERIC_INPUT_CALL(logst->input, aio_io, aio);
+
+ wait_event_interruptible_timeout(logst->event, logst->got, 60 * HZ);
+ status = -EIO;
+ if (!logst->got)
+ goto done_put;
+ status = logst->error_code;
+ if (status < 0)
+ goto done_put;
+ logst->read_aio = aio;
+ }
+
+ status = log_scan(aio->io_data + logst->offset,
+ aio->io_len - logst->offset,
+ aio->io_pos,
+ logst->offset,
+ sloppy,
+ lh,
+ payload,
+ payload_len,
+ &logst->seq_nr);
+
+ if (unlikely(status == 0)) {
+ XIO_ERR("bad logfile scan\n");
+ status = -EINVAL;
+ }
+ if (unlikely(status < 0))
+ goto done_put;
+
+ /* memoize success */
+ logst->offset += status;
+ if (logst->offset + (logst->max_size + OVERHEAD) * 2 >= aio->io_len)
+ logst->do_free = true;
+
+done:
+ if (status == -ENODATA) {
+ /* indicate EOF */
+ status = 0;
+ }
+ return status;
+
+done_put:
+ old_offset = logst->offset;
+ if (aio) {
+ GENERIC_INPUT_CALL(logst->input, aio_put, aio);
+ logst->read_aio = NULL;
+ logst->log_pos += logst->offset;
+ logst->offset = 0;
+ }
+ if (status == -EAGAIN && old_offset > 0)
+ goto restart;
+ goto done;
+
+done_free:
+ obj_free(aio);
+ logst->read_aio = NULL;
+ goto done;
+
+}
+EXPORT_SYMBOL_GPL(log_read);
+
+/***************** module init stuff ************************/
+
+int __init init_log_format(void)
+{
+ XIO_INF("init_log_format()\n");
+ return 0;
+}
+
+void exit_log_format(void)
+{
+ XIO_INF("exit_log_format()\n");
+}
--
2.0.0

2014-07-01 21:54:02

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 11/50] mars: add new file include/linux/brick/lib_pairing_heap.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/lib_pairing_heap.h | 94 ++++++++++++++++++++++++++++++++++
1 file changed, 94 insertions(+)
create mode 100644 include/linux/brick/lib_pairing_heap.h

diff --git a/include/linux/brick/lib_pairing_heap.h b/include/linux/brick/lib_pairing_heap.h
new file mode 100644
index 0000000..f60d4e7
--- /dev/null
+++ b/include/linux/brick/lib_pairing_heap.h
@@ -0,0 +1,94 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef PAIRING_HEAP_H
+#define PAIRING_HEAP_H
+
+/* Algorithm: see http://en.wikipedia.org/wiki/Pairing_heap
+ * This is just an efficient translation from recursive to iterative form.
+ *
+ * Note: find_min() is so trivial that we don't implement it.
+ */
+
+/* generic version: KEYDEF is kept separate, allowing you to
+ * embed this structure into other container structures already
+ * possessing some key (just provide an empty KEYDEF in this case).
+ */
+#define _PAIRING_HEAP_TYPEDEF(KEYTYPE, KEYDEF) \
+ \
+struct pairing_heap_##KEYTYPE { \
+ KEYDEF \
+ struct pairing_heap_##KEYTYPE *next; \
+ struct pairing_heap_##KEYTYPE *subheaps; \
+}; \
+/* this comment is for keeping TRAILING_SEMICOLON happy */
+
+/* less generic version: define the key inside.
+ */
+#define PAIRING_HEAP_TYPEDEF(KEYTYPE) \
+ _PAIRING_HEAP_TYPEDEF(KEYTYPE, KEYTYPE key;)
+
+/* generic methods: allow arbitrary CMP() functions.
+ */
+#define _PAIRING_HEAP_FUNCTIONS(_STATIC, KEYTYPE, CMP) \
+ \
+_STATIC \
+struct pairing_heap_##KEYTYPE *_ph_merge_##KEYTYPE(struct pairing_heap_##KEYTYPE *heap1,\
+ struct pairing_heap_##KEYTYPE *heap2) \
+{ \
+ if (!heap1) \
+ return heap2; \
+ if (!heap2) \
+ return heap1; \
+ if (CMP(heap1, heap2) < 0) { \
+ heap2->next = heap1->subheaps; \
+ heap1->subheaps = heap2; \
+ return heap1; \
+ } \
+ heap1->next = heap2->subheaps; \
+ heap2->subheaps = heap1; \
+ return heap2; \
+} \
+ \
+_STATIC \
+void ph_insert_##KEYTYPE(struct pairing_heap_##KEYTYPE **heap, struct pairing_heap_##KEYTYPE *new)\
+{ \
+ new->next = NULL; \
+ new->subheaps = NULL; \
+ *heap = _ph_merge_##KEYTYPE(*heap, new); \
+} \
+ \
+_STATIC \
+void ph_delete_min_##KEYTYPE(struct pairing_heap_##KEYTYPE **heap) \
+{ \
+ struct pairing_heap_##KEYTYPE *tmplist = NULL; \
+ struct pairing_heap_##KEYTYPE *ptr; \
+ struct pairing_heap_##KEYTYPE *next; \
+ struct pairing_heap_##KEYTYPE *res; \
+ if (!*heap) { \
+ return; \
+ } \
+ for (ptr = (*heap)->subheaps; ptr; ptr = next) { \
+ struct pairing_heap_##KEYTYPE *p2 = ptr->next; \
+ next = p2; \
+ if (p2) { \
+ next = p2->next; \
+ ptr = _ph_merge_##KEYTYPE(ptr, p2); \
+ } \
+ ptr->next = tmplist; \
+ tmplist = ptr; \
+ } \
+ res = NULL; \
+ for (ptr = tmplist; ptr; ptr = next) { \
+ next = ptr->next; \
+ res = _ph_merge_##KEYTYPE(res, ptr); \
+ } \
+ *heap = res; \
+}
+
+/* some default CMP() function */
+#define PAIRING_HEAP_COMPARE(a, b) ((a)->key < (b)->key ? -1 : ((a)->key > (b)->key ? 1 : 0))
+
+/* less generic version: use the default CMP() function */
+#define PAIRING_HEAP_FUNCTIONS(_STATIC, KEYTYPE) \
+ _PAIRING_HEAP_FUNCTIONS(_STATIC, KEYTYPE, PAIRING_HEAP_COMPARE)
+
+#endif
--
2.0.0

2014-07-01 21:55:01

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 15/50] mars: add new file include/linux/brick/lib_timing.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/lib_timing.h | 156 +++++++++++++++++++++++++++++++++++++++
1 file changed, 156 insertions(+)
create mode 100644 include/linux/brick/lib_timing.h

diff --git a/include/linux/brick/lib_timing.h b/include/linux/brick/lib_timing.h
new file mode 100644
index 0000000..af15f1a
--- /dev/null
+++ b/include/linux/brick/lib_timing.h
@@ -0,0 +1,156 @@
+/* (c) 2012 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef BRICK_LIB_TIMING_H
+#define BRICK_LIB_TIMING_H
+
+#include <linux/brick/brick.h>
+
+#include <linux/sched.h>
+
+/* Simple infrastructure for timing of arbitrary operations and creation
+ * of some simple histogram statistics.
+ */
+
+#define TIMING_MAX 24
+
+struct timing_stats {
+#ifdef CONFIG_MARS_DEBUG
+ int tim_count[TIMING_MAX];
+
+#endif
+};
+
+#define _TIME_THIS(_stamp1, _stamp2, _CODE) \
+ ({ \
+ (_stamp1) = cpu_clock(raw_smp_processor_id()); \
+ \
+ _CODE; \
+ \
+ (_stamp2) = cpu_clock(raw_smp_processor_id()); \
+ (_stamp2) - (_stamp1); \
+ })
+
+#define TIME_THIS(_CODE) \
+ ({ \
+ unsigned long long _stamp1; \
+ unsigned long long _stamp2; \
+ _TIME_THIS(_stamp1, _stamp2, _CODE); \
+ })
+
+#ifdef CONFIG_MARS_DEBUG
+
+#define _TIME_STATS(_timing, _stamp1, _stamp2, _CODE) \
+ ({ \
+ unsigned long long _time; \
+ unsigned long _tmp; \
+ int _i; \
+ \
+ _time = _TIME_THIS(_stamp1, _stamp2, _CODE); \
+ \
+ _tmp = _time / 1000; /* convert to us */ \
+ _i = 0; \
+ while (_tmp > 0 && _i < TIMING_MAX - 1) { \
+ _tmp >>= 1; \
+ _i++; \
+ } \
+ (_timing)->tim_count[_i]++; \
+ _time; \
+ })
+
+#define TIME_STATS(_timing, _CODE) \
+ ({ \
+ unsigned long long _stamp1; \
+ unsigned long long _stamp2; \
+ _TIME_STATS(_timing, _stamp1, _stamp2, _CODE); \
+ })
+
+extern int report_timing(struct timing_stats *tim, char *str, int maxlen);
+
+#else /* CONFIG_MARS_DEBUG */
+
+#define _TIME_STATS(_timing, _stamp1, _stamp2, _CODE) \
+ ((void)_timing, (_stamp1) = (_stamp2) = cpu_clock(raw_smp_processor_id()), _CODE, 0)
+
+#define TIME_STATS(_timing, _CODE) \
+ ((void)_timing, _CODE, 0)
+
+#define report_timing(tim, str, maxlen) ((void)tim, 0)
+
+#endif /* CONFIG_MARS_DEBUG */
+
+/* A banning represents some overloaded resource.
+ *
+ * Whenever overload is detected, you should call banning_hit()
+ * telling that the overload is assumed / estimated to continue
+ * for some duration in time.
+ *
+ * ATTENTION! These operations are deliberately raceful.
+ * They are meant to deliver _hints_ (e.g. for IO scheduling
+ * decisions etc), not hard facts!
+ *
+ * If you need locking, just surround these operations
+ * with locking by yourself.
+ */
+struct banning {
+ long long ban_last_hit;
+
+ /* statistical */
+ int ban_renew_count;
+ int ban_count;
+};
+
+extern inline
+bool banning_hit(struct banning *ban, long long duration)
+{
+ long long now = cpu_clock(raw_smp_processor_id());
+ bool hit = ban->ban_last_hit >= now;
+ long long new_hit = now + duration;
+
+ ban->ban_renew_count++;
+ if (!ban->ban_last_hit || ban->ban_last_hit < new_hit) {
+ ban->ban_last_hit = new_hit;
+ ban->ban_count++;
+ }
+ return hit;
+}
+
+extern inline
+bool banning_is_hit(struct banning *ban)
+{
+ long long now = cpu_clock(raw_smp_processor_id());
+
+ return (ban->ban_last_hit && ban->ban_last_hit >= now);
+}
+
+extern inline
+void banning_reset(struct banning *ban)
+{
+ ban->ban_last_hit = 0;
+}
+
+/* Threshold: trigger a banning whenever some latency threshold
+ * is exceeded.
+ */
+struct threshold {
+ struct banning *thr_ban;
+
+ /* tunables */
+ int thr_limit; /* in us */
+ int thr_factor; /* in % */
+ int thr_plus; /* in us */
+ /* statistical */
+ int thr_triggered;
+ int thr_true_hit;
+};
+
+extern inline
+void threshold_check(struct threshold *thr, long long latency)
+{
+ if (thr->thr_limit &&
+ latency > (long long)thr->thr_limit * 1000) {
+ thr->thr_triggered++;
+ if (!banning_hit(thr->thr_ban, latency * thr->thr_factor / 100 + thr->thr_plus * 1000))
+ thr->thr_true_hit++;
+ }
+}
+
+#endif
--
2.0.0

2014-07-01 21:55:35

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 31/50] mars: add new file include/linux/xio/xio_client.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/xio/xio_client.h | 70 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 70 insertions(+)
create mode 100644 include/linux/xio/xio_client.h

diff --git a/include/linux/xio/xio_client.h b/include/linux/xio/xio_client.h
new file mode 100644
index 0000000..86f3eac
--- /dev/null
+++ b/include/linux/xio/xio_client.h
@@ -0,0 +1,70 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_CLIENT_H
+#define XIO_CLIENT_H
+
+#include <linux/xio_net.h>
+#include <linux/brick/lib_limiter.h>
+
+extern struct xio_limiter client_limiter;
+extern int global_net_io_timeout;
+extern int xio_client_abort;
+
+struct client_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct list_head io_head;
+ struct list_head hash_head;
+ struct list_head tmp_head;
+ unsigned long submit_jiffies;
+ int alloc_len;
+ bool do_dealloc;
+};
+
+struct client_brick {
+ XIO_BRICK(client);
+ /* tunables */
+ int max_flying; /* limit on parallelism */
+ int io_timeout; /* > 0: report IO errors after timeout (in seconds) */
+ bool limit_mode;
+
+ /* readonly from outside */
+ int connection_state; /* 0 = switched off, 1 = not connected, 2 = connected */
+};
+
+struct client_input {
+ XIO_INPUT(client);
+};
+
+struct client_threadinfo {
+ struct task_struct *thread;
+
+ wait_queue_head_t run_event;
+ int restart_count;
+};
+
+struct client_output {
+ XIO_OUTPUT(client);
+ atomic_t fly_count;
+ atomic_t timeout_count;
+ spinlock_t lock;
+ struct list_head aio_list;
+ struct list_head wait_list;
+
+ wait_queue_head_t event;
+ int last_id;
+ int recv_error;
+ struct xio_socket socket;
+ char *host;
+ char *path;
+ struct client_threadinfo sender;
+ struct client_threadinfo receiver;
+ struct xio_info info;
+
+ wait_queue_head_t info_event;
+ bool get_info;
+ bool got_info;
+ struct list_head *hash_table;
+};
+
+XIO_TYPES(client);
+
+#endif
--
2.0.0

2014-07-01 21:53:59

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 32/50] mars: add new file drivers/block/mars/xio_bricks/xio_client.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/xio_client.c | 739 +++++++++++++++++++++++++++++
1 file changed, 739 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/xio_client.c

diff --git a/drivers/block/mars/xio_bricks/xio_client.c b/drivers/block/mars/xio_bricks/xio_client.c
new file mode 100644
index 0000000..ee357e3
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/xio_client.c
@@ -0,0 +1,739 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+/* Client brick (just for demonstration) */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/xio.h>
+
+/************************ own type definitions ***********************/
+
+#include <linux/xio/xio_client.h>
+
+#define CLIENT_HASH_MAX (PAGE_SIZE / sizeof(struct list_head))
+
+int xio_client_abort = 10;
+EXPORT_SYMBOL_GPL(xio_client_abort);
+
+/************************ own helper functions ***********************/
+
+static int thread_count;
+
+static void _kill_thread(struct client_threadinfo *ti, const char *name)
+{
+ if (ti->thread) {
+ XIO_DBG("stopping %s thread\n", name);
+ brick_thread_stop(ti->thread);
+ ti->thread = NULL;
+ }
+}
+
+static void _kill_socket(struct client_output *output)
+{
+ output->brick->connection_state = 1;
+ if (xio_socket_is_alive(&output->socket)) {
+ XIO_DBG("shutdown socket\n");
+ xio_shutdown_socket(&output->socket);
+ }
+ _kill_thread(&output->receiver, "receiver");
+ output->recv_error = 0;
+ XIO_DBG("close socket\n");
+ xio_put_socket(&output->socket);
+}
+
+static int _request_info(struct client_output *output)
+{
+ struct xio_cmd cmd = {
+ .cmd_code = CMD_GETINFO,
+ };
+ int status;
+
+ XIO_DBG("\n");
+ status = xio_send_struct(&output->socket, &cmd, xio_cmd_meta);
+ if (unlikely(status < 0))
+ XIO_DBG("send of getinfo failed, status = %d\n", status);
+ return status;
+}
+
+static int receiver_thread(void *data);
+
+static int _connect(struct client_output *output, const char *str)
+{
+ struct sockaddr_storage sockaddr = {};
+ int status;
+
+ if (unlikely(!output->path)) {
+ output->path = brick_strdup(str);
+ status = -EINVAL;
+ output->host = strchr(output->path, '@');
+ if (!output->host) {
+ brick_string_free(output->path);
+ output->path = NULL;
+ XIO_ERR("parameter string '%s' contains no remote specifier with '@'-syntax\n", str);
+ goto done;
+ }
+ *output->host++ = '\0';
+ }
+
+ if (unlikely(output->receiver.thread)) {
+ XIO_WRN("receiver thread unexpectedly not dead\n");
+ _kill_thread(&output->receiver, "receiver");
+ }
+
+ status = xio_create_sockaddr(&sockaddr, output->host);
+ if (unlikely(status < 0)) {
+ XIO_DBG("no sockaddr, status = %d\n", status);
+ goto done;
+ }
+
+ status = xio_create_socket(&output->socket, &sockaddr, false);
+ if (unlikely(status < 0)) {
+ XIO_DBG("no socket, status = %d\n", status);
+ goto really_done;
+ }
+ output->socket.s_shutdown_on_err = true;
+ output->socket.s_send_abort = xio_client_abort;
+ output->socket.s_recv_abort = xio_client_abort;
+
+ output->receiver.thread = brick_thread_create(receiver_thread, output, "xio_receiver%d", thread_count++);
+ if (unlikely(!output->receiver.thread)) {
+ XIO_ERR("cannot start receiver thread, status = %d\n", status);
+ status = -ENOENT;
+ goto done;
+ }
+
+ {
+ struct xio_cmd cmd = {
+ .cmd_code = CMD_CONNECT,
+ .cmd_str1 = output->path,
+ };
+
+ status = xio_send_struct(&output->socket, &cmd, xio_cmd_meta);
+ if (unlikely(status < 0)) {
+ XIO_DBG("send of connect failed, status = %d\n", status);
+ goto done;
+ }
+ }
+ if (status >= 0)
+ status = _request_info(output);
+
+done:
+ if (status < 0) {
+ XIO_INF("cannot connect to remote host '%s' (status = %d) -- retrying\n",
+ output->host ? output->host : "NULL",
+ status);
+ _kill_socket(output);
+ }
+really_done:
+ return status;
+}
+
+/***************** own brick * input * output operations *****************/
+
+static int client_get_info(struct client_output *output, struct xio_info *info)
+{
+ int status;
+
+ output->got_info = false;
+ output->get_info = true;
+ wake_up_interruptible(&output->event);
+
+ wait_event_interruptible_timeout(output->info_event, output->got_info, 60 * HZ);
+ status = -EIO;
+ if (output->got_info && info) {
+ memcpy(info, &output->info, sizeof(*info));
+ status = 0;
+ }
+
+ return status;
+}
+
+static int client_io_get(struct client_output *output, struct aio_object *aio)
+{
+ int maxlen;
+
+ if (aio->obj_initialized) {
+ obj_get(aio);
+ return aio->io_len;
+ }
+
+ /* Limit transfers to page boundaries.
+ * Currently, this is more restrictive than necessary.
+ * TODO: improve performance by doing better when possible.
+ * This needs help from the server in some efficient way.
+ */
+ maxlen = PAGE_SIZE - (aio->io_pos & (PAGE_SIZE-1));
+ if (aio->io_len > maxlen)
+ aio->io_len = maxlen;
+
+ if (!aio->io_data) { /* buffered IO */
+ struct client_aio_aspect *aio_a = client_aio_get_aspect(output->brick, aio);
+
+ if (!aio_a)
+ return -EILSEQ;
+
+ aio->io_data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len));
+
+ aio_a->do_dealloc = true;
+ aio->io_flags = 0;
+ }
+
+ obj_get_first(aio);
+ return 0;
+}
+
+static void client_io_put(struct client_output *output, struct aio_object *aio)
+{
+ struct client_aio_aspect *aio_a;
+
+ if (!obj_put(aio))
+ goto out_return;
+ aio_a = client_aio_get_aspect(output->brick, aio);
+ if (aio_a && aio_a->do_dealloc)
+ brick_block_free(aio->io_data, aio_a->alloc_len);
+ obj_free(aio);
+out_return:;
+}
+
+static
+void _hash_insert(struct client_output *output, struct client_aio_aspect *aio_a)
+{
+ struct aio_object *aio = aio_a->object;
+ int hash_index;
+
+ spin_lock(&output->lock);
+ list_del(&aio_a->io_head);
+ list_add_tail(&aio_a->io_head, &output->aio_list);
+ list_del(&aio_a->hash_head);
+ aio->io_id = ++output->last_id;
+ hash_index = aio->io_id % CLIENT_HASH_MAX;
+ list_add_tail(&aio_a->hash_head, &output->hash_table[hash_index]);
+ spin_unlock(&output->lock);
+}
+
+static void client_io_io(struct client_output *output, struct aio_object *aio)
+{
+ struct client_aio_aspect *aio_a;
+ int error = -EINVAL;
+
+ aio_a = client_aio_get_aspect(output->brick, aio);
+ if (unlikely(!aio_a))
+ goto error;
+
+ while (output->brick->max_flying > 0 && atomic_read(&output->fly_count) > output->brick->max_flying)
+ brick_msleep(1000 * 2 / HZ);
+
+ atomic_inc(&xio_global_io_flying);
+ atomic_inc(&output->fly_count);
+ obj_get(aio);
+
+ aio_a->submit_jiffies = jiffies;
+ _hash_insert(output, aio_a);
+
+ wake_up_interruptible(&output->event);
+
+ goto out_return;
+error:
+ XIO_ERR("IO error = %d\n", error);
+ SIMPLE_CALLBACK(aio, error);
+ client_io_put(output, aio);
+out_return:;
+}
+
+static
+int receiver_thread(void *data)
+{
+ struct client_output *output = data;
+ int status = 0;
+
+ while (!brick_thread_should_stop()) {
+ struct xio_cmd cmd = {};
+ struct list_head *tmp;
+ struct client_aio_aspect *aio_a = NULL;
+ struct aio_object *aio = NULL;
+
+ status = xio_recv_struct(&output->socket, &cmd, xio_cmd_meta);
+ if (status < 0)
+ goto done;
+
+ switch (cmd.cmd_code & CMD_FLAG_MASK) {
+ case CMD_NOTIFY:
+ local_trigger();
+ break;
+ case CMD_CONNECT:
+ if (cmd.cmd_int1 < 0) {
+ status = cmd.cmd_int1;
+ XIO_ERR("at remote side: brick connect failed, remote status = %d\n", status);
+ goto done;
+ }
+ break;
+ case CMD_CB:
+ {
+ int hash_index = cmd.cmd_int1 % CLIENT_HASH_MAX;
+
+ spin_lock(&output->lock);
+ for (tmp = output->hash_table[hash_index].next; tmp != &output->hash_table[hash_index]; tmp = tmp->next) {
+ struct aio_object *tmp_aio;
+
+ aio_a = container_of(tmp, struct client_aio_aspect, hash_head);
+ tmp_aio = aio_a->object;
+ if (unlikely(!tmp_aio)) {
+ spin_unlock(&output->lock);
+ XIO_ERR("bad internal aio pointer\n");
+ status = -EBADR;
+ goto done;
+ }
+ if (tmp_aio->io_id == cmd.cmd_int1) {
+ aio = tmp_aio;
+ list_del_init(&aio_a->hash_head);
+ list_del_init(&aio_a->io_head);
+ break;
+ }
+ }
+ spin_unlock(&output->lock);
+
+ if (unlikely(!aio)) {
+ XIO_WRN("got unknown id = %d for callback\n", cmd.cmd_int1);
+ status = -EBADR;
+ goto done;
+ }
+
+ status = xio_recv_cb(&output->socket, aio, &cmd);
+ if (unlikely(status < 0)) {
+ XIO_WRN("interrupted data transfer during callback, status = %d\n", status);
+ _hash_insert(output, aio_a);
+ goto done;
+ }
+
+ SIMPLE_CALLBACK(aio, aio->_object_cb.cb_error);
+
+ client_io_put(output, aio);
+
+ atomic_dec(&output->fly_count);
+ atomic_dec(&xio_global_io_flying);
+ break;
+ }
+ case CMD_GETINFO:
+ status = xio_recv_struct(&output->socket, &output->info, xio_info_meta);
+ if (status < 0) {
+ XIO_WRN("got bad info from remote side, status = %d\n", status);
+ goto done;
+ }
+ output->got_info = true;
+ wake_up_interruptible(&output->info_event);
+ break;
+ default:
+ XIO_ERR("got bad command %d from remote side, terminating.\n", cmd.cmd_code);
+ status = -EBADR;
+ goto done;
+ }
+done:
+ brick_string_free(cmd.cmd_str1);
+ if (unlikely(status < 0)) {
+ if (!output->recv_error) {
+ XIO_DBG("signalling status = %d\n", status);
+ output->recv_error = status;
+ }
+ wake_up_interruptible(&output->event);
+ brick_msleep(100);
+ }
+ }
+
+ if (status < 0)
+ XIO_WRN("receiver thread terminated with status = %d, recv_error = %d\n", status, output->recv_error);
+
+ xio_shutdown_socket(&output->socket);
+ wake_up_interruptible(&output->receiver.run_event);
+ return status;
+}
+
+static
+void _do_resubmit(struct client_output *output)
+{
+ spin_lock(&output->lock);
+ if (!list_empty(&output->wait_list)) {
+ struct list_head *first = output->wait_list.next;
+ struct list_head *last = output->wait_list.prev;
+ struct list_head *old_start = output->aio_list.next;
+
+#define list_connect __list_del /* the original routine has a misleading name: in reality it is more general */
+ list_connect(&output->aio_list, first);
+ list_connect(last, old_start);
+ INIT_LIST_HEAD(&output->wait_list);
+ }
+ spin_unlock(&output->lock);
+}
+
+static
+void _do_timeout(struct client_output *output, struct list_head *anchor, bool force)
+{
+ struct client_brick *brick = output->brick;
+ struct list_head *tmp;
+ struct list_head *next;
+ LIST_HEAD(tmp_list);
+ int rounds = 0;
+ long io_timeout = brick->io_timeout;
+
+ if (io_timeout <= 0)
+ io_timeout = global_net_io_timeout;
+
+ if (!xio_net_is_alive)
+ force = true;
+
+ if (!force && io_timeout <= 0)
+ goto out_return;
+ io_timeout *= HZ;
+
+ spin_lock(&output->lock);
+ for (tmp = anchor->next, next = tmp->next; tmp != anchor; tmp = next, next = tmp->next) {
+ struct client_aio_aspect *aio_a;
+
+ aio_a = container_of(tmp, struct client_aio_aspect, io_head);
+
+ if (!force &&
+ !time_is_before_jiffies(aio_a->submit_jiffies + io_timeout)) {
+ continue;
+ }
+
+ list_del_init(&aio_a->hash_head);
+ list_del_init(&aio_a->io_head);
+ list_add_tail(&aio_a->tmp_head, &tmp_list);
+ }
+ spin_unlock(&output->lock);
+
+ while (!list_empty(&tmp_list)) {
+ struct client_aio_aspect *aio_a;
+ struct aio_object *aio;
+
+ tmp = tmp_list.next;
+ list_del_init(tmp);
+ aio_a = container_of(tmp, struct client_aio_aspect, tmp_head);
+ aio = aio_a->object;
+
+ if (!rounds++) {
+ XIO_WRN("timeout after %ld: signalling IO error at pos = %lld len = %d\n",
+ io_timeout,
+ aio->io_pos,
+ aio->io_len);
+ }
+
+ atomic_inc(&output->timeout_count);
+
+ SIMPLE_CALLBACK(aio, -ENOTCONN);
+
+ client_io_put(output, aio);
+
+ atomic_dec(&output->fly_count);
+ atomic_dec(&xio_global_io_flying);
+ }
+out_return:;
+}
+
+static int sender_thread(void *data)
+{
+ struct client_output *output = data;
+ struct client_brick *brick = output->brick;
+ bool do_kill = false;
+ int status = 0;
+
+ output->receiver.restart_count = 0;
+
+ while (!brick_thread_should_stop()) {
+ struct list_head *tmp = NULL;
+ struct client_aio_aspect *aio_a;
+ struct aio_object *aio;
+
+ if (unlikely(output->recv_error != 0 || !xio_socket_is_alive(&output->socket))) {
+ XIO_DBG("recv_error = %d do_kill = %d\n", output->recv_error, do_kill);
+ if (do_kill) {
+ do_kill = false;
+ _kill_socket(output);
+ brick_msleep(3000);
+ }
+
+ status = _connect(output, brick->brick_name);
+ if (unlikely(status < 0)) {
+ brick_msleep(3000);
+ _do_timeout(output, &output->wait_list, false);
+ _do_timeout(output, &output->aio_list, false);
+ continue;
+ }
+ brick->connection_state = 2;
+ do_kill = true;
+ /* Re-Submit any waiting requests
+ */
+ _do_resubmit(output);
+ }
+
+ wait_event_interruptible_timeout(output->event,
+ !list_empty(&output->aio_list) ||
+ output->get_info ||
+ output->recv_error != 0 ||
+ brick_thread_should_stop(),
+ 1 * HZ);
+
+ if (unlikely(output->recv_error != 0)) {
+ XIO_DBG("recv_error = %d\n", output->recv_error);
+ brick_msleep(1000);
+ continue;
+ }
+
+ if (output->get_info) {
+ status = _request_info(output);
+ if (status >= 0) {
+ output->get_info = false;
+ } else {
+ XIO_WRN("cannot get info, status = %d\n", status);
+ brick_msleep(1000);
+ }
+ }
+
+ /* Grab the next aio from the queue
+ */
+ spin_lock(&output->lock);
+ if (list_empty(&output->aio_list)) {
+ spin_unlock(&output->lock);
+ continue;
+ }
+ tmp = output->aio_list.next;
+ list_del(tmp);
+ list_add(tmp, &output->wait_list);
+ aio_a = container_of(tmp, struct client_aio_aspect, io_head);
+ spin_unlock(&output->lock);
+
+ aio = aio_a->object;
+
+ if (brick->limit_mode) {
+ int amount = 0;
+
+ if (aio->io_cs_mode < 2)
+ amount = (aio->io_len - 1) / 1024 + 1;
+ xio_limit_sleep(&client_limiter, amount);
+ }
+
+ status = xio_send_aio(&output->socket, aio);
+ if (unlikely(status < 0)) {
+ /* retry submission on next occasion.. */
+ XIO_WRN("sending failed, status = %d\n", status);
+
+ if (do_kill) {
+ do_kill = false;
+ _kill_socket(output);
+ }
+ _hash_insert(output, aio_a);
+ brick_msleep(1000);
+ continue;
+ }
+ }
+
+ if (status < 0)
+ XIO_WRN("sender thread terminated with status = %d\n", status);
+ if (do_kill)
+ _kill_socket(output);
+
+ /* Signal error on all pending IO requests.
+ * We have no other chance (except probably delaying
+ * this until destruction which is probably not what
+ * we want).
+ */
+ _do_timeout(output, &output->wait_list, true);
+ _do_timeout(output, &output->aio_list, true);
+
+ wake_up_interruptible(&output->sender.run_event);
+ XIO_DBG("sender terminated\n");
+ return status;
+}
+
+static int client_switch(struct client_brick *brick)
+{
+ struct client_output *output = brick->outputs[0];
+ int status = 0;
+
+ if (brick->power.button) {
+ if (brick->power.on_led)
+ goto done;
+ xio_set_power_off_led((void *)brick, false);
+ if (!output->sender.thread) {
+ brick->connection_state = 1;
+ output->sender.thread = brick_thread_create(sender_thread,
+ output,
+ "xio_sender%d",
+ thread_count++);
+ if (unlikely(!output->sender.thread)) {
+ XIO_ERR("cannot start sender thread\n");
+ status = -ENOENT;
+ goto done;
+ }
+ }
+ if (output->sender.thread)
+ xio_set_power_on_led((void *)brick, true);
+ } else {
+ if (brick->power.off_led)
+ goto done;
+ xio_set_power_on_led((void *)brick, false);
+ _kill_thread(&output->sender, "sender");
+ brick->connection_state = 0;
+ if (!output->sender.thread)
+ xio_set_power_off_led((void *)brick, !output->sender.thread);
+ }
+done:
+ return status;
+}
+
+/*************** informational * statistics **************/
+
+static
+char *client_statistics(struct client_brick *brick, int verbose)
+{
+ struct client_output *output = brick->outputs[0];
+ char *res = brick_string_alloc(1024);
+
+ snprintf(res, 1024,
+ "#%d socket max_flying = %d io_timeout = %d | timeout_count = %d fly_count = %d\n",
+ output->socket.s_debug_nr,
+ brick->max_flying,
+ brick->io_timeout,
+ atomic_read(&output->timeout_count),
+ atomic_read(&output->fly_count));
+
+ return res;
+}
+
+static
+void client_reset_statistics(struct client_brick *brick)
+{
+ struct client_output *output = brick->outputs[0];
+
+ atomic_set(&output->timeout_count, 0);
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int client_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct client_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->io_head);
+ INIT_LIST_HEAD(&ini->hash_head);
+ INIT_LIST_HEAD(&ini->tmp_head);
+ return 0;
+}
+
+static void client_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct client_aio_aspect *ini = (void *)_ini;
+
+ CHECK_HEAD_EMPTY(&ini->io_head);
+ CHECK_HEAD_EMPTY(&ini->hash_head);
+}
+
+XIO_MAKE_STATICS(client);
+
+/********************* brick constructors * destructors *******************/
+
+static int client_brick_construct(struct client_brick *brick)
+{
+ return 0;
+}
+
+static int client_output_construct(struct client_output *output)
+{
+ int i;
+
+ output->hash_table = brick_block_alloc(0, PAGE_SIZE);
+
+ for (i = 0; i < CLIENT_HASH_MAX; i++)
+ INIT_LIST_HEAD(&output->hash_table[i]);
+ spin_lock_init(&output->lock);
+ INIT_LIST_HEAD(&output->aio_list);
+ INIT_LIST_HEAD(&output->wait_list);
+ init_waitqueue_head(&output->event);
+ init_waitqueue_head(&output->sender.run_event);
+ init_waitqueue_head(&output->receiver.run_event);
+ init_waitqueue_head(&output->info_event);
+ return 0;
+}
+
+static int client_output_destruct(struct client_output *output)
+{
+ if (output->path) {
+ brick_string_free(output->path);
+ output->path = NULL;
+ }
+ brick_block_free(output->hash_table, PAGE_SIZE);
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct client_brick_ops client_brick_ops = {
+ .brick_switch = client_switch,
+ .brick_statistics = client_statistics,
+ .reset_statistics = client_reset_statistics,
+};
+
+static struct client_output_ops client_output_ops = {
+ .xio_get_info = client_get_info,
+ .aio_get = client_io_get,
+ .aio_put = client_io_put,
+ .aio_io = client_io_io,
+};
+
+const struct client_input_type client_input_type = {
+ .type_name = "client_input",
+ .input_size = sizeof(struct client_input),
+};
+
+static const struct client_input_type *client_input_types[] = {
+ &client_input_type,
+};
+
+const struct client_output_type client_output_type = {
+ .type_name = "client_output",
+ .output_size = sizeof(struct client_output),
+ .master_ops = &client_output_ops,
+ .output_construct = &client_output_construct,
+ .output_destruct = &client_output_destruct,
+};
+
+static const struct client_output_type *client_output_types[] = {
+ &client_output_type,
+};
+
+const struct client_brick_type client_brick_type = {
+ .type_name = "client_brick",
+ .brick_size = sizeof(struct client_brick),
+ .max_inputs = 0,
+ .max_outputs = 1,
+ .master_ops = &client_brick_ops,
+ .aspect_types = client_aspect_types,
+ .default_input_types = client_input_types,
+ .default_output_types = client_output_types,
+ .brick_construct = &client_brick_construct,
+};
+EXPORT_SYMBOL_GPL(client_brick_type);
+
+/***************** module init stuff ************************/
+
+struct xio_limiter client_limiter = {
+ .lim_max_rate = 0,
+};
+EXPORT_SYMBOL_GPL(client_limiter);
+
+int global_net_io_timeout = CONFIG_MARS_NETIO_TIMEOUT;
+EXPORT_SYMBOL_GPL(global_net_io_timeout);
+
+int __init init_xio_client(void)
+{
+ XIO_INF("init_client()\n");
+ _client_brick_type = (void *)&client_brick_type;
+ return client_register_brick_type();
+}
+
+void exit_xio_client(void)
+{
+ XIO_INF("exit_client()\n");
+ client_unregister_brick_type();
+}
--
2.0.0

2014-07-01 21:55:54

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 49/50] mars: generic pre-patch for mars

From: Thomas Schoebel-Theuer <[email protected]>

Mostly introduces missing EXPORT_SYMBOL().
Should have no impact onto the kernel.

This is the generic version which exports all sys_*() system
calls. This should not introduce any additional maintenance pain
because that interfaces has to be stable anyway due to POSIX etc.

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
fs/open.c | 1 -
fs/utimes.c | 2 ++
include/linux/syscalls.h | 3 +++
include/uapi/linux/major.h | 1 +
mm/page_alloc.c | 3 +++
5 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index 36662d0..3b21b76 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1064,7 +1064,6 @@ SYSCALL_DEFINE1(close, unsigned int, fd)

return retval;
}
-EXPORT_SYMBOL(sys_close);

/*
* This routine simulates a hangup on the tty, to arrange that users
diff --git a/fs/utimes.c b/fs/utimes.c
index aa138d6..4a1f4a8 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -1,3 +1,4 @@
+#include <linux/module.h>
#include <linux/compiler.h>
#include <linux/file.h>
#include <linux/fs.h>
@@ -181,6 +182,7 @@ retry:
out:
return error;
}
+EXPORT_SYMBOL(do_utimes);

SYSCALL_DEFINE4(utimensat, int, dfd, const char __user *, filename,
struct timespec __user *, utimes, int, flags)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0..c674309 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -75,6 +75,7 @@ struct sigaltstack;
#include <linux/sem.h>
#include <asm/siginfo.h>
#include <linux/unistd.h>
+#include <linux/export.h>
#include <linux/quota.h>
#include <linux/key.h>
#include <trace/syscall.h>
@@ -176,6 +177,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;

#define SYSCALL_DEFINE0(sname) \
SYSCALL_METADATA(_##sname, 0); \
+ EXPORT_SYMBOL(sys_##sname); \
asmlinkage long sys_##sname(void)

#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
@@ -202,6 +204,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;
__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
return ret; \
} \
+ EXPORT_SYMBOL(sys##name); \
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))

asmlinkage long sys32_quotactl(unsigned int cmd, const char __user *special,
diff --git a/include/uapi/linux/major.h b/include/uapi/linux/major.h
index 620252e..61a665c 100644
--- a/include/uapi/linux/major.h
+++ b/include/uapi/linux/major.h
@@ -148,6 +148,7 @@
#define UNIX98_PTY_SLAVE_MAJOR (UNIX98_PTY_MASTER_MAJOR+UNIX98_PTY_MAJOR_COUNT)

#define DRBD_MAJOR 147
+#define MARS_MAJOR 148
#define RTF_MAJOR 150
#define RAW_MAJOR 162

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..e55e7c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -205,6 +205,8 @@ static char * const zone_names[MAX_NR_ZONES] = {
};

int min_free_kbytes = 1024;
+EXPORT_SYMBOL(min_free_kbytes);
+
int user_min_free_kbytes = -1;

static unsigned long __meminitdata nr_kernel_pages;
@@ -5692,6 +5694,7 @@ static void __setup_per_zone_wmarks(void)
/* update totalreserve_pages */
calculate_totalreserve_pages();
}
+EXPORT_SYMBOL(setup_per_zone_wmarks);

/**
* setup_per_zone_wmarks - called when min_free_kbytes changes
--
2.0.0

2014-07-01 21:53:57

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 44/50] mars: add new file include/linux/mars_light/mars_proc.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/mars_light/mars_proc.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
create mode 100644 include/linux/mars_light/mars_proc.h

diff --git a/include/linux/mars_light/mars_proc.h b/include/linux/mars_light/mars_proc.h
new file mode 100644
index 0000000..cd76701
--- /dev/null
+++ b/include/linux/mars_light/mars_proc.h
@@ -0,0 +1,18 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef MARS_PROC_H
+#define MARS_PROC_H
+
+typedef char * (*xio_info_fn)(void);
+
+extern xio_info_fn xio_info;
+
+extern int min_free_kbytes;
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_xio_proc(void);
+extern void exit_xio_proc(void);
+
+#endif
--
2.0.0

2014-07-01 21:56:35

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 42/50] mars: add new file include/linux/xio/xio_server.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/xio/xio_server.h | 48 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
create mode 100644 include/linux/xio/xio_server.h

diff --git a/include/linux/xio/xio_server.h b/include/linux/xio/xio_server.h
new file mode 100644
index 0000000..98ad994
--- /dev/null
+++ b/include/linux/xio/xio_server.h
@@ -0,0 +1,48 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_SERVER_H
+#define XIO_SERVER_H
+
+#include <linux/wait.h>
+
+#include <linux/xio_net.h>
+#include <linux/brick/lib_limiter.h>
+
+extern int server_show_statist;
+
+extern struct xio_limiter server_limiter;
+
+struct server_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct server_brick *brick;
+ struct list_head cb_head;
+ bool do_put;
+};
+
+struct server_output {
+ XIO_OUTPUT(server);
+};
+
+struct server_brick {
+ XIO_BRICK(server);
+ atomic_t in_flight;
+ struct semaphore socket_sem;
+ struct xio_socket handler_socket;
+ struct task_struct *handler_thread;
+ struct task_struct *cb_thread;
+
+ wait_queue_head_t startup_event;
+ wait_queue_head_t cb_event;
+ spinlock_t cb_lock;
+ struct list_head cb_read_list;
+ struct list_head cb_write_list;
+ bool cb_running;
+ bool handler_running;
+};
+
+struct server_input {
+ XIO_INPUT(server);
+};
+
+XIO_TYPES(server);
+
+#endif
--
2.0.0

2014-07-01 21:56:50

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 33/50] mars: add new file include/linux/xio/xio_if.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/xio/xio_if.h | 93 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 93 insertions(+)
create mode 100644 include/linux/xio/xio_if.h

diff --git a/include/linux/xio/xio_if.h b/include/linux/xio/xio_if.h
new file mode 100644
index 0000000..ec7f226
--- /dev/null
+++ b/include/linux/xio/xio_if.h
@@ -0,0 +1,93 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_IF_H
+#define XIO_IF_H
+
+#include <linux/semaphore.h>
+
+#define HT_SHIFT 6 /* ???? */
+#define XIO_MAX_SEGMENT_SIZE (1U << (9+HT_SHIFT))
+
+#define MAX_BIO 32
+
+/************************ global tuning ***********************/
+
+extern int if_throttle_start_size; /* in kb */
+extern struct xio_limiter if_throttle;
+
+/***********************************************/
+
+/* I don't want to enhance / intrude into struct bio for compatibility reasons
+ * (support for a variety of kernel versions).
+ * The following is just a silly workaround which could be removed again.
+ */
+struct bio_wrapper {
+ struct bio *bio;
+ atomic_t bi_comp_cnt;
+ unsigned long start_time;
+};
+
+struct if_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct list_head plug_head;
+ struct list_head hash_head;
+ int hash_index;
+ int bio_count;
+ int current_len;
+ int max_len;
+ struct page *orig_page;
+ struct bio_wrapper *orig_biow[MAX_BIO];
+ struct if_input *input;
+};
+
+struct if_hash_anchor;
+
+struct if_input {
+ XIO_INPUT(if);
+ /* TODO: move this to if_brick (better systematics) */
+ struct list_head plug_anchor;
+ struct request_queue *q;
+ struct gendisk *disk;
+ struct block_device *bdev;
+ loff_t capacity;
+ atomic_t plugged_count;
+ atomic_t flying_count;
+
+ /* only for statistics */
+ atomic_t read_flying_count;
+ atomic_t write_flying_count;
+ atomic_t total_reada_count;
+ atomic_t total_read_count;
+ atomic_t total_write_count;
+ atomic_t total_empty_count;
+ atomic_t total_fire_count;
+ atomic_t total_skip_sync_count;
+ atomic_t total_aio_read_count;
+ atomic_t total_aio_write_count;
+ spinlock_t req_lock;
+ struct semaphore kick_sem;
+ struct if_hash_anchor *hash_table;
+};
+
+struct if_output {
+ XIO_OUTPUT(if);
+};
+
+struct if_brick {
+ XIO_BRICK(if);
+ /* parameters */
+ loff_t dev_size;
+ int max_plugged;
+ int readahead;
+ bool skip_sync;
+
+ /* inspectable */
+ atomic_t open_count;
+
+ /* private */
+ struct semaphore switch_sem;
+ struct say_channel *say_channel;
+};
+
+XIO_TYPES(if);
+
+#endif
--
2.0.0

2014-07-01 21:53:53

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 34/50] mars: add new file drivers/block/mars/xio_bricks/xio_if.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/xio_if.c | 1037 ++++++++++++++++++++++++++++++++
1 file changed, 1037 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/xio_if.c

diff --git a/drivers/block/mars/xio_bricks/xio_if.c b/drivers/block/mars/xio_bricks/xio_if.c
new file mode 100644
index 0000000..2c2fa42
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/xio_if.c
@@ -0,0 +1,1037 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+/* Interface to a Linux device.
+ * 1 Input, 0 Outputs.
+ */
+
+#define REQUEST_MERGING
+#define ALWAYS_UNPLUG true
+#define PREFETCH_LEN PAGE_SIZE
+
+/* low-level device parameters */
+#define USE_MAX_SECTORS (XIO_MAX_SEGMENT_SIZE >> 9)
+#define USE_MAX_PHYS_SEGMENTS (XIO_MAX_SEGMENT_SIZE >> 9)
+#define USE_MAX_SEGMENT_SIZE XIO_MAX_SEGMENT_SIZE
+#define USE_LOGICAL_BLOCK_SIZE 512
+#define USE_SEGMENT_BOUNDARY (PAGE_SIZE-1)
+
+#define USE_CONGESTED_FN
+#define USE_MERGE_BVEC
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/bio.h>
+#include <linux/major.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+
+#include <linux/xio.h>
+#include <linux/brick/lib_limiter.h>
+
+#ifndef XIO_MAJOR /* remove this later: fallback to old prepatch */
+#define XIO_MAJOR MARS_MAJOR
+#endif
+
+/************************ global tuning ***********************/
+
+int if_throttle_start_size = 0; /* in kb */
+EXPORT_SYMBOL_GPL(if_throttle_start_size);
+
+struct xio_limiter if_throttle = {
+ .lim_max_rate = 5000,
+};
+EXPORT_SYMBOL_GPL(if_throttle);
+
+/************************ own type definitions ***********************/
+
+#include <linux/xio/xio_if.h>
+
+#define IF_HASH_MAX (PAGE_SIZE / sizeof(struct if_hash_anchor))
+#define IF_HASH_CHUNK (PAGE_SIZE * 32)
+
+struct if_hash_anchor {
+ spinlock_t hash_lock;
+ struct list_head hash_anchor;
+};
+
+/************************ own static definitions ***********************/
+
+/* TODO: check bounds, ensure that free minor numbers are recycled */
+static int device_minor;
+
+/*************** object * aspect constructors * destructors **************/
+
+/************************ linux operations ***********************/
+
+#ifdef part_stat_lock
+static
+void _if_start_io_acct(struct if_input *input, struct bio_wrapper *biow)
+{
+ struct bio *bio = biow->bio;
+ const int rw = bio_data_dir(bio);
+ const int cpu = part_stat_lock();
+
+ (void)cpu;
+ part_round_stats(cpu, &input->disk->part0);
+ part_stat_inc(cpu, &input->disk->part0, ios[rw]);
+ part_stat_add(cpu, &input->disk->part0, sectors[rw], bio->bi_iter.bi_size >> 9);
+ part_inc_in_flight(&input->disk->part0, rw);
+ part_stat_unlock();
+ biow->start_time = jiffies;
+}
+
+static
+void _if_end_io_acct(struct if_input *input, struct bio_wrapper *biow)
+{
+ unsigned long duration = jiffies - biow->start_time;
+ struct bio *bio = biow->bio;
+ const int rw = bio_data_dir(bio);
+ const int cpu = part_stat_lock();
+
+ (void)cpu;
+ part_stat_add(cpu, &input->disk->part0, ticks[rw], duration);
+ part_round_stats(cpu, &input->disk->part0);
+ part_dec_in_flight(&input->disk->part0, rw);
+ part_stat_unlock();
+}
+
+#else /* part_stat_lock */
+#define _if_start_io_acct(...) do {} while (0)
+#define _if_end_io_acct(...) do {} while (0)
+#endif
+
+/* callback
+ */
+static
+void if_endio(struct generic_callback *cb)
+{
+ struct if_aio_aspect *aio_a = cb->cb_private;
+ struct if_input *input;
+ int k;
+ int rw;
+ int error;
+
+ LAST_CALLBACK(cb);
+ if (unlikely(!aio_a || !aio_a->object)) {
+ XIO_FAT("aio_a = %p aio = %p, something is very wrong here!\n", aio_a, aio_a->object);
+ goto out_return;
+ }
+ input = aio_a->input;
+ CHECK_PTR(input, err);
+
+ rw = aio_a->object->io_rw;
+
+ for (k = 0; k < aio_a->bio_count; k++) {
+ struct bio_wrapper *biow;
+ struct bio *bio;
+
+ biow = aio_a->orig_biow[k];
+ aio_a->orig_biow[k] = NULL;
+ CHECK_PTR(biow, err);
+
+ CHECK_ATOMIC(&biow->bi_comp_cnt, 1);
+ if (!atomic_dec_and_test(&biow->bi_comp_cnt))
+ continue;
+
+ bio = biow->bio;
+ CHECK_PTR_NULL(bio, err);
+
+ _if_end_io_acct(input, biow);
+
+ error = CALLBACK_ERROR(aio_a->object);
+ if (unlikely(error < 0)) {
+ int bi_size = bio->bi_iter.bi_size;
+
+ XIO_ERR("NYI: error=%d RETRY LOGIC %u\n", error, bi_size);
+ } else { /* bio conventions are slightly different... */
+ error = 0;
+ bio->bi_iter.bi_size = 0;
+ }
+ bio_endio(bio, error);
+ bio_put(bio);
+ brick_mem_free(biow);
+ }
+ atomic_dec(&input->flying_count);
+ if (rw)
+ atomic_dec(&input->write_flying_count);
+ else
+ atomic_dec(&input->read_flying_count);
+ goto out_return;
+err:
+ XIO_FAT("error in callback, giving up\n");
+out_return:;
+}
+
+/* Kick off plugged aios
+ */
+static
+void _if_unplug(struct if_input *input)
+{
+ /* struct if_brick *brick = input->brick; */
+ LIST_HEAD(tmp_list);
+
+#ifdef CONFIG_MARS_DEBUG
+ might_sleep();
+#endif
+
+ down(&input->kick_sem);
+ spin_lock(&input->req_lock);
+ if (!list_empty(&input->plug_anchor)) {
+ /* move over the whole list */
+ list_replace_init(&input->plug_anchor, &tmp_list);
+ atomic_set(&input->plugged_count, 0);
+ }
+ spin_unlock(&input->req_lock);
+ up(&input->kick_sem);
+
+ while (!list_empty(&tmp_list)) {
+ struct if_aio_aspect *aio_a;
+ struct aio_object *aio;
+ int hash_index;
+
+ aio_a = container_of(tmp_list.next, struct if_aio_aspect, plug_head);
+ list_del_init(&aio_a->plug_head);
+
+ hash_index = aio_a->hash_index;
+ spin_lock(&input->hash_table[hash_index].hash_lock);
+ list_del_init(&aio_a->hash_head);
+ spin_unlock(&input->hash_table[hash_index].hash_lock);
+
+ aio = aio_a->object;
+
+ if (unlikely(aio_a->current_len > aio_a->max_len))
+ XIO_ERR("request len %d > %d\n", aio_a->current_len, aio_a->max_len);
+ aio->io_len = aio_a->current_len;
+
+ atomic_inc(&input->flying_count);
+ atomic_inc(&input->total_fire_count);
+ if (aio->io_rw)
+ atomic_inc(&input->write_flying_count);
+ else
+ atomic_inc(&input->read_flying_count);
+ if (aio->io_skip_sync)
+ atomic_inc(&input->total_skip_sync_count);
+
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+ }
+}
+
+/* accept a linux bio, convert to aio and call buf_io() on it.
+ */
+static
+#ifdef BIO_CPU_AFFINE
+int
+#else
+void
+#endif
+if_make_request(struct request_queue *q, struct bio *bio)
+{
+ struct if_input *input = q->queuedata;
+ struct if_brick *brick = input->brick;
+
+ /* Original flags of the source bio
+ */
+ const int rw = bio_data_dir(bio);
+ const int sectors = bio_sectors(bio);
+
+/* adapt to different kernel versions (TBD: improve) */
+#if defined(BIO_RW_RQ_MASK) || defined(BIO_FLUSH)
+ const bool ahead = bio_rw_flagged(bio, BIO_RW_AHEAD) && rw == READ;
+ const bool barrier = bio_rw_flagged(bio, BIO_RW_BARRIER);
+ const bool syncio = bio_rw_flagged(bio, BIO_RW_SYNCIO);
+ const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG);
+ const bool meta = bio_rw_flagged(bio, BIO_RW_META);
+ const bool discard = bio_rw_flagged(bio, BIO_RW_DISCARD);
+ const bool noidle = bio_rw_flagged(bio, BIO_RW_NOIDLE);
+
+#elif defined(REQ_FLUSH) && defined(REQ_SYNC)
+#define _flagged(x) (bio->bi_rw & (x))
+ const bool ahead = _flagged(REQ_RAHEAD) && rw == READ;
+ const bool barrier = _flagged(REQ_FLUSH);
+ const bool syncio = _flagged(REQ_SYNC);
+ const bool unplug = false;
+ const bool meta = _flagged(REQ_META);
+ const bool discard = _flagged(REQ_DISCARD);
+ const bool noidle = _flagged(REQ_THROTTLED);
+
+#else
+#error Cannot decode the bio flags
+#endif
+ const int prio = bio_prio(bio);
+
+ /* Transform into XIO flags
+ */
+ const int io_prio =
+ (prio == IOPRIO_CLASS_RT || (meta | syncio)) ?
+ XIO_PRIO_HIGH :
+ (prio == IOPRIO_CLASS_IDLE) ?
+ XIO_PRIO_LOW :
+ XIO_PRIO_NORMAL;
+ const bool do_unplug = ALWAYS_UNPLUG | unplug | noidle;
+ const bool do_skip_sync = brick->skip_sync && !(barrier | syncio);
+
+ struct bio_wrapper *biow;
+ struct aio_object *aio = NULL;
+ struct if_aio_aspect *aio_a;
+
+ struct bio_vec bvec;
+ struct bvec_iter i;
+
+ loff_t pos = ((loff_t)bio->bi_iter.bi_sector) << 9; /* TODO: make dynamic */
+ int total_len = bio->bi_iter.bi_size;
+
+ bool assigned = false;
+ int error = -ENOSYS;
+
+ bind_to_channel(brick->say_channel, current);
+
+ might_sleep();
+
+ if (unlikely(!sectors)) {
+ _if_unplug(input);
+ /* THINK: usually this happens only at write barriers.
+ * We have no "barrier" operation in XIO, since
+ * callback semantics should always denote
+ * "writethrough accomplished".
+ * In case of exceptional semantics, we need to do
+ * something here. For now, we do just nothing.
+ */
+ bio_endio(bio, 0);
+ error = 0;
+ goto done;
+ }
+
+ /* throttling of too big write requests */
+ if (rw && if_throttle_start_size > 0) {
+ int kb = (total_len + 512) / 1024;
+
+ if (kb >= if_throttle_start_size)
+ xio_limit_sleep(&if_throttle, kb);
+ }
+
+ (void)ahead; /* shut up gcc */
+ if (unlikely(discard)) { /* NYI */
+ bio_endio(bio, 0);
+ error = 0;
+ goto done;
+ }
+
+ biow = brick_mem_alloc(sizeof(struct bio_wrapper));
+ biow->bio = bio;
+ atomic_set(&biow->bi_comp_cnt, 0);
+
+ if (rw)
+ atomic_inc(&input->total_write_count);
+ else
+ atomic_inc(&input->total_read_count);
+ _if_start_io_acct(input, biow);
+
+ /* Get a reference to the bio.
+ * Will be released after bio_endio().
+ */
+ atomic_inc(&bio->bi_cnt);
+
+ /* FIXME: THIS IS PROVISIONARY (use event instead)
+ */
+ while (unlikely(!brick->power.on_led))
+ brick_msleep(100);
+
+ down(&input->kick_sem);
+
+ bio_for_each_segment(bvec, bio, i) {
+ struct page *page = bvec.bv_page;
+ int bv_len = bvec.bv_len;
+ int offset = bvec.bv_offset;
+
+ void *data;
+
+#ifdef ARCH_HAS_KMAP
+#error FIXME: the current infrastructure cannot deal with HIGHMEM / kmap()
+#endif
+ data = page_address(page);
+ error = -EINVAL;
+ if (unlikely(!data))
+ break;
+
+ data += offset;
+
+ while (bv_len > 0) {
+ struct list_head *tmp;
+ int hash_index;
+ int this_len = 0;
+
+ aio = NULL;
+ aio_a = NULL;
+
+ hash_index = (pos / IF_HASH_CHUNK) % IF_HASH_MAX;
+
+#ifdef REQUEST_MERGING
+ spin_lock(&input->hash_table[hash_index].hash_lock);
+ for (tmp = input->hash_table[hash_index].hash_anchor.next; tmp != &input->hash_table[hash_index].hash_anchor; tmp = tmp->next) {
+ struct if_aio_aspect *tmp_a;
+ struct aio_object *tmp_aio;
+ int i;
+
+ tmp_a = container_of(tmp, struct if_aio_aspect, hash_head);
+ tmp_aio = tmp_a->object;
+ if (tmp_a->orig_page != page || tmp_aio->io_rw != rw || tmp_a->bio_count >= MAX_BIO || tmp_a->current_len + bv_len > tmp_a->max_len)
+ continue;
+
+ if (tmp_aio->io_data + tmp_a->current_len == data)
+ goto merge_end;
+ continue;
+
+merge_end:
+ tmp_a->current_len += bv_len;
+ aio = tmp_aio;
+ aio_a = tmp_a;
+ this_len = bv_len;
+ if (!do_skip_sync)
+ aio->io_skip_sync = false;
+
+ for (i = 0; i < aio_a->bio_count; i++) {
+ if (aio_a->orig_biow[i]->bio == bio)
+ goto unlock;
+ }
+
+ CHECK_ATOMIC(&biow->bi_comp_cnt, 0);
+ atomic_inc(&biow->bi_comp_cnt);
+ aio_a->orig_biow[aio_a->bio_count++] = biow;
+ assigned = true;
+ goto unlock;
+ } /* foreach hash collision list member */
+
+unlock:
+ spin_unlock(&input->hash_table[hash_index].hash_lock);
+#endif
+ if (!aio) {
+ int prefetch_len;
+
+ error = -ENOMEM;
+ aio = if_alloc_aio(brick);
+ aio_a = if_aio_get_aspect(brick, aio);
+ if (unlikely(!aio_a)) {
+ up(&input->kick_sem);
+ goto err;
+ }
+
+#ifdef PREFETCH_LEN
+ prefetch_len = PREFETCH_LEN - offset;
+/**/
+ if (prefetch_len > total_len)
+ prefetch_len = total_len;
+ if (pos + prefetch_len > brick->dev_size)
+ prefetch_len = brick->dev_size - pos;
+ if (prefetch_len < bv_len)
+ prefetch_len = bv_len;
+#else
+ prefetch_len = bv_len;
+#endif
+
+ SETUP_CALLBACK(aio, if_endio, aio_a);
+
+ aio_a->input = input;
+ aio->io_rw = aio->io_may_write = rw;
+ aio->io_pos = pos;
+ aio->io_len = prefetch_len;
+ aio->io_data = data; /* direct IO */
+ aio->io_prio = io_prio;
+ aio_a->orig_page = page;
+
+ error = GENERIC_INPUT_CALL(input, aio_get, aio);
+ if (unlikely(error < 0)) {
+ up(&input->kick_sem);
+ goto err;
+ }
+
+ this_len = aio->io_len; /* now may be shorter than originally requested. */
+ aio_a->max_len = this_len;
+ if (this_len > bv_len)
+ this_len = bv_len;
+ aio_a->current_len = this_len;
+ if (rw)
+ atomic_inc(&input->total_aio_write_count);
+ else
+ atomic_inc(&input->total_aio_read_count);
+ CHECK_ATOMIC(&biow->bi_comp_cnt, 0);
+ atomic_inc(&biow->bi_comp_cnt);
+ aio_a->orig_biow[0] = biow;
+ aio_a->bio_count = 1;
+ assigned = true;
+
+ /* When a bio with multiple biovecs is split into
+ * multiple aios, only the last one should be
+ * working in synchronous writethrough mode.
+ */
+ aio->io_skip_sync = true;
+ if (!do_skip_sync && i.bi_idx + 1 >= bio->bi_iter.bi_idx)
+ aio->io_skip_sync = false;
+
+ atomic_inc(&input->plugged_count);
+
+ aio_a->hash_index = hash_index;
+ spin_lock(&input->hash_table[hash_index].hash_lock);
+ list_add_tail(&aio_a->hash_head, &input->hash_table[hash_index].hash_anchor);
+ spin_unlock(&input->hash_table[hash_index].hash_lock);
+
+ spin_lock(&input->req_lock);
+ list_add_tail(&aio_a->plug_head, &input->plug_anchor);
+ spin_unlock(&input->req_lock);
+ } /* !aio */
+
+ pos += this_len;
+ data += this_len;
+ bv_len -= this_len;
+ total_len -= this_len;
+ } /* while bv_len > 0 */
+ } /* foreach bvec */
+
+ up(&input->kick_sem);
+
+ if (likely(!total_len))
+ error = 0;
+ else
+ XIO_ERR("bad rest len = %d\n", total_len);
+err:
+ if (error < 0) {
+ XIO_ERR("cannot submit request from bio, status=%d\n", error);
+ if (!assigned)
+ bio_endio(bio, error);
+ }
+
+ if (do_unplug ||
+ (brick && brick->max_plugged > 0 && atomic_read(&input->plugged_count) > brick->max_plugged)) {
+ _if_unplug(input);
+ }
+
+done:
+ remove_binding_from(brick->say_channel, current);
+
+#ifdef BIO_CPU_AFFINE
+ return error;
+#else
+ goto out_return;
+#endif
+out_return:;
+}
+
+#ifndef BLK_MAX_REQUEST_COUNT
+/* static */
+void if_unplug(struct request_queue *q)
+{
+ struct if_input *input = q->queuedata;
+
+ spin_lock_irq(q->queue_lock);
+ was_plugged = blk_remove_plug(q);
+ spin_unlock_irq(q->queue_lock);
+
+ _if_unplug(input);
+}
+#endif
+
+/* static */
+int xio_congested(void *data, int bdi_bits)
+{
+ struct if_input *input = data;
+ int ret = 0;
+
+ if (bdi_bits & (1 << BDI_sync_congested) &&
+ atomic_read(&input->read_flying_count) > 0) {
+ ret |= (1 << BDI_sync_congested);
+ }
+ if (bdi_bits & (1 << BDI_async_congested) &&
+ atomic_read(&input->write_flying_count) > 0) {
+ ret |= (1 << BDI_async_congested);
+ }
+ return ret;
+}
+
+static
+int xio_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)
+{
+ unsigned int bio_size = bvm->bi_size;
+
+ if (!bio_size)
+ return bvec->bv_len;
+ return 128;
+}
+
+static
+loff_t if_get_capacity(struct if_brick *brick)
+{
+ /* Don't read always, read only when unknown.
+ * brick->dev_size may be different from underlying sizes,
+ * e.g. when the size symlink indicates a logically smaller
+ * device than physically.
+ */
+ if (brick->dev_size <= 0) {
+ struct xio_info info = {};
+ struct if_input *input = brick->inputs[0];
+ int status;
+
+ status = GENERIC_INPUT_CALL(input, xio_get_info, &info);
+ if (unlikely(status < 0)) {
+ XIO_ERR("cannot get device info, status=%d\n", status);
+ return 0;
+ }
+ XIO_INF("determined default capacity: %lld bytes\n", info.current_size);
+ brick->dev_size = info.current_size;
+ }
+ return brick->dev_size;
+}
+
+static
+void if_set_capacity(struct if_input *input, loff_t capacity)
+{
+ CHECK_PTR(input->disk, done);
+ CHECK_PTR(input->disk->disk_name, done);
+ XIO_INF("new capacity of '%s': %lld bytes\n", input->disk->disk_name, capacity);
+ input->capacity = capacity;
+ set_capacity(input->disk, capacity >> 9);
+ if (likely(input->bdev && input->bdev->bd_inode))
+ i_size_write(input->bdev->bd_inode, capacity);
+done:;
+}
+
+static const struct block_device_operations if_blkdev_ops;
+
+static int if_switch(struct if_brick *brick)
+{
+ struct if_input *input = brick->inputs[0];
+ struct request_queue *q;
+ struct gendisk *disk;
+ int minor;
+ int status = 0;
+
+ down(&brick->switch_sem);
+
+ /* brick is in operation */
+ if (brick->power.button && brick->power.on_led) {
+ loff_t capacity;
+
+ capacity = if_get_capacity(brick);
+ if (capacity > 0 && capacity != input->capacity) {
+ XIO_INF("changing capacity from %lld to %lld\n",
+ (long long)input->capacity,
+ (long long)capacity);
+ if_set_capacity(input, capacity);
+ }
+ }
+
+ /* brick should be switched on */
+ if (brick->power.button && brick->power.off_led) {
+ loff_t capacity;
+
+ xio_set_power_off_led((void *)brick, false);
+ brick->say_channel = get_binding(current);
+
+ status = -ENOMEM;
+ q = blk_alloc_queue(GFP_BRICK);
+ if (!q) {
+ XIO_ERR("cannot allocate device request queue\n");
+ goto is_down;
+ }
+ q->queuedata = input;
+ input->q = q;
+
+ disk = alloc_disk(1);
+ if (!disk) {
+ XIO_ERR("cannot allocate gendisk\n");
+ goto is_down;
+ }
+
+ minor = device_minor++; /* TODO: protect against races (e.g. atomic_t) */
+ set_disk_ro(disk, true);
+
+ disk->queue = q;
+ disk->major = XIO_MAJOR; /* TODO: make this dynamic for >256 devices */
+ disk->first_minor = minor;
+ disk->fops = &if_blkdev_ops;
+ snprintf(disk->disk_name, sizeof(disk->disk_name), "%s", brick->brick_name);
+ disk->private_data = input;
+ input->disk = disk;
+ capacity = if_get_capacity(brick);
+ XIO_DBG("created device name %s, capacity=%lld\n", disk->disk_name, capacity);
+ if_set_capacity(input, capacity);
+
+ blk_queue_make_request(q, if_make_request);
+#ifdef USE_MAX_SECTORS
+#ifdef MAX_SEGMENT_SIZE
+ XIO_DBG("blk_queue_max_sectors()\n");
+ blk_queue_max_sectors(q, USE_MAX_SECTORS);
+#else
+ XIO_DBG("blk_queue_max_hw_sectors()\n");
+ blk_queue_max_hw_sectors(q, USE_MAX_SECTORS);
+#endif
+#endif
+#ifdef USE_MAX_PHYS_SEGMENTS
+#ifdef MAX_SEGMENT_SIZE
+ XIO_DBG("blk_queue_max_phys_segments()\n");
+ blk_queue_max_phys_segments(q, USE_MAX_PHYS_SEGMENTS);
+#else
+ XIO_DBG("blk_queue_max_segments()\n");
+ blk_queue_max_segments(q, USE_MAX_PHYS_SEGMENTS);
+#endif
+#endif
+#ifdef USE_MAX_HW_SEGMENTS
+ XIO_DBG("blk_queue_max_hw_segments()\n");
+ blk_queue_max_hw_segments(q, USE_MAX_HW_SEGMENTS);
+#endif
+#ifdef USE_MAX_SEGMENT_SIZE
+ XIO_DBG("blk_queue_max_segment_size()\n");
+ blk_queue_max_segment_size(q, USE_MAX_SEGMENT_SIZE);
+#endif
+#ifdef USE_LOGICAL_BLOCK_SIZE
+ XIO_DBG("blk_queue_logical_block_size()\n");
+ blk_queue_logical_block_size(q, USE_LOGICAL_BLOCK_SIZE);
+#endif
+#ifdef USE_SEGMENT_BOUNDARY
+ XIO_DBG("blk_queue_segment_boundary()\n");
+ blk_queue_segment_boundary(q, USE_SEGMENT_BOUNDARY);
+#endif
+#ifdef QUEUE_ORDERED_DRAIN
+ XIO_DBG("blk_queue_ordered()\n");
+ blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL);
+#endif
+ XIO_DBG("blk_queue_bounce_limit()\n");
+ blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
+#ifndef BLK_MAX_REQUEST_COUNT
+ XIO_DBG("unplug_fn\n");
+ q->unplug_fn = if_unplug;
+#endif
+ XIO_DBG("queue_lock\n");
+ q->queue_lock = &input->req_lock; /* needed! */
+
+ input->bdev = bdget(MKDEV(disk->major, minor));
+ /* we have no partitions. we contain only ourselves. */
+ input->bdev->bd_contains = input->bdev;
+
+#ifdef USE_CONGESTED_FN
+ XIO_DBG("congested_fn\n");
+ q->backing_dev_info.congested_fn = xio_congested;
+ q->backing_dev_info.congested_data = input;
+#endif
+#ifdef USE_MERGE_BVEC
+ XIO_DBG("blk_queue_merge_bvec()\n");
+ blk_queue_merge_bvec(q, xio_merge_bvec);
+#endif
+
+ /* point of no return */
+ XIO_DBG("add_disk()\n");
+ add_disk(disk);
+ set_disk_ro(disk, false);
+
+ /* report success */
+ xio_set_power_on_led((void *)brick, true);
+ status = 0;
+ }
+
+ /* brick should be switched off */
+ if (!brick->power.button && !brick->power.off_led) {
+ int flying;
+
+ xio_set_power_on_led((void *)brick, false);
+ disk = input->disk;
+ if (!disk)
+ goto is_down;
+
+ if (atomic_read(&brick->open_count) > 0) {
+ XIO_INF("device '%s' is open %d times, cannot shutdown\n",
+ disk->disk_name,
+ atomic_read(&brick->open_count));
+ status = -EBUSY;
+ goto done; /* don't indicate "off" status */
+ }
+ flying = atomic_read(&input->flying_count);
+ if (flying > 0) {
+ XIO_INF("device '%s' has %d flying requests, cannot shutdown\n", disk->disk_name, flying);
+ status = -EBUSY;
+ goto done; /* don't indicate "off" status */
+ }
+ if (input->bdev) {
+ XIO_DBG("calling bdput()\n");
+ bdput(input->bdev);
+ input->bdev = NULL;
+ }
+ XIO_DBG("calling del_gendisk()\n");
+ del_gendisk(input->disk);
+ XIO_DBG("calling put_disk()\n");
+ put_disk(input->disk);
+ input->disk = NULL;
+ status = 0;
+is_down:
+ xio_set_power_off_led((void *)brick, true);
+ }
+
+done:
+ up(&brick->switch_sem);
+ return status;
+}
+
+/*************** interface to the outer world (kernel) **************/
+
+static int if_open(struct block_device *bdev, fmode_t mode)
+{
+ struct if_input *input;
+ struct if_brick *brick;
+
+ if (unlikely(!bdev || !bdev->bd_disk)) {
+ XIO_ERR("----------------------- INVAL ------------------------------\n");
+ return -EINVAL;
+ }
+
+ input = bdev->bd_disk->private_data;
+
+ if (unlikely(!input || !input->brick)) {
+ XIO_ERR("----------------------- BAD IF SETUP ------------------------------\n");
+ return -EINVAL;
+ }
+ brick = input->brick;
+
+ down(&brick->switch_sem);
+
+ if (unlikely(!brick->power.on_led)) {
+ XIO_INF("----------------------- BUSY %d ------------------------------\n",
+ atomic_read(&brick->open_count));
+ up(&brick->switch_sem);
+ return -EBUSY;
+ }
+
+ atomic_inc(&brick->open_count);
+
+ XIO_INF("----------------------- OPEN %d ------------------------------\n", atomic_read(&brick->open_count));
+
+ up(&brick->switch_sem);
+ return 0;
+}
+
+static
+void
+if_release(struct gendisk *gd, fmode_t mode)
+{
+ struct if_input *input = gd->private_data;
+ struct if_brick *brick = input->brick;
+ int nr;
+
+ XIO_INF("----------------------- CLOSE %d ------------------------------\n", atomic_read(&brick->open_count));
+
+ if (atomic_dec_and_test(&brick->open_count)) {
+ while ((nr = atomic_read(&input->flying_count)) > 0) {
+ XIO_INF("%d IO requests not yet completed\n", nr);
+ brick_msleep(1000);
+ }
+
+ XIO_DBG("status button=%d on_led=%d off_led=%d\n",
+ brick->power.button,
+ brick->power.on_led,
+ brick->power.off_led);
+ local_trigger();
+ }
+}
+
+static const struct block_device_operations if_blkdev_ops = {
+ .owner = THIS_MODULE,
+ .open = if_open,
+ .release = if_release,
+
+};
+
+/*************** informational * statistics **************/
+
+static
+char *if_statistics(struct if_brick *brick, int verbose)
+{
+ struct if_input *input = brick->inputs[0];
+ char *res = brick_string_alloc(512);
+ int tmp0 = atomic_read(&input->total_reada_count);
+ int tmp1 = atomic_read(&input->total_read_count);
+ int tmp2 = atomic_read(&input->total_aio_read_count);
+ int tmp3 = atomic_read(&input->total_write_count);
+ int tmp4 = atomic_read(&input->total_aio_write_count);
+
+ snprintf(res, 512,
+ "total reada = %d reads = %d aio_reads = %d (%d%%) writes = %d aio_writes = %d (%d%%) empty = %d fired = %d skip_sync = %d | plugged = %d flying = %d (reads = %d writes = %d)\n",
+ tmp0,
+ tmp1,
+ tmp2,
+ tmp1 ? tmp2 * 100 / tmp1 : 0,
+ tmp3,
+ tmp4,
+ tmp3 ? tmp4 * 100 / tmp3 : 0,
+ atomic_read(&input->total_empty_count),
+ atomic_read(&input->total_fire_count),
+ atomic_read(&input->total_skip_sync_count),
+ atomic_read(&input->plugged_count),
+ atomic_read(&input->flying_count),
+ atomic_read(&input->read_flying_count),
+ atomic_read(&input->write_flying_count));
+ return res;
+}
+
+static
+void if_reset_statistics(struct if_brick *brick)
+{
+ struct if_input *input = brick->inputs[0];
+
+ atomic_set(&input->total_read_count, 0);
+ atomic_set(&input->total_write_count, 0);
+ atomic_set(&input->total_empty_count, 0);
+ atomic_set(&input->total_fire_count, 0);
+ atomic_set(&input->total_skip_sync_count, 0);
+ atomic_set(&input->total_aio_read_count, 0);
+ atomic_set(&input->total_aio_write_count, 0);
+}
+
+/***************** own brick * input * output operations *****************/
+
+/* none */
+
+/*************** object * aspect constructors * destructors **************/
+
+static int if_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct if_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->plug_head);
+ INIT_LIST_HEAD(&ini->hash_head);
+ return 0;
+}
+
+static void if_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct if_aio_aspect *ini = (void *)_ini;
+
+ CHECK_HEAD_EMPTY(&ini->plug_head);
+ CHECK_HEAD_EMPTY(&ini->hash_head);
+}
+
+XIO_MAKE_STATICS(if);
+
+/*********************** constructors * destructors ***********************/
+
+static int if_brick_construct(struct if_brick *brick)
+{
+ sema_init(&brick->switch_sem, 1);
+ atomic_set(&brick->open_count, 0);
+ return 0;
+}
+
+static int if_brick_destruct(struct if_brick *brick)
+{
+ return 0;
+}
+
+static int if_input_construct(struct if_input *input)
+{
+ int i;
+
+ input->hash_table = brick_block_alloc(0, PAGE_SIZE);
+ for (i = 0; i < IF_HASH_MAX; i++) {
+ spin_lock_init(&input->hash_table[i].hash_lock);
+ INIT_LIST_HEAD(&input->hash_table[i].hash_anchor);
+ }
+ INIT_LIST_HEAD(&input->plug_anchor);
+ sema_init(&input->kick_sem, 1);
+ spin_lock_init(&input->req_lock);
+ atomic_set(&input->flying_count, 0);
+ atomic_set(&input->read_flying_count, 0);
+ atomic_set(&input->write_flying_count, 0);
+ atomic_set(&input->plugged_count, 0);
+ return 0;
+}
+
+static int if_input_destruct(struct if_input *input)
+{
+ int i;
+
+ for (i = 0; i < IF_HASH_MAX; i++)
+ CHECK_HEAD_EMPTY(&input->hash_table[i].hash_anchor);
+ CHECK_HEAD_EMPTY(&input->plug_anchor);
+ brick_block_free(input->hash_table, PAGE_SIZE);
+ return 0;
+}
+
+static int if_output_construct(struct if_output *output)
+{
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct if_brick_ops if_brick_ops = {
+ .brick_switch = if_switch,
+ .brick_statistics = if_statistics,
+ .reset_statistics = if_reset_statistics,
+};
+
+static struct if_output_ops if_output_ops;
+
+const struct if_input_type if_input_type = {
+ .type_name = "if_input",
+ .input_size = sizeof(struct if_input),
+ .input_construct = &if_input_construct,
+ .input_destruct = &if_input_destruct,
+};
+
+static const struct if_input_type *if_input_types[] = {
+ &if_input_type,
+};
+
+const struct if_output_type if_output_type = {
+ .type_name = "if_output",
+ .output_size = sizeof(struct if_output),
+ .master_ops = &if_output_ops,
+ .output_construct = &if_output_construct,
+};
+
+static const struct if_output_type *if_output_types[] = {
+ &if_output_type,
+};
+
+const struct if_brick_type if_brick_type = {
+ .type_name = "if_brick",
+ .brick_size = sizeof(struct if_brick),
+ .max_inputs = 1,
+ .max_outputs = 0,
+ .master_ops = &if_brick_ops,
+ .aspect_types = if_aspect_types,
+ .default_input_types = if_input_types,
+ .default_output_types = if_output_types,
+ .brick_construct = &if_brick_construct,
+ .brick_destruct = &if_brick_destruct,
+};
+EXPORT_SYMBOL_GPL(if_brick_type);
+
+/***************** module init stuff ************************/
+
+void exit_xio_if(void)
+{
+ int status;
+
+ XIO_INF("exit_if()\n");
+ status = if_unregister_brick_type();
+ unregister_blkdev(XIO_MAJOR, "xio");
+}
+
+int __init init_xio_if(void)
+{
+ int status;
+
+ (void)if_aspect_types; /* not used, shut up gcc */
+
+ XIO_INF("init_if()\n");
+ status = register_blkdev(XIO_MAJOR, "xio");
+ if (status)
+ return status;
+ status = if_register_brick_type();
+ if (status)
+ goto err_device;
+ return status;
+err_device:
+ XIO_ERR("init_if() status=%d\n", status);
+ exit_xio_if();
+ return status;
+}
--
2.0.0

2014-07-01 21:57:37

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 08/50] mars: add new file include/linux/brick/meta.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/meta.h | 90 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 90 insertions(+)
create mode 100644 include/linux/brick/meta.h

diff --git a/include/linux/brick/meta.h b/include/linux/brick/meta.h
new file mode 100644
index 0000000..44104fd
--- /dev/null
+++ b/include/linux/brick/meta.h
@@ -0,0 +1,90 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef META_H
+#define META_H
+
+/***********************************************************************/
+
+/* metadata descriptions */
+
+/* The idea is to describe your C structures in such a way that
+ * transfers to disk or over a network become self-describing.
+ *
+ * In essence, this is a kind of version-independent marshalling.
+ *
+ * Advantage:
+ * When you extend your original C struct (and of course update the
+ * corresponding meta structure), old data on disk (or network peers
+ * running an old version of your program) will remain valid.
+ * Upon read, newly added fields missing in the old version will be simply
+ * not filled in and therefore remain zeroed (if you don't forget to
+ * initially clear your structures via memset() / initializers / etc).
+ * Note that this works only if you never rename or remove existing
+ * fields; you should only add new ones.
+ * [TODO: add macros for description of ignored / renamed fields to
+ * overcome this limitation]
+ * You may increase the size of integers, for example from 32bit to 64bit
+ * or even higher; sign extension will be automatically carried out
+ * when necessary.
+ * Also, you may change the order of fields, because the metadata interpreter
+ * will check each field individually; field offsets are automatically
+ * maintained.
+ *
+ * Disadvantage: this adds some (small) overhead.
+ */
+
+enum field_type {
+ FIELD_DONE,
+ FIELD_REF,
+ FIELD_SUB,
+ FIELD_STRING,
+ FIELD_RAW,
+ FIELD_INT,
+ FIELD_UINT,
+};
+
+struct meta {
+ /* char field_name[MAX_FIELD_LEN]; */
+ char *field_name;
+
+ short field_type;
+ short field_data_size;
+ short field_transfer_size;
+ int field_offset;
+ const struct meta *field_ref;
+};
+
+#define _META_INI(NAME, STRUCT, TYPE, TSIZE) \
+ .field_name = #NAME, \
+ .field_type = TYPE, \
+ .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \
+ .field_transfer_size = (TSIZE), \
+ .field_offset = offsetof(STRUCT, NAME) \
+
+#define META_INI_TRANSFER(NAME, STRUCT, TYPE, TSIZE) \
+ { _META_INI(NAME, STRUCT, TYPE, TSIZE) }
+
+#define META_INI(NAME, STRUCT, TYPE) \
+ { _META_INI(NAME, STRUCT, TYPE, 0) }
+
+#define _META_INI_AIO(NAME, STRUCT, AIO) \
+ .field_name = #NAME, \
+ .field_type = FIELD_REF, \
+ .field_data_size = sizeof(*(((STRUCT *)NULL)->NAME)), \
+ .field_offset = offsetof(STRUCT, NAME), \
+ .field_ref = AIO
+
+#define META_INI_AIO(NAME, STRUCT, AIO) { _META_INI_AIO(NAME, STRUCT, AIO) }
+
+#define _META_INI_SUB(NAME, STRUCT, SUB) \
+ .field_name = #NAME, \
+ .field_type = FIELD_SUB, \
+ .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \
+ .field_offset = offsetof(STRUCT, NAME), \
+ .field_ref = SUB
+
+#define META_INI_SUB(NAME, STRUCT, SUB) { _META_INI_SUB(NAME, STRUCT, SUB) }
+
+extern const struct meta *find_meta(const struct meta *meta, const char *field_name);
+/* extern void free_meta(void *data, const struct meta *meta); */
+
+#endif
--
2.0.0

2014-07-01 21:57:35

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 18/50] mars: add new file drivers/block/mars/lib_limiter.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/lib_limiter.c | 103 +++++++++++++++++++++++++++++++++++++++
1 file changed, 103 insertions(+)
create mode 100644 drivers/block/mars/lib_limiter.c

diff --git a/drivers/block/mars/lib_limiter.c b/drivers/block/mars/lib_limiter.c
new file mode 100644
index 0000000..e8984b7
--- /dev/null
+++ b/drivers/block/mars/lib_limiter.c
@@ -0,0 +1,103 @@
+/* (c) 2012 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/brick/lib_limiter.h>
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#define LIMITER_TIME_RESOLUTION NSEC_PER_SEC
+
+int xio_limit(struct xio_limiter *lim, int amount)
+{
+ int delay = 0;
+ long long now;
+
+ now = cpu_clock(raw_smp_processor_id());
+
+ /* Compute the maximum delay along the path
+ * down to the root of the hierarchy tree.
+ */
+ while (lim != NULL) {
+ long long window = now - lim->lim_stamp;
+
+ /* Sometimes, raw CPU clocks may do weired things...
+ * Smaller windows in the denominator than 1s could fake unrealistic rates.
+ */
+ if (unlikely(lim->lim_min_window <= 0))
+ lim->lim_min_window = 1000;
+ if (unlikely(lim->lim_max_window <= lim->lim_min_window))
+ lim->lim_max_window = lim->lim_min_window + 8000;
+ if (unlikely(window < (long long)lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000)))
+ window = (long long)lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000);
+
+ /* Only use incremental accumulation at repeated calls, but
+ * never after longer pauses.
+ */
+ if (likely(lim->lim_stamp &&
+ window < (long long)lim->lim_max_window * (LIMITER_TIME_RESOLUTION / 1000))) {
+ long long rate_raw;
+ int rate;
+
+ /* Races are possible, but taken into account.
+ * There is no real harm from rarely lost updates.
+ */
+ if (likely(amount > 0)) {
+ lim->lim_accu += amount;
+ lim->lim_cumul += amount;
+ lim->lim_count++;
+ }
+
+ rate_raw = lim->lim_accu * LIMITER_TIME_RESOLUTION / window;
+ rate = rate_raw;
+ if (unlikely(rate_raw > INT_MAX))
+ rate = INT_MAX;
+ lim->lim_rate = rate;
+
+ /* limit exceeded? */
+ if (lim->lim_max_rate > 0 && rate > lim->lim_max_rate) {
+ int this_delay = (window * rate / lim->lim_max_rate - window) / (LIMITER_TIME_RESOLUTION / 1000);
+
+ /* compute maximum */
+ if (this_delay > delay && this_delay > 0)
+ delay = this_delay;
+ }
+
+ /* Try to keep the next window below min_window
+ */
+ window -= lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000);
+ if (window > 0) {
+ long long used_up = (long long)lim->lim_rate * window / LIMITER_TIME_RESOLUTION;
+
+ if (used_up > 0) {
+ lim->lim_stamp += window;
+ lim->lim_accu -= used_up;
+ if (unlikely(lim->lim_accu < 0))
+ lim->lim_accu = 0;
+ }
+ }
+ } else { /* reset, start over with new measurement cycle */
+ if (unlikely(amount < 0))
+ amount = 0;
+ lim->lim_accu = amount;
+ lim->lim_stamp = now - lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000);
+ lim->lim_rate = 0;
+ }
+ lim = lim->lim_father;
+ }
+ return delay;
+}
+EXPORT_SYMBOL_GPL(xio_limit);
+
+void xio_limit_sleep(struct xio_limiter *lim, int amount)
+{
+ int sleep = xio_limit(lim, amount);
+
+ if (sleep > 0) {
+ if (unlikely(lim->lim_max_delay <= 0))
+ lim->lim_max_delay = 1000;
+ if (sleep > lim->lim_max_delay)
+ sleep = lim->lim_max_delay;
+ brick_msleep(sleep);
+ }
+}
+EXPORT_SYMBOL_GPL(xio_limit_sleep);
--
2.0.0

2014-07-01 21:57:33

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 17/50] mars: add new file include/linux/brick/lib_limiter.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/lib_limiter.h | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
create mode 100644 include/linux/brick/lib_limiter.h

diff --git a/include/linux/brick/lib_limiter.h b/include/linux/brick/lib_limiter.h
new file mode 100644
index 0000000..87db968
--- /dev/null
+++ b/include/linux/brick/lib_limiter.h
@@ -0,0 +1,33 @@
+/* (c) 2012 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef BRICK_LIB_LIMITER_H
+#define BRICK_LIB_LIMITER_H
+
+#include <linux/brick/brick.h>
+
+#include <linux/utsname.h>
+
+struct xio_limiter {
+ /* hierarchy tree */
+ struct xio_limiter *lim_father;
+
+ /* tunables */
+ int lim_max_rate;
+ int lim_max_delay;
+ int lim_min_window;
+ int lim_max_window;
+
+ /* readable */
+ int lim_rate;
+ int lim_cumul;
+ int lim_count;
+ long long lim_stamp;
+
+ /* internal */
+ long long lim_accu;
+};
+
+extern int xio_limit(struct xio_limiter *lim, int amount);
+
+extern void xio_limit_sleep(struct xio_limiter *lim, int amount);
+
+#endif
--
2.0.0

2014-07-01 21:58:30

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 05/50] mars: add new file include/linux/brick/brick_mem.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/brick_mem.h | 202 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 202 insertions(+)
create mode 100644 include/linux/brick/brick_mem.h

diff --git a/include/linux/brick/brick_mem.h b/include/linux/brick/brick_mem.h
new file mode 100644
index 0000000..198bb05
--- /dev/null
+++ b/include/linux/brick/brick_mem.h
@@ -0,0 +1,202 @@
+/* (c) 2011 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef BRICK_MEM_H
+#define BRICK_MEM_H
+
+#include <linux/mm_types.h>
+
+#define BRICK_DEBUG_MEM 4096
+
+#ifndef CONFIG_MARS_DEBUG_MEM
+#undef BRICK_DEBUG_MEM
+#endif
+#ifdef CONFIG_MARS_DEBUG_ORDER0
+#define BRICK_DEBUG_ORDER0
+#endif
+
+#define CONFIG_MARS_MEM_PREALLOC /* this is VITAL - disable only for experiments! */
+
+#define GFP_BRICK GFP_NOIO
+
+extern long long brick_global_memavail;
+extern long long brick_global_memlimit;
+extern atomic64_t brick_global_block_used;
+
+/* All brick memory allocations are guaranteed to succeed.
+ * In case of low memory, they will just retry (forever).
+ *
+ * We always prefer threads for concurrency.
+ * Therefore, in_interrupt() code does not occur, and we can
+ * always sleep in case of memory pressure.
+ *
+ * Resource deadlocks are avoided by the above memory limits.
+ * When exceeded, new memory is simply not allocated any more
+ * (except for vital memory, such as IO memory for which a
+ * low_mem_reserve must always exist, anyway).
+ */
+
+/***********************************************************************/
+
+/* compiler tweaking */
+
+/* Some functions are known to return non-null pointer values,
+ * at least under some Kconfig conditions.
+ *
+ * In code like...
+ *
+ * void *ptr = myfunction();
+ * if (unlikely(!ptr)) {
+ * printk("ERROR: this should not happen\n");
+ * goto fail;
+ * }
+ *
+ * ... the dead code elimination of gcc will not remove the if clause
+ * because the function might return a NULL value, even if a human
+ * would know that myfunction() does not return a NULL value.
+ *
+ * Unfortunately, the __attribute__((nonnull)) can only be applied
+ * to input parameters, but not to the return value.
+ *
+ * More unfortunately, a small inline wrapper does not help,
+ * because it seems that together with the elimination of the wrapper,
+ * its nonnull attribute seems to be eliminated alltogether.
+ * I don't know whether this is a bug or a feature (or just a weakness).
+ *
+ * Following is a small hack which solves the problem at least for gcc 4.7.
+ *
+ * In order to be useful, the -fdelete-null-pointer-checks must be set.
+ * Since XIO is superuser-only anyway, enabling this for MARS should not
+ * be a security risk
+ * (c.f. upstream kernel commit a3ca86aea507904148870946d599e07a340b39bf)
+ */
+extern inline
+void *brick_mark_nonnull(void *_ptr)
+{
+ char *ptr = _ptr;
+
+ /* fool gcc to believe that the pointer were dereferenced... */
+ asm("" : : "X" (*ptr));
+ return ptr;
+}
+
+/***********************************************************************/
+
+/* small memory allocation (use this only for len < PAGE_SIZE) */
+
+#define brick_mem_alloc(_len_) \
+ ({ \
+ void *_res_ = _brick_mem_alloc(_len_, __LINE__); \
+ brick_mark_nonnull(_res_); \
+ })
+
+#define brick_zmem_alloc(_len_) \
+ ({ \
+ void *_res_ = _brick_mem_alloc(_len_, __LINE__); \
+ _res_ = brick_mark_nonnull(_res_); \
+ memset(_res_, 0, _len_); \
+ _res_; \
+ })
+
+#define brick_mem_free(_data_) \
+ do { \
+ if (_data_) { \
+ _brick_mem_free(_data_, __LINE__); \
+ } \
+ } while (0)
+
+/* don't use the following directly */
+extern void *_brick_mem_alloc(int len, int line) __attribute__((malloc)) __attribute__((alloc_size(1)));
+extern void _brick_mem_free(void *data, int line);
+
+/***********************************************************************/
+
+/* string memory allocation */
+
+#define BRICK_STRING_LEN 1024 /* default value when len == 0 */
+
+#define brick_string_alloc(_len_) \
+ ({ \
+ char *_res_ = _brick_string_alloc((_len_), __LINE__); \
+ (char *)brick_mark_nonnull(_res_); \
+ })
+
+#define brick_strndup(_orig_, _len_) \
+ ({ \
+ char *_res_ = _brick_string_alloc((_len_) + 1, __LINE__);\
+ _res_ = brick_mark_nonnull(_res_); \
+ strncpy(_res_, (_orig_), (_len_) + 1); \
+ /* always null-terminate for safety */ \
+ _res_[_len_] = '\0'; \
+ (char *)brick_mark_nonnull(_res_); \
+ })
+
+#define brick_strdup(_orig_) \
+ ({ \
+ int _len_ = strlen(_orig_); \
+ char *_res_ = _brick_string_alloc((_len_) + 1, __LINE__);\
+ _res_ = brick_mark_nonnull(_res_); \
+ strncpy(_res_, (_orig_), (_len_) + 1); \
+ (char *)brick_mark_nonnull(_res_); \
+ })
+
+#define brick_string_free(_data_) \
+ do { \
+ if (_data_) { \
+ _brick_string_free(_data_, __LINE__); \
+ } \
+ } while (0)
+
+/* don't use the following directly */
+extern char *_brick_string_alloc(int len, int line) __attribute__((malloc));
+extern void _brick_string_free(const char *data, int line);
+
+/***********************************************************************/
+
+/* block memory allocation (for aligned multiples of 512 resp PAGE_SIZE) */
+
+#define brick_block_alloc(_pos_, _len_) \
+ ({ \
+ void *_res_ = _brick_block_alloc((_pos_), (_len_), __LINE__);\
+ brick_mark_nonnull(_res_); \
+ })
+
+#define brick_block_free(_data_, _len_) \
+ do { \
+ if (_data_) { \
+ _brick_block_free((_data_), (_len_), __LINE__); \
+ } \
+ } while (0)
+
+extern struct page *brick_iomap(void *data, int *offset, int *len);
+
+/* don't use the following directly */
+extern void *_brick_block_alloc(loff_t pos, int len, int line) __attribute__((malloc)) __attribute__((alloc_size(2)));
+extern void _brick_block_free(void *data, int len, int cline);
+
+/***********************************************************************/
+
+/* reservations / preallocation */
+
+#define BRICK_MAX_ORDER 11
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+extern int brick_allow_freelist;
+
+extern int brick_pre_reserve[BRICK_MAX_ORDER+1];
+extern int brick_mem_freelist_max[BRICK_MAX_ORDER+1];
+extern int brick_mem_alloc_count[BRICK_MAX_ORDER+1];
+extern int brick_mem_alloc_max[BRICK_MAX_ORDER+1];
+
+extern int brick_mem_reserve(void);
+
+#endif
+
+extern void brick_mem_statistics(bool final);
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_brick_mem(void);
+extern void exit_brick_mem(void);
+
+#endif
--
2.0.0

2014-07-01 21:58:46

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 02/50] mars: add new file drivers/block/mars/lamport.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/lamport.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
create mode 100644 drivers/block/mars/lamport.c

diff --git a/drivers/block/mars/lamport.c b/drivers/block/mars/lamport.c
new file mode 100644
index 0000000..67484c5
--- /dev/null
+++ b/drivers/block/mars/lamport.c
@@ -0,0 +1,48 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/semaphore.h>
+
+#include <linux/brick/lamport.h>
+
+/* TODO: replace with spinlock if possible (first check) */
+struct semaphore lamport_sem = __SEMAPHORE_INITIALIZER(lamport_sem, 1);
+struct timespec lamport_now = {};
+
+void get_lamport(struct timespec *now)
+{
+ int diff;
+
+ down(&lamport_sem);
+
+ *now = CURRENT_TIME;
+ diff = timespec_compare(now, &lamport_now);
+ if (diff >= 0) {
+ timespec_add_ns(now, 1);
+ memcpy(&lamport_now, now, sizeof(lamport_now));
+ timespec_add_ns(&lamport_now, 1);
+ } else {
+ timespec_add_ns(&lamport_now, 1);
+ memcpy(now, &lamport_now, sizeof(*now));
+ }
+
+ up(&lamport_sem);
+}
+EXPORT_SYMBOL_GPL(get_lamport);
+
+void set_lamport(struct timespec *old)
+{
+ int diff;
+
+ down(&lamport_sem);
+
+ diff = timespec_compare(old, &lamport_now);
+ if (diff >= 0) {
+ memcpy(&lamport_now, old, sizeof(lamport_now));
+ timespec_add_ns(&lamport_now, 1);
+ }
+
+ up(&lamport_sem);
+}
+EXPORT_SYMBOL_GPL(set_lamport);
--
2.0.0

2014-07-01 21:59:05

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 13/50] mars: add new file include/linux/brick/lib_rank.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/lib_rank.h | 119 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 119 insertions(+)
create mode 100644 include/linux/brick/lib_rank.h

diff --git a/include/linux/brick/lib_rank.h b/include/linux/brick/lib_rank.h
new file mode 100644
index 0000000..f0ba222
--- /dev/null
+++ b/include/linux/brick/lib_rank.h
@@ -0,0 +1,119 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+/* (c) 2012 Thomas Schoebel-Theuer */
+
+#ifndef LIB_RANK_H
+#define LIB_RANK_H
+
+/* Generic round-robin scheduler based on ranking information.
+ */
+
+#define RKI_DUMMY INT_MIN
+
+struct rank_info {
+ int rki_x;
+ int rki_y;
+};
+
+struct rank_data {
+ /* public readonly */
+ long long rkd_current_points;
+
+ /* private */
+ long long rkd_tmp;
+ long long rkd_got;
+};
+
+/* Ranking phase.
+ *
+ * Calls should follow the following usage pattern:
+ *
+ * ranking_start(...);
+ * for (...) {
+ * ranking_compute(&rkd[this_time], ...);
+ * // usually you need at least 1 call for each rkd[] element,
+ * // but you can call more often to include ranking information
+ * // from many different sources.
+ * // Note: instead / additionally, you may also use
+ * // ranking_add() or ranking_override().
+ * }
+ * ranking_stop(...);
+ *
+ * => now the new ranking values are computed and already active
+ * for the round-robin ranking_select() mechanism described below.
+ *
+ * Important: the rki[] array describes a ranking function at some
+ * example points (x_i, y_i) which must be ordered according to x_i
+ * in ascending order. And, of course, you need to supply at least
+ * two sample points (otherwise a linear function cannot
+ * be described).
+ * The array _must_ always end with a dummy record where the x_i has the
+ * value RKI_DUMMY.
+ */
+
+extern inline
+void ranking_start(struct rank_data rkd[], int rkd_count)
+{
+ int i;
+
+ for (i = 0; i < rkd_count; i++)
+ rkd[i].rkd_tmp = 0;
+}
+
+extern void ranking_compute(struct rank_data *rkd, const struct rank_info rki[], int x);
+
+/* This may be used to (exceptionally) add some extra salt...
+ */
+extern inline
+void ranking_add(struct rank_data *rkd, int y)
+{
+ rkd->rkd_tmp += y;
+}
+
+/* This may be used to (exceptionally) override certain ranking values.
+ */
+extern inline
+void ranking_override(struct rank_data *rkd, int y)
+{
+ rkd->rkd_tmp = y;
+}
+
+extern inline
+void ranking_stop(struct rank_data rkd[], int rkd_count)
+{
+ int i;
+
+ for (i = 0; i < rkd_count; i++)
+ rkd[i].rkd_current_points = rkd[i].rkd_tmp;
+}
+
+/* This is a round-robin scheduler taking her weights
+ * from the previous ranking phase (the more ranking points,
+ * the more frequently a candidate will be selected).
+ *
+ * Typical usage pattern (independent from the above ranking phase
+ * usage pattern):
+ *
+ * while (__there_is_work_to_be_done(...)) {
+ * int winner = ranking_select(...);
+ * if (winner >= 0) {
+ * __do_something(winner);
+ * ranking_select_done(..., winner, 1); // or higher, winpoints >= 1 must hold
+ * }
+ * ...
+ * }
+ *
+ */
+
+extern int ranking_select(struct rank_data rkd[], int rkd_count);
+
+extern inline
+void ranking_select_done(struct rank_data rkd[], int winner, int win_points)
+{
+ if (winner >= 0) {
+ if (win_points < 1)
+ win_points = 1;
+ rkd[winner].rkd_got += win_points;
+ }
+}
+
+#endif
--
2.0.0

2014-07-01 21:59:31

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 30/50] mars: add new file drivers/block/mars/xio_bricks/xio_aio.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/xio_aio.c | 1224 +++++++++++++++++++++++++++++++
1 file changed, 1224 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/xio_aio.c

diff --git a/drivers/block/mars/xio_bricks/xio_aio.c b/drivers/block/mars/xio_bricks/xio_aio.c
new file mode 100644
index 0000000..5356fdb
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/xio_aio.c
@@ -0,0 +1,1224 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#define XIO_DEBUGGING
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/string.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/file.h>
+
+#include <linux/xio.h>
+#include <linux/brick/lib_timing.h>
+#include <linux/lib_mapfree.h>
+
+#include <linux/xio/xio_aio.h>
+
+#define XIO_MAX_AIO 1024
+#define XIO_MAX_AIO_READ 32
+
+static struct timing_stats timings[3];
+
+struct threshold aio_submit_threshold = {
+ .thr_ban = &xio_global_ban,
+ .thr_limit = AIO_SUBMIT_MAX_LATENCY,
+ .thr_factor = 10,
+ .thr_plus = 10000,
+};
+EXPORT_SYMBOL_GPL(aio_submit_threshold);
+
+struct threshold aio_io_threshold[2] = {
+ [0] = {
+ .thr_ban = &xio_global_ban,
+ .thr_limit = AIO_IO_R_MAX_LATENCY,
+ .thr_factor = 100,
+ .thr_plus = 0,
+ },
+ [1] = {
+ .thr_ban = &xio_global_ban,
+ .thr_limit = AIO_IO_W_MAX_LATENCY,
+ .thr_factor = 100,
+ .thr_plus = 0,
+ },
+};
+EXPORT_SYMBOL_GPL(aio_io_threshold);
+
+struct threshold aio_sync_threshold = {
+ .thr_ban = &xio_global_ban,
+ .thr_limit = AIO_SYNC_MAX_LATENCY,
+ .thr_factor = 100,
+ .thr_plus = 0,
+};
+EXPORT_SYMBOL_GPL(aio_sync_threshold);
+
+int aio_sync_mode = 2;
+EXPORT_SYMBOL_GPL(aio_sync_mode);
+
+/************************ mmu faking (provisionary) ***********************/
+
+/* Kludge: our kernel threads will have no mm context, but need one
+ * for stuff like ioctx_alloc() / aio_setup_ring() etc
+ * which expect userspace resources.
+ * We fake one.
+ * TODO: factor out the userspace stuff from AIO such that
+ * this fake is no longer necessary.
+ * Even better: replace do_mmap() in AIO stuff by something
+ * more friendly to kernelspace apps.
+ */
+#include <linux/mmu_context.h>
+
+struct mm_struct *mm_fake = NULL;
+struct task_struct *mm_fake_task = NULL;
+atomic_t mm_fake_count = ATOMIC_INIT(0);
+
+static inline void set_fake(void)
+{
+ mm_fake = current->mm;
+ if (mm_fake) {
+ XIO_DBG("initialized fake\n");
+ mm_fake_task = current;
+ get_task_struct(current); /* paired with put_task_struct() */
+ atomic_inc(&mm_fake->mm_count); /* paired with mmdrop() */
+ atomic_inc(&mm_fake->mm_users); /* paired with mmput() */
+ }
+}
+
+static inline void put_fake(void)
+{
+ int count = 0;
+
+ while (mm_fake && mm_fake_task) {
+ int remain = atomic_read(&mm_fake_count);
+
+ if (unlikely(remain != 0)) {
+ if (count++ < 10) {
+ XIO_WRN("cannot cleanup fake, remain = %d\n", remain);
+ brick_msleep(1000);
+ continue;
+ }
+ XIO_ERR("cannot cleanup fake, remain = %d\n", remain);
+ break;
+ } else {
+ XIO_DBG("cleaning up fake\n");
+ mmput(mm_fake);
+ mmdrop(mm_fake);
+ mm_fake = NULL;
+ put_task_struct(mm_fake_task);
+ mm_fake_task = NULL;
+ }
+ }
+}
+
+static inline void use_fake_mm(void)
+{
+ if (!current->mm && mm_fake) {
+ atomic_inc(&mm_fake_count);
+ XIO_DBG("using fake, count=%d\n", atomic_read(&mm_fake_count));
+ use_mm(mm_fake);
+ }
+}
+
+/* Cleanup faked mm, otherwise do_exit() will crash
+ */
+static inline void unuse_fake_mm(void)
+{
+ if (current->mm == mm_fake && mm_fake) {
+ XIO_DBG("unusing fake, count=%d\n", atomic_read(&mm_fake_count));
+ atomic_dec(&mm_fake_count);
+ unuse_mm(mm_fake);
+ current->mm = NULL;
+ }
+}
+
+/************************ own type definitions ***********************/
+
+/***************** some helpers *****************/
+
+static inline
+void _enqueue(struct aio_threadinfo *tinfo, struct aio_aio_aspect *aio_a, int prio, bool at_end)
+{
+ prio++;
+ if (unlikely(prio < 0))
+ prio = 0;
+ else if (unlikely(prio >= XIO_PRIO_NR))
+ prio = XIO_PRIO_NR - 1;
+
+ aio_a->enqueue_stamp = cpu_clock(raw_smp_processor_id());
+
+ spin_lock(&tinfo->lock);
+
+ if (at_end)
+ list_add_tail(&aio_a->io_head, &tinfo->aio_list[prio]);
+ else
+ list_add(&aio_a->io_head, &tinfo->aio_list[prio]);
+ tinfo->queued[prio]++;
+ atomic_inc(&tinfo->queued_sum);
+
+ spin_unlock(&tinfo->lock);
+
+ atomic_inc(&tinfo->total_enqueue_count);
+
+ wake_up_interruptible_all(&tinfo->event);
+}
+
+static inline
+struct aio_aio_aspect *_dequeue(struct aio_threadinfo *tinfo)
+{
+ struct aio_aio_aspect *aio_a = NULL;
+ int prio;
+
+ spin_lock(&tinfo->lock);
+
+ for (prio = 0; prio < XIO_PRIO_NR; prio++) {
+ struct list_head *start = &tinfo->aio_list[prio];
+ struct list_head *tmp = start->next;
+
+ if (tmp != start) {
+ list_del_init(tmp);
+ tinfo->queued[prio]--;
+ atomic_dec(&tinfo->queued_sum);
+ aio_a = container_of(tmp, struct aio_aio_aspect, io_head);
+ goto done;
+ }
+ }
+
+done:
+ spin_unlock(&tinfo->lock);
+
+ if (likely(aio_a && aio_a->object)) {
+ unsigned long long latency;
+
+ latency = cpu_clock(raw_smp_processor_id()) - aio_a->enqueue_stamp;
+ threshold_check(&aio_io_threshold[aio_a->object->io_rw & 1], latency);
+ }
+ return aio_a;
+}
+
+/***************** own brick * input * output operations *****************/
+
+static
+loff_t get_total_size(struct aio_output *output)
+{
+ struct file *file;
+ struct inode *inode;
+ loff_t min;
+
+ file = output->mf->mf_filp;
+ if (unlikely(!file)) {
+ XIO_ERR("file is not open\n");
+ return -EILSEQ;
+ }
+ if (unlikely(!file->f_mapping)) {
+ XIO_ERR("file %p has no mapping\n", file);
+ return -EILSEQ;
+ }
+ inode = file->f_mapping->host;
+ if (unlikely(!inode)) {
+ XIO_ERR("file %p has no inode\n", file);
+ return -EILSEQ;
+ }
+
+ min = i_size_read(inode);
+
+ /* Workaround for races in the page cache.
+ * It appears that concurrent reads and writes seem to
+ * result in inconsistent reads in some very rare cases, due to
+ * races. Sometimes, the inode claims that the file has been already
+ * appended by a write operation, but the data has not actually hit
+ * the page cache, such that a concurrent read gets NULL blocks.
+ */
+ if (!output->brick->is_static_device) {
+ loff_t max = 0;
+
+ mf_get_dirty(output->mf, &min, &max, 0, 99);
+ }
+
+ return min;
+}
+
+static int aio_io_get(struct aio_output *output, struct aio_object *aio)
+{
+ loff_t total_size;
+
+ if (unlikely(!output->mf)) {
+ XIO_ERR("brick is not switched on\n");
+ return -EILSEQ;
+ }
+
+ if (unlikely(aio->io_len <= 0)) {
+ XIO_ERR("bad io_len=%d\n", aio->io_len);
+ return -EILSEQ;
+ }
+
+ total_size = get_total_size(output);
+ if (unlikely(total_size < 0))
+ return total_size;
+ aio->io_total_size = total_size;
+
+ if (aio->obj_initialized) {
+ obj_get(aio);
+ return aio->io_len;
+ }
+
+ /* Buffered IO.
+ */
+ if (!aio->io_data) {
+ struct aio_aio_aspect *aio_a = aio_aio_get_aspect(output->brick, aio);
+
+ if (unlikely(!aio_a)) {
+ XIO_ERR("bad aio_a\n");
+ return -EILSEQ;
+ }
+ if (unlikely(aio->io_len <= 0)) {
+ XIO_ERR("bad io_len = %d\n", aio->io_len);
+ return -ENOMEM;
+ }
+ aio->io_data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len));
+ aio_a->do_dealloc = true;
+ atomic_inc(&output->total_alloc_count);
+ atomic_inc(&output->alloc_count);
+ }
+
+ obj_get_first(aio);
+ return aio->io_len;
+}
+
+static void aio_io_put(struct aio_output *output, struct aio_object *aio)
+{
+ struct file *file;
+ struct aio_aio_aspect *aio_a;
+
+ if (!obj_put(aio))
+ goto done;
+
+ if (likely(output->mf)) {
+ file = output->mf->mf_filp;
+ if (likely(file && file->f_mapping && file->f_mapping->host))
+ aio->io_total_size = get_total_size(output);
+ }
+
+ aio_a = aio_aio_get_aspect(output->brick, aio);
+ if (aio_a && aio_a->do_dealloc) {
+ brick_block_free(aio->io_data, aio_a->alloc_len);
+ atomic_dec(&output->alloc_count);
+ }
+ obj_free(aio);
+done:;
+}
+
+static
+void _complete(struct aio_output *output, struct aio_aio_aspect *aio_a, int err)
+{
+ struct aio_object *aio;
+
+ CHECK_PTR(aio_a, fatal);
+ aio = aio_a->object;
+ CHECK_PTR(aio, fatal);
+
+ if (err < 0) {
+ XIO_ERR("IO error %d at pos=%lld len=%d (aio=%p io_data=%p)\n",
+ err,
+ aio->io_pos,
+ aio->io_len,
+ aio,
+ aio->io_data);
+ } else {
+ aio_checksum(aio);
+ aio->io_flags |= AIO_UPTODATE;
+ }
+
+ CHECKED_CALLBACK(aio, err, err_found);
+
+done:
+ if (aio->io_rw)
+ atomic_dec(&output->write_count);
+ else
+ atomic_dec(&output->read_count);
+ mf_remove_dirty(output->mf, &aio_a->di);
+
+ aio_io_put(output, aio);
+ atomic_dec(&xio_global_io_flying);
+ goto out_return;
+err_found:
+ XIO_FAT("giving up...\n");
+ goto done;
+
+fatal:
+ XIO_FAT("bad pointer, giving up...\n");
+out_return:;
+}
+
+static
+void _complete_aio(struct aio_output *output, struct aio_object *aio, int err)
+{
+ struct aio_aio_aspect *aio_a;
+
+ obj_check(aio);
+ aio_a = aio_aio_get_aspect(output->brick, aio);
+ CHECK_PTR(aio_a, fatal);
+ _complete(output, aio_a, err);
+ goto out_return;
+fatal:
+ XIO_FAT("bad pointer, giving up...\n");
+out_return:;
+}
+
+static
+void _complete_all(struct list_head *tmp_list, struct aio_output *output, int err)
+{
+ while (!list_empty(tmp_list)) {
+ struct list_head *tmp = tmp_list->next;
+ struct aio_aio_aspect *aio_a = container_of(tmp, struct aio_aio_aspect, io_head);
+
+ list_del_init(tmp);
+ aio_a->di.dirty_stage = 3;
+ _complete(output, aio_a, err);
+ }
+}
+
+static void aio_io_io(struct aio_output *output, struct aio_object *aio)
+{
+ struct aio_threadinfo *tinfo = &output->tinfo[0];
+ struct aio_aio_aspect *aio_a;
+ int err = -EINVAL;
+
+ obj_get(aio);
+ atomic_inc(&xio_global_io_flying);
+
+ /* statistics */
+ if (aio->io_rw) {
+ atomic_inc(&output->total_write_count);
+ atomic_inc(&output->write_count);
+ } else {
+ atomic_inc(&output->total_read_count);
+ atomic_inc(&output->read_count);
+ }
+
+ if (unlikely(!output->mf || !output->mf->mf_filp))
+ goto done;
+
+ mapfree_set(output->mf, aio->io_pos, -1);
+
+ aio_a = aio_aio_get_aspect(output->brick, aio);
+ if (unlikely(!aio_a))
+ goto done;
+
+ _enqueue(tinfo, aio_a, aio->io_prio, true);
+ goto out_return;
+done:
+ _complete_aio(output, aio, err);
+out_return:;
+}
+
+static int aio_submit(struct aio_output *output, struct aio_aio_aspect *aio_a, bool use_fdsync)
+{
+ struct aio_object *aio = aio_a->object;
+
+ mm_segment_t oldfs;
+ int res;
+
+ struct iocb iocb = {
+ .aio_data = (__u64)aio_a,
+ .aio_lio_opcode = use_fdsync ? IOCB_CMD_FDSYNC : (aio->io_rw != 0 ? IOCB_CMD_PWRITE : IOCB_CMD_PREAD),
+ .aio_fildes = output->fd,
+ .aio_buf = (unsigned long)aio->io_data,
+ .aio_nbytes = aio->io_len,
+ .aio_offset = aio->io_pos,
+ /* .aio_reqprio = something(aio->io_prio) field exists, but not yet implemented in kernelspace :( */
+ };
+ struct iocb *iocbp = &iocb;
+ unsigned long long latency;
+
+ if (unlikely(output->fd < 0)) {
+ XIO_ERR("bad fd = %d\n", output->fd);
+ res = -EBADF;
+ goto done;
+ }
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ latency = TIME_STATS(&timings[aio->io_rw & 1], res = sys_io_submit(output->ctxp, 1, &iocbp));
+ set_fs(oldfs);
+
+ threshold_check(&aio_submit_threshold, latency);
+
+ atomic_inc(&output->total_submit_count);
+
+ if (likely(res >= 0))
+ atomic_inc(&output->submit_count);
+ else if (likely(res == -EAGAIN))
+ atomic_inc(&output->total_again_count);
+ else
+ XIO_ERR("error = %d\n", res);
+
+done:
+ return res;
+}
+
+static int aio_submit_dummy(struct aio_output *output)
+{
+ mm_segment_t oldfs;
+ int res;
+ int dummy;
+
+ struct iocb iocb = {
+ .aio_buf = (__u64)&dummy,
+ };
+ struct iocb *iocbp = &iocb;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ res = sys_io_submit(output->ctxp, 1, &iocbp);
+ set_fs(oldfs);
+
+ if (likely(res >= 0))
+ atomic_inc(&output->submit_count);
+ return res;
+}
+
+static
+int aio_start_thread(
+ struct aio_output *output,
+ struct aio_threadinfo *tinfo,
+ int (*fn)(void *),
+ char class)
+{
+ int j;
+
+ for (j = 0; j < XIO_PRIO_NR; j++)
+ INIT_LIST_HEAD(&tinfo->aio_list[j]);
+ tinfo->output = output;
+ spin_lock_init(&tinfo->lock);
+ init_waitqueue_head(&tinfo->event);
+ init_waitqueue_head(&tinfo->terminate_event);
+ tinfo->terminated = false;
+ tinfo->thread = brick_thread_create(fn, tinfo, "xio_aio_%c%d", class, output->index);
+ if (unlikely(!tinfo->thread)) {
+ XIO_ERR("cannot create thread\n");
+ return -ENOENT;
+ }
+ return 0;
+}
+
+static
+void aio_stop_thread(struct aio_output *output, int i, bool do_submit_dummy)
+{
+ struct aio_threadinfo *tinfo = &output->tinfo[i];
+
+ if (tinfo->thread) {
+ XIO_DBG("stopping thread %d ...\n", i);
+ brick_thread_stop_nowait(tinfo->thread);
+
+/**/
+ if (do_submit_dummy) {
+ XIO_DBG("submitting dummy for wakeup %d...\n", i);
+ use_fake_mm();
+ aio_submit_dummy(output);
+ if (likely(current->mm))
+ unuse_fake_mm();
+ }
+
+ /* wait for termination */
+ XIO_DBG("waiting for thread %d ...\n", i);
+ wait_event_interruptible_timeout(
+ tinfo->terminate_event,
+ tinfo->terminated,
+ (60 - i * 2) * HZ);
+ if (likely(tinfo->terminated))
+ brick_thread_stop(tinfo->thread);
+ else
+ XIO_ERR("thread %d did not terminate - leaving a zombie\n", i);
+ }
+}
+
+static
+int aio_sync(struct file *file)
+{
+ int err;
+
+ switch (aio_sync_mode) {
+ case 1:
+#if defined(S_BIAS) || (defined(RHEL_MAJOR) && (RHEL_MAJOR < 7))
+ err = vfs_fsync_range(file, file->f_path.dentry, 0, LLONG_MAX, 1);
+#else
+ err = vfs_fsync_range(file, 0, LLONG_MAX, 1);
+#endif
+ break;
+ case 2:
+#if defined(S_BIAS) || (defined(RHEL_MAJOR) && (RHEL_MAJOR < 7))
+ err = vfs_fsync_range(file, file->f_path.dentry, 0, LLONG_MAX, 0);
+#else
+ err = vfs_fsync_range(file, 0, LLONG_MAX, 0);
+#endif
+ break;
+ default:
+ err = filemap_write_and_wait_range(file->f_mapping, 0, LLONG_MAX);
+ }
+
+ return err;
+}
+
+static
+void aio_sync_all(struct aio_output *output, struct list_head *tmp_list)
+{
+ unsigned long long latency;
+ int err;
+
+ output->fdsync_active = true;
+ atomic_inc(&output->total_fdsync_count);
+
+ latency = TIME_STATS(
+ &timings[2],
+ err = aio_sync(output->mf->mf_filp)
+ );
+
+ threshold_check(&aio_sync_threshold, latency);
+
+ output->fdsync_active = false;
+ wake_up_interruptible_all(&output->fdsync_event);
+ if (err < 0)
+ XIO_ERR("FDSYNC error %d\n", err);
+
+ /* Signal completion for the whole list.
+ * No locking needed, it's on the stack.
+ */
+ _complete_all(tmp_list, output, err);
+}
+
+/* Workaround for non-implemented aio_fsync()
+ */
+static
+int aio_sync_thread(void *data)
+{
+ struct aio_threadinfo *tinfo = data;
+ struct aio_output *output = tinfo->output;
+
+ XIO_DBG("sync thread has started on '%s'.\n", output->brick->brick_path);
+ /* set_user_nice(current, -20); */
+
+ while (!brick_thread_should_stop() || atomic_read(&tinfo->queued_sum) > 0) {
+ LIST_HEAD(tmp_list);
+ int i;
+
+ output->fdsync_active = false;
+ wake_up_interruptible_all(&output->fdsync_event);
+
+ wait_event_interruptible_timeout(
+ tinfo->event,
+ atomic_read(&tinfo->queued_sum) > 0,
+ HZ / 4);
+
+ spin_lock(&tinfo->lock);
+ for (i = 0; i < XIO_PRIO_NR; i++) {
+ struct list_head *start = &tinfo->aio_list[i];
+
+ if (!list_empty(start)) {
+ /* move over the whole list */
+ list_replace_init(start, &tmp_list);
+ atomic_sub(tinfo->queued[i], &tinfo->queued_sum);
+ tinfo->queued[i] = 0;
+ break;
+ }
+ }
+ spin_unlock(&tinfo->lock);
+
+ if (!list_empty(&tmp_list))
+ aio_sync_all(output, &tmp_list);
+ }
+
+ XIO_DBG("sync thread has stopped.\n");
+ tinfo->terminated = true;
+ wake_up_interruptible_all(&tinfo->terminate_event);
+ return 0;
+}
+
+static int aio_event_thread(void *data)
+{
+ struct aio_threadinfo *tinfo = data;
+ struct aio_output *output = tinfo->output;
+ struct aio_threadinfo *other = &output->tinfo[2];
+ struct io_event *events;
+ int err = -ENOMEM;
+
+ events = brick_mem_alloc(sizeof(struct io_event) * XIO_MAX_AIO_READ);
+
+ XIO_DBG("event thread has started.\n");
+
+ use_fake_mm();
+ if (!current->mm)
+ goto err;
+
+ err = aio_start_thread(output, &output->tinfo[2], aio_sync_thread, 'y');
+ if (unlikely(err < 0))
+ goto err;
+
+ while (!brick_thread_should_stop() || atomic_read(&tinfo->queued_sum) > 0) {
+ mm_segment_t oldfs;
+ int count;
+ int i;
+
+ struct timespec timeout = {
+ .tv_sec = 1,
+ };
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ /* TODO: don't timeout upon termination.
+ * Probably we should submit a dummy request.
+ */
+ count = sys_io_getevents(output->ctxp, 1, XIO_MAX_AIO_READ, events, &timeout);
+ set_fs(oldfs);
+
+ if (likely(count > 0))
+ atomic_sub(count, &output->submit_count);
+
+ for (i = 0; i < count; i++) {
+ struct aio_aio_aspect *aio_a = (void *)events[i].data;
+ struct aio_object *aio;
+ int err = events[i].res;
+
+ if (!aio_a)
+ continue; /* this was a dummy request */
+
+ aio_a->di.dirty_stage = 2;
+ aio = aio_a->object;
+
+ mapfree_set(output->mf, aio->io_pos, aio->io_pos + aio->io_len);
+
+ if (output->brick->o_fdsync
+ && err >= 0
+ && aio->io_rw != READ
+ && !aio->io_skip_sync
+ && !aio_a->resubmit++) {
+ /* workaround for non-implemented AIO FSYNC operation */
+ if (output->mf &&
+ output->mf->mf_filp &&
+ output->mf->mf_filp->f_op &&
+ !output->mf->mf_filp->f_op->aio_fsync) {
+ _enqueue(other, aio_a, aio->io_prio, true);
+ continue;
+ }
+ err = aio_submit(output, aio_a, true);
+ if (likely(err >= 0))
+ continue;
+ }
+
+ aio_a->di.dirty_stage = 3;
+ _complete(output, aio_a, err);
+
+ }
+ }
+ err = 0;
+
+err:
+ XIO_DBG("event thread has stopped, err = %d\n", err);
+
+ aio_stop_thread(output, 2, false);
+
+ unuse_fake_mm();
+
+ tinfo->terminated = true;
+ wake_up_interruptible_all(&tinfo->terminate_event);
+ brick_mem_free(events);
+ return err;
+}
+
+#if 1
+/* This should go to fs/open.c (as long as vfs_submit() is not implemented)
+ */
+#include <linux/fdtable.h>
+void fd_uninstall(unsigned int fd)
+{
+ struct files_struct *files = current->files;
+ struct fdtable *fdt;
+
+ XIO_DBG("fd = %d\n", fd);
+ if (unlikely(fd < 0)) {
+ XIO_ERR("bad fd = %d\n", fd);
+ goto out_return;
+ }
+ spin_lock(&files->file_lock);
+ fdt = files_fdtable(files);
+ rcu_assign_pointer(fdt->fd[fd], NULL);
+ spin_unlock(&files->file_lock);
+out_return:;
+}
+EXPORT_SYMBOL(fd_uninstall);
+#endif
+
+static
+atomic_t ioctx_count = ATOMIC_INIT(0);
+
+static
+void _destroy_ioctx(struct aio_output *output)
+{
+ if (unlikely(!output))
+ goto done;
+
+ aio_stop_thread(output, 1, true);
+
+ use_fake_mm();
+
+ if (likely(output->ctxp)) {
+ mm_segment_t oldfs;
+ int err;
+
+ XIO_DBG("ioctx count = %d destroying %p\n", atomic_read(&ioctx_count), (void *)output->ctxp);
+ oldfs = get_fs();
+ set_fs(get_ds());
+ err = sys_io_destroy(output->ctxp);
+ set_fs(oldfs);
+ atomic_dec(&ioctx_count);
+ XIO_DBG("ioctx count = %d status = %d\n", atomic_read(&ioctx_count), err);
+ output->ctxp = 0;
+ }
+
+ if (likely(output->fd >= 0)) {
+ XIO_DBG("destroying fd %d\n", output->fd);
+ fd_uninstall(output->fd);
+ put_unused_fd(output->fd);
+ output->fd = -1;
+ }
+
+done:
+ if (likely(current->mm))
+ unuse_fake_mm();
+}
+
+static
+int _create_ioctx(struct aio_output *output)
+{
+ struct file *file;
+
+ mm_segment_t oldfs;
+ int err = -EINVAL;
+
+ CHECK_PTR_NULL(output, done);
+ CHECK_PTR_NULL(output->mf, done);
+ file = output->mf->mf_filp;
+ CHECK_PTR_NULL(file, done);
+
+ /* TODO: this is provisionary. We only need it for sys_io_submit()
+ * which uses userspace concepts like file handles.
+ * This should be accompanied by a future kernelsapce vfs_submit() or
+ * do_submit() which currently does not exist :(
+ */
+ err = get_unused_fd();
+ XIO_DBG("file %p '%s' new fd = %d\n", file, output->mf->mf_name, err);
+ if (unlikely(err < 0)) {
+ XIO_ERR("cannot get fd, err=%d\n", err);
+ goto done;
+ }
+ output->fd = err;
+ fd_install(err, file);
+
+ use_fake_mm();
+
+ err = -ENOMEM;
+ if (unlikely(!current->mm)) {
+ XIO_ERR("cannot fake mm\n");
+ goto done;
+ }
+
+ XIO_DBG("ioctx count = %d old = %p\n", atomic_read(&ioctx_count), (void *)output->ctxp);
+ output->ctxp = 0;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ err = sys_io_setup(XIO_MAX_AIO, &output->ctxp);
+ set_fs(oldfs);
+ if (likely(output->ctxp))
+ atomic_inc(&ioctx_count);
+ XIO_DBG("ioctx count = %d new = %p status = %d\n", atomic_read(&ioctx_count), (void *)output->ctxp, err);
+ if (unlikely(err < 0)) {
+ XIO_ERR("io_setup failed, err=%d\n", err);
+ goto done;
+ }
+
+ err = aio_start_thread(output, &output->tinfo[1], aio_event_thread, 'e');
+ if (unlikely(err < 0)) {
+ XIO_ERR("could not start event thread\n");
+ goto done;
+ }
+
+done:
+ if (likely(current->mm))
+ unuse_fake_mm();
+ return err;
+}
+
+static int aio_submit_thread(void *data)
+{
+ struct aio_threadinfo *tinfo = data;
+ struct aio_output *output = tinfo->output;
+ struct file *file;
+ int err = -EINVAL;
+
+ XIO_DBG("submit thread has started.\n");
+
+ file = output->mf->mf_filp;
+
+ use_fake_mm();
+
+ while (!brick_thread_should_stop() || atomic_read(&output->read_count) + atomic_read(&output->write_count) + atomic_read(&tinfo->queued_sum) > 0) {
+ struct aio_aio_aspect *aio_a;
+ struct aio_object *aio;
+ int sleeptime;
+ int status;
+
+ wait_event_interruptible_timeout(
+ tinfo->event,
+ atomic_read(&tinfo->queued_sum) > 0,
+ HZ / 4);
+
+ aio_a = _dequeue(tinfo);
+ if (!aio_a)
+ continue;
+
+ aio = aio_a->object;
+ status = -EINVAL;
+ CHECK_PTR(aio, error);
+
+ mapfree_set(output->mf, aio->io_pos, -1);
+
+ aio_a->di.dirty_stage = 0;
+ if (aio->io_rw)
+ mf_insert_dirty(output->mf, &aio_a->di);
+
+ aio->io_total_size = get_total_size(output);
+
+ /* check for reads crossing the EOF boundary (special case) */
+ if (aio->io_timeout > 0 &&
+ !aio->io_rw &&
+ aio->io_pos + aio->io_len > aio->io_total_size) {
+ loff_t len = aio->io_total_size - aio->io_pos;
+
+ if (len > 0) {
+ if (aio->io_len > len)
+ aio->io_len = len;
+ } else {
+ if (!aio_a->start_jiffies)
+ aio_a->start_jiffies = jiffies;
+ if ((long long)jiffies - aio_a->start_jiffies <= aio->io_timeout) {
+ if (atomic_read(&tinfo->queued_sum) <= 0) {
+ atomic_inc(&output->total_msleep_count);
+ brick_msleep(1000 * 4 / HZ);
+ }
+ _enqueue(tinfo, aio_a, XIO_PRIO_LOW, true);
+ continue;
+ }
+ XIO_DBG("ENODATA %lld\n", len);
+ _complete(output, aio_a, -ENODATA);
+ continue;
+ }
+ }
+
+ sleeptime = 1;
+ for (;;) {
+ aio_a->di.dirty_stage = 1;
+ status = aio_submit(output, aio_a, false);
+
+ if (likely(status != -EAGAIN))
+ break;
+ aio_a->di.dirty_stage = 0;
+ atomic_inc(&output->total_delay_count);
+ brick_msleep(sleeptime);
+ if (sleeptime < 100)
+ sleeptime++;
+ }
+
+error:
+ if (unlikely(status < 0))
+ _complete_aio(output, aio, status);
+ }
+
+ XIO_DBG("submit thread has stopped, status = %d.\n", err);
+
+ if (likely(current->mm))
+ unuse_fake_mm();
+
+ tinfo->terminated = true;
+ wake_up_interruptible_all(&tinfo->terminate_event);
+ return err;
+}
+
+static int aio_get_info(struct aio_output *output, struct xio_info *info)
+{
+ struct file *file;
+
+ if (unlikely(!output || !output->mf))
+ return -EINVAL;
+ file = output->mf->mf_filp;
+ if (unlikely(!file || !file->f_mapping || !file->f_mapping->host))
+ return -EINVAL;
+
+ info->tf_align = 1;
+ info->tf_min_size = 1;
+ info->current_size = get_total_size(output);
+
+ XIO_DBG("determined file size = %lld\n", info->current_size);
+
+ return 0;
+}
+
+/*************** informational * statistics **************/
+
+static noinline
+char *aio_statistics(struct aio_brick *brick, int verbose)
+{
+ struct aio_output *output = brick->outputs[0];
+ char *res = brick_string_alloc(4096);
+ char *sync = NULL;
+ int pos = 0;
+
+ pos += report_timing(&timings[0], res + pos, 4096 - pos);
+ pos += report_timing(&timings[1], res + pos, 4096 - pos);
+ pos += report_timing(&timings[2], res + pos, 4096 - pos);
+
+ snprintf(res + pos, 4096 - pos,
+ "total reads = %d writes = %d allocs = %d submits = %d again = %d delays = %d msleeps = %d fdsyncs = %d fdsync_waits = %d map_free = %d | flying reads = %d writes = %d allocs = %d submits = %d q0 = %d q1 = %d q2 = %d | total q0 = %d q1 = %d q2 = %d %s\n",
+ atomic_read(&output->total_read_count),
+ atomic_read(&output->total_write_count),
+ atomic_read(&output->total_alloc_count),
+ atomic_read(&output->total_submit_count),
+ atomic_read(&output->total_again_count),
+ atomic_read(&output->total_delay_count),
+ atomic_read(&output->total_msleep_count),
+ atomic_read(&output->total_fdsync_count),
+ atomic_read(&output->total_fdsync_wait_count),
+ atomic_read(&output->total_mapfree_count),
+ atomic_read(&output->read_count),
+ atomic_read(&output->write_count),
+ atomic_read(&output->alloc_count),
+ atomic_read(&output->submit_count),
+ atomic_read(&output->tinfo[0].queued_sum),
+ atomic_read(&output->tinfo[1].queued_sum),
+ atomic_read(&output->tinfo[2].queued_sum),
+ atomic_read(&output->tinfo[0].total_enqueue_count),
+ atomic_read(&output->tinfo[1].total_enqueue_count),
+ atomic_read(&output->tinfo[2].total_enqueue_count),
+ sync ? sync : "");
+
+ if (sync)
+ brick_string_free(sync);
+
+ return res;
+}
+
+static noinline
+void aio_reset_statistics(struct aio_brick *brick)
+{
+ struct aio_output *output = brick->outputs[0];
+ int i;
+
+ atomic_set(&output->total_read_count, 0);
+ atomic_set(&output->total_write_count, 0);
+ atomic_set(&output->total_alloc_count, 0);
+ atomic_set(&output->total_submit_count, 0);
+ atomic_set(&output->total_again_count, 0);
+ atomic_set(&output->total_delay_count, 0);
+ atomic_set(&output->total_msleep_count, 0);
+ atomic_set(&output->total_fdsync_count, 0);
+ atomic_set(&output->total_fdsync_wait_count, 0);
+ atomic_set(&output->total_mapfree_count, 0);
+ for (i = 0; i < 3; i++) {
+ struct aio_threadinfo *tinfo = &output->tinfo[i];
+
+ atomic_set(&tinfo->total_enqueue_count, 0);
+ }
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int aio_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct aio_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->io_head);
+ INIT_LIST_HEAD(&ini->di.dirty_head);
+ ini->di.dirty_aio = ini->object;
+ return 0;
+}
+
+static void aio_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct aio_aio_aspect *ini = (void *)_ini;
+
+ CHECK_HEAD_EMPTY(&ini->di.dirty_head);
+ CHECK_HEAD_EMPTY(&ini->io_head);
+}
+
+XIO_MAKE_STATICS(aio);
+
+/********************* brick constructors * destructors *******************/
+
+static int aio_brick_construct(struct aio_brick *brick)
+{
+ return 0;
+}
+
+static int aio_switch(struct aio_brick *brick)
+{
+ static int index;
+ struct aio_output *output = brick->outputs[0];
+ const char *path = output->brick->brick_path;
+ int flags = O_RDWR | O_LARGEFILE;
+ int status = 0;
+
+ XIO_DBG("power.button = %d\n", brick->power.button);
+ if (!brick->power.button)
+ goto cleanup;
+
+ if (brick->power.on_led || output->mf)
+ goto done;
+
+ xio_set_power_off_led((void *)brick, false);
+
+ if (brick->o_creat) {
+ flags |= O_CREAT;
+ XIO_DBG("using O_CREAT on %s\n", path);
+ }
+ if (brick->o_direct) {
+ flags |= O_DIRECT;
+ XIO_DBG("using O_DIRECT on %s\n", path);
+ }
+
+ output->mf = mapfree_get(path, flags);
+ if (unlikely(!output->mf)) {
+ XIO_ERR("could not open file = '%s' flags = %d\n", path, flags);
+ status = -ENOENT;
+ goto err;
+ }
+
+ output->index = ++index;
+
+ status = _create_ioctx(output);
+ if (unlikely(status < 0)) {
+ XIO_ERR("could not create ioctx, status = %d\n", status);
+ goto err;
+ }
+
+ status = aio_start_thread(output, &output->tinfo[0], aio_submit_thread, 's');
+ if (unlikely(status < 0)) {
+ XIO_ERR("could not start theads, status = %d\n", status);
+ goto err;
+ }
+
+ XIO_DBG("opened file '%s'\n", path);
+ xio_set_power_on_led((void *)brick, true);
+
+done:
+ return 0;
+
+err:
+ XIO_ERR("status = %d\n", status);
+cleanup:
+ if (brick->power.off_led)
+ goto done;
+
+ xio_set_power_on_led((void *)brick, false);
+
+ aio_stop_thread(output, 0, false);
+
+ _destroy_ioctx(output);
+
+ xio_set_power_off_led((void *)brick,
+ (output->tinfo[0].thread == NULL &&
+ output->tinfo[1].thread == NULL &&
+ output->tinfo[2].thread == NULL));
+
+ XIO_DBG("switch off off_led = %d status = %d\n", brick->power.off_led, status);
+ if (brick->power.off_led) {
+ if (output->mf) {
+ XIO_DBG("closing file = '%s'\n", output->mf->mf_name);
+ mapfree_put(output->mf);
+ output->mf = NULL;
+ }
+ }
+ return status;
+}
+
+static int aio_output_construct(struct aio_output *output)
+{
+ init_waitqueue_head(&output->fdsync_event);
+ output->fd = -1;
+ return 0;
+}
+
+static int aio_output_destruct(struct aio_output *output)
+{
+ if (unlikely(output->fd >= 0))
+ XIO_ERR("active fd = %d detected\n", output->fd);
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct aio_brick_ops aio_brick_ops = {
+ .brick_switch = aio_switch,
+ .brick_statistics = aio_statistics,
+ .reset_statistics = aio_reset_statistics,
+};
+
+static struct aio_output_ops aio_output_ops = {
+ .aio_get = aio_io_get,
+ .aio_put = aio_io_put,
+ .aio_io = aio_io_io,
+ .xio_get_info = aio_get_info,
+};
+
+const struct aio_input_type aio_input_type = {
+ .type_name = "aio_input",
+ .input_size = sizeof(struct aio_input),
+};
+
+static const struct aio_input_type *aio_input_types[] = {
+ &aio_input_type,
+};
+
+const struct aio_output_type aio_output_type = {
+ .type_name = "aio_output",
+ .output_size = sizeof(struct aio_output),
+ .master_ops = &aio_output_ops,
+ .output_construct = &aio_output_construct,
+ .output_destruct = &aio_output_destruct,
+};
+
+static const struct aio_output_type *aio_output_types[] = {
+ &aio_output_type,
+};
+
+const struct aio_brick_type aio_brick_type = {
+ .type_name = "aio_brick",
+ .brick_size = sizeof(struct aio_brick),
+ .max_inputs = 0,
+ .max_outputs = 1,
+ .master_ops = &aio_brick_ops,
+ .aspect_types = aio_aspect_types,
+ .default_input_types = aio_input_types,
+ .default_output_types = aio_output_types,
+ .brick_construct = &aio_brick_construct,
+};
+EXPORT_SYMBOL_GPL(aio_brick_type);
+
+/***************** module init stuff ************************/
+
+int __init init_xio_aio(void)
+{
+ XIO_DBG("init_aio()\n");
+ _aio_brick_type = (void *)&aio_brick_type;
+ set_fake();
+ return aio_register_brick_type();
+}
+
+void exit_xio_aio(void)
+{
+ XIO_DBG("exit_aio()\n");
+ put_fake();
+ aio_unregister_brick_type();
+}
--
2.0.0

2014-07-01 21:59:59

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 43/50] mars: add new file drivers/block/mars/xio_bricks/xio_server.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/xio_server.c | 801 +++++++++++++++++++++++++++++
1 file changed, 801 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/xio_server.c

diff --git a/drivers/block/mars/xio_bricks/xio_server.c b/drivers/block/mars/xio_bricks/xio_server.c
new file mode 100644
index 0000000..0b87f9f
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/xio_server.c
@@ -0,0 +1,801 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+/* Server brick (just for demonstration) */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#define _STRATEGY
+#include <linux/brick/brick.h>
+#include <linux/xio.h>
+#include <linux/xio/xio_bio.h>
+#include <linux/xio/xio_aio.h>
+
+#include <linux/mars_light/light_strategy.h>
+
+/************************ own type definitions ***********************/
+
+#include <linux/xio/xio_server.h>
+
+#define NR_SOCKETS 3
+
+static struct xio_socket server_socket[NR_SOCKETS];
+static struct task_struct *server_thread[NR_SOCKETS];
+
+/************************ own helper functions ***********************/
+
+static
+int cb_thread(void *data)
+{
+ struct server_brick *brick = data;
+ struct xio_socket *sock = &brick->handler_socket;
+ bool aborted = false;
+ bool ok = xio_get_socket(sock);
+ int status = -EINVAL;
+
+ XIO_DBG("--------------- cb_thread starting on socket #%d, ok = %d\n", sock->s_debug_nr, ok);
+ if (!ok)
+ goto done;
+
+ brick->cb_running = true;
+ wake_up_interruptible(&brick->startup_event);
+
+ while (!brick_thread_should_stop() || !list_empty(&brick->cb_read_list) || !list_empty(&brick->cb_write_list) || atomic_read(&brick->in_flight) > 0) {
+ struct server_aio_aspect *aio_a;
+ struct aio_object *aio;
+ struct list_head *tmp;
+
+ wait_event_interruptible_timeout(
+ brick->cb_event,
+ !list_empty(&brick->cb_read_list) ||
+ !list_empty(&brick->cb_write_list),
+ 1 * HZ);
+
+ spin_lock(&brick->cb_lock);
+ tmp = brick->cb_write_list.next;
+ if (tmp == &brick->cb_write_list) {
+ tmp = brick->cb_read_list.next;
+ if (tmp == &brick->cb_read_list) {
+ spin_unlock(&brick->cb_lock);
+ brick_msleep(1000 / HZ);
+ continue;
+ }
+ }
+ list_del_init(tmp);
+ spin_unlock(&brick->cb_lock);
+
+ aio_a = container_of(tmp, struct server_aio_aspect, cb_head);
+ aio = aio_a->object;
+ status = -EINVAL;
+ CHECK_PTR(aio, err);
+
+ status = 0;
+ if (!aborted) {
+ down(&brick->socket_sem);
+ status = xio_send_cb(sock, aio);
+ up(&brick->socket_sem);
+ }
+
+err:
+ if (unlikely(status < 0) && !aborted) {
+ aborted = true;
+ XIO_WRN("cannot send response, status = %d\n", status);
+ /* Just shutdown the socket and forget all pending
+ * requests.
+ * The _client_ is responsible for resending
+ * any lost operations.
+ */
+ xio_shutdown_socket(sock);
+ }
+
+ if (aio_a->do_put) {
+ GENERIC_INPUT_CALL(brick->inputs[0], aio_put, aio);
+ atomic_dec(&brick->in_flight);
+ } else {
+ obj_free(aio);
+ }
+ }
+
+ xio_shutdown_socket(sock);
+ xio_put_socket(sock);
+
+done:
+ XIO_DBG("---------- cb_thread terminating, status = %d\n", status);
+ wake_up_interruptible(&brick->startup_event);
+ return status;
+}
+
+static
+void server_endio(struct generic_callback *cb)
+{
+ struct server_aio_aspect *aio_a;
+ struct aio_object *aio;
+ struct server_brick *brick;
+ int rw;
+
+ aio_a = cb->cb_private;
+ CHECK_PTR(aio_a, err);
+ aio = aio_a->object;
+ CHECK_PTR(aio, err);
+ LAST_CALLBACK(cb);
+ if (unlikely(cb != &aio->_object_cb))
+ XIO_ERR("bad cb pointer %p != %p\n", cb, &aio->_object_cb);
+
+ brick = aio_a->brick;
+ if (unlikely(!brick)) {
+ XIO_WRN("late IO callback -- cannot do anything\n");
+ goto out_return;
+ }
+
+ rw = aio->io_rw;
+
+ spin_lock(&brick->cb_lock);
+ if (rw)
+ list_add_tail(&aio_a->cb_head, &brick->cb_write_list);
+ else
+ list_add_tail(&aio_a->cb_head, &brick->cb_read_list);
+ spin_unlock(&brick->cb_lock);
+
+ wake_up_interruptible(&brick->cb_event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle callback - giving up\n");
+out_return:;
+}
+
+static
+int server_io(struct server_brick *brick, struct xio_socket *sock, struct xio_cmd *cmd)
+{
+ struct aio_object *aio;
+ struct server_aio_aspect *aio_a;
+ int amount;
+ int status = -ENOTRECOVERABLE;
+
+ if (!brick->cb_running || !brick->handler_running || !xio_socket_is_alive(sock))
+ goto done;
+
+ aio = server_alloc_aio(brick);
+ status = -ENOMEM;
+ aio_a = server_aio_get_aspect(brick, aio);
+ if (unlikely(!aio_a)) {
+ obj_free(aio);
+ goto done;
+ }
+
+ status = xio_recv_aio(sock, aio, cmd);
+ if (status < 0) {
+ obj_free(aio);
+ goto done;
+ }
+
+ aio_a->brick = brick;
+ SETUP_CALLBACK(aio, server_endio, aio_a);
+
+ amount = 0;
+ if (!aio->io_cs_mode < 2)
+ amount = (aio->io_len - 1) / 1024 + 1;
+ xio_limit_sleep(&server_limiter, amount);
+
+ status = GENERIC_INPUT_CALL(brick->inputs[0], aio_get, aio);
+ if (unlikely(status < 0)) {
+ XIO_WRN("aio_get execution error = %d\n", status);
+ SIMPLE_CALLBACK(aio, status);
+ status = 0; /* continue serving requests */
+ goto done;
+ }
+ aio_a->do_put = true;
+ atomic_inc(&brick->in_flight);
+ GENERIC_INPUT_CALL(brick->inputs[0], aio_io, aio);
+
+done:
+ return status;
+}
+
+static
+int _set_server_aio_params(struct xio_brick *_brick, void *private)
+{
+ struct aio_brick *aio_brick = (void *)_brick;
+
+ if (_brick->type != (void *)_aio_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ aio_brick->o_creat = false;
+ aio_brick->o_direct = false;
+ aio_brick->o_fdsync = false;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+static
+int _set_server_bio_params(struct xio_brick *_brick, void *private)
+{
+ struct bio_brick *bio_brick;
+
+ if (_brick->type == (void *)_aio_brick_type)
+ return _set_server_aio_params(_brick, private);
+ if (_brick->type != (void *)_bio_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ bio_brick = (void *)_brick;
+ bio_brick->ra_pages = 0;
+ bio_brick->do_noidle = true;
+ bio_brick->do_sync = true;
+ bio_brick->do_unplug = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+static
+int dummy_worker(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction)
+{
+ return 0;
+}
+
+static
+int handler_thread(void *data)
+{
+ struct server_brick *brick = data;
+ struct xio_socket *sock = &brick->handler_socket;
+ bool ok = xio_get_socket(sock);
+ int debug_nr;
+ int status = -EINVAL;
+
+ XIO_DBG("#%d --------------- handler_thread starting on socket %p\n", sock->s_debug_nr, sock);
+ if (!ok)
+ goto done;
+
+ brick->handler_running = true;
+ wake_up_interruptible(&brick->startup_event);
+
+ while (!brick_thread_should_stop() && xio_socket_is_alive(sock)) {
+ struct xio_cmd cmd = {};
+
+ status = -EINTR;
+ if (unlikely(!mars_global || !mars_global->global_power.button)) {
+ XIO_DBG("system is not alive\n");
+ break;
+ }
+
+ status = xio_recv_struct(sock, &cmd, xio_cmd_meta);
+ if (unlikely(status < 0)) {
+ XIO_WRN("#%d recv cmd status = %d\n", sock->s_debug_nr, status);
+ goto clean;
+ }
+ if (unlikely(!xio_socket_is_alive(sock))) {
+ XIO_WRN("#%d is dead\n", sock->s_debug_nr);
+ status = -EINTR;
+ goto clean;
+ }
+
+ status = -EPROTO;
+ switch (cmd.cmd_code & CMD_FLAG_MASK) {
+ case CMD_NOP:
+ status = 0;
+ XIO_DBG("#%d got NOP operation\n", sock->s_debug_nr);
+ break;
+ case CMD_NOTIFY:
+ status = 0;
+ from_remote_trigger();
+ break;
+ case CMD_GETINFO:
+ {
+ struct xio_info info = {};
+
+ status = GENERIC_INPUT_CALL(brick->inputs[0], xio_get_info, &info);
+ if (status < 0)
+ break;
+ down(&brick->socket_sem);
+ status = xio_send_struct(sock, &cmd, xio_cmd_meta);
+ if (status >= 0)
+ status = xio_send_struct(sock, &info, xio_info_meta);
+ up(&brick->socket_sem);
+ break;
+ }
+ case CMD_GETENTS:
+ {
+ struct mars_global local = {
+ .dent_anchor = LIST_HEAD_INIT(local.dent_anchor),
+ .brick_anchor = LIST_HEAD_INIT(local.brick_anchor),
+ .global_power = {
+ .button = true,
+ },
+ .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(local.main_event),
+ };
+
+ status = -EINVAL;
+ if (unlikely(!cmd.cmd_str1))
+ break;
+
+ init_rwsem(&local.dent_mutex);
+ init_rwsem(&local.brick_mutex);
+
+ status = mars_dent_work(&local,
+ "/mars",
+ sizeof(struct mars_dent),
+ external_checker,
+ dummy_worker,
+ &local,
+ 3);
+
+ down(&brick->socket_sem);
+ status = xio_send_dent_list(sock, &local.dent_anchor);
+ up(&brick->socket_sem);
+
+ if (status < 0) {
+ XIO_WRN("#%d could not send dentry information, status = %d\n",
+ sock->s_debug_nr,
+ status);
+ }
+
+ xio_free_dent_all(&local, &local.dent_anchor);
+ break;
+ }
+ case CMD_CONNECT:
+ {
+ struct xio_brick *prev;
+ const char *path = cmd.cmd_str1;
+
+ status = -EINVAL;
+ CHECK_PTR(path, err);
+ CHECK_PTR_NULL(_bio_brick_type, err);
+
+ if (!brick->global || !mars_global || !mars_global->global_power.button) {
+ XIO_WRN("#%d system is not alive\n", sock->s_debug_nr);
+ goto err;
+ }
+
+ prev = make_brick_all(
+ brick->global,
+ NULL,
+ _set_server_bio_params,
+ NULL,
+ path,
+ (const struct generic_brick_type *)_bio_brick_type,
+ (const struct generic_brick_type *[]){},
+ 2, /* start always */
+ path,
+ (const char *[]){},
+ 0);
+ if (likely(prev)) {
+ status = generic_connect((void *)brick->inputs[0], (void *)prev->outputs[0]);
+ if (unlikely(status < 0))
+ XIO_ERR("#%d cannot connect to '%s'\n", sock->s_debug_nr, path);
+ prev->killme = true;
+ } else {
+ XIO_ERR("#%d cannot find brick '%s'\n", sock->s_debug_nr, path);
+ }
+
+err:
+ cmd.cmd_int1 = status;
+ down(&brick->socket_sem);
+ status = xio_send_struct(sock, &cmd, xio_cmd_meta);
+ up(&brick->socket_sem);
+ break;
+ }
+ case CMD_AIO:
+ {
+ status = server_io(brick, sock, &cmd);
+ break;
+ }
+ case CMD_CB:
+ XIO_ERR("#%d oops, as a server I should never get CMD_CB; something is wrong here - attack attempt??\n",
+ sock->s_debug_nr);
+ break;
+ default:
+ XIO_ERR("#%d unknown command %d\n", sock->s_debug_nr, cmd.cmd_code);
+ }
+clean:
+ brick_string_free(cmd.cmd_str1);
+ if (status < 0)
+ break;
+ }
+
+ xio_shutdown_socket(sock);
+ xio_put_socket(sock);
+
+done:
+ XIO_DBG("#%d handler_thread terminating, status = %d\n", sock->s_debug_nr, status);
+
+ debug_nr = sock->s_debug_nr;
+
+ XIO_DBG("#%d done.\n", debug_nr);
+ brick->killme = true;
+ return status;
+}
+
+/***************** own brick * input * output operations *****************/
+
+static int server_get_info(struct server_output *output, struct xio_info *info)
+{
+ struct server_input *input = output->brick->inputs[0];
+
+ return GENERIC_INPUT_CALL(input, xio_get_info, info);
+}
+
+static int server_io_get(struct server_output *output, struct aio_object *aio)
+{
+ struct server_input *input = output->brick->inputs[0];
+
+ return GENERIC_INPUT_CALL(input, aio_get, aio);
+}
+
+static void server_io_put(struct server_output *output, struct aio_object *aio)
+{
+ struct server_input *input = output->brick->inputs[0];
+
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+}
+
+static void server_io_io(struct server_output *output, struct aio_object *aio)
+{
+ struct server_input *input = output->brick->inputs[0];
+
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+}
+
+static int server_switch(struct server_brick *brick)
+{
+ struct xio_socket *sock = &brick->handler_socket;
+ int status = 0;
+
+ if (brick->power.button) {
+ static int version;
+ bool ok;
+
+ if (brick->power.on_led)
+ goto done;
+
+ ok = xio_get_socket(sock);
+ if (unlikely(!ok)) {
+ status = -ENOENT;
+ goto err;
+ }
+
+ xio_set_power_off_led((void *)brick, false);
+
+ brick->cb_thread = brick_thread_create(cb_thread, brick, "xio_cb%d", version);
+ if (unlikely(!brick->cb_thread)) {
+ XIO_ERR("cannot create cb thread\n");
+ status = -ENOENT;
+ goto err;
+ }
+
+ brick->handler_thread = brick_thread_create(handler_thread, brick, "xio_handler%d", version++);
+ if (unlikely(!brick->handler_thread)) {
+ XIO_ERR("cannot create handler thread\n");
+ brick_thread_stop(brick->cb_thread);
+ brick->cb_thread = NULL;
+ status = -ENOENT;
+ goto err;
+ }
+
+ xio_set_power_on_led((void *)brick, true);
+ } else if (!brick->power.off_led) {
+ struct task_struct *thread;
+
+ xio_set_power_on_led((void *)brick, false);
+
+ xio_shutdown_socket(sock);
+
+ thread = brick->handler_thread;
+ if (thread) {
+ brick->handler_thread = NULL;
+ brick->handler_running = false;
+ XIO_DBG("#%d stopping handler thread....\n", sock->s_debug_nr);
+ brick_thread_stop(thread);
+ }
+ thread = brick->cb_thread;
+ if (thread) {
+ brick->cb_thread = NULL;
+ brick->cb_running = false;
+ XIO_DBG("#%d stopping callback thread....\n", sock->s_debug_nr);
+ brick_thread_stop(thread);
+ }
+
+ xio_put_socket(sock);
+ XIO_DBG("#%d socket s_count = %d\n", sock->s_debug_nr, atomic_read(&sock->s_count));
+
+ xio_set_power_off_led((void *)brick, true);
+ }
+err:
+ if (unlikely(status < 0)) {
+ xio_set_power_off_led((void *)brick, true);
+ xio_shutdown_socket(sock);
+ xio_put_socket(sock);
+ }
+done:
+ return status;
+}
+
+/*************** informational * statistics **************/
+
+static
+char *server_statistics(struct server_brick *brick, int verbose)
+{
+ char *res = brick_string_alloc(1024);
+
+ snprintf(res, 1024,
+ "cb_running = %d handler_running = %d in_flight = %d\n",
+ brick->cb_running,
+ brick->handler_running,
+ atomic_read(&brick->in_flight));
+
+ return res;
+}
+
+static
+void server_reset_statistics(struct server_brick *brick)
+{
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int server_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct server_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->cb_head);
+ return 0;
+}
+
+static void server_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct server_aio_aspect *ini = (void *)_ini;
+
+ CHECK_HEAD_EMPTY(&ini->cb_head);
+}
+
+XIO_MAKE_STATICS(server);
+
+/********************* brick constructors * destructors *******************/
+
+static int server_brick_construct(struct server_brick *brick)
+{
+ init_waitqueue_head(&brick->startup_event);
+ init_waitqueue_head(&brick->cb_event);
+ sema_init(&brick->socket_sem, 1);
+ spin_lock_init(&brick->cb_lock);
+ INIT_LIST_HEAD(&brick->cb_read_list);
+ INIT_LIST_HEAD(&brick->cb_write_list);
+ return 0;
+}
+
+static int server_brick_destruct(struct server_brick *brick)
+{
+ CHECK_HEAD_EMPTY(&brick->cb_read_list);
+ CHECK_HEAD_EMPTY(&brick->cb_write_list);
+ return 0;
+}
+
+static int server_output_construct(struct server_output *output)
+{
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct server_brick_ops server_brick_ops = {
+ .brick_switch = server_switch,
+ .brick_statistics = server_statistics,
+ .reset_statistics = server_reset_statistics,
+};
+
+static struct server_output_ops server_output_ops = {
+ .xio_get_info = server_get_info,
+ .aio_get = server_io_get,
+ .aio_put = server_io_put,
+ .aio_io = server_io_io,
+};
+
+const struct server_input_type server_input_type = {
+ .type_name = "server_input",
+ .input_size = sizeof(struct server_input),
+};
+
+static const struct server_input_type *server_input_types[] = {
+ &server_input_type,
+};
+
+const struct server_output_type server_output_type = {
+ .type_name = "server_output",
+ .output_size = sizeof(struct server_output),
+ .master_ops = &server_output_ops,
+ .output_construct = &server_output_construct,
+};
+
+static const struct server_output_type *server_output_types[] = {
+ &server_output_type,
+};
+
+const struct server_brick_type server_brick_type = {
+ .type_name = "server_brick",
+ .brick_size = sizeof(struct server_brick),
+ .max_inputs = 1,
+ .max_outputs = 0,
+ .master_ops = &server_brick_ops,
+ .aspect_types = server_aspect_types,
+ .default_input_types = server_input_types,
+ .default_output_types = server_output_types,
+ .brick_construct = &server_brick_construct,
+ .brick_destruct = &server_brick_destruct,
+};
+EXPORT_SYMBOL_GPL(server_brick_type);
+
+/*********************************************************************/
+
+/* strategy layer */
+
+int server_show_statist = 0;
+EXPORT_SYMBOL_GPL(server_show_statist);
+
+static int _server_thread(void *data)
+{
+ struct mars_global server_global = {
+ .dent_anchor = LIST_HEAD_INIT(server_global.dent_anchor),
+ .brick_anchor = LIST_HEAD_INIT(server_global.brick_anchor),
+ .global_power = {
+ .button = true,
+ },
+ .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(server_global.main_event),
+ };
+ struct xio_socket *my_socket = data;
+ char *id = my_id();
+ int status = 0;
+
+ init_rwsem(&server_global.dent_mutex);
+ init_rwsem(&server_global.brick_mutex);
+
+ XIO_INF("-------- server starting on host '%s' ----------\n", id);
+
+ while (!brick_thread_should_stop() &&
+ (!mars_global || !mars_global->global_power.button)) {
+ XIO_DBG("system did not start up\n");
+ brick_msleep(5000);
+ }
+
+ XIO_INF("-------- server now working on host '%s' ----------\n", id);
+
+ while (!brick_thread_should_stop() || !list_empty(&server_global.brick_anchor)) {
+ struct server_brick *brick = NULL;
+ struct xio_socket handler_socket = {};
+
+ server_global.global_version++;
+
+ if (server_show_statist)
+ show_statistics(&server_global, "server");
+
+ status = xio_kill_brick_when_possible(&server_global, &server_global.brick_anchor, false, NULL, true);
+ XIO_DBG("kill server bricks (when possible) = %d\n", status);
+
+ if (!mars_global || !mars_global->global_power.button) {
+ brick_msleep(1000);
+ continue;
+ }
+
+ status = xio_accept_socket(&handler_socket, my_socket);
+ if (unlikely(status < 0 || !xio_socket_is_alive(&handler_socket))) {
+ brick_msleep(500);
+ if (status == -EAGAIN)
+ continue; /* without error message */
+ XIO_WRN("accept status = %d\n", status);
+ brick_msleep(1000);
+ continue;
+ }
+ handler_socket.s_shutdown_on_err = true;
+
+ XIO_DBG("got new connection #%d\n", handler_socket.s_debug_nr);
+
+ brick = (void *)xio_make_brick(&server_global, NULL, &server_brick_type, "handler", "handler");
+ if (!brick) {
+ XIO_ERR("cannot create server instance\n");
+ xio_shutdown_socket(&handler_socket);
+ xio_put_socket(&handler_socket);
+ brick_msleep(2000);
+ continue;
+ }
+ memcpy(&brick->handler_socket, &handler_socket, sizeof(struct xio_socket));
+
+ /* TODO: check authorization.
+ */
+
+ brick->power.button = true;
+ status = server_switch(brick);
+ if (unlikely(status < 0)) {
+ XIO_ERR("cannot switch on server brick, status = %d\n", status);
+ goto err;
+ }
+
+ /* further references are usually held by the threads */
+ xio_put_socket(&brick->handler_socket);
+
+ /* fire and forget....
+ * the new instance is now responsible for itself.
+ */
+ brick = NULL;
+ brick_msleep(100);
+ continue;
+
+err:
+ if (brick) {
+ xio_shutdown_socket(&brick->handler_socket);
+ xio_put_socket(&brick->handler_socket);
+ status = xio_kill_brick((void *)brick);
+ if (status < 0)
+ BRICK_ERR("kill status = %d, giving up\n", status);
+ brick = NULL;
+ }
+ brick_msleep(2000);
+ }
+
+ XIO_INF("-------- cleaning up ----------\n");
+
+ xio_kill_brick_all(&server_global, &server_global.brick_anchor, false);
+
+ /* cleanup_mm(); */
+
+ XIO_INF("-------- done status = %d ----------\n", status);
+ return status;
+}
+
+/***************** module init stuff ************************/
+
+struct xio_limiter server_limiter = {
+ .lim_max_rate = 0,
+};
+EXPORT_SYMBOL_GPL(server_limiter);
+
+void exit_xio_server(void)
+{
+ int i;
+
+ XIO_INF("exit_server()\n");
+ server_unregister_brick_type();
+
+ for (i = 0; i < NR_SOCKETS; i++) {
+ if (server_thread[i]) {
+ XIO_INF("stopping server thread %d...\n", i);
+ brick_thread_stop(server_thread[i]);
+ }
+ XIO_INF("closing server socket %d...\n", i);
+ xio_put_socket(&server_socket[i]);
+ }
+}
+
+int __init init_xio_server(void)
+{
+ int i;
+
+ XIO_INF("init_server()\n");
+
+ for (i = 0; i < NR_SOCKETS; i++) {
+ struct sockaddr_storage sockaddr = {};
+ char tmp[16];
+ int status;
+
+ sprintf(tmp, ":%d", xio_net_default_port + i);
+ status = xio_create_sockaddr(&sockaddr, tmp);
+ if (unlikely(status < 0)) {
+ exit_xio_server();
+ return status;
+ }
+
+ status = xio_create_socket(&server_socket[i], &sockaddr, true);
+ if (unlikely(status < 0)) {
+ XIO_ERR("could not create server socket %d, status = %d\n", i, status);
+ exit_xio_server();
+ return status;
+ }
+
+ server_thread[i] = brick_thread_create(_server_thread, &server_socket[i], "xio_server_%d", i);
+ if (unlikely(!server_thread[i] || IS_ERR(server_thread[i]))) {
+ XIO_ERR("could not create server thread %d\n", i);
+ exit_xio_server();
+ return -ENOENT;
+ }
+ }
+
+ return server_register_brick_type();
+}
--
2.0.0

2014-07-01 21:59:57

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 06/50] mars: add new file drivers/block/mars/brick_mem.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/brick_mem.c | 1081 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 1081 insertions(+)
create mode 100644 drivers/block/mars/brick_mem.c

diff --git a/drivers/block/mars/brick_mem.c b/drivers/block/mars/brick_mem.c
new file mode 100644
index 0000000..ce682fc
--- /dev/null
+++ b/drivers/block/mars/brick_mem.c
@@ -0,0 +1,1081 @@
+/* (c) 2011 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/delay.h>
+
+#include <linux/atomic.h>
+
+#include <linux/brick/brick_mem.h>
+#include <linux/brick/brick_say.h>
+#include <linux/brick/lamport.h>
+
+#define USE_KERNEL_PAGES /* currently mandatory (vmalloc does not work) */
+
+#define MAGIC_BLOCK ((int)0x8B395D7B)
+#define MAGIC_BEND ((int)0x8B395D7C)
+#define MAGIC_MEM1 ((int)0x8B395D7D)
+#define MAGIC_MEM2 ((int)0x9B395D8D)
+#define MAGIC_MEND1 ((int)0x8B395D7E)
+#define MAGIC_MEND2 ((int)0x9B395D8E)
+#define MAGIC_STR ((int)0x8B395D7F)
+#define MAGIC_SEND ((int)0x9B395D8F)
+
+#define INT_ACCESS(ptr, offset) (*(int *)(((char *)(ptr)) + (offset)))
+
+#define _BRICK_FMT(_fmt, _class) \
+ "%ld.%09ld %ld.%09ld MEM_%-5s %s[%d] %s:%d %s(): " \
+ _fmt, \
+ _s_now.tv_sec, _s_now.tv_nsec, \
+ _l_now.tv_sec, _l_now.tv_nsec, \
+ say_class[_class], \
+ current->comm, (int)smp_processor_id(), \
+ __BASE_FILE__, \
+ __LINE__, \
+ __func__
+
+#define _BRICK_MSG(_class, _dump, _fmt, _args...) \
+ do { \
+ struct timespec _s_now = CURRENT_TIME; \
+ struct timespec _l_now; \
+ get_lamport(&_l_now); \
+ say(_class, _BRICK_FMT(_fmt, _class), ##_args); \
+ if (_dump) \
+ dump_stack(); \
+ } while (0)
+
+#define BRICK_ERR(_fmt, _args...) _BRICK_MSG(SAY_ERROR, true, _fmt, ##_args)
+#define BRICK_WRN(_fmt, _args...) _BRICK_MSG(SAY_WARN, false, _fmt, ##_args)
+#define BRICK_INF(_fmt, _args...) _BRICK_MSG(SAY_INFO, false, _fmt, ##_args)
+
+/***********************************************************************/
+
+/* limit handling */
+
+#include <linux/swap.h>
+
+long long brick_global_memavail = 0;
+EXPORT_SYMBOL_GPL(brick_global_memavail);
+long long brick_global_memlimit = 0;
+EXPORT_SYMBOL_GPL(brick_global_memlimit);
+atomic64_t brick_global_block_used = ATOMIC64_INIT(0);
+EXPORT_SYMBOL_GPL(brick_global_block_used);
+
+void get_total_ram(void)
+{
+ struct sysinfo i = {};
+
+ si_meminfo(&i);
+ /* si_swapinfo(&i); */
+ brick_global_memavail = (long long)i.totalram * (PAGE_SIZE / 1024);
+ BRICK_INF("total RAM = %lld [KiB]\n", brick_global_memavail);
+}
+
+/***********************************************************************/
+
+/* small memory allocation (use this only for len < PAGE_SIZE) */
+
+#ifdef BRICK_DEBUG_MEM
+static atomic_t phys_mem_alloc = ATOMIC_INIT(0);
+static atomic_t mem_redirect_alloc = ATOMIC_INIT(0);
+static atomic_t mem_count[BRICK_DEBUG_MEM];
+static atomic_t mem_free[BRICK_DEBUG_MEM];
+static int mem_len[BRICK_DEBUG_MEM];
+
+#define PLUS_SIZE (6 * sizeof(int))
+#else
+#define PLUS_SIZE (2 * sizeof(int))
+#endif
+
+static inline
+void *__brick_mem_alloc(int len)
+{
+ void *res;
+
+ if (len >= PAGE_SIZE) {
+#ifdef BRICK_DEBUG_MEM
+ atomic_inc(&mem_redirect_alloc);
+#endif
+ res = _brick_block_alloc(0, len, 0);
+ } else {
+ for (;;) {
+ res = kmalloc(len, GFP_BRICK);
+ if (likely(res))
+ break;
+ msleep(1000);
+ }
+#ifdef BRICK_DEBUG_MEM
+ atomic_inc(&phys_mem_alloc);
+#endif
+ }
+ return res;
+}
+
+static inline
+void __brick_mem_free(void *data, int len)
+{
+ if (len >= PAGE_SIZE) {
+ _brick_block_free(data, len, 0);
+#ifdef BRICK_DEBUG_MEM
+ atomic_dec(&mem_redirect_alloc);
+#endif
+ } else {
+ kfree(data);
+#ifdef BRICK_DEBUG_MEM
+ atomic_dec(&phys_mem_alloc);
+#endif
+ }
+}
+
+void *_brick_mem_alloc(int len, int line)
+{
+ void *res;
+
+#ifdef CONFIG_MARS_DEBUG
+ might_sleep();
+#endif
+
+ res = __brick_mem_alloc(len + PLUS_SIZE);
+
+#ifdef BRICK_DEBUG_MEM
+ if (unlikely(line < 0))
+ line = 0;
+ else if (unlikely(line >= BRICK_DEBUG_MEM))
+ line = BRICK_DEBUG_MEM - 1;
+ INT_ACCESS(res, 0 * sizeof(int)) = MAGIC_MEM1;
+ INT_ACCESS(res, 1 * sizeof(int)) = len;
+ INT_ACCESS(res, 2 * sizeof(int)) = line;
+ INT_ACCESS(res, 3 * sizeof(int)) = MAGIC_MEM2;
+ res += 4 * sizeof(int);
+ INT_ACCESS(res, len + 0 * sizeof(int)) = MAGIC_MEND1;
+ INT_ACCESS(res, len + 1 * sizeof(int)) = MAGIC_MEND2;
+ atomic_inc(&mem_count[line]);
+ mem_len[line] = len;
+#else
+ INT_ACCESS(res, 0 * sizeof(int)) = len;
+ res += PLUS_SIZE;
+#endif
+ return res;
+}
+EXPORT_SYMBOL_GPL(_brick_mem_alloc);
+
+void _brick_mem_free(void *data, int cline)
+{
+#ifdef BRICK_DEBUG_MEM
+ void *test = data - 4 * sizeof(int);
+ int magic1 = INT_ACCESS(test, 0 * sizeof(int));
+ int len = INT_ACCESS(test, 1 * sizeof(int));
+ int line = INT_ACCESS(test, 2 * sizeof(int));
+ int magic2 = INT_ACCESS(test, 3 * sizeof(int));
+
+ if (unlikely(magic1 != MAGIC_MEM1)) {
+ BRICK_ERR("line %d memory corruption: magix1 %08x != %08x, len = %d\n",
+ cline,
+ magic1,
+ MAGIC_MEM1,
+ len);
+ goto _out_return;
+ }
+ if (unlikely(magic2 != MAGIC_MEM2)) {
+ BRICK_ERR("line %d memory corruption: magix2 %08x != %08x, len = %d\n",
+ cline,
+ magic2,
+ MAGIC_MEM2,
+ len);
+ goto _out_return;
+ }
+ if (unlikely(line < 0 || line >= BRICK_DEBUG_MEM)) {
+ BRICK_ERR("line %d memory corruption: alloc line = %d, len = %d\n", cline, line, len);
+ goto _out_return;
+ }
+ INT_ACCESS(test, 0) = 0xffffffff;
+ magic1 = INT_ACCESS(data, len + 0 * sizeof(int));
+ if (unlikely(magic1 != MAGIC_MEND1)) {
+ BRICK_ERR("line %d memory corruption: magix1 %08x != %08x, len = %d\n",
+ cline,
+ magic1,
+ MAGIC_MEND1,
+ len);
+ goto _out_return;
+ }
+ magic2 = INT_ACCESS(data, len + 1 * sizeof(int));
+ if (unlikely(magic2 != MAGIC_MEND2)) {
+ BRICK_ERR("line %d memory corruption: magix2 %08x != %08x, len = %d\n",
+ cline,
+ magic2,
+ MAGIC_MEND2,
+ len);
+ goto _out_return;
+ }
+ INT_ACCESS(data, len) = 0xffffffff;
+ atomic_dec(&mem_count[line]);
+ atomic_inc(&mem_free[line]);
+#else
+ void *test = data - PLUS_SIZE;
+ int len = INT_ACCESS(test, 0 * sizeof(int));
+
+#endif
+ data = test;
+ __brick_mem_free(data, len + PLUS_SIZE);
+#ifdef BRICK_DEBUG_MEM
+_out_return:;
+#endif
+}
+EXPORT_SYMBOL_GPL(_brick_mem_free);
+
+/***********************************************************************/
+
+/* string memory allocation */
+
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+# define STRING_CANARY \
+ "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
+ "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy" \
+ "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" \
+ "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
+ "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy" \
+ "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" \
+ "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
+ "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy" \
+ "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" \
+ " FILE = " __FILE__ \
+ " DATE = " __DATE__ \
+ " TIME = " __TIME__ \
+ " VERSION = " __VERSION__ \
+ " xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx STRING_error xxx\n"
+# define STRING_PLUS (sizeof(int) * 3 + sizeof(STRING_CANARY))
+#elif defined(BRICK_DEBUG_MEM)
+# define STRING_PLUS (sizeof(int) * 4)
+#else
+# define STRING_PLUS 0
+#endif
+
+#ifdef BRICK_DEBUG_MEM
+static atomic_t phys_string_alloc = ATOMIC_INIT(0);
+static atomic_t string_count[BRICK_DEBUG_MEM];
+static atomic_t string_free[BRICK_DEBUG_MEM];
+
+#endif
+
+char *_brick_string_alloc(int len, int line)
+{
+ char *res;
+
+#ifdef CONFIG_MARS_DEBUG
+ might_sleep();
+ if (unlikely(len > PAGE_SIZE))
+ BRICK_WRN("line = %d string too long: len = %d\n", line, len);
+#endif
+ if (len <= 0)
+ len = BRICK_STRING_LEN;
+
+ for (;;) {
+ res = kzalloc(len + STRING_PLUS, GFP_BRICK);
+ if (likely(res))
+ break;
+ msleep(1000);
+ }
+
+#ifdef BRICK_DEBUG_MEM
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ memset(res + 1, '?', len - 1);
+#endif
+ atomic_inc(&phys_string_alloc);
+ if (unlikely(line < 0))
+ line = 0;
+ else if (unlikely(line >= BRICK_DEBUG_MEM))
+ line = BRICK_DEBUG_MEM - 1;
+ INT_ACCESS(res, 0) = MAGIC_STR;
+ INT_ACCESS(res, sizeof(int)) = len;
+ INT_ACCESS(res, sizeof(int) * 2) = line;
+ res += sizeof(int) * 3;
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ strcpy(res + len, STRING_CANARY);
+#else
+ INT_ACCESS(res, len) = MAGIC_SEND;
+#endif
+ atomic_inc(&string_count[line]);
+#endif
+ return res;
+}
+EXPORT_SYMBOL_GPL(_brick_string_alloc);
+
+void _brick_string_free(const char *data, int cline)
+{
+#ifdef BRICK_DEBUG_MEM
+ int magic;
+ int len;
+ int line;
+ char *orig = (void *)data;
+
+ data -= sizeof(int) * 3;
+ magic = INT_ACCESS(data, 0);
+ if (unlikely(magic != MAGIC_STR)) {
+ BRICK_ERR("cline %d stringmem corruption: magix %08x != %08x\n", cline, magic, MAGIC_STR);
+ goto _out_return;
+ }
+ len = INT_ACCESS(data, sizeof(int));
+ line = INT_ACCESS(data, sizeof(int) * 2);
+ if (unlikely(len <= 0)) {
+ BRICK_ERR("cline %d stringmem corruption: line = %d len = %d\n", cline, line, len);
+ goto _out_return;
+ }
+ if (unlikely(len > PAGE_SIZE))
+ BRICK_ERR("cline %d string too long: line = %d len = %d string='%s'\n", cline, line, len, orig);
+ if (unlikely(line < 0 || line >= BRICK_DEBUG_MEM)) {
+ BRICK_ERR("cline %d stringmem corruption: line = %d (len = %d)\n", cline, line, len);
+ goto _out_return;
+ }
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ if (unlikely(strcmp(orig + len, STRING_CANARY))) {
+ BRICK_ERR("cline %d stringmem corruption: bad canary '%s', line = %d len = %d\n",
+ cline, STRING_CANARY, line, len);
+ goto _out_return;
+ }
+ orig[len]--;
+ memset(orig, '!', len);
+#else
+ magic = INT_ACCESS(orig, len);
+ if (unlikely(magic != MAGIC_SEND)) {
+ BRICK_ERR("cline %d stringmem corruption: end_magix %08x != %08x, line = %d len = %d\n",
+ cline, magic, MAGIC_SEND, line, len);
+ goto _out_return;
+ }
+ INT_ACCESS(orig, len) = 0xffffffff;
+#endif
+ atomic_dec(&string_count[line]);
+ atomic_inc(&string_free[line]);
+ atomic_dec(&phys_string_alloc);
+#endif
+ kfree(data);
+#ifdef BRICK_DEBUG_MEM
+_out_return:;
+#endif
+}
+EXPORT_SYMBOL_GPL(_brick_string_free);
+
+/***********************************************************************/
+
+/* block memory allocation */
+
+static
+int len2order(int len)
+{
+ int order = 0;
+
+ if (unlikely(len <= 0)) {
+ BRICK_ERR("trying to use %d bytes\n", len);
+ return 0;
+ }
+
+ while ((PAGE_SIZE << order) < len)
+ order++;
+
+ if (unlikely(order > BRICK_MAX_ORDER)) {
+ BRICK_ERR("trying to use %d bytes (oder = %d, max = %d)\n", len, order, BRICK_MAX_ORDER);
+ return BRICK_MAX_ORDER;
+ }
+ return order;
+}
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+static atomic_t _alloc_count[BRICK_MAX_ORDER+1];
+int brick_mem_alloc_count[BRICK_MAX_ORDER+1] = {};
+EXPORT_SYMBOL_GPL(brick_mem_alloc_count);
+int brick_mem_alloc_max[BRICK_MAX_ORDER+1] = {};
+EXPORT_SYMBOL_GPL(brick_mem_alloc_max);
+int brick_mem_freelist_max[BRICK_MAX_ORDER+1] = {};
+EXPORT_SYMBOL_GPL(brick_mem_freelist_max);
+#endif
+
+#ifdef BRICK_DEBUG_MEM
+static atomic_t phys_block_alloc = ATOMIC_INIT(0);
+
+/* indexed by line */
+static atomic_t block_count[BRICK_DEBUG_MEM];
+static atomic_t block_free[BRICK_DEBUG_MEM];
+static int block_len[BRICK_DEBUG_MEM];
+
+/* indexed by order */
+static atomic_t op_count[BRICK_MAX_ORDER+1];
+static atomic_t raw_count[BRICK_MAX_ORDER+1];
+static int alloc_line[BRICK_MAX_ORDER+1];
+static int alloc_len[BRICK_MAX_ORDER+1];
+
+#endif
+
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+
+#define MAX_INFO_LISTS 1024
+
+#define INFO_LIST_HASH(addr) ((unsigned long)(addr) / (PAGE_SIZE * 2) % MAX_INFO_LISTS)
+
+struct mem_block_info {
+ struct list_head inf_head;
+ void *inf_data;
+ int inf_len;
+ int inf_line;
+ bool inf_used;
+};
+
+static struct list_head inf_anchor[MAX_INFO_LISTS];
+static rwlock_t inf_lock[MAX_INFO_LISTS];
+
+static
+void _new_block_info(void *data, int len, int cline)
+{
+ struct mem_block_info *inf;
+ int hash;
+
+ for (;;) {
+ inf = kmalloc(sizeof(struct mem_block_info), GFP_BRICK);
+ if (likely(inf))
+ break;
+ msleep(1000);
+ }
+ inf->inf_data = data;
+ inf->inf_len = len;
+ inf->inf_line = cline;
+ inf->inf_used = true;
+
+ hash = INFO_LIST_HASH(data);
+
+ write_lock(&inf_lock[hash]);
+ list_add(&inf->inf_head, &inf_anchor[hash]);
+ write_unlock(&inf_lock[hash]);
+}
+
+static
+struct mem_block_info *_find_block_info(void *data, bool remove)
+{
+ struct mem_block_info *res = NULL;
+ struct list_head *tmp;
+ int hash = INFO_LIST_HASH(data);
+
+ if (remove)
+ write_lock(&inf_lock[hash]);
+ else
+ read_lock(&inf_lock[hash]);
+ for (tmp = inf_anchor[hash].next; tmp != &inf_anchor[hash]; tmp = tmp->next) {
+ struct mem_block_info *inf = container_of(tmp, struct mem_block_info, inf_head);
+
+ if (inf->inf_data != data)
+ continue;
+ if (remove)
+ list_del_init(tmp);
+ res = inf;
+ break;
+ }
+ if (remove)
+ write_unlock(&inf_lock[hash]);
+ else
+ read_unlock(&inf_lock[hash]);
+ return res;
+}
+
+#endif /* CONFIG_MARS_DEBUG_MEM_STRONG */
+
+static inline
+void *__brick_block_alloc(gfp_t gfp, int order, int cline)
+{
+ void *res;
+
+ for (;;) {
+#ifdef USE_KERNEL_PAGES
+ res = (void *)__get_free_pages(gfp, order);
+#else
+ res = __vmalloc(PAGE_SIZE << order, gfp, PAGE_KERNEL_IO);
+#endif
+ if (likely(res))
+ break;
+ msleep(1000);
+ }
+
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ _new_block_info(res, PAGE_SIZE << order, cline);
+#endif
+#ifdef BRICK_DEBUG_MEM
+ atomic_inc(&phys_block_alloc);
+ atomic_inc(&raw_count[order]);
+#endif
+ atomic64_add((PAGE_SIZE/1024) << order, &brick_global_block_used);
+
+ return res;
+}
+
+static inline
+void __brick_block_free(void *data, int order, int cline)
+{
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ struct mem_block_info *inf = _find_block_info(data, true);
+
+ if (likely(inf)) {
+ int inf_len = inf->inf_len;
+ int inf_line = inf->inf_line;
+
+ kfree(inf);
+ if (unlikely(inf_len != (PAGE_SIZE << order))) {
+ BRICK_ERR("line %d: address %p: bad freeing size %d (correct should be %d, previous line = %d)\n",
+ cline,
+ data,
+ (int)(PAGE_SIZE << order),
+ inf_len,
+ inf_line);
+ goto err;
+ }
+ } else {
+ BRICK_ERR("line %d: trying to free non-existent address %p (order = %d)\n", cline, data, order);
+ goto err;
+ }
+#endif
+#ifdef USE_KERNEL_PAGES
+ __free_pages(virt_to_page((unsigned long)data), order);
+#else
+ vfree(data);
+#endif
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+err:
+#endif
+#ifdef BRICK_DEBUG_MEM
+ atomic_dec(&phys_block_alloc);
+ atomic_dec(&raw_count[order]);
+#endif
+ atomic64_sub((PAGE_SIZE/1024) << order, &brick_global_block_used);
+}
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+int brick_allow_freelist = 1;
+EXPORT_SYMBOL_GPL(brick_allow_freelist);
+
+int brick_pre_reserve[BRICK_MAX_ORDER+1] = {};
+EXPORT_SYMBOL_GPL(brick_pre_reserve);
+
+/* Note: we have no separate lists per CPU.
+ * This should not hurt because the freelists are only used
+ * for higher-order pages which should be rather low-frequency.
+ */
+static spinlock_t freelist_lock[BRICK_MAX_ORDER+1];
+static void *brick_freelist[BRICK_MAX_ORDER+1];
+static atomic_t freelist_count[BRICK_MAX_ORDER+1];
+
+static
+void *_get_free(int order, int cline)
+{
+ void *data;
+
+ spin_lock(&freelist_lock[order]);
+ data = brick_freelist[order];
+ if (likely(data)) {
+ void *next = *(void **)data;
+
+#ifdef BRICK_DEBUG_MEM /* check for corruptions */
+ long pattern = *(((long *)data)+1);
+ void *copy = *(((void **)data)+2);
+
+ if (unlikely(pattern != 0xf0f0f0f0f0f0f0f0 || next != copy)) { /* found a corruption */
+ /* prevent further trouble by leaving a memleak */
+ brick_freelist[order] = NULL;
+ spin_unlock(&freelist_lock[order]);
+ BRICK_ERR("line %d:freelist corruption at %p (pattern = %lx next %p != %p, murdered = %d), order = %d\n",
+ cline, data, pattern, next, copy, atomic_read(&freelist_count[order]), order);
+ return NULL;
+ }
+#endif
+ brick_freelist[order] = next;
+ atomic_dec(&freelist_count[order]);
+ }
+ spin_unlock(&freelist_lock[order]);
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ if (data) {
+ struct mem_block_info *inf = _find_block_info(data, false);
+
+ if (likely(inf)) {
+ if (unlikely(inf->inf_len != (PAGE_SIZE << order))) {
+ BRICK_ERR("line %d: address %p: bad freelist size %d (correct should be %d, previous line = %d)\n",
+ cline, data, (int)(PAGE_SIZE << order), inf->inf_len, inf->inf_line);
+ }
+ inf->inf_line = cline;
+ inf->inf_used = true;
+ } else {
+ BRICK_ERR("line %d: freelist address %p is invalid (order = %d)\n", cline, data, order);
+ }
+ }
+#endif
+ return data;
+}
+
+static
+void _put_free(void *data, int order)
+{
+ void *next;
+
+#ifdef BRICK_DEBUG_MEM /* fill with pattern */
+ memset(data, 0xf0, PAGE_SIZE << order);
+#endif
+
+ spin_lock(&freelist_lock[order]);
+ next = brick_freelist[order];
+ *(void **)data = next;
+#ifdef BRICK_DEBUG_MEM /* insert redundant copy for checking */
+ *(((void **)data)+2) = next;
+#endif
+ brick_freelist[order] = data;
+ spin_unlock(&freelist_lock[order]);
+ atomic_inc(&freelist_count[order]);
+}
+
+static
+void _free_all(void)
+{
+ int order;
+
+ for (order = BRICK_MAX_ORDER; order >= 0; order--) {
+ for (;;) {
+ void *data = _get_free(order, __LINE__);
+
+ if (!data)
+ break;
+ __brick_block_free(data, order, __LINE__);
+ }
+ }
+}
+
+int brick_mem_reserve(void)
+{
+ int order;
+ int status = 0;
+
+ for (order = BRICK_MAX_ORDER; order >= 0; order--) {
+ int max = brick_pre_reserve[order];
+ int i;
+
+ brick_mem_freelist_max[order] += max;
+ BRICK_INF("preallocating %d at order %d (new maxlevel = %d)\n",
+ max,
+ order,
+ brick_mem_freelist_max[order]);
+
+ max = brick_mem_freelist_max[order] - atomic_read(&freelist_count[order]);
+ if (max >= 0) {
+ for (i = 0; i < max; i++) {
+ void *data = __brick_block_alloc(GFP_KERNEL, order, __LINE__);
+
+ if (likely(data))
+ _put_free(data, order);
+ else
+ status = -ENOMEM;
+ }
+ } else {
+ for (i = 0; i < -max; i++) {
+ void *data = _get_free(order, __LINE__);
+
+ if (likely(data))
+ __brick_block_free(data, order, __LINE__);
+ }
+ }
+ }
+ return status;
+}
+#else
+int brick_mem_reserve(struct mem_reservation *r)
+{
+ BRICK_INF("preallocation is not compiled in\n");
+ return 0;
+}
+#endif
+EXPORT_SYMBOL_GPL(brick_mem_reserve);
+
+void *_brick_block_alloc(loff_t pos, int len, int line)
+{
+ void *data;
+ int count;
+
+#ifdef BRICK_DEBUG_MEM
+#ifdef BRICK_DEBUG_ORDER0
+ const int plus0 = PAGE_SIZE;
+
+#else
+ const int plus0 = 0;
+
+#endif
+ const int plus = len <= PAGE_SIZE ? plus0 : PAGE_SIZE * 2;
+
+#else
+ const int plus = 0;
+
+#endif
+ int order = len2order(len + plus);
+
+ if (unlikely(order < 0)) {
+ BRICK_ERR("trying to allocate %d bytes (max = %d)\n", len, (int)(PAGE_SIZE << order));
+ return NULL;
+ }
+
+#ifdef CONFIG_MARS_DEBUG
+ might_sleep();
+#endif
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ count = atomic_add_return(1, &_alloc_count[order]);
+ brick_mem_alloc_count[order] = count;
+ if (count > brick_mem_alloc_max[order])
+ brick_mem_alloc_max[order] = count;
+#endif
+
+#ifdef BRICK_DEBUG_MEM
+ atomic_inc(&op_count[order]);
+ /* statistics */
+ alloc_line[order] = line;
+ alloc_len[order] = len;
+#endif
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ /* Dynamic increase of limits, in order to reduce
+ * fragmentation on higher-order pages.
+ * This comes on cost of higher memory usage.
+ */
+ if (order > 0 && count > brick_mem_freelist_max[order])
+ brick_mem_freelist_max[order] = count;
+#endif
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ data = _get_free(order, line);
+ if (!data)
+#endif
+ data = __brick_block_alloc(GFP_BRICK, order, line);
+
+#ifdef BRICK_DEBUG_MEM
+ if (order > 0) {
+ if (unlikely(line < 0))
+ line = 0;
+ else if (unlikely(line >= BRICK_DEBUG_MEM))
+ line = BRICK_DEBUG_MEM - 1;
+ atomic_inc(&block_count[line]);
+ block_len[line] = len;
+ if (order > 1) {
+ INT_ACCESS(data, 0 * sizeof(int)) = MAGIC_BLOCK;
+ INT_ACCESS(data, 1 * sizeof(int)) = line;
+ INT_ACCESS(data, 2 * sizeof(int)) = len;
+ data += PAGE_SIZE;
+ INT_ACCESS(data, -1 * sizeof(int)) = MAGIC_BLOCK;
+ INT_ACCESS(data, len) = MAGIC_BEND;
+ } else if (order == 1) {
+ INT_ACCESS(data, PAGE_SIZE + 0 * sizeof(int)) = MAGIC_BLOCK;
+ INT_ACCESS(data, PAGE_SIZE + 1 * sizeof(int)) = line;
+ INT_ACCESS(data, PAGE_SIZE + 2 * sizeof(int)) = len;
+ }
+ }
+#endif
+ return data;
+}
+EXPORT_SYMBOL_GPL(_brick_block_alloc);
+
+void _brick_block_free(void *data, int len, int cline)
+{
+ int order;
+
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ struct mem_block_info *inf;
+ char *real_data;
+
+#endif
+#ifdef BRICK_DEBUG_MEM
+ int prev_line = 0;
+
+#ifdef BRICK_DEBUG_ORDER0
+ const int plus0 = PAGE_SIZE;
+
+#else
+ const int plus0 = 0;
+
+#endif
+ const int plus = len <= PAGE_SIZE ? plus0 : PAGE_SIZE * 2;
+
+#else
+ const int plus = 0;
+
+#endif
+
+ order = len2order(len + plus);
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ real_data = data;
+ if (order > 1)
+ real_data -= PAGE_SIZE;
+ inf = _find_block_info(real_data, false);
+ if (likely(inf)) {
+ prev_line = inf->inf_line;
+ if (unlikely(inf->inf_len != (PAGE_SIZE << order))) {
+ BRICK_ERR("line %d: address %p: bad freeing size %d (correct should be %d, previous line = %d)\n",
+ cline, data, (int)(PAGE_SIZE << order), inf->inf_len, prev_line);
+ goto _out_return;
+ }
+ if (unlikely(!inf->inf_used)) {
+ BRICK_ERR("line %d: address %p: double freeing (previous line = %d)\n",
+ cline,
+ data,
+ prev_line);
+ goto _out_return;
+ }
+ inf->inf_line = cline;
+ inf->inf_used = false;
+ } else {
+ BRICK_ERR("line %d: trying to free non-existent address %p (order = %d)\n", cline, data, order);
+ goto _out_return;
+ }
+#endif
+#ifdef BRICK_DEBUG_MEM
+ if (order > 1) {
+ void *test = data - PAGE_SIZE;
+ int magic = INT_ACCESS(test, 0);
+ int line = INT_ACCESS(test, sizeof(int));
+ int oldlen = INT_ACCESS(test, sizeof(int)*2);
+ int magic1 = INT_ACCESS(data, -1 * sizeof(int));
+ int magic2;
+
+ if (unlikely(magic1 != MAGIC_BLOCK)) {
+ BRICK_ERR("line %d memory corruption: %p magix1 %08x != %08x (previous line = %d)\n",
+ cline,
+ data,
+ magic1,
+ MAGIC_BLOCK,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(magic != MAGIC_BLOCK)) {
+ BRICK_ERR("line %d memory corruption: %p magix %08x != %08x (previous line = %d)\n",
+ cline,
+ data,
+ magic,
+ MAGIC_BLOCK,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(line < 0 || line >= BRICK_DEBUG_MEM)) {
+ BRICK_ERR("line %d memory corruption %p: alloc line = %d (previous line = %d)\n",
+ cline,
+ data,
+ line,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(oldlen != len)) {
+ BRICK_ERR("line %d memory corruption %p: len != oldlen (%d != %d, previous line = %d))\n",
+ cline,
+ data,
+ len,
+ oldlen,
+ prev_line);
+ goto _out_return;
+ }
+ magic2 = INT_ACCESS(data, len);
+ if (unlikely(magic2 != MAGIC_BEND)) {
+ BRICK_ERR("line %d memory corruption %p: magix %08x != %08x (previous line = %d)\n",
+ cline,
+ data,
+ magic,
+ MAGIC_BEND,
+ prev_line);
+ goto _out_return;
+ }
+ INT_ACCESS(test, 0) = 0xffffffff;
+ INT_ACCESS(data, len) = 0xffffffff;
+ data = test;
+ atomic_dec(&block_count[line]);
+ atomic_inc(&block_free[line]);
+ } else if (order == 1) {
+ void *test = data + PAGE_SIZE;
+ int magic = INT_ACCESS(test, 0 * sizeof(int));
+ int line = INT_ACCESS(test, 1 * sizeof(int));
+ int oldlen = INT_ACCESS(test, 2 * sizeof(int));
+
+ if (unlikely(magic != MAGIC_BLOCK)) {
+ BRICK_ERR("line %d memory corruption %p: magix %08x != %08x (previous line = %d)\n",
+ cline,
+ data,
+ magic,
+ MAGIC_BLOCK,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(line < 0 || line >= BRICK_DEBUG_MEM)) {
+ BRICK_ERR("line %d memory corruption %p: alloc line = %d (previous line = %d)\n",
+ cline,
+ data,
+ line,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(oldlen != len)) {
+ BRICK_ERR("line %d memory corruption %p: len != oldlen (%d != %d, previous line = %d))\n",
+ cline,
+ data,
+ len,
+ oldlen,
+ prev_line);
+ goto _out_return;
+ }
+ atomic_dec(&block_count[line]);
+ atomic_inc(&block_free[line]);
+ }
+#endif
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ if (order > 0 && brick_allow_freelist && atomic_read(&freelist_count[order]) <= brick_mem_freelist_max[order]) {
+ _put_free(data, order);
+ } else
+#endif
+ __brick_block_free(data, order, cline);
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ brick_mem_alloc_count[order] = atomic_dec_return(&_alloc_count[order]);
+#endif
+#ifdef BRICK_DEBUG_MEM
+_out_return:;
+#endif
+}
+EXPORT_SYMBOL_GPL(_brick_block_free);
+
+struct page *brick_iomap(void *data, int *offset, int *len)
+{
+ int _offset = ((unsigned long)data) & (PAGE_SIZE-1);
+ struct page *page;
+
+ *offset = _offset;
+ if (*len > PAGE_SIZE - _offset)
+ *len = PAGE_SIZE - _offset;
+ if (is_vmalloc_addr(data))
+ page = vmalloc_to_page(data);
+ else
+ page = virt_to_page(data);
+ return page;
+}
+EXPORT_SYMBOL_GPL(brick_iomap);
+
+/***********************************************************************/
+
+/* module */
+
+void brick_mem_statistics(bool final)
+{
+#ifdef BRICK_DEBUG_MEM
+ int i;
+ int count = 0;
+ int places = 0;
+
+ BRICK_INF("======== page allocation:\n");
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ for (i = 0; i <= BRICK_MAX_ORDER; i++) {
+ BRICK_INF("pages order = %2d operations = %9d freelist_count = %4d / %3d raw_count = %5d alloc_count = %5d alloc_len = %5d line = %5d max_count = %5d\n",
+ i,
+ atomic_read(&op_count[i]),
+ atomic_read(&freelist_count[i]),
+ brick_mem_freelist_max[i],
+ atomic_read(&raw_count[i]),
+ brick_mem_alloc_count[i],
+ alloc_len[i],
+ alloc_line[i],
+ brick_mem_alloc_max[i]);
+ }
+#endif
+ for (i = 0; i < BRICK_DEBUG_MEM; i++) {
+ int val = atomic_read(&block_count[i]);
+
+ if (val) {
+ count += val;
+ places++;
+ BRICK_INF("line %4d: %6d allocated (last size = %4d, freed = %6d)\n",
+ i,
+ val,
+ block_len[i],
+ atomic_read(&block_free[i]));
+ }
+ }
+ if (!final || !count) {
+ BRICK_INF("======== %d block allocations in %d places (phys=%d)\n",
+ count, places, atomic_read(&phys_block_alloc));
+ } else {
+ BRICK_ERR("======== %d block allocations in %d places (phys=%d)\n",
+ count, places, atomic_read(&phys_block_alloc));
+ }
+ count = places = 0;
+ for (i = 0; i < BRICK_DEBUG_MEM; i++) {
+ int val = atomic_read(&mem_count[i]);
+
+ if (val) {
+ count += val;
+ places++;
+ BRICK_INF("line %4d: %6d allocated (last size = %4d, freed = %6d)\n",
+ i,
+ val,
+ mem_len[i],
+ atomic_read(&mem_free[i]));
+ }
+ }
+ if (!final || !count) {
+ BRICK_INF("======== %d memory allocations in %d places (phys=%d,redirect=%d)\n",
+ count, places,
+ atomic_read(&phys_mem_alloc), atomic_read(&mem_redirect_alloc));
+ } else {
+ BRICK_ERR("======== %d memory allocations in %d places (phys=%d,redirect=%d)\n",
+ count, places,
+ atomic_read(&phys_mem_alloc), atomic_read(&mem_redirect_alloc));
+ }
+ count = places = 0;
+ for (i = 0; i < BRICK_DEBUG_MEM; i++) {
+ int val = atomic_read(&string_count[i]);
+
+ if (val) {
+ count += val;
+ places++;
+ BRICK_INF("line %4d: %6d allocated (freed = %6d)\n",
+ i,
+ val,
+ atomic_read(&string_free[i]));
+ }
+ }
+ if (!final || !count) {
+ BRICK_INF("======== %d string allocations in %d places (phys=%d)\n",
+ count, places, atomic_read(&phys_string_alloc));
+ } else {
+ BRICK_ERR("======== %d string allocations in %d places (phys=%d)\n",
+ count, places, atomic_read(&phys_string_alloc));
+ }
+#endif
+}
+EXPORT_SYMBOL_GPL(brick_mem_statistics);
+
+/* module init stuff */
+
+int __init init_brick_mem(void)
+{
+ int i;
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ for (i = BRICK_MAX_ORDER; i >= 0; i--)
+ spin_lock_init(&freelist_lock[i]);
+#endif
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ for (i = 0; i < MAX_INFO_LISTS; i++) {
+ INIT_LIST_HEAD(&inf_anchor[i]);
+ rwlock_init(&inf_lock[i]);
+ }
+#else
+ (void)i;
+#endif
+
+ get_total_ram();
+
+ return 0;
+}
+
+void exit_brick_mem(void)
+{
+ BRICK_INF("deallocating memory...\n");
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ _free_all();
+#endif
+
+ brick_mem_statistics(true);
+}
--
2.0.0

2014-07-01 21:59:56

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 12/50] mars: add new file include/linux/brick/lib_queue.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/lib_queue.h | 146 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 146 insertions(+)
create mode 100644 include/linux/brick/lib_queue.h

diff --git a/include/linux/brick/lib_queue.h b/include/linux/brick/lib_queue.h
new file mode 100644
index 0000000..dcf4e1b
--- /dev/null
+++ b/include/linux/brick/lib_queue.h
@@ -0,0 +1,146 @@
+/* (c) 2011 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#ifndef LIB_QUEUE_H
+#define LIB_QUEUE_H
+
+#define QUEUE_ANCHOR(PREFIX, KEYTYPE, HEAPTYPE) \
+ /* parameters */ \
+ /* readonly from outside */ \
+ atomic_t q_queued; \
+ atomic_t q_flying; \
+ atomic_t q_total; \
+ /* tunables */ \
+ int q_batchlen; \
+ int q_io_prio; \
+ bool q_ordering; \
+ /* private */ \
+ wait_queue_head_t *q_event; \
+ spinlock_t q_lock; \
+ struct list_head q_anchor; \
+ struct pairing_heap_##HEAPTYPE *heap_high; \
+ struct pairing_heap_##HEAPTYPE *heap_low; \
+ long long q_last_insert; /* jiffies */ \
+ KEYTYPE heap_margin; \
+ KEYTYPE last_pos; \
+ /* this comment is for keeping TRAILING_SEMICOLON happy */
+
+#define QUEUE_FUNCTIONS(PREFIX, ELEM_TYPE, HEAD, KEYFN, KEYCMP, HEAPTYPE)\
+ \
+static inline \
+void q_##PREFIX##_trigger(struct PREFIX##_queue *q) \
+{ \
+ if (q->q_event) { \
+ wake_up_interruptible(q->q_event); \
+ } \
+} \
+ \
+static inline \
+void q_##PREFIX##_init(struct PREFIX##_queue *q) \
+{ \
+ INIT_LIST_HEAD(&q->q_anchor); \
+ q->heap_low = NULL; \
+ q->heap_high = NULL; \
+ spin_lock_init(&q->q_lock); \
+ atomic_set(&q->q_queued, 0); \
+ atomic_set(&q->q_flying, 0); \
+} \
+ \
+static inline \
+void q_##PREFIX##_insert(struct PREFIX##_queue *q, ELEM_TYPE * elem) \
+{ \
+ spin_lock(&q->q_lock); \
+ \
+ if (q->q_ordering) { \
+ struct pairing_heap_##HEAPTYPE **use = &q->heap_high; \
+ if (KEYCMP(KEYFN(elem), &q->heap_margin) <= 0) { \
+ use = &q->heap_low; \
+ } \
+ ph_insert_##HEAPTYPE(use, &elem->ph); \
+ } else { \
+ list_add_tail(&elem->HEAD, &q->q_anchor); \
+ } \
+ atomic_inc(&q->q_queued); \
+ atomic_inc(&q->q_total); \
+ q->q_last_insert = jiffies; \
+ \
+ spin_unlock(&q->q_lock); \
+ \
+ q_##PREFIX##_trigger(q); \
+} \
+ \
+static inline \
+void q_##PREFIX##_pushback(struct PREFIX##_queue *q, ELEM_TYPE * elem) \
+{ \
+ if (q->q_ordering) { \
+ atomic_dec(&q->q_total); \
+ q_##PREFIX##_insert(q, elem); \
+ return; \
+ } \
+ \
+ spin_lock(&q->q_lock); \
+ \
+ list_add(&elem->HEAD, &q->q_anchor); \
+ atomic_inc(&q->q_queued); \
+ \
+ spin_unlock(&q->q_lock); \
+} \
+ \
+static inline \
+ELEM_TYPE *q_##PREFIX##_fetch(struct PREFIX##_queue *q) \
+{ \
+ ELEM_TYPE *elem = NULL; \
+ \
+ spin_lock(&q->q_lock); \
+ \
+ if (q->q_ordering) { \
+ if (!q->heap_high) { \
+ q->heap_high = q->heap_low; \
+ q->heap_low = NULL; \
+ q->heap_margin = 0; \
+ q->last_pos = 0; \
+ } \
+ if (q->heap_high) { \
+ elem = container_of(q->heap_high, ELEM_TYPE, ph);\
+ \
+ if (unlikely(KEYCMP(KEYFN(elem), &q->last_pos) < 0)) {\
+ XIO_ERR("backskip pos %lld -> %lld\n", \
+ (long long)q->last_pos, (long long)KEYFN(elem));\
+ } \
+ memcpy(&q->last_pos, KEYFN(elem), sizeof(q->last_pos));\
+ \
+ if (KEYCMP(KEYFN(elem), &q->heap_margin) > 0) { \
+ memcpy(&q->heap_margin, KEYFN(elem), sizeof(q->heap_margin));\
+ } \
+ ph_delete_min_##HEAPTYPE(&q->heap_high); \
+ atomic_dec(&q->q_queued); \
+ } \
+ } else if (!list_empty(&q->q_anchor)) { \
+ struct list_head *next = q->q_anchor.next; \
+ list_del_init(next); \
+ atomic_dec(&q->q_queued); \
+ elem = container_of(next, ELEM_TYPE, HEAD); \
+ } \
+ \
+ spin_unlock(&q->q_lock); \
+ \
+ q_##PREFIX##_trigger(q); \
+ \
+ return elem; \
+} \
+ \
+static inline \
+void q_##PREFIX##_inc_flying(struct PREFIX##_queue *q) \
+{ \
+ atomic_inc(&q->q_flying); \
+ q_##PREFIX##_trigger(q); \
+} \
+ \
+static inline \
+void q_##PREFIX##_dec_flying(struct PREFIX##_queue *q) \
+{ \
+ atomic_dec(&q->q_flying); \
+ q_##PREFIX##_trigger(q); \
+} \
+ \
+
+#endif
--
2.0.0

2014-07-01 22:02:01

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 25/50] mars: add new file include/linux/lib_log.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/lib_log.h | 314 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 314 insertions(+)
create mode 100644 include/linux/lib_log.h

diff --git a/include/linux/lib_log.h b/include/linux/lib_log.h
new file mode 100644
index 0000000..ead2a72
--- /dev/null
+++ b/include/linux/lib_log.h
@@ -0,0 +1,314 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+/* Definitions for logfile format.
+ *
+ * This is meant for sharing between different transaction logger variants,
+ * and/or for sharing with userspace tools (e.g. logfile analyzers).
+ * TODO: factor out some remaining kernelspace issues.
+ */
+
+#ifndef LIB_LOG_H
+#define LIB_LOG_H
+
+#ifdef __KERNEL__
+#include <linux/xio.h>
+
+extern atomic_t global_aio_flying;
+#endif
+
+/* The following structure is memory-only.
+ * Transfers to disk are indirectly via the
+ * format conversion functions below.
+ * The advantage is that even newer disk formats can be parsed
+ * by old code (of course, not all information / features will be
+ * available then).
+ */
+#define log_header log_header_v1
+
+struct log_header_v1 {
+ struct timespec l_stamp;
+ struct timespec l_written;
+ loff_t l_pos;
+
+ short l_len;
+ short l_code;
+ unsigned int l_seq_nr;
+ int l_crc;
+};
+
+#define FORMAT_VERSION 1 /* version of disk format, currently there is no other one */
+
+#define CODE_UNKNOWN 0
+#define CODE_WRITE_NEW 1
+#define CODE_WRITE_OLD 2
+
+#define START_MAGIC 0xa8f7e908d9177957ll
+#define END_MAGIC 0x74941fb74ab5726dll
+
+#define START_OVERHEAD \
+ ( \
+ sizeof(START_MAGIC) + \
+ sizeof(char) + \
+ sizeof(char) + \
+ sizeof(short) + \
+ sizeof(struct timespec) + \
+ sizeof(loff_t) + \
+ sizeof(int) + \
+ sizeof(int) + \
+ sizeof(short) + \
+ sizeof(short) + \
+ 0 \
+ )
+
+#define END_OVERHEAD \
+ ( \
+ sizeof(END_MAGIC) + \
+ sizeof(int) + \
+ sizeof(char) + \
+ 3 + 4 /*spare*/ + \
+ sizeof(struct timespec) + \
+ 0 \
+ )
+
+#define OVERHEAD (START_OVERHEAD + END_OVERHEAD)
+
+/* TODO: make this bytesex-aware. */
+#define DATA_PUT(data, offset, val) \
+ do { \
+ *((typeof(val) *)((data)+offset)) = val; \
+ offset += sizeof(val); \
+ } while (0)
+
+#define DATA_GET(data, offset, val) \
+ do { \
+ val = *((typeof(val) *)((data)+offset)); \
+ offset += sizeof(val); \
+ } while (0)
+
+#define SCAN_TXT \
+"at file_pos = %lld file_offset = %d scan_offset = %d (%lld) test_offset = %d (%lld) restlen = %d: "
+#define SCAN_PAR \
+file_pos, file_offset, offset, file_pos + file_offset + offset, i, file_pos + file_offset + i, restlen
+
+static inline
+int log_scan(void *buf,
+ int len,
+ loff_t file_pos,
+ int file_offset,
+ bool sloppy,
+ struct log_header *lh,
+ void **payload,
+ int *payload_len,
+ unsigned int *seq_nr)
+{
+ bool dirty = false;
+ int offset;
+ int i;
+
+ *payload = NULL;
+ *payload_len = 0;
+
+ for (i = 0; i < len && i <= len - OVERHEAD; i += sizeof(long)) {
+ long long start_magic;
+ char format_version;
+ char valid_flag;
+
+ short total_len;
+ long long end_magic;
+ char valid_copy;
+
+ int restlen = 0;
+ int found_offset;
+
+ offset = i;
+ if (unlikely(i > 0 && !sloppy)) {
+ XIO_ERR(SCAN_TXT "detected a hole / bad data\n", SCAN_PAR);
+ return -EBADMSG;
+ }
+
+ DATA_GET(buf, offset, start_magic);
+ if (unlikely(start_magic != START_MAGIC)) {
+ if (start_magic != 0)
+ dirty = true;
+ continue;
+ }
+
+ restlen = len - i;
+ if (unlikely(restlen < START_OVERHEAD)) {
+ XIO_WRN(SCAN_TXT "magic found, but restlen is too small\n", SCAN_PAR);
+ return -EAGAIN;
+ }
+
+ DATA_GET(buf, offset, format_version);
+ if (unlikely(format_version != FORMAT_VERSION)) {
+ XIO_ERR(SCAN_TXT "found unknown data format %d\n", SCAN_PAR, (int)format_version);
+ return -EBADMSG;
+ }
+ DATA_GET(buf, offset, valid_flag);
+ if (unlikely(!valid_flag)) {
+ XIO_WRN(SCAN_TXT "data is explicitly marked invalid (was there a short write?)\n", SCAN_PAR);
+ continue;
+ }
+ DATA_GET(buf, offset, total_len);
+ if (unlikely(total_len > restlen)) {
+ XIO_WRN(SCAN_TXT "total_len = %d but available data restlen = %d. Was the logfile truncated?\n",
+ SCAN_PAR,
+ total_len,
+ restlen);
+ return -EAGAIN;
+ }
+
+ memset(lh, 0, sizeof(struct log_header));
+
+ DATA_GET(buf, offset, lh->l_stamp.tv_sec);
+ DATA_GET(buf, offset, lh->l_stamp.tv_nsec);
+ DATA_GET(buf, offset, lh->l_pos);
+ DATA_GET(buf, offset, lh->l_len);
+ offset += 2; /* skip spare */
+ offset += 4; /* skip spare */
+ DATA_GET(buf, offset, lh->l_code);
+ offset += 2; /* skip spare */
+
+ found_offset = offset;
+ offset += lh->l_len;
+
+ restlen = len - offset;
+ if (unlikely(restlen < END_OVERHEAD)) {
+ XIO_WRN(SCAN_TXT "restlen %d is too small\n", SCAN_PAR, restlen);
+ return -EAGAIN;
+ }
+
+ DATA_GET(buf, offset, end_magic);
+ if (unlikely(end_magic != END_MAGIC)) {
+ XIO_WRN(SCAN_TXT "bad end_magic 0x%llx, is the logfile truncated?\n", SCAN_PAR, end_magic);
+ return -EBADMSG;
+ }
+ DATA_GET(buf, offset, lh->l_crc);
+ DATA_GET(buf, offset, valid_copy);
+
+ if (unlikely(valid_copy != 1)) {
+ XIO_WRN(SCAN_TXT "found data marked as uncompleted / invalid, len = %d, valid_flag = %d\n",
+ SCAN_PAR,
+ lh->l_len,
+ (int)valid_copy);
+ return -EBADMSG;
+ }
+
+ /* skip spares */
+ offset += 3;
+
+ DATA_GET(buf, offset, lh->l_seq_nr);
+ DATA_GET(buf, offset, lh->l_written.tv_sec);
+ DATA_GET(buf, offset, lh->l_written.tv_nsec);
+
+ if (unlikely(lh->l_seq_nr > *seq_nr + 1 && lh->l_seq_nr && *seq_nr)) {
+ XIO_ERR(SCAN_TXT "record sequence number %u mismatch, expected was %u\n",
+ SCAN_PAR,
+ lh->l_seq_nr,
+ *seq_nr + 1);
+ return -EBADMSG;
+ } else if (unlikely(lh->l_seq_nr != *seq_nr + 1 && lh->l_seq_nr && *seq_nr)) {
+ XIO_WRN(SCAN_TXT "record sequence number %u mismatch, expected was %u\n",
+ SCAN_PAR,
+ lh->l_seq_nr,
+ *seq_nr + 1);
+ }
+ *seq_nr = lh->l_seq_nr;
+
+ if (lh->l_crc) {
+ unsigned char checksum[xio_digest_size];
+
+ xio_digest(checksum, buf + found_offset, lh->l_len);
+ if (unlikely(*(int *)checksum != lh->l_crc)) {
+ XIO_ERR(SCAN_TXT "data checksumming mismatch, length = %d\n", SCAN_PAR, lh->l_len);
+ return -EBADMSG;
+ }
+ }
+
+ /* last check */
+ if (unlikely(total_len != offset - i)) {
+ XIO_ERR(SCAN_TXT "internal size mismatch: %d != %d\n", SCAN_PAR, total_len, offset - i);
+ return -EBADMSG;
+ }
+
+ /* Success... */
+ *payload = buf + found_offset;
+ *payload_len = lh->l_len;
+
+ /* don't cry when nullbytes have been skipped */
+ if (i > 0 && dirty)
+ XIO_WRN(SCAN_TXT "skipped %d dirty bytes to find valid data\n", SCAN_PAR, i);
+
+ return offset;
+ }
+
+ XIO_ERR("could not find any useful data within len=%d bytes\n", len);
+ return -EAGAIN;
+}
+
+/**************************************************************************/
+
+#ifdef __KERNEL__
+
+/* Bookkeeping status between calls
+ */
+struct log_status {
+ /* interfacing */
+ wait_queue_head_t *signal_event;
+ /* tunables */
+ loff_t start_pos;
+ loff_t end_pos;
+
+ int align_size; /* alignment between requests */
+ int chunk_size; /* must be at least 8K (better 64k) */
+ int max_size; /* max payload length */
+ int io_prio;
+ bool do_crc;
+
+ /* informational */
+ atomic_t aio_flying;
+ int count;
+ loff_t log_pos;
+ struct timespec log_pos_stamp;
+
+ /* internal */
+ struct timespec tmp_pos_stamp;
+ struct xio_input *input;
+ struct xio_brick *brick;
+ struct xio_info info;
+ int offset;
+ int validflag_offset;
+ int reallen_offset;
+ int payload_offset;
+ int payload_len;
+ unsigned int seq_nr;
+ struct aio_object *log_aio;
+ struct aio_object *read_aio;
+
+ wait_queue_head_t event;
+ int error_code;
+ bool got;
+ bool do_free;
+ void *private;
+};
+
+void init_logst(struct log_status *logst, struct xio_input *input, loff_t start_pos, loff_t end_pos);
+void exit_logst(struct log_status *logst);
+
+void log_flush(struct log_status *logst);
+
+void *log_reserve(struct log_status *logst, struct log_header *lh);
+
+bool log_finalize(struct log_status *logst, int len, void (*endio)(void *private, int error), void *private);
+
+int log_read(struct log_status *logst, bool sloppy, struct log_header *lh, void **payload, int *payload_len);
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_log_format(void);
+extern void exit_log_format(void);
+
+#endif
+#endif
--
2.0.0

2014-07-01 22:02:32

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 24/50] mars: add new file drivers/block/mars/xio_bricks/lib_mapfree.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/lib_mapfree.c | 356 ++++++++++++++++++++++++++++
1 file changed, 356 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/lib_mapfree.c

diff --git a/drivers/block/mars/xio_bricks/lib_mapfree.c b/drivers/block/mars/xio_bricks/lib_mapfree.c
new file mode 100644
index 0000000..99a5ce9
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/lib_mapfree.c
@@ -0,0 +1,356 @@
+/* (c) 2012 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/lib_mapfree.h>
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/file.h>
+
+/* time to wait between background mapfree operations */
+int mapfree_period_sec = 10;
+EXPORT_SYMBOL_GPL(mapfree_period_sec);
+
+/* some grace space where no regular cleanup should occur */
+int mapfree_grace_keep_mb = 16;
+EXPORT_SYMBOL_GPL(mapfree_grace_keep_mb);
+
+static
+DECLARE_RWSEM(mapfree_mutex);
+
+static
+LIST_HEAD(mapfree_list);
+
+static
+void mapfree_pages(struct mapfree_info *mf, int grace_keep)
+{
+ struct address_space *mapping;
+ pgoff_t start;
+ pgoff_t end;
+
+ if (unlikely(!mf->mf_filp))
+ goto done;
+
+ mapping = mf->mf_filp->f_mapping;
+ if (unlikely(!mapping))
+ goto done;
+
+ if (grace_keep < 0) { /* force full flush */
+ start = 0;
+ end = -1;
+ } else {
+ loff_t tmp;
+ loff_t min;
+
+ spin_lock(&mf->mf_lock);
+
+ min = tmp = mf->mf_min[0];
+ if (likely(mf->mf_min[1] < min))
+ min = mf->mf_min[1];
+ if (tmp) {
+ mf->mf_min[1] = tmp;
+ mf->mf_min[0] = 0;
+ }
+
+ spin_unlock(&mf->mf_lock);
+
+ min -= (loff_t)grace_keep * (1024 * 1024); /* megabytes */
+ end = 0;
+
+ if (min > 0 || mf->mf_last) {
+ start = mf->mf_last / PAGE_SIZE;
+ /* add some grace overlapping */
+ if (likely(start > 0))
+ start--;
+ mf->mf_last = min;
+ end = min / PAGE_SIZE;
+ } else { /* there was no progress for at least 2 rounds */
+ start = 0;
+ if (!grace_keep) /* also flush thoroughly */
+ end = -1;
+ }
+
+ XIO_DBG("file = '%s' start = %lu end = %lu\n", SAFE_STR(mf->mf_name), start, end);
+ }
+
+ if (end > start || end == -1)
+ invalidate_mapping_pages(mapping, start, end);
+
+done:;
+}
+
+static
+void _mapfree_put(struct mapfree_info *mf)
+{
+ if (atomic_dec_and_test(&mf->mf_count)) {
+ XIO_DBG("closing file '%s' filp = %p\n", mf->mf_name, mf->mf_filp);
+ list_del_init(&mf->mf_head);
+ CHECK_HEAD_EMPTY(&mf->mf_dirty_anchor);
+ if (likely(mf->mf_filp)) {
+ mapfree_pages(mf, -1);
+ filp_close(mf->mf_filp, NULL);
+ }
+ brick_string_free(mf->mf_name);
+ brick_mem_free(mf);
+ }
+}
+
+void mapfree_put(struct mapfree_info *mf)
+{
+ down_write(&mapfree_mutex);
+ _mapfree_put(mf);
+ up_write(&mapfree_mutex);
+}
+EXPORT_SYMBOL_GPL(mapfree_put);
+
+struct mapfree_info *mapfree_get(const char *name, int flags)
+{
+ struct mapfree_info *mf = NULL;
+ struct list_head *tmp;
+
+ if (!(flags & O_DIRECT)) {
+ down_read(&mapfree_mutex);
+ for (tmp = mapfree_list.next; tmp != &mapfree_list; tmp = tmp->next) {
+ struct mapfree_info *_mf = container_of(tmp, struct mapfree_info, mf_head);
+
+ if (_mf->mf_flags == flags && !strcmp(_mf->mf_name, name)) {
+ mf = _mf;
+ atomic_inc(&mf->mf_count);
+ break;
+ }
+ }
+ up_read(&mapfree_mutex);
+
+ if (mf)
+ goto done;
+ }
+
+ for (;;) {
+ struct address_space *mapping;
+ struct inode *inode = NULL;
+ int ra = 1;
+ int prot = 0600;
+
+ mm_segment_t oldfs;
+
+ mf = brick_zmem_alloc(sizeof(struct mapfree_info));
+
+ mf->mf_name = brick_strdup(name);
+
+ mf->mf_flags = flags;
+ INIT_LIST_HEAD(&mf->mf_head);
+ INIT_LIST_HEAD(&mf->mf_dirty_anchor);
+ atomic_set(&mf->mf_count, 1);
+ spin_lock_init(&mf->mf_lock);
+ mf->mf_max = -1;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ mf->mf_filp = filp_open(name, flags, prot);
+ set_fs(oldfs);
+
+ XIO_DBG("file '%s' flags = %d prot = %d filp = %p\n", name, flags, prot, mf->mf_filp);
+
+ if (unlikely(!mf->mf_filp || IS_ERR(mf->mf_filp))) {
+ int err = PTR_ERR(mf->mf_filp);
+
+ XIO_ERR("can't open file '%s' status=%d\n", name, err);
+ mf->mf_filp = NULL;
+ _mapfree_put(mf);
+ mf = NULL;
+ break;
+ }
+
+ mapping = mf->mf_filp->f_mapping;
+ if (likely(mapping))
+ inode = mapping->host;
+ if (unlikely(!mapping || !inode)) {
+ XIO_ERR("file '%s' has no mapping\n", name);
+ mf->mf_filp = NULL;
+ _mapfree_put(mf);
+ mf = NULL;
+ break;
+ }
+
+ mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~(__GFP_IO | __GFP_FS));
+
+ mf->mf_max = i_size_read(inode);
+
+ if (S_ISBLK(inode->i_mode)) {
+ XIO_INF("changing blkdev readahead from %lu to %d\n",
+ inode->i_bdev->bd_disk->queue->backing_dev_info.ra_pages,
+ ra);
+ inode->i_bdev->bd_disk->queue->backing_dev_info.ra_pages = ra;
+ }
+
+ if (flags & O_DIRECT) { /* never share them */
+ break;
+ }
+
+ /* maintain global list of all open files */
+ down_write(&mapfree_mutex);
+ for (tmp = mapfree_list.next; tmp != &mapfree_list; tmp = tmp->next) {
+ struct mapfree_info *_mf = container_of(tmp, struct mapfree_info, mf_head);
+
+ if (unlikely(_mf->mf_flags == flags && !strcmp(_mf->mf_name, name))) {
+ XIO_WRN("race on creation of '%s' detected\n", name);
+ _mapfree_put(mf);
+ mf = _mf;
+ atomic_inc(&mf->mf_count);
+ goto leave;
+ }
+ }
+ list_add_tail(&mf->mf_head, &mapfree_list);
+leave:
+ up_write(&mapfree_mutex);
+ break;
+ }
+done:
+ return mf;
+}
+EXPORT_SYMBOL_GPL(mapfree_get);
+
+void mapfree_set(struct mapfree_info *mf, loff_t min, loff_t max)
+{
+ spin_lock(&mf->mf_lock);
+ if (!mf->mf_min[0] || mf->mf_min[0] > min)
+ mf->mf_min[0] = min;
+ if (max >= 0 && mf->mf_max < max)
+ mf->mf_max = max;
+ spin_unlock(&mf->mf_lock);
+}
+EXPORT_SYMBOL_GPL(mapfree_set);
+
+static
+int mapfree_thread(void *data)
+{
+ while (!brick_thread_should_stop()) {
+ struct mapfree_info *mf = NULL;
+ struct list_head *tmp;
+ long long eldest = 0;
+
+ brick_msleep(500);
+
+ if (mapfree_period_sec <= 0)
+ continue;
+
+ down_read(&mapfree_mutex);
+
+ for (tmp = mapfree_list.next; tmp != &mapfree_list; tmp = tmp->next) {
+ struct mapfree_info *_mf = container_of(tmp, struct mapfree_info, mf_head);
+
+ if (unlikely(!_mf->mf_jiffies)) {
+ _mf->mf_jiffies = jiffies;
+ continue;
+ }
+ if ((long long)jiffies - _mf->mf_jiffies > mapfree_period_sec * HZ &&
+ (!mf || _mf->mf_jiffies < eldest)) {
+ mf = _mf;
+ eldest = _mf->mf_jiffies;
+ }
+ }
+ if (mf)
+ atomic_inc(&mf->mf_count);
+
+ up_read(&mapfree_mutex);
+
+ if (!mf)
+ continue;
+
+ mapfree_pages(mf, mapfree_grace_keep_mb);
+
+ mf->mf_jiffies = jiffies;
+ mapfree_put(mf);
+ }
+ return 0;
+}
+
+/***************** dirty IOs on the fly *****************/
+
+void mf_insert_dirty(struct mapfree_info *mf, struct dirty_info *di)
+{
+ if (likely(di->dirty_aio)) {
+ spin_lock(&mf->mf_lock);
+ list_del(&di->dirty_head);
+ list_add(&di->dirty_head, &mf->mf_dirty_anchor);
+ spin_unlock(&mf->mf_lock);
+ }
+}
+EXPORT_SYMBOL_GPL(mf_insert_dirty);
+
+void mf_remove_dirty(struct mapfree_info *mf, struct dirty_info *di)
+{
+ if (!list_empty(&di->dirty_head)) {
+ spin_lock(&mf->mf_lock);
+ list_del_init(&di->dirty_head);
+ spin_unlock(&mf->mf_lock);
+ }
+}
+EXPORT_SYMBOL_GPL(mf_remove_dirty);
+
+void mf_get_dirty(struct mapfree_info *mf, loff_t *min, loff_t *max, int min_stage, int max_stage)
+{
+ struct list_head *tmp;
+
+ spin_lock(&mf->mf_lock);
+ for (tmp = mf->mf_dirty_anchor.next; tmp != &mf->mf_dirty_anchor; tmp = tmp->next) {
+ struct dirty_info *di = container_of(tmp, struct dirty_info, dirty_head);
+ struct aio_object *aio = di->dirty_aio;
+
+ if (unlikely(!aio))
+ continue;
+ if (di->dirty_stage < min_stage || di->dirty_stage > max_stage)
+ continue;
+ if (aio->io_pos < *min)
+ *min = aio->io_pos;
+ if (aio->io_pos + aio->io_len > *max)
+ *max = aio->io_pos + aio->io_len;
+ }
+ spin_unlock(&mf->mf_lock);
+}
+EXPORT_SYMBOL_GPL(mf_get_dirty);
+
+void mf_get_any_dirty(const char *filename, loff_t *min, loff_t *max, int min_stage, int max_stage)
+{
+ struct list_head *tmp;
+
+ down_read(&mapfree_mutex);
+ for (tmp = mapfree_list.next; tmp != &mapfree_list; tmp = tmp->next) {
+ struct mapfree_info *mf = container_of(tmp, struct mapfree_info, mf_head);
+
+ if (!strcmp(mf->mf_name, filename))
+ mf_get_dirty(mf, min, max, min_stage, max_stage);
+ }
+ up_read(&mapfree_mutex);
+}
+EXPORT_SYMBOL_GPL(mf_get_any_dirty);
+
+/***************** module init stuff ************************/
+
+static
+struct task_struct *mf_thread = NULL;
+
+int __init init_xio_mapfree(void)
+{
+ XIO_DBG("init_mapfree()\n");
+ mf_thread = brick_thread_create(mapfree_thread, NULL, "xio_mapfree");
+ if (unlikely(!mf_thread)) {
+ XIO_ERR("could not create mapfree thread\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void exit_xio_mapfree(void)
+{
+ XIO_DBG("exit_mapfree()\n");
+ if (likely(mf_thread)) {
+ brick_thread_stop(mf_thread);
+ mf_thread = NULL;
+ }
+}
--
2.0.0

2014-07-01 22:02:34

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 22/50] mars: add new file drivers/block/mars/xio_bricks/xio_net.c

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/block/mars/xio_bricks/xio_net.c | 1445 +++++++++++++++++++++++++++++++
1 file changed, 1445 insertions(+)
create mode 100644 drivers/block/mars/xio_bricks/xio_net.c

diff --git a/drivers/block/mars/xio_bricks/xio_net.c b/drivers/block/mars/xio_bricks/xio_net.c
new file mode 100644
index 0000000..4e31a78
--- /dev/null
+++ b/drivers/block/mars/xio_bricks/xio_net.c
@@ -0,0 +1,1445 @@
+/* (c) 2011 Thomas Schoebel-Theuer / 1&1 Internet AG */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/moduleparam.h>
+
+#include <linux/xio.h>
+#include <linux/xio_net.h>
+
+#define USE_BUFFERING
+
+#define SEND_PROTO_VERSION 1
+
+/******************************************************************/
+
+/* Internal data structures for low-level transfer of C structures
+ * described by struct meta.
+ * Only these low-level fields need to have a fixed size like s64.
+ * The size and bytesex of the higher-level C structures is converted
+ * automatically; therefore classical "int" or "long long" etc is viable.
+ */
+
+#define MAX_FIELD_LEN (32 + 16)
+
+struct xio_desc_cache {
+ u64 cache_sender_cookie;
+ u64 cache_recver_cookie;
+ u16 cache_sender_proto;
+ u16 cache_recver_proto;
+ s16 cache_items;
+ s8 cache_is_bigendian;
+ s8 cache_spare1;
+ s32 cache_spare2;
+ s32 cache_spare3;
+ u64 cache_spare4[4];
+};
+
+struct xio_desc_item {
+ char field_name[MAX_FIELD_LEN];
+
+ s8 field_type;
+ s8 field_spare0;
+ s16 field_data_size;
+ s16 field_sender_size;
+ s16 field_sender_offset;
+ s16 field_recver_size;
+ s16 field_recver_offset;
+ s32 field_spare;
+};
+
+/* This must not be mirror symmetric between big and little endian
+ */
+#define XIO_DESC_MAGIC 0x73D0A2EC6148F48Ell
+
+struct xio_desc_header {
+ u64 h_magic;
+ u64 h_cookie;
+ s16 h_meta_len;
+ s16 h_index;
+ u32 h_spare1;
+ u64 h_spare2;
+};
+
+#define MAX_INT_TRANSFER 16
+
+/******************************************************************/
+
+/* Bytesex conversion / sign extension
+ */
+
+#ifdef __LITTLE_ENDIAN
+static const bool myself_is_bigendian;
+
+#endif
+#ifdef __BIG_ENDIAN
+static const bool myself_is_bigendian = true;
+
+#endif
+
+static inline
+void swap_bytes(void *data, int len)
+{
+ char *a = data;
+ char *b = data + len - 1;
+
+ while (a < b) {
+ char tmp = *a;
+
+ *a = *b;
+ *b = tmp;
+ a++;
+ b--;
+ }
+}
+
+#define SWAP_FIELD(x) swap_bytes(&(x), sizeof(x))
+
+static inline
+void swap_mc(struct xio_desc_cache *mc, int len)
+{
+ struct xio_desc_item *mi;
+
+ SWAP_FIELD(mc->cache_sender_cookie);
+ SWAP_FIELD(mc->cache_recver_cookie);
+ SWAP_FIELD(mc->cache_items);
+
+ len -= sizeof(*mc);
+
+ for (mi = (void *)(mc + 1); len > 0; mi++, len -= sizeof(*mi)) {
+ SWAP_FIELD(mi->field_type);
+ SWAP_FIELD(mi->field_data_size);
+ SWAP_FIELD(mi->field_sender_size);
+ SWAP_FIELD(mi->field_sender_offset);
+ SWAP_FIELD(mi->field_recver_size);
+ SWAP_FIELD(mi->field_recver_offset);
+ }
+}
+
+static inline
+char get_sign(const void *data, int len, bool is_bigendian, bool is_signed)
+{
+ if (is_signed) {
+ char x = is_bigendian ?
+ ((const char *)data)[0] :
+ ((const char *)data)[len - 1];
+ if (x < 0)
+ return -1;
+ }
+ return 0;
+}
+
+/******************************************************************/
+
+/* Low-level network traffic
+ */
+
+int xio_net_default_port = CONFIG_MARS_DEFAULT_PORT;
+EXPORT_SYMBOL_GPL(xio_net_default_port);
+module_param_named(xio_port, xio_net_default_port, int, 0);
+
+/* TODO: allow binding to specific source addresses instead of catch-all.
+ * TODO: make all the socket options configurable.
+ * TODO: implement signal handling.
+ * TODO: add authentication.
+ * TODO: add compression / encryption.
+ */
+
+struct xio_tcp_params default_tcp_params = {
+ .ip_tos = IPTOS_LOWDELAY,
+ .tcp_window_size = 8 * 1024 * 1024, /* for long distance replications */
+ .tcp_nodelay = 0,
+ .tcp_timeout = 2,
+ .tcp_keepcnt = 3,
+ .tcp_keepintvl = 3, /* keepalive ping time */
+ .tcp_keepidle = 4,
+};
+EXPORT_SYMBOL(default_tcp_params);
+
+static
+void __setsockopt(struct socket *sock, int level, int optname, char *optval, int optsize)
+{
+ int status = kernel_setsockopt(sock, level, optname, optval, optsize);
+
+ if (status < 0) {
+ XIO_WRN("cannot set %d socket option %d to value %d, status = %d\n",
+ level, optname, *(int *)optval, status);
+ }
+}
+
+#define _setsockopt(sock, level, optname, val) __setsockopt(sock, level, optname, (char *)&(val), sizeof(val))
+
+int xio_create_sockaddr(struct sockaddr_storage *addr, const char *spec)
+{
+ struct sockaddr_in *sockaddr = (void *)addr;
+ const char *new_spec;
+ const char *tmp_spec;
+ int status = 0;
+
+ memset(addr, 0, sizeof(*addr));
+ sockaddr->sin_family = AF_INET;
+ sockaddr->sin_port = htons(xio_net_default_port);
+
+ /* Try to translate hostnames to IPs if possible.
+ */
+ if (xio_translate_hostname)
+ new_spec = xio_translate_hostname(spec);
+ else
+ new_spec = brick_strdup(spec);
+ tmp_spec = new_spec;
+
+ /* This is PROVISIONARY!
+ * TODO: add IPV6 syntax and many more features :)
+ */
+ if (!*tmp_spec)
+ goto done;
+ if (*tmp_spec != ':') {
+ unsigned char u0 = 0, u1 = 0, u2 = 0, u3 = 0;
+
+ status = sscanf(tmp_spec, "%hhu.%hhu.%hhu.%hhu", &u0, &u1, &u2, &u3);
+ if (status != 4) {
+ XIO_ERR("invalid sockaddr IP syntax '%s', status = %d\n", tmp_spec, status);
+ status = -EINVAL;
+ goto done;
+ }
+ XIO_DBG("decoded IP = %u.%u.%u.%u\n", u0, u1, u2, u3);
+ sockaddr->sin_addr.s_addr = (__be32)u0 | (__be32)u1 << 8 | (__be32)u2 << 16 | (__be32)u3 << 24;
+ }
+ /* deocde port number (when present) */
+ tmp_spec = spec;
+ while (*tmp_spec && *tmp_spec++ != ':')
+ /*empty*/;
+ if (*tmp_spec) {
+ int port = 0;
+
+ status = kstrtoint(tmp_spec, 0, &port);
+ if (unlikely(status)) {
+ XIO_ERR("invalid sockaddr PORT syntax '%s', status = %d\n", tmp_spec, status);
+ status = -EINVAL;
+ goto done;
+ }
+ XIO_DBG("decoded PORT = %d\n", port);
+ sockaddr->sin_port = htons(port);
+ }
+ status = 0;
+done:
+ brick_string_free(new_spec);
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_create_sockaddr);
+
+static int current_debug_nr; /* no locking, just for debugging */
+
+static
+void _set_socketopts(struct socket *sock)
+{
+ struct timeval t = {
+ .tv_sec = default_tcp_params.tcp_timeout,
+ };
+ int x_true = 1;
+
+ /* TODO: improve this by a table-driven approach
+ */
+ sock->sk->sk_rcvtimeo = sock->sk->sk_sndtimeo = default_tcp_params.tcp_timeout * HZ;
+ sock->sk->sk_reuse = 1;
+ _setsockopt(sock, SOL_SOCKET, SO_SNDBUFFORCE, default_tcp_params.tcp_window_size);
+ _setsockopt(sock, SOL_SOCKET, SO_RCVBUFFORCE, default_tcp_params.tcp_window_size);
+ _setsockopt(sock, SOL_IP, SO_PRIORITY, default_tcp_params.ip_tos);
+ _setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, default_tcp_params.tcp_nodelay);
+ _setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, x_true);
+ _setsockopt(sock, IPPROTO_TCP, TCP_KEEPCNT, default_tcp_params.tcp_keepcnt);
+ _setsockopt(sock, IPPROTO_TCP, TCP_KEEPINTVL, default_tcp_params.tcp_keepintvl);
+ _setsockopt(sock, IPPROTO_TCP, TCP_KEEPIDLE, default_tcp_params.tcp_keepidle);
+ _setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, t);
+ _setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, t);
+
+ if (sock->file) { /* switch back to blocking mode */
+ sock->file->f_flags &= ~O_NONBLOCK;
+ }
+}
+
+int xio_proto_exchange(struct xio_socket *msock, const char *msg)
+{
+ int status;
+
+ msock->s_send_proto = SEND_PROTO_VERSION;
+ status = xio_send_raw(msock, &msock->s_send_proto, 1, false);
+ if (unlikely(status < 0)) {
+ XIO_DBG("#%d protocol exchange on %s failed at sending, status = %d\n",
+ msock->s_debug_nr,
+ msg,
+ status);
+ goto done;
+ }
+ status = xio_recv_raw(msock, &msock->s_recv_proto, 1, 1);
+ if (unlikely(status < 0)) {
+ XIO_DBG("#%d protocol exchange on %s failed at receiving, status = %d\n",
+ msock->s_debug_nr,
+ msg,
+ status);
+ goto done;
+ }
+ /* take the the minimum of both protocol versions */
+ if (msock->s_send_proto > msock->s_recv_proto)
+ msock->s_send_proto = msock->s_recv_proto;
+done:
+ return status;
+}
+
+int xio_create_socket(struct xio_socket *msock, struct sockaddr_storage *addr, bool is_server)
+{
+ struct socket *sock;
+ struct sockaddr *sockaddr = (void *)addr;
+ int status = -EEXIST;
+
+ if (unlikely(atomic_read(&msock->s_count))) {
+ XIO_ERR("#%d socket already in use\n", msock->s_debug_nr);
+ goto final;
+ }
+ if (unlikely(msock->s_socket)) {
+ XIO_ERR("#%d socket already open\n", msock->s_debug_nr);
+ goto final;
+ }
+ atomic_set(&msock->s_count, 1);
+
+ status = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &msock->s_socket);
+ if (unlikely(status < 0 || !msock->s_socket)) {
+ msock->s_socket = NULL;
+ XIO_WRN("cannot create socket, status = %d\n", status);
+ goto final;
+ }
+ msock->s_debug_nr = ++current_debug_nr;
+ sock = msock->s_socket;
+ CHECK_PTR(sock, done);
+ msock->s_alive = true;
+
+ _set_socketopts(sock);
+
+ if (is_server) {
+ status = kernel_bind(sock, sockaddr, sizeof(*sockaddr));
+ if (unlikely(status < 0)) {
+ XIO_WRN("#%d bind failed, status = %d\n", msock->s_debug_nr, status);
+ goto done;
+ }
+ status = kernel_listen(sock, 16);
+ if (status < 0)
+ XIO_WRN("#%d listen failed, status = %d\n", msock->s_debug_nr, status);
+ } else {
+ status = kernel_connect(sock, sockaddr, sizeof(*sockaddr), 0);
+ if (unlikely(status < 0)) {
+ XIO_DBG("#%d connect failed, status = %d\n", msock->s_debug_nr, status);
+ goto done;
+ }
+ status = xio_proto_exchange(msock, "connect");
+ }
+
+done:
+ if (status < 0)
+ xio_put_socket(msock);
+ else
+ XIO_DBG("successfully created socket #%d\n", msock->s_debug_nr);
+final:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_create_socket);
+
+int xio_accept_socket(struct xio_socket *new_msock, struct xio_socket *old_msock)
+{
+ int status = -ENOENT;
+ struct socket *new_socket = NULL;
+ bool ok;
+
+ ok = xio_get_socket(old_msock);
+ if (likely(ok)) {
+ struct socket *sock = old_msock->s_socket;
+
+ if (unlikely(!sock))
+ goto err;
+
+ status = kernel_accept(sock, &new_socket, O_NONBLOCK);
+ if (unlikely(status < 0))
+ goto err;
+ if (unlikely(!new_socket)) {
+ status = -EBADF;
+ goto err;
+ }
+
+ _set_socketopts(new_socket);
+
+ memset(new_msock, 0, sizeof(struct xio_socket));
+ new_msock->s_socket = new_socket;
+ atomic_set(&new_msock->s_count, 1);
+ new_msock->s_alive = true;
+ new_msock->s_debug_nr = ++current_debug_nr;
+ XIO_DBG("#%d successfully accepted socket #%d\n", old_msock->s_debug_nr, new_msock->s_debug_nr);
+
+ status = xio_proto_exchange(new_msock, "accept");
+err:
+ xio_put_socket(old_msock);
+ }
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_accept_socket);
+
+bool xio_get_socket(struct xio_socket *msock)
+{
+ if (unlikely(atomic_read(&msock->s_count) <= 0)) {
+ XIO_ERR("#%d bad nesting on msock = %p\n", msock->s_debug_nr, msock);
+ return false;
+ }
+
+ atomic_inc(&msock->s_count);
+
+ if (unlikely(!msock->s_socket || !msock->s_alive)) {
+ xio_put_socket(msock);
+ return false;
+ }
+ return true;
+}
+EXPORT_SYMBOL_GPL(xio_get_socket);
+
+void xio_put_socket(struct xio_socket *msock)
+{
+ if (unlikely(atomic_read(&msock->s_count) <= 0)) {
+ XIO_ERR("#%d bad nesting on msock = %p sock = %p\n", msock->s_debug_nr, msock, msock->s_socket);
+ } else if (atomic_dec_and_test(&msock->s_count)) {
+ struct socket *sock = msock->s_socket;
+ int i;
+
+ XIO_DBG("#%d closing socket %p\n", msock->s_debug_nr, sock);
+ if (likely(sock && cmpxchg(&msock->s_alive, true, false)))
+ kernel_sock_shutdown(sock, SHUT_WR);
+ if (likely(sock && !msock->s_alive)) {
+ XIO_DBG("#%d releasing socket %p\n", msock->s_debug_nr, sock);
+ sock_release(sock);
+ }
+ for (i = 0; i < MAX_DESC_CACHE; i++) {
+ if (msock->s_desc_send[i])
+ brick_block_free(msock->s_desc_send[i], PAGE_SIZE);
+ if (msock->s_desc_recv[i])
+ brick_block_free(msock->s_desc_recv[i], PAGE_SIZE);
+ }
+ brick_block_free(msock->s_buffer, PAGE_SIZE);
+ memset(msock, 0, sizeof(struct xio_socket));
+ }
+}
+EXPORT_SYMBOL_GPL(xio_put_socket);
+
+void xio_shutdown_socket(struct xio_socket *msock)
+{
+ if (msock->s_socket) {
+ bool ok = xio_get_socket(msock);
+
+ if (likely(ok)) {
+ struct socket *sock = msock->s_socket;
+
+ if (likely(sock && cmpxchg(&msock->s_alive, true, false))) {
+ XIO_DBG("#%d shutdown socket %p\n", msock->s_debug_nr, sock);
+ kernel_sock_shutdown(sock, SHUT_WR);
+ }
+ xio_put_socket(msock);
+ }
+ }
+}
+EXPORT_SYMBOL_GPL(xio_shutdown_socket);
+
+bool xio_socket_is_alive(struct xio_socket *msock)
+{
+ bool res = false;
+
+ if (!msock->s_socket || !msock->s_alive)
+ goto done;
+ if (unlikely(atomic_read(&msock->s_count) <= 0)) {
+ XIO_ERR("#%d bad nesting on msock = %p sock = %p\n", msock->s_debug_nr, msock, msock->s_socket);
+ goto done;
+ }
+ res = true;
+done:
+ return res;
+}
+EXPORT_SYMBOL_GPL(xio_socket_is_alive);
+
+static
+int _xio_send_raw(struct xio_socket *msock, const void *buf, int len)
+{
+ int sleeptime = 1000 / HZ;
+ int sent = 0;
+ int status = 0;
+
+ msock->s_send_cnt = 0;
+ while (len > 0) {
+ int this_len = len;
+ struct socket *sock = msock->s_socket;
+
+ if (unlikely(!sock || !xio_net_is_alive || brick_thread_should_stop())) {
+ XIO_WRN("interrupting, sent = %d\n", sent);
+ status = -EIDRM;
+ break;
+ }
+
+ {
+ struct kvec iov = {
+ .iov_base = (void *)buf,
+ .iov_len = this_len,
+ };
+ struct msghdr msg = {
+ .msg_iov = (struct iovec *)&iov,
+ .msg_flags = 0 | MSG_NOSIGNAL,
+ };
+ status = kernel_sendmsg(sock, &msg, &iov, 1, this_len);
+ }
+
+ if (status == -EAGAIN) {
+ if (msock->s_send_abort > 0 && ++msock->s_send_cnt > msock->s_send_abort) {
+ XIO_WRN("#%d reached send abort %d\n", msock->s_debug_nr, msock->s_send_abort);
+ status = -EINTR;
+ break;
+ }
+ brick_msleep(sleeptime);
+ /* linearly increasing backoff */
+ if (sleeptime < 100)
+ sleeptime += 1000 / HZ;
+ continue;
+ }
+ msock->s_send_cnt = 0;
+ if (unlikely(status == -EINTR)) { /* ignore it */
+ flush_signals(current);
+ brick_msleep(50);
+ continue;
+ }
+ if (unlikely(!status)) {
+ XIO_WRN("#%d EOF from socket upon send_page()\n", msock->s_debug_nr);
+ brick_msleep(50);
+ status = -ECOMM;
+ break;
+ }
+ if (unlikely(status < 0)) {
+ XIO_WRN("#%d bad socket sendmsg, len=%d, this_len=%d, sent=%d, status = %d\n",
+ msock->s_debug_nr,
+ len,
+ this_len,
+ sent,
+ status);
+ break;
+ }
+
+ len -= status;
+ buf += status;
+ sent += status;
+ sleeptime = 1000 / HZ;
+ }
+
+ if (status >= 0)
+ status = sent;
+
+ return status;
+}
+
+int xio_send_raw(struct xio_socket *msock, const void *buf, int len, bool cork)
+{
+#ifdef USE_BUFFERING
+ int sent = 0;
+ int rest = len;
+
+#endif
+ int status = -EINVAL;
+
+ if (!xio_get_socket(msock))
+ goto final;
+
+#ifdef USE_BUFFERING
+restart:
+ if (!msock->s_buffer) {
+ msock->s_pos = 0;
+ msock->s_buffer = brick_block_alloc(0, PAGE_SIZE);
+ }
+
+ if (msock->s_pos + rest < PAGE_SIZE) {
+ memcpy(msock->s_buffer + msock->s_pos, buf, rest);
+ msock->s_pos += rest;
+ sent += rest;
+ rest = 0;
+ status = sent;
+ if (cork)
+ goto done;
+ }
+
+ if (msock->s_pos > 0) {
+ status = _xio_send_raw(msock, msock->s_buffer, msock->s_pos);
+ if (status < 0)
+ goto done;
+
+ brick_block_free(msock->s_buffer, PAGE_SIZE);
+ msock->s_buffer = NULL;
+ msock->s_pos = 0;
+ }
+
+ if (rest >= PAGE_SIZE) {
+ status = _xio_send_raw(msock, buf, rest);
+ goto done;
+ } else if (rest > 0) {
+ goto restart;
+ }
+ status = sent;
+
+done:
+#else
+ status = _xio_send_raw(msock, buf, len);
+#endif
+ if (status < 0 && msock->s_shutdown_on_err)
+ xio_shutdown_socket(msock);
+
+ xio_put_socket(msock);
+
+final:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_send_raw);
+
+/* Note: buf may be NULL. In this case, the data is simply consumed,
+ * like /dev/null
+ */
+int xio_recv_raw(struct xio_socket *msock, void *buf, int minlen, int maxlen)
+{
+ void *dummy = NULL;
+ int sleeptime = 1000 / HZ;
+ int status = -EIDRM;
+ int done = 0;
+
+ if (!buf)
+ buf = dummy = brick_block_alloc(0, maxlen);
+
+ if (!xio_get_socket(msock))
+ goto final;
+
+ msock->s_recv_cnt = 0;
+
+ while (done < minlen) {
+ struct kvec iov = {
+ .iov_base = buf + done,
+ .iov_len = maxlen - done,
+ };
+ struct msghdr msg = {
+ .msg_iovlen = 1,
+ .msg_iov = (struct iovec *)&iov,
+ .msg_flags = MSG_NOSIGNAL,
+ };
+ struct socket *sock = msock->s_socket;
+
+ if (unlikely(!sock)) {
+ XIO_WRN("#%d socket has disappeared\n", msock->s_debug_nr);
+ status = -EIDRM;
+ goto err;
+ }
+
+ if (!xio_net_is_alive || brick_thread_should_stop()) {
+ XIO_WRN("#%d interrupting, done = %d\n", msock->s_debug_nr, done);
+ if (done > 0)
+ status = -EIDRM;
+ goto err;
+ }
+
+ status = kernel_recvmsg(sock, &msg, &iov, 1, maxlen-done, msg.msg_flags);
+
+ if (!xio_net_is_alive || brick_thread_should_stop()) {
+ XIO_WRN("#%d interrupting, done = %d\n", msock->s_debug_nr, done);
+ if (done > 0)
+ status = -EIDRM;
+ goto err;
+ }
+
+ if (status == -EAGAIN) {
+ if (msock->s_recv_abort > 0 && ++msock->s_recv_cnt > msock->s_recv_abort) {
+ XIO_WRN("#%d reached recv abort %d\n", msock->s_debug_nr, msock->s_recv_abort);
+ status = -EINTR;
+ goto err;
+ }
+ brick_msleep(sleeptime);
+ /* linearly increasing backoff */
+ if (sleeptime < 100)
+ sleeptime += 1000 / HZ;
+ continue;
+ }
+ msock->s_recv_cnt = 0;
+ if (!status) { /* EOF */
+ XIO_WRN("#%d got EOF from socket (done=%d, req_size=%d)\n",
+ msock->s_debug_nr,
+ done,
+ maxlen - done);
+ status = -EPIPE;
+ goto err;
+ }
+ if (status < 0) {
+ XIO_WRN("#%d bad recvmsg, status = %d\n", msock->s_debug_nr, status);
+ goto err;
+ }
+ done += status;
+ sleeptime = 1000 / HZ;
+ }
+ status = done;
+
+err:
+ if (status < 0 && msock->s_shutdown_on_err)
+ xio_shutdown_socket(msock);
+ xio_put_socket(msock);
+final:
+ if (dummy)
+ brick_block_free(dummy, maxlen);
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_recv_raw);
+
+/*********************************************************************/
+
+/* Mid-level field data exchange
+ */
+
+static
+int _add_fields(struct xio_desc_item *mi, const struct meta *meta, int offset, const char *prefix, int maxlen)
+{
+ int count = 0;
+
+ for (; meta->field_name != NULL; meta++) {
+ const char *new_prefix;
+ int new_offset;
+ int len;
+
+ short this_size;
+
+ new_prefix = mi->field_name;
+ new_offset = offset + meta->field_offset;
+
+ if (unlikely(maxlen < sizeof(struct xio_desc_item))) {
+ XIO_ERR("desc cache item overflow\n");
+ count = -1;
+ goto done;
+ }
+
+ len = scnprintf(mi->field_name, MAX_FIELD_LEN, "%s.%s", prefix, meta->field_name);
+ if (unlikely(len >= MAX_FIELD_LEN)) {
+ XIO_ERR("field len overflow on '%s.%s'\n", prefix, meta->field_name);
+ count = -1;
+ goto done;
+ }
+ mi->field_type = meta->field_type;
+ this_size = meta->field_data_size;
+ mi->field_data_size = this_size;
+ mi->field_sender_size = this_size;
+ this_size = meta->field_transfer_size;
+ if (this_size > 0)
+ mi->field_sender_size = this_size;
+ mi->field_sender_offset = new_offset;
+ mi->field_recver_offset = -1;
+
+ mi++;
+ maxlen -= sizeof(struct xio_desc_item);
+ count++;
+
+ if (meta->field_type == FIELD_SUB) {
+ int sub_count;
+
+ sub_count = _add_fields(mi, meta->field_ref, new_offset, new_prefix, maxlen);
+ if (sub_count < 0)
+ return sub_count;
+
+ mi += sub_count;
+ count += sub_count;
+ maxlen -= sub_count * sizeof(struct xio_desc_item);
+ }
+ }
+done:
+ return count;
+}
+
+static
+struct xio_desc_cache *make_sender_cache(struct xio_socket *msock, const struct meta *meta, int *cache_index)
+{
+ int orig_len = PAGE_SIZE;
+ int maxlen = orig_len;
+ struct xio_desc_cache *mc;
+ struct xio_desc_item *mi;
+ int i;
+ int status;
+
+ for (i = 0; i < MAX_DESC_CACHE; i++) {
+ mc = msock->s_desc_send[i];
+ if (!mc)
+ break;
+ if (mc->cache_sender_cookie == (u64)meta)
+ goto done;
+ }
+
+ if (unlikely(i >= MAX_DESC_CACHE - 1)) {
+ XIO_ERR("#%d desc cache overflow\n", msock->s_debug_nr);
+ return NULL;
+ }
+
+ mc = brick_block_alloc(0, maxlen);
+
+ memset(mc, 0, maxlen);
+ mc->cache_sender_cookie = (u64)meta;
+ /* further bits may be used in future */
+ mc->cache_sender_proto = msock->s_send_proto;
+ mc->cache_recver_proto = msock->s_recv_proto;
+
+ maxlen -= sizeof(struct xio_desc_cache);
+ mi = (void *)(mc + 1);
+
+ status = _add_fields(mi, meta, 0, "", maxlen);
+
+ if (likely(status > 0)) {
+ mc->cache_items = status;
+ mc->cache_is_bigendian = myself_is_bigendian;
+ msock->s_desc_send[i] = mc;
+ *cache_index = i;
+ } else {
+ brick_block_free(mc, orig_len);
+ mc = NULL;
+ }
+
+done:
+ return mc;
+}
+
+static
+void _make_recver_cache(struct xio_desc_cache *mc, const struct meta *meta, int offset, const char *prefix)
+{
+ char *tmp = brick_string_alloc(MAX_FIELD_LEN);
+ int i;
+
+ for (; meta->field_name != NULL; meta++) {
+ snprintf(tmp, MAX_FIELD_LEN, "%s.%s", prefix, meta->field_name);
+ for (i = 0; i < mc->cache_items; i++) {
+ struct xio_desc_item *mi = ((struct xio_desc_item *)(mc + 1)) + i;
+
+ if (meta->field_type == mi->field_type &&
+ !strcmp(tmp, mi->field_name)) {
+ mi->field_recver_size = meta->field_data_size;
+ mi->field_recver_offset = offset + meta->field_offset;
+ if (meta->field_type == FIELD_SUB)
+ _make_recver_cache(mc, meta->field_ref, mi->field_recver_offset, tmp);
+ goto found;
+ }
+ }
+ XIO_WRN("field '%s' is missing\n", meta->field_name);
+found:;
+ }
+ brick_string_free(tmp);
+}
+
+static
+void make_recver_cache(struct xio_desc_cache *mc, const struct meta *meta)
+{
+ int i;
+
+ _make_recver_cache(mc, meta, 0, "");
+
+ for (i = 0; i < mc->cache_items; i++) {
+ struct xio_desc_item *mi = ((struct xio_desc_item *)(mc + 1)) + i;
+
+ if (unlikely(mi->field_recver_offset < 0))
+ XIO_WRN("field '%s' is not transferred\n", mi->field_name);
+ }
+}
+
+#define _CHECK_STATUS(_txt_) \
+do { \
+ if (unlikely(status < 0)) { \
+ XIO_DBG("%s status = %d\n", _txt_, status); \
+ goto err; \
+ } \
+} while (0)
+
+static
+int _desc_send_item(struct xio_socket *msock,
+ const void *data,
+ const struct xio_desc_cache *mc,
+ int index,
+ bool cork)
+{
+ struct xio_desc_item *mi = ((struct xio_desc_item *)(mc + 1)) + index;
+ const void *item = data + mi->field_sender_offset;
+
+ s16 data_len = mi->field_data_size;
+ s16 transfer_len = mi->field_sender_size;
+ int status;
+ bool is_signed = false;
+ int res = -1;
+
+ switch (mi->field_type) {
+ case FIELD_REF:
+ XIO_ERR("field '%s' NYI type = %d\n", mi->field_name, mi->field_type);
+ goto err;
+ case FIELD_SUB:
+ /* skip this */
+ res = 0;
+ break;
+ case FIELD_INT:
+ is_signed = true;
+ /* fallthrough */
+ case FIELD_UINT:
+ if (unlikely(data_len <= 0 || data_len > MAX_INT_TRANSFER)) {
+ XIO_ERR("field '%s' bad data_len = %d\n", mi->field_name, data_len);
+ goto err;
+ }
+ if (unlikely(transfer_len > MAX_INT_TRANSFER)) {
+ XIO_ERR("field '%s' bad transfer_len = %d\n", mi->field_name, transfer_len);
+ goto err;
+ }
+
+ if (likely(data_len == transfer_len))
+ goto raw;
+
+ if (transfer_len > data_len) {
+ int diff = transfer_len - data_len;
+ char empty[diff];
+ char sign;
+
+ sign = get_sign(item, data_len, myself_is_bigendian, is_signed);
+ memset(empty, sign, diff);
+
+ if (myself_is_bigendian) {
+ status = xio_send_raw(msock, empty, diff, true);
+ _CHECK_STATUS("send_diff");
+ status = xio_send_raw(msock, item, data_len, cork);
+ _CHECK_STATUS("send_item");
+
+ } else {
+ status = xio_send_raw(msock, item, data_len, true);
+ _CHECK_STATUS("send_item");
+ status = xio_send_raw(msock, empty, diff, cork);
+ _CHECK_STATUS("send_diff");
+ }
+
+ res = data_len;
+ break;
+ } else if (unlikely(transfer_len <= 0)) {
+ XIO_ERR("bad transfer_len = %d\n", transfer_len);
+ goto err;
+ } else { /* transfer_len < data_len */
+ char check = get_sign(item, data_len, myself_is_bigendian, is_signed);
+ int start;
+ int end;
+ int i;
+
+ if (is_signed &&
+ unlikely(get_sign(item, transfer_len, myself_is_bigendian, true) != check)) {
+ XIO_ERR("cannot sign-reduce signed integer from %d to %d bytes, byte %d !~ %d\n",
+ data_len,
+ transfer_len,
+ ((char *)item)[transfer_len - 1],
+ check);
+ goto err;
+ }
+
+ if (myself_is_bigendian) {
+ start = 0;
+ end = data_len - transfer_len;
+ } else {
+ start = transfer_len;
+ end = data_len;
+ }
+
+ for (i = start; i < end; i++) {
+ if (unlikely(((char *)item)[i] != check)) {
+ XIO_ERR("cannot sign-reduce %ssigned integer from %d to %d bytes at pos %d, byte %d != %d\n",
+ is_signed ? "" : "un",
+ data_len,
+ transfer_len,
+ i,
+ ((char *)item)[i],
+ check);
+ goto err;
+ }
+ }
+
+ /* just omit the higher/lower bytes */
+ data_len = transfer_len;
+ if (myself_is_bigendian)
+ item += end;
+ goto raw;
+ }
+ case FIELD_STRING:
+ item = *(void **)item;
+ data_len = 0;
+ if (item)
+ data_len = strlen(item) + 1;
+
+ status = xio_send_raw(msock, &data_len, sizeof(data_len), true);
+ _CHECK_STATUS("send_string_len");
+ /* fallthrough */
+ case FIELD_RAW:
+raw:
+ if (unlikely(data_len < 0)) {
+ XIO_ERR("field '%s' bad data_len = %d\n", mi->field_name, data_len);
+ goto err;
+ }
+ status = xio_send_raw(msock, item, data_len, cork);
+ _CHECK_STATUS("send_raw");
+ res = data_len;
+ break;
+ default:
+ XIO_ERR("field '%s' unknown type = %d\n", mi->field_name, mi->field_type);
+ }
+err:
+ return res;
+}
+
+static
+int _desc_recv_item(struct xio_socket *msock, void *data, const struct xio_desc_cache *mc, int index, int line)
+{
+ struct xio_desc_item *mi = ((struct xio_desc_item *)(mc + 1)) + index;
+ void *item = NULL;
+
+ s16 data_len = mi->field_recver_size;
+ s16 transfer_len = mi->field_sender_size;
+ int status;
+ bool is_signed = false;
+ int res = -1;
+
+ if (likely(data && data_len > 0 && mi->field_recver_offset >= 0))
+ item = data + mi->field_recver_offset;
+
+ switch (mi->field_type) {
+ case FIELD_REF:
+ XIO_ERR("field '%s' NYI type = %d\n", mi->field_name, mi->field_type);
+ goto err;
+ case FIELD_SUB:
+ /* skip this */
+ res = 0;
+ break;
+ case FIELD_INT:
+ is_signed = true;
+ /* fallthrough */
+ case FIELD_UINT:
+ if (unlikely(data_len <= 0 || data_len > MAX_INT_TRANSFER)) {
+ XIO_ERR("field '%s' bad data_len = %d\n", mi->field_name, data_len);
+ goto err;
+ }
+ if (unlikely(transfer_len > MAX_INT_TRANSFER)) {
+ XIO_ERR("field '%s' bad transfer_len = %d\n", mi->field_name, transfer_len);
+ goto err;
+ }
+
+ if (likely(data_len == transfer_len))
+ goto raw;
+
+ if (transfer_len > data_len) {
+ int diff = transfer_len - data_len;
+ char empty[diff];
+ char check;
+
+ memset(empty, 0, diff);
+
+ if (myself_is_bigendian) {
+ status = xio_recv_raw(msock, empty, diff, diff);
+ _CHECK_STATUS("recv_diff");
+ }
+
+ status = xio_recv_raw(msock, item, data_len, data_len);
+ _CHECK_STATUS("recv_item");
+ if (unlikely(mc->cache_is_bigendian != myself_is_bigendian && item))
+ swap_bytes(item, data_len);
+
+ if (!myself_is_bigendian) {
+ status = xio_recv_raw(msock, empty, diff, diff);
+ _CHECK_STATUS("recv_diff");
+ }
+
+ /* check that sign extension did no harm */
+ check = get_sign(empty, diff, mc->cache_is_bigendian, is_signed);
+ while (--diff >= 0) {
+ if (unlikely(empty[diff] != check)) {
+ XIO_ERR("field '%s' %sSIGNED INTEGER OVERFLOW on size reduction from %d to %d, byte %d != %d\n",
+ mi->field_name,
+ is_signed ? "" : "UN",
+ transfer_len,
+ data_len,
+ empty[diff],
+ check);
+ goto err;
+ }
+ }
+ if (is_signed && item &&
+ unlikely(get_sign(item, data_len, myself_is_bigendian, true) != check)) {
+ XIO_ERR("field '%s' SIGNED INTEGER OVERLOW on reduction from size %d to %d, byte %d !~ %d\n",
+ mi->field_name,
+ transfer_len,
+ data_len,
+ ((char *)item)[data_len - 1],
+ check);
+ goto err;
+ }
+
+ res = data_len;
+ break;
+ } else if (unlikely(transfer_len <= 0)) {
+ XIO_ERR("field '%s' bad transfer_len = %d\n", mi->field_name, transfer_len);
+ goto err;
+ } else if (unlikely(!item)) { /* shortcut without checks */
+ data_len = transfer_len;
+ goto raw;
+ } else { /* transfer_len < data_len */
+ int diff = data_len - transfer_len;
+ char *transfer_ptr = item;
+ char sign;
+
+ if (myself_is_bigendian)
+ transfer_ptr += diff;
+
+ status = xio_recv_raw(msock, transfer_ptr, transfer_len, transfer_len);
+ _CHECK_STATUS("recv_transfer");
+ if (unlikely(mc->cache_is_bigendian != myself_is_bigendian))
+ swap_bytes(transfer_ptr, transfer_len);
+
+ /* sign-extend from transfer_len to data_len */
+ sign = get_sign(transfer_ptr, transfer_len, myself_is_bigendian, is_signed);
+ if (myself_is_bigendian)
+ memset(item, sign, diff);
+ else
+ memset(item + transfer_len, sign, diff);
+ res = data_len;
+ break;
+ }
+ case FIELD_STRING:
+ data_len = 0;
+ status = xio_recv_raw(msock, &data_len, sizeof(data_len), sizeof(data_len));
+ _CHECK_STATUS("recv_string_len");
+
+ if (unlikely(mc->cache_is_bigendian != myself_is_bigendian))
+ swap_bytes(&data_len, sizeof(data_len));
+
+ if (data_len > 0 && item) {
+ char *str = _brick_string_alloc(data_len, line);
+
+ *(void **)item = str;
+ item = str;
+ }
+
+ transfer_len = data_len;
+ /* fallthrough */
+ case FIELD_RAW:
+raw:
+ if (unlikely(data_len < 0)) {
+ XIO_ERR("field = '%s' implausible data_len = %d\n", mi->field_name, data_len);
+ goto err;
+ }
+ if (likely(data_len > 0)) {
+ if (unlikely(transfer_len != data_len)) {
+ XIO_ERR("cannot handle generic mismatch in transfer sizes, field = '%s', %d != %d\n",
+ mi->field_name,
+ transfer_len,
+ data_len);
+ goto err;
+ }
+ status = xio_recv_raw(msock, item, data_len, data_len);
+ _CHECK_STATUS("recv_raw");
+ }
+ res = data_len;
+ break;
+ default:
+ XIO_ERR("field '%s' unknown type = %d\n", mi->field_name, mi->field_type);
+ }
+err:
+ return res;
+}
+
+static inline
+int _desc_send_struct(struct xio_socket *msock, int cache_index, const void *data, int h_meta_len, bool cork)
+{
+ const struct xio_desc_cache *mc = msock->s_desc_send[cache_index];
+
+ struct xio_desc_header header = {
+ .h_magic = XIO_DESC_MAGIC,
+ .h_cookie = mc->cache_sender_cookie,
+ .h_meta_len = h_meta_len,
+ .h_index = data ? cache_index : -1,
+ };
+ int index;
+ int count = 0;
+ int status = 0;
+
+ status = xio_send_raw(msock, &header, sizeof(header), cork || data);
+ _CHECK_STATUS("send_header");
+
+ if (unlikely(h_meta_len > 0)) {
+ status = xio_send_raw(msock, mc, h_meta_len, true);
+ _CHECK_STATUS("send_meta");
+ }
+
+ if (likely(data)) {
+ for (index = 0; index < mc->cache_items; index++) {
+ status = _desc_send_item(msock, data, mc, index, cork || index < mc->cache_items-1);
+ _CHECK_STATUS("send_cache_item");
+ count++;
+ }
+ }
+
+ if (status >= 0)
+ status = count;
+err:
+ return status;
+}
+
+static
+int desc_send_struct(struct xio_socket *msock, const void *data, const struct meta *meta, bool cork)
+{
+ struct xio_desc_cache *mc;
+ int i;
+ int h_meta_len = 0;
+ int status = -EINVAL;
+
+ for (i = 0; i < MAX_DESC_CACHE; i++) {
+ mc = msock->s_desc_send[i];
+ if (!mc)
+ break;
+ if (mc->cache_sender_cookie == (u64)meta)
+ goto found;
+ }
+
+ mc = make_sender_cache(msock, meta, &i);
+ if (unlikely(!mc))
+ goto done;
+
+ h_meta_len = mc->cache_items * sizeof(struct xio_desc_item) + sizeof(struct xio_desc_cache);
+
+found:
+ status = _desc_send_struct(msock, i, data, h_meta_len, cork);
+
+done:
+ return status;
+}
+
+static
+int desc_recv_struct(struct xio_socket *msock, void *data, const struct meta *meta, int line)
+{
+ struct xio_desc_header header = {};
+ struct xio_desc_cache *mc;
+ int cache_index;
+ int index;
+ int count = 0;
+ int status = 0;
+ bool need_swap = false;
+
+ status = xio_recv_raw(msock, &header, sizeof(header), sizeof(header));
+ _CHECK_STATUS("recv_header");
+
+ if (unlikely(header.h_magic != XIO_DESC_MAGIC)) {
+ need_swap = true;
+ SWAP_FIELD(header.h_magic);
+ if (unlikely(header.h_magic != XIO_DESC_MAGIC)) {
+ XIO_WRN("#%d called from line %d bad packet header magic = %llx\n",
+ msock->s_debug_nr,
+ line,
+ header.h_magic);
+ status = -ENOMSG;
+ goto err;
+ }
+ SWAP_FIELD(header.h_cookie);
+ SWAP_FIELD(header.h_meta_len);
+ SWAP_FIELD(header.h_index);
+ }
+
+ cache_index = header.h_index;
+ if (cache_index < 0) { /* EOR */
+ goto done;
+ }
+ if (unlikely(cache_index >= MAX_DESC_CACHE - 1)) {
+ XIO_WRN("#%d called from line %d bad cache index %d\n", msock->s_debug_nr, line, cache_index);
+ status = -EBADF;
+ goto err;
+ }
+
+ mc = msock->s_desc_recv[cache_index];
+ if (unlikely(!mc)) {
+ if (unlikely(header.h_meta_len <= 0)) {
+ XIO_WRN("#%d called from line %d missing meta information\n", msock->s_debug_nr, line);
+ status = -ENOMSG;
+ goto err;
+ }
+
+ mc = _brick_block_alloc(0, PAGE_SIZE, line);
+
+ status = xio_recv_raw(msock, mc, header.h_meta_len, header.h_meta_len);
+ if (unlikely(status < 0))
+ brick_block_free(mc, PAGE_SIZE);
+ _CHECK_STATUS("recv_meta");
+
+ if (unlikely(need_swap))
+ swap_mc(mc, header.h_meta_len);
+
+ make_recver_cache(mc, meta);
+
+ msock->s_desc_recv[cache_index] = mc;
+ } else if (unlikely(header.h_meta_len > 0)) {
+ XIO_WRN("#%d called from line %d has %d unexpected meta bytes\n",
+ msock->s_debug_nr,
+ line,
+ header.h_meta_len);
+ }
+
+ for (index = 0; index < mc->cache_items; index++) {
+ status = _desc_recv_item(msock, data, mc, index, line);
+ _CHECK_STATUS("recv_cache_item");
+ count++;
+ }
+
+done:
+ if (status >= 0)
+ status = count;
+err:
+ return status;
+}
+
+int xio_send_struct(struct xio_socket *msock, const void *data, const struct meta *meta)
+{
+ return desc_send_struct(msock, data, meta, false);
+}
+EXPORT_SYMBOL_GPL(xio_send_struct);
+
+int _xio_recv_struct(struct xio_socket *msock, void *data, const struct meta *meta, int line)
+{
+ return desc_recv_struct(msock, data, meta, line);
+}
+EXPORT_SYMBOL_GPL(_xio_recv_struct);
+
+/*********************************************************************/
+
+/* High-level transport of xio structures
+ */
+
+const struct meta xio_cmd_meta[] = {
+ META_INI_SUB(cmd_stamp, struct xio_cmd, xio_timespec_meta),
+ META_INI(cmd_code, struct xio_cmd, FIELD_INT),
+ META_INI(cmd_int1, struct xio_cmd, FIELD_INT),
+ META_INI(cmd_str1, struct xio_cmd, FIELD_STRING),
+ {}
+};
+EXPORT_SYMBOL_GPL(xio_cmd_meta);
+
+int xio_send_aio(struct xio_socket *msock, struct aio_object *aio)
+{
+ struct xio_cmd cmd = {
+ .cmd_code = CMD_AIO,
+ .cmd_int1 = aio->io_id,
+ };
+ int seq = 0;
+ int status;
+
+ if (aio->io_rw != 0 && aio->io_data && aio->io_cs_mode < 2)
+ cmd.cmd_code |= CMD_FLAG_HAS_DATA;
+
+ get_lamport(&cmd.cmd_stamp);
+
+ status = desc_send_struct(msock, &cmd, xio_cmd_meta, true);
+ if (status < 0)
+ goto done;
+
+ seq = 0;
+ status = desc_send_struct(msock, aio, xio_aio_meta, cmd.cmd_code & CMD_FLAG_HAS_DATA);
+ if (status < 0)
+ goto done;
+
+ if (cmd.cmd_code & CMD_FLAG_HAS_DATA)
+ status = xio_send_raw(msock, aio->io_data, aio->io_len, false);
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_send_aio);
+
+int xio_recv_aio(struct xio_socket *msock, struct aio_object *aio, struct xio_cmd *cmd)
+{
+ int status;
+
+ status = desc_recv_struct(msock, aio, xio_aio_meta, __LINE__);
+ if (status < 0)
+ goto done;
+
+ set_lamport(&cmd->cmd_stamp);
+
+ if (cmd->cmd_code & CMD_FLAG_HAS_DATA) {
+ if (!aio->io_data)
+ aio->io_data = brick_zmem_alloc(aio->io_len);
+ status = xio_recv_raw(msock, aio->io_data, aio->io_len, aio->io_len);
+ if (status < 0)
+ XIO_WRN("#%d aio_len = %d, status = %d\n", msock->s_debug_nr, aio->io_len, status);
+ }
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_recv_aio);
+
+int xio_send_cb(struct xio_socket *msock, struct aio_object *aio)
+{
+ struct xio_cmd cmd = {
+ .cmd_code = CMD_CB,
+ .cmd_int1 = aio->io_id,
+ };
+ int seq = 0;
+ int status;
+
+ if (aio->io_rw == 0 && aio->io_data && aio->io_cs_mode < 2)
+ cmd.cmd_code |= CMD_FLAG_HAS_DATA;
+
+ get_lamport(&cmd.cmd_stamp);
+
+ status = desc_send_struct(msock, &cmd, xio_cmd_meta, true);
+ if (status < 0)
+ goto done;
+
+ seq = 0;
+ status = desc_send_struct(msock, aio, xio_aio_meta, cmd.cmd_code & CMD_FLAG_HAS_DATA);
+ if (status < 0)
+ goto done;
+
+ if (cmd.cmd_code & CMD_FLAG_HAS_DATA)
+ status = xio_send_raw(msock, aio->io_data, aio->io_len, false);
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_send_cb);
+
+int xio_recv_cb(struct xio_socket *msock, struct aio_object *aio, struct xio_cmd *cmd)
+{
+ int status;
+
+ status = desc_recv_struct(msock, aio, xio_aio_meta, __LINE__);
+ if (status < 0)
+ goto done;
+
+ set_lamport(&cmd->cmd_stamp);
+
+ if (cmd->cmd_code & CMD_FLAG_HAS_DATA) {
+ if (!aio->io_data) {
+ XIO_WRN("#%d no internal buffer available\n", msock->s_debug_nr);
+ status = -EINVAL;
+ goto done;
+ }
+ status = xio_recv_raw(msock, aio->io_data, aio->io_len, aio->io_len);
+ }
+done:
+ return status;
+}
+EXPORT_SYMBOL_GPL(xio_recv_cb);
+
+/***************** module init stuff ************************/
+
+char *(*xio_translate_hostname)(const char *name) = NULL;
+EXPORT_SYMBOL_GPL(xio_translate_hostname);
+
+bool xio_net_is_alive = false;
+EXPORT_SYMBOL_GPL(xio_net_is_alive);
+
+int __init init_xio_net(void)
+{
+ XIO_INF("init_net()\n");
+ xio_net_is_alive = true;
+ return 0;
+}
+
+void exit_xio_net(void)
+{
+ xio_net_is_alive = false;
+ XIO_INF("exit_net()\n");
+}
--
2.0.0

2014-07-01 22:02:30

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [PATCH 29/50] mars: add new file include/linux/xio/xio_aio.h

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/xio/xio_aio.h | 96 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 96 insertions(+)
create mode 100644 include/linux/xio/xio_aio.h

diff --git a/include/linux/xio/xio_aio.h b/include/linux/xio/xio_aio.h
new file mode 100644
index 0000000..776a28c
--- /dev/null
+++ b/include/linux/xio/xio_aio.h
@@ -0,0 +1,96 @@
+/* (c) 2010 Thomas Schoebel-Theuer / 1&1 Internet AG */
+#ifndef XIO_AIO_H
+#define XIO_AIO_H
+
+#include <linux/aio.h>
+#include <linux/syscalls.h>
+
+#include <linux/lib_mapfree.h>
+
+#define AIO_SUBMIT_MAX_LATENCY 1000 /* 1 ms */
+#define AIO_IO_R_MAX_LATENCY 50000 /* 50 ms */
+#define AIO_IO_W_MAX_LATENCY 150000 /* 150 ms */
+#define AIO_SYNC_MAX_LATENCY 150000 /* 150 ms */
+
+extern struct threshold aio_submit_threshold;
+extern struct threshold aio_io_threshold[2];
+extern struct threshold aio_sync_threshold;
+
+/* aio_sync_mode:
+ * 0 = filemap_write_and_wait_range()
+ * 1 = fdatasync()
+ * 2 = fsync()
+ */
+extern int aio_sync_mode;
+
+struct aio_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct list_head io_head;
+ struct dirty_info di;
+ unsigned long long enqueue_stamp;
+ long long start_jiffies;
+ int resubmit;
+ int alloc_len;
+ bool do_dealloc;
+};
+
+struct aio_brick {
+ XIO_BRICK(aio);
+ /* parameters */
+ bool o_creat;
+ bool o_direct;
+ bool o_fdsync;
+ bool is_static_device;
+};
+
+struct aio_input {
+ XIO_INPUT(aio);
+};
+
+struct aio_threadinfo {
+ struct list_head aio_list[XIO_PRIO_NR];
+ struct aio_output *output;
+ struct task_struct *thread;
+
+ wait_queue_head_t event;
+ wait_queue_head_t terminate_event;
+ spinlock_t lock;
+ int queued[XIO_PRIO_NR];
+ atomic_t queued_sum;
+ atomic_t total_enqueue_count;
+ bool terminated;
+};
+
+struct aio_output {
+ XIO_OUTPUT(aio);
+ /* private */
+ struct mapfree_info *mf;
+
+ int fd; /* FIXME: remove this! */
+ struct aio_threadinfo tinfo[3];
+
+ aio_context_t ctxp;
+ wait_queue_head_t fdsync_event;
+ bool fdsync_active;
+
+ /* statistics */
+ int index;
+ atomic_t total_read_count;
+ atomic_t total_write_count;
+ atomic_t total_alloc_count;
+ atomic_t total_submit_count;
+ atomic_t total_again_count;
+ atomic_t total_delay_count;
+ atomic_t total_msleep_count;
+ atomic_t total_fdsync_count;
+ atomic_t total_fdsync_wait_count;
+ atomic_t total_mapfree_count;
+ atomic_t read_count;
+ atomic_t write_count;
+ atomic_t alloc_count;
+ atomic_t submit_count;
+};
+
+XIO_TYPES(aio);
+
+#endif
--
2.0.0

2014-07-01 22:43:38

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

On Tue, Jul 01, 2014 at 11:47:29PM +0200, Thomas Schoebel-Theuer wrote:
> From: Thomas Schoebel-Theuer <[email protected]>
>
> Mostly introduces missing EXPORT_SYMBOL().
> Should have no impact onto the kernel.
>
> This is the generic version which exports all sys_*() system
> calls. This should not introduce any additional maintenance pain
> because that interfaces has to be stable anyway due to POSIX etc.
>
> Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
> ---
> fs/open.c | 1 -
> fs/utimes.c | 2 ++
> include/linux/syscalls.h | 3 +++
> include/uapi/linux/major.h | 1 +
> mm/page_alloc.c | 3 +++
> 5 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/open.c b/fs/open.c
> index 36662d0..3b21b76 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1064,7 +1064,6 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
>
> return retval;
> }
> -EXPORT_SYMBOL(sys_close);
>
> /*
> * This routine simulates a hangup on the tty, to arrange that users
> diff --git a/fs/utimes.c b/fs/utimes.c
> index aa138d6..4a1f4a8 100644
> --- a/fs/utimes.c
> +++ b/fs/utimes.c
> @@ -1,3 +1,4 @@
> +#include <linux/module.h>
> #include <linux/compiler.h>
> #include <linux/file.h>
> #include <linux/fs.h>
> @@ -181,6 +182,7 @@ retry:
> out:
> return error;
> }
> +EXPORT_SYMBOL(do_utimes);
>
> SYSCALL_DEFINE4(utimensat, int, dfd, const char __user *, filename,
> struct timespec __user *, utimes, int, flags)
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..c674309 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -75,6 +75,7 @@ struct sigaltstack;
> #include <linux/sem.h>
> #include <asm/siginfo.h>
> #include <linux/unistd.h>
> +#include <linux/export.h>
> #include <linux/quota.h>
> #include <linux/key.h>
> #include <trace/syscall.h>
> @@ -176,6 +177,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;
>
> #define SYSCALL_DEFINE0(sname) \
> SYSCALL_METADATA(_##sname, 0); \
> + EXPORT_SYMBOL(sys_##sname); \
> asmlinkage long sys_##sname(void)
>
> #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
> @@ -202,6 +204,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;
> __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
> return ret; \
> } \
> + EXPORT_SYMBOL(sys##name); \
> static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
>

Heh, nice try, but no, we aren't going to export all syscalls, that's
crazy. And wrong on many levels, sorry.

Be explicit with your exports, and justify _why_ you need them.

thanks,

greg k-h

2014-07-02 07:19:35

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

I will prepare a new version ASAP.

Is the asmlinkage one of the reasons?

If so, some additional wrappers like vfs_* or kernel_* would have to be
written. This would also complicate portability between the out-of-tree
and the in-tree version of MARS (which I have to maintain in parallel at
least for some years).

It would be great if I just could make the exports of sys_* explicitly.

Thanks and cheers,

Thomas

On 07/02/2014 12:36 AM, Greg KH wrote:
> On Tue, Jul 01, 2014 at 11:47:29PM +0200, Thomas Schoebel-Theuer wrote:
>> From: Thomas Schoebel-Theuer <[email protected]>
>>
>> Mostly introduces missing EXPORT_SYMBOL().
>> Should have no impact onto the kernel.
>>
>> This is the generic version which exports all sys_*() system
>> calls. This should not introduce any additional maintenance pain
>> because that interfaces has to be stable anyway due to POSIX etc.
>>
>> Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
>> ---
>> fs/open.c | 1 -
>> fs/utimes.c | 2 ++
>> include/linux/syscalls.h | 3 +++
>> include/uapi/linux/major.h | 1 +
>> mm/page_alloc.c | 3 +++
>> 5 files changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index 36662d0..3b21b76 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -1064,7 +1064,6 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
>>
>> return retval;
>> }
>> -EXPORT_SYMBOL(sys_close);
>>
>> /*
>> * This routine simulates a hangup on the tty, to arrange that users
>> diff --git a/fs/utimes.c b/fs/utimes.c
>> index aa138d6..4a1f4a8 100644
>> --- a/fs/utimes.c
>> +++ b/fs/utimes.c
>> @@ -1,3 +1,4 @@
>> +#include <linux/module.h>
>> #include <linux/compiler.h>
>> #include <linux/file.h>
>> #include <linux/fs.h>
>> @@ -181,6 +182,7 @@ retry:
>> out:
>> return error;
>> }
>> +EXPORT_SYMBOL(do_utimes);
>>
>> SYSCALL_DEFINE4(utimensat, int, dfd, const char __user *, filename,
>> struct timespec __user *, utimes, int, flags)
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index b0881a0..c674309 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -75,6 +75,7 @@ struct sigaltstack;
>> #include <linux/sem.h>
>> #include <asm/siginfo.h>
>> #include <linux/unistd.h>
>> +#include <linux/export.h>
>> #include <linux/quota.h>
>> #include <linux/key.h>
>> #include <trace/syscall.h>
>> @@ -176,6 +177,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;
>>
>> #define SYSCALL_DEFINE0(sname) \
>> SYSCALL_METADATA(_##sname, 0); \
>> + EXPORT_SYMBOL(sys_##sname); \
>> asmlinkage long sys_##sname(void)
>>
>> #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
>> @@ -202,6 +204,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;
>> __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
>> return ret; \
>> } \
>> + EXPORT_SYMBOL(sys##name); \
>> static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
>>
> Heh, nice try, but no, we aren't going to export all syscalls, that's
> crazy. And wrong on many levels, sorry.
>
> Be explicit with your exports, and justify _why_ you need them.
>
> thanks,
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-07-02 08:24:17

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars


A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Wed, Jul 02, 2014 at 09:19:31AM +0200, Thomas Schoebel-Theuer wrote:
> I will prepare a new version ASAP.
>
> Is the asmlinkage one of the reasons?

No. The wholesale exporting of all 300+ syscall functions for no
apparent reasoning is the reason.

> If so, some additional wrappers like vfs_* or kernel_* would have to be
> written. This would also complicate portability between the out-of-tree
> and the in-tree version of MARS (which I have to maintain in parallel at
> least for some years).

Maintaining out of tree code is not our problem, sorry.

thanks,

greg k-h

2014-07-02 09:02:17

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

> Maintaining out of tree code is not our problem, sorry. thanks, greg k-h

OK, I just noticed that in the meantime many vfs_*() are present now
which were missing when I started the project on very old kernels (or
maybe I missed something, sorry for any potential mistakes).

So I will happily make a new version which has the lowest possible
footprint in the kernel, but nevertheless is _internally_ portable (just
for me and the needs of 1&1 users).

This will incur many changes in many places in the patchset. Is it OK to
re-submit the _whole_ patchset again after that?

I will have to re-run the full testsuite on the new version in order to
be sure that no bad things happen, so this will take at least 24h (if
not several days).

Cheers,

Thomas

2014-07-02 13:27:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

On Wed, Jul 02, 2014 at 01:24:00AM -0700, Greg KH wrote:
> > I will prepare a new version ASAP.
> >
> > Is the asmlinkage one of the reasons?
>
> No. The wholesale exporting of all 300+ syscall functions for no
> apparent reasoning is the reason.

Using syscalls or syscall-like functionality from kernel code generally
is a bad idea. Every single use needs a clear rationale in the patch
description. Which do not seem to exist at all for the series, so in
it's current form it'll go straight to the trash bin anyway.

2014-07-02 14:46:12

by Thomas Schöbel-Theuer

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

> Using syscalls or syscall-like functionality from kernel code
> generally is a bad idea. Every single use needs a clear rationale in
> the patch description. Which do not seem to exist at all for the
> series, so in it's current form it'll go straight to the trash bin anyway.

Thanks, Christoph, for the explanation. I have been so long away from
the community that I don't remember all conventions / preferences any more.

Please, could you kindly advise me in the following dilemma:

Now there exists a vfs_rmdir() which I would gladly prefer in order to
avoid EXPORT_SYMBOL(sys_rmdir).

However, I probably would have to "borrow" large parts of the
sys_rmdir() implementation from fs/namei.c (the only real difference
appears to me that the pathname is not in userspace).

So, what is worse: copying relatively large pieces of code, or using
sys_rmdir()?

Thanks,

Thomas

2014-07-02 14:50:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

On Wed, Jul 02, 2014 at 04:36:42PM +0200, Thomas Sch?bel-Theuer wrote:
> However, I probably would have to "borrow" large parts of the
> sys_rmdir() implementation from fs/namei.c (the only real difference
> appears to me that the pathname is not in userspace).
>
> So, what is worse: copying relatively large pieces of code, or using
> sys_rmdir()?

Most likely you shouldn't do either. A block driver really should not
be removing directories from the filesystem namespace.

2014-07-02 16:20:29

by Thomas Schöbel-Theuer

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

> Most likely you shouldn't do either. A block driver really should not
> be removing directories from the filesystem namespace.

Please take into account that MARS Light is not just a "device driver",
but a long-distance distributed system dealing with huge masses of state
information.

The /mars/ filesystem (which is a reserved directory) is a means for
storage of transaction logfiles. This has one big advantage: it is a
_shared_ storage space for _multiple_ resources.

The current setup at 1&1 datacenters is dealing with at least 7
resources in parallel; the number is likely to grow in future.

IMHO, replacing the /mars/ filesystem by block device storage would
likely mean to have a rather static ring-buffer like structure for
_each_ of multiple resources.

When taking your argument (which I can understand) to the extreme, I
would not even be allowed to use files as a dynamic storage for
transaction logfiles at all.

In essence, I fear that such a requirement would mean to re-implement
the core functionality of a classical filesystem, namely (a) creating
multiple instances of _dynamic_ storage (aka files) out of a single
static storage (aka block device), and (b) synchronizing parallel access
to that. My idea was to re-use that functionality already implemented by
filesystems, where fundamental problems such as fragmentation are
already solved.

There is a second reason for using a filesystem:

As explained in slide 10 of the LinuxTag presentation, the symlink tree
residing in /mars/ is also used for storage and propagation of metadata
information.

Unfortunately, the slides don't contain my oral explanations (both at
LCA2014 and LinuxTag2014) about the symlink tree. It is used for three
purposes at the same time:

1) as a _persistent_ key->value store.

2) as the _only_ means of high-level metadata communication in the
long-distance cluster.

3) as interface between userspace and kernelspace (no longer any binary
interfaces).

IMHO, if I had to avoid filesystem operations at least regarding
symlinks, the complete MARS Light design would have to be redesigend /
re-implemented in a very different way, essentially a completely
different implementation, essentially forcing me to throw away the
effort of years.

Notice that MARS Full is planned to abstract away from concrete storage
formats such as filesystems, so your objective could probably be met in
future (if it will have some advantages).

Please permit me to use /mars/ for MARS Light as a reserved space where
not only transaction logfiles for multiple resources are residing, but
also the symlink tree is allowed to reside.

Cheers,

Thomas

2014-07-02 16:36:37

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

On Wed, Jul 02, 2014 at 11:02:09AM +0200, Thomas Schoebel-Theuer wrote:
> > Maintaining out of tree code is not our problem, sorry. thanks, greg k-h
>
> OK, I just noticed that in the meantime many vfs_*() are present now
> which were missing when I started the project on very old kernels (or
> maybe I missed something, sorry for any potential mistakes).
>
> So I will happily make a new version which has the lowest possible
> footprint in the kernel, but nevertheless is _internally_ portable (just
> for me and the needs of 1&1 users).
>
> This will incur many changes in many places in the patchset. Is it OK to
> re-submit the _whole_ patchset again after that?

Given that your patchset really isn't even in a reviewable format (you
just add one file per-patch) resending it doesn't bother anyone :)

Hint, you will have to work on a better way to submit this code in a
format that is mergable. It's a non-trivial task, but going to be
required if you wish for this to ever be accepted.

Best of luck,

greg k-h

2014-07-02 18:41:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

On Wed, Jul 02, 2014 at 06:20:20PM +0200, Thomas Sch?bel-Theuer wrote:
> Please take into account that MARS Light is not just a "device driver",
> but a long-distance distributed system dealing with huge masses of state
> information.

Which doesn't matter. No kernel driver has a business messing with the
filesystem namespace. Especially no maintaining magic symlink farms.

2014-07-03 06:11:10

by Thomas Schöbel-Theuer

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

>> Please take into account that MARS Light is not just a "device driver",
>> but a long-distance distributed system dealing with huge masses of state
>> information.
> Which doesn't matter. No kernel driver has a business messing with the
> filesystem namespace. Especially no maintaining magic symlink farms.
>

I see the following alternative solutions for this:

a) MARS Light cannot go upstream because it is viewed as a driver
violating the rules. In this case, I could retry submission some years
laters when (subsets of) MARS Full strictly obey some hierarchy rules
which I have to know completely in advance in order to no longer make
any mistakes. However, I am not sure whether the limits in the solution
space are really appropriate for the problem space; I would have to make
new prototype implementations and compare them to other solutions. Only
after that, I would be able to decide whether to go upstream at all, or
to continue an out-of-tree project.

or

b) you permit me exceptions from the rules, justified by the fact that a
_distributed_ "driver" for long-distance replication of terabytes of
data (constructed for long-distance replication of _whole_ _datacenters_
) needs dynamically growing storage space for bulk data. In order to not
re-invent the wheel, you permit me storing both data and metadata in a
reserved filesystem instance. Storing metadata _together_ with the
_corresponding_ (!) bulk data is justified by the fact that separate
storage spaces would need some additional means for ensuring
_consistency_ between them in case of node failures etc.

or

c) MARS Light is viewed as a distributed system, similar to a cluster
filesystem, which just happens to export a block device to userspace (at
current stage of development; notice that the generic brick framework is
not limited to the block device layer).

or

d) both MARS Light and the future Full is viewed as a new generic
framework. In current stage, only some parts dealing with block devices
are implemented. In later stages of MARS Full, the future hierarchy
rules will be automatically checked / established by some future
strategy bricks, according to IOP design principles as proposed in my
papers / in my monography (a detailed discussion is probably out of
scope of this discussion).

Possibly there are further alternatives.

At least in cases c) and d), the sourcecode should IMHO not go to
drivers/block/ but somewhere else (please give me some suggestions; this
is why ./rework-mars-for-upstream.pl has is easily configurable).

Cheers,

Thomas

2014-07-03 10:41:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 49/50] mars: generic pre-patch for mars

Hi Thoams,

On Thu, Jul 03, 2014 at 08:10:33AM +0200, Thomas Sch?bel-Theuer wrote:
> I see the following alternative solutions for this:

All your alternatives really miss the point. I general we don't want
driver messing with the filesystem namespace, create their own symlink
farms or similar. Maybe there is a good explanation why your driver
is special, but you haven't provided it. In fact you've provided very
little explanation at all. You've submitted a series with a very
highlevel introduction, and the 50 patches with no explanation at all.

The exported syscalls where the first major issue to stick out, but
without even looking at the code I bet there will various other
roadblocks if someone actually bothers to review the rest of the code.

So before you come up with various take it or leave it alternatives I'd
suggest you figure out how

a) a driver submission should look like
b) explain and "sell" your design so that people get interested in it,
and will start to actually review it and discuss your design
tradeoffs with it.

An attitude of a few options to chose from isn't really going to get
your very far.

2014-07-03 15:01:13

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH 06/50] mars: add new file drivers/block/mars/brick_mem.c

On 2014-07-01 23:46, Thomas Schoebel-Theuer wrote:
> +#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
> +# define STRING_CANARY \
> + "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
> + "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy" \
> + "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" \
> + "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
> + "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy" \
> + "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" \
> + "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
> + "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy" \
> + "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" \
> + " FILE = " __FILE__ \
> + " DATE = " __DATE__ \
> + " TIME = " __TIME__ \

The kernel is built with -Werror=date-time nowadays, so this is not
going to work.

Michal

2014-07-03 19:59:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 30/50] mars: add new file drivers/block/mars/xio_bricks/xio_aio.c

Thomas Schoebel-Theuer <[email protected]> writes:
> +
> +/************************ mmu faking (provisionary) ***********************/
> +
> +/* Kludge: our kernel threads will have no mm context, but need one
> + * for stuff like ioctx_alloc() / aio_setup_ring() etc
> + * which expect userspace resources.
> + * We fake one.
> + * TODO: factor out the userspace stuff from AIO such that
> + * this fake is no longer necessary.
> + * Even better: replace do_mmap() in AIO stuff by something
> + * more friendly to kernelspace apps.
> + */

That obviously has to be done before taking this patchkit any further.

AFAIK the kernel already has totally usable internal AIO interfaces.

-Andi

--
[email protected] -- Speaking for myself only