2016-12-30 23:01:32

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 00/32] State of MARS Reo-Redundancy Module

Hi together,

here is my traditional annual status report on the development of MARS [1].

In the meantime, the out-of-tree MARS has replaced DRBD as the backbone
of the 1&1 geo-redundancy feature as publicly advertised for 1&1
Shared Hosting Linux (ShaHoLin). MARS is also running on several other
1&1 clusters. Some other people over the world have also seemingly
started to use it.

At 1&1, MARS is now running on more than 2000 Servers and on more
than 2 * 8 petabytes of data, and it has collected more than 20 millions
of operation hours.

The slides [1] are explaining why the sharding architecture supported
by MARS has no problems scaling out to such numbers, while some other
non-MARS and non-blocklevel 1&1 clusters (architecturally called
"Big Clusters" in the slides, although they have less than 1 PB of data)
have seemingly reached their _practical_ scaling limits at their practical
dimensioning and practical workload, although they were originally
advertized as scaling almost "unlimited" in theory, and some people
had seemed to _believe_ this in the past. Some of the reasons for
massive differences in scalability are explained in the slides, and
some more explanations will hopefully follow in 2017.

During 2016, I published several bugfix releases for the stable branch,
and some portability improvements to some newer kernels for the
out-of-tree (OOT) version of MARS at the github repo [2].

There is also a prototype of a prepatch-less WIP-compatibility branch
which is not yet merged with master.

Some minor developments have also started: there is a lab prototye for
md5 checksumming of 4k blocks on the underlying disk devices. This was
motivated by the observation that most operational incidents are due to
hardware defects, and we want to catch them as early as possible.
Conversely, this also implies that MARS is considered more stable than
the hardware, but of course this is _expected_ from a HA solution ;)

In January 2016, I took a few holidays for improving the upstream version
of MARS a little bit, mainly improving some checkpatch issues. After that,
I had to do much other work at 1&1, so unfortunately the development of
this part got stuck at that point. Sorry. The attached code is more or
less just for your personal information that this part is not dead, and
that development will continue.

This autumn, I got a new boss and some new objectives for 2017.

One of the new objectives will involve MARS.

The current ShaHoLin sharding architecture has been diretly migrated
from the former DRBD hardware setup to MARS: it just consists of about
1000 _pairs_ of hard-iron iSCSI storage servers and some hard-iron
standalone servers with local hardware RAIDs. They are hosting about
500 MARS resources (originally DRBD resources) just for the web servers;
there are even more resources at (already virtualized) database servers.

The former should be virtualized during 2017 for reducing the server iron,
likely using LXC and/or KVM.

The resulting future system should increase flexibility by MARS resource
data migration among the _whole_ pool. This means that MARS will give up
the traditional DRBD-like pairing in favour of a new feature: treating all
of the existing storage like one big "virtual LVM-like storage pool".

Notice that MARS's internal _architecture_ can already do this: it allows
for k > 2 replicas, and already has dynamic join-cluster and join-resource
and leave-resource operations which can be easily used for runtime data
migration during operation (while resources are loaded), and even for
very big resources. After adding a new operation "merge-cluster" which
checks that all the resource names are disjoint, this _would_ even work
with the current version of MARS, at least in theory.

However, the current limitation is at the internal _metadata_ updates
(not at the IO data paths): currently all cluster members are exchanging
all _metadata_ with all other nodes, leading to O(n^2) _metadata_
communications. A future version of MARS will reduce this to the
necessary scale (only among the nodes involved in some resources),
and it will communicate the other metadata less frequently and no
longer full-mesh.

As a side note: hopefully I will also get the necessary time for replacing
the current symlink tree by metadata files, which should also improve the
metadata scaling properties.

The goal will be a very low number of MARS clusters (one for US, one
for EU, and both probably split for web hosting versus databases)
consisting of several thousands of nodes. The realtime-critical data
IO paths will remain at the sharding principle, leading to excellent
scalability. Only the _metadata_ updates will follow the "big cluster"
architectural approach, and only as far as necessary. They are not
time-critical anyway.

As always, for the opensource community part of my work: it would be
nice if some other kernel hackers would start joining the MARS
development in 2017, at least for helping me getting it upstream.

I would be excited if I would be invited to the next kernel summit
or a similar meeting.

A happy new year from your devoted

Thomas


[1] https://github.com/schoebel/mars/blob/master/docu/MARS_GUUG2016.pdf

[2] https://github.com/schoebel/mars


Thomas Schoebel-Theuer (32):
mars: add new module lamport
mars: add new module brick_say
mars: add new module brick_mem
mars: add new module brick_checking
mars: add new module meta
mars: add new module brick
mars: add new module lib_pairing_heap
mars: add new module lib_queue
mars: add new module lib_rank
mars: add new module lib_limiter
mars: add new module lib_timing
mars: add new module vfs_compat
mars: add new module xio
mars: add new module xio_net
mars: add new module lib_mapfree
mars: add new module lib_log
mars: add new module xio_bio
mars: add new module xio_sio
mars: add new module xio_client
mars: add new module xio_if
mars: add new module xio_copy
mars: add new module xio_trans_logger
mars: add new module xio_server
mars: add new module strategy
mars: add new module main_strategy
mars: add new module net
mars: add new module server_strategy
mars: add new module mars_proc
mars: add new module mars_main
mars: add new module Makefile
mars: add new module Kconfig
mars: activate build

drivers/staging/Kconfig | 2 +
drivers/staging/Makefile | 1 +
drivers/staging/mars/Kconfig | 266 +
drivers/staging/mars/Makefile | 96 +
drivers/staging/mars/brick.c | 723 +++
drivers/staging/mars/brick_mem.c | 1080 ++++
drivers/staging/mars/brick_say.c | 920 +++
drivers/staging/mars/lamport.c | 61 +
drivers/staging/mars/lib/lib_limiter.c | 163 +
drivers/staging/mars/lib/lib_rank.c | 87 +
drivers/staging/mars/lib/lib_timing.c | 68 +
drivers/staging/mars/mars/main_strategy.c | 2135 +++++++
drivers/staging/mars/mars/mars_main.c | 6160 ++++++++++++++++++++
drivers/staging/mars/mars/mars_proc.c | 389 ++
drivers/staging/mars/mars/mars_proc.h | 34 +
drivers/staging/mars/mars/net.c | 109 +
drivers/staging/mars/mars/server_strategy.c | 436 ++
drivers/staging/mars/mars/strategy.h | 239 +
drivers/staging/mars/xio_bricks/lib_log.c | 506 ++
drivers/staging/mars/xio_bricks/lib_mapfree.c | 382 ++
drivers/staging/mars/xio_bricks/xio.c | 227 +
drivers/staging/mars/xio_bricks/xio_bio.c | 845 +++
drivers/staging/mars/xio_bricks/xio_client.c | 1083 ++++
drivers/staging/mars/xio_bricks/xio_copy.c | 1005 ++++
drivers/staging/mars/xio_bricks/xio_if.c | 892 +++
drivers/staging/mars/xio_bricks/xio_net.c | 1849 ++++++
drivers/staging/mars/xio_bricks/xio_server.c | 493 ++
drivers/staging/mars/xio_bricks/xio_sio.c | 578 ++
drivers/staging/mars/xio_bricks/xio_trans_logger.c | 3410 +++++++++++
include/linux/brick/brick.h | 620 ++
include/linux/brick/brick_checking.h | 107 +
include/linux/brick/brick_mem.h | 218 +
include/linux/brick/brick_say.h | 89 +
include/linux/brick/lamport.h | 26 +
include/linux/brick/lib_limiter.h | 52 +
include/linux/brick/lib_pairing_heap.h | 109 +
include/linux/brick/lib_queue.h | 165 +
include/linux/brick/lib_rank.h | 136 +
include/linux/brick/lib_timing.h | 182 +
include/linux/brick/meta.h | 106 +
include/linux/brick/vfs_compat.h | 48 +
include/linux/xio/lib_log.h | 333 ++
include/linux/xio/lib_mapfree.h | 84 +
include/linux/xio/xio.h | 319 +
include/linux/xio/xio_bio.h | 85 +
include/linux/xio/xio_client.h | 105 +
include/linux/xio/xio_copy.h | 115 +
include/linux/xio/xio_if.h | 109 +
include/linux/xio/xio_net.h | 177 +
include/linux/xio/xio_server.h | 91 +
include/linux/xio/xio_sio.h | 68 +
include/linux/xio/xio_trans_logger.h | 271 +
52 files changed, 27854 insertions(+)
create mode 100644 drivers/staging/mars/Kconfig
create mode 100644 drivers/staging/mars/Makefile
create mode 100644 drivers/staging/mars/brick.c
create mode 100644 drivers/staging/mars/brick_mem.c
create mode 100644 drivers/staging/mars/brick_say.c
create mode 100644 drivers/staging/mars/lamport.c
create mode 100644 drivers/staging/mars/lib/lib_limiter.c
create mode 100644 drivers/staging/mars/lib/lib_rank.c
create mode 100644 drivers/staging/mars/lib/lib_timing.c
create mode 100644 drivers/staging/mars/mars/main_strategy.c
create mode 100644 drivers/staging/mars/mars/mars_main.c
create mode 100644 drivers/staging/mars/mars/mars_proc.c
create mode 100644 drivers/staging/mars/mars/mars_proc.h
create mode 100644 drivers/staging/mars/mars/net.c
create mode 100644 drivers/staging/mars/mars/server_strategy.c
create mode 100644 drivers/staging/mars/mars/strategy.h
create mode 100644 drivers/staging/mars/xio_bricks/lib_log.c
create mode 100644 drivers/staging/mars/xio_bricks/lib_mapfree.c
create mode 100644 drivers/staging/mars/xio_bricks/xio.c
create mode 100644 drivers/staging/mars/xio_bricks/xio_bio.c
create mode 100644 drivers/staging/mars/xio_bricks/xio_client.c
create mode 100644 drivers/staging/mars/xio_bricks/xio_copy.c
create mode 100644 drivers/staging/mars/xio_bricks/xio_if.c
create mode 100644 drivers/staging/mars/xio_bricks/xio_net.c
create mode 100644 drivers/staging/mars/xio_bricks/xio_server.c
create mode 100644 drivers/staging/mars/xio_bricks/xio_sio.c
create mode 100644 drivers/staging/mars/xio_bricks/xio_trans_logger.c
create mode 100644 include/linux/brick/brick.h
create mode 100644 include/linux/brick/brick_checking.h
create mode 100644 include/linux/brick/brick_mem.h
create mode 100644 include/linux/brick/brick_say.h
create mode 100644 include/linux/brick/lamport.h
create mode 100644 include/linux/brick/lib_limiter.h
create mode 100644 include/linux/brick/lib_pairing_heap.h
create mode 100644 include/linux/brick/lib_queue.h
create mode 100644 include/linux/brick/lib_rank.h
create mode 100644 include/linux/brick/lib_timing.h
create mode 100644 include/linux/brick/meta.h
create mode 100644 include/linux/brick/vfs_compat.h
create mode 100644 include/linux/xio/lib_log.h
create mode 100644 include/linux/xio/lib_mapfree.h
create mode 100644 include/linux/xio/xio.h
create mode 100644 include/linux/xio/xio_bio.h
create mode 100644 include/linux/xio/xio_client.h
create mode 100644 include/linux/xio/xio_copy.h
create mode 100644 include/linux/xio/xio_if.h
create mode 100644 include/linux/xio/xio_net.h
create mode 100644 include/linux/xio/xio_server.h
create mode 100644 include/linux/xio/xio_sio.h
create mode 100644 include/linux/xio/xio_trans_logger.h

--
2.11.0


2016-12-30 22:58:04

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 01/32] mars: add new module lamport

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/lamport.c | 61 ++++++++++++++++++++++++++++++++++++++++++
include/linux/brick/lamport.h | 26 ++++++++++++++++++
2 files changed, 87 insertions(+)
create mode 100644 drivers/staging/mars/lamport.c
create mode 100644 include/linux/brick/lamport.h

diff --git a/drivers/staging/mars/lamport.c b/drivers/staging/mars/lamport.c
new file mode 100644
index 000000000000..373093f6e35f
--- /dev/null
+++ b/drivers/staging/mars/lamport.c
@@ -0,0 +1,61 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/semaphore.h>
+
+#include <linux/brick/lamport.h>
+
+/* TODO: replace with spinlock if possible (first check) */
+struct semaphore lamport_sem = __SEMAPHORE_INITIALIZER(lamport_sem, 1);
+struct timespec lamport_now = {};
+
+void get_lamport(struct timespec *now)
+{
+ int diff;
+
+ down(&lamport_sem);
+
+ *now = CURRENT_TIME;
+ diff = timespec_compare(now, &lamport_now);
+ if (diff >= 0) {
+ timespec_add_ns(now, 1);
+ memcpy(&lamport_now, now, sizeof(lamport_now));
+ timespec_add_ns(&lamport_now, 1);
+ } else {
+ timespec_add_ns(&lamport_now, 1);
+ memcpy(now, &lamport_now, sizeof(*now));
+ }
+
+ up(&lamport_sem);
+}
+
+void set_lamport(struct timespec *old)
+{
+ int diff;
+
+ down(&lamport_sem);
+
+ diff = timespec_compare(old, &lamport_now);
+ if (diff >= 0) {
+ memcpy(&lamport_now, old, sizeof(lamport_now));
+ timespec_add_ns(&lamport_now, 1);
+ }
+
+ up(&lamport_sem);
+}
diff --git a/include/linux/brick/lamport.h b/include/linux/brick/lamport.h
new file mode 100644
index 000000000000..9aac0ce01bb4
--- /dev/null
+++ b/include/linux/brick/lamport.h
@@ -0,0 +1,26 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef LAMPORT_H
+#define LAMPORT_H
+
+#include <linux/time.h>
+
+extern void get_lamport(struct timespec *now);
+extern void set_lamport(struct timespec *old);
+
+#endif
--
2.11.0

2016-12-30 22:58:07

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 11/32] mars: add new module lib_timing

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/lib/lib_timing.c | 68 +++++++++++++
include/linux/brick/lib_timing.h | 182 ++++++++++++++++++++++++++++++++++
2 files changed, 250 insertions(+)
create mode 100644 drivers/staging/mars/lib/lib_timing.c
create mode 100644 include/linux/brick/lib_timing.h

diff --git a/drivers/staging/mars/lib/lib_timing.c b/drivers/staging/mars/lib/lib_timing.c
new file mode 100644
index 000000000000..1996052cb647
--- /dev/null
+++ b/drivers/staging/mars/lib/lib_timing.c
@@ -0,0 +1,68 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/brick/lib_timing.h>
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#ifdef CONFIG_DEBUG_KERNEL
+
+int report_timing(struct timing_stats *tim, char *str, int maxlen)
+{
+ int len = 0;
+ int time = 1;
+ int resol = 1;
+
+ static const char * const units[] = {
+ "us",
+ "ms",
+ "s",
+ "ERROR"
+ };
+ const char *unit = units[0];
+ int unit_index = 0;
+ int i;
+
+ for (i = 0; i < TIMING_MAX; i++) {
+ int this_len = scnprintf(
+
+ str, maxlen, "<%d%s = %d (%lld) ", resol, unit, tim->tim_count[i], (
+ long long)tim->tim_count[i] * time);
+
+ str += this_len;
+ len += this_len;
+ maxlen -= this_len;
+ if (maxlen <= 1)
+ break;
+ resol <<= 1;
+ time <<= 1;
+ if (resol >= 1000) {
+ resol = 1;
+ unit = units[++unit_index];
+ }
+ }
+ return len;
+}
+
+#endif /* CONFIG_DEBUG_KERNEL */
+
+struct threshold global_io_threshold = {
+ .thr_limit = 30 * 1000000, /* 30 seconds */
+ .thr_factor = 100,
+ .thr_plus = 0,
+};
diff --git a/include/linux/brick/lib_timing.h b/include/linux/brick/lib_timing.h
new file mode 100644
index 000000000000..7081d984a2ce
--- /dev/null
+++ b/include/linux/brick/lib_timing.h
@@ -0,0 +1,182 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef LIB_TIMING_H
+#define LIB_TIMING_H
+
+#include <linux/sched.h>
+
+/* Simple infrastructure for timing of arbitrary operations and creation
+ * of some simple histogram statistics.
+ */
+
+#define TIMING_MAX 24
+
+struct timing_stats {
+#ifdef CONFIG_DEBUG_KERNEL
+ int tim_count[TIMING_MAX];
+
+#endif
+};
+
+#define _TIME_THIS(_stamp1, _stamp2, _CODE) \
+ ({ \
+ (_stamp1) = cpu_clock(raw_smp_processor_id()); \
+ \
+ _CODE; \
+ \
+ (_stamp2) = cpu_clock(raw_smp_processor_id()); \
+ (_stamp2) - (_stamp1); \
+ })
+
+#define TIME_THIS(_CODE) \
+ ({ \
+ unsigned long long _stamp1; \
+ unsigned long long _stamp2; \
+ _TIME_THIS(_stamp1, _stamp2, _CODE); \
+ })
+
+#ifdef CONFIG_DEBUG_KERNEL
+
+#define _TIME_STATS(_timing, _stamp1, _stamp2, _CODE) \
+ ({ \
+ unsigned long long _time; \
+ unsigned long _tmp; \
+ int _i; \
+ \
+ _time = _TIME_THIS(_stamp1, _stamp2, _CODE); \
+ \
+ _tmp = _time / 1000; /* convert to us */ \
+ _i = 0; \
+ while (_tmp > 0 && _i < TIMING_MAX - 1) { \
+ _tmp >>= 1; \
+ _i++; \
+ } \
+ (_timing)->tim_count[_i]++; \
+ _time; \
+ })
+
+#define TIME_STATS(_timing, _CODE) \
+ ({ \
+ unsigned long long _stamp1; \
+ unsigned long long _stamp2; \
+ _TIME_STATS(_timing, _stamp1, _stamp2, _CODE); \
+ })
+
+extern int report_timing(struct timing_stats *tim, char *str, int maxlen);
+
+#else /* CONFIG_DEBUG_KERNEL */
+
+#define _TIME_STATS(_timing, _stamp1, _stamp2, _CODE) \
+ ((void)_timing, (_stamp1) = (_stamp2) = cpu_clock(raw_smp_processor_id()), _CODE, 0)
+
+#define TIME_STATS(_timing, _CODE) \
+ ((void)_timing, _CODE, 0)
+
+#define report_timing(tim, str, maxlen) ((void)tim, 0)
+
+#endif /* CONFIG_DEBUG_KERNEL */
+
+/* A banning represents some overloaded resource.
+ *
+ * Whenever overload is detected, you should call banning_hit()
+ * telling that the overload is assumed / estimated to continue
+ * for some duration in time.
+ *
+ * ATTENTION! These operations are deliberately raceful.
+ * They are meant to deliver _hints_ (e.g. for IO scheduling
+ * decisions etc), not hard facts!
+ *
+ * If you need locking, just surround these operations
+ * with locking by yourself.
+ */
+struct banning {
+ long long ban_last_hit;
+
+ /* statistical */
+ int ban_renew_count;
+ int ban_count;
+};
+
+static inline
+bool banning_hit(struct banning *ban, long long duration)
+{
+ long long now = cpu_clock(raw_smp_processor_id());
+ bool hit = ban->ban_last_hit >= now;
+ long long new_hit = now + duration;
+
+ ban->ban_renew_count++;
+ if (!ban->ban_last_hit || ban->ban_last_hit < new_hit) {
+ ban->ban_last_hit = new_hit;
+ ban->ban_count++;
+ }
+ return hit;
+}
+
+static inline
+bool banning_is_hit(struct banning *ban)
+{
+ long long now = cpu_clock(raw_smp_processor_id());
+
+ return (ban->ban_last_hit && ban->ban_last_hit >= now);
+}
+
+extern inline
+void banning_reset(struct banning *ban)
+{
+ ban->ban_last_hit = 0;
+}
+
+/* Threshold: trigger a banning whenever some latency threshold
+ * is exceeded.
+ */
+struct threshold {
+ struct banning *thr_ban;
+
+ struct threshold *thr_parent; /* support hierarchies */
+ /* tunables */
+ int thr_limit; /* in us */
+ int thr_factor; /* in % */
+ int thr_plus; /* in us */
+ /* statistical */
+ int thr_max; /* in ms */
+ int thr_triggered;
+ int thr_true_hit;
+};
+
+static inline
+void threshold_check(struct threshold *thr, long long latency)
+{
+ int ms = latency >> 6; /* ignore small rounding error */
+
+ while (thr) {
+ if (ms > thr->thr_max)
+ thr->thr_max = ms;
+ if (thr->thr_limit &&
+ latency > (long long)thr->thr_limit * 1000) {
+ thr->thr_triggered++;
+ if (thr->thr_ban &&
+ !banning_hit(thr->thr_ban, latency * thr->thr_factor / 100 + thr->thr_plus * 1000))
+ thr->thr_true_hit++;
+ }
+ thr = thr->thr_parent;
+ }
+}
+
+extern struct threshold global_io_threshold;
+
+#endif
--
2.11.0

2016-12-30 22:58:06

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 04/32] mars: add new module brick_checking

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/brick_checking.h | 107 +++++++++++++++++++++++++++++++++++
1 file changed, 107 insertions(+)
create mode 100644 include/linux/brick/brick_checking.h

diff --git a/include/linux/brick/brick_checking.h b/include/linux/brick/brick_checking.h
new file mode 100644
index 000000000000..957bd5227db9
--- /dev/null
+++ b/include/linux/brick/brick_checking.h
@@ -0,0 +1,107 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef BRICK_CHECKING_H
+#define BRICK_CHECKING_H
+
+/***********************************************************************/
+
+/* checking */
+
+#if defined(CONFIG_MARS_DEBUG) || defined(CONFIG_MARS_CHECKS)
+#define BRICK_CHECKING true
+#else
+#define BRICK_CHECKING false
+#endif
+
+#define _CHECK_ATOMIC(atom, OP, minval) \
+do { \
+ if (BRICK_CHECKING) { \
+ int __test = atomic_read(atom); \
+ if (unlikely(__test OP(minval))) { \
+ atomic_set(atom, minval); \
+ BRICK_ERR("%d: atomic " #atom " " #OP " " #minval " (%d)\n", __LINE__, __test);\
+ } \
+ } \
+} while (0)
+
+#define CHECK_ATOMIC(atom, minval) \
+ _CHECK_ATOMIC(atom, <, minval)
+
+#define CHECK_HEAD_EMPTY(head) \
+do { \
+ if (BRICK_CHECKING && unlikely(!list_empty(head) && (head)->next)) {\
+ list_del_init(head); \
+ BRICK_ERR("%d: list_head " #head " (%p) not empty\n", __LINE__, head);\
+ } \
+} while (0)
+
+#ifdef CONFIG_MARS_DEBUG_MEM
+#define CHECK_PTR_DEAD(ptr, label) \
+do { \
+ if (BRICK_CHECKING && unlikely((ptr) == (void *)0x5a5a5a5a5a5a5a5a)) {\
+ BRICK_FAT("%d: pointer '" #ptr "' is DEAD\n", __LINE__);\
+ goto label; \
+ } \
+} while (0)
+#else
+#define CHECK_PTR_DEAD(ptr, label) /*empty*/
+#endif
+
+#define CHECK_PTR_NULL(ptr, label) \
+do { \
+ CHECK_PTR_DEAD(ptr, label); \
+ if (BRICK_CHECKING && unlikely(!(ptr))) { \
+ BRICK_FAT("%d: pointer '" #ptr "' is NULL\n", __LINE__);\
+ goto label; \
+ } \
+} while (0)
+
+#ifdef CONFIG_MARS_DEBUG
+#define CHECK_PTR(ptr, label) \
+do { \
+ CHECK_PTR_NULL(ptr, label); \
+ if (BRICK_CHECKING && unlikely(!virt_addr_valid(ptr))) { \
+ BRICK_FAT("%d: pointer '" #ptr "' (%p) is no valid virtual KERNEL address\n", __LINE__, ptr);\
+ goto label; \
+ } \
+} while (0)
+#else
+#define CHECK_PTR(ptr, label) CHECK_PTR_NULL(ptr, label)
+#endif
+
+#define CHECK_ASPECT(a_ptr, o_ptr, label) \
+do { \
+ if (BRICK_CHECKING && unlikely((a_ptr)->object != o_ptr)) { \
+ BRICK_FAT( \
+ "%d: aspect pointer '" #a_ptr "' (%p) belongs to object %p, not to " #o_ptr " (%p)\n", __LINE__, a_ptr, (\
+ \
+ \
+ a_ptr)->object, o_ptr); \
+ goto label; \
+ } \
+} while (0)
+
+#define _CHECK(ptr, label) \
+do { \
+ if (BRICK_CHECKING && unlikely(!(ptr))) { \
+ BRICK_FAT("%d: condition '" #ptr "' is VIOLATED\n", __LINE__);\
+ goto label; \
+ } \
+} while (0)
+
+#endif
--
2.11.0

2016-12-30 22:58:09

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 12/32] mars: add new module vfs_compat

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/vfs_compat.h | 48 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
create mode 100644 include/linux/brick/vfs_compat.h

diff --git a/include/linux/brick/vfs_compat.h b/include/linux/brick/vfs_compat.h
new file mode 100644
index 000000000000..68d082b70b43
--- /dev/null
+++ b/include/linux/brick/vfs_compat.h
@@ -0,0 +1,48 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef _MARS_COMPAT
+#define _MARS_COMPAT
+
+/* TRANSITIONAL compatibility to BOTH the old prepatch
+ * and the new wrappers around vfs_*().
+ */
+#ifndef MARS_MAJOR
+#define __USE_COMPAT
+#endif
+
+#ifdef __USE_COMPAT
+
+int _compat_symlink(
+const char __user *oldname,
+ const char __user *newname,
+ struct timespec *mtime);
+
+int _compat_mkdir(
+const char __user *pathname,
+ int mode);
+
+int _compat_rename(
+const char __user *oldname,
+ const char __user *newname);
+
+int _compat_unlink(const char __user *pathname);
+
+#else
+#include <linux/syscalls.h>
+#endif
+#endif
--
2.11.0

2016-12-30 22:58:11

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 16/32] mars: add new module lib_log

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/lib_log.c | 506 ++++++++++++++++++++++++++++++
include/linux/xio/lib_log.h | 333 ++++++++++++++++++++
2 files changed, 839 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/lib_log.c
create mode 100644 include/linux/xio/lib_log.h

diff --git a/drivers/staging/mars/xio_bricks/lib_log.c b/drivers/staging/mars/xio_bricks/lib_log.c
new file mode 100644
index 000000000000..e0d086a0981f
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/lib_log.c
@@ -0,0 +1,506 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bio.h>
+
+#include <linux/xio/lib_log.h>
+
+atomic_t global_aio_flying = ATOMIC_INIT(0);
+
+void exit_logst(struct log_status *logst)
+{
+ int count;
+
+ log_flush(logst);
+
+ /* TODO: replace by event */
+ count = 0;
+ while (atomic_read(&logst->aio_flying) > 0) {
+ if (!count++)
+ XIO_DBG("waiting for IO terminating...");
+ brick_msleep(500);
+ }
+ if (logst->read_aio) {
+ XIO_DBG("putting read_aio\n");
+ GENERIC_INPUT_CALL(logst->input, aio_put, logst->read_aio);
+ logst->read_aio = NULL;
+ }
+ if (logst->log_aio) {
+ XIO_DBG("putting log_aio\n");
+ GENERIC_INPUT_CALL(logst->input, aio_put, logst->log_aio);
+ logst->log_aio = NULL;
+ }
+}
+
+void init_logst(struct log_status *logst, struct xio_input *input, loff_t start_pos, loff_t end_pos)
+{
+ exit_logst(logst);
+
+ memset(logst, 0, sizeof(struct log_status));
+
+ logst->input = input;
+ logst->brick = input->brick;
+ logst->start_pos = start_pos;
+ logst->log_pos = start_pos;
+ logst->end_pos = end_pos;
+ init_waitqueue_head(&logst->event);
+}
+
+#define XIO_LOG_CB_MAX 32
+
+struct log_cb_info {
+ struct aio_object *aio;
+ struct log_status *logst;
+ struct semaphore mutex;
+ atomic_t refcount;
+ int nr_cb;
+ void (*endios[XIO_LOG_CB_MAX])(void *private, int error);
+ void *privates[XIO_LOG_CB_MAX];
+};
+
+static
+void put_log_cb_info(struct log_cb_info *cb_info)
+{
+ if (atomic_dec_and_test(&cb_info->refcount))
+ brick_mem_free(cb_info);
+}
+
+static
+void _do_callbacks(struct log_cb_info *cb_info, int error)
+{
+ int i;
+
+ down(&cb_info->mutex);
+ for (i = 0; i < cb_info->nr_cb; i++) {
+ void (*end_fn)(void *private, int error);
+
+ end_fn = cb_info->endios[i];
+ cb_info->endios[i] = NULL;
+ if (end_fn)
+ end_fn(cb_info->privates[i], error);
+ }
+ up(&cb_info->mutex);
+}
+
+static
+void log_write_endio(struct generic_callback *cb)
+{
+ struct log_cb_info *cb_info = cb->cb_private;
+ struct log_status *logst;
+
+ LAST_CALLBACK(cb);
+ CHECK_PTR(cb_info, err);
+
+ logst = cb_info->logst;
+ CHECK_PTR(logst, done);
+
+ _do_callbacks(cb_info, cb->cb_error);
+
+done:
+ put_log_cb_info(cb_info);
+ atomic_dec(&logst->aio_flying);
+ atomic_dec(&global_aio_flying);
+ if (logst->signal_event)
+ wake_up_interruptible(logst->signal_event);
+
+ goto out_return;
+err:
+ XIO_FAT("internal pointer corruption\n");
+out_return:;
+}
+
+void log_flush(struct log_status *logst)
+{
+ struct aio_object *aio = logst->log_aio;
+ struct log_cb_info *cb_info;
+ int align_size;
+ int gap;
+
+ if (!aio || !logst->count)
+ goto out_return;
+ gap = 0;
+ align_size = (logst->align_size / PAGE_SIZE) * PAGE_SIZE;
+ if (align_size > 0) {
+ /* round up to next alignment border */
+ int align_offset = logst->offset & (align_size - 1);
+
+ if (align_offset > 0) {
+ int restlen = aio->io_len - logst->offset;
+
+ gap = align_size - align_offset;
+ if (unlikely(gap > restlen))
+ gap = restlen;
+ }
+ }
+ if (gap > 0) {
+ /* don't leak information from kernelspace */
+ memset(aio->io_data + logst->offset, 0, gap);
+ logst->offset += gap;
+ }
+ aio->io_len = logst->offset;
+ memcpy(&logst->log_pos_stamp, &logst->tmp_pos_stamp, sizeof(logst->log_pos_stamp));
+
+ cb_info = logst->private;
+ logst->private = NULL;
+ SETUP_CALLBACK(aio, log_write_endio, cb_info);
+ cb_info->logst = logst;
+ aio->io_rw = 1;
+
+ atomic_inc(&logst->aio_flying);
+ atomic_inc(&global_aio_flying);
+
+ GENERIC_INPUT_CALL(logst->input, aio_io, aio);
+ GENERIC_INPUT_CALL(logst->input, aio_put, aio);
+
+ logst->log_pos += logst->offset;
+ logst->offset = 0;
+ logst->count = 0;
+ logst->log_aio = NULL;
+
+ put_log_cb_info(cb_info);
+out_return:;
+}
+
+void *log_reserve(struct log_status *logst, struct log_header *lh)
+{
+ struct log_cb_info *cb_info = logst->private;
+ struct aio_object *aio;
+ void *data;
+
+ short total_len = lh->l_len + OVERHEAD;
+ int offset;
+ int status;
+
+ if (unlikely(lh->l_len <= 0 || lh->l_len > logst->max_size)) {
+ XIO_ERR("trying to write %d bytes, max allowed = %d\n", lh->l_len, logst->max_size);
+ goto err;
+ }
+
+ aio = logst->log_aio;
+ if ((aio && total_len > aio->io_len - logst->offset) ||
+ !cb_info || cb_info->nr_cb >= XIO_LOG_CB_MAX) {
+ log_flush(logst);
+ }
+
+ aio = logst->log_aio;
+ if (!aio) {
+ if (unlikely(logst->private)) {
+ XIO_ERR("oops\n");
+ brick_mem_free(logst->private);
+ }
+ logst->private = brick_zmem_alloc(sizeof(struct log_cb_info));
+ cb_info = logst->private;
+ sema_init(&cb_info->mutex, 1);
+ atomic_set(&cb_info->refcount, 2);
+
+ aio = xio_alloc_aio(logst->brick);
+ cb_info->aio = aio;
+
+ aio->io_pos = logst->log_pos;
+ aio->io_len = logst->chunk_size ? logst->chunk_size : total_len;
+ aio->io_may_write = WRITE;
+ aio->io_prio = logst->io_prio;
+
+ for (;;) {
+ status = GENERIC_INPUT_CALL(logst->input, aio_get, aio);
+ if (likely(status >= 0))
+ break;
+ if (status != -ENOMEM && status != -EAGAIN) {
+ XIO_ERR("aio_get() failed, status = %d\n", status);
+ goto err_free;
+ }
+ brick_msleep(100);
+ }
+
+ if (unlikely(aio->io_len < total_len)) {
+ XIO_ERR("io_len = %d total_len = %d\n", aio->io_len, total_len);
+ goto put;
+ }
+
+ logst->offset = 0;
+ logst->log_aio = aio;
+ }
+
+ offset = logst->offset;
+ data = aio->io_data;
+ DATA_PUT(data, offset, START_MAGIC);
+ DATA_PUT(data, offset, (char)FORMAT_VERSION);
+ logst->validflag_offset = offset;
+ DATA_PUT(data, offset, (char)0); /* valid_flag */
+ DATA_PUT(data, offset, total_len); /* start of next header */
+ DATA_PUT(data, offset, lh->l_stamp.tv_sec);
+ DATA_PUT(data, offset, lh->l_stamp.tv_nsec);
+ DATA_PUT(data, offset, lh->l_pos);
+ logst->reallen_offset = offset;
+ DATA_PUT(data, offset, lh->l_len);
+ DATA_PUT(data, offset, (short)0); /* spare */
+ DATA_PUT(data, offset, 0); /* spare */
+ DATA_PUT(data, offset, lh->l_code);
+ DATA_PUT(data, offset, (short)0); /* spare */
+
+ /* remember the last timestamp */
+ memcpy(&logst->tmp_pos_stamp, &lh->l_stamp, sizeof(logst->tmp_pos_stamp));
+
+ logst->payload_offset = offset;
+ logst->payload_len = lh->l_len;
+
+ return data + offset;
+
+put:
+ GENERIC_INPUT_CALL(logst->input, aio_put, aio);
+ logst->log_aio = NULL;
+ return NULL;
+
+err_free:
+ obj_free(aio);
+ if (logst->private) {
+ /* TODO: if callbacks are already registered, call them here with some error code */
+ brick_mem_free(logst->private);
+ logst->private = NULL;
+ }
+err:
+ return NULL;
+}
+
+bool log_finalize(struct log_status *logst, int len, void (*endio)(void *private, int error), void *private)
+{
+ struct aio_object *aio = logst->log_aio;
+ struct log_cb_info *cb_info = logst->private;
+ struct timespec now;
+ void *data;
+ int offset;
+ int restlen;
+ int nr_cb;
+ int crc;
+ bool ok = false;
+
+ CHECK_PTR(aio, err);
+
+ if (unlikely(len > logst->payload_len)) {
+ XIO_ERR("trying to write more than reserved (%d > %d)\n", len, logst->payload_len);
+ goto err;
+ }
+ restlen = aio->io_len - logst->offset;
+ if (unlikely(len + END_OVERHEAD > restlen)) {
+ XIO_ERR("trying to write more than available (%d > %d)\n", len, (int)(restlen - END_OVERHEAD));
+ goto err;
+ }
+ if (unlikely(!cb_info || cb_info->nr_cb >= XIO_LOG_CB_MAX)) {
+ XIO_ERR("too many endio() calls\n");
+ goto err;
+ }
+
+ data = aio->io_data;
+
+ crc = 0;
+ if (logst->do_crc) {
+ unsigned char checksum[xio_digest_size];
+
+ xio_digest(checksum, data + logst->payload_offset, len);
+ crc = *(int *)checksum;
+ }
+
+ /* Correct the length in the header.
+ */
+ offset = logst->reallen_offset;
+ DATA_PUT(data, offset, (short)len);
+
+ /* Write the trailer.
+ */
+ offset = logst->payload_offset + len;
+ DATA_PUT(data, offset, END_MAGIC);
+ DATA_PUT(data, offset, crc);
+ DATA_PUT(data, offset, (char)1); /* valid_flag copy */
+ DATA_PUT(data, offset, (char)0); /* spare */
+ DATA_PUT(data, offset, (short)0); /* spare */
+ DATA_PUT(data, offset, logst->seq_nr + 1);
+ get_lamport(&now); /* when the log entry was ready. */
+ DATA_PUT(data, offset, now.tv_sec);
+ DATA_PUT(data, offset, now.tv_nsec);
+
+ if (unlikely(offset > aio->io_len)) {
+ XIO_FAT("length calculation was wrong: %d > %d\n", offset, aio->io_len);
+ goto err;
+ }
+ logst->offset = offset;
+
+ /* This must come last. In case of incomplete
+ * or even overlapping disk transfers, this indicates
+ * the completeness / integrity of the payload at
+ * the time of starting the transfer.
+ */
+ offset = logst->validflag_offset;
+ DATA_PUT(data, offset, (char)1);
+
+ nr_cb = cb_info->nr_cb++;
+ cb_info->endios[nr_cb] = endio;
+ cb_info->privates[nr_cb] = private;
+
+ /* report success */
+ logst->seq_nr++;
+ logst->count++;
+ ok = true;
+
+err:
+ return ok;
+}
+
+static
+void log_read_endio(struct generic_callback *cb)
+{
+ struct log_status *logst = cb->cb_private;
+
+ LAST_CALLBACK(cb);
+ CHECK_PTR(logst, err);
+ logst->error_code = cb->cb_error;
+ logst->got = true;
+ wake_up_interruptible(&logst->event);
+ goto out_return;
+err:
+ XIO_FAT("internal pointer corruption\n");
+out_return:;
+}
+
+int log_read(struct log_status *logst, bool sloppy, struct log_header *lh, void **payload, int *payload_len)
+{
+ struct aio_object *aio;
+ int old_offset;
+ int status;
+
+restart:
+ status = 0;
+ aio = logst->read_aio;
+ if (!aio || logst->do_free) {
+ loff_t this_len;
+
+ if (aio) {
+ GENERIC_INPUT_CALL(logst->input, aio_put, aio);
+ logst->read_aio = NULL;
+ logst->log_pos += logst->offset;
+ logst->offset = 0;
+ }
+
+ this_len = logst->end_pos - logst->log_pos;
+ if (this_len > logst->chunk_size) {
+ this_len = logst->chunk_size;
+ } else if (unlikely(this_len <= 0)) {
+ XIO_ERR(
+ "tried bad IO len %lld, start_pos = %lld log_pos = %lld end_pos = %lld\n",
+ this_len,
+ logst->start_pos,
+ logst->log_pos,
+ logst->end_pos);
+ status = -EOVERFLOW;
+ goto done;
+ }
+
+ aio = xio_alloc_aio(logst->brick);
+ aio->io_pos = logst->log_pos;
+ aio->io_len = this_len;
+ aio->io_prio = logst->io_prio;
+
+ status = GENERIC_INPUT_CALL(logst->input, aio_get, aio);
+ if (unlikely(status < 0)) {
+ if (status != -ENODATA)
+ XIO_ERR("aio_get() failed, status = %d\n", status);
+ goto done_free;
+ }
+ if (unlikely(aio->io_len <= OVERHEAD)) { /* EOF */
+ status = 0;
+ goto done_put;
+ }
+
+ SETUP_CALLBACK(aio, log_read_endio, logst);
+ aio->io_rw = READ;
+ logst->offset = 0;
+ logst->got = false;
+ logst->do_free = false;
+
+ GENERIC_INPUT_CALL(logst->input, aio_io, aio);
+
+ wait_event_interruptible_timeout(logst->event, logst->got, 60 * HZ);
+ status = -ETIME;
+ if (!logst->got)
+ goto done_put;
+ status = logst->error_code;
+ if (status < 0)
+ goto done_put;
+ logst->read_aio = aio;
+ }
+
+ status = log_scan(
+ aio->io_data + logst->offset,
+ aio->io_len - logst->offset,
+ aio->io_pos,
+ logst->offset,
+ sloppy,
+ lh,
+ payload,
+ payload_len,
+ &logst->seq_nr);
+
+ if (unlikely(status == 0)) {
+ XIO_ERR("bad logfile scan\n");
+ status = -EINVAL;
+ }
+ if (unlikely(status < 0))
+ goto done_put;
+
+ /* memoize success */
+ logst->offset += status;
+ if (logst->offset + (logst->max_size + OVERHEAD) * 2 >= aio->io_len)
+ logst->do_free = true;
+
+done:
+ if (status == -ENODATA) {
+ /* indicate EOF */
+ status = 0;
+ }
+ return status;
+
+done_put:
+ old_offset = logst->offset;
+ if (aio) {
+ GENERIC_INPUT_CALL(logst->input, aio_put, aio);
+ logst->read_aio = NULL;
+ logst->log_pos += logst->offset;
+ logst->offset = 0;
+ }
+ if (status == -EAGAIN && old_offset > 0)
+ goto restart;
+ goto done;
+
+done_free:
+ obj_free(aio);
+ logst->read_aio = NULL;
+ goto done;
+}
+
+/***************** module init stuff ************************/
+
+int __init init_log_format(void)
+{
+ XIO_INF("init_log_format()\n");
+ return 0;
+}
+
+void exit_log_format(void)
+{
+ XIO_INF("exit_log_format()\n");
+}
diff --git a/include/linux/xio/lib_log.h b/include/linux/xio/lib_log.h
new file mode 100644
index 000000000000..39de5af559f5
--- /dev/null
+++ b/include/linux/xio/lib_log.h
@@ -0,0 +1,333 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* Definitions for logfile format.
+ *
+ * This is meant for sharing between different transaction logger variants,
+ * and/or for sharing with userspace tools (e.g. logfile analyzers).
+ * TODO: factor out some remaining kernelspace issues.
+ */
+
+#ifndef LIB_LOG_H
+#define LIB_LOG_H
+
+#ifdef __KERNEL__
+#include <linux/xio/xio.h>
+
+extern atomic_t global_aio_flying;
+#endif
+
+/* The following structure is memory-only.
+ * Transfers to disk are indirectly via the
+ * format conversion functions below.
+ * The advantage is that even newer disk formats can be parsed
+ * by old code (of course, not all information / features will be
+ * available then).
+ */
+#define log_header log_header_v1
+
+struct log_header_v1 {
+ struct timespec l_stamp;
+ struct timespec l_written;
+ loff_t l_pos;
+
+ short l_len;
+ short l_code;
+ unsigned int l_seq_nr;
+ int l_crc;
+};
+
+#define FORMAT_VERSION 1 /* version of disk format, currently there is no other one */
+
+#define CODE_UNKNOWN 0
+#define CODE_WRITE_NEW 1
+#define CODE_WRITE_OLD 2
+
+#define START_MAGIC 0xa8f7e908d9177957ll
+#define END_MAGIC 0x74941fb74ab5726dll
+
+#define START_OVERHEAD \
+ ( \
+ sizeof(START_MAGIC) + \
+ sizeof(char) + \
+ sizeof(char) + \
+ sizeof(short) + \
+ sizeof(struct timespec) + \
+ sizeof(loff_t) + \
+ sizeof(int) + \
+ sizeof(int) + \
+ sizeof(short) + \
+ sizeof(short) + \
+ 0 \
+ )
+
+#define END_OVERHEAD \
+ ( \
+ sizeof(END_MAGIC) + \
+ sizeof(int) + \
+ sizeof(char) + \
+ 3 + 4 /*spare*/ + \
+ sizeof(struct timespec) + \
+ 0 \
+ )
+
+#define OVERHEAD (START_OVERHEAD + END_OVERHEAD)
+
+/* TODO: make this bytesex-aware. */
+#define DATA_PUT(data, offset, val) \
+ do { \
+ *((typeof(val) *)((data)+offset)) = val; \
+ offset += sizeof(val); \
+ } while (0)
+
+#define DATA_GET(data, offset, val) \
+ do { \
+ val = *((typeof(val) *)((data)+offset)); \
+ offset += sizeof(val); \
+ } while (0)
+
+#define SCAN_TXT \
+"at file_pos = %lld file_offset = %d scan_offset = %d (%lld) test_offset = %d (%lld) restlen = %d: "
+#define SCAN_PAR \
+file_pos, file_offset, offset, file_pos + file_offset + offset, i, file_pos + file_offset + i, restlen
+
+static inline
+int log_scan(
+void *buf,
+int len,
+loff_t file_pos,
+int file_offset,
+bool sloppy,
+struct log_header *lh,
+void **payload,
+int *payload_len,
+unsigned int *seq_nr)
+{
+ bool dirty = false;
+ int offset;
+ int i;
+
+ *payload = NULL;
+ *payload_len = 0;
+
+ for (i = 0; i < len && i <= len - OVERHEAD; i += sizeof(long)) {
+ long long start_magic;
+ char format_version;
+ char valid_flag;
+
+ short total_len;
+ long long end_magic;
+ char valid_copy;
+
+ int restlen = 0;
+ int found_offset;
+
+ offset = i;
+ if (unlikely(i > 0 && !sloppy)) {
+ XIO_ERR(SCAN_TXT "detected a hole / bad data\n", SCAN_PAR);
+ return -EBADMSG;
+ }
+
+ DATA_GET(buf, offset, start_magic);
+ if (unlikely(start_magic != START_MAGIC)) {
+ if (start_magic != 0)
+ dirty = true;
+ continue;
+ }
+
+ restlen = len - i;
+ if (unlikely(restlen < START_OVERHEAD)) {
+ XIO_WRN(SCAN_TXT "magic found, but restlen is too small\n", SCAN_PAR);
+ return -EAGAIN;
+ }
+
+ DATA_GET(buf, offset, format_version);
+ if (unlikely(format_version != FORMAT_VERSION)) {
+ XIO_ERR(SCAN_TXT "found unknown data format %d\n", SCAN_PAR, (int)format_version);
+ return -EBADMSG;
+ }
+ DATA_GET(buf, offset, valid_flag);
+ if (unlikely(!valid_flag)) {
+ XIO_WRN(SCAN_TXT "data is explicitly marked invalid (was there a short write?)\n", SCAN_PAR);
+ continue;
+ }
+ DATA_GET(buf, offset, total_len);
+ if (unlikely(total_len > restlen)) {
+ XIO_WRN(
+ SCAN_TXT "total_len = %d but available data restlen = %d. Was the logfile truncated?\n",
+ SCAN_PAR,
+ total_len,
+ restlen);
+ return -EAGAIN;
+ }
+
+ memset(lh, 0, sizeof(struct log_header));
+
+ DATA_GET(buf, offset, lh->l_stamp.tv_sec);
+ DATA_GET(buf, offset, lh->l_stamp.tv_nsec);
+ DATA_GET(buf, offset, lh->l_pos);
+ DATA_GET(buf, offset, lh->l_len);
+ offset += 2; /* skip spare */
+ offset += 4; /* skip spare */
+ DATA_GET(buf, offset, lh->l_code);
+ offset += 2; /* skip spare */
+
+ found_offset = offset;
+ offset += lh->l_len;
+
+ restlen = len - offset;
+ if (unlikely(restlen < END_OVERHEAD)) {
+ XIO_WRN(SCAN_TXT "restlen %d is too small\n", SCAN_PAR, restlen);
+ return -EAGAIN;
+ }
+
+ DATA_GET(buf, offset, end_magic);
+ if (unlikely(end_magic != END_MAGIC)) {
+ XIO_WRN(SCAN_TXT "bad end_magic 0x%llx, is the logfile truncated?\n", SCAN_PAR, end_magic);
+ return -EBADMSG;
+ }
+ DATA_GET(buf, offset, lh->l_crc);
+ DATA_GET(buf, offset, valid_copy);
+
+ if (unlikely(valid_copy != 1)) {
+ XIO_WRN(
+ SCAN_TXT "found data marked as uncompleted / invalid, len = %d, valid_flag = %d\n", SCAN_PAR, lh->l_len, (
+
+ int)valid_copy);
+ return -EBADMSG;
+ }
+
+ /* skip spares */
+ offset += 3;
+
+ DATA_GET(buf, offset, lh->l_seq_nr);
+ DATA_GET(buf, offset, lh->l_written.tv_sec);
+ DATA_GET(buf, offset, lh->l_written.tv_nsec);
+
+ if (unlikely(lh->l_seq_nr > *seq_nr + 1 && lh->l_seq_nr && *seq_nr)) {
+ XIO_ERR(
+ SCAN_TXT "record sequence number %u mismatch, expected was %u\n",
+ SCAN_PAR,
+ lh->l_seq_nr,
+ *seq_nr + 1);
+ return -EBADMSG;
+ } else if (unlikely(lh->l_seq_nr != *seq_nr + 1 && lh->l_seq_nr && *seq_nr)) {
+ XIO_WRN(
+ SCAN_TXT "record sequence number %u mismatch, expected was %u\n",
+ SCAN_PAR,
+ lh->l_seq_nr,
+ *seq_nr + 1);
+ }
+ *seq_nr = lh->l_seq_nr;
+
+ if (lh->l_crc) {
+ unsigned char checksum[xio_digest_size];
+
+ xio_digest(checksum, buf + found_offset, lh->l_len);
+ if (unlikely(*(int *)checksum != lh->l_crc)) {
+ XIO_ERR(SCAN_TXT "data checksumming mismatch, length = %d\n", SCAN_PAR, lh->l_len);
+ return -EBADMSG;
+ }
+ }
+
+ /* last check */
+ if (unlikely(total_len != offset - i)) {
+ XIO_ERR(SCAN_TXT "internal size mismatch: %d != %d\n", SCAN_PAR, total_len, offset - i);
+ return -EBADMSG;
+ }
+
+ /* Success... */
+ *payload = buf + found_offset;
+ *payload_len = lh->l_len;
+
+ /* don't cry when nullbytes have been skipped */
+ if (i > 0 && dirty)
+ XIO_WRN(SCAN_TXT "skipped %d dirty bytes to find valid data\n", SCAN_PAR, i);
+
+ return offset;
+ }
+
+ XIO_ERR("could not find any useful data within len=%d bytes\n", len);
+ return -EAGAIN;
+}
+
+/**************************************************************************/
+
+#ifdef __KERNEL__
+
+/* Bookkeeping status between calls
+ */
+struct log_status {
+ /* interfacing */
+ wait_queue_head_t *signal_event;
+ /* tunables */
+ loff_t start_pos;
+ loff_t end_pos;
+
+ int align_size; /* alignment between requests */
+ int chunk_size; /* must be at least 8K (better 64k) */
+ int max_size; /* max payload length */
+ int io_prio;
+ bool do_crc;
+
+ /* informational */
+ atomic_t aio_flying;
+ int count;
+ loff_t log_pos;
+ struct timespec log_pos_stamp;
+
+ /* internal */
+ struct timespec tmp_pos_stamp;
+ struct xio_input *input;
+ struct xio_brick *brick;
+ struct xio_info info;
+ int offset;
+ int validflag_offset;
+ int reallen_offset;
+ int payload_offset;
+ int payload_len;
+ unsigned int seq_nr;
+ struct aio_object *log_aio;
+ struct aio_object *read_aio;
+
+ wait_queue_head_t event;
+ int error_code;
+ bool got;
+ bool do_free;
+ void *private;
+};
+
+void init_logst(struct log_status *logst, struct xio_input *input, loff_t start_pos, loff_t end_pos);
+void exit_logst(struct log_status *logst);
+
+void log_flush(struct log_status *logst);
+
+void *log_reserve(struct log_status *logst, struct log_header *lh);
+
+bool log_finalize(struct log_status *logst, int len, void (*endio)(void *private, int error), void *private);
+
+int log_read(struct log_status *logst, bool sloppy, struct log_header *lh, void **payload, int *payload_len);
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_log_format(void);
+extern void exit_log_format(void);
+
+#endif
+#endif
--
2.11.0

2016-12-30 22:58:23

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 24/32] mars: add new module strategy

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/mars/strategy.h | 239 +++++++++++++++++++++++++++++++++++
1 file changed, 239 insertions(+)
create mode 100644 drivers/staging/mars/mars/strategy.h

diff --git a/drivers/staging/mars/mars/strategy.h b/drivers/staging/mars/mars/strategy.h
new file mode 100644
index 000000000000..d570772847c2
--- /dev/null
+++ b/drivers/staging/mars/mars/strategy.h
@@ -0,0 +1,239 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* OLD CODE = > will disappear! */
+#ifndef _OLD_STRATEGY
+#define _OLD_STRATEGY
+
+#define _STRATEGY /* call this only in strategy bricks, never in ordinary bricks */
+
+#include <linux/xio/xio.h>
+
+#define MARS_ARGV_MAX 4
+
+extern loff_t global_total_space;
+extern loff_t global_remaining_space;
+
+extern int global_logrot_auto;
+extern int global_free_space_0;
+extern int global_free_space_1;
+extern int global_free_space_2;
+extern int global_free_space_3;
+extern int global_free_space_4;
+extern int global_sync_want;
+extern int global_sync_nr;
+extern int global_sync_limit;
+extern int mars_rollover_interval;
+extern int mars_scan_interval;
+extern int mars_propagate_interval;
+extern int mars_sync_flip_interval;
+extern int mars_peer_abort;
+extern int mars_emergency_mode;
+extern int mars_reset_emergency;
+extern int mars_keep_msg;
+
+extern int mars_fast_fullsync;
+
+#define MARS_DENT(TYPE) \
+ struct list_head dent_link; \
+ struct list_head brick_list; \
+ struct TYPE *d_parent; \
+ char *d_argv[MARS_ARGV_MAX]; /* for internal use, will be automatically deallocated*/\
+ char *d_args; /* ditto uninterpreted */ \
+ char *d_name; /* current path component */ \
+ char *d_rest; /* some "meaningful" rest of d_name*/ \
+ char *d_path; /* full absolute path */ \
+ struct say_channel *d_say_channel; /* for messages */ \
+ loff_t d_corr_A; /* logical size correction */ \
+ loff_t d_corr_B; /* logical size correction */ \
+ int d_depth; \
+ /* from readdir() = > often DT_UNKNOWN */ \
+ /* don't rely on it - use stat_val.mode instead */ \
+ unsigned int d_type; \
+ int d_class; /* for pre-grouping order */ \
+ int d_serial; /* for pre-grouping order */ \
+ int d_version; /* dynamic programming per call of mars_ent_work() */\
+ int d_child_count; \
+ bool d_killme; \
+ bool d_use_channel; \
+ struct kstat stat_val; \
+ char *link_val; \
+ struct mars_global *d_global; \
+ void (*d_private_destruct)(void *private); \
+ void *d_private
+
+struct mars_dent {
+ MARS_DENT(mars_dent);
+};
+
+extern const struct meta mars_kstat_meta[];
+extern const struct meta mars_dent_meta[];
+
+struct mars_global {
+ struct rw_semaphore dent_mutex;
+ struct rw_semaphore brick_mutex;
+ struct generic_switch global_power;
+ struct list_head dent_anchor;
+ struct list_head brick_anchor;
+
+ wait_queue_head_t main_event;
+ int global_version;
+ int deleted_my_border;
+ int deleted_border;
+ int deleted_min;
+ bool main_trigger;
+};
+
+extern void bind_to_dent(struct mars_dent *dent, struct say_channel **ch);
+
+typedef int (
+*mars_dent_checker_fn)(
+struct mars_dent *parent,
+const char *name,
+int namlen,
+unsigned int d_type,
+int *prefix,
+int *serial,
+bool *use_channel);
+
+typedef int (*mars_dent_worker_fn)(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction);
+
+extern int mars_dent_work(
+struct mars_global *global,
+char *dirname,
+int allocsize,
+
+mars_dent_checker_fn checker,
+mars_dent_worker_fn worker,
+void *buf,
+int maxdepth);
+
+extern struct mars_dent *_mars_find_dent(struct mars_global *global, const char *path);
+extern struct mars_dent *mars_find_dent(struct mars_global *global, const char *path);
+extern int mars_find_dent_all(struct mars_global *global, char *prefix, struct mars_dent ***table);
+extern void xio_kill_dent(struct mars_dent *dent);
+extern void xio_free_dent(struct mars_dent *dent);
+extern void xio_free_dent_all(struct mars_global *global, struct list_head *anchor);
+
+/* low-level brick instantiation */
+
+extern struct xio_brick *mars_find_brick(struct mars_global *global, const void *brick_type, const char *path);
+extern struct xio_brick *xio_make_brick(
+struct mars_global *global, struct mars_dent *belongs, const void *_brick_type, const char *path, const char *name);
+
+extern int xio_free_brick(struct xio_brick *brick);
+extern int xio_kill_brick(struct xio_brick *brick);
+extern int xio_kill_brick_all(struct mars_global *global, struct list_head *anchor, bool use_dent_link);
+extern int xio_kill_brick_when_possible(
+struct mars_global *global,
+struct list_head *anchor,
+bool use_dent_link,
+const struct xio_brick_type *type,
+bool even_on);
+
+/* mid-level brick instantiation (identity is based on path strings) */
+
+extern char *_vpath_make(int line, const char *fmt, va_list *args);
+extern char *_path_make(int line, const char *fmt, ...);
+extern char *_backskip_replace(int line, const char *path, char delim, bool insert, const char *fmt, ...);
+
+#define vpath_make(_fmt, _args) \
+ _vpath_make(__LINE__, _fmt, _args)
+#define path_make(_fmt, _args...) \
+ _path_make(__LINE__, _fmt, ##_args)
+#define backskip_replace(_path, _delim, _insert, _fmt, _args...) \
+ _backskip_replace(__LINE__, _path, _delim, _insert, _fmt, ##_args)
+
+extern struct xio_brick *path_find_brick(struct mars_global *global, const void *brick_type, const char *fmt, ...);
+
+/* Create a new brick and connect its inputs to a set of predecessors.
+ * When @timeout > 0, switch on the brick as well as its predecessors.
+ */
+extern struct xio_brick *make_brick_all(
+ struct mars_global *global,
+ struct mars_dent *belongs,
+ int (*setup_fn)(struct xio_brick *brick, void *private),
+ void *private,
+ const char *new_name,
+ const struct generic_brick_type *new_brick_type,
+ const struct generic_brick_type *prev_brick_type[],
+/* -1 = off, 0 = leave in current state, +1 = create when necessary, +2 = create + switch on */
+ int switch_override,
+ const char *new_fmt,
+ const char *prev_fmt[],
+ int prev_count,
+ ...
+ );
+
+/* general MARS infrastructure */
+
+/* General fs wrappers (for abstraction)
+ */
+extern int mars_stat(const char *path, struct kstat *stat, bool use_lstat);
+extern void mars_sync(void);
+extern int mars_mkdir(const char *path);
+extern int mars_unlink(const char *path);
+extern int mars_symlink(const char *oldpath, const char *newpath, const struct timespec *stamp, uid_t uid);
+extern char *mars_readlink(const char *newpath);
+extern int mars_rename(const char *oldpath, const char *newpath);
+extern void mars_remaining_space(const char *fspath, loff_t *total, loff_t *remaining);
+
+/***********************************************************************/
+
+extern struct mars_global *mars_global;
+
+extern bool xio_check_inputs(struct xio_brick *brick);
+extern bool xio_check_outputs(struct xio_brick *brick);
+
+extern int mars_power_button(struct xio_brick *brick, bool val, bool force_off);
+
+/***********************************************************************/
+
+/* statistics */
+
+extern int global_show_statist;
+
+void show_statistics(struct mars_global *global, const char *class);
+
+/***********************************************************************/
+
+/* quirk */
+
+extern int mars_mem_percent;
+
+extern int main_checker(
+struct mars_dent *parent,
+const char *_name,
+int namlen,
+unsigned int d_type,
+int *prefix,
+int *serial,
+bool *use_channel);
+
+void from_remote_trigger(void);
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_sy(void);
+extern void exit_sy(void);
+
+extern int init_sy_net(void);
+extern void exit_sy_net(void);
+
+#endif
--
2.11.0

2016-12-30 22:58:37

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 25/32] mars: add new module main_strategy

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/mars/main_strategy.c | 2135 +++++++++++++++++++++++++++++
1 file changed, 2135 insertions(+)
create mode 100644 drivers/staging/mars/mars/main_strategy.c

diff --git a/drivers/staging/mars/mars/main_strategy.c b/drivers/staging/mars/mars/main_strategy.c
new file mode 100644
index 000000000000..7929b566d645
--- /dev/null
+++ b/drivers/staging/mars/mars/main_strategy.c
@@ -0,0 +1,2135 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define XIO_DEBUGGING
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/file.h>
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+
+#include "strategy.h"
+
+#include <linux/xio/lib_mapfree.h>
+#include <linux/xio/xio_client.h>
+
+#include <linux/brick/vfs_compat.h>
+#include <linux/namei.h>
+#include <linux/kthread.h>
+#include <linux/statfs.h>
+
+#define SKIP_BIO false
+
+/*******************************************************************/
+
+/* meta descriptions */
+
+const struct meta mars_kstat_meta[] = {
+ META_INI(ino, struct kstat, FIELD_UINT),
+ META_INI(mode, struct kstat, FIELD_UINT),
+ META_INI(size, struct kstat, FIELD_INT),
+ META_INI_SUB(atime, struct kstat, xio_timespec_meta),
+ META_INI_SUB(mtime, struct kstat, xio_timespec_meta),
+ META_INI_SUB(ctime, struct kstat, xio_timespec_meta),
+ META_INI_TRANSFER(blksize, struct kstat, FIELD_UINT, 4),
+ {}
+};
+
+const struct meta mars_dent_meta[] = {
+ META_INI(d_name, struct mars_dent, FIELD_STRING),
+ META_INI(d_rest, struct mars_dent, FIELD_STRING),
+ META_INI(d_path, struct mars_dent, FIELD_STRING),
+ META_INI(d_type, struct mars_dent, FIELD_UINT),
+ META_INI(d_class, struct mars_dent, FIELD_INT),
+ META_INI(d_serial, struct mars_dent, FIELD_INT),
+ META_INI(d_corr_A, struct mars_dent, FIELD_INT),
+ META_INI(d_corr_B, struct mars_dent, FIELD_INT),
+ META_INI_SUB(stat_val, struct mars_dent, mars_kstat_meta),
+ META_INI(link_val, struct mars_dent, FIELD_STRING),
+ META_INI(d_args, struct mars_dent, FIELD_STRING),
+ META_INI(d_argv[0], struct mars_dent, FIELD_STRING),
+ META_INI(d_argv[1], struct mars_dent, FIELD_STRING),
+ META_INI(d_argv[2], struct mars_dent, FIELD_STRING),
+ META_INI(d_argv[3], struct mars_dent, FIELD_STRING),
+ {}
+};
+
+/*******************************************************************/
+
+/* The _compat_*() functions are needed for the out-of-tree version
+ * of MARS for adapdation to different kernel version.
+ */
+
+/* Hack because of 8bcb77fabd7cbabcad49f58750be8683febee92b
+ */
+static int __path_parent(const char *name, struct path *path, unsigned flags)
+{
+ char *tmp;
+ int len;
+ int error;
+
+ len = strlen(name);
+ while (len > 0 && name[len] != '/')
+ len--;
+ if (unlikely(!len))
+ return -EINVAL;
+
+ tmp = brick_string_alloc(len + 1);
+ strncpy(tmp, name, len);
+ tmp[len] = '\0';
+
+ error = kern_path(tmp, flags | LOOKUP_DIRECTORY | LOOKUP_FOLLOW, path);
+
+ brick_string_free(tmp);
+ return error;
+}
+
+/* code is blindly stolen from symlinkat()
+ * and later adapted to various kernels
+ */
+int _compat_symlink(
+const char __user *oldname,
+ const char __user *newname,
+ struct timespec *mtime)
+{
+ const int newdfd = AT_FDCWD;
+ int error;
+ char *from;
+ struct dentry *dentry;
+ struct path path;
+ unsigned int lookup_flags = 0;
+
+ from = (char *)oldname;
+
+retry:
+ dentry = user_path_create(newdfd, newname, &path, lookup_flags);
+ error = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto out_putname;
+
+ error = vfs_symlink(path.dentry->d_inode, dentry, from);
+ if (error >= 0 && mtime) {
+ struct iattr iattr = {
+ .ia_valid = ATTR_MTIME | ATTR_MTIME_SET | ATTR_TIMES_SET,
+ .ia_mtime.tv_sec = mtime->tv_sec,
+ .ia_mtime.tv_nsec = mtime->tv_nsec,
+ };
+
+ mutex_lock(&dentry->d_inode->i_mutex);
+ error = notify_change(dentry, &iattr, NULL);
+ mutex_unlock(&dentry->d_inode->i_mutex);
+ }
+ done_path_create(&path, dentry);
+ if (retry_estale(error, lookup_flags)) {
+ lookup_flags |= LOOKUP_REVAL;
+ goto retry;
+ }
+out_putname:
+ return error;
+}
+
+/* code is stolen from mkdirat()
+ */
+int _compat_mkdir(
+const char __user *pathname,
+ int mode)
+{
+ const int dfd = AT_FDCWD;
+ struct dentry *dentry;
+ struct path path;
+ int error;
+ unsigned int lookup_flags = LOOKUP_DIRECTORY;
+
+retry:
+ dentry = user_path_create(dfd, pathname, &path, lookup_flags);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ if (!IS_POSIXACL(path.dentry->d_inode))
+ mode &= ~current_umask();
+ error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
+ done_path_create(&path, dentry);
+ if (retry_estale(error, lookup_flags)) {
+ lookup_flags |= LOOKUP_REVAL;
+ goto retry;
+ }
+ return error;
+}
+
+/* This has some restrictions:
+ * - oldname and newname must reside in the same directory
+ * - standard case, no mountpoints inbetween
+ * - no security checks (we are anyway called from kernel code)
+ */
+int _compat_rename(
+const char *oldname,
+ const char *newname)
+{
+ struct path oldpath;
+ struct path newpath;
+ struct dentry *old_dir;
+ struct dentry *new_dir;
+ struct dentry *old_dentry;
+ struct dentry *new_dentry;
+ struct dentry *trap;
+ const char *old_one;
+ const char *new_one;
+ const char *tmp;
+ unsigned int lookup_flags = 0;
+
+ bool should_retry = false;
+
+ int error;
+
+retry:
+ error = __path_parent(oldname, &oldpath, lookup_flags);
+ if (unlikely(error))
+ goto exit;
+ old_dir = oldpath.dentry;
+
+ error = __path_parent(newname, &newpath, lookup_flags);
+ if (unlikely(error))
+ goto exit1;
+ new_dir = newpath.dentry;
+
+ old_one = oldname;
+ for (;;) {
+ for (tmp = old_one; *tmp && *tmp != '/'; tmp++)
+ ; /* empty */
+ if (!*tmp)
+ break;
+ old_one = tmp + 1;
+ }
+
+ new_one = newname;
+ for (;;) {
+ for (tmp = new_one; *tmp && *tmp != '/'; tmp++)
+ ; /* empty */
+ if (!*tmp)
+ break;
+ new_one = tmp + 1;
+ }
+
+ error = mnt_want_write(oldpath.mnt);
+ if (unlikely(error))
+ goto exit2;
+ trap = lock_rename(new_dir, old_dir);
+
+ old_dentry = lookup_one_len(old_one, old_dir, strlen(old_one));
+ error = PTR_ERR(old_dentry);
+ if (unlikely(IS_ERR(old_dentry)))
+ goto out_unlock_rename;
+ error = -ENOENT;
+ if (unlikely(d_is_negative(old_dentry)))
+ goto out_dput_old;
+ error = -EINVAL;
+ if (unlikely(old_dentry == trap))
+ goto out_dput_old;
+
+ new_dentry = lookup_one_len(new_one, new_dir, strlen(new_one));
+ error = PTR_ERR(new_dentry);
+ if (unlikely(IS_ERR(new_dentry)))
+ goto out_dput_old;
+ error = -ENOTEMPTY;
+ if (unlikely(new_dentry == trap))
+ goto out_dput_new;
+
+ error = vfs_rename(
+ old_dir->d_inode, old_dentry,
+ new_dir->d_inode, new_dentry, NULL, 0);
+
+out_dput_new:
+ dput(new_dentry);
+
+out_dput_old:
+ dput(old_dentry);
+
+out_unlock_rename:
+ unlock_rename(new_dir, old_dir);
+ mnt_drop_write(oldpath.mnt);
+exit2:
+ if (retry_estale(error, lookup_flags))
+ should_retry = true;
+ path_put(&newpath);
+exit1:
+ path_put(&oldpath);
+ if (should_retry) {
+ lookup_flags |= LOOKUP_REVAL;
+ goto retry;
+ }
+exit:
+ return error;
+}
+
+/* This has some restrictions:
+ * - standard case, no mountpoints inbetween
+ * - no security checks (we are anyway called from kernel code)
+ */
+int _compat_unlink(const char *pathname)
+{
+ struct path path;
+ struct dentry *parent;
+ struct dentry *dentry;
+ struct inode *inode = NULL;
+ const char *one;
+ const char *tmp;
+ int error;
+ unsigned int lookup_flags = 0;
+
+retry:
+ error = __path_parent(pathname, &path, lookup_flags);
+ if (unlikely(error))
+ goto exit;
+
+ parent = path.dentry;
+ if (unlikely(d_is_negative(parent)))
+ goto exit1;
+
+ one = pathname;
+ for (;;) {
+ for (tmp = one; *tmp && *tmp != '/'; tmp++)
+ ; /* empty */
+ if (!*tmp)
+ break;
+ one = tmp + 1;
+ }
+
+ error = mnt_want_write(path.mnt);
+ if (error)
+ goto exit1;
+ mutex_lock_nested(&parent->d_inode->i_mutex, I_MUTEX_PARENT);
+
+ dentry = lookup_one_len(one, parent, strlen(one));
+ error = PTR_ERR(dentry);
+ if (unlikely(IS_ERR(dentry)))
+ goto exit2;
+ error = -ENOENT;
+ if (unlikely(d_is_negative(dentry)))
+ goto exit3;
+
+ inode = dentry->d_inode;
+ ihold(inode);
+
+ error = vfs_unlink(parent->d_inode, dentry, NULL);
+
+exit3:
+ dput(dentry);
+exit2:
+ mutex_unlock(&parent->d_inode->i_mutex);
+ if (inode)
+ iput(inode);
+ mnt_drop_write(path.mnt);
+exit1:
+ path_put(&path);
+exit:
+ if (retry_estale(error, lookup_flags)) {
+ lookup_flags |= LOOKUP_REVAL;
+ inode = NULL;
+ goto retry;
+ }
+ return error;
+}
+
+/*******************************************************************/
+
+/* some helpers */
+
+static inline
+int _length_paranoia(int len, int line)
+{
+ if (unlikely(len < 0)) {
+ XIO_ERR("implausible string length %d (line=%d)\n", len, line);
+ len = PAGE_SIZE - 2;
+ } else if (unlikely(len > PAGE_SIZE - 2)) {
+ XIO_WRN(
+ "string length %d will be truncated to %d, line=%d\n",
+ len, (int)PAGE_SIZE - 2, line);
+ len = PAGE_SIZE - 2;
+ }
+ return len;
+}
+
+int mars_stat(const char *path, struct kstat *stat, bool use_lstat)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ if (use_lstat)
+ status = vfs_lstat((char *)path, stat);
+ else
+ status = vfs_stat((char *)path, stat);
+ set_fs(oldfs);
+
+ if (likely(status >= 0))
+ set_lamport(&stat->mtime);
+
+ return status;
+}
+
+void mars_sync(void)
+{
+ struct file *f;
+
+ mm_segment_t oldfs;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ f = filp_open("/mars", O_DIRECTORY | O_RDONLY, 0);
+ set_fs(oldfs);
+ if (unlikely(IS_ERR(f)))
+ goto out_return;
+ if (likely(f->f_mapping)) {
+ struct inode *inode = f->f_mapping->host;
+
+ if (likely(inode && inode->i_sb)) {
+ struct super_block *sb = inode->i_sb;
+
+ down_read(&sb->s_umount);
+ sync_filesystem(sb);
+ up_read(&sb->s_umount);
+ }
+ }
+
+ filp_close(f, NULL);
+out_return:;
+}
+
+int mars_mkdir(const char *path)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = _compat_mkdir(path, 0700);
+ set_fs(oldfs);
+
+ return status;
+}
+
+int mars_unlink(const char *path)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = _compat_unlink(path);
+ set_fs(oldfs);
+
+ return status;
+}
+
+int mars_symlink(const char *oldpath, const char *newpath, const struct timespec *stamp, uid_t uid)
+{
+ char *tmp = backskip_replace(newpath, '/', true, "/.tmp-");
+
+ mm_segment_t oldfs;
+ struct kstat stat = {};
+ struct timespec times[2];
+ int status = -ENOMEM;
+
+ if (unlikely(!tmp))
+ goto done;
+
+ if (stamp)
+ memcpy(&times[0], stamp, sizeof(times[0]));
+ else
+ get_lamport(&times[0]);
+
+#ifdef CONFIG_MARS_DEBUG
+ while (mars_hang_mode & 4)
+ brick_msleep(100);
+#endif
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ /* Some filesystems have only full second resolution.
+ * Thus it may happen that the new timestamp is not
+ * truly moving forward when called twice shortly.
+ * This is a _workaround_, to be replaced by a better
+ * method somewhen.
+ */
+ status = vfs_lstat((char *)newpath, &stat);
+ if (status >= 0 &&
+ !stamp &&
+ !stat.mtime.tv_nsec &&
+ times[0].tv_sec == stat.mtime.tv_sec) {
+ XIO_DBG("workaround timestamp tv_sec=%ld\n", stat.mtime.tv_sec);
+ times[0].tv_sec = stat.mtime.tv_sec + 1;
+ /* Setting tv_nsec to 1 prevents from unnecessarily reentering
+ * this workaround again if accidentally the original tv_nsec
+ * had been 0 or if the workaround had been triggered.
+ */
+ times[0].tv_nsec = 1;
+ }
+
+ (void)_compat_unlink(tmp);
+ status = _compat_symlink(oldpath, tmp, &times[0]);
+ if (status >= 0) {
+ set_lamport(&times[0]);
+ status = mars_rename(tmp, newpath);
+ }
+ set_fs(oldfs);
+ brick_string_free(tmp);
+
+done:
+ return status;
+}
+
+char *mars_readlink(const char *newpath)
+{
+ char *res = NULL;
+ struct path path = {};
+
+ mm_segment_t oldfs;
+ struct inode *inode;
+ int len;
+ int status = -ENOMEM;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+
+ status = user_path_at(AT_FDCWD, newpath, 0, &path);
+ if (unlikely(status < 0)) {
+ XIO_DBG("link '%s' does not exist, status = %d\n", newpath, status);
+ goto done_fs;
+ }
+
+ inode = path.dentry->d_inode;
+ if (unlikely(!inode || !S_ISLNK(inode->i_mode))) {
+ XIO_ERR("link '%s' has invalid inode\n", newpath);
+ status = -EINVAL;
+ goto done_put;
+ }
+
+ len = i_size_read(inode);
+ if (unlikely(len <= 0 || len > PAGE_SIZE)) {
+ XIO_ERR("link '%s' invalid length = %d\n", newpath, len);
+ status = -EINVAL;
+ goto done_put;
+ }
+ res = brick_string_alloc(len + 2);
+
+ status = inode->i_op->readlink(path.dentry, res, len + 1);
+ if (unlikely(status < 0))
+ XIO_ERR("cannot read link '%s', status = %d\n", newpath, status);
+ else
+ set_lamport(&inode->i_mtime);
+done_put:
+ path_put(&path);
+
+done_fs:
+ set_fs(oldfs);
+ if (unlikely(status < 0)) {
+ if (unlikely(!res))
+ res = brick_string_alloc(1);
+ res[0] = '\0';
+ }
+ return res;
+}
+
+int mars_rename(const char *oldpath, const char *newpath)
+{
+ mm_segment_t oldfs;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = _compat_rename(oldpath, newpath);
+ set_fs(oldfs);
+
+ return status;
+}
+
+loff_t _compute_space(struct kstatfs *kstatfs, loff_t raw_val)
+{
+ int fsize = kstatfs->f_frsize;
+
+ if (fsize <= 0)
+ fsize = kstatfs->f_bsize;
+
+ XIO_INF("fsize = %d raw_val = %lld\n", fsize, raw_val);
+ /* illegal values? cannot do anything.... */
+ if (fsize <= 0)
+ return 0;
+
+ /* prevent intermediate integer overflows */
+ if (fsize <= 1024)
+ return raw_val / (1024 / fsize);
+
+ return raw_val * (fsize / 1024);
+}
+
+void mars_remaining_space(const char *fspath, loff_t *total, loff_t *remaining)
+{
+ struct path path = {};
+ struct kstatfs kstatfs = {};
+
+ mm_segment_t oldfs;
+ int res;
+
+ *total = *remaining = 0;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+
+ res = user_path_at(AT_FDCWD, fspath, 0, &path);
+
+ set_fs(oldfs);
+
+ if (unlikely(res < 0)) {
+ XIO_ERR("cannot get fspath '%s', err = %d\n\n", fspath, res);
+ goto err;
+ }
+ if (unlikely(!path.dentry)) {
+ XIO_ERR("bad dentry for fspath '%s'\n", fspath);
+ res = -ENXIO;
+ goto done;
+ }
+
+ res = vfs_statfs(&path, &kstatfs);
+ if (unlikely(res < 0))
+ goto done;
+
+ *total = _compute_space(&kstatfs, kstatfs.f_blocks);
+ *remaining = _compute_space(&kstatfs, kstatfs.f_bavail);
+
+done:
+ path_put(&path);
+err:;
+}
+
+/************************************************************/
+
+/* thread binding */
+
+void bind_to_dent(struct mars_dent *dent, struct say_channel **ch)
+{
+ if (!dent) {
+ if (*ch) {
+ remove_binding_from(*ch, current);
+ *ch = NULL;
+ }
+ goto out_return;
+ }
+ /* Memoize the channel. This is executed only once for each dent. */
+ if (unlikely(!dent->d_say_channel)) {
+ struct mars_dent *test = dent->d_parent;
+
+ for (;;) {
+ if (!test) {
+ dent->d_say_channel = default_channel;
+ break;
+ }
+ if (test->d_use_channel && test->d_path) {
+ dent->d_say_channel = make_channel(test->d_path, true);
+ break;
+ }
+ test = test->d_parent;
+ }
+ }
+ if (dent->d_say_channel != *ch) {
+ if (*ch)
+ remove_binding_from(*ch, current);
+ *ch = dent->d_say_channel;
+ if (*ch)
+ bind_to_channel(*ch, current);
+ }
+out_return:;
+}
+
+/************************************************************/
+
+/* infrastructure */
+
+struct mars_global *mars_global;
+
+void (*_local_trigger)(void) = NULL;
+
+static
+void __local_trigger(void)
+{
+ if (mars_global) {
+ mars_global->main_trigger = true;
+ wake_up_interruptible_all(&mars_global->main_event);
+ }
+}
+
+bool xio_check_inputs(struct xio_brick *brick)
+{
+ int max_inputs;
+ int i;
+
+ if (likely(brick->type)) {
+ max_inputs = brick->type->max_inputs;
+ } else {
+ XIO_ERR("uninitialized brick '%s' '%s'\n", brick->brick_name, brick->brick_path);
+ return true;
+ }
+ for (i = 0; i < max_inputs; i++) {
+ struct xio_input *input = brick->inputs[i];
+ struct xio_output *prev_output;
+ struct xio_brick *prev_brick;
+
+ if (!input)
+ continue;
+ prev_output = input->connect;
+ if (!prev_output)
+ continue;
+ prev_brick = prev_output->brick;
+ CHECK_PTR(prev_brick, done);
+ if (prev_brick->power.on_led)
+ continue;
+done:
+ return true;
+ }
+ return false;
+}
+
+bool xio_check_outputs(struct xio_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < brick->type->max_outputs; i++) {
+ struct xio_output *output = brick->outputs[i];
+
+ if (!output || !output->nr_connected)
+ continue;
+ return true;
+ }
+ return false;
+}
+
+int mars_power_button(struct xio_brick *brick, bool val, bool force_off)
+{
+ int status = 0;
+ bool oldval = brick->power.button;
+
+ if (force_off && !val)
+ brick->power.force_off = true;
+
+ if (brick->power.force_off)
+ val = false;
+
+ if (val != oldval) {
+ /* check whether switching is possible */
+ status = -EINVAL;
+ if (val) { /* check all inputs */
+ if (unlikely(xio_check_inputs(brick))) {
+ XIO_ERR(
+ "CANNOT SWITCH ON: brick '%s' '%s' has a turned-off predecessor\n",
+ brick->brick_name,
+ brick->brick_path);
+ goto done;
+ }
+ } else { /* check all outputs */
+ if (unlikely(xio_check_outputs(brick))) {
+ /* For now, we have a strong rule:
+ * Switching off is only allowed when no successor brick
+ * exists at all. This could be relaxed to checking
+ * whether all successor bricks are actually switched off.
+ * Probabĺy it is a good idea to retain the stronger rule
+ * as long as nobody needs the relaxed one.
+ */
+ XIO_ERR(
+ "CANNOT SWITCH OFF: brick '%s' '%s' has a successor\n",
+ brick->brick_name,
+ brick->brick_path);
+ goto done;
+ }
+ }
+
+ XIO_DBG(
+ "brick '%s' '%s' type '%s' power button %d -> %d\n",
+ brick->brick_name,
+ brick->brick_path,
+ brick->type->type_name,
+ oldval,
+ val);
+
+ set_button(&brick->power, val, false);
+ }
+
+ if (unlikely(!brick->ops)) {
+ XIO_ERR("brick '%s' '%s' has no brick_switch() method\n", brick->brick_name, brick->brick_path);
+ status = -EINVAL;
+ goto done;
+ }
+
+ /* Always call the switch function, even if nothing changes.
+ * The implementations must be idempotent.
+ * They may exploit the regular calls for some maintenance operations
+ * (e.g. changing disk capacity etc).
+ */
+ status = brick->ops->brick_switch(brick);
+
+ if (val != oldval)
+ local_trigger();
+
+done:
+ return status;
+}
+
+/*******************************************************************/
+
+/* strategy layer */
+
+struct mars_cookie {
+ struct mars_global *global;
+
+ mars_dent_checker_fn checker;
+ char *path;
+ struct mars_dent *parent;
+ int allocsize;
+ int depth;
+ bool hit;
+};
+
+static
+int get_inode(char *newpath, struct mars_dent *dent)
+{
+ mm_segment_t oldfs;
+ int status;
+ struct kstat tmp = {};
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+
+ status = vfs_lstat(newpath, &tmp);
+ if (status < 0) {
+ XIO_WRN("cannot stat '%s', status = %d\n", newpath, status);
+ goto done;
+ }
+
+ memcpy(&dent->stat_val, &tmp, sizeof(dent->stat_val));
+
+ if (S_ISLNK(dent->stat_val.mode)) {
+ struct path path = {};
+ int len = dent->stat_val.size;
+ struct inode *inode;
+ char *link;
+
+ if (unlikely(len <= 0)) {
+ XIO_ERR("symlink '%s' bad len = %d\n", newpath, len);
+ status = -EINVAL;
+ goto done;
+ }
+ len = _length_paranoia(len, __LINE__);
+
+ status = user_path_at(AT_FDCWD, newpath, 0, &path);
+ if (unlikely(status < 0)) {
+ XIO_WRN("cannot read link '%s'\n", newpath);
+ goto done;
+ }
+
+ inode = path.dentry->d_inode;
+
+ status = -ENOMEM;
+ link = brick_string_alloc(len + 2);
+ status = inode->i_op->readlink(path.dentry, link, len + 1);
+ link[len] = '\0';
+ if (status < 0 ||
+ (dent->link_val && !strcmp(dent->link_val, link))) {
+ brick_string_free(link);
+ } else {
+ brick_string_free(dent->link_val);
+ dent->link_val = link;
+ }
+ path_put(&path);
+ } else if (S_ISREG(dent->stat_val.mode) && dent->d_name && !strncmp(dent->d_name, "log-", 4)) {
+ loff_t min = dent->stat_val.size;
+ loff_t max = 0;
+
+ dent->d_corr_A = 0;
+ dent->d_corr_B = 0;
+ mf_get_any_dirty(newpath, &min, &max, 0, 2);
+ if (min < dent->stat_val.size) {
+ XIO_DBG("file '%s' A size=%lld min=%lld max=%lld\n", newpath, dent->stat_val.size, min, max);
+ dent->d_corr_A = min;
+ }
+ mf_get_any_dirty(newpath, &min, &max, 0, 3);
+ if (min < dent->stat_val.size) {
+ XIO_DBG("file '%s' B size=%lld min=%lld max=%lld\n", newpath, dent->stat_val.size, min, max);
+ dent->d_corr_B = min;
+ }
+ }
+
+done:
+ set_fs(oldfs);
+ return status;
+}
+
+static
+int dent_compare(struct mars_dent *a, struct mars_dent *b)
+{
+ if (a->d_class < b->d_class)
+ return -1;
+ if (a->d_class > b->d_class)
+ return 1;
+ if (a->d_serial < b->d_serial)
+ return -1;
+ if (a->d_serial > b->d_serial)
+ return 1;
+ return strcmp(a->d_path, b->d_path);
+}
+
+struct mars_dir_context {
+ struct dir_context ctx;
+ struct mars_cookie *cookie;
+};
+
+int mars_filler(
+struct dir_context *__buf, const char *name, int namlen, loff_t offset,
+ u64 ino, unsigned int d_type)
+{
+ struct mars_dir_context *buf = (void *)__buf;
+ struct mars_cookie *cookie = buf->cookie;
+
+ struct mars_global *global = cookie->global;
+ struct list_head *anchor = &global->dent_anchor;
+ struct list_head *start = anchor;
+ struct mars_dent *dent;
+ struct list_head *tmp;
+ char *newpath;
+ int prefix = 0;
+ int pathlen;
+ int class;
+ int serial = 0;
+ bool use_channel = false;
+
+ cookie->hit = true;
+
+ if (name[0] == '.')
+ return 0;
+
+ class = cookie->checker(cookie->parent, name, namlen, d_type, &prefix, &serial, &use_channel);
+ if (class < 0)
+ return 0;
+
+ pathlen = strlen(cookie->path);
+ newpath = brick_string_alloc(pathlen + namlen + 2);
+ memcpy(newpath, cookie->path, pathlen);
+ newpath[pathlen++] = '/';
+ memcpy(newpath + pathlen, name, namlen);
+ pathlen += namlen;
+ newpath[pathlen] = '\0';
+
+ dent = brick_zmem_alloc(cookie->allocsize);
+
+ dent->d_class = class;
+ dent->d_serial = serial;
+ dent->d_path = newpath;
+
+ for (tmp = anchor->next; tmp != anchor; tmp = tmp->next) {
+ struct mars_dent *test = container_of(tmp, struct mars_dent, dent_link);
+ int cmp = dent_compare(test, dent);
+
+ if (!cmp) {
+ brick_mem_free(dent);
+ dent = test;
+ goto found;
+ }
+ /* keep the list sorted. find the next smallest member. */
+ if (cmp > 0)
+ break;
+ start = tmp;
+ }
+
+ dent->d_name = brick_string_alloc(namlen + 1);
+ memcpy(dent->d_name, name, namlen);
+ dent->d_name[namlen] = '\0';
+ dent->d_rest = brick_strdup(dent->d_name + prefix);
+
+ newpath = NULL;
+
+ INIT_LIST_HEAD(&dent->dent_link);
+ INIT_LIST_HEAD(&dent->brick_list);
+
+ list_add(&dent->dent_link, start);
+
+found:
+ dent->d_type = d_type;
+ dent->d_class = class;
+ dent->d_serial = serial;
+ if (dent->d_parent)
+ dent->d_parent->d_child_count--;
+ dent->d_parent = cookie->parent;
+ if (dent->d_parent)
+ dent->d_parent->d_child_count++;
+ dent->d_depth = cookie->depth;
+ dent->d_global = global;
+ dent->d_killme = false;
+ dent->d_use_channel = use_channel;
+ brick_string_free(newpath);
+ return 0;
+}
+
+static int _mars_readdir(struct mars_cookie *cookie)
+{
+ struct file *f;
+ struct address_space *mapping;
+
+ mm_segment_t oldfs;
+ int status = 0;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ f = filp_open(cookie->path, O_DIRECTORY | O_RDONLY, 0);
+ set_fs(oldfs);
+ if (unlikely(IS_ERR(f)))
+ return PTR_ERR(f);
+ mapping = f->f_mapping;
+ if (mapping)
+ mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~(__GFP_IO | __GFP_FS));
+
+ for (;;) {
+ struct mars_dir_context buf = {
+ .ctx.actor = mars_filler,
+ .cookie = cookie,
+ };
+
+ cookie->hit = false;
+ status = iterate_dir(f, &buf.ctx);
+ if (!cookie->hit)
+ break;
+ if (unlikely(status < 0)) {
+ XIO_ERR("readdir() on path='%s' status=%d\n", cookie->path, status);
+ break;
+ }
+ }
+
+ filp_close(f, NULL);
+ return status;
+}
+
+int mars_dent_work(
+struct mars_global *global,
+char *dirname,
+int allocsize,
+mars_dent_checker_fn checker,
+mars_dent_worker_fn worker,
+void *buf,
+int maxdepth)
+{
+ static int version;
+
+ struct mars_cookie cookie = {
+ .global = global,
+ .checker = checker,
+ .path = dirname,
+ .parent = NULL,
+ .allocsize = allocsize,
+ .depth = 0,
+ };
+ struct say_channel *say_channel = NULL;
+ struct list_head *tmp;
+ struct list_head *next;
+ int rounds = 0;
+ int status;
+ int total_status = 0;
+ bool found_dir;
+
+ /* Initialize the flat dent list
+ */
+ version++;
+ global->global_version = version;
+ total_status = _mars_readdir(&cookie);
+
+ if (total_status || !worker)
+ goto done;
+
+ down_write(&global->dent_mutex);
+
+restart:
+ found_dir = false;
+
+ /* First, get all the inode information in a separate pass
+ * before starting work.
+ * The separate pass is necessary because some dents may
+ * forward-reference other dents, and it would be a pity if
+ * some inodes were not available or were outdated.
+ */
+ for (tmp = global->dent_anchor.next; tmp != &global->dent_anchor; tmp = tmp->next) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ /* treat any member only once during this invocation */
+ if (dent->d_version == version)
+ continue;
+ dent->d_version = version;
+
+ bind_to_dent(dent, &say_channel);
+
+ status = get_inode(dent->d_path, dent);
+ total_status |= status;
+
+ /* mark gone dents for removal */
+ if (unlikely(status < 0) && list_empty(&dent->brick_list))
+ dent->d_killme = true;
+
+ /* recurse into subdirectories by inserting into the flat list */
+ if (S_ISDIR(dent->stat_val.mode) && dent->d_depth <= maxdepth) {
+ struct mars_cookie sub_cookie = {
+ .global = global,
+ .checker = checker,
+ .path = dent->d_path,
+ .allocsize = allocsize,
+ .parent = dent,
+ .depth = dent->d_depth + 1,
+ };
+ found_dir = true;
+ status = _mars_readdir(&sub_cookie);
+ total_status |= status;
+ if (status < 0)
+ XIO_INF("forward: status %d on '%s'\n", status, dent->d_path);
+ }
+ }
+ bind_to_dent(NULL, &say_channel);
+
+ if (found_dir && ++rounds < 10) {
+ brick_yield();
+ goto restart;
+ }
+
+ up_write(&global->dent_mutex);
+
+ /* Preparation pass.
+ * Here is a chance to mark some dents for removal
+ * (or other types of non-destructive operations)
+ */
+ down_read(&global->dent_mutex);
+ for (
+ tmp = global->dent_anchor.next, next = tmp->next; tmp != &global->dent_anchor; tmp = next, next = next->next) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ up_read(&global->dent_mutex);
+
+ brick_yield();
+
+ bind_to_dent(dent, &say_channel);
+
+ status = worker(buf, dent, true, false);
+ down_read(&global->dent_mutex);
+ total_status |= status;
+ }
+ up_read(&global->dent_mutex);
+
+ bind_to_dent(NULL, &say_channel);
+
+ /* Remove all dents marked for removal.
+ */
+ down_write(&global->dent_mutex);
+ for (
+ tmp = global->dent_anchor.next, next = tmp->next; tmp != &global->dent_anchor; tmp = next, next = next->next) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ if (!dent->d_killme)
+ continue;
+
+ bind_to_dent(dent, &say_channel);
+
+ XIO_DBG("killing dent '%s'\n", dent->d_path);
+ list_del_init(tmp);
+ xio_free_dent(dent);
+ }
+ up_write(&global->dent_mutex);
+
+ bind_to_dent(NULL, &say_channel);
+
+ /* Forward pass.
+ */
+ down_read(&global->dent_mutex);
+ for (
+ tmp = global->dent_anchor.next, next = tmp->next; tmp != &global->dent_anchor; tmp = next, next = next->next) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ up_read(&global->dent_mutex);
+
+ brick_yield();
+
+ bind_to_dent(dent, &say_channel);
+
+ status = worker(buf, dent, false, false);
+
+ down_read(&global->dent_mutex);
+ total_status |= status;
+ }
+ bind_to_dent(NULL, &say_channel);
+
+ /* Backward pass.
+ */
+ for (
+ tmp = global->dent_anchor.prev, next = tmp->prev; tmp != &global->dent_anchor; tmp = next, next = next->prev) {
+ struct mars_dent *dent = container_of(tmp, struct mars_dent, dent_link);
+
+ up_read(&global->dent_mutex);
+
+ brick_yield();
+
+ bind_to_dent(dent, &say_channel);
+
+ status = worker(buf, dent, false, true);
+
+ down_read(&global->dent_mutex);
+ total_status |= status;
+ if (status < 0)
+ XIO_INF("backwards: status %d on '%s'\n", status, dent->d_path);
+ }
+ up_read(&global->dent_mutex);
+
+ bind_to_dent(NULL, &say_channel);
+
+done:
+ return total_status;
+}
+
+struct mars_dent *_mars_find_dent(struct mars_global *global, const char *path)
+{
+ struct mars_dent *res = NULL;
+ struct list_head *tmp;
+
+ if (!rwsem_is_locked(&global->dent_mutex))
+ XIO_ERR("dent_mutex not held!\n");
+
+ for (tmp = global->dent_anchor.next; tmp != &global->dent_anchor; tmp = tmp->next) {
+ struct mars_dent *tmp_dent = container_of(tmp, struct mars_dent, dent_link);
+
+ if (!strcmp(tmp_dent->d_path, path)) {
+ res = tmp_dent;
+ break;
+ }
+ }
+
+ return res;
+}
+
+struct mars_dent *mars_find_dent(struct mars_global *global, const char *path)
+{
+ struct mars_dent *res;
+
+ if (!global)
+ return NULL;
+ down_read(&global->dent_mutex);
+ res = _mars_find_dent(global, path);
+ up_read(&global->dent_mutex);
+ return res;
+}
+
+int mars_find_dent_all(struct mars_global *global, char *prefix, struct mars_dent ***table)
+{
+ int max = 1024; /* provisionary */
+ int count = 0;
+ struct list_head *tmp;
+ struct mars_dent **res;
+ int prefix_len = strlen(prefix);
+
+ if (unlikely(!global))
+ goto done;
+
+ res = brick_zmem_alloc(max * sizeof(void *));
+ *table = res;
+
+ down_read(&global->dent_mutex);
+ for (tmp = global->dent_anchor.next; tmp != &global->dent_anchor; tmp = tmp->next) {
+ struct mars_dent *tmp_dent = container_of(tmp, struct mars_dent, dent_link);
+ int this_len;
+
+ if (!tmp_dent->d_path)
+ continue;
+ this_len = strlen(tmp_dent->d_path);
+ if (this_len < prefix_len || strncmp(tmp_dent->d_path, prefix, prefix_len))
+ continue;
+ res[count++] = tmp_dent;
+ if (count >= max)
+ break;
+ }
+ up_read(&global->dent_mutex);
+
+done:
+ return count;
+}
+
+void xio_kill_dent(struct mars_dent *dent)
+{
+ dent->d_killme = true;
+ xio_kill_brick_all(NULL, &dent->brick_list, true);
+}
+
+void xio_free_dent(struct mars_dent *dent)
+{
+ int i;
+
+ xio_kill_dent(dent);
+
+ CHECK_HEAD_EMPTY(&dent->dent_link);
+ CHECK_HEAD_EMPTY(&dent->brick_list);
+
+ for (i = 0; i < MARS_ARGV_MAX; i++)
+ brick_string_free(dent->d_argv[i]);
+ brick_string_free(dent->d_args);
+ brick_string_free(dent->d_name);
+ brick_string_free(dent->d_rest);
+ brick_string_free(dent->d_path);
+ brick_string_free(dent->link_val);
+ if (likely(dent->d_parent))
+ dent->d_parent->d_child_count--;
+ if (dent->d_private_destruct)
+ dent->d_private_destruct(dent->d_private);
+ brick_mem_free(dent->d_private);
+ brick_mem_free(dent);
+}
+
+void xio_free_dent_all(struct mars_global *global, struct list_head *anchor)
+{
+ LIST_HEAD(tmp_list);
+
+ if (global)
+ down_write(&global->dent_mutex);
+ list_replace_init(anchor, &tmp_list);
+ if (global)
+ up_write(&global->dent_mutex);
+ XIO_DBG("is_empty=%d\n", list_empty(&tmp_list));
+ while (!list_empty(&tmp_list)) {
+ struct mars_dent *dent;
+
+ dent = container_of(tmp_list.prev, struct mars_dent, dent_link);
+ list_del_init(&dent->dent_link);
+ xio_free_dent(dent);
+ }
+}
+
+/*******************************************************************/
+
+/* low-level brick instantiation */
+
+struct xio_brick *mars_find_brick(struct mars_global *global, const void *brick_type, const char *path)
+{
+ struct list_head *tmp;
+
+ if (!global || !path)
+ return NULL;
+
+ down_read(&global->brick_mutex);
+
+ for (tmp = global->brick_anchor.next; tmp != &global->brick_anchor; tmp = tmp->next) {
+ struct xio_brick *test = container_of(tmp, struct xio_brick, global_brick_link);
+
+ if (!strcmp(test->brick_path, path)) {
+ up_read(&global->brick_mutex);
+ if (brick_type && test->type != brick_type) {
+ XIO_ERR("bad brick type\n");
+ return NULL;
+ }
+ return test;
+ }
+ }
+
+ up_read(&global->brick_mutex);
+
+ return NULL;
+}
+
+int xio_free_brick(struct xio_brick *brick)
+{
+ struct mars_global *global;
+ int i;
+ int count;
+ int status;
+ int sleeptime;
+ int maxsleep;
+
+ if (!brick) {
+ XIO_ERR("bad brick parameter\n");
+ status = -EINVAL;
+ goto done;
+ }
+
+ if (!brick->power.force_off || !brick->power.off_led) {
+ XIO_WRN("brick '%s' is not freeable\n", brick->brick_path);
+ status = -ETXTBSY;
+ goto done;
+ }
+
+ /* first check whether the brick is in use somewhere */
+ for (i = 0; i < brick->type->max_outputs; i++) {
+ struct xio_output *output = brick->outputs[i];
+
+ if (output && output->nr_connected > 0) {
+ XIO_WRN("brick '%s' not freeable, output %i is used\n", brick->brick_path, i);
+ status = -EEXIST;
+ goto done;
+ }
+ }
+
+ /* Should not happen, but workaround: wait until flying IO has vanished */
+ maxsleep = 20000;
+ sleeptime = 1000;
+ for (;;) {
+ count = atomic_read(&brick->aio_object_layout.alloc_count);
+ if (likely(!count))
+ break;
+ if (maxsleep > 0) {
+ XIO_WRN(
+ "MEMLEAK: brick '%s' has %d aios allocated (total = %d, maxsleep = %d)\n", brick->brick_path, count, atomic_read(
+
+ &brick->aio_object_layout.total_alloc_count), maxsleep);
+ } else {
+ XIO_ERR(
+ "MEMLEAK: brick '%s' has %d aios allocated (total = %d)\n", brick->brick_path, count, atomic_read(
+
+ &brick->aio_object_layout.total_alloc_count));
+ break;
+ }
+ brick_msleep(sleeptime);
+ maxsleep -= sleeptime;
+ }
+
+ XIO_DBG("===> freeing brick name = '%s' path = '%s'\n", brick->brick_name, brick->brick_path);
+
+ global = brick->private_ptr;
+ if (global) {
+ down_write(&global->brick_mutex);
+ list_del_init(&brick->global_brick_link);
+ list_del_init(&brick->dent_brick_link);
+ up_write(&global->brick_mutex);
+ }
+
+ for (i = 0; i < brick->type->max_inputs; i++) {
+ struct xio_input *input = brick->inputs[i];
+
+ if (input) {
+ XIO_DBG("disconnecting input %i\n", i);
+ generic_disconnect((void *)input);
+ }
+ }
+
+ XIO_DBG("deallocate name = '%s' path = '%s'\n", brick->brick_name, brick->brick_path);
+ brick_string_free(brick->brick_name);
+ brick_string_free(brick->brick_path);
+
+ status = generic_brick_exit_full((void *)brick);
+
+ if (status >= 0) {
+ brick_mem_free(brick);
+ local_trigger();
+ } else {
+ XIO_ERR("error freeing brick, status = %d\n", status);
+ }
+
+done:
+ return status;
+}
+
+struct xio_brick *xio_make_brick(
+struct mars_global *global, struct mars_dent *belongs, const void *_brick_type, const char *path, const char *name)
+{
+ const struct generic_brick_type *brick_type = _brick_type;
+ const struct generic_input_type **input_types;
+ const struct generic_output_type **output_types;
+ struct xio_brick *res;
+ int size;
+ int i;
+ int status;
+
+ size = brick_type->brick_size +
+ (brick_type->max_inputs + brick_type->max_outputs) * sizeof(void *);
+ input_types = brick_type->default_input_types;
+ for (i = 0; i < brick_type->max_inputs; i++) {
+ const struct generic_input_type *type = *input_types++;
+
+ if (unlikely(!type)) {
+ XIO_ERR("input_type %d is missing\n", i);
+ goto err_name;
+ }
+ if (unlikely(type->input_size <= 0)) {
+ XIO_ERR("bad input_size at %d\n", i);
+ goto err_name;
+ }
+ size += type->input_size;
+ }
+ output_types = brick_type->default_output_types;
+ for (i = 0; i < brick_type->max_outputs; i++) {
+ const struct generic_output_type *type = *output_types++;
+
+ if (unlikely(!type)) {
+ XIO_ERR("output_type %d is missing\n", i);
+ goto err_name;
+ }
+ if (unlikely(type->output_size <= 0)) {
+ XIO_ERR("bad output_size at %d\n", i);
+ goto err_name;
+ }
+ size += type->output_size;
+ }
+
+ res = brick_zmem_alloc(size);
+ res->private_ptr = global;
+ INIT_LIST_HEAD(&res->dent_brick_link);
+ res->brick_name = brick_strdup(name);
+ res->brick_path = brick_strdup(path);
+
+ status = generic_brick_init_full(res, size, brick_type, NULL, NULL);
+ XIO_DBG("brick '%s' init '%s' '%s' (status=%d)\n", brick_type->type_name, path, name, status);
+ if (status < 0) {
+ XIO_ERR("cannot init brick %s\n", brick_type->type_name);
+ goto err_path;
+ }
+ res->free = xio_free_brick;
+
+ /* Immediately make it visible, regardless of internal state.
+ * Switching on / etc must be done separately.
+ */
+ down_write(&global->brick_mutex);
+ list_add(&res->global_brick_link, &global->brick_anchor);
+ if (belongs)
+ list_add_tail(&res->dent_brick_link, &belongs->brick_list);
+ up_write(&global->brick_mutex);
+
+ return res;
+
+err_path:
+ brick_string_free(res->brick_name);
+ brick_string_free(res->brick_path);
+ brick_mem_free(res);
+err_name:
+ return NULL;
+}
+
+int xio_kill_brick(struct xio_brick *brick)
+{
+ struct mars_global *global;
+ int status = -EINVAL;
+
+ CHECK_PTR(brick, done);
+ global = brick->private_ptr;
+
+ XIO_DBG(
+ "===> killing brick %s path = '%s' name = '%s'\n",
+ brick->type ? brick->type->type_name : "undef",
+ brick->brick_path,
+ brick->brick_name);
+
+ if (unlikely(brick->nr_outputs > 0 && brick->outputs[0] && brick->outputs[0]->nr_connected)) {
+ XIO_ERR("sorry, output is in use '%s'\n", brick->brick_path);
+ goto done;
+ }
+
+ if (global) {
+ down_write(&global->brick_mutex);
+ list_del_init(&brick->global_brick_link);
+ list_del_init(&brick->dent_brick_link);
+ up_write(&global->brick_mutex);
+ }
+
+ if (brick->show_status)
+ brick->show_status(brick, true);
+
+ /* start shutdown */
+ set_button_wait((void *)brick, false, true, 0);
+
+ if (likely(brick->power.off_led)) {
+ int max_inputs = 0;
+ int i;
+
+ if (likely(brick->type))
+ max_inputs = brick->type->max_inputs;
+ else
+ XIO_ERR("uninitialized brick '%s' '%s'\n", brick->brick_name, brick->brick_path);
+ XIO_DBG("---> freeing '%s' '%s'\n", brick->brick_name, brick->brick_path);
+
+ if (brick->kill_ptr)
+ *brick->kill_ptr = NULL;
+
+ for (i = 0; i < max_inputs; i++) {
+ struct generic_input *input = (void *)brick->inputs[i];
+
+ if (!input)
+ continue;
+ status = generic_disconnect(input);
+ if (unlikely(status < 0)) {
+ XIO_ERR(
+ "brick '%s' '%s' disconnect %d failed, status = %d\n",
+ brick->brick_name,
+ brick->brick_path,
+ i,
+ status);
+ goto done;
+ }
+ }
+ if (likely(brick->free)) {
+ status = brick->free(brick);
+ if (unlikely(status < 0)) {
+ XIO_ERR(
+ "freeing '%s' '%s' failed, status = %d\n",
+ brick->brick_name,
+ brick->brick_path,
+ status);
+ goto done;
+ }
+ } else {
+ XIO_ERR("brick '%s' '%s' has no destructor\n", brick->brick_name, brick->brick_path);
+ }
+ status = 0;
+ } else {
+ XIO_ERR("brick '%s' '%s' is not off\n", brick->brick_name, brick->brick_path);
+ status = -EUCLEAN;
+ }
+
+done:
+ return status;
+}
+
+int xio_kill_brick_all(struct mars_global *global, struct list_head *anchor, bool use_dent_link)
+{
+ int status = 0;
+
+ if (!anchor || !anchor->next)
+ goto done;
+ if (global)
+ down_write(&global->brick_mutex);
+ while (!list_empty(anchor)) {
+ struct list_head *tmp = anchor->next;
+ struct xio_brick *brick;
+
+ if (use_dent_link)
+ brick = container_of(tmp, struct xio_brick, dent_brick_link);
+ else
+ brick = container_of(tmp, struct xio_brick, global_brick_link);
+ list_del_init(tmp);
+ if (global)
+ up_write(&global->brick_mutex);
+ status |= xio_kill_brick(brick);
+ if (global)
+ down_write(&global->brick_mutex);
+ }
+ if (global)
+ up_write(&global->brick_mutex);
+done:
+ return status;
+}
+
+int xio_kill_brick_when_possible(
+struct mars_global *global,
+struct list_head *anchor,
+bool use_dent_link,
+const struct xio_brick_type *type,
+bool even_on)
+{
+ int return_status = 0;
+ struct list_head *tmp;
+
+restart:
+ if (global)
+ down_write(&global->brick_mutex);
+ for (tmp = anchor->next; tmp != anchor; tmp = tmp->next) {
+ struct xio_brick *brick;
+ int count;
+ int status;
+
+ if (use_dent_link)
+ brick = container_of(tmp, struct xio_brick, dent_brick_link);
+ else
+ brick = container_of(tmp, struct xio_brick, global_brick_link);
+ /* only kill the right brick types */
+ if (type && brick->type != type)
+ continue;
+ /* only kill marked bricks */
+ if (!brick->killme)
+ continue;
+ /* only kill unconnected bricks */
+ if (brick->nr_outputs > 0 && brick->outputs[0] && brick->outputs[0]->nr_connected > 0)
+ continue;
+ if (!even_on && (brick->power.button || !brick->power.off_led))
+ continue;
+ /* only kill bricks which have no resources allocated */
+ count = atomic_read(&brick->aio_object_layout.alloc_count);
+ if (count > 0)
+ continue;
+ /* Workaround FIXME:
+ * only kill bricks which have not been touched during the current mars_dent_work() round.
+ * some bricks like aio seem to have races between startup and termination of threads.
+ * disable this for stress-testing the allocation/deallocation logic.
+ * OTOH, frequently doing useless starts/stops is no good idea.
+ * CHECK: how to avoid too frequent switching by other means?
+ */
+ if (brick->kill_round++ < 1)
+ continue;
+
+ list_del_init(tmp);
+ if (global)
+ up_write(&global->brick_mutex);
+
+ XIO_DBG("KILLING '%s'\n", brick->brick_name);
+ status = xio_kill_brick(brick);
+
+ if (status >= 0)
+ return_status++;
+ else
+ return status;
+ /* The list may have changed during the open lock
+ * in unpredictable ways.
+ */
+ goto restart;
+ }
+ if (global)
+ up_write(&global->brick_mutex);
+ return return_status;
+}
+
+/*******************************************************************/
+
+/* mid-level brick instantiation (identity is based on path strings) */
+
+char *_vpath_make(int line, const char *fmt, va_list *args)
+{
+ va_list copy_args;
+ char dummy[2];
+ int len;
+ char *res;
+
+ memcpy(&copy_args, args, sizeof(copy_args));
+ len = vsnprintf(dummy, sizeof(dummy), fmt, copy_args);
+ len = _length_paranoia(len, line);
+ res = _brick_string_alloc(len + 2, line);
+
+ vsnprintf(res, len + 1, fmt, *args);
+
+ return res;
+}
+
+char *_path_make(int line, const char *fmt, ...)
+{
+ va_list args;
+ char *res;
+
+ va_start(args, fmt);
+ res = _vpath_make(line, fmt, &args);
+ va_end(args);
+ return res;
+}
+
+char *_backskip_replace(int line, const char *path, char delim, bool insert, const char *fmt, ...)
+{
+ int path_len = strlen(path);
+ int fmt_len;
+ int total_len;
+ char *res;
+ va_list args;
+ int pos = path_len;
+ int plus;
+ char dummy[2];
+
+ va_start(args, fmt);
+ fmt_len = vsnprintf(dummy, sizeof(dummy), fmt, args);
+ va_end(args);
+ fmt_len = _length_paranoia(fmt_len, line);
+
+ total_len = fmt_len + path_len;
+ total_len = _length_paranoia(total_len, line);
+
+ res = _brick_string_alloc(total_len + 2, line);
+
+ while (pos > 0 && path[pos] != '/')
+ pos--;
+ if (delim != '/') {
+ while (pos < path_len && path[pos] != delim)
+ pos++;
+ }
+ memcpy(res, path, pos);
+
+ va_start(args, fmt);
+ plus = vscnprintf(res + pos, total_len - pos, fmt, args);
+ va_end(args);
+
+ if (insert)
+ strncpy(res + pos + plus, path + pos + 1, total_len - pos - plus);
+ return res;
+}
+
+struct xio_brick *path_find_brick(struct mars_global *global, const void *brick_type, const char *fmt, ...)
+{
+ va_list args;
+ char *fullpath;
+ struct xio_brick *res;
+
+ va_start(args, fmt);
+ fullpath = vpath_make(fmt, &args);
+ va_end(args);
+
+ if (!fullpath)
+ return NULL;
+ res = mars_find_brick(global, brick_type, fullpath);
+ brick_string_free(fullpath);
+ return res;
+}
+
+const struct generic_brick_type *_client_brick_type;
+const struct generic_brick_type *_bio_brick_type;
+
+const struct generic_brick_type *_sio_brick_type;
+
+struct xio_brick *make_brick_all(
+ struct mars_global *global,
+ struct mars_dent *belongs,
+ int (*setup_fn)(struct xio_brick *brick, void *private),
+ void *private,
+ const char *new_name,
+ const struct generic_brick_type *new_brick_type,
+ const struct generic_brick_type *prev_brick_type[],
+/* -1 = off, 0 = leave in current state, +1 = create when necessary, +2 = create + switch on */
+ int switch_override,
+ const char *new_fmt,
+ const char *prev_fmt[],
+ int prev_count,
+ ...
+ )
+{
+ va_list args;
+ const char *new_path;
+ char *_new_path = NULL;
+ struct xio_brick *brick = NULL;
+ char *paths[prev_count];
+ struct xio_brick *prev[prev_count];
+ bool switch_state;
+ int i;
+ int status;
+
+ /* treat variable arguments */
+ va_start(args, prev_count);
+ if (new_fmt) {
+ _new_path = vpath_make(new_fmt, &args);
+ new_path = _new_path;
+ } else {
+ new_path = new_name;
+ }
+ for (i = 0; i < prev_count; i++)
+ paths[i] = vpath_make(prev_fmt[i], &args);
+ va_end(args);
+
+ if (!new_path) {
+ XIO_ERR("could not create new path\n");
+ goto err;
+ }
+
+ /* get old switch state */
+ brick = mars_find_brick(global, NULL, new_path);
+ switch_state = false;
+ if (brick)
+ switch_state = brick->power.button;
+ /* override? */
+ if (switch_override > 1)
+ switch_state = true;
+ else if (switch_override < 0)
+ switch_state = false;
+ /* even higher override */
+ if (global && !global->global_power.button)
+ switch_state = false;
+
+ /* brick already existing? */
+ if (brick) {
+ /* just switch the power state */
+ XIO_DBG("found existing brick '%s'\n", new_path);
+ /* highest general override */
+ if (xio_check_outputs(brick)) {
+ if (!switch_state)
+ XIO_DBG("brick '%s' override 0 -> 1\n", new_path);
+ switch_state = true;
+ }
+ goto do_switch;
+ }
+
+ /* brick not existing = > check whether to create it */
+ if (switch_override < 1) { /* don't create */
+ XIO_DBG("no need for brick '%s'\n", new_path);
+ goto done;
+ }
+ XIO_DBG("make new brick '%s'\n", new_path);
+ if (!new_name)
+ new_name = new_path;
+
+ XIO_DBG(
+ "----> new brick type = '%s' path = '%s' name = '%s'\n", new_brick_type->type_name, new_path, new_name);
+
+ /* get all predecessor bricks */
+ for (i = 0; i < prev_count; i++) {
+ char *path = paths[i];
+
+ if (!path) {
+ XIO_ERR("could not build path %d\n", i);
+ goto err;
+ }
+
+ prev[i] = mars_find_brick(global, NULL, path);
+
+ if (!prev[i]) {
+ XIO_WRN("prev brick '%s' does not exist\n", path);
+ goto err;
+ }
+ XIO_DBG("------> predecessor %d path = '%s'\n", i, path);
+ if (!prev[i]->power.on_led) {
+ switch_state = false;
+ XIO_DBG("predecessor power is not on\n");
+ }
+ }
+
+ /* some generic brick replacements (better performance / network functionality) */
+ brick = NULL;
+ if ((new_brick_type == _bio_brick_type ||
+ new_brick_type == _sio_brick_type) &&
+ _client_brick_type) {
+ char *remote = strchr(new_name, '@');
+
+ if (remote) {
+ remote++;
+ XIO_DBG("substitute by remote brick '%s' on peer '%s'\n", new_name, remote);
+
+ brick = xio_make_brick(global, belongs, _client_brick_type, new_path, new_name);
+ if (brick) {
+ struct client_brick *_brick = (void *)brick;
+
+ _brick->max_flying = 10000;
+ }
+ }
+ }
+ if (!brick &&
+ new_brick_type == _bio_brick_type &&
+ _sio_brick_type) {
+ struct kstat test = {};
+ int status = mars_stat(new_path, &test, false);
+
+ if (SKIP_BIO || status < 0 || !S_ISBLK(test.mode)) {
+ new_brick_type = _sio_brick_type;
+ XIO_DBG("substitute bio by sio\n");
+ }
+ }
+
+ /* create it... */
+ if (!brick)
+ brick = xio_make_brick(global, belongs, new_brick_type, new_path, new_name);
+ if (unlikely(!brick)) {
+ XIO_ERR("creation failed '%s' '%s'\n", new_path, new_name);
+ goto err;
+ }
+ if (unlikely(brick->nr_inputs < prev_count)) {
+ XIO_ERR("'%s' wrong number of arguments: %d < %d\n", new_path, brick->nr_inputs, prev_count);
+ goto err;
+ }
+
+ /* connect the wires */
+ for (i = 0; i < prev_count; i++) {
+ int status;
+
+ status = generic_connect((void *)brick->inputs[i], (void *)prev[i]->outputs[0]);
+ if (unlikely(status < 0)) {
+ XIO_ERR("'%s' '%s' cannot connect input %d\n", new_path, new_name, i);
+ goto err;
+ }
+ }
+
+do_switch:
+ /* call setup function */
+ if (setup_fn) {
+ int setup_status = setup_fn(brick, private);
+
+ if (setup_status <= 0)
+ switch_state = 0;
+ }
+
+ /* switch on/off (may fail silently, but responsibility is at the workers) */
+ status = mars_power_button((void *)brick, switch_state, false);
+ XIO_DBG("switch '%s' to %d status = %d\n", new_path, switch_state, status);
+ goto done;
+
+err:
+ if (brick)
+ xio_kill_brick(brick);
+ brick = NULL;
+done:
+ for (i = 0; i < prev_count; i++) {
+ if (paths[i])
+ brick_string_free(paths[i]);
+ }
+ if (_new_path)
+ brick_string_free(_new_path);
+
+ return brick;
+}
+
+/***********************************************************************/
+
+/* statistics */
+
+int global_show_statist;
+
+module_param_named(show_statist, global_show_statist, int, 0);
+
+static
+void _show_one(struct xio_brick *test, int *brick_count)
+{
+ int i;
+
+ if (*brick_count)
+ XIO_DBG("---------\n");
+ XIO_DBG(
+ "BRICK type = %s path = '%s' name = '%s' size_hint=%d aios_alloc = %d aios_apsect_alloc = %d total_aios_alloc = %d total_aios_aspects = %d button = %d off = %d on = %d\n",
+ test->type->type_name,
+ test->brick_path,
+ test->brick_name,
+ test->aio_object_layout.size_hint,
+ atomic_read(&test->aio_object_layout.alloc_count),
+ atomic_read(&test->aio_object_layout.aspect_count),
+ atomic_read(&test->aio_object_layout.total_alloc_count),
+ atomic_read(&test->aio_object_layout.total_aspect_count),
+ test->power.button,
+ test->power.off_led,
+ test->power.on_led);
+ (*brick_count)++;
+ if (test->ops && test->ops->brick_statistics) {
+ char *info = test->ops->brick_statistics(test, 0);
+
+ if (info) {
+ XIO_DBG(" %s", info);
+ brick_string_free(info);
+ }
+ }
+ for (i = 0; i < test->type->max_inputs; i++) {
+ struct xio_input *input = test->inputs[i];
+ struct xio_output *output = input ? input->connect : NULL;
+
+ if (output) {
+ XIO_DBG(
+ " input %d connected with %s path = '%s' name = '%s'\n",
+ i,
+ output->brick->type->type_name,
+ output->brick->brick_path,
+ output->brick->brick_name);
+ } else {
+ XIO_DBG(" input %d not connected\n", i);
+ }
+ }
+ for (i = 0; i < test->type->max_outputs; i++) {
+ struct xio_output *output = test->outputs[i];
+
+ if (output)
+ XIO_DBG(" output %d nr_connected = %d\n", i, output->nr_connected);
+ }
+}
+
+void show_statistics(struct mars_global *global, const char *class)
+{
+ struct list_head *tmp;
+ int dent_count = 0;
+ int brick_count = 0;
+
+ if (!global_show_statist)
+ return; /* silently */
+
+ brick_mem_statistics(false);
+
+ down_read(&global->brick_mutex);
+ XIO_DBG("================================== %s bricks:\n", class);
+ for (tmp = global->brick_anchor.next; tmp != &global->brick_anchor; tmp = tmp->next) {
+ struct xio_brick *test;
+
+ test = container_of(tmp, struct xio_brick, global_brick_link);
+ _show_one(test, &brick_count);
+ }
+ up_read(&global->brick_mutex);
+
+ XIO_DBG("================================== %s dents:\n", class);
+ down_read(&global->dent_mutex);
+ for (tmp = global->dent_anchor.next; tmp != &global->dent_anchor; tmp = tmp->next) {
+ struct mars_dent *dent;
+ struct list_head *sub;
+
+ dent = container_of(tmp, struct mars_dent, dent_link);
+ XIO_DBG(
+ "dent %d '%s' '%s' stamp=%ld.%09ld\n",
+ dent->d_class,
+ dent->d_path,
+ dent->link_val,
+ dent->stat_val.mtime.tv_sec,
+ dent->stat_val.mtime.tv_nsec);
+ dent_count++;
+ for (sub = dent->brick_list.next; sub != &dent->brick_list; sub = sub->next) {
+ struct xio_brick *test;
+
+ test = container_of(sub, struct xio_brick, dent_brick_link);
+ XIO_DBG(" owner of brick '%s'\n", test->brick_path);
+ }
+ }
+ up_read(&global->dent_mutex);
+
+ XIO_DBG(
+ "==================== %s STATISTICS: %d dents, %d bricks, %lld KB free\n",
+ class,
+ dent_count,
+ brick_count,
+ global_remaining_space);
+}
+
+/*******************************************************************/
+
+/* power led handling */
+
+void xio_set_power_on_led(struct xio_brick *brick, bool val)
+{
+ bool oldval = brick->power.on_led;
+
+ if (val != oldval) {
+ set_on_led(&brick->power, val);
+ local_trigger();
+ }
+}
+
+void xio_set_power_off_led(struct xio_brick *brick, bool val)
+{
+ bool oldval = brick->power.off_led;
+
+ if (val != oldval) {
+ set_off_led(&brick->power, val);
+ local_trigger();
+ }
+}
+
+/*******************************************************************/
+
+/* init stuff */
+
+int __init init_sy(void)
+{
+ XIO_INF("init_sy()\n");
+
+ _local_trigger = __local_trigger;
+
+ return 0;
+}
+
+void exit_sy(void)
+{
+ _local_trigger = NULL;
+
+ XIO_INF("exit_sy()\n");
+}
--
2.11.0

2016-12-30 22:58:18

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 15/32] mars: add new module lib_mapfree

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/lib_mapfree.c | 382 ++++++++++++++++++++++++++
include/linux/xio/lib_mapfree.h | 84 ++++++
2 files changed, 466 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/lib_mapfree.c
create mode 100644 include/linux/xio/lib_mapfree.h

diff --git a/drivers/staging/mars/xio_bricks/lib_mapfree.c b/drivers/staging/mars/xio_bricks/lib_mapfree.c
new file mode 100644
index 000000000000..fc7c057fc993
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/lib_mapfree.c
@@ -0,0 +1,382 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/xio/lib_mapfree.h>
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/file.h>
+
+/* time to wait between background mapfree operations */
+int mapfree_period_sec = 10;
+
+/* some grace space where no regular cleanup should occur */
+int mapfree_grace_keep_mb = 16;
+
+static
+DECLARE_RWSEM(mapfree_mutex);
+
+static
+LIST_HEAD(mapfree_list);
+
+void mapfree_pages(struct mapfree_info *mf, int grace_keep)
+{
+ struct address_space *mapping;
+ pgoff_t start;
+ pgoff_t end;
+
+ if (unlikely(!mf))
+ goto done;
+ if (unlikely(!mf->mf_filp))
+ goto done;
+
+ mapping = mf->mf_filp->f_mapping;
+ if (unlikely(!mapping))
+ goto done;
+
+ if (grace_keep < 0) { /* force full flush */
+ start = 0;
+ end = -1;
+ } else {
+ unsigned long flags;
+ loff_t tmp;
+ loff_t min;
+
+ spin_lock_irqsave(&mf->mf_lock, flags);
+
+ tmp = mf->mf_min[0];
+ min = tmp;
+ if (likely(mf->mf_min[1] < min))
+ min = mf->mf_min[1];
+ if (tmp) {
+ mf->mf_min[1] = tmp;
+ mf->mf_min[0] = 0;
+ }
+
+ spin_unlock_irqrestore(&mf->mf_lock, flags);
+
+ min -= (loff_t)grace_keep * (1024 * 1024); /* megabytes */
+ end = 0;
+
+ if (min > 0 || mf->mf_last) {
+ start = mf->mf_last / PAGE_SIZE;
+ /* add some grace overlapping */
+ if (likely(start > 0))
+ start--;
+ mf->mf_last = min;
+ end = min / PAGE_SIZE;
+ } else { /* there was no progress for at least 2 rounds */
+ start = 0;
+ if (!grace_keep) /* also flush thoroughly */
+ end = -1;
+ }
+
+ XIO_DBG("file = '%s' start = %lu end = %lu\n", mf->mf_name, start, end);
+ }
+
+ if (end > start || end == -1)
+ invalidate_mapping_pages(mapping, start, end);
+
+done:;
+}
+
+static
+void _mapfree_put(struct mapfree_info *mf)
+{
+ if (atomic_dec_and_test(&mf->mf_count)) {
+ XIO_DBG("closing file '%s' filp = %p\n", mf->mf_name, mf->mf_filp);
+ list_del_init(&mf->mf_head);
+ CHECK_HEAD_EMPTY(&mf->mf_dirty_anchor);
+ if (likely(mf->mf_filp)) {
+ mapfree_pages(mf, -1);
+ filp_close(mf->mf_filp, NULL);
+ }
+ brick_string_free(mf->mf_name);
+ brick_mem_free(mf);
+ }
+}
+
+void mapfree_put(struct mapfree_info *mf)
+{
+ if (likely(mf)) {
+ down_write(&mapfree_mutex);
+ _mapfree_put(mf);
+ up_write(&mapfree_mutex);
+ }
+}
+
+struct mapfree_info *mapfree_get(const char *name, int flags)
+{
+ struct mapfree_info *mf = NULL;
+ struct list_head *tmp;
+
+ if (!(flags & O_DIRECT)) {
+ down_read(&mapfree_mutex);
+ for (tmp = mapfree_list.next; tmp != &mapfree_list; tmp = tmp->next) {
+ struct mapfree_info *_mf = container_of(tmp, struct mapfree_info, mf_head);
+
+ if (_mf->mf_flags == flags && !strcmp(_mf->mf_name, name)) {
+ mf = _mf;
+ atomic_inc(&mf->mf_count);
+ break;
+ }
+ }
+ up_read(&mapfree_mutex);
+
+ if (mf)
+ goto done;
+ }
+
+ for (;;) {
+ struct address_space *mapping;
+ struct inode *inode = NULL;
+ int ra = 1;
+ int prot = 0600;
+
+ mm_segment_t oldfs;
+
+ mf = brick_zmem_alloc(sizeof(struct mapfree_info));
+
+ mf->mf_name = brick_strdup(name);
+
+ mf->mf_flags = flags;
+ INIT_LIST_HEAD(&mf->mf_head);
+ INIT_LIST_HEAD(&mf->mf_dirty_anchor);
+ atomic_set(&mf->mf_count, 1);
+ spin_lock_init(&mf->mf_lock);
+ mf->mf_max = -1;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ mf->mf_filp = filp_open(name, flags, prot);
+ set_fs(oldfs);
+
+ XIO_DBG("file '%s' flags = %d prot = %d filp = %p\n", name, flags, prot, mf->mf_filp);
+
+ if (unlikely(!mf->mf_filp || IS_ERR(mf->mf_filp))) {
+ int err = PTR_ERR(mf->mf_filp);
+
+ XIO_ERR("can't open file '%s' status=%d\n", name, err);
+ mf->mf_filp = NULL;
+ _mapfree_put(mf);
+ mf = NULL;
+ break;
+ }
+
+ mapping = mf->mf_filp->f_mapping;
+ if (likely(mapping))
+ inode = mapping->host;
+ if (unlikely(!mapping || !inode)) {
+ XIO_ERR("file '%s' has no mapping\n", name);
+ mf->mf_filp = NULL;
+ _mapfree_put(mf);
+ mf = NULL;
+ break;
+ }
+
+ mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~(__GFP_IO | __GFP_FS));
+
+ mf->mf_max = i_size_read(inode);
+
+ if (S_ISBLK(inode->i_mode)) {
+ XIO_INF(
+ "changing blkdev readahead from %lu to %d\n",
+ inode->i_bdev->bd_disk->queue->backing_dev_info.ra_pages,
+ ra);
+ inode->i_bdev->bd_disk->queue->backing_dev_info.ra_pages = ra;
+ }
+
+ if (flags & O_DIRECT) { /* never share them */
+ break;
+ }
+
+ /* maintain global list of all open files */
+ down_write(&mapfree_mutex);
+ for (tmp = mapfree_list.next; tmp != &mapfree_list; tmp = tmp->next) {
+ struct mapfree_info *_mf = container_of(tmp, struct mapfree_info, mf_head);
+
+ if (unlikely(_mf->mf_flags == flags && !strcmp(_mf->mf_name, name))) {
+ XIO_WRN("race on creation of '%s' detected\n", name);
+ _mapfree_put(mf);
+ mf = _mf;
+ atomic_inc(&mf->mf_count);
+ goto leave;
+ }
+ }
+ list_add_tail(&mf->mf_head, &mapfree_list);
+leave:
+ up_write(&mapfree_mutex);
+ break;
+ }
+done:
+ return mf;
+}
+
+void mapfree_set(struct mapfree_info *mf, loff_t min, loff_t max)
+{
+ unsigned long flags;
+
+ if (likely(mf)) {
+ spin_lock_irqsave(&mf->mf_lock, flags);
+ if (!mf->mf_min[0] || mf->mf_min[0] > min)
+ mf->mf_min[0] = min;
+ if (max >= 0 && mf->mf_max < max)
+ mf->mf_max = max;
+ spin_unlock_irqrestore(&mf->mf_lock, flags);
+ }
+}
+
+static
+int mapfree_thread(void *data)
+{
+ while (!brick_thread_should_stop()) {
+ struct mapfree_info *mf = NULL;
+ struct list_head *tmp;
+ long long eldest = 0;
+
+ brick_msleep(500);
+
+ if (mapfree_period_sec <= 0)
+ continue;
+
+ down_read(&mapfree_mutex);
+
+ for (tmp = mapfree_list.next; tmp != &mapfree_list; tmp = tmp->next) {
+ struct mapfree_info *_mf = container_of(tmp, struct mapfree_info, mf_head);
+
+ if (unlikely(!_mf->mf_jiffies)) {
+ _mf->mf_jiffies = jiffies;
+ continue;
+ }
+ if ((long long)jiffies - _mf->mf_jiffies > mapfree_period_sec * HZ &&
+ (!mf || _mf->mf_jiffies < eldest)) {
+ mf = _mf;
+ eldest = _mf->mf_jiffies;
+ }
+ }
+ if (mf)
+ atomic_inc(&mf->mf_count);
+
+ up_read(&mapfree_mutex);
+
+ if (!mf)
+ continue;
+
+ mapfree_pages(mf, mapfree_grace_keep_mb);
+
+ mf->mf_jiffies = jiffies;
+ mapfree_put(mf);
+ }
+ return 0;
+}
+
+/***************** dirty IOs on the fly *****************/
+
+void mf_insert_dirty(struct mapfree_info *mf, struct dirty_info *di)
+{
+ unsigned long flags;
+
+ if (likely(di->dirty_aio && mf)) {
+ spin_lock_irqsave(&mf->mf_lock, flags);
+ list_del(&di->dirty_head);
+ list_add(&di->dirty_head, &mf->mf_dirty_anchor);
+ spin_unlock_irqrestore(&mf->mf_lock, flags);
+ }
+}
+
+void mf_remove_dirty(struct mapfree_info *mf, struct dirty_info *di)
+{
+ unsigned long flags;
+
+ if (!list_empty(&di->dirty_head) && mf) {
+ spin_lock_irqsave(&mf->mf_lock, flags);
+ list_del_init(&di->dirty_head);
+ spin_unlock_irqrestore(&mf->mf_lock, flags);
+ }
+}
+
+void mf_get_dirty(struct mapfree_info *mf, loff_t *min, loff_t *max, int min_stage, int max_stage)
+{
+ unsigned long flags;
+
+ struct list_head *tmp;
+
+ if (unlikely(!mf))
+ goto done;
+
+ spin_lock_irqsave(&mf->mf_lock, flags);
+ for (tmp = mf->mf_dirty_anchor.next; tmp != &mf->mf_dirty_anchor; tmp = tmp->next) {
+ struct dirty_info *di = container_of(tmp, struct dirty_info, dirty_head);
+ struct aio_object *aio = di->dirty_aio;
+
+ if (unlikely(!aio))
+ continue;
+ if (di->dirty_stage < min_stage || di->dirty_stage > max_stage)
+ continue;
+ if (aio->io_pos < *min)
+ *min = aio->io_pos;
+ if (aio->io_pos + aio->io_len > *max)
+ *max = aio->io_pos + aio->io_len;
+ }
+ spin_unlock_irqrestore(&mf->mf_lock, flags);
+done:;
+}
+
+void mf_get_any_dirty(const char *filename, loff_t *min, loff_t *max, int min_stage, int max_stage)
+{
+ struct list_head *tmp;
+
+ down_read(&mapfree_mutex);
+ for (tmp = mapfree_list.next; tmp != &mapfree_list; tmp = tmp->next) {
+ struct mapfree_info *mf = container_of(tmp, struct mapfree_info, mf_head);
+
+ if (!strcmp(mf->mf_name, filename))
+ mf_get_dirty(mf, min, max, min_stage, max_stage);
+ }
+ up_read(&mapfree_mutex);
+}
+
+/***************** module init stuff ************************/
+
+static
+struct task_struct *mf_thread;
+
+int __init init_xio_mapfree(void)
+{
+ XIO_DBG("init_mapfree()\n");
+ mf_thread = brick_thread_create(mapfree_thread, NULL, "xio_mapfree");
+ if (unlikely(!mf_thread)) {
+ XIO_ERR("could not create mapfree thread\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void exit_xio_mapfree(void)
+{
+ XIO_DBG("exit_mapfree()\n");
+ if (likely(mf_thread)) {
+ brick_thread_stop(mf_thread);
+ mf_thread = NULL;
+ }
+}
diff --git a/include/linux/xio/lib_mapfree.h b/include/linux/xio/lib_mapfree.h
new file mode 100644
index 000000000000..e7594e125150
--- /dev/null
+++ b/include/linux/xio/lib_mapfree.h
@@ -0,0 +1,84 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_LIB_MAPFREE_H
+#define XIO_LIB_MAPFREE_H
+
+/* Mapfree infrastructure.
+ *
+ * Purposes:
+ *
+ * 1) Open files only once when possible, do ref-counting on struct mapfree_info
+ *
+ * 2) Automatically call invalidate_mapping_pages() in the background on
+ * "unused" areas to free resources.
+ * Used areas can be indicated by calling mapfree_set() frequently.
+ * Usage model: tailored to sequential logfiles.
+ *
+ * 3) Do it all in a completely decoupled manner, in order to prevent resource deadlocks.
+ *
+ * 4) Also to prevent deadlocks: always set mapping_set_gfp_mask() accordingly.
+ */
+
+#include <linux/xio/xio.h>
+
+extern int mapfree_period_sec;
+extern int mapfree_grace_keep_mb;
+
+struct mapfree_info {
+ struct list_head mf_head;
+ struct list_head mf_dirty_anchor;
+ char *mf_name;
+ struct file *mf_filp;
+ int mf_flags;
+ int mf_mode;
+ atomic_t mf_count;
+ spinlock_t mf_lock;
+ loff_t mf_min[2];
+ loff_t mf_last;
+ loff_t mf_max;
+ long long mf_jiffies;
+};
+
+struct dirty_info {
+ struct list_head dirty_head;
+ struct aio_object *dirty_aio;
+ int dirty_stage;
+};
+
+struct mapfree_info *mapfree_get(const char *filename, int flags);
+
+void mapfree_put(struct mapfree_info *mf);
+
+void mapfree_set(struct mapfree_info *mf, loff_t min, loff_t max);
+
+void mapfree_pages(struct mapfree_info *mf, int grace_keep);
+
+/***************** dirty IOs on the fly *****************/
+
+void mf_insert_dirty(struct mapfree_info *mf, struct dirty_info *di);
+void mf_remove_dirty(struct mapfree_info *mf, struct dirty_info *di);
+void mf_get_dirty(struct mapfree_info *mf, loff_t *min, loff_t *max, int min_stage, int max_stage);
+void mf_get_any_dirty(const char *filename, loff_t *min, loff_t *max, int min_stage, int max_stage);
+
+/***************** module init stuff ************************/
+
+int __init init_xio_mapfree(void);
+
+void exit_xio_mapfree(void);
+
+#endif
--
2.11.0

2016-12-30 22:58:30

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 27/32] mars: add new module server_strategy

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/mars/server_strategy.c | 436 ++++++++++++++++++++++++++++
1 file changed, 436 insertions(+)
create mode 100644 drivers/staging/mars/mars/server_strategy.c

diff --git a/drivers/staging/mars/mars/server_strategy.c b/drivers/staging/mars/mars/server_strategy.c
new file mode 100644
index 000000000000..3b880c10be49
--- /dev/null
+++ b/drivers/staging/mars/mars/server_strategy.c
@@ -0,0 +1,436 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2016 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2016 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* MARS Light specific parts of xio_server
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#define _STRATEGY
+#include <linux/brick/brick.h>
+#include <linux/xio/xio.h>
+#include <linux/xio/xio_bio.h>
+#include <linux/xio/xio_sio.h>
+
+#include "strategy.h"
+
+#include <linux/xio/xio_server.h>
+#include <linux/xio/xio_trans_logger.h>
+
+static
+int dummy_worker(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction)
+{
+ return 0;
+}
+
+static
+int _set_server_sio_params(struct xio_brick *_brick, void *private)
+{
+ struct sio_brick *sio_brick = (void *)_brick;
+
+ if (_brick->type != (void *)_sio_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ sio_brick->o_direct = false;
+ sio_brick->o_fdsync = false;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+static
+int _set_server_bio_params(struct xio_brick *_brick, void *private)
+{
+ struct bio_brick *bio_brick;
+
+ if (_brick->type == (void *)_sio_brick_type)
+ return _set_server_sio_params(_brick, private);
+ if (_brick->type != (void *)_bio_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ bio_brick = (void *)_brick;
+ bio_brick->ra_pages = 0;
+ bio_brick->do_noidle = true;
+ bio_brick->do_sync = true;
+ bio_brick->do_unplug = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+int handler_thread(void *data)
+{
+ struct mars_global handler_global = {
+ .dent_anchor = LIST_HEAD_INIT(handler_global.dent_anchor),
+ .brick_anchor = LIST_HEAD_INIT(handler_global.brick_anchor),
+ .global_power = {
+ .button = true,
+ },
+ .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(handler_global.main_event),
+ };
+ struct task_struct *thread = NULL;
+ struct server_brick *brick = data;
+ struct xio_socket *sock = &brick->handler_socket;
+ bool ok = xio_get_socket(sock);
+ unsigned long statist_jiffies = jiffies;
+ int debug_nr;
+ int status = -EINVAL;
+
+ init_rwsem(&handler_global.dent_mutex);
+ init_rwsem(&handler_global.brick_mutex);
+
+ XIO_DBG("#%d --------------- handler_thread starting on socket %p\n", sock->s_debug_nr, sock);
+ if (!ok)
+ goto done;
+
+ thread = brick_thread_create(cb_thread, brick, "xio_cb%d", brick->version);
+ if (unlikely(!thread)) {
+ XIO_ERR("cannot create cb thread\n");
+ status = -ENOENT;
+ goto done;
+ }
+ brick->cb_thread = thread;
+
+ brick->handler_running = true;
+ wake_up_interruptible(&brick->startup_event);
+
+ while (!list_empty(&handler_global.brick_anchor) ||
+ xio_socket_is_alive(sock)) {
+ struct xio_cmd cmd = {};
+
+ handler_global.global_version++;
+
+ if (!list_empty(&handler_global.brick_anchor)) {
+ if (server_show_statist && !time_is_before_jiffies(statist_jiffies + 10 * HZ)) {
+ show_statistics(&handler_global, "handler");
+ statist_jiffies = jiffies;
+ }
+ if (!xio_socket_is_alive(sock) &&
+ atomic_read(&brick->in_flight) <= 0 &&
+ brick->conn_brick) {
+ if (generic_disconnect((void *)brick->inputs[0]) >= 0)
+ brick->conn_brick = NULL;
+ }
+
+ status = xio_kill_brick_when_possible(
+ &handler_global, &handler_global.brick_anchor, false, NULL, true);
+ XIO_DBG("kill handler bricks (when possible) = %d\n", status);
+ }
+
+ status = -EINTR;
+ if (unlikely(!mars_global || !mars_global->global_power.button)) {
+ XIO_DBG("system is not alive\n");
+ goto clean;
+ }
+ if (unlikely(brick_thread_should_stop()))
+ goto clean;
+ if (unlikely(!xio_socket_is_alive(sock))) {
+ /* Dont read any data anymore, the protocol
+ * may be screwed up completely.
+ */
+ XIO_DBG("#%d is dead\n", sock->s_debug_nr);
+ goto clean;
+ }
+
+ status = xio_recv_struct(sock, &cmd, xio_cmd_meta);
+ if (unlikely(status < 0)) {
+ XIO_WRN("#%d recv cmd status = %d\n", sock->s_debug_nr, status);
+ goto clean;
+ }
+
+ if (unlikely(!brick->private_ptr || !mars_global || !mars_global->global_power.button)) {
+ XIO_WRN("#%d system is not alive\n", sock->s_debug_nr);
+ status = -EINTR;
+ goto clean;
+ }
+
+ status = -EPROTO;
+ switch (cmd.cmd_code & CMD_FLAG_MASK) {
+ case CMD_NOP:
+ status = 0;
+ XIO_DBG("#%d got NOP operation\n", sock->s_debug_nr);
+ break;
+ case CMD_NOTIFY:
+ status = 0;
+ from_remote_trigger();
+ break;
+ case CMD_GETINFO:
+ {
+ struct xio_info info = {};
+
+ status = GENERIC_INPUT_CALL(brick->inputs[0], xio_get_info, &info);
+ if (status < 0)
+ break;
+ down(&brick->socket_sem);
+ status = xio_send_struct(sock, &cmd, xio_cmd_meta);
+ if (status >= 0)
+ status = xio_send_struct(sock, &info, xio_info_meta);
+ up(&brick->socket_sem);
+ break;
+ }
+ case CMD_GETENTS:
+ {
+ status = -EINVAL;
+ if (unlikely(!cmd.cmd_str1))
+ break;
+
+ status = mars_dent_work(
+ &handler_global, "/mars", sizeof(
+ struct mars_dent), main_checker, dummy_worker, &handler_global, 3);
+
+ down(&brick->socket_sem);
+ status = xio_send_dent_list(sock, &handler_global.dent_anchor);
+ up(&brick->socket_sem);
+
+ if (status < 0) {
+ XIO_WRN(
+ "#%d could not send dentry information, status = %d\n", sock->s_debug_nr, status);
+ }
+
+ xio_free_dent_all(&handler_global, &handler_global.dent_anchor);
+ break;
+ }
+ case CMD_CONNECT:
+ {
+ struct xio_brick *prev;
+ const char *path = cmd.cmd_str1;
+
+ status = -EINVAL;
+ CHECK_PTR(path, err);
+ CHECK_PTR_NULL(_bio_brick_type, err);
+
+ prev = make_brick_all(
+ &handler_global,
+ NULL,
+ _set_server_bio_params,
+ NULL,
+ path,
+ (const struct generic_brick_type *)_bio_brick_type,
+ (const struct generic_brick_type *[]){},
+ 2, /* start always */
+ path,
+ (const char *[]){},
+ 0);
+ if (likely(prev)) {
+ status = generic_connect((void *)brick->inputs[0], (void *)prev->outputs[0]);
+ if (unlikely(status < 0))
+ XIO_ERR("#%d cannot connect to '%s'\n", sock->s_debug_nr, path);
+ prev->killme = true;
+ brick->conn_brick = prev;
+ } else {
+ XIO_ERR("#%d cannot find brick '%s'\n", sock->s_debug_nr, path);
+ }
+
+err:
+ cmd.cmd_int1 = status;
+ down(&brick->socket_sem);
+ status = xio_send_struct(sock, &cmd, xio_cmd_meta);
+ up(&brick->socket_sem);
+ break;
+ }
+ case CMD_AIO:
+ {
+ status = server_io(brick, sock, &cmd);
+ break;
+ }
+ case CMD_CB:
+ XIO_ERR(
+ "#%d oops, as a server I should never get CMD_CB; something is wrong here - attack attempt??\n",
+ sock->s_debug_nr);
+ break;
+ case CMD_CONNECT_LOGGER:
+ {
+ struct sockaddr peer_addr = {};
+ int peer_addr_len = sizeof(peer_addr);
+ struct mars_global *global = mars_global;
+ struct trans_logger_brick *prev;
+ const char *path = cmd.cmd_str1;
+
+ prev = (
+ void *)mars_find_brick(
+
+ global, (const struct generic_brick_type *)&trans_logger_brick_type, path);
+ status = -ENOENT;
+ if (!prev) {
+ XIO_WRN("not found '%s'\n", path);
+ break;
+ }
+ if (prev->killme) {
+ XIO_WRN("dead '%s'\n", path);
+ break;
+ }
+ status = kernel_getpeername(sock->s_socket, &peer_addr, &peer_addr_len);
+ ((struct sockaddr_in *)&peer_addr)->sin_port = 0;
+ if (prev->outputs[0]->nr_connected &&
+ (status < 0 || memcmp(&prev->peer_addr, &peer_addr, peer_addr_len))) {
+ XIO_WRN("invalid additional connect to '%s' from a different address\n", path);
+ status = -EBUSY;
+ break;
+ }
+ memset(&prev->peer_addr, 0, sizeof(prev->peer_addr));
+ if (status >= 0)
+ memcpy(&prev->peer_addr, &peer_addr, peer_addr_len);
+
+ status = generic_connect((void *)brick->inputs[0], (void *)prev->outputs[0]);
+ if (unlikely(status < 0))
+ XIO_ERR("#%d cannot connect to '%s'\n", sock->s_debug_nr, path);
+ brick->conn_brick = (void *)prev;
+ break;
+ }
+ default:
+ XIO_ERR("#%d unknown command %d\n", sock->s_debug_nr, cmd.cmd_code);
+ }
+clean:
+ brick_string_free(cmd.cmd_str1);
+ if (unlikely(status < 0)) {
+ xio_shutdown_socket(sock);
+ brick_msleep(1000);
+ }
+ }
+
+ xio_shutdown_socket(sock);
+ xio_put_socket(sock);
+
+done:
+ XIO_DBG("#%d handler_thread terminating, status = %d\n", sock->s_debug_nr, status);
+
+ xio_kill_brick_all(&handler_global, &handler_global.brick_anchor, false);
+
+ if (thread) {
+ brick->cb_thread = NULL;
+ brick->cb_running = false;
+ XIO_DBG("#%d stopping callback thread....\n", sock->s_debug_nr);
+ brick_thread_stop(thread);
+ }
+
+ debug_nr = sock->s_debug_nr;
+
+ XIO_DBG("#%d done.\n", debug_nr);
+ brick->killme = true;
+ return status;
+}
+
+int server_thread(void *data)
+{
+ struct mars_global server_global = {
+ .dent_anchor = LIST_HEAD_INIT(server_global.dent_anchor),
+ .brick_anchor = LIST_HEAD_INIT(server_global.brick_anchor),
+ .global_power = {
+ .button = true,
+ },
+ .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(server_global.main_event),
+ };
+ struct xio_socket *my_socket = data;
+ char *id = my_id();
+ int status = 0;
+
+ init_rwsem(&server_global.dent_mutex);
+ init_rwsem(&server_global.brick_mutex);
+
+ XIO_INF("-------- server starting on host '%s' ----------\n", id);
+
+ while (!brick_thread_should_stop() &&
+ (!mars_global || !mars_global->global_power.button)) {
+ XIO_DBG("system did not start up\n");
+ brick_msleep(5000);
+ }
+
+ XIO_INF("-------- server now working on host '%s' ----------\n", id);
+
+ while (!brick_thread_should_stop() || !list_empty(&server_global.brick_anchor)) {
+ struct server_brick *brick = NULL;
+ struct xio_socket handler_socket = {};
+
+ server_global.global_version++;
+
+ if (server_show_statist)
+ show_statistics(&server_global, "server");
+
+ status = xio_kill_brick_when_possible(&server_global, &server_global.brick_anchor, false, NULL, true);
+ XIO_DBG("kill server bricks (when possible) = %d\n", status);
+
+ if (!mars_global || !mars_global->global_power.button) {
+ brick_msleep(1000);
+ continue;
+ }
+
+ status = xio_accept_socket(&handler_socket, my_socket, &device_tcp_params);
+ if (unlikely(status < 0 || !xio_socket_is_alive(&handler_socket))) {
+ brick_msleep(500);
+ if (status == -EAGAIN)
+ continue; /* without error message */
+ XIO_WRN("accept status = %d\n", status);
+ brick_msleep(1000);
+ continue;
+ }
+ handler_socket.s_shutdown_on_err = true;
+
+ XIO_DBG("got new connection #%d\n", handler_socket.s_debug_nr);
+
+ brick = (void *)xio_make_brick(&server_global, NULL, &server_brick_type, "handler", "handler");
+ if (!brick) {
+ XIO_ERR("cannot create server instance\n");
+ xio_shutdown_socket(&handler_socket);
+ xio_put_socket(&handler_socket);
+ brick_msleep(2000);
+ continue;
+ }
+ memcpy(&brick->handler_socket, &handler_socket, sizeof(struct xio_socket));
+
+ /* TODO: check authorization.
+ */
+
+ brick->power.button = true;
+ status = server_switch(brick);
+ if (unlikely(status < 0)) {
+ XIO_ERR("cannot switch on server brick, status = %d\n", status);
+ goto err;
+ }
+
+ /* further references are usually held by the threads */
+ xio_put_socket(&brick->handler_socket);
+
+ /* fire and forget....
+ * the new instance is now responsible for itself.
+ */
+ brick = NULL;
+ brick_msleep(100);
+ continue;
+
+err:
+ if (brick) {
+ xio_shutdown_socket(&brick->handler_socket);
+ xio_put_socket(&brick->handler_socket);
+ status = xio_kill_brick((void *)brick);
+ if (status < 0)
+ BRICK_ERR("kill status = %d, giving up\n", status);
+ brick = NULL;
+ }
+ brick_msleep(2000);
+ }
+
+ XIO_INF("-------- cleaning up ----------\n");
+
+ xio_kill_brick_all(&server_global, &server_global.brick_anchor, false);
+
+ /* cleanup_mm(); */
+
+ XIO_INF("-------- done status = %d ----------\n", status);
+ return status;
+}
--
2.11.0

2016-12-30 22:58:45

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 30/32] mars: add new module Makefile

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/Makefile | 96 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 96 insertions(+)
create mode 100644 drivers/staging/mars/Makefile

diff --git a/drivers/staging/mars/Makefile b/drivers/staging/mars/Makefile
new file mode 100644
index 000000000000..5e94c3c692c2
--- /dev/null
+++ b/drivers/staging/mars/Makefile
@@ -0,0 +1,96 @@
+#
+# Makefile for MARS
+#
+
+# remove_this
+#
+# TST: this was required by some sysadmins some years ago for
+# very 1&1-specific OOT Debian build methods.
+# Not tested in other environments. Might need some tweaks, or could
+# be removed in the long term.
+#
+ifndef CONFIG_MARS
+# mars_config.h is generated by a simple Kconfig parser (gen_config.pl)
+# at build time.
+# It does not respect any Kconfig dependencies.
+# Therefore, it is unsafe. Use at your own risk!
+# It is ONLY used for out-of-tree builds.
+#
+CONFIG_MARS_BIGMODULE := m
+CONFIG_MARS_NET_COMPAT := y
+obj-$(CONFIG_MARS_BIGMODULE) += mars.o
+extra-y += mars_config.h
+GEN_CONFIG_SCRIPT := $(src)/../scripts/gen_config.pl
+$(obj)/mars_config.h: $(obj)/buildtag.h
+$(obj)/mars_config.h: $(src)/Kconfig $(GEN_CONFIG_SCRIPT)
+ $(Q)$(kecho) "MARS: using compiler $($(CC) --version | head -1)"
+ $(CC) -v
+ $(Q)$(kecho) "MARS: Generating $@"
+ $(Q)set -e; \
+ if [ ! -x $(GEN_CONFIG_SCRIPT) ]; then \
+ $(kecho) "MARS: cannot execute script $(GEN_CONFIG_SCRIPT)"; \
+ /bin/false; \
+ fi; \
+ cat $< | $(GEN_CONFIG_SCRIPT) > $@;
+ cat $@;
+endif
+# end_remove_this
+
+obj-$(CONFIG_MARS) += mars.o
+
+KBUILD_CFLAGS += -fdelete-null-pointer-checks
+
+# remove_this
+# The following is 1&1 specific. Don't use anywhere else.
+ifneq ($(KBUILD_EXTMOD),)
+ CONFIG_MARS := m
+# mars_config.h is generated by a simple Kconfig parser (gen_config.pl)
+# at build time.
+# It does not respect any Kconfig dependencies.
+# Therefore, it is unsafe. Use at your own risk!
+# It is ONLY used for out-of-tree builds.
+#
+extra-y += mars_config.h
+GEN_CONFIG_SCRIPT := $(src)/../scripts/gen_config.pl
+$(obj)/mars_config.h: $(obj)/buildtag.h
+$(obj)/mars_config.h: $(src)/Kconfig $(GEN_CONFIG_SCRIPT)
+ $(Q)$(kecho) "MARS: using compiler $($(CC) --version | head -1)"
+ $(CC) -v
+ $(Q)$(kecho) "MARS: Generating $@"
+ $(Q)set -e; \
+ if [ ! -x $(GEN_CONFIG_SCRIPT) ]; then \
+ $(kecho) "MARS: cannot execute script $(GEN_CONFIG_SCRIPT)"; \
+ /bin/false; \
+ fi; \
+ cat $< | $(GEN_CONFIG_SCRIPT) > $@;
+ cat $@;
+endif
+# end_remove_this
+
+obj-$(CONFIG_MARS) += mars.o
+
+mars-objs := \
+ lamport.o \
+ brick_say.o \
+ brick_mem.o \
+ brick.o \
+ xio_bricks/xio.o \
+ xio_bricks/lib_log.o \
+ lib/lib_rank.o \
+ lib/lib_limiter.o \
+ lib/lib_timing.o \
+ xio_bricks/lib_mapfree.o \
+ xio_bricks/xio_net.o \
+ mars/server_strategy.o \
+ xio_bricks/xio_server.o \
+ xio_bricks/xio_client.o \
+ xio_bricks/xio_sio.o \
+ xio_bricks/xio_bio.o \
+ xio_bricks/xio_if.o \
+ xio_bricks/xio_copy.o \
+ xio_bricks/xio_trans_logger.o \
+ mars/main_strategy.o \
+ mars/net.o \
+ mars/mars_proc.o \
+ mars/mars_main.o
+
--
2.11.0

2016-12-30 22:58:43

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 21/32] mars: add new module xio_copy

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/xio_copy.c | 1005 ++++++++++++++++++++++++++++
include/linux/xio/xio_copy.h | 115 ++++
2 files changed, 1120 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/xio_copy.c
create mode 100644 include/linux/xio/xio_copy.h

diff --git a/drivers/staging/mars/xio_bricks/xio_copy.c b/drivers/staging/mars/xio_bricks/xio_copy.c
new file mode 100644
index 000000000000..56b60f2f837e
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/xio_copy.c
@@ -0,0 +1,1005 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* Copy brick (just for demonstration) */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/xio/xio.h>
+#include <linux/brick/lib_limiter.h>
+
+#ifndef READ
+#define READ 0
+#define WRITE 1
+#endif
+
+#define COPY_CHUNK (PAGE_SIZE)
+#define NR_COPY_REQUESTS (32 * 1024 * 1024 / COPY_CHUNK)
+
+#define STATES_PER_PAGE (PAGE_SIZE / sizeof(struct copy_state))
+#define MAX_SUB_TABLES (NR_COPY_REQUESTS / STATES_PER_PAGE + (NR_COPY_REQUESTS % STATES_PER_PAGE ? 1 : 0)\
+ \
+)
+#define MAX_COPY_REQUESTS (PAGE_SIZE / sizeof(struct copy_state *) * STATES_PER_PAGE)
+
+#define GET_STATE(brick, index) \
+ ((brick)->st[(index) / STATES_PER_PAGE][(index) % STATES_PER_PAGE])
+
+/************************ own type definitions ***********************/
+
+#include <linux/xio/xio_copy.h>
+
+int xio_copy_overlap = 1;
+
+int xio_copy_read_prio = XIO_PRIO_NORMAL;
+
+int xio_copy_write_prio = XIO_PRIO_NORMAL;
+
+int xio_copy_read_max_fly;
+
+int xio_copy_write_max_fly;
+
+#define is_read_limited(brick) \
+ (xio_copy_read_max_fly > 0 && atomic_read(&(brick)->copy_read_flight) >= xio_copy_read_max_fly)
+
+#define is_write_limited(brick) \
+ (xio_copy_write_max_fly > 0 && atomic_read(&(brick)->copy_write_flight) >= xio_copy_write_max_fly)
+
+/************************ own helper functions ***********************/
+
+/* TODO:
+ * The clash logic is untested / alpha stage (Feb. 2011).
+ *
+ * For now, the output is never used, so this cannot do harm.
+ *
+ * In order to get the output really working / enterprise grade,
+ * some larger test effort should be invested.
+ */
+static inline
+void _clash(struct copy_brick *brick)
+{
+ brick->trigger = true;
+ set_bit(0, &brick->clash);
+ atomic_inc(&brick->total_clash_count);
+ wake_up_interruptible(&brick->event);
+}
+
+static inline
+int _clear_clash(struct copy_brick *brick)
+{
+ int old;
+
+ old = test_and_clear_bit(0, &brick->clash);
+ return old;
+}
+
+/* Current semantics:
+ *
+ * All writes are always going to the original input A. They are _not_
+ * replicated to B.
+ *
+ * In order to get B really uptodate, you have to replay the right
+ * transaction logs there (at the right time).
+ * [If you had no writes on A at all during the copy, of course
+ * this is not necessary]
+ *
+ * When utilize_mode is on, reads can utilize the already copied
+ * region from B, but only as long as this region has not been
+ * invalidated by writes (indicated by low_dirty).
+ *
+ * TODO: implement replicated writes, together with some transaction
+ * replay logic applying the transaction logs _only_ after
+ * crashes during inconsistency caused by partial replication of writes.
+ */
+static
+int _determine_input(struct copy_brick *brick, struct aio_object *aio)
+{
+ int rw;
+ int below;
+ int behind;
+ loff_t io_end;
+
+ if (!brick->utilize_mode || brick->low_dirty)
+ return INPUT_A_IO;
+
+ io_end = aio->io_pos + aio->io_len;
+ below = io_end <= brick->copy_start;
+ behind = !brick->copy_end || aio->io_pos >= brick->copy_end;
+ rw = aio->io_may_write | aio->io_rw;
+ if (rw) {
+ if (!behind) {
+ brick->low_dirty = true;
+ if (!below) {
+ _clash(brick);
+ wake_up_interruptible(&brick->event);
+ }
+ }
+ return INPUT_A_IO;
+ }
+
+ if (below)
+ return INPUT_B_IO;
+
+ return INPUT_A_IO;
+}
+
+#define GET_INDEX(pos) (((pos) / COPY_CHUNK) % NR_COPY_REQUESTS)
+#define GET_OFFSET(pos) ((pos) % COPY_CHUNK)
+
+static
+void __clear_aio(struct copy_brick *brick, struct aio_object *aio, int queue)
+{
+ struct copy_input *input;
+
+ input = queue ? brick->inputs[INPUT_B_COPY] : brick->inputs[INPUT_A_COPY];
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+}
+
+static
+void _clear_aio(struct copy_brick *brick, int index, int queue)
+{
+ struct copy_state *st = &GET_STATE(brick, index);
+ struct aio_object *aio = st->table[queue];
+
+ if (aio) {
+ if (unlikely(st->active[queue])) {
+ XIO_ERR("clearing active aio, index = %d queue = %d\n", index, queue);
+ st->active[queue] = false;
+ }
+ __clear_aio(brick, aio, queue);
+ st->table[queue] = NULL;
+ }
+}
+
+static
+void _clear_all_aio(struct copy_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < NR_COPY_REQUESTS; i++) {
+ GET_STATE(brick, i).state = COPY_STATE_START;
+ _clear_aio(brick, i, 0);
+ _clear_aio(brick, i, 1);
+ }
+}
+
+static
+void _clear_state_table(struct copy_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < MAX_SUB_TABLES; i++) {
+ struct copy_state *sub_table = brick->st[i];
+
+ memset(sub_table, 0, PAGE_SIZE);
+ }
+}
+
+static
+void copy_endio(struct generic_callback *cb)
+{
+ struct copy_aio_aspect *aio_a;
+ struct aio_object *aio;
+ struct copy_brick *brick;
+ struct copy_state *st;
+ int index;
+ int queue;
+ int error = 0;
+
+ LAST_CALLBACK(cb);
+ aio_a = cb->cb_private;
+ CHECK_PTR(aio_a, err);
+ aio = aio_a->object;
+ CHECK_PTR(aio, err);
+ brick = aio_a->brick;
+ CHECK_PTR(brick, err);
+
+ queue = aio_a->queue;
+ index = GET_INDEX(aio->io_pos);
+ st = &GET_STATE(brick, index);
+
+ if (unlikely(queue < 0 || queue >= 2)) {
+ XIO_ERR("bad queue %d\n", queue);
+ error = -EINVAL;
+ goto exit;
+ }
+ st->active[queue] = false;
+ if (unlikely(st->table[queue])) {
+ XIO_ERR("table corruption at %d %d (%p => %p)\n", index, queue, st->table[queue], aio);
+ error = -EEXIST;
+ goto exit;
+ }
+ if (unlikely(cb->cb_error < 0)) {
+ error = cb->cb_error;
+ __clear_aio(brick, aio, queue);
+ /* This is racy, but does no harm.
+ * Worst case just produces more error output.
+ */
+ if (!brick->copy_error_count++)
+ XIO_WRN("IO error %d on index %d, old state = %d\n", cb->cb_error, index, st->state);
+ } else {
+ if (unlikely(st->table[queue])) {
+ XIO_ERR("overwriting index %d, state = %d\n", index, st->state);
+ _clear_aio(brick, index, queue);
+ }
+ st->table[queue] = aio;
+ }
+
+exit:
+ if (unlikely(error < 0)) {
+ st->error = error;
+ _clash(brick);
+ }
+ if (aio->io_rw)
+ atomic_dec(&brick->copy_write_flight);
+ else
+ atomic_dec(&brick->copy_read_flight);
+ brick->trigger = true;
+ wake_up_interruptible(&brick->event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle callback\n");
+out_return:;
+}
+
+static
+int _make_aio(
+struct copy_brick *brick, int index, int queue, void *data, loff_t pos, loff_t end_pos, int rw, int cs_mode)
+{
+ struct aio_object *aio;
+ struct copy_aio_aspect *aio_a;
+ struct copy_input *input;
+ int offset;
+ int len;
+ int status = -EAGAIN;
+
+ if (brick->clash || end_pos <= 0)
+ goto done;
+
+ aio = copy_alloc_aio(brick);
+ status = -ENOMEM;
+
+ aio_a = copy_aio_get_aspect(brick, aio);
+ if (unlikely(!aio_a)) {
+ XIO_FAT("cannot get own apsect\n");
+ goto done;
+ }
+
+ aio_a->brick = brick;
+ aio_a->queue = queue;
+ aio->io_may_write = rw;
+ aio->io_rw = rw;
+ aio->io_data = data;
+ aio->io_pos = pos;
+ aio->io_cs_mode = cs_mode;
+ offset = GET_OFFSET(pos);
+ len = COPY_CHUNK - offset;
+ if (pos + len > end_pos)
+ len = end_pos - pos;
+ aio->io_len = len;
+ aio->io_prio = rw ?
+ xio_copy_write_prio :
+ xio_copy_read_prio;
+ if (aio->io_prio < XIO_PRIO_HIGH || aio->io_prio > XIO_PRIO_LOW)
+ aio->io_prio = brick->io_prio;
+
+ SETUP_CALLBACK(aio, copy_endio, aio_a);
+
+ input = queue ? brick->inputs[INPUT_B_COPY] : brick->inputs[INPUT_A_COPY];
+ status = GENERIC_INPUT_CALL(input, aio_get, aio);
+ if (unlikely(status < 0)) {
+ XIO_ERR("status = %d\n", status);
+ obj_free(aio);
+ goto done;
+ }
+ if (unlikely(aio->io_len < len))
+ XIO_DBG("shorten len %d < %d\n", aio->io_len, len);
+ if (queue == 0) {
+ GET_STATE(brick, index).len = aio->io_len;
+ } else if (unlikely(aio->io_len < GET_STATE(brick, index).len)) {
+ XIO_DBG("shorten len %d < %d at index %d\n", aio->io_len, GET_STATE(brick, index).len, index);
+ GET_STATE(brick, index).len = aio->io_len;
+ }
+
+ GET_STATE(brick, index).active[queue] = true;
+ if (rw)
+ atomic_inc(&brick->copy_write_flight);
+ else
+ atomic_inc(&brick->copy_read_flight);
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+
+done:
+ return status;
+}
+
+static
+void _update_percent(struct copy_brick *brick, bool force)
+{
+ if (force ||
+ brick->copy_last > brick->copy_start + 8 * 1024 * 1024 ||
+ time_is_before_jiffies(brick->last_jiffies + 5 * HZ) ||
+ (brick->copy_last == brick->copy_end && brick->copy_end > 0)) {
+ brick->copy_start = brick->copy_last;
+ brick->last_jiffies = jiffies;
+ brick->power.percent_done = brick->copy_end > 0 ? brick->copy_start * 100 / brick->copy_end : 0;
+ XIO_INF(
+ "'%s' copied %lld / %lld bytes (%d%%)\n",
+ brick->brick_path,
+ brick->copy_last,
+ brick->copy_end,
+ brick->power.percent_done);
+ }
+}
+
+/* The heart of this brick.
+ * State transition function of the finite automaton.
+ * In case no progress is possible (e.g. preconditions not
+ * yet true), the state is left as is (idempotence property:
+ * calling this too often does no harm, just costs performance).
+ */
+static
+int _next_state(struct copy_brick *brick, int index, loff_t pos)
+{
+ struct aio_object *aio0;
+ struct aio_object *aio1;
+ struct copy_state *st;
+ char state;
+ char next_state;
+ bool do_restart = false;
+ int progress = 0;
+ int status;
+
+ st = &GET_STATE(brick, index);
+ next_state = st->state;
+
+restart:
+ state = next_state;
+
+ do_restart = false;
+
+ switch (state) {
+ case COPY_STATE_RESET:
+ /* This state is only entered after errors or
+ * in restarting situations.
+ */
+ _clear_aio(brick, index, 1);
+ _clear_aio(brick, index, 0);
+ next_state = COPY_STATE_START;
+ /* fallthrough */
+ case COPY_STATE_START:
+ /* This is the relgular starting state.
+ * It must be zero, automatically entered via memset()
+ */
+ if (st->table[0] || st->table[1]) {
+ XIO_ERR("index %d not startable\n", index);
+ progress = -EPROTO;
+ goto idle;
+ }
+
+ _clear_aio(brick, index, 1);
+ _clear_aio(brick, index, 0);
+ st->writeout = false;
+ st->error = 0;
+
+ if (brick->is_aborting ||
+ is_read_limited(brick))
+ goto idle;
+
+ status = _make_aio(brick, index, 0, NULL, pos, brick->copy_end, READ, brick->verify_mode ? 2 : 0);
+ if (unlikely(status < 0)) {
+ XIO_DBG("status = %d\n", status);
+ progress = status;
+ break;
+ }
+
+ next_state = COPY_STATE_READ1;
+ if (!brick->verify_mode)
+ break;
+
+ next_state = COPY_STATE_START2;
+ /* fallthrough */
+ case COPY_STATE_START2:
+ status = _make_aio(brick, index, 1, NULL, pos, brick->copy_end, READ, 2);
+ if (unlikely(status < 0)) {
+ XIO_DBG("status = %d\n", status);
+ progress = status;
+ break;
+ }
+ next_state = COPY_STATE_READ2;
+ /* fallthrough */
+ case COPY_STATE_READ2:
+ aio1 = st->table[1];
+ if (!aio1) { /* idempotence: wait by unchanged state */
+ goto idle;
+ }
+ /* fallthrough = > wait for both aios to appear */
+ case COPY_STATE_READ1:
+ case COPY_STATE_READ3:
+ aio0 = st->table[0];
+ if (!aio0) { /* idempotence: wait by unchanged state */
+ goto idle;
+ }
+ if (brick->copy_limiter) {
+ int amount = (aio0->io_len - 1) / 1024 + 1;
+
+ rate_limit_sleep(brick->copy_limiter, amount);
+ }
+ /* on append mode: increase the end pointer dynamically */
+ if (brick->append_mode > 0 && aio0->io_total_size && aio0->io_total_size > brick->copy_end)
+ brick->copy_end = aio0->io_total_size;
+ /* do verify (when applicable) */
+ aio1 = st->table[1];
+ if (aio1 && state != COPY_STATE_READ3) {
+ int len = aio0->io_len;
+ bool ok;
+
+ if (len != aio1->io_len) {
+ ok = false;
+ } else if (aio0->io_cs_mode) {
+ static unsigned char null[sizeof(aio0->io_checksum)];
+
+ ok = !memcmp(aio0->io_checksum, aio1->io_checksum, sizeof(aio0->io_checksum));
+ if (ok)
+ ok = memcmp(aio0->io_checksum, null, sizeof(aio0->io_checksum)) != 0;
+ } else if (!aio0->io_data || !aio1->io_data) {
+ ok = false;
+ } else {
+ ok = !memcmp(aio0->io_data, aio1->io_data, len);
+ }
+
+ _clear_aio(brick, index, 1);
+
+ if (ok)
+ brick->verify_ok_count++;
+ else
+ brick->verify_error_count++;
+
+ if (ok || !brick->repair_mode) {
+ /* skip start of writing, goto final treatment of writeout */
+ next_state = COPY_STATE_CLEANUP;
+ break;
+ }
+ }
+
+ if (aio0->io_cs_mode > 1) { /* re-read, this time with data */
+ _clear_aio(brick, index, 0);
+ status = _make_aio(brick, index, 0, NULL, pos, brick->copy_end, READ, 0);
+ if (unlikely(status < 0)) {
+ XIO_DBG("status = %d\n", status);
+ progress = status;
+ next_state = COPY_STATE_RESET;
+ break;
+ }
+ next_state = COPY_STATE_READ3;
+ break;
+ }
+ next_state = COPY_STATE_WRITE;
+ /* fallthrough */
+ case COPY_STATE_WRITE:
+ if (is_write_limited(brick))
+ goto idle;
+ /* Obey ordering to get a strict "append" behaviour.
+ * We assume that we don't need to wait for completion
+ * of the previous write to avoid a sparse result file
+ * under all circumstances, i.e. we only assure that
+ * _starting_ the writes is in order.
+ * This is only correct when all lower bricks obey the
+ * order of io_io() operations.
+ * Currenty, bio and aio are obeying this. Be careful when
+ * implementing new IO bricks!
+ */
+ if (st->prev >= 0 && !GET_STATE(brick, st->prev).writeout)
+ goto idle;
+ aio0 = st->table[0];
+ if (unlikely(!aio0 || !aio0->io_data)) {
+ XIO_ERR("src buffer for write does not exist, state %d at index %d\n", state, index);
+ progress = -EILSEQ;
+ break;
+ }
+ if (unlikely(brick->is_aborting)) {
+ progress = -EINTR;
+ break;
+ }
+ /* start writeout */
+ status = _make_aio(brick, index, 1, aio0->io_data, pos, pos + aio0->io_len, WRITE, 0);
+ if (unlikely(status < 0)) {
+ XIO_DBG("status = %d\n", status);
+ progress = status;
+ next_state = COPY_STATE_RESET;
+ break;
+ }
+ /* Attention! overlapped IO behind EOF could
+ * lead to temporary inconsistent state of the
+ * file, because the write order may be different from
+ * strict O_APPEND behaviour.
+ */
+ if (xio_copy_overlap)
+ st->writeout = true;
+ next_state = COPY_STATE_WRITTEN;
+ /* fallthrough */
+ case COPY_STATE_WRITTEN:
+ aio1 = st->table[1];
+ if (!aio1) { /* idempotence: wait by unchanged state */
+ goto idle;
+ }
+ st->writeout = true;
+ /* rechecking means to start over again.
+ * ATTENTIION! this may lead to infinite request
+ * submission loops, intentionally.
+ * TODO: implement some timeout means.
+ */
+ if (brick->recheck_mode && brick->repair_mode) {
+ next_state = COPY_STATE_RESET;
+ break;
+ }
+ next_state = COPY_STATE_CLEANUP;
+ /* fallthrough */
+ case COPY_STATE_CLEANUP:
+ _clear_aio(brick, index, 1);
+ _clear_aio(brick, index, 0);
+ next_state = COPY_STATE_FINISHED;
+ /* fallthrough */
+ case COPY_STATE_FINISHED:
+ /* Indicate successful completion by remaining in this state.
+ * Restart of the finite automaton must be done externally.
+ */
+ goto idle;
+ default:
+ XIO_ERR("illegal state %d at index %d\n", state, index);
+ _clash(brick);
+ progress = -EILSEQ;
+ }
+
+ do_restart = (state != next_state);
+
+idle:
+ if (unlikely(progress < 0)) {
+ if (st->error >= 0)
+ st->error = progress;
+ XIO_DBG("progress = %d\n", progress);
+ progress = 0;
+ _clash(brick);
+ } else if (do_restart) {
+ goto restart;
+ } else if (st->state != next_state) {
+ progress++;
+ }
+
+ /* save the resulting state */
+ st->state = next_state;
+ return progress;
+}
+
+static
+int _run_copy(struct copy_brick *brick)
+{
+ int max;
+ loff_t pos;
+ loff_t limit = -1;
+
+ short prev;
+ int progress;
+
+ if (unlikely(_clear_clash(brick))) {
+ XIO_DBG("clash\n");
+ if (atomic_read(&brick->copy_read_flight) + atomic_read(&brick->copy_write_flight) > 0) {
+ /* wait until all pending copy IO has finished
+ */
+ _clash(brick);
+ XIO_DBG("re-clash\n");
+ brick_msleep(100);
+ return 0;
+ }
+ _clear_all_aio(brick);
+ _clear_state_table(brick);
+ }
+
+ /* Do at most max iterations in the below loop
+ */
+ max = NR_COPY_REQUESTS - atomic_read(&brick->io_flight) * 2;
+
+ prev = -1;
+ progress = 0;
+ for (
+ pos = brick->copy_last; pos < brick->copy_end || brick->append_mode > 1; pos = (
+ (pos / COPY_CHUNK) + 1) * COPY_CHUNK) {
+ int index = GET_INDEX(pos);
+ struct copy_state *st = &GET_STATE(brick, index);
+
+ if (max-- <= 0)
+ break;
+ st->prev = prev;
+ prev = index;
+ /* call the finite state automaton */
+ if (!(st->active[0] | st->active[1])) {
+ progress += _next_state(brick, index, pos);
+ limit = pos;
+ }
+ }
+
+ /* check the resulting state: can we advance the copy_last pointer? */
+ if (likely(progress && !brick->clash)) {
+ int count = 0;
+
+ for (pos = brick->copy_last; pos <= limit; pos = ((pos / COPY_CHUNK) + 1) * COPY_CHUNK) {
+ int index = GET_INDEX(pos);
+ struct copy_state *st = &GET_STATE(brick, index);
+
+ if (st->state != COPY_STATE_FINISHED)
+ break;
+ if (unlikely(st->error < 0)) {
+ /* check for fatal consistency errors */
+ if (st->error == -EMEDIUMTYPE) {
+ brick->copy_error = st->error;
+ brick->abort_mode = true;
+ XIO_WRN("Consistency is violated\n");
+ }
+ if (!brick->copy_error) {
+ brick->copy_error = st->error;
+ XIO_WRN("IO error = %d\n", st->error);
+ }
+ if (brick->abort_mode)
+ brick->is_aborting = true;
+ break;
+ }
+ /* rollover */
+ st->state = COPY_STATE_START;
+ count += st->len;
+ /* check contiguity */
+ if (unlikely(GET_OFFSET(pos) + st->len != COPY_CHUNK))
+ break;
+ }
+ if (count > 0) {
+ brick->copy_last += count;
+ get_lamport(&brick->copy_last_stamp);
+ _update_percent(brick, false);
+ }
+ }
+ return progress;
+}
+
+static
+bool _is_done(struct copy_brick *brick)
+{
+ if (brick_thread_should_stop())
+ brick->is_aborting = true;
+ return brick->is_aborting &&
+ atomic_read(&brick->copy_read_flight) + atomic_read(&brick->copy_write_flight) <= 0;
+}
+
+static int _copy_thread(void *data)
+{
+ struct copy_brick *brick = data;
+ int rounds = 0;
+
+ XIO_DBG("--------------- copy_thread %p starting\n", brick);
+ brick->copy_error = 0;
+ brick->copy_error_count = 0;
+ brick->verify_ok_count = 0;
+ brick->verify_error_count = 0;
+
+ _update_percent(brick, true);
+
+ xio_set_power_on_led((void *)brick, true);
+ brick->trigger = true;
+
+ while (!_is_done(brick)) {
+ loff_t old_start = brick->copy_start;
+ loff_t old_end = brick->copy_end;
+ int progress = 0;
+
+ if (old_end > 0) {
+ progress = _run_copy(brick);
+ if (!progress || ++rounds > 1000)
+ rounds = 0;
+ }
+
+ wait_event_interruptible_timeout(
+ brick->event,
+ progress > 0 ||
+ brick->trigger ||
+ brick->copy_start != old_start ||
+ brick->copy_end != old_end ||
+ _is_done(brick),
+ 1 * HZ);
+ brick->trigger = false;
+ }
+
+ /* check for fatal consistency errors */
+ if (brick->copy_error == -EMEDIUMTYPE) {
+ /* reset the whole area */
+ brick->copy_start = 0;
+ brick->copy_last = 0;
+ XIO_WRN("resetting the full copy area\n");
+ }
+ _update_percent(brick, true);
+
+ XIO_DBG(
+ "--------------- copy_thread terminating (%d read requests / %d write requests flying, copy_start = %lld copy_end = %lld)\n",
+ atomic_read(&brick->copy_read_flight),
+ atomic_read(&brick->copy_write_flight),
+ brick->copy_start,
+ brick->copy_end);
+
+ _clear_all_aio(brick);
+ xio_set_power_off_led((void *)brick, true);
+ XIO_DBG("--------------- copy_thread done.\n");
+ return 0;
+}
+
+/***************** own brick * input * output operations *****************/
+
+static int copy_get_info(struct copy_output *output, struct xio_info *info)
+{
+ struct copy_input *input = output->brick->inputs[INPUT_B_IO];
+
+ return GENERIC_INPUT_CALL(input, xio_get_info, info);
+}
+
+static int copy_io_get(struct copy_output *output, struct aio_object *aio)
+{
+ struct copy_input *input;
+ int index;
+ int status;
+
+ index = _determine_input(output->brick, aio);
+ input = output->brick->inputs[index];
+ status = GENERIC_INPUT_CALL(input, aio_get, aio);
+ if (status >= 0)
+ atomic_inc(&output->brick->io_flight);
+ return status;
+}
+
+static void copy_io_put(struct copy_output *output, struct aio_object *aio)
+{
+ struct copy_input *input;
+ int index;
+
+ index = _determine_input(output->brick, aio);
+ input = output->brick->inputs[index];
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+ if (atomic_dec_and_test(&output->brick->io_flight)) {
+ output->brick->trigger = true;
+ wake_up_interruptible(&output->brick->event);
+ }
+}
+
+static void copy_io_io(struct copy_output *output, struct aio_object *aio)
+{
+ struct copy_input *input;
+ int index;
+
+ index = _determine_input(output->brick, aio);
+ input = output->brick->inputs[index];
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+}
+
+static int copy_switch(struct copy_brick *brick)
+{
+ static int version;
+
+ XIO_DBG("power.button = %d\n", brick->power.button);
+ if (brick->power.button) {
+ if (brick->power.on_led)
+ goto done;
+ xio_set_power_off_led((void *)brick, false);
+ brick->is_aborting = false;
+ if (!brick->thread) {
+ brick->copy_last = brick->copy_start;
+ get_lamport(&brick->copy_last_stamp);
+ brick->thread = brick_thread_create(_copy_thread, brick, "xio_copy%d", version++);
+ if (brick->thread) {
+ brick->trigger = true;
+ } else {
+ xio_set_power_off_led((void *)brick, true);
+ XIO_ERR("could not start copy thread\n");
+ }
+ }
+ } else {
+ if (brick->power.off_led)
+ goto done;
+ xio_set_power_on_led((void *)brick, false);
+ if (brick->thread) {
+ XIO_INF("stopping thread...\n");
+ brick_thread_stop(brick->thread);
+ }
+ }
+done:
+ return 0;
+}
+
+/*************** informational * statistics **************/
+
+static
+char *copy_statistics(struct copy_brick *brick, int verbose)
+{
+ char *res = brick_string_alloc(1024);
+
+ snprintf(
+ res, 1024,
+ "copy_start = %lld copy_last = %lld copy_end = %lld copy_error = %d copy_error_count = %d verify_ok_count = %d verify_error_count = %d low_dirty = %d is_aborting = %d clash = %lu | total clash_count = %d | io_flight = %d copy_read_flight = %d copy_write_flight = %d\n",
+ brick->copy_start,
+ brick->copy_last,
+ brick->copy_end,
+ brick->copy_error,
+ brick->copy_error_count,
+ brick->verify_ok_count,
+ brick->verify_error_count,
+ brick->low_dirty,
+ brick->is_aborting,
+ brick->clash,
+ atomic_read(&brick->total_clash_count),
+ atomic_read(&brick->io_flight),
+ atomic_read(&brick->copy_read_flight),
+ atomic_read(&brick->copy_write_flight));
+
+ return res;
+}
+
+static
+void copy_reset_statistics(struct copy_brick *brick)
+{
+ atomic_set(&brick->total_clash_count, 0);
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int copy_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct copy_aio_aspect *ini = (void *)_ini;
+
+ (void)ini;
+ return 0;
+}
+
+static void copy_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct copy_aio_aspect *ini = (void *)_ini;
+
+ (void)ini;
+}
+
+XIO_MAKE_STATICS(copy);
+
+/********************* brick constructors * destructors *******************/
+
+static
+void _free_pages(struct copy_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < MAX_SUB_TABLES; i++) {
+ struct copy_state *sub_table = brick->st[i];
+
+ if (!sub_table)
+ continue;
+
+ brick_block_free(sub_table, PAGE_SIZE);
+ }
+ brick_block_free(brick->st, PAGE_SIZE);
+}
+
+static int copy_brick_construct(struct copy_brick *brick)
+{
+ int i;
+
+ brick->st = brick_block_alloc(0, PAGE_SIZE);
+ memset(brick->st, 0, PAGE_SIZE);
+
+ for (i = 0; i < MAX_SUB_TABLES; i++) {
+ struct copy_state *sub_table;
+
+ /* this should be usually optimized away as dead code */
+ if (unlikely(i >= MAX_SUB_TABLES)) {
+ XIO_ERR("sorry, subtable index %d is too large.\n", i);
+ _free_pages(brick);
+ return -EINVAL;
+ }
+
+ sub_table = brick_block_alloc(0, PAGE_SIZE);
+ brick->st[i] = sub_table;
+ memset(sub_table, 0, PAGE_SIZE);
+ }
+
+ init_waitqueue_head(&brick->event);
+ sema_init(&brick->mutex, 1);
+ return 0;
+}
+
+static int copy_brick_destruct(struct copy_brick *brick)
+{
+ _free_pages(brick);
+ return 0;
+}
+
+static int copy_output_construct(struct copy_output *output)
+{
+ return 0;
+}
+
+static int copy_output_destruct(struct copy_output *output)
+{
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct copy_brick_ops copy_brick_ops = {
+ .brick_switch = copy_switch,
+ .brick_statistics = copy_statistics,
+ .reset_statistics = copy_reset_statistics,
+};
+
+static struct copy_output_ops copy_output_ops = {
+ .xio_get_info = copy_get_info,
+ .aio_get = copy_io_get,
+ .aio_put = copy_io_put,
+ .aio_io = copy_io_io,
+};
+
+const struct copy_input_type copy_input_type = {
+ .type_name = "copy_input",
+ .input_size = sizeof(struct copy_input),
+};
+
+static const struct copy_input_type *copy_input_types[] = {
+ &copy_input_type,
+ &copy_input_type,
+ &copy_input_type,
+ &copy_input_type,
+};
+
+const struct copy_output_type copy_output_type = {
+ .type_name = "copy_output",
+ .output_size = sizeof(struct copy_output),
+ .master_ops = &copy_output_ops,
+ .output_construct = &copy_output_construct,
+ .output_destruct = &copy_output_destruct,
+};
+
+static const struct copy_output_type *copy_output_types[] = {
+ &copy_output_type,
+};
+
+const struct copy_brick_type copy_brick_type = {
+ .type_name = "copy_brick",
+ .brick_size = sizeof(struct copy_brick),
+ .max_inputs = 4,
+ .max_outputs = 1,
+ .master_ops = &copy_brick_ops,
+ .aspect_types = copy_aspect_types,
+ .default_input_types = copy_input_types,
+ .default_output_types = copy_output_types,
+ .brick_construct = &copy_brick_construct,
+ .brick_destruct = &copy_brick_destruct,
+};
+
+/***************** module init stuff ************************/
+
+int __init init_xio_copy(void)
+{
+ XIO_INF("init_copy()\n");
+ return copy_register_brick_type();
+}
+
+void exit_xio_copy(void)
+{
+ XIO_INF("exit_copy()\n");
+ copy_unregister_brick_type();
+}
diff --git a/include/linux/xio/xio_copy.h b/include/linux/xio/xio_copy.h
new file mode 100644
index 000000000000..f92e898419f1
--- /dev/null
+++ b/include/linux/xio/xio_copy.h
@@ -0,0 +1,115 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_COPY_H
+#define XIO_COPY_H
+
+#include <linux/wait.h>
+#include <linux/semaphore.h>
+
+#define INPUT_A_IO 0
+#define INPUT_A_COPY 1
+#define INPUT_B_IO 2
+#define INPUT_B_COPY 3
+
+extern int xio_copy_overlap;
+extern int xio_copy_read_prio;
+extern int xio_copy_write_prio;
+extern int xio_copy_read_max_fly;
+extern int xio_copy_write_max_fly;
+
+enum {
+ COPY_STATE_RESET = -1,
+ COPY_STATE_START = 0, /* don't change this, it _must_ be zero */
+ COPY_STATE_START2,
+ COPY_STATE_READ1,
+ COPY_STATE_READ2,
+ COPY_STATE_READ3,
+ COPY_STATE_WRITE,
+ COPY_STATE_WRITTEN,
+ COPY_STATE_CLEANUP,
+ COPY_STATE_FINISHED,
+};
+
+struct copy_state {
+ struct aio_object *table[2];
+ bool active[2];
+ char state;
+ bool writeout;
+
+ short prev;
+ short len;
+ short error;
+};
+
+struct copy_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct copy_brick *brick;
+ int queue;
+};
+
+struct copy_brick {
+ XIO_BRICK(copy);
+ /* parameters */
+ struct rate_limiter *copy_limiter;
+ loff_t copy_start;
+
+ loff_t copy_end; /* stop working if == 0 */
+ int io_prio;
+
+ int append_mode; /* 1 = passively, 2 = actively */
+ bool verify_mode; /* 0 = copy, 1 = checksum+compare */
+ bool repair_mode; /* whether to repair in case of verify errors */
+ bool recheck_mode; /* whether to re-check after repairs (costs performance) */
+ bool utilize_mode; /* utilize already copied data */
+ bool abort_mode; /* abort on IO error (default is retry forever) */
+ /* readonly from outside */
+ loff_t copy_last; /* current working position */
+ struct timespec copy_last_stamp;
+ int copy_error;
+ int copy_error_count;
+ int verify_ok_count;
+ int verify_error_count;
+ bool low_dirty;
+ bool is_aborting;
+
+ /* internal */
+ bool trigger;
+ unsigned long clash;
+ atomic_t total_clash_count;
+ atomic_t io_flight;
+ atomic_t copy_read_flight;
+ atomic_t copy_write_flight;
+ unsigned long last_jiffies;
+
+ wait_queue_head_t event;
+ struct semaphore mutex;
+ struct task_struct *thread;
+ struct copy_state **st;
+};
+
+struct copy_input {
+ XIO_INPUT(copy);
+};
+
+struct copy_output {
+ XIO_OUTPUT(copy);
+};
+
+XIO_TYPES(copy);
+
+#endif
--
2.11.0

2016-12-30 22:58:53

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 32/32] mars: activate build

From: Thomas Schoebel-Theuer <[email protected]>

---
drivers/staging/Kconfig | 2 ++
drivers/staging/Makefile | 1 +
2 files changed, 3 insertions(+)

diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig
index 5d3b86a33857..bbccc4f0ebbe 100644
--- a/drivers/staging/Kconfig
+++ b/drivers/staging/Kconfig
@@ -56,6 +56,8 @@ source "drivers/staging/vt6656/Kconfig"

source "drivers/staging/iio/Kconfig"

+source "drivers/staging/mars/Kconfig"
+
source "drivers/staging/sm750fb/Kconfig"

source "drivers/staging/xgifb/Kconfig"
diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index 30918edef5e3..01732bd65542 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_VT6655) += vt6655/
obj-$(CONFIG_VT6656) += vt6656/
obj-$(CONFIG_VME_BUS) += vme/
obj-$(CONFIG_IIO) += iio/
+obj-$(CONFIG_MARS) += mars/
obj-$(CONFIG_FB_SM750) += sm750fb/
obj-$(CONFIG_FB_XGI) += xgifb/
obj-$(CONFIG_USB_EMXX) += emxx_udc/
--
2.11.0

2016-12-30 22:59:21

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 29/32] mars: add new module mars_main

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/mars/mars_main.c | 6160 +++++++++++++++++++++++++++++++++
1 file changed, 6160 insertions(+)
create mode 100644 drivers/staging/mars/mars/mars_main.c

diff --git a/drivers/staging/mars/mars/mars_main.c b/drivers/staging/mars/mars/mars_main.c
new file mode 100644
index 000000000000..346e0fbeb9b2
--- /dev/null
+++ b/drivers/staging/mars/mars/mars_main.c
@@ -0,0 +1,6160 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define XIO_DEBUGGING
+
+/* This MUST be updated whenever INCOMPATIBLE changes are made to the
+ * symlink tree in /mars/ .
+ *
+ * Just adding a new symlink is usually not "incompatible", if
+ * other tools like marsadm just ignore it.
+ *
+ * "incompatible" means that something may BREAK.
+ */
+#define SYMLINK_TREE_VERSION "0.1"
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+
+#include "strategy.h"
+
+#include <linux/wait.h>
+
+#include <linux/xio/lib_mapfree.h>
+
+/* used brick types */
+#include <linux/xio/xio_server.h>
+#include <linux/xio/xio_client.h>
+#include <linux/xio/xio_copy.h>
+#include <linux/xio/xio_bio.h>
+#include <linux/xio/xio_sio.h>
+#include <linux/xio/xio_trans_logger.h>
+#include <linux/xio/xio_if.h>
+#include "mars_proc.h"
+
+#define REPLAY_TOLERANCE (PAGE_SIZE + OVERHEAD)
+
+/* TODO: add human-readable timestamps */
+#define XIO_INF_TO(channel, fmt, args...) \
+ ({ \
+ say_to(channel, SAY_INFO, "%s: " fmt, say_class[SAY_INFO], ##args);\
+ XIO_INF(fmt, ##args); \
+ })
+
+#define XIO_WRN_TO(channel, fmt, args...) \
+ ({ \
+ say_to(channel, SAY_WARN, "%s: " fmt, say_class[SAY_WARN], ##args);\
+ XIO_WRN(fmt, ##args); \
+ })
+
+#define XIO_ERR_TO(channel, fmt, args...) \
+ ({ \
+ say_to(channel, SAY_ERROR, "%s: " fmt, say_class[SAY_ERROR], ##args);\
+ XIO_ERR(fmt, ##args); \
+ })
+
+loff_t raw_total_space;
+loff_t global_total_space;
+
+loff_t raw_remaining_space;
+loff_t global_remaining_space;
+
+int check_mars_space = 1;
+
+module_param_named(check_mars_space, check_mars_space, int, 0);
+
+int global_logrot_auto = 32;
+
+module_param_named(logrot_auto, global_logrot_auto, int, 0);
+
+int global_free_space_0 = CONFIG_MARS_MIN_SPACE_0;
+
+int global_free_space_1 = CONFIG_MARS_MIN_SPACE_1;
+
+int global_free_space_2 = CONFIG_MARS_MIN_SPACE_2;
+
+int global_free_space_3 = CONFIG_MARS_MIN_SPACE_3;
+
+int global_free_space_4 = CONFIG_MARS_MIN_SPACE_4;
+
+int _global_sync_want;
+int global_sync_want;
+
+int global_sync_nr;
+
+int global_sync_limit;
+
+int mars_rollover_interval = 3;
+
+module_param_named(mars_rollover_interval, mars_rollover_interval, int, 0);
+
+int mars_scan_interval = 5;
+
+module_param_named(mars_scan_interval, mars_scan_interval, int, 0);
+
+int mars_propagate_interval = 5;
+
+module_param_named(mars_propagate_interval, mars_propagate_interval, int, 0);
+
+int mars_sync_flip_interval = 60;
+
+module_param_named(mars_sync_flip_interval, mars_sync_flip_interval, int, 0);
+
+int mars_peer_abort = 7;
+
+int mars_fast_fullsync = 1;
+
+module_param_named(mars_fast_fullsync, mars_fast_fullsync, int, 0);
+
+int xio_throttle_start = 60;
+
+int xio_throttle_end = 90;
+
+int mars_emergency_mode;
+
+int mars_reset_emergency = 1;
+
+int mars_keep_msg = 10;
+
+#ifdef CONFIG_MARS_DEBUG
+#include <linux/reboot.h>
+
+int mars_crash_mode;
+int mars_hang_mode;
+
+void _crashme(int mode, bool do_sync)
+{
+ if (mode == mars_crash_mode) {
+ if (do_sync)
+ mars_sync();
+ emergency_restart();
+ }
+}
+
+#endif
+
+#define MARS_SYMLINK_MAX 1023
+
+struct key_value_pair {
+ const char *key;
+ char *val;
+ char *old_val;
+ unsigned long last_jiffies;
+ struct timespec system_stamp;
+ struct timespec lamport_stamp;
+};
+
+static inline
+void clear_vals(struct key_value_pair *start)
+{
+ while (start->key) {
+ brick_string_free(start->val);
+ start->val = NULL;
+ brick_string_free(start->old_val);
+ start->old_val = NULL;
+ start++;
+ }
+}
+
+static
+void show_vals(struct key_value_pair *start, const char *path, const char *add)
+{
+ while (start->key) {
+ char *dst = path_make("%s/actual-%s/msg-%s%s", path, my_id(), add, start->key);
+
+ /* show the old message for some keep_time if no new one is available */
+ if (!start->val && start->old_val &&
+ (long long)start->last_jiffies + mars_keep_msg * HZ <= (long long)jiffies) {
+ start->val = start->old_val;
+ start->old_val = NULL;
+ }
+ if (start->val) {
+ char *src = path_make(
+ "%ld.%09ld %ld.%09ld %s",
+ start->system_stamp.tv_sec, start->system_stamp.tv_nsec,
+ start->lamport_stamp.tv_sec, start->lamport_stamp.tv_nsec,
+ start->val);
+ mars_symlink(src, dst, NULL, 0);
+ brick_string_free(src);
+ brick_string_free(start->old_val);
+ start->old_val = start->val;
+ start->val = NULL;
+ } else {
+ mars_symlink("OK", dst, NULL, 0);
+ memset(&start->system_stamp, 0, sizeof(start->system_stamp));
+ memset(&start->lamport_stamp, 0, sizeof(start->lamport_stamp));
+ brick_string_free(start->old_val);
+ start->old_val = NULL;
+ }
+ brick_string_free(dst);
+ start++;
+ }
+}
+
+static inline
+void assign_keys(struct key_value_pair *start, const char **keys)
+{
+ while (*keys) {
+ start->key = *keys;
+ start++;
+ keys++;
+ }
+}
+
+static inline
+struct key_value_pair *find_key(struct key_value_pair *start, const char *key)
+{
+ while (start->key) {
+ if (!strcmp(start->key, key))
+ return start;
+ start++;
+ }
+ XIO_ERR("cannot find key '%s'\n", key);
+ return NULL;
+}
+
+static
+void _make_msg(int line, struct key_value_pair *pair, const char *fmt, ...) __printf(3, 4);
+
+static
+void _make_msg(int line, struct key_value_pair *pair, const char *fmt, ...)
+{
+ int len;
+ va_list args;
+
+ if (unlikely(!pair || !pair->key)) {
+ XIO_ERR("bad pointer %p at line %d\n", pair, line);
+ goto out_return;
+ }
+ pair->last_jiffies = jiffies;
+ if (!pair->val) {
+ pair->val = brick_string_alloc(MARS_SYMLINK_MAX + 1);
+ len = 0;
+ if (!pair->system_stamp.tv_sec) {
+ pair->system_stamp = CURRENT_TIME;
+ get_lamport(&pair->lamport_stamp);
+ }
+ } else {
+ len = strnlen(pair->val, MARS_SYMLINK_MAX);
+ if (unlikely(len >= MARS_SYMLINK_MAX - 48))
+ goto out_return;
+ pair->val[len++] = ',';
+ }
+
+ va_start(args, fmt);
+ vsnprintf(pair->val + len, MARS_SYMLINK_MAX - 1 - len, fmt, args);
+ va_end(args);
+out_return:;
+}
+
+#define make_msg(pair, fmt, args...) \
+ _make_msg(__LINE__, pair, fmt, ##args)
+
+static
+struct key_value_pair gbl_pairs[] = {
+ { NULL }
+};
+
+#define make_gbl_msg(key, fmt, args...) \
+ make_msg(find_key(gbl_pairs, key), fmt, ##args)
+
+static
+const char *rot_keys[] = {
+ /* from _update_version_link() */
+ "err-versionlink-skip",
+ /* from _update_info() */
+ "err-sequence-trash",
+ /* from _is_switchover_possible() */
+ "inf-versionlink-not-yet-exist",
+ "inf-versionlink-not-equal",
+ "inf-replay-not-yet-finished",
+ "err-bad-log-name",
+ "err-log-not-contiguous",
+ "err-versionlink-not-readable",
+ "err-replaylink-not-readable",
+ "err-splitbrain-detected",
+ /* from _update_file() */
+ "inf-fetch",
+ /* from make_sync() */
+ "inf-sync",
+ /* from make_log_step() */
+ "wrn-log-consecutive",
+ /* from make_log_finalize() */
+ "inf-replay-start",
+ "wrn-space-low",
+ "err-space-low",
+ "err-emergency",
+ "err-replay-stop",
+ /* from _check_logging_status() */
+ "inf-replay-tolerance",
+ "err-replay-size",
+ NULL,
+};
+
+#define make_rot_msg(rot, key, fmt, args...) \
+ make_msg(find_key(&(rot)->msgs[0], key), fmt, ##args)
+
+#define IS_EXHAUSTED() (mars_emergency_mode > 0)
+#define IS_EMERGENCY_SECONDARY() (mars_emergency_mode > 1)
+#define IS_EMERGENCY_PRIMARY() (mars_emergency_mode > 2)
+#define IS_JAMMED() (mars_emergency_mode > 3)
+
+static
+void _make_alivelink_str(const char *name, const char *src)
+{
+ char *dst = path_make("/mars/%s-%s", name, my_id());
+
+ if (!src || !dst) {
+ XIO_ERR("cannot make alivelink paths\n");
+ goto err;
+ }
+ XIO_DBG("'%s' -> '%s'\n", src, dst);
+ mars_symlink(src, dst, NULL, 0);
+err:
+ brick_string_free(dst);
+}
+
+static
+void _make_alivelink(const char *name, loff_t val)
+{
+ char *src = path_make("%lld", val);
+
+ _make_alivelink_str(name, src);
+ brick_string_free(src);
+}
+
+static
+int compute_emergency_mode(void)
+{
+ loff_t rest;
+ loff_t present;
+ loff_t limit = 0;
+ int mode = 4;
+ int this_mode = 0;
+
+ mars_remaining_space("/mars", &raw_total_space, &raw_remaining_space);
+ rest = raw_remaining_space;
+
+#define CHECK_LIMIT(LIMIT_VAR) \
+do { \
+ if (LIMIT_VAR > 0) \
+ limit += (loff_t)LIMIT_VAR * 1024 * 1024; \
+ if (rest < limit && !this_mode) { \
+ this_mode = mode; \
+ } \
+ mode--; \
+} while (0)
+
+ CHECK_LIMIT(global_free_space_4);
+ CHECK_LIMIT(global_free_space_3);
+ CHECK_LIMIT(global_free_space_2);
+ CHECK_LIMIT(global_free_space_1);
+
+ /* Decrease the emergeny mode only in single steps.
+ */
+ if (mars_reset_emergency && mars_emergency_mode > 0 && mars_emergency_mode > this_mode)
+ mars_emergency_mode--;
+ else
+ mars_emergency_mode = this_mode;
+ _make_alivelink("emergency", mars_emergency_mode);
+
+ rest -= limit;
+ if (rest < 0)
+ rest = 0;
+ global_remaining_space = rest;
+ _make_alivelink("rest-space", rest / (1024 * 1024));
+
+ present = raw_total_space - limit;
+ global_total_space = present;
+
+ if (xio_throttle_start > 0 &&
+ xio_throttle_end > xio_throttle_start &&
+ present > 0) {
+ loff_t percent_used = 100 - (rest * 100 / present);
+
+ if (percent_used < xio_throttle_start)
+ if_throttle_start_size = 0;
+ else if (percent_used >= xio_throttle_end)
+ if_throttle_start_size = 1;
+ else
+ if_throttle_start_size = (
+ xio_throttle_end - percent_used) * 1024 / (xio_throttle_end - xio_throttle_start) + 1;
+ }
+
+ if (unlikely(present < global_free_space_0))
+ return -ENOSPC;
+ return 0;
+}
+
+/*****************************************************************/
+
+static struct task_struct *main_thread;
+
+typedef int (*main_worker_fn)(void *buf, struct mars_dent *dent);
+
+struct main_class {
+ char *cl_name;
+ int cl_len;
+ char cl_type;
+ bool cl_hostcontext;
+ bool cl_serial;
+ bool cl_use_channel;
+ int cl_father;
+
+ main_worker_fn cl_prepare;
+ main_worker_fn cl_forward;
+ main_worker_fn cl_backward;
+};
+
+/* the order is important! */
+enum {
+ /* root element: this must have index 0 */
+ CL_ROOT,
+ /* global ID */
+ CL_UUID,
+ /* global userspace */
+ CL_GLOBAL_USERSPACE,
+ CL_GLOBAL_USERSPACE_ITEMS,
+ /* global todos */
+ CL_GLOBAL_TODO,
+ CL_GLOBAL_TODO_DELETE,
+ CL_GLOBAL_TODO_DELETED,
+ CL_DEFAULTS0,
+ CL_DEFAULTS,
+ CL_DEFAULTS_ITEMS0,
+ CL_DEFAULTS_ITEMS,
+ /* replacement for DNS in kernelspace */
+ CL_IPS,
+ CL_PEERS,
+ CL_GBL_ACTUAL,
+ CL_GBL_ACTUAL_ITEMS,
+ CL_ALIVE,
+ CL_TIME,
+ CL_TREE,
+ CL_EMERGENCY,
+ CL_REST_SPACE,
+ /* resource definitions */
+ CL_RESOURCE,
+ CL_RESOURCE_USERSPACE,
+ CL_RESOURCE_USERSPACE_ITEMS,
+ CL_RES_DEFAULTS0,
+ CL_RES_DEFAULTS,
+ CL_RES_DEFAULTS_ITEMS0,
+ CL_RES_DEFAULTS_ITEMS,
+ CL_TODO,
+ CL_TODO_ITEMS,
+ CL_ACTUAL,
+ CL_ACTUAL_ITEMS,
+ CL_DATA,
+ CL_SIZE,
+ CL_ACTSIZE,
+ CL_PRIMARY,
+ CL_CONNECT,
+ CL_TRANSFER,
+ CL_SYNC,
+ CL_VERIF,
+ CL_SYNCPOS,
+ CL_VERSION,
+ CL_LOG,
+ CL_REPLAYSTATUS,
+ CL_DEVICE,
+ CL_MAXNR,
+};
+
+/*********************************************************************/
+
+/* needed for logfile rotation */
+
+#define MAX_INFOS 4
+
+struct mars_rotate {
+ struct list_head rot_head;
+ struct mars_global *global;
+ struct copy_brick *sync_brick;
+ struct mars_dent *replay_link;
+ struct xio_brick *bio_brick;
+ struct mars_dent *aio_dent;
+ struct xio_brick *aio_brick;
+ struct xio_info aio_info;
+ struct trans_logger_brick *trans_brick;
+ struct mars_dent *first_log;
+ struct mars_dent *relevant_log;
+ struct xio_brick *relevant_brick;
+ struct mars_dent *next_relevant_log;
+ struct xio_brick *next_relevant_brick;
+ struct mars_dent *prev_log;
+ struct mars_dent *next_log;
+ struct mars_dent *syncstatus_dent;
+ struct timespec sync_finish_stamp;
+ struct if_brick *if_brick;
+ struct client_brick *remote_brick;
+ const char *fetch_path;
+ const char *fetch_peer;
+ const char *preferred_peer;
+ const char *parent_path;
+ const char *parent_rest;
+ const char *fetch_next_origin;
+ struct say_channel *log_say;
+ struct copy_brick *fetch_brick;
+ struct rate_limiter replay_limiter;
+ struct rate_limiter sync_limiter;
+ struct rate_limiter fetch_limiter;
+ int inf_prev_sequence;
+ int inf_old_sequence;
+ long long flip_start;
+ loff_t dev_size;
+ loff_t start_pos;
+ loff_t end_pos;
+ int max_sequence;
+ int fetch_round;
+ int fetch_serial;
+ int fetch_next_serial;
+ int split_brain_serial;
+ int split_brain_round;
+ int fetch_next_is_available;
+ int relevant_serial;
+ int replay_code;
+ bool has_symlinks;
+ bool res_shutdown;
+ bool has_error;
+ bool has_double_logfile;
+ bool has_hole_logfile;
+ bool allow_update;
+ bool forbid_replay;
+ bool replay_mode;
+ bool todo_primary;
+ bool is_primary;
+ bool old_is_primary;
+ bool created_hole;
+ bool is_log_damaged;
+ bool has_emergency;
+ bool wants_sync;
+ bool gets_sync;
+ bool log_is_really_damaged;
+
+ /* protect the infs array against concurrent read/write */
+ spinlock_t inf_lock;
+ bool infs_is_dirty[MAX_INFOS];
+ struct trans_logger_info infs[MAX_INFOS];
+ struct key_value_pair msgs[sizeof(rot_keys) / sizeof(char *)];
+};
+
+static LIST_HEAD(rot_anchor);
+
+/*********************************************************************/
+
+/* TUNING */
+
+int mars_mem_percent = 20;
+
+#define CONF_TRANS_SHADOW_LIMIT (1024 * 128) /* don't fill the hashtable too much */
+
+#define CONF_TRANS_BATCHLEN 64
+#define CONF_TRANS_PRIO XIO_PRIO_HIGH
+#define CONF_TRANS_LOG_READS false
+
+#define CONF_ALL_BATCHLEN 1
+#define CONF_ALL_PRIO XIO_PRIO_NORMAL
+
+#define IF_SKIP_SYNC true
+
+#define IF_MAX_PLUGGED 10000
+#define IF_READAHEAD 0
+
+#define BIO_READAHEAD 0
+#define BIO_NOIDLE true
+#define BIO_SYNC true
+#define BIO_UNPLUG true
+
+#define COPY_APPEND_MODE 0
+#define COPY_PRIO XIO_PRIO_LOW
+
+static
+int _set_trans_params(struct xio_brick *_brick, void *private)
+{
+ struct trans_logger_brick *trans_brick = (void *)_brick;
+
+ if (_brick->type != (void *)&trans_logger_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ if (!trans_brick->q_phase[1].q_ordering) {
+ trans_brick->q_phase[0].q_batchlen = CONF_TRANS_BATCHLEN;
+ trans_brick->q_phase[1].q_batchlen = CONF_ALL_BATCHLEN;
+ trans_brick->q_phase[2].q_batchlen = CONF_ALL_BATCHLEN;
+ trans_brick->q_phase[3].q_batchlen = CONF_ALL_BATCHLEN;
+
+ trans_brick->q_phase[0].q_io_prio = CONF_TRANS_PRIO;
+ trans_brick->q_phase[1].q_io_prio = CONF_ALL_PRIO;
+ trans_brick->q_phase[2].q_io_prio = CONF_ALL_PRIO;
+ trans_brick->q_phase[3].q_io_prio = CONF_ALL_PRIO;
+
+ trans_brick->q_phase[1].q_ordering = true;
+ trans_brick->q_phase[3].q_ordering = true;
+
+ trans_brick->shadow_mem_limit = CONF_TRANS_SHADOW_LIMIT;
+ trans_brick->log_reads = CONF_TRANS_LOG_READS;
+ }
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+struct client_cookie {
+ bool limit_mode;
+ bool create_mode;
+};
+
+static
+int _set_client_params(struct xio_brick *_brick, void *private)
+{
+ struct client_brick *client_brick = (void *)_brick;
+ struct client_cookie *clc = private;
+
+ client_brick->limit_mode = clc ? clc->limit_mode : false;
+ client_brick->killme = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+static
+int _set_sio_params(struct xio_brick *_brick, void *private)
+{
+ struct sio_brick *sio_brick = (void *)_brick;
+
+ if (_brick->type == (void *)&client_brick_type)
+ return _set_client_params(_brick, private);
+ if (_brick->type != (void *)&sio_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ sio_brick->o_direct = false; /* important! */
+ sio_brick->o_fdsync = true;
+ sio_brick->killme = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+static
+int _set_bio_params(struct xio_brick *_brick, void *private)
+{
+ struct bio_brick *bio_brick;
+
+ if (_brick->type == (void *)&client_brick_type)
+ return _set_client_params(_brick, private);
+ if (_brick->type == (void *)&sio_brick_type)
+ return _set_sio_params(_brick, private);
+ if (_brick->type != (void *)&bio_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ bio_brick = (void *)_brick;
+ bio_brick->ra_pages = BIO_READAHEAD;
+ bio_brick->do_noidle = BIO_NOIDLE;
+ bio_brick->do_sync = BIO_SYNC;
+ bio_brick->do_unplug = BIO_UNPLUG;
+ bio_brick->killme = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+ return 1;
+}
+
+static
+int _set_if_params(struct xio_brick *_brick, void *private)
+{
+ struct if_brick *if_brick = (void *)_brick;
+ struct mars_rotate *rot = private;
+
+ if (_brick->type != (void *)&if_brick_type) {
+ XIO_ERR("bad brick type\n");
+ return -EINVAL;
+ }
+ if (likely(rot))
+ if_brick->max_size = rot->dev_size;
+ if_brick->max_plugged = IF_MAX_PLUGGED;
+ if_brick->readahead = IF_READAHEAD;
+ if_brick->skip_sync = IF_SKIP_SYNC;
+ XIO_INF("name = '%s' path = '%s' size = %lld\n", _brick->brick_name, _brick->brick_path, if_brick->dev_size);
+ return 1;
+}
+
+struct copy_cookie {
+ const char *argv[2];
+ const char *copy_path;
+ loff_t start_pos;
+ loff_t end_pos;
+ bool keep_running;
+ bool verify_mode;
+
+ const char *fullpath[2];
+ struct xio_output *output[2];
+ struct xio_info info[2];
+};
+
+static
+int _set_copy_params(struct xio_brick *_brick, void *private)
+{
+ struct copy_brick *copy_brick = (void *)_brick;
+ struct copy_cookie *cc = private;
+ int status = 1;
+
+ if (_brick->type != (void *)&copy_brick_type) {
+ XIO_ERR("bad brick type\n");
+ status = -EINVAL;
+ goto done;
+ }
+ copy_brick->append_mode = COPY_APPEND_MODE;
+ copy_brick->io_prio = COPY_PRIO;
+ copy_brick->verify_mode = cc->verify_mode;
+ copy_brick->repair_mode = true;
+ copy_brick->killme = true;
+ XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path);
+
+ /* Determine the copy area, switch on/off when necessary
+ */
+ if (!copy_brick->power.button && copy_brick->power.off_led) {
+ int i;
+
+ copy_brick->copy_last = 0;
+ for (i = 0; i < 2; i++) {
+ status = cc->output[i]->ops->xio_get_info(cc->output[i], &cc->info[i]);
+ if (status < 0) {
+ XIO_WRN("cannot determine current size of '%s'\n", cc->argv[i]);
+ goto done;
+ }
+ XIO_DBG("%d '%s' current_size = %lld\n", i, cc->fullpath[i], cc->info[i].current_size);
+ }
+ copy_brick->copy_start = cc->info[1].current_size;
+ if (cc->start_pos != -1) {
+ copy_brick->copy_start = cc->start_pos;
+ if (unlikely(cc->start_pos > cc->info[0].current_size)) {
+ XIO_ERR(
+ "bad start position %lld is larger than actual size %lld on '%s'\n",
+ cc->start_pos,
+ cc->info[0].current_size,
+ cc->copy_path);
+ status = -EINVAL;
+ goto done;
+ }
+ }
+ XIO_DBG("copy_start = %lld\n", copy_brick->copy_start);
+ copy_brick->copy_end = cc->info[0].current_size;
+ if (cc->end_pos != -1) {
+ if (unlikely(cc->end_pos > copy_brick->copy_end)) {
+ XIO_ERR(
+ "target size %lld is larger than actual size %lld on source\n",
+ cc->end_pos,
+ copy_brick->copy_end);
+ status = -EINVAL;
+ goto done;
+ }
+ copy_brick->copy_end = cc->end_pos;
+ if (unlikely(cc->end_pos > cc->info[1].current_size)) {
+ XIO_ERR(
+ "bad end position %lld is larger than actual size %lld on target\n",
+ cc->end_pos,
+ cc->info[1].current_size);
+ status = -EINVAL;
+ goto done;
+ }
+ }
+ XIO_DBG("copy_end = %lld\n", copy_brick->copy_end);
+ if (copy_brick->copy_start < copy_brick->copy_end) {
+ status = 1;
+ XIO_DBG("copy switch on\n");
+ }
+ } else if (copy_brick->power.button && copy_brick->power.on_led &&
+ !cc->keep_running &&
+ copy_brick->copy_last == copy_brick->copy_end && copy_brick->copy_end > 0) {
+ status = 0;
+ XIO_DBG("copy switch off\n");
+ }
+
+done:
+ return status;
+}
+
+/*********************************************************************/
+
+/* internal helpers */
+
+#define MARS_DELIM ','
+
+static int _parse_args(struct mars_dent *dent, char *str, int count)
+{
+ int i;
+ int status = -EINVAL;
+
+ if (!str)
+ goto done;
+ if (!dent->d_args)
+ dent->d_args = brick_strdup(str);
+ for (i = 0; i < count; i++) {
+ char *tmp;
+ int len;
+
+ if (!*str)
+ goto done;
+ if (i == count - 1) {
+ len = strlen(str);
+ } else {
+ char *tmp = strchr(str, MARS_DELIM);
+
+ if (!tmp)
+ goto done;
+ len = (tmp - str);
+ }
+ brick_string_free(dent->d_argv[i]);
+ tmp = brick_string_alloc(len + 1);
+ dent->d_argv[i] = tmp;
+ strncpy(dent->d_argv[i], str, len);
+ dent->d_argv[i][len] = '\0';
+
+ str += len;
+ if (i != count - 1)
+ str++;
+ }
+ status = 0;
+done:
+ if (status < 0) {
+ XIO_ERR(
+ "bad syntax '%s' (should have %d args), status = %d\n",
+ dent->d_args ? dent->d_args : "",
+ count,
+ status);
+ }
+ return status;
+}
+
+static
+int _check_switch(struct mars_global *global, const char *path)
+{
+ int status;
+ int res = 0;
+ struct mars_dent *allow_dent;
+
+ /* Upon shutdown, treat all switches as "off"
+ */
+ if (!global->global_power.button)
+ goto done;
+
+ allow_dent = mars_find_dent(global, path);
+ if (!allow_dent || !allow_dent->link_val)
+ goto done;
+ status = kstrtoint(allow_dent->link_val, 10, &res);
+ (void)status; /* treat errors as if the switch were set to 0 */
+ XIO_DBG("'%s' -> %d\n", path, res);
+
+done:
+ return res;
+}
+
+static
+int __check_allow(struct mars_global *global, struct mars_dent *parent, const char *name, const char *peer)
+{
+ int res = 0;
+ char *path = path_make("%s/todo-%s/%s", parent->d_path, peer, name);
+
+ if (!path)
+ goto done;
+
+ res = _check_switch(global, path);
+
+done:
+ brick_string_free(path);
+ return res;
+}
+
+static inline
+int _check_allow(struct mars_global *global, struct mars_dent *parent, const char *name)
+{
+ return __check_allow(global, parent, name, my_id());
+}
+
+#define skip_part(s) _skip_part(s, ',', ':')
+#define skip_sect(s) _skip_part(s, ':', 0)
+static inline
+int _skip_part(const char *str, const char del1, const char del2)
+{
+ int len = 0;
+
+ while (str[len] && str[len] != del1 && (!del2 || str[len] != del2))
+ len++;
+ return len;
+}
+
+static inline
+int skip_dir(const char *str)
+{
+ int len = 0;
+ int res = 0;
+
+ for (len = 0; str[len]; len++)
+ if (str[len] == '/')
+ res = len + 1;
+ return res;
+}
+
+static
+int parse_logfile_name(const char *str, int *seq, const char **host)
+{
+ char *_host;
+ int count;
+ int len = 0;
+ int len_host;
+
+ *seq = 0;
+ *host = NULL;
+
+ count = sscanf(str, "log-%d-%n", seq, &len);
+ if (unlikely(count != 1)) {
+ XIO_ERR("bad logfile name '%s', count=%d, len=%d\n", str, count, len);
+ return 0;
+ }
+
+ _host = brick_strdup(str + len);
+
+ len_host = skip_part(_host);
+ _host[len_host] = '\0';
+ *host = _host;
+ len += len_host;
+
+ return len;
+}
+
+static
+int compare_replaylinks(struct mars_rotate *rot, const char *hosta, const char *hostb)
+{
+ const char *linka = path_make("%s/replay-%s", rot->parent_path, hosta);
+ const char *linkb = path_make("%s/replay-%s", rot->parent_path, hostb);
+ const char *a = NULL;
+ const char *b = NULL;
+ int seqa;
+ int seqb;
+ int posa;
+ int posb;
+ loff_t offa = 0;
+ loff_t offb = -1;
+ loff_t taila = 0;
+ loff_t tailb = -1;
+ int count;
+ int res = -2;
+
+ if (unlikely(!linka || !linkb)) {
+ XIO_ERR("nen MEM");
+ goto done;
+ }
+
+ a = mars_readlink(linka);
+ if (unlikely(!a || !a[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read replaylink '%s'\n", linka);
+ goto done;
+ }
+ b = mars_readlink(linkb);
+ if (unlikely(!b || !b[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read replaylink '%s'\n", linkb);
+ goto done;
+ }
+
+ count = sscanf(a, "log-%d-%n", &seqa, &posa);
+ if (unlikely(count != 1))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linka, a);
+ count = sscanf(b, "log-%d-%n", &seqb, &posb);
+ if (unlikely(count != 1))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linkb, b);
+
+ if (seqa < seqb) {
+ res = -1;
+ goto done;
+ } else if (seqa > seqb) {
+ res = 1;
+ goto done;
+ }
+
+ posa += skip_part(a + posa);
+ posb += skip_part(b + posb);
+ if (unlikely(!a[posa++]))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linka, a);
+ if (unlikely(!b[posb++]))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linkb, b);
+
+ count = sscanf(a + posa, "%lld,%lld", &offa, &taila);
+ if (unlikely(count != 2))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linka, a);
+ count = sscanf(b + posb, "%lld,%lld", &offb, &tailb);
+ if (unlikely(count != 2))
+ XIO_ERR_TO(rot->log_say, "replay link '%s' -> '%s' is malformed\n", linkb, b);
+
+ if (offa < offb)
+ res = -1;
+ else if (offa > offb)
+ res = 1;
+ else
+ res = 0;
+
+done:
+ brick_string_free(a);
+ brick_string_free(b);
+ brick_string_free(linka);
+ brick_string_free(linkb);
+ return res;
+}
+
+/*********************************************************************/
+
+/* status display */
+
+static
+int _update_link_when_necessary(struct mars_rotate *rot, const char *type, const char *old, const char *new)
+{
+ char *check = NULL;
+ int status = -EINVAL;
+ bool res = false;
+
+ if (unlikely(!old || !new))
+ goto out;
+
+ /* Check whether something really has changed (avoid
+ * useless/disturbing timestamp updates)
+ */
+ check = mars_readlink(new);
+ if (check && !strcmp(check, old)) {
+ XIO_DBG("%s symlink '%s' -> '%s' has not changed\n", type, old, new);
+ res = 0;
+ goto out;
+ }
+
+ status = mars_symlink(old, new, NULL, 0);
+ if (unlikely(status < 0)) {
+ XIO_ERR_TO(
+ rot->log_say, "cannot create %s symlink '%s' -> '%s' status = %d\n", type, old, new, status);
+ } else {
+ res = 1;
+ XIO_DBG("made %s symlink '%s' -> '%s' status = %d\n", type, old, new, status);
+ }
+
+out:
+ brick_string_free(check);
+ return res;
+}
+
+static
+int _update_replay_link(struct mars_rotate *rot, struct trans_logger_info *inf)
+{
+ char *old = NULL;
+ char *new = NULL;
+ int res = 0;
+
+ old = path_make(
+ "log-%09d-%s,%lld,%lld",
+ inf->inf_sequence,
+ inf->inf_host,
+ inf->inf_min_pos,
+ inf->inf_max_pos - inf->inf_min_pos);
+ if (!old)
+ goto out;
+ new = path_make("%s/replay-%s", rot->parent_path, my_id());
+ if (!new)
+ goto out;
+
+ _crashme(1, true);
+
+ res = _update_link_when_necessary(rot, "replay", old, new);
+
+out:
+ brick_string_free(new);
+ brick_string_free(old);
+ return res;
+}
+
+static
+int _update_version_link(struct mars_rotate *rot, struct trans_logger_info *inf)
+{
+ char *data = brick_string_alloc(0);
+ char *old = brick_string_alloc(0);
+ char *new = NULL;
+ unsigned char *digest = brick_string_alloc(0);
+ char *prev = NULL;
+ char *prev_link = NULL;
+ char *prev_digest = NULL;
+ int len;
+ int i;
+ int res = 0;
+
+ if (likely(inf->inf_sequence > 1)) {
+ if (unlikely((inf->inf_sequence < rot->inf_prev_sequence ||
+ inf->inf_sequence > rot->inf_prev_sequence + 1) &&
+ rot->inf_prev_sequence != 0)) {
+ char *skip_path = path_make("%s/skip-check-%s", rot->parent_path, my_id());
+ char *skip_link = mars_readlink(skip_path);
+ char *msg = "";
+ int skip_nr = -1;
+ int nr_char = 0;
+
+ if (likely(skip_link && skip_link[0])) {
+ int status = sscanf(skip_link, "%d%n", &skip_nr, &nr_char);
+
+ (void)status; /* keep msg empty in case of errors */
+ msg = skip_link + nr_char;
+ }
+ brick_string_free(skip_path);
+ if (likely(skip_nr != inf->inf_sequence)) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "SKIP in sequence numbers detected: %d != %d + 1\n",
+ inf->inf_sequence,
+ rot->inf_prev_sequence);
+ make_rot_msg(
+ rot,
+ "err-versionlink-skip",
+ "SKIP in sequence numbers detected: %d != %d + 1",
+ inf->inf_sequence,
+ rot->inf_prev_sequence);
+ brick_string_free(skip_link);
+ goto out;
+ }
+ XIO_WRN_TO(
+ rot->log_say,
+ "you explicitly requested to SKIP sequence numbers from %d to %d%s\n",
+ rot->inf_prev_sequence, inf->inf_sequence, msg);
+ brick_string_free(skip_link);
+ }
+ prev = path_make("%s/version-%09d-%s", rot->parent_path, inf->inf_sequence - 1, my_id());
+ if (unlikely(!prev)) {
+ XIO_ERR("no MEM\n");
+ goto out;
+ }
+ prev_link = mars_readlink(prev);
+ rot->inf_prev_sequence = inf->inf_sequence;
+ }
+
+ len = sprintf(
+ data, "%d,%s,%lld:%s", inf->inf_sequence, inf->inf_host, inf->inf_log_pos, prev_link ? prev_link : "");
+
+ XIO_DBG("data = '%s' len = %d\n", data, len);
+
+ xio_digest(digest, data, len);
+
+ len = 0;
+ for (i = 0; i < xio_digest_size; i++)
+ len += sprintf(old + len, "%02x", digest[i]);
+
+ if (likely(prev_link && prev_link[0])) {
+ char *tmp;
+
+ prev_digest = brick_strdup(prev_link);
+ /* take the part before ':' */
+ for (tmp = prev_digest; *tmp; tmp++)
+ if (*tmp == ':')
+ break;
+ *tmp = '\0';
+ }
+
+ len += sprintf(
+ old + len,
+ ",log-%09d-%s,%lld:%s",
+ inf->inf_sequence,
+ inf->inf_host,
+ inf->inf_log_pos,
+ prev_digest ? prev_digest : "");
+
+ new = path_make("%s/version-%09d-%s", rot->parent_path, inf->inf_sequence, my_id());
+ if (!new) {
+ XIO_ERR("no MEM\n");
+ goto out;
+ }
+
+ _crashme(2, true);
+
+ res = _update_link_when_necessary(rot, "version", old, new);
+
+out:
+ brick_string_free(new);
+ brick_string_free(prev);
+ brick_string_free(data);
+ brick_string_free(digest);
+ brick_string_free(old);
+ brick_string_free(prev_link);
+ brick_string_free(prev_digest);
+ return res;
+}
+
+static
+void _update_info(struct trans_logger_info *inf)
+{
+ struct mars_rotate *rot = inf->inf_private;
+ int hash;
+ unsigned long flags;
+
+ if (unlikely(!rot)) {
+ XIO_ERR("rot is NULL\n");
+ goto done;
+ }
+
+ XIO_DBG(
+ "inf = %p '%s' seq = %d min_pos = %lld max_pos = %lld log_pos = %lld is_replaying = %d is_logging = %d\n",
+ inf,
+ inf->inf_host,
+ inf->inf_sequence,
+ inf->inf_min_pos,
+ inf->inf_max_pos,
+ inf->inf_log_pos,
+ inf->inf_is_replaying,
+ inf->inf_is_logging);
+
+ hash = inf->inf_sequence % MAX_INFOS;
+ if (unlikely(rot->infs_is_dirty[hash])) {
+ if (unlikely(rot->infs[hash].inf_sequence != inf->inf_sequence)) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "buffer %d: sequence trash %d -> %d. is the mars_main thread hanging?\n",
+ hash,
+ rot->infs[hash].inf_sequence,
+ inf->inf_sequence);
+ make_rot_msg(
+ rot,
+ "err-sequence-trash",
+ "buffer %d: sequence trash %d -> %d",
+ hash,
+ rot->infs[hash].inf_sequence,
+ inf->inf_sequence);
+ } else {
+ XIO_DBG("buffer %d is overwritten (sequence=%d)\n", hash, inf->inf_sequence);
+ }
+ }
+
+ spin_lock_irqsave(&rot->inf_lock, flags);
+ memcpy(&rot->infs[hash], inf, sizeof(struct trans_logger_info));
+ rot->infs_is_dirty[hash] = true;
+ spin_unlock_irqrestore(&rot->inf_lock, flags);
+
+ local_trigger();
+done:;
+}
+
+static
+void write_info_links(struct mars_rotate *rot)
+{
+ struct trans_logger_info inf;
+ int count = 0;
+
+ for (;;) {
+ unsigned long flags;
+ int hash = -1;
+ int min = 0;
+ int i;
+
+ spin_lock_irqsave(&rot->inf_lock, flags);
+ for (i = 0; i < MAX_INFOS; i++) {
+ if (!rot->infs_is_dirty[i])
+ continue;
+ if (!min || min > rot->infs[i].inf_sequence) {
+ min = rot->infs[i].inf_sequence;
+ hash = i;
+ }
+ }
+
+ if (hash < 0) {
+ spin_unlock_irqrestore(&rot->inf_lock, flags);
+ break;
+ }
+
+ rot->infs_is_dirty[hash] = false;
+ memcpy(&inf, &rot->infs[hash], sizeof(struct trans_logger_info));
+ spin_unlock_irqrestore(&rot->inf_lock, flags);
+
+ XIO_DBG(
+ "seq = %d min_pos = %lld max_pos = %lld log_pos = %lld is_replaying = %d is_logging = %d\n",
+ inf.inf_sequence,
+ inf.inf_min_pos,
+ inf.inf_max_pos,
+ inf.inf_log_pos,
+ inf.inf_is_replaying,
+ inf.inf_is_logging);
+
+ if (inf.inf_is_logging || inf.inf_is_replaying) {
+ count += _update_replay_link(rot, &inf);
+ count += _update_version_link(rot, &inf);
+ if (min > rot->inf_old_sequence) {
+ mars_sync();
+ rot->inf_old_sequence = min;
+ }
+ }
+ }
+ if (count) {
+ if (inf.inf_min_pos == inf.inf_max_pos)
+ local_trigger();
+ remote_trigger();
+ }
+}
+
+static
+void _make_new_replaylink(struct mars_rotate *rot, char *new_host, int new_sequence, loff_t end_pos)
+{
+ struct trans_logger_info inf = {
+ .inf_private = rot,
+ .inf_sequence = new_sequence,
+ .inf_min_pos = 0,
+ .inf_max_pos = 0,
+ .inf_log_pos = end_pos,
+ .inf_is_replaying = true,
+ };
+ strncpy(inf.inf_host, new_host, sizeof(inf.inf_host));
+
+ XIO_DBG("new_host = '%s' new_sequence = %d end_pos = %lld\n", new_host, new_sequence, end_pos);
+
+ _update_replay_link(rot, &inf);
+ _update_version_link(rot, &inf);
+
+ local_trigger();
+ remote_trigger();
+}
+
+static
+int __show_actual(const char *path, const char *name, int val)
+{
+ char *src;
+ char *dst = NULL;
+ int status = -EINVAL;
+
+ src = path_make("%d", val);
+ dst = path_make("%s/actual-%s/%s", path, my_id(), name);
+ status = -ENOMEM;
+ if (!dst)
+ goto done;
+
+ XIO_DBG("symlink '%s' -> '%s'\n", dst, src);
+ status = mars_symlink(src, dst, NULL, 0);
+
+done:
+ brick_string_free(src);
+ brick_string_free(dst);
+ return status;
+}
+
+static inline
+int _show_actual(const char *path, const char *name, bool val)
+{
+ return __show_actual(path, name, val ? 1 : 0);
+}
+
+static
+void _show_primary(struct mars_rotate *rot, struct mars_dent *parent)
+{
+ int status;
+
+ if (!rot || !parent)
+ goto out_return;
+ status = _show_actual(parent->d_path, "is-primary", rot->is_primary);
+ if (rot->is_primary != rot->old_is_primary) {
+ rot->old_is_primary = rot->is_primary;
+ remote_trigger();
+ }
+out_return:;
+}
+
+static
+void _show_brick_status(struct xio_brick *test, bool shutdown)
+{
+ const char *path;
+ char *src;
+ char *dst;
+ int status;
+
+ path = test->brick_path;
+ if (!path) {
+ XIO_WRN("bad path\n");
+ goto out_return;
+ }
+ if (*path != '/') {
+ XIO_WRN("bogus path '%s'\n", path);
+ goto out_return;
+ }
+
+ src = (test->power.on_led && !shutdown) ? "1" : "0";
+ dst = backskip_replace(path, '/', true, "/actual-%s/", my_id());
+ if (!dst)
+ goto out_return;
+
+ status = mars_symlink(src, dst, NULL, 0);
+ XIO_DBG("status symlink '%s' -> '%s' status = %d\n", dst, src, status);
+ brick_string_free(dst);
+out_return:;
+}
+
+static
+void _show_status_all(struct mars_global *global)
+{
+ struct list_head *tmp;
+
+ down_read(&global->brick_mutex);
+ for (tmp = global->brick_anchor.next; tmp != &global->brick_anchor; tmp = tmp->next) {
+ struct xio_brick *test;
+
+ test = container_of(tmp, struct xio_brick, global_brick_link);
+ if (!test->show_status)
+ continue;
+ _show_brick_status(test, false);
+ }
+ up_read(&global->brick_mutex);
+}
+
+static
+void _show_rate(struct mars_rotate *rot, struct rate_limiter *limiter, const char *basename)
+{
+ char *name;
+
+ rate_limit(limiter, 0);
+
+ name = path_make("ops-%s", basename);
+ __show_actual(rot->parent_path, name, limiter->lim_ops_rate);
+ brick_string_free(name);
+
+ name = path_make("amount-%s", basename);
+ __show_actual(rot->parent_path, name, limiter->lim_amount_rate);
+ brick_string_free(name);
+}
+
+/*********************************************************************/
+
+static
+int __make_copy(
+ struct mars_global *global,
+ struct mars_dent *belongs,
+ const char *switch_path,
+ const char *copy_path,
+ const char *parent,
+ const char *argv[],
+ struct key_value_pair *msg_pair,
+ loff_t start_pos, /* -1 means at EOF of source */
+ loff_t end_pos, /* -1 means at EOF of target */
+ bool keep_running,
+ bool verify_mode,
+ bool limit_mode,
+ bool space_using_mode,
+ struct copy_brick **__copy)
+{
+ struct xio_brick *copy;
+ struct copy_cookie cc = {};
+
+ struct client_cookie clc[2] = {
+ {
+ .limit_mode = limit_mode,
+ },
+ {
+ .limit_mode = limit_mode,
+ .create_mode = true,
+ },
+ };
+ int i;
+ bool switch_copy;
+ int status = -EINVAL;
+
+ if (!switch_path || !global)
+ goto done;
+
+ /* don't generate empty aio files if copy does not yet exist */
+ switch_copy = _check_switch(global, switch_path);
+ copy = mars_find_brick(global, &copy_brick_type, copy_path);
+ if (!copy && !switch_copy)
+ goto done;
+
+ /* create/find predecessor aio bricks */
+ for (i = 0; i < 2; i++) {
+ struct xio_brick *aio;
+
+ cc.argv[i] = argv[i];
+ if (parent) {
+ cc.fullpath[i] = path_make("%s/%s", parent, argv[i]);
+ if (!cc.fullpath[i]) {
+ XIO_ERR("cannot make path '%s/%s'\n", parent, argv[i]);
+ goto done;
+ }
+ } else {
+ cc.fullpath[i] = argv[i];
+ }
+
+ aio =
+ make_brick_all(
+ global,
+ NULL,
+ _set_bio_params,
+ &clc[i],
+ NULL,
+ (const struct generic_brick_type *)&bio_brick_type,
+ (const struct generic_brick_type*[]){},
+ switch_copy || (copy && !copy->power.off_led) ? 2 : -1,
+ cc.fullpath[i],
+ (const char *[]){},
+ 0);
+ if (!aio) {
+ XIO_DBG("cannot instantiate '%s'\n", cc.fullpath[i]);
+ make_msg(msg_pair, "cannot instantiate '%s'", cc.fullpath[i]);
+ goto done;
+ }
+ cc.output[i] = aio->outputs[0];
+ /* When switching off, use a short timeout for aborting.
+ * Important on very slow networks (since a large number
+ * of requests may be pending).
+ */
+ aio->power.io_timeout = switch_copy ? 0 : 1;
+ }
+
+ cc.copy_path = copy_path;
+ cc.start_pos = start_pos;
+ cc.end_pos = end_pos;
+ cc.keep_running = keep_running;
+ cc.verify_mode = verify_mode;
+
+ copy =
+ make_brick_all(
+ global,
+ belongs,
+ _set_copy_params,
+ &cc,
+ cc.fullpath[1],
+ (const struct generic_brick_type *)&copy_brick_type,
+ (const struct generic_brick_type*[]){NULL, NULL, NULL, NULL},
+ (!switch_copy || (IS_EMERGENCY_PRIMARY() && !space_using_mode)) ? -1 : 2,
+ "%s",
+ (const char *[]){"%s", "%s", "%s", "%s"},
+ 4,
+ copy_path,
+ cc.fullpath[0],
+ cc.fullpath[0],
+ cc.fullpath[1],
+ cc.fullpath[1]);
+ if (copy) {
+ struct copy_brick *_copy = (void *)copy;
+
+ copy->show_status = _show_brick_status;
+ make_msg(
+ msg_pair,
+ "from = '%s' to = '%s' on = %d start_pos = %lld end_pos = %lld actual_pos = %lld actual_stamp = %ld.%09ld ops_rate = %d amount_rate = %d read_fly = %d write_fly = %d error_code = %d nr_errors = %d",
+ argv[0],
+ argv[1],
+ _copy->power.on_led,
+ _copy->copy_start,
+ _copy->copy_end,
+ _copy->copy_last,
+ _copy->copy_last_stamp.tv_sec, _copy->copy_last_stamp.tv_nsec,
+ _copy->copy_limiter ? _copy->copy_limiter->lim_ops_rate : 0,
+ _copy->copy_limiter ? _copy->copy_limiter->lim_amount_rate : 0,
+ atomic_read(&_copy->copy_read_flight),
+ atomic_read(&_copy->copy_write_flight),
+ _copy->copy_error,
+ _copy->copy_error_count);
+ }
+ if (__copy)
+ *__copy = (void *)copy;
+
+ status = 0;
+
+done:
+ XIO_DBG("status = %d\n", status);
+ for (i = 0; i < 2; i++) {
+ if (cc.fullpath[i] && cc.fullpath[i] != argv[i])
+ brick_string_free(cc.fullpath[i]);
+ }
+ return status;
+}
+
+/*********************************************************************/
+
+/* remote workers */
+
+static
+rwlock_t peer_lock = __RW_LOCK_UNLOCKED(&peer_lock);
+
+static
+struct list_head peer_anchor = LIST_HEAD_INIT(peer_anchor);
+
+struct mars_peerinfo {
+ struct mars_global *global;
+ char *peer;
+ char *path;
+ struct xio_socket socket;
+ struct task_struct *peer_thread;
+
+ /* protect the following lists against concurrent read/write */
+ spinlock_t lock;
+ struct list_head peer_head;
+ struct list_head remote_dent_list;
+ unsigned long last_remote_jiffies;
+ int maxdepth;
+ bool to_remote_trigger;
+ bool from_remote_trigger;
+};
+
+static
+struct mars_peerinfo *find_peer(const char *peer_name)
+{
+ struct list_head *tmp;
+ struct mars_peerinfo *res = NULL;
+ unsigned long flags;
+
+ read_lock_irqsave(&peer_lock, flags);
+ for (tmp = peer_anchor.next; tmp != &peer_anchor; tmp = tmp->next) {
+ struct mars_peerinfo *peer = container_of(tmp, struct mars_peerinfo, peer_head);
+
+ if (!strcmp(peer->peer, peer_name)) {
+ res = peer;
+ break;
+ }
+ }
+ read_unlock_irqrestore(&peer_lock, flags);
+
+ return res;
+}
+
+static
+bool _is_usable_dir(const char *name)
+{
+ if (!strncmp(name, "resource-", 9) ||
+ !strncmp(name, "todo-", 5) ||
+ !strncmp(name, "actual-", 7) ||
+ !strncmp(name, "defaults", 8)) {
+ return true;
+ }
+ return false;
+}
+
+static
+bool _is_peer_logfile(const char *name, const char *id)
+{
+ int len = strlen(name);
+ int idlen = id ? strlen(id) : 4 + 9 + 1;
+
+ if (len <= idlen ||
+ strncmp(name, "log-", 4) != 0) {
+ XIO_DBG("not a logfile at all: '%s'\n", name);
+ return false;
+ }
+ if (id &&
+ name[len - idlen - 1] == '-' &&
+ strncmp(name + len - idlen, id, idlen) == 0) {
+ XIO_DBG("not a peer logfile: '%s'\n", name);
+ return false;
+ }
+ XIO_DBG("found peer logfile: '%s'\n", name);
+ return true;
+}
+
+static
+int _update_file(
+struct mars_dent *parent,
+const char *switch_path,
+const char *copy_path,
+const char *file,
+const char *peer,
+loff_t end_pos)
+{
+ struct mars_rotate *rot = parent->d_private;
+ struct mars_global *global = rot->global;
+ const char *tmp = path_make("%s@%s:%d", file, peer, xio_net_default_port + 1);
+ const char *argv[2] = { tmp, file };
+ struct copy_brick *copy = NULL;
+ struct key_value_pair *msg_pair = find_key(rot->msgs, "inf-fetch");
+ bool do_start = true;
+ int status = -ENOMEM;
+
+ if (unlikely(!tmp || !global))
+ goto done;
+
+ rot->fetch_round = 0;
+
+ if (rot->todo_primary | rot->is_primary) {
+ XIO_DBG("disallowing fetch, todo_primary=%d is_primary=%d\n", rot->todo_primary, rot->is_primary);
+ make_msg(
+ msg_pair, "disallowing fetch (todo_primary=%d is_primary=%d)", rot->todo_primary, rot->is_primary);
+ do_start = false;
+ }
+ if (do_start && !strcmp(peer, "(none)")) {
+ XIO_DBG("disabling fetch from unspecified peer / no primary designated\n");
+ make_msg(msg_pair, "disabling fetch from unspecified peer / no primary designated");
+ do_start = false;
+ }
+ if (do_start && !global->global_power.button) {
+ XIO_DBG("disabling fetch due to rmmod\n");
+ make_msg(msg_pair, "disabling fetch due to rmmod");
+ do_start = false;
+ }
+#if 0
+ /* Disabled for now. Re-enable this code after a new feature has been
+ * implemented: when pause-replay is given on a secondary,
+ * /dev/mars/mydata should appear in _readonly_ form.
+ * You may then draw a backup from the readonly device without losing
+ * redundancy, because the transactions logs will continue updating.
+ * Until the new feature is implemented, use
+ * "marsadm pause-replay $res; marsadm detach $res; mount -o ro /dev/lv/$res"
+ * as a workaround. It is important that "fetch" remains enabled.
+ *
+ * Hint: "marsadm down" disables _all_ switches, including fetch,
+ * thus it can / must be used for pausing everything.
+ */
+ if (do_start && !_check_allow(global, parent, "attach")) {
+ XIO_DBG("disabling fetch due to detach\n");
+ make_msg(msg_pair, "disabling fetch due to detach");
+ do_start = false;
+ }
+#endif
+ if (do_start && !_check_allow(global, parent, "connect")) {
+ XIO_DBG("disabling fetch due to disconnect\n");
+ make_msg(msg_pair, "disabling fetch due to disconnect");
+ do_start = false;
+ }
+
+ XIO_DBG("src = '%s' dst = '%s'\n", tmp, file);
+ status = __make_copy(
+ global,
+ NULL,
+ do_start ? switch_path : "",
+ copy_path,
+ NULL,
+ argv,
+ msg_pair,
+ -1,
+ -1,
+ false,
+ false,
+ false,
+ true,
+ &copy);
+ if (status >= 0 && copy) {
+ copy->copy_limiter = &rot->fetch_limiter;
+ /* FIXME: code is dead */
+ if (copy->append_mode && copy->power.on_led &&
+ end_pos > copy->copy_end) {
+ XIO_DBG("appending to '%s' %lld => %lld\n", copy_path, copy->copy_end, end_pos);
+ /* FIXME: use corrected length from xio_get_info() / see _set_copy_params() */
+ copy->copy_end = end_pos;
+ }
+ }
+
+done:
+ brick_string_free(tmp);
+ return status;
+}
+
+static
+int check_logfile(
+const char *peer,
+struct mars_dent *remote_dent,
+struct mars_dent *local_dent,
+struct mars_dent *parent,
+loff_t dst_size)
+{
+ loff_t src_size = remote_dent->stat_val.size;
+ struct mars_rotate *rot;
+ const char *switch_path = NULL;
+ struct copy_brick *fetch_brick;
+ int status = 0;
+
+ /* correct the remote size when necessary */
+ if (remote_dent->d_corr_B > 0 && remote_dent->d_corr_B < src_size) {
+ XIO_DBG(
+ "logfile '%s' correcting src_size from %lld to %lld\n",
+ remote_dent->d_path,
+ src_size,
+ remote_dent->d_corr_B);
+ src_size = remote_dent->d_corr_B;
+ }
+
+ /* plausibility checks */
+ if (unlikely(dst_size > src_size)) {
+ XIO_WRN("my local copy is larger than the remote one, ignoring\n");
+ status = -EINVAL;
+ goto done;
+ }
+
+ /* check whether we are participating in that resource */
+ rot = parent->d_private;
+ if (!rot) {
+ XIO_WRN("parent has no rot info\n");
+ status = -EINVAL;
+ goto done;
+ }
+ if (!rot->fetch_path) {
+ XIO_WRN("parent has no fetch_path\n");
+ status = -EINVAL;
+ goto done;
+ }
+
+ /* bookkeeping for serialization of logfile updates */
+ if (remote_dent->d_serial > rot->fetch_serial) {
+ rot->fetch_next_is_available++;
+ if (!rot->fetch_next_serial || !rot->fetch_next_origin) {
+ rot->fetch_next_serial = remote_dent->d_serial;
+ rot->fetch_next_origin = brick_strdup(remote_dent->d_rest);
+ } else if (
+ rot->fetch_next_serial == remote_dent->d_serial && strcmp(
+ rot->fetch_next_origin, remote_dent->d_rest)) {
+ rot->split_brain_round = 0;
+ rot->split_brain_serial = remote_dent->d_serial;
+ XIO_WRN(
+ "SPLIT BRAIN logfiles from '%s' and '%s' with same serial number %d detected!\n",
+ rot->fetch_next_origin, remote_dent->d_rest, rot->split_brain_serial);
+ }
+ }
+
+ /* check whether connection is allowed */
+ switch_path = path_make("%s/todo-%s/connect", parent->d_path, my_id());
+
+ /* check whether copy is necessary */
+ fetch_brick = rot->fetch_brick;
+ XIO_DBG(
+ "fetch_brick = %p (remote '%s' %d) fetch_serial = %d\n",
+ fetch_brick,
+ remote_dent->d_path,
+ remote_dent->d_serial,
+ rot->fetch_serial);
+ if (fetch_brick) {
+ if (remote_dent->d_serial == rot->fetch_serial && rot->fetch_peer && !strcmp(peer, rot->fetch_peer)) {
+ /* treat copy brick instance underway */
+ status = _update_file(
+ parent, switch_path, rot->fetch_path, remote_dent->d_path, peer, src_size);
+ XIO_DBG("re-update '%s' from peer '%s' status = %d\n", remote_dent->d_path, peer, status);
+ }
+ } else if (!rot->fetch_serial && rot->allow_update &&
+ !rot->is_primary && !rot->old_is_primary &&
+ (!rot->preferred_peer || !strcmp(rot->preferred_peer, peer)) &&
+ (!rot->split_brain_serial || remote_dent->d_serial < rot->split_brain_serial) &&
+ (dst_size < src_size || !local_dent)) {
+ /* start copy brick instance */
+ status = _update_file(parent, switch_path, rot->fetch_path, remote_dent->d_path, peer, src_size);
+ XIO_DBG("update '%s' from peer '%s' status = %d\n", remote_dent->d_path, peer, status);
+ if (likely(status >= 0)) {
+ rot->fetch_serial = remote_dent->d_serial;
+ rot->fetch_next_is_available = 0;
+ brick_string_free(rot->fetch_peer);
+ rot->fetch_peer = brick_strdup(peer);
+ }
+ } else {
+ XIO_DBG(
+ "allow_update = %d src_size = %lld dst_size = %lld local_dent = %p\n",
+ rot->allow_update,
+ src_size,
+ dst_size,
+ local_dent);
+ }
+
+done:
+ brick_string_free(switch_path);
+ return status;
+}
+
+static
+int run_bone(struct mars_peerinfo *peer, struct mars_dent *remote_dent)
+{
+ int status = 0;
+ struct kstat local_stat = {};
+ const char *marker_path = NULL;
+ bool stat_ok;
+ bool update_mtime = true;
+ bool update_ctime = true;
+ bool run_trigger = false;
+
+ if (!strncmp(remote_dent->d_name, ".tmp", 4))
+ goto done;
+ if (!strncmp(remote_dent->d_name, ".deleted-", 9))
+ goto done;
+ if (!strncmp(remote_dent->d_name, "ignore", 6))
+ goto done;
+
+ /* create / check markers (prevent concurrent updates) */
+ if (remote_dent->link_val && !strncmp(remote_dent->d_path, "/mars/todo-global/delete-", 25)) {
+ marker_path = backskip_replace(remote_dent->link_val, '/', true, "/.deleted-");
+ if (mars_stat(marker_path, &local_stat, true) < 0 ||
+ timespec_compare(&remote_dent->stat_val.mtime, &local_stat.mtime) > 0) {
+ XIO_DBG(
+ "creating / updating marker '%s' mtime=%lu.%09lu\n",
+ marker_path, remote_dent->stat_val.mtime.tv_sec, remote_dent->stat_val.mtime.tv_nsec);
+ mars_symlink("1", marker_path, &remote_dent->stat_val.mtime, 0);
+ }
+ if (remote_dent->d_serial < peer->global->deleted_my_border) {
+ XIO_DBG(
+ "ignoring deletion '%s' at border %d\n", remote_dent->d_path, peer->global->deleted_my_border);
+ goto done;
+ }
+ } else {
+ /* check marker preventing concurrent updates from remote hosts when deletes are in progress */
+ marker_path = backskip_replace(remote_dent->d_path, '/', true, "/.deleted-");
+ if (mars_stat(marker_path, &local_stat, true) >= 0) {
+ if (timespec_compare(&remote_dent->stat_val.mtime, &local_stat.mtime) <= 0) {
+ XIO_DBG(
+ "marker '%s' exists, ignoring '%s' (new mtime=%lu.%09lu, marker mtime=%lu.%09lu)\n",
+ marker_path, remote_dent->d_path,
+ remote_dent->stat_val.mtime.tv_sec, remote_dent->stat_val.mtime.tv_nsec,
+ local_stat.mtime.tv_sec, local_stat.mtime.tv_nsec);
+ goto done;
+ } else {
+ XIO_DBG(
+ "marker '%s' exists, overwriting '%s' (new mtime=%lu.%09lu, marker mtime=%lu.%09lu)\n",
+ marker_path, remote_dent->d_path,
+ remote_dent->stat_val.mtime.tv_sec, remote_dent->stat_val.mtime.tv_nsec,
+ local_stat.mtime.tv_sec, local_stat.mtime.tv_nsec);
+ }
+ }
+ }
+
+ status = mars_stat(remote_dent->d_path, &local_stat, true);
+ stat_ok = (status >= 0);
+
+ if (stat_ok) {
+ update_mtime = timespec_compare(&remote_dent->stat_val.mtime, &local_stat.mtime) > 0;
+ update_ctime = timespec_compare(&remote_dent->stat_val.ctime, &local_stat.ctime) > 0;
+ }
+
+ if (S_ISDIR(remote_dent->stat_val.mode)) {
+ if (!_is_usable_dir(remote_dent->d_name)) {
+ XIO_DBG("ignoring directory '%s'\n", remote_dent->d_path);
+ goto done;
+ }
+ if (!stat_ok) {
+ status = mars_mkdir(remote_dent->d_path);
+ XIO_DBG("create directory '%s' status = %d\n", remote_dent->d_path, status);
+ }
+ } else if (S_ISLNK(remote_dent->stat_val.mode) && remote_dent->link_val) {
+ if (!stat_ok || update_mtime) {
+ status = mars_symlink(
+ remote_dent->link_val, remote_dent->d_path, &remote_dent->stat_val.mtime, __kuid_val(
+ remote_dent->stat_val.uid));
+ XIO_DBG(
+ "create symlink '%s' -> '%s' status = %d\n",
+ remote_dent->d_path,
+ remote_dent->link_val,
+ status);
+ run_trigger = true;
+ }
+ } else if (S_ISREG(remote_dent->stat_val.mode) && _is_peer_logfile(remote_dent->d_name, my_id())) {
+ const char *parent_path = backskip_replace(remote_dent->d_path, '/', false, "");
+
+ if (likely(parent_path)) {
+ struct mars_dent *parent = mars_find_dent(peer->global, parent_path);
+
+ if (unlikely(!parent)) {
+ XIO_DBG("ignoring non-existing local resource '%s'\n", parent_path);
+ /* don't copy old / outdated logfiles */
+ } else {
+ struct mars_rotate *rot;
+
+ rot = parent->d_private;
+ if (rot && rot->relevant_serial > remote_dent->d_serial) {
+ XIO_DBG(
+ "ignoring outdated remote logfile '%s' behind %d\n",
+ remote_dent->d_path, rot->relevant_serial);
+ } else {
+ struct mars_dent *local_dent;
+
+ local_dent = mars_find_dent(peer->global, remote_dent->d_path);
+ status = check_logfile(
+ peer->peer, remote_dent, local_dent, parent, local_stat.size);
+ }
+ }
+ brick_string_free(parent_path);
+ }
+ } else {
+ XIO_DBG("ignoring '%s'\n", remote_dent->d_path);
+ }
+
+done:
+ brick_string_free(marker_path);
+ if (status >= 0)
+ status = run_trigger ? 1 : 0;
+ return status;
+}
+
+static
+int run_bones(struct mars_peerinfo *peer)
+{
+ LIST_HEAD(tmp_list);
+ struct list_head *tmp;
+ unsigned long flags;
+ bool run_trigger = false;
+ int status = 0;
+
+ spin_lock_irqsave(&peer->lock, flags);
+ list_replace_init(&peer->remote_dent_list, &tmp_list);
+ spin_unlock_irqrestore(&peer->lock, flags);
+
+ XIO_DBG("remote_dent_list list_empty = %d\n", list_empty(&tmp_list));
+
+ for (tmp = tmp_list.next; tmp != &tmp_list; tmp = tmp->next) {
+ struct mars_dent *remote_dent = container_of(tmp, struct mars_dent, dent_link);
+
+ if (!remote_dent->d_path || !remote_dent->d_name) {
+ XIO_DBG("NULL\n");
+ continue;
+ }
+ status = run_bone(peer, remote_dent);
+ if (status > 0)
+ run_trigger = true;
+ /* XIO_DBG("path = '%s' worker status = %d\n", remote_dent->d_path, status); */
+ }
+
+ xio_free_dent_all(NULL, &tmp_list);
+
+ if (run_trigger)
+ local_trigger();
+ return status;
+}
+
+/*********************************************************************/
+
+/* remote working infrastructure */
+
+static
+void _peer_cleanup(struct mars_peerinfo *peer)
+{
+ XIO_DBG("cleanup\n");
+ if (xio_socket_is_alive(&peer->socket)) {
+ XIO_DBG("really shutdown socket\n");
+ xio_shutdown_socket(&peer->socket);
+ }
+ xio_put_socket(&peer->socket);
+}
+
+static DECLARE_WAIT_QUEUE_HEAD(remote_event);
+
+static
+int peer_thread(void *data)
+{
+ struct mars_peerinfo *peer = data;
+ const char *real_peer;
+ struct sockaddr_storage src_sockaddr;
+ struct sockaddr_storage dst_sockaddr;
+
+ struct key_value_pair peer_pairs[] = {
+ { peer->peer },
+ { NULL }
+ };
+ int pause_time = 0;
+ bool do_kill = false;
+ int status;
+
+ if (!peer)
+ return -1;
+
+ real_peer = xio_translate_hostname(peer->peer);
+ XIO_INF("-------- peer thread starting on peer '%s' (%s)\n", peer->peer, real_peer);
+
+ status = xio_create_sockaddr(&src_sockaddr, my_id());
+ if (unlikely(status < 0)) {
+ XIO_ERR("unusable local address '%s' (%s)\n", real_peer, peer->peer);
+ goto done;
+ }
+
+ status = xio_create_sockaddr(&dst_sockaddr, real_peer);
+ if (unlikely(status < 0)) {
+ XIO_ERR("unusable remote address '%s' (%s)\n", real_peer, peer->peer);
+ goto done;
+ }
+
+ while (!brick_thread_should_stop()) {
+ struct mars_global tmp_global = {
+ .dent_anchor = LIST_HEAD_INIT(tmp_global.dent_anchor),
+ .brick_anchor = LIST_HEAD_INIT(tmp_global.brick_anchor),
+ .global_power = {
+ .button = true,
+ },
+ .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(tmp_global.main_event),
+ };
+ LIST_HEAD(old_list);
+ unsigned long flags;
+
+ struct xio_cmd cmd = {
+ .cmd_str1 = peer->path,
+ .cmd_int1 = peer->maxdepth,
+ };
+
+ init_rwsem(&tmp_global.dent_mutex);
+ init_rwsem(&tmp_global.brick_mutex);
+
+ show_vals(peer_pairs, "/mars", "connection-from-");
+
+ if (!xio_socket_is_alive(&peer->socket)) {
+ make_msg(peer_pairs, "connection to '%s' (%s) is dead", peer->peer, real_peer);
+ brick_string_free(real_peer);
+ real_peer = xio_translate_hostname(peer->peer);
+ status = xio_create_sockaddr(&dst_sockaddr, real_peer);
+ if (unlikely(status < 0)) {
+ XIO_ERR("unusable remote address '%s' (%s)\n", real_peer, peer->peer);
+ make_msg(peer_pairs, "unusable remote address '%s' (%s)\n", real_peer, peer->peer);
+ brick_msleep(1000);
+ continue;
+ }
+ if (do_kill) {
+ do_kill = false;
+ _peer_cleanup(peer);
+ brick_msleep(1000);
+ continue;
+ }
+ if (!xio_net_is_alive) {
+ brick_msleep(1000);
+ continue;
+ }
+
+ status = xio_create_socket(&peer->socket, &src_sockaddr, &dst_sockaddr, &repl_tcp_params);
+ if (unlikely(status < 0)) {
+ XIO_INF(
+ "no connection to mars module on '%s' (%s) status = %d\n",
+ peer->peer,
+ real_peer,
+ status);
+ make_msg(
+ peer_pairs,
+ "connection to '%s' (%s) could not be established: status = %d",
+ peer->peer,
+ real_peer,
+ status);
+ brick_msleep(2000);
+ continue;
+ }
+ do_kill = true;
+ peer->socket.s_shutdown_on_err = true;
+ peer->socket.s_send_abort = mars_peer_abort;
+ peer->socket.s_recv_abort = mars_peer_abort;
+ XIO_DBG("successfully opened socket to '%s'\n", real_peer);
+ brick_msleep(100);
+ continue;
+ } else {
+ const char *new_peer;
+
+ /* check whether IP assignment has changed */
+ new_peer = xio_translate_hostname(peer->peer);
+ XIO_INF(
+ "AHA %d '%s' '%s'\n",
+ xio_socket_is_alive(&peer->socket),
+ new_peer, real_peer);
+ if (new_peer && real_peer && strcmp(new_peer, real_peer))
+ xio_shutdown_socket(&peer->socket);
+ brick_string_free(new_peer);
+ }
+
+ if (peer->from_remote_trigger) {
+ pause_time = 0;
+ peer->from_remote_trigger = false;
+ XIO_DBG("got notify from peer.\n");
+ }
+
+ status = 0;
+ if (peer->to_remote_trigger) {
+ pause_time = 0;
+ peer->to_remote_trigger = false;
+ XIO_DBG("sending notify to peer...\n");
+ cmd.cmd_code = CMD_NOTIFY;
+ status = xio_send_struct(&peer->socket, &cmd, xio_cmd_meta);
+ }
+
+ if (likely(status >= 0)) {
+ cmd.cmd_code = CMD_GETENTS;
+ status = xio_send_struct(&peer->socket, &cmd, xio_cmd_meta);
+ }
+ if (unlikely(status < 0)) {
+ XIO_WRN("communication error on send, status = %d\n", status);
+ if (do_kill) {
+ do_kill = false;
+ _peer_cleanup(peer);
+ }
+ brick_msleep(1000);
+ continue;
+ }
+
+ XIO_DBG("fetching remote dentry list\n");
+ status = xio_recv_dent_list(&peer->socket, &tmp_global.dent_anchor);
+ if (unlikely(status < 0)) {
+ XIO_WRN("communication error on receive, status = %d\n", status);
+ if (do_kill) {
+ do_kill = false;
+ _peer_cleanup(peer);
+ }
+ goto free_and_restart;
+ xio_free_dent_all(NULL, &tmp_global.dent_anchor);
+ brick_msleep(2000);
+ continue;
+ }
+
+ if (likely(!list_empty(&tmp_global.dent_anchor))) {
+ struct mars_dent *peer_uuid;
+ struct mars_dent *my_uuid;
+
+ XIO_DBG("got remote denties\n");
+
+ peer_uuid = mars_find_dent(&tmp_global, "/mars/uuid");
+ if (unlikely(!peer_uuid || !peer_uuid->link_val)) {
+ XIO_ERR("peer %s has no uuid\n", peer->peer);
+ make_msg(peer_pairs, "peer has no UUID");
+ goto free_and_restart;
+ }
+ my_uuid = mars_find_dent(mars_global, "/mars/uuid");
+ if (unlikely(!my_uuid || !my_uuid->link_val)) {
+ XIO_ERR("cannot determine my own uuid for peer %s\n", peer->peer);
+ make_msg(peer_pairs, "cannot determine my own uuid");
+ goto free_and_restart;
+ }
+ if (unlikely(strcmp(peer_uuid->link_val, my_uuid->link_val))) {
+ XIO_ERR(
+ "UUID mismatch for peer %s, you are trying to communicate with a foreign cluster!\n",
+ peer->peer);
+ make_msg(
+ peer_pairs,
+ "UUID mismatch, own cluster '%s' is trying to communicate with a foreign cluster '%s'",
+ my_uuid->link_val, peer_uuid->link_val);
+ goto free_and_restart;
+ }
+
+ make_msg(peer_pairs, "CONNECTED %s(%s)", peer->peer, real_peer);
+
+ spin_lock_irqsave(&peer->lock, flags);
+
+ list_replace_init(&peer->remote_dent_list, &old_list);
+ list_replace_init(&tmp_global.dent_anchor, &peer->remote_dent_list);
+
+ spin_unlock_irqrestore(&peer->lock, flags);
+
+ peer->last_remote_jiffies = jiffies;
+
+ local_trigger();
+
+ xio_free_dent_all(NULL, &old_list);
+ }
+
+ brick_msleep(100);
+ if (!brick_thread_should_stop()) {
+ if (pause_time < mars_propagate_interval)
+ pause_time++;
+ wait_event_interruptible_timeout(
+ remote_event,
+ (peer->to_remote_trigger | peer->from_remote_trigger) ||
+ (mars_global && mars_global->main_trigger),
+ pause_time * HZ);
+ }
+ continue;
+
+free_and_restart:
+ xio_free_dent_all(NULL, &tmp_global.dent_anchor);
+ brick_msleep(2000);
+ }
+
+ XIO_INF("-------- peer thread terminating\n");
+
+ make_msg(peer_pairs, "NOT connected %s(%s)", peer->peer, real_peer);
+ show_vals(peer_pairs, "/mars", "connection-from-");
+
+ if (do_kill)
+ _peer_cleanup(peer);
+
+done:
+ clear_vals(peer_pairs);
+ brick_string_free(real_peer);
+ return 0;
+}
+
+static
+void _make_alive(void)
+{
+ struct timespec now;
+ char *tmp;
+
+ get_lamport(&now);
+ tmp = path_make("%ld.%09ld", now.tv_sec, now.tv_nsec);
+ if (likely(tmp)) {
+ _make_alivelink_str("time", tmp);
+ brick_string_free(tmp);
+ }
+ _make_alivelink("alive", mars_global && mars_global->global_power.button ? 1 : 0);
+ _make_alivelink_str("tree", SYMLINK_TREE_VERSION);
+}
+
+void from_remote_trigger(void)
+{
+ struct list_head *tmp;
+ int count = 0;
+ unsigned long flags;
+
+ _make_alive();
+
+ read_lock_irqsave(&peer_lock, flags);
+ for (tmp = peer_anchor.next; tmp != &peer_anchor; tmp = tmp->next) {
+ struct mars_peerinfo *peer = container_of(tmp, struct mars_peerinfo, peer_head);
+
+ peer->from_remote_trigger = true;
+ count++;
+ }
+ read_unlock_irqrestore(&peer_lock, flags);
+
+ XIO_DBG("got trigger for %d peers\n", count);
+ wake_up_interruptible_all(&remote_event);
+}
+
+static
+void __remote_trigger(void)
+{
+ struct list_head *tmp;
+ int count = 0;
+ unsigned long flags;
+
+ read_lock_irqsave(&peer_lock, flags);
+ for (tmp = peer_anchor.next; tmp != &peer_anchor; tmp = tmp->next) {
+ struct mars_peerinfo *peer = container_of(tmp, struct mars_peerinfo, peer_head);
+
+ peer->to_remote_trigger = true;
+ count++;
+ }
+ read_unlock_irqrestore(&peer_lock, flags);
+
+ XIO_DBG("triggered %d peers\n", count);
+ wake_up_interruptible_all(&remote_event);
+}
+
+static
+bool is_shutdown(void)
+{
+ bool res = false;
+ int used = atomic_read(&global_mshadow_count);
+
+ if (used > 0) {
+ XIO_INF(
+ "global shutdown delayed: there are %d buffers in use, occupying %ld bytes\n", used, atomic64_read(
+ &global_mshadow_used));
+ } else {
+ int rounds = 3;
+
+ while ((used = atomic_read(&xio_global_io_flying)) <= 0) {
+ if (--rounds <= 0) {
+ res = true;
+ break;
+ }
+ brick_msleep(30);
+ }
+ if (!res)
+ XIO_INF("global shutdown delayed: there are %d IO requests flying\n", used);
+ }
+ return res;
+}
+
+/*********************************************************************/
+
+/* helpers for worker functions */
+
+static int _kill_peer(struct mars_global *global, struct mars_peerinfo *peer)
+{
+ LIST_HEAD(tmp_list);
+ unsigned long flags;
+
+ if (!peer)
+ return 0;
+
+ write_lock_irqsave(&peer_lock, flags);
+ list_del_init(&peer->peer_head);
+ write_unlock_irqrestore(&peer_lock, flags);
+
+ XIO_INF("stopping peer thread...\n");
+ if (peer->peer_thread) {
+ brick_thread_stop(peer->peer_thread);
+ peer->peer_thread = NULL;
+ }
+ spin_lock_irqsave(&peer->lock, flags);
+ list_replace_init(&peer->remote_dent_list, &tmp_list);
+ spin_unlock_irqrestore(&peer->lock, flags);
+ xio_free_dent_all(NULL, &tmp_list);
+ brick_string_free(peer->peer);
+ brick_string_free(peer->path);
+ return 0;
+}
+
+static
+void peer_destruct(void *_peer)
+{
+ struct mars_peerinfo *peer = _peer;
+
+ if (likely(peer))
+ _kill_peer(peer->global, peer);
+}
+
+static int _make_peer(struct mars_global *global, struct mars_dent *dent, char *path)
+{
+ static int serial;
+ struct mars_peerinfo *peer;
+ char *mypeer;
+ char *parent_path;
+ int status = 0;
+
+ if (unlikely(!global ||
+ !dent || !dent->link_val || !dent->d_parent)) {
+ XIO_DBG("cannot work\n");
+ return 0;
+ }
+ parent_path = dent->d_parent->d_path;
+ if (unlikely(!parent_path)) {
+ XIO_DBG("cannot work\n");
+ return 0;
+ }
+ mypeer = dent->d_rest;
+ if (!mypeer) {
+ status = _parse_args(dent, dent->link_val, 1);
+ if (status < 0)
+ goto done;
+ mypeer = dent->d_argv[0];
+ }
+
+ XIO_DBG("peer '%s'\n", mypeer);
+ if (!dent->d_private) {
+ unsigned long flags;
+
+ dent->d_private = brick_zmem_alloc(sizeof(struct mars_peerinfo));
+ dent->d_private_destruct = peer_destruct;
+ peer = dent->d_private;
+ peer->global = global;
+ peer->peer = brick_strdup(mypeer);
+ peer->path = brick_strdup(path);
+ peer->maxdepth = 2;
+ spin_lock_init(&peer->lock);
+ INIT_LIST_HEAD(&peer->peer_head);
+ INIT_LIST_HEAD(&peer->remote_dent_list);
+
+ write_lock_irqsave(&peer_lock, flags);
+ list_add_tail(&peer->peer_head, &peer_anchor);
+ write_unlock_irqrestore(&peer_lock, flags);
+ }
+
+ peer = dent->d_private;
+ if (!peer->peer_thread) {
+ peer->peer_thread = brick_thread_create(peer_thread, peer, "mars_peer%d", serial++);
+ if (unlikely(!peer->peer_thread)) {
+ XIO_ERR("cannot start peer thread\n");
+ return -1;
+ }
+ XIO_DBG("started peer thread\n");
+ }
+
+ /* This must be called by the main thread in order to
+ * avoid nasty races.
+ * The peer thread does nothing but fetching the dent list.
+ */
+ status = run_bones(peer);
+
+done:
+ return status;
+}
+
+static int kill_scan(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_peerinfo *peer = dent->d_private;
+ int res;
+
+ if (!global || global->global_power.button || !peer)
+ return 0;
+ dent->d_private = NULL;
+ res = _kill_peer(global, peer);
+ brick_mem_free(peer);
+ return res;
+}
+
+static int make_scan(void *buf, struct mars_dent *dent)
+{
+ XIO_DBG("path = '%s' peer = '%s'\n", dent->d_path, dent->d_rest);
+ /* don't connect to myself */
+ if (!strcmp(dent->d_rest, my_id()))
+ return 0;
+ return _make_peer(buf, dent, "/mars");
+}
+
+static
+int kill_any(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct list_head *tmp;
+
+ if (global->global_power.button || !is_shutdown())
+ return 0;
+
+ for (tmp = dent->brick_list.next; tmp != &dent->brick_list; tmp = tmp->next) {
+ struct xio_brick *brick = container_of(tmp, struct xio_brick, dent_brick_link);
+
+ if (brick->nr_outputs > 0 && brick->outputs[0] && brick->outputs[0]->nr_connected) {
+ XIO_DBG(
+ "cannot kill dent '%s' because brick '%s' is wired\n", dent->d_path, brick->brick_path);
+ return 0;
+ }
+ }
+
+ XIO_DBG("killing dent = '%s'\n", dent->d_path);
+ xio_kill_dent(dent);
+ return 1;
+}
+
+/*********************************************************************/
+
+/* handlers / helpers for logfile rotation */
+
+static
+void _create_new_logfile(const char *path)
+{
+ struct file *f;
+ const int flags = O_RDWR | O_CREAT | O_EXCL;
+ const int prot = 0600;
+
+ mm_segment_t oldfs;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ f = filp_open(path, flags, prot);
+ set_fs(oldfs);
+ if (IS_ERR(f)) {
+ int err = PTR_ERR(f);
+
+ if (err == -EEXIST)
+ XIO_INF("logfile '%s' already exists\n", path);
+ else
+ XIO_ERR("could not create logfile '%s' status = %d\n", path, err);
+ } else {
+ XIO_DBG("created empty logfile '%s'\n", path);
+ mars_sync();
+ _crashme(10, false);
+ filp_close(f, NULL);
+ local_trigger();
+ }
+}
+
+static
+const char *__get_link_path(const char *_linkpath, const char **linkpath)
+{
+ const char *res = mars_readlink(_linkpath);
+
+ if (linkpath)
+ *linkpath = _linkpath;
+ else
+ brick_string_free(_linkpath);
+ return res;
+}
+
+static
+const char *get_replaylink(const char *parent_path, const char *host, const char **linkpath)
+{
+ const char * _linkpath = path_make("%s/replay-%s", parent_path, host);
+
+ return __get_link_path(_linkpath, linkpath);
+}
+
+static
+const char *get_versionlink(const char *parent_path, int seq, const char *host, const char **linkpath)
+{
+ const char * _linkpath = path_make("%s/version-%09d-%s", parent_path, seq, host);
+
+ return __get_link_path(_linkpath, linkpath);
+}
+
+static inline
+int _get_tolerance(struct mars_rotate *rot)
+{
+ if (rot->is_log_damaged)
+ return REPLAY_TOLERANCE;
+ return 0;
+}
+
+static
+bool is_switchover_possible(
+struct mars_rotate *rot, const char *old_log_path, const char *new_log_path, int replay_tolerance, bool skip_new)
+{
+ const char *old_log_name = old_log_path + skip_dir(old_log_path);
+ const char *new_log_name = new_log_path + skip_dir(new_log_path);
+ const char *old_host = NULL;
+ const char *new_host = NULL;
+ const char *own_versionlink_path = NULL;
+ const char *old_versionlink_path = NULL;
+ const char *new_versionlink_path = NULL;
+ const char *own_versionlink = NULL;
+ const char *old_versionlink = NULL;
+ const char *new_versionlink = NULL;
+ const char *own_replaylink_path = NULL;
+ const char *own_replaylink = NULL;
+ loff_t own_r_val;
+ loff_t own_v_val;
+ loff_t own_r_tail;
+ int old_log_seq;
+ int new_log_seq;
+ int own_r_offset;
+ int own_v_offset;
+ int own_r_len;
+ int own_v_len;
+ int len1;
+ int len2;
+ int offs2;
+ char dummy = 0;
+
+ bool res = false;
+
+ XIO_DBG(
+ "old_log = '%s' new_log = '%s' toler = %d skip_new = %d\n",
+ old_log_path, new_log_path, replay_tolerance, skip_new);
+
+ /* check precondition: is split brain already for sure? */
+ if (unlikely(rot->has_double_logfile)) {
+ XIO_WRN_TO(
+ rot->log_say,
+ "SPLIT BRAIN detected: multiple logfiles with sequence number %d exist\n",
+ rot->next_relevant_log->d_serial);
+ make_rot_msg(
+ rot,
+ "err-splitbrain-detected",
+ "SPLIT BRAIN detected: multiple logfiles with sequence number %d exist\n",
+ rot->next_relevant_log->d_serial);
+ goto done;
+ }
+
+ /* parse the names */
+ if (unlikely(!parse_logfile_name(old_log_name, &old_log_seq, &old_host))) {
+ make_rot_msg(rot, "err-bad-log-name", "logfile name '%s' cannot be parsed", old_log_name);
+ goto done;
+ }
+ if (unlikely(!parse_logfile_name(new_log_name, &new_log_seq, &new_host))) {
+ make_rot_msg(rot, "err-bad-log-name", "logfile name '%s' cannot be parsed", new_log_name);
+ goto done;
+ }
+
+ /* check: are the sequence numbers contiguous? */
+ if (unlikely(new_log_seq != old_log_seq + 1)) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "logfile sequence numbers are not contiguous (%d != %d + 1), old_log_path='%s' new_log_path='%s'\n",
+ new_log_seq,
+ old_log_seq,
+ old_log_path,
+ new_log_path);
+ make_rot_msg(
+ rot,
+ "err-log-not-contiguous",
+ "logfile sequence numbers are not contiguous (%d != %d + 1) old_log_path='%s' new_log_path='%s'",
+ new_log_seq,
+ old_log_seq,
+ old_log_path,
+ new_log_path);
+ goto done;
+ }
+
+ /* fetch all the versionlinks and test for their existence. */
+ own_versionlink = get_versionlink(rot->parent_path, old_log_seq, my_id(), &own_versionlink_path);
+ if (unlikely(!own_versionlink || !own_versionlink[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read my own versionlink '%s'\n", own_versionlink_path);
+ make_rot_msg(
+ rot, "err-versionlink-not-readable", "cannot read my own versionlink '%s'", own_versionlink_path);
+ goto done;
+ }
+ old_versionlink = get_versionlink(rot->parent_path, old_log_seq, old_host, &old_versionlink_path);
+ if (unlikely(!old_versionlink || !old_versionlink[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read old versionlink '%s'\n", old_versionlink_path);
+ make_rot_msg(
+ rot, "err-versionlink-not-readable", "cannot read old versionlink '%s'", old_versionlink_path);
+ goto done;
+ }
+ if (!skip_new && strcmp(new_host, my_id())) {
+ new_versionlink = get_versionlink(rot->parent_path, new_log_seq, new_host, &new_versionlink_path);
+ if (unlikely(!new_versionlink || !new_versionlink[0])) {
+ XIO_INF_TO(
+ rot->log_say,
+ "new versionlink '%s' does not yet exist, we must wait for it.\n",
+ new_versionlink_path);
+ make_rot_msg(
+ rot,
+ "inf-versionlink-not-yet-exist",
+ "we must wait for new versionlink '%s'",
+ new_versionlink_path);
+ goto done;
+ }
+ }
+
+ /* check: are the versionlinks correct? */
+ if (unlikely(strcmp(own_versionlink, old_versionlink))) {
+ XIO_INF_TO(
+ rot->log_say,
+ "old logfile is not yet completeley transferred, own_versionlink '%s' -> '%s' != old_versionlink '%s' -> '%s'\n",
+ own_versionlink_path,
+ own_versionlink,
+ old_versionlink_path,
+ old_versionlink);
+ make_rot_msg(
+ rot,
+ "inf-versionlink-not-equal",
+ "old logfile is not yet completeley transferred (own_versionlink '%s' -> '%s' != old_versionlink '%s' -> '%s')",
+ own_versionlink_path,
+ own_versionlink,
+ old_versionlink_path,
+ old_versionlink);
+ goto done;
+ }
+
+ /* check: did I fully replay my old logfile data? */
+ own_replaylink = get_replaylink(rot->parent_path, my_id(), &own_replaylink_path);
+ if (unlikely(!own_replaylink || !own_replaylink[0])) {
+ XIO_ERR_TO(rot->log_say, "cannot read my own replaylink '%s'\n", own_replaylink_path);
+ goto done;
+ }
+ own_r_len = skip_part(own_replaylink);
+ own_v_offset = skip_part(own_versionlink);
+ if (unlikely(!own_versionlink[own_v_offset++])) {
+ XIO_ERR_TO(
+ rot->log_say, "own version link '%s' -> '%s' is malformed\n", own_versionlink_path, own_versionlink);
+ make_rot_msg(
+ rot,
+ "err-replaylink-not-readable",
+ "own version link '%s' -> '%s' is malformed",
+ own_versionlink_path,
+ own_versionlink);
+ goto done;
+ }
+ own_v_len = skip_part(own_versionlink + own_v_offset);
+ if (unlikely(own_r_len != own_v_len ||
+ strncmp(own_replaylink, own_versionlink + own_v_offset, own_r_len))) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "internal problem: logfile name mismatch between '%s' and '%s'\n",
+ own_replaylink,
+ own_versionlink);
+ make_rot_msg(
+ rot,
+ "err-bad-log-name",
+ "internal problem: logfile name mismatch between '%s' and '%s'",
+ own_replaylink,
+ own_versionlink);
+ goto done;
+ }
+ if (unlikely(!own_replaylink[own_r_len])) {
+ XIO_ERR_TO(
+ rot->log_say, "own replay link '%s' -> '%s' is malformed\n", own_replaylink_path, own_replaylink);
+ make_rot_msg(
+ rot,
+ "err-replaylink-not-readable",
+ "own replay link '%s' -> '%s' is malformed",
+ own_replaylink_path,
+ own_replaylink);
+ goto done;
+ }
+ own_r_offset = own_r_len + 1;
+ if (unlikely(!own_versionlink[own_v_len])) {
+ XIO_ERR_TO(
+ rot->log_say, "own version link '%s' -> '%s' is malformed\n", own_versionlink_path, own_versionlink);
+ make_rot_msg(
+ rot,
+ "err-versionlink-not-readable",
+ "own version link '%s' -> '%s' is malformed",
+ own_versionlink_path,
+ own_versionlink);
+ goto done;
+ }
+ own_v_offset += own_r_len + 1;
+ own_r_len = skip_part(own_replaylink + own_r_offset);
+ own_v_len = skip_part(own_versionlink + own_v_offset);
+ own_r_val = 0;
+ own_v_val = 0;
+ own_r_tail = 0;
+ if (sscanf(own_replaylink + own_r_offset, "%lld,%lld", &own_r_val, &own_r_tail) != 2) {
+ XIO_ERR_TO(
+ rot->log_say, "own replay link '%s' -> '%s' is malformed\n", own_replaylink_path, own_replaylink);
+ make_rot_msg(
+ rot,
+ "err-replaylink-not-readable",
+ "own replay link '%s' -> '%s' is malformed",
+ own_replaylink_path,
+ own_replaylink);
+ goto done;
+ }
+ /* SSCANF_TO_KSTRTO: kstros64 does not work because of the next char */
+ if (sscanf(own_versionlink + own_v_offset, "%lld%c", &own_v_val, &dummy) != 2) {
+ XIO_ERR_TO(
+ rot->log_say, "own version link '%s' -> '%s' is malformed\n", own_versionlink_path, own_versionlink);
+ make_rot_msg(
+ rot,
+ "err-versionlink-not-readable",
+ "own version link '%s' -> '%s' is malformed",
+ own_versionlink_path,
+ own_versionlink);
+ goto done;
+ }
+ if (unlikely(own_r_len > own_v_len || own_r_len + replay_tolerance < own_v_len)) {
+ XIO_INF_TO(
+ rot->log_say,
+ "log replay is not yet finished: '%s' and '%s' are reporting different positions.\n",
+ own_replaylink,
+ own_versionlink);
+ make_rot_msg(
+ rot,
+ "inf-replay-not-yet-finished",
+ "log replay is not yet finished: '%s' and '%s' are reporting different positions",
+ own_replaylink,
+ own_versionlink);
+ goto done;
+ }
+
+ /* last check: is the new versionlink based on the old one? */
+ if (new_versionlink) {
+ len1 = skip_sect(own_versionlink);
+ offs2 = skip_sect(new_versionlink);
+ if (unlikely(!new_versionlink[offs2++])) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "new version link '%s' -> '%s' is malformed\n",
+ new_versionlink_path,
+ new_versionlink);
+ make_rot_msg(
+ rot,
+ "err-versionlink-not-readable",
+ "new version link '%s' -> '%s' is malformed",
+ new_versionlink_path,
+ new_versionlink);
+ goto done;
+ }
+ len2 = skip_sect(new_versionlink + offs2);
+ if (unlikely(len1 != len2 ||
+ strncmp(own_versionlink, new_versionlink + offs2, len1))) {
+ XIO_WRN_TO(
+ rot->log_say,
+ "VERSION MISMATCH old '%s' -> '%s' new '%s' -> '%s' ==(%d,%d) ===> check for SPLIT BRAIN!\n",
+ own_versionlink_path,
+ own_versionlink,
+ new_versionlink_path,
+ new_versionlink,
+ len1,
+ len2);
+ make_rot_msg(
+ rot,
+ "err-splitbrain-detected",
+ "VERSION MISMATCH old '%s' -> '%s' new '%s' -> '%s' ==(%d,%d) ===> check for SPLIT BRAIN",
+ own_versionlink_path,
+ own_versionlink,
+ new_versionlink_path,
+ new_versionlink,
+ len1,
+ len2);
+ goto done;
+ }
+ }
+
+ /* report success */
+ res = true;
+ XIO_DBG("VERSION OK '%s' -> '%s'\n", own_versionlink_path, own_versionlink);
+
+done:
+ brick_string_free(old_host);
+ brick_string_free(new_host);
+ brick_string_free(own_versionlink_path);
+ brick_string_free(old_versionlink_path);
+ brick_string_free(new_versionlink_path);
+ brick_string_free(own_versionlink);
+ brick_string_free(old_versionlink);
+ brick_string_free(new_versionlink);
+ brick_string_free(own_replaylink_path);
+ brick_string_free(own_replaylink);
+ return res;
+}
+
+static
+void rot_destruct(void *_rot)
+{
+ struct mars_rotate *rot = _rot;
+
+ if (likely(rot)) {
+ list_del_init(&rot->rot_head);
+ write_info_links(rot);
+ del_channel(rot->log_say);
+ rot->log_say = NULL;
+ brick_string_free(rot->fetch_path);
+ brick_string_free(rot->fetch_peer);
+ brick_string_free(rot->preferred_peer);
+ brick_string_free(rot->parent_path);
+ brick_string_free(rot->parent_rest);
+ brick_string_free(rot->fetch_next_origin);
+ rot->fetch_path = NULL;
+ rot->fetch_peer = NULL;
+ rot->preferred_peer = NULL;
+ rot->parent_path = NULL;
+ rot->parent_rest = NULL;
+ rot->fetch_next_origin = NULL;
+ clear_vals(rot->msgs);
+ }
+}
+
+/* This must be called once at every round of logfile checking.
+ */
+static
+int make_log_init(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent = dent->d_parent;
+ struct xio_brick *bio_brick;
+ struct xio_brick *aio_brick;
+ struct xio_brick *trans_brick;
+ struct mars_rotate *rot = parent->d_private;
+ struct mars_dent *replay_link;
+ struct mars_dent *aio_dent;
+ struct xio_output *output;
+ const char *parent_path;
+ const char *replay_path = NULL;
+ const char *aio_path = NULL;
+ bool switch_on;
+ int status = 0;
+
+ if (!global->global_power.button)
+ goto done;
+ status = -EINVAL;
+ CHECK_PTR(parent, done);
+ parent_path = parent->d_path;
+ CHECK_PTR(parent_path, done);
+
+ if (!rot) {
+ const char *fetch_path;
+
+ rot = brick_zmem_alloc(sizeof(struct mars_rotate));
+ spin_lock_init(&rot->inf_lock);
+ fetch_path = path_make("%s/logfile-update", parent_path);
+ if (unlikely(!fetch_path)) {
+ XIO_ERR("cannot create fetch_path\n");
+ brick_mem_free(rot);
+ status = -ENOMEM;
+ goto done;
+ }
+ rot->fetch_path = fetch_path;
+ rot->global = global;
+ parent->d_private = rot;
+ parent->d_private_destruct = rot_destruct;
+ list_add_tail(&rot->rot_head, &rot_anchor);
+ assign_keys(rot->msgs, rot_keys);
+ }
+
+ rot->replay_link = NULL;
+ rot->aio_dent = NULL;
+ rot->aio_brick = NULL;
+ rot->first_log = NULL;
+ rot->relevant_log = NULL;
+ rot->relevant_serial = 0;
+ rot->relevant_brick = NULL;
+ rot->next_relevant_log = NULL;
+ rot->prev_log = NULL;
+ rot->next_log = NULL;
+ brick_string_free(rot->fetch_next_origin);
+ rot->fetch_next_origin = NULL;
+ rot->max_sequence = 0;
+ /* reset the split brain detector only when conflicts have gone for a number of rounds */
+ if (rot->split_brain_serial && rot->split_brain_round++ > 3)
+ rot->split_brain_serial = 0;
+ rot->fetch_next_serial = 0;
+ rot->has_error = false;
+ rot->wants_sync = false;
+ rot->has_symlinks = true;
+ brick_string_free(rot->preferred_peer);
+ rot->preferred_peer = NULL;
+
+ if (dent->link_val) {
+ int status = kstrtos64(dent->link_val, 10, &rot->dev_size);
+
+ (void)status; /* leave as before in case of errors */
+ }
+ if (!rot->parent_path) {
+ rot->parent_path = brick_strdup(parent_path);
+ rot->parent_rest = brick_strdup(parent->d_rest);
+ }
+
+ if (unlikely(!rot->log_say)) {
+ char *name = path_make("%s/logstatus-%s", parent_path, my_id());
+
+ if (likely(name)) {
+ rot->log_say = make_channel(name, false);
+ brick_string_free(name);
+ }
+ }
+
+ write_info_links(rot);
+
+ /* Fetch the replay status symlink.
+ * It must exist, and its value will control everything.
+ */
+ replay_path = path_make("%s/replay-%s", parent_path, my_id());
+ if (unlikely(!replay_path)) {
+ XIO_ERR("cannot make path\n");
+ status = -ENOMEM;
+ goto done;
+ }
+
+ replay_link = (void *)mars_find_dent(global, replay_path);
+ if (unlikely(!replay_link || !replay_link->link_val)) {
+ XIO_DBG("replay status symlink '%s' does not exist (%p)\n", replay_path, replay_link);
+ rot->allow_update = false;
+ status = -ENOENT;
+ goto done;
+ }
+
+ status = _parse_args(replay_link, replay_link->link_val, 3);
+ if (unlikely(status < 0))
+ goto done;
+ rot->replay_link = replay_link;
+
+ /* Fetch AIO dentry of the logfile.
+ */
+ if (rot->trans_brick) {
+ struct trans_logger_input *trans_input = rot->trans_brick->inputs[rot->trans_brick->old_input_nr];
+
+ if (trans_input && trans_input->is_operating) {
+ aio_path = path_make(
+ "%s/log-%09d-%s", parent_path, trans_input->inf.inf_sequence, trans_input->inf.inf_host);
+ XIO_DBG(
+ "using logfile '%s' from trans_input %d (new=%d)\n",
+ aio_path,
+ rot->trans_brick->old_input_nr,
+ rot->trans_brick->log_input_nr);
+ }
+ }
+ if (!aio_path) {
+ aio_path = path_make("%s/%s", parent_path, replay_link->d_argv[0]);
+ XIO_DBG("using logfile '%s' from replay symlink\n", aio_path);
+ }
+ if (unlikely(!aio_path)) {
+ XIO_ERR("cannot make path\n");
+ status = -ENOMEM;
+ goto done;
+ }
+
+ aio_dent = (void *)mars_find_dent(global, aio_path);
+ if (unlikely(!aio_dent)) {
+ XIO_DBG("logfile '%s' does not exist\n", aio_path);
+ status = -ENOENT;
+ if (rot->todo_primary && !rot->is_primary && !rot->old_is_primary) {
+ int offset = strlen(aio_path) - strlen(my_id());
+
+ if (offset > 0 && aio_path[offset - 1] == '-' && !strcmp(aio_path + offset, my_id())) {
+ /* try to create an empty logfile */
+ _create_new_logfile(aio_path);
+ }
+ }
+ goto done;
+ }
+ rot->aio_dent = aio_dent;
+
+ /* check whether attach is allowed */
+ switch_on = _check_allow(global, parent, "attach");
+ if (switch_on && rot->res_shutdown) {
+ XIO_ERR("cannot start transaction logger: resource shutdown mode is currently active\n");
+ switch_on = false;
+ }
+
+ /* Fetch / make the AIO brick instance
+ */
+ aio_brick =
+ make_brick_all(
+ global,
+ aio_dent,
+ _set_sio_params,
+ NULL,
+ aio_path,
+ (const struct generic_brick_type *)&sio_brick_type,
+ (const struct generic_brick_type*[]){},
+/**/ rot->trans_brick || switch_on ? 2 : -1,
+ "%s",
+ (const char *[]){},
+ 0,
+ aio_path);
+ rot->aio_brick = aio_brick;
+ status = 0;
+ if (unlikely(!aio_brick || !aio_brick->power.on_led))
+ goto done; /* this may happen in case of detach */
+ bio_brick = rot->bio_brick;
+ if (unlikely(!bio_brick || !bio_brick->power.on_led))
+ goto done; /* this may happen in case of detach */
+
+ /* Fetch the actual logfile size
+ */
+ output = aio_brick->outputs[0];
+ status = output->ops->xio_get_info(output, &rot->aio_info);
+ if (status < 0) {
+ XIO_ERR("cannot get info on '%s'\n", aio_path);
+ goto done;
+ }
+ XIO_DBG("logfile '%s' size = %lld\n", aio_path, rot->aio_info.current_size);
+
+ if (rot->is_primary &&
+ global_logrot_auto > 0 &&
+ unlikely(rot->aio_info.current_size >= (loff_t)global_logrot_auto * 1024 * 1024 * 1024)) {
+ char *new_path = path_make("%s/log-%09d-%s", parent_path, aio_dent->d_serial + 1, my_id());
+
+ if (likely(new_path && !mars_find_dent(global, new_path))) {
+ XIO_INF(
+ "old logfile size = %lld, creating new logfile '%s'\n", rot->aio_info.current_size, new_path);
+ _create_new_logfile(new_path);
+ }
+ brick_string_free(new_path);
+ }
+
+ /* Fetch / make the transaction logger.
+ * We deliberately "forget" to connect the log input here.
+ * Will be carried out later in make_log_step().
+ * The final switch-on will be started in make_log_finalize().
+ */
+ trans_brick =
+ make_brick_all(
+ global,
+ replay_link,
+ _set_trans_params,
+ NULL,
+ aio_path,
+ (const struct generic_brick_type *)&trans_logger_brick_type,
+ (const struct generic_brick_type *[]){NULL},
+ 1, /* create when necessary, but leave in current state otherwise */
+ "%s/replay-%s",
+ (const char *[]){"%s/data-%s"},
+ 1,
+ parent_path,
+ my_id(),
+ parent_path,
+ my_id());
+ rot->trans_brick = (void *)trans_brick;
+ status = -ENOENT;
+ if (!trans_brick)
+ goto done;
+ rot->trans_brick->kill_ptr = (void **)&rot->trans_brick;
+ rot->trans_brick->replay_limiter = &rot->replay_limiter;
+ /* For safety, default is to try an (unnecessary) replay in case
+ * something goes wrong later.
+ */
+ rot->replay_mode = true;
+
+ status = 0;
+
+done:
+ brick_string_free(aio_path);
+ brick_string_free(replay_path);
+ return status;
+}
+
+static
+bool _next_is_acceptable(struct mars_rotate *rot, struct mars_dent *old_dent, struct mars_dent *new_dent)
+{
+ /* Primaries are never allowed to consider logfiles not belonging to them.
+ * Secondaries need this for replay, unfortunately.
+ */
+ if ((rot->is_primary | rot->old_is_primary) ||
+ (rot->trans_brick && rot->trans_brick->power.on_led && !rot->trans_brick->replay_mode)) {
+ if (new_dent->stat_val.size) {
+ XIO_WRN(
+ "logrotate impossible, '%s' size = %lld\n", new_dent->d_rest, new_dent->stat_val.size);
+ return false;
+ }
+ if (strcmp(new_dent->d_rest, my_id())) {
+ XIO_WRN("logrotate impossible, '%s'\n", new_dent->d_rest);
+ return false;
+ }
+ } else {
+ /* Only secondaries should check for contiguity,
+ * primaries sometimes need holes for emergency mode.
+ */
+ if (new_dent->d_serial != old_dent->d_serial + 1)
+ return false;
+ }
+ return true;
+}
+
+/* Note: this is strictly called in d_serial order.
+ * This is important!
+ */
+static
+int make_log_step(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent = dent->d_parent;
+ struct mars_rotate *rot;
+ struct trans_logger_brick *trans_brick;
+ struct mars_dent *prev_log;
+ int replay_log_nr = 0;
+ int status = -EINVAL;
+
+ CHECK_PTR(parent, err);
+ rot = parent->d_private;
+ if (!rot)
+ goto err;
+ CHECK_PTR(rot, err);
+
+ status = 0;
+ trans_brick = rot->trans_brick;
+ if (!global->global_power.button || !dent->d_parent || !trans_brick || rot->has_error) {
+ XIO_DBG("nothing to do rot_error = %d\n", rot->has_error);
+ goto done;
+ }
+
+ /* Check for consecutiveness of logfiles
+ */
+ prev_log = rot->next_log;
+ if (prev_log && prev_log->d_serial + 1 != dent->d_serial &&
+ (!rot->replay_link || !rot->replay_link->d_argv[0] ||
+ sscanf(rot->replay_link->d_argv[0], "log-%d", &replay_log_nr) != 1 ||
+ dent->d_serial > replay_log_nr)) {
+ XIO_WRN_TO(
+ rot->log_say,
+ "transaction logs are not consecutive at '%s' (%d ~> %d)\n",
+ dent->d_path,
+ prev_log->d_serial,
+ dent->d_serial);
+ make_rot_msg(
+ rot,
+ "wrn-log-consecutive",
+ "transaction logs are not consecutive at '%s' (%d ~> %d)\n",
+ dent->d_path,
+ prev_log->d_serial,
+ dent->d_serial);
+ }
+
+ if (dent->d_serial > rot->max_sequence)
+ rot->max_sequence = dent->d_serial;
+
+ if (!rot->first_log)
+ rot->first_log = dent;
+
+ /* Skip any logfiles after the relevant one.
+ * This should happen only when replaying multiple logfiles
+ * in sequence, or when starting a new logfile for writing.
+ */
+ status = 0;
+ if (rot->relevant_log) {
+ if (!rot->next_relevant_log) {
+ if (unlikely(dent->d_serial == rot->relevant_log->d_serial)) {
+ /* always prefer the one created by myself */
+ if (!strcmp(rot->relevant_log->d_rest, my_id())) {
+ XIO_WRN(
+ "PREFER LOGFILE '%s' in front of '%s'\n",
+ rot->relevant_log->d_path, dent->d_path);
+ } else if (!strcmp(dent->d_rest, my_id())) {
+ XIO_WRN(
+ "PREFER LOGFILE '%s' in front of '%s'\n",
+ dent->d_path, rot->relevant_log->d_path);
+ rot->relevant_log = dent;
+ } else {
+ rot->has_double_logfile = true;
+ XIO_ERR(
+ "DOUBLE LOGFILES '%s' '%s'\n",
+ dent->d_path, rot->relevant_log->d_path);
+ }
+ } else if (_next_is_acceptable(rot, rot->relevant_log, dent)) {
+ rot->next_relevant_log = dent;
+ } else if (dent->d_serial > rot->relevant_log->d_serial + 5) {
+ rot->has_hole_logfile = true;
+ }
+ } else { /* check for double logfiles = > split brain */
+ if (unlikely(dent->d_serial == rot->next_relevant_log->d_serial)) {
+ /* always prefer the one created by myself */
+ if (!strcmp(rot->next_relevant_log->d_rest, my_id())) {
+ XIO_WRN(
+ "PREFER LOGFILE '%s' in front of '%s'\n",
+ rot->next_relevant_log->d_path,
+ dent->d_path);
+ } else if (!strcmp(dent->d_rest, my_id())) {
+ XIO_WRN(
+ "PREFER LOGFILE '%s' in front of '%s'\n",
+ dent->d_path,
+ rot->next_relevant_log->d_path);
+ rot->next_relevant_log = dent;
+ } else {
+ rot->has_double_logfile = true;
+ XIO_ERR(
+ "DOUBLE LOGFILES '%s' '%s'\n", dent->d_path, rot->next_relevant_log->d_path);
+ }
+ } else if (dent->d_serial > rot->next_relevant_log->d_serial + 5) {
+ rot->has_hole_logfile = true;
+ }
+ }
+ XIO_DBG("next_relevant_log = %p\n", rot->next_relevant_log);
+ goto ok;
+ }
+
+ /* Preconditions
+ */
+ if (!rot->replay_link || !rot->aio_dent || !rot->aio_brick) {
+ XIO_DBG("nothing to do on '%s'\n", dent->d_path);
+ goto ok;
+ }
+
+ /* Remember the relevant log.
+ */
+ if (!rot->relevant_log && rot->aio_dent->d_serial == dent->d_serial) {
+ rot->relevant_serial = dent->d_serial;
+ rot->relevant_log = dent;
+ rot->has_double_logfile = false;
+ rot->has_hole_logfile = false;
+ }
+
+ok:
+ /* All ok: switch over the indicators.
+ */
+ XIO_DBG("next_log = '%s'\n", dent->d_path);
+ rot->prev_log = rot->next_log;
+ rot->next_log = dent;
+
+done:
+ if (status < 0) {
+ XIO_DBG("rot_error status = %d\n", status);
+ rot->has_error = true;
+ }
+err:
+ return status;
+}
+
+/* Internal helper. Return codes:
+ * ret < 0 : error
+ * ret == 0 : not relevant
+ * ret == 1 : relevant, no transaction replay, switch to the next
+ * ret == 2 : relevant for transaction replay
+ * ret == 3 : relevant for appending
+ */
+static
+int _check_logging_status(
+struct mars_rotate *rot, int *log_nr, long long *oldpos_start, long long *oldpos_end, long long *newpos)
+{
+ struct mars_dent *dent = rot->relevant_log;
+ struct mars_dent *parent;
+ struct mars_global *global = NULL;
+ const char *vers_link = NULL;
+ int status = 0;
+
+ if (!dent)
+ goto done;
+
+ status = -EINVAL;
+ parent = dent->d_parent;
+ CHECK_PTR(parent, done);
+ global = rot->global;
+ CHECK_PTR_NULL(global, done);
+ CHECK_PTR(rot->replay_link, done);
+ CHECK_PTR(rot->aio_brick, done);
+ CHECK_PTR(rot->aio_dent, done);
+
+ XIO_DBG(" dent = '%s'\n", dent->d_path);
+ XIO_DBG("aio_dent = '%s'\n", rot->aio_dent->d_path);
+ if (unlikely(strcmp(dent->d_path, rot->aio_dent->d_path)))
+ goto done;
+
+ if (sscanf(rot->replay_link->d_argv[0], "log-%d", log_nr) != 1) {
+ XIO_ERR_TO(
+ rot->log_say, "replay link has malformed logfile number '%s'\n", rot->replay_link->d_argv[0]);
+ goto done;
+ }
+ if (kstrtos64(rot->replay_link->d_argv[1], 10, oldpos_start)) {
+ XIO_ERR_TO(
+ rot->log_say, "replay link has bad start position argument '%s'\n", rot->replay_link->d_argv[1]);
+ goto done;
+ }
+ if (kstrtos64(rot->replay_link->d_argv[2], 10, oldpos_end)) {
+ XIO_ERR_TO(
+ rot->log_say, "replay link has bad end position argument '%s'\n", rot->replay_link->d_argv[2]);
+ goto done;
+ }
+ *oldpos_end += *oldpos_start;
+ if (unlikely(*oldpos_end < *oldpos_start)) {
+ XIO_ERR_TO(rot->log_say, "replay link end_pos %lld < start_pos %lld\n", *oldpos_end, *oldpos_start);
+ /* safety: use the smaller value, it does not hurt */
+ *oldpos_start = *oldpos_end;
+ if (unlikely(*oldpos_start < 0))
+ *oldpos_start = 0;
+ }
+
+ vers_link = get_versionlink(rot->parent_path, *log_nr, my_id(), NULL);
+ if (vers_link && vers_link[0]) {
+ long long vers_pos = 0;
+ int offset = 0;
+ int i;
+
+ for (i = 0; i < 2; i++) {
+ offset += skip_part(vers_link + offset);
+ if (unlikely(!vers_link[offset++])) {
+ XIO_ERR_TO(rot->log_say, "version link '%s' is malformed\n", vers_link);
+ goto check_pos;
+ }
+ }
+
+ sscanf(vers_link + offset, "%lld", &vers_pos);
+ if (vers_pos < *oldpos_start) {
+ XIO_WRN(
+ "versionlink has smaller startpos %lld < %lld\n",
+ vers_pos, *oldpos_start);
+ /* for safety, take the minimum of both */
+ *oldpos_start = vers_pos;
+ } else if (vers_pos > *oldpos_start) {
+ XIO_WRN(
+ "versionlink has greater startpos %lld > %lld\n",
+ vers_pos, *oldpos_start);
+ }
+ }
+check_pos:
+ *newpos = rot->aio_info.current_size;
+
+ if (unlikely(rot->aio_info.current_size < *oldpos_start)) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "oops, bad replay position attempted at logfile '%s' (file length %lld should never be smaller than requested position %lld, is your filesystem corrupted?) => please repair this by hand\n",
+ rot->aio_dent->d_path,
+ rot->aio_info.current_size,
+ *oldpos_start);
+ make_rot_msg(
+ rot,
+ "err-replay-size",
+ "oops, bad replay position attempted at logfile '%s' (file length %lld should never be smaller than requested position %lld, is your filesystem corrupted?) => please repair this by hand",
+ rot->aio_dent->d_path,
+ rot->aio_info.current_size,
+ *oldpos_start);
+ status = -EBADF;
+ goto done;
+ }
+
+ status = 0;
+ if (rot->aio_info.current_size > *oldpos_start) {
+ if ((rot->aio_info.current_size - *oldpos_start < REPLAY_TOLERANCE ||
+ (rot->log_is_really_damaged &&
+ rot->todo_primary &&
+ rot->relevant_log &&
+ strcmp(rot->relevant_log->d_rest, my_id()))) &&
+ (rot->todo_primary ||
+ (rot->relevant_log &&
+ rot->next_relevant_log &&
+ is_switchover_possible(
+ rot, rot->relevant_log->d_path, rot->next_relevant_log->d_path, _get_tolerance(
+ rot), false)))) {
+ XIO_INF_TO(
+ rot->log_say,
+ "TOLERANCE: transaction log '%s' is treated as fully applied\n",
+ rot->aio_dent->d_path);
+ make_rot_msg(
+ rot,
+ "inf-replay-tolerance",
+ "TOLERANCE: transaction log '%s' is treated as fully applied",
+ rot->aio_dent->d_path);
+ status = 1;
+ } else {
+ XIO_INF_TO(
+ rot->log_say,
+ "transaction log replay is necessary on '%s' from %lld to %lld (dirty region ends at %lld)\n",
+ rot->aio_dent->d_path,
+ *oldpos_start,
+ rot->aio_info.current_size,
+ *oldpos_end);
+ status = 2;
+ }
+ } else if (rot->next_relevant_log) {
+ XIO_INF_TO(
+ rot->log_say,
+ "transaction log '%s' is already applied, and the next one is available for switching\n",
+ rot->aio_dent->d_path);
+ status = 1;
+ } else if (rot->todo_primary) {
+ if (rot->aio_info.current_size > 0 || strcmp(dent->d_rest, my_id()) != 0) {
+ XIO_INF_TO(
+ rot->log_say,
+ "transaction log '%s' is already applied (would be usable for appending at position %lld, but a fresh logfile will be used for safety reasons)\n",
+ rot->aio_dent->d_path,
+ *oldpos_end);
+ status = 1;
+ } else {
+ XIO_INF_TO(
+ rot->log_say,
+ "empty transaction log '%s' is usable for me as a primary node\n",
+ rot->aio_dent->d_path);
+ status = 3;
+ }
+ } else {
+ XIO_DBG("transaction log '%s' is the last one, currently fully applied\n", rot->aio_dent->d_path);
+ status = 0;
+ }
+
+done:
+ brick_string_free(vers_link);
+ return status;
+}
+
+static
+int _make_logging_status(struct mars_rotate *rot)
+{
+ struct mars_dent *dent = rot->relevant_log;
+ struct mars_dent *parent;
+ struct mars_global *global = NULL;
+ struct trans_logger_brick *trans_brick;
+ int log_nr = 0;
+ loff_t start_pos = 0;
+ loff_t dirty_pos = 0;
+ loff_t end_pos = 0;
+ int status = 0;
+
+ if (!dent)
+ goto done;
+
+ status = -EINVAL;
+ parent = dent->d_parent;
+ CHECK_PTR(parent, done);
+ global = rot->global;
+ CHECK_PTR_NULL(global, done);
+
+ status = 0;
+ trans_brick = rot->trans_brick;
+ if (!global->global_power.button || !trans_brick || rot->has_error) {
+ XIO_DBG("nothing to do rot_error = %d\n", rot->has_error);
+ goto done;
+ }
+
+ /* Find current logging status.
+ */
+ status = _check_logging_status(rot, &log_nr, &start_pos, &dirty_pos, &end_pos);
+ XIO_DBG(
+ "case = %d (todo_primary=%d is_primary=%d old_is_primary=%d)\n",
+ status,
+ rot->todo_primary,
+ rot->is_primary,
+ rot->old_is_primary);
+ if (status < 0)
+ goto done;
+ if (unlikely(start_pos < 0 || dirty_pos < start_pos || end_pos < dirty_pos)) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "replay symlink has implausible values: start_pos = %lld dirty_pos = %lld end_pos = %lld\n",
+ start_pos,
+ dirty_pos,
+ end_pos);
+ }
+ /* Relevant or not?
+ */
+ switch (status) {
+ case 0: /* not relevant */
+ goto ok;
+ case 1: /* Relevant, and transaction replay already finished.
+ * Allow switching over to a new logfile.
+ */
+ if (!trans_brick->power.button && !trans_brick->power.on_led && trans_brick->power.off_led) {
+ if (rot->next_relevant_log && !rot->log_is_really_damaged) {
+ int replay_tolerance = _get_tolerance(rot);
+ bool skip_new = !!rot->todo_primary;
+
+ XIO_DBG(
+ "check switchover from '%s' to '%s' (size = %lld, skip_new = %d, replay_tolerance = %d)\n",
+ dent->d_path,
+ rot->next_relevant_log->d_path,
+ rot->next_relevant_log->stat_val.size,
+ skip_new,
+ replay_tolerance);
+ if (
+ is_switchover_possible(
+ rot, dent->d_path, rot->next_relevant_log->d_path, replay_tolerance, skip_new) ||
+ (skip_new && !_check_allow(global, parent, "connect"))) {
+ XIO_INF_TO(
+ rot->log_say,
+ "start switchover from transaction log '%s' to '%s'\n",
+ dent->d_path,
+ rot->next_relevant_log->d_path);
+ _make_new_replaylink(
+ rot,
+ rot->next_relevant_log->d_rest,
+ rot->next_relevant_log->d_serial,
+ rot->next_relevant_log->stat_val.size);
+ }
+ } else if (rot->todo_primary) {
+ if (dent->d_serial > log_nr)
+ log_nr = dent->d_serial;
+ XIO_INF_TO(
+ rot->log_say,
+ "preparing new transaction log, number moves from %d to %d\n",
+ dent->d_serial,
+ log_nr + 1);
+ _make_new_replaylink(rot, my_id(), log_nr + 1, 0);
+ } else {
+ XIO_DBG("nothing to do on last transaction log '%s'\n", dent->d_path);
+ }
+ }
+ status = -EAGAIN;
+ goto done;
+ case 2: /* relevant for transaction replay */
+ XIO_INF_TO(
+ rot->log_say,
+ "replaying transaction log '%s' from position %lld to %lld\n",
+ dent->d_path,
+ start_pos,
+ end_pos);
+ rot->replay_mode = true;
+ rot->start_pos = start_pos;
+ rot->end_pos = end_pos;
+ break;
+ case 3: /* relevant for appending */
+ XIO_INF_TO(rot->log_say, "appending to transaction log '%s'\n", dent->d_path);
+ rot->replay_mode = false;
+ rot->start_pos = 0;
+ rot->end_pos = 0;
+ break;
+ default:
+ XIO_ERR_TO(rot->log_say, "bad internal status %d\n", status);
+ status = -EINVAL;
+ goto done;
+ }
+
+ok:
+ /* All ok: switch over the indicators.
+ */
+ rot->prev_log = rot->next_log;
+ rot->next_log = dent;
+
+done:
+ if (status < 0) {
+ XIO_DBG("rot_error status = %d\n", status);
+ rot->has_error = true;
+ }
+ return status;
+}
+
+static
+void _init_trans_input(struct trans_logger_input *trans_input, struct mars_dent *log_dent, struct mars_rotate *rot)
+{
+ if (unlikely(trans_input->connect || trans_input->is_operating)) {
+ XIO_ERR("this should not happen\n");
+ goto out_return;
+ }
+
+ memset(&trans_input->inf, 0, sizeof(trans_input->inf));
+
+ strncpy(trans_input->inf.inf_host, log_dent->d_rest, sizeof(trans_input->inf.inf_host));
+ trans_input->inf.inf_sequence = log_dent->d_serial;
+ trans_input->inf.inf_private = rot;
+ trans_input->inf.inf_callback = _update_info;
+ XIO_DBG("initialized '%s' %d\n", trans_input->inf.inf_host, trans_input->inf.inf_sequence);
+out_return:;
+}
+
+static
+int _get_free_input(struct trans_logger_brick *trans_brick)
+{
+ int nr = (((trans_brick->log_input_nr - TL_INPUT_LOG1) + 1) % 2) + TL_INPUT_LOG1;
+ struct trans_logger_input *candidate;
+
+ candidate = trans_brick->inputs[nr];
+ if (unlikely(!candidate)) {
+ XIO_ERR("input nr = %d is corrupted!\n", nr);
+ return -EEXIST;
+ }
+ if (unlikely(candidate->is_operating || candidate->connect)) {
+ XIO_DBG(
+ "nr = %d unusable! is_operating = %d connect = %p\n", nr, candidate->is_operating, candidate->connect);
+ return -EEXIST;
+ }
+ XIO_DBG("got nr = %d\n", nr);
+ return nr;
+}
+
+static
+void _rotate_trans(struct mars_rotate *rot)
+{
+ struct trans_logger_brick *trans_brick = rot->trans_brick;
+ int old_nr = trans_brick->old_input_nr;
+ int log_nr = trans_brick->log_input_nr;
+ int next_nr;
+
+ XIO_DBG(
+ "log_input_nr = %d old_input_nr = %d next_relevant_log = %p\n", log_nr, old_nr, rot->next_relevant_log);
+
+ /* try to cleanup old log */
+ if (log_nr != old_nr) {
+ struct trans_logger_input *trans_input = trans_brick->inputs[old_nr];
+ struct trans_logger_input *new_input = trans_brick->inputs[log_nr];
+
+ if (!trans_input->connect) {
+ XIO_DBG("ignoring unused old input %d\n", old_nr);
+ } else if (!new_input->is_operating) {
+ XIO_DBG("ignoring uninitialized new input %d\n", log_nr);
+ } else if (trans_input->is_operating &&
+ trans_input->inf.inf_min_pos == trans_input->inf.inf_max_pos &&
+ list_empty(&trans_input->pos_list) &&
+ atomic_read(&trans_input->log_obj_count) <= 0) {
+ int status;
+
+ XIO_INF("cleanup old transaction log (%d -> %d)\n", old_nr, log_nr);
+ status = generic_disconnect((void *)trans_input);
+ if (unlikely(status < 0))
+ XIO_ERR("disconnect failed\n");
+ else
+ remote_trigger();
+ } else {
+ XIO_DBG(
+ "old transaction replay not yet finished: is_operating = %d pos %lld != %lld\n",
+ trans_input->is_operating,
+ trans_input->inf.inf_min_pos,
+ trans_input->inf.inf_max_pos);
+ }
+ } else
+ /* try to setup new log */
+ if (log_nr == trans_brick->new_input_nr &&
+ rot->next_relevant_log &&
+ (rot->next_relevant_log->d_serial == trans_brick->inputs[log_nr]->inf.inf_sequence + 1 ||
+ trans_brick->cease_logging)) {
+ struct trans_logger_input *trans_input;
+ int status;
+
+ next_nr = _get_free_input(trans_brick);
+ if (unlikely(next_nr < 0)) {
+ XIO_ERR_TO(rot->log_say, "no free input\n");
+ goto done;
+ }
+
+ XIO_DBG("start switchover %d -> %d\n", old_nr, next_nr);
+
+ rot->next_relevant_brick =
+ make_brick_all(
+ rot->global,
+ rot->next_relevant_log,
+ _set_sio_params,
+ NULL,
+ rot->next_relevant_log->d_path,
+ (const struct generic_brick_type *)&sio_brick_type,
+ (const struct generic_brick_type *[]){},
+ 2, /* create + activate */
+ rot->next_relevant_log->d_path,
+ (const char *[]){},
+ 0);
+ if (unlikely(!rot->next_relevant_brick)) {
+ XIO_ERR_TO(
+ rot->log_say, "could not open next transaction log '%s'\n", rot->next_relevant_log->d_path);
+ goto done;
+ }
+ trans_input = trans_brick->inputs[next_nr];
+ if (unlikely(!trans_input)) {
+ XIO_ERR_TO(rot->log_say, "internal log input does not exist\n");
+ goto done;
+ }
+
+ _init_trans_input(trans_input, rot->next_relevant_log, rot);
+
+ status = generic_connect((void *)trans_input, (void *)rot->next_relevant_brick->outputs[0]);
+ if (unlikely(status < 0)) {
+ XIO_ERR_TO(rot->log_say, "internal connect failed\n");
+ goto done;
+ }
+ trans_brick->new_input_nr = next_nr;
+ XIO_INF_TO(
+ rot->log_say,
+ "started logrotate switchover from '%s' to '%s'\n",
+ rot->relevant_log->d_path,
+ rot->next_relevant_log->d_path);
+ rot->replay_code = 0;
+ }
+done:;
+}
+
+static
+void _change_trans(struct mars_rotate *rot)
+{
+ struct trans_logger_brick *trans_brick = rot->trans_brick;
+
+ XIO_DBG(
+ "replay_mode = %d start_pos = %lld end_pos = %lld\n", trans_brick->replay_mode, rot->start_pos, rot->end_pos);
+
+ if (trans_brick->replay_mode) {
+ trans_brick->replay_start_pos = rot->start_pos;
+ trans_brick->replay_end_pos = rot->end_pos;
+ } else {
+ _rotate_trans(rot);
+ }
+}
+
+static
+int _start_trans(struct mars_rotate *rot)
+{
+ struct trans_logger_brick *trans_brick;
+ struct trans_logger_input *trans_input;
+ int nr;
+ int status;
+
+ /* Internal safety checks
+ */
+ status = -EINVAL;
+ if (unlikely(!rot)) {
+ XIO_ERR("rot is NULL\n");
+ goto done;
+ }
+ if (unlikely(!rot->aio_brick || !rot->relevant_log)) {
+ XIO_ERR(
+ "aio %p or relevant log %p is missing, this should not happen\n", rot->aio_brick, rot->relevant_log);
+ goto done;
+ }
+ trans_brick = rot->trans_brick;
+ if (unlikely(!trans_brick)) {
+ XIO_ERR("logger instance does not exist\n");
+ goto done;
+ }
+
+ /* Update status when already working
+ */
+ if (trans_brick->power.button || !trans_brick->power.off_led) {
+ _change_trans(rot);
+ status = 0;
+ goto done;
+ }
+
+ /* Further safety checks.
+ */
+ if (unlikely(rot->relevant_brick)) {
+ XIO_ERR("log aio brick already present, this should not happen\n");
+ goto done;
+ }
+ if (
+ unlikely(
+ trans_brick->inputs[TL_INPUT_LOG1]->is_operating || trans_brick->inputs[TL_INPUT_LOG2]->is_operating)) {
+ XIO_ERR("some input is operating, this should not happen\n");
+ goto done;
+ }
+
+ /* Allocate new input slot
+ */
+ nr = _get_free_input(trans_brick);
+ if (unlikely(nr < TL_INPUT_LOG1 || nr > TL_INPUT_LOG2)) {
+ XIO_ERR("bad new_input_nr = %d\n", nr);
+ goto done;
+ }
+ trans_brick->new_input_nr = nr;
+ trans_brick->old_input_nr = nr;
+ trans_brick->log_input_nr = nr;
+ trans_input = trans_brick->inputs[nr];
+ if (unlikely(!trans_input)) {
+ XIO_ERR("log input %d does not exist\n", nr);
+ goto done;
+ }
+
+ /* Open new transaction log
+ */
+ rot->relevant_brick =
+ make_brick_all(
+ rot->global,
+ rot->relevant_log,
+ _set_sio_params,
+ NULL,
+ rot->relevant_log->d_path,
+ (const struct generic_brick_type *)&sio_brick_type,
+ (const struct generic_brick_type *[]){},
+ 2, /* start always */
+ rot->relevant_log->d_path,
+ (const char *[]){},
+ 0);
+ if (unlikely(!rot->relevant_brick)) {
+ XIO_ERR("log aio brick '%s' not open\n", rot->relevant_log->d_path);
+ goto done;
+ }
+
+ /* Supply all relevant parameters
+ */
+ trans_brick->replay_mode = rot->replay_mode;
+ trans_brick->replay_tolerance = REPLAY_TOLERANCE;
+ _init_trans_input(trans_input, rot->relevant_log, rot);
+ rot->replay_code = 0;
+
+ /* Connect to new transaction log
+ */
+ status = generic_connect((void *)trans_input, (void *)rot->relevant_brick->outputs[0]);
+ if (unlikely(status < 0)) {
+ XIO_ERR("initial connect failed\n");
+ goto done;
+ }
+
+ _change_trans(rot);
+
+ /* Switch on....
+ */
+ status = mars_power_button((void *)trans_brick, true, false);
+ XIO_DBG("status = %d\n", status);
+
+done:
+ return status;
+}
+
+static
+int _stop_trans(struct mars_rotate *rot, const char *parent_path)
+{
+ struct trans_logger_brick *trans_brick = rot->trans_brick;
+ int status = 0;
+
+ if (!trans_brick)
+ goto done;
+
+ /* Switch off temporarily....
+ */
+ status = mars_power_button((void *)trans_brick, false, false);
+ XIO_DBG("status = %d\n", status);
+ if (status < 0)
+ goto done;
+
+ /* Disconnect old connection(s)
+ */
+ if (trans_brick->power.off_led) {
+ int i;
+
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *trans_input;
+
+ trans_input = trans_brick->inputs[i];
+ if (trans_input && !trans_input->is_operating) {
+ if (trans_input->connect)
+ (void)generic_disconnect((void *)trans_input);
+ }
+ }
+ }
+ write_info_links(rot);
+
+done:
+ return status;
+}
+
+static
+int make_log_finalize(struct mars_global *global, struct mars_dent *dent)
+{
+ struct mars_dent *parent = dent->d_parent;
+ struct mars_rotate *rot;
+ struct trans_logger_brick *trans_brick;
+ struct copy_brick *fetch_brick;
+ bool is_attached;
+ bool is_stopped;
+ int status = -EINVAL;
+
+ CHECK_PTR(parent, err);
+ rot = parent->d_private;
+ if (!rot)
+ goto err;
+ CHECK_PTR(rot, err);
+ rot->has_symlinks = true;
+ trans_brick = rot->trans_brick;
+ status = 0;
+ if (!trans_brick) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ /* Handle jamming (a very exceptional state)
+ */
+ if (IS_JAMMED()) {
+#ifndef CONFIG_MARS_DEBUG
+ brick_say_logging = 0;
+#endif
+ rot->has_emergency = true;
+ /* Report remote errors to clients when they
+ * try to sync during emergency mode.
+ */
+ if (rot->bio_brick && rot->bio_brick->mode_ptr)
+ *rot->bio_brick->mode_ptr = -EMEDIUMTYPE;
+ XIO_ERR_TO(rot->log_say, "DISK SPACE IS EXTREMELY LOW on %s\n", rot->parent_path);
+ make_rot_msg(rot, "err-space-low", "DISK SPACE IS EXTREMELY LOW");
+ } else if (IS_EXHAUSTED() && rot->has_emergency) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "EMEGENCY MODE HYSTERESIS on %s: you need to free more space for recovery.\n",
+ rot->parent_path);
+ make_rot_msg(
+ rot, "err-space-low", "EMEGENCY MODE HYSTERESIS: you need to free more space for recovery.");
+ } else {
+ int limit = _check_allow(global, parent, "emergency-limit");
+
+ rot->has_emergency = (limit > 0 && global_remaining_space * 100 / global_total_space < limit);
+ XIO_DBG(
+ "has_emergency=%d limit=%d remaining_space=%lld total_space=%lld\n",
+ rot->has_emergency, limit, global_remaining_space, global_total_space);
+ if (!rot->has_emergency && rot->bio_brick && rot->bio_brick->mode_ptr)
+ *rot->bio_brick->mode_ptr = 0;
+ }
+ _show_actual(parent->d_path, "has-emergency", rot->has_emergency);
+ if (rot->has_emergency) {
+ if (rot->todo_primary || rot->is_primary) {
+ trans_brick->cease_logging = true;
+ rot->inf_prev_sequence = 0; /* disable checking */
+ }
+ } else {
+ if (!trans_logger_resume) {
+ XIO_INF_TO(
+ rot->log_say,
+ "emergency mode on %s could be turned off now, but /proc/sys/mars/logger_resume inhibits it.\n",
+ rot->parent_path);
+ } else {
+ trans_brick->cease_logging = false;
+ XIO_INF_TO(rot->log_say, "emergency mode on %s will be turned off again\n", rot->parent_path);
+ }
+ }
+ is_stopped = trans_brick->cease_logging | trans_brick->stopped_logging;
+ _show_actual(parent->d_path, "is-emergency", is_stopped);
+ if (is_stopped) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "EMERGENCY MODE on %s: stopped transaction logging, and created a hole in the logfile sequence nubers.\n",
+ rot->parent_path);
+ make_rot_msg(
+ rot,
+ "err-emergency",
+ "EMERGENCY MODE on %s: stopped transaction logging, and created a hole in the logfile sequence nubers.\n",
+ rot->parent_path);
+ /* Create a hole in the sequence of logfile numbers.
+ * The secondaries will later stumble over it.
+ */
+ if (!rot->created_hole) {
+ int new_sequence = rot->max_sequence + 10;
+ char *new_vers = path_make("%s/version-%09d-%s", rot->parent_path, new_sequence, my_id());
+
+ char *new_vval = path_make(
+
+ "00000000000000000000000000000000,log-%09d-%s,0:", new_sequence, my_id());
+ char *new_path = path_make("%s/log-%09d-%s", rot->parent_path, new_sequence + 1, my_id());
+
+ if (likely(new_vers && new_vval && new_path &&
+ !mars_find_dent(global, new_path))) {
+ XIO_INF_TO(rot->log_say, "EMERGENCY: creating new logfile '%s'\n", new_path);
+ mars_symlink(new_vval, new_vers, NULL, 0);
+ _create_new_logfile(new_path);
+ rot->created_hole = true;
+ }
+ brick_string_free(new_vers);
+ brick_string_free(new_vval);
+ brick_string_free(new_path);
+ }
+ } else {
+ rot->created_hole = false;
+ }
+
+ if (IS_EMERGENCY_SECONDARY()) {
+ if (!rot->todo_primary && rot->first_log && rot->first_log != rot->relevant_log) {
+ XIO_WRN_TO(
+ rot->log_say,
+ "EMERGENCY: ruthlessly freeing old logfile '%s', don't cry on any ramifications.\n",
+ rot->first_log->d_path);
+ make_rot_msg(
+ rot,
+ "wrn-space-low",
+ "EMERGENCY: ruthlessly freeing old logfile '%s'",
+ rot->first_log->d_path);
+ mars_unlink(rot->first_log->d_path);
+ rot->first_log->d_killme = true;
+ /* give it a chance to cease deleting next time */
+ compute_emergency_mode();
+ } else if (IS_EMERGENCY_PRIMARY()) {
+ XIO_WRN_TO(rot->log_say, "EMERGENCY: the space on /mars/ is VERY low.\n");
+ make_rot_msg(rot, "wrn-space-low", "EMERGENCY: the space on /mars/ is VERY low.");
+ } else {
+ XIO_WRN_TO(rot->log_say, "EMERGENCY: the space on /mars/ is low.\n");
+ make_rot_msg(rot, "wrn-space-low", "EMERGENCY: the space on /mars/ is low.");
+ }
+ } else if (IS_EXHAUSTED()) {
+ XIO_WRN_TO(rot->log_say, "EMERGENCY: the space on /mars/ is becoming low.\n");
+ make_rot_msg(rot, "wrn-space-low", "EMERGENCY: the space on /mars/ is becoming low.");
+ }
+
+ rot->log_is_really_damaged = false;
+ if (trans_brick->replay_mode) {
+ if (trans_brick->replay_code > 0) {
+ XIO_INF_TO(
+ rot->log_say,
+ "logfile replay ended successfully at position %lld\n",
+ trans_brick->replay_current_pos);
+ if (rot->replay_code >= 0)
+ rot->replay_code = trans_brick->replay_code;
+ } else if (trans_brick->replay_code == -EAGAIN ||
+ trans_brick->replay_end_pos - trans_brick->replay_current_pos < trans_brick->replay_tolerance) {
+ XIO_INF_TO(
+ rot->log_say,
+ "logfile replay stopped intermediately at position %lld\n",
+ trans_brick->replay_current_pos);
+ } else if (trans_brick->replay_code < 0) {
+ XIO_ERR_TO(
+ rot->log_say,
+ "logfile replay stopped with error = %d at position %lld\n",
+ trans_brick->replay_code,
+ trans_brick->replay_current_pos);
+ make_rot_msg(
+ rot,
+ "err-replay-stop",
+ "logfile replay stopped with error = %d at position %lld",
+ trans_brick->replay_code,
+ trans_brick->replay_current_pos);
+ rot->replay_code = trans_brick->replay_code;
+ rot->log_is_really_damaged = true;
+ } else if (rot->replay_code >= 0) {
+ rot->replay_code = trans_brick->replay_code;
+ }
+ } else {
+ rot->replay_code = 0;
+ }
+ __show_actual(parent->d_path, "replay-code", rot->replay_code);
+
+ /* Stopping is also possible in case of errors
+ */
+ if (trans_brick->power.button && trans_brick->power.on_led && !trans_brick->power.off_led) {
+ bool do_stop = true;
+
+ if (trans_brick->replay_mode) {
+ rot->is_log_damaged =
+ trans_brick->replay_code == -EAGAIN &&
+ trans_brick->replay_end_pos - trans_brick->replay_current_pos < trans_brick->replay_tolerance;
+ do_stop = trans_brick->replay_code != 0 ||
+ !global->global_power.button ||
+ !_check_allow(global, parent, "allow-replay") ||
+ !_check_allow(global, parent, "attach");
+ } else {
+ do_stop =
+ trans_brick->outputs[0]->nr_connected <= 0 &&
+ (!rot->todo_primary ||
+ !_check_allow(global, parent, "attach"));
+ }
+
+ XIO_DBG(
+ "replay_mode = %d replay_code = %d is_primary = %d nr_connected = %d do_stop = %d\n",
+ trans_brick->replay_mode, trans_brick->replay_code,
+ trans_brick->outputs[0]->nr_connected,
+ rot->is_primary, (int)do_stop);
+
+ if (do_stop)
+ status = _stop_trans(rot, parent->d_path);
+ else
+ _change_trans(rot);
+ goto done;
+ }
+
+ /* Starting is only possible when no error occurred.
+ */
+ if (!rot->relevant_log || rot->has_error) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ /* Start when necessary
+ */
+ if (!trans_brick->power.button && !trans_brick->power.on_led && trans_brick->power.off_led) {
+ bool do_start;
+
+ status = _make_logging_status(rot);
+ if (status <= 0)
+ goto done;
+
+ rot->is_log_damaged = false;
+
+ do_start = (!rot->replay_mode ||
+ (rot->start_pos != rot->end_pos &&
+ _check_allow(global, parent, "allow-replay")));
+
+ if (do_start && rot->forbid_replay) {
+ XIO_INF("cannot start replay because sync wants to start\n");
+ make_rot_msg(rot, "inf-replay-start", "cannot start replay because sync wants to star");
+ do_start = false;
+ }
+
+ if (do_start && rot->sync_brick && !rot->sync_brick->power.off_led) {
+ XIO_INF("cannot start replay because sync is running\n");
+ make_rot_msg(rot, "inf-replay-start", "cannot start replay because sync is running");
+ do_start = false;
+ }
+
+ XIO_DBG(
+ "rot->replay_mode = %d rot->start_pos = %lld rot->end_pos = %lld | do_start = %d\n",
+ rot->replay_mode,
+ rot->start_pos,
+ rot->end_pos,
+ do_start);
+
+ if (do_start)
+ status = _start_trans(rot);
+ }
+
+done:
+ /* check whether some copy has finished */
+ fetch_brick = (struct copy_brick *)mars_find_brick(global, &copy_brick_type, rot->fetch_path);
+ XIO_DBG("fetch_path = '%s' fetch_brick = %p\n", rot->fetch_path, fetch_brick);
+ if (fetch_brick &&
+ (fetch_brick->power.off_led ||
+ !global->global_power.button ||
+ !_check_allow(global, parent, "connect") ||
+ !_check_allow(global, parent, "attach") ||
+ (fetch_brick->copy_last == fetch_brick->copy_end &&
+ (rot->fetch_next_is_available > 0 ||
+ rot->fetch_round++ > 3)))) {
+ int i;
+
+ for (i = 0; i < 4; i++) {
+ if (fetch_brick->inputs[i] && fetch_brick->inputs[i]->brick)
+ fetch_brick->inputs[i]->brick->power.io_timeout = 1;
+ }
+ status = xio_kill_brick((void *)fetch_brick);
+ if (status < 0)
+ XIO_ERR("could not kill fetch_brick, status = %d\n", status);
+ else
+ fetch_brick = NULL;
+ local_trigger();
+ }
+ rot->fetch_next_is_available = 0;
+ rot->fetch_brick = fetch_brick;
+ if (fetch_brick)
+ fetch_brick->kill_ptr = (void **)&rot->fetch_brick;
+ else
+ rot->fetch_serial = 0;
+ /* remove trans_logger (when possible) upon detach */
+ is_attached = !!rot->trans_brick;
+ _show_actual(rot->parent_path, "is-attached", is_attached);
+
+ if (rot->trans_brick && rot->trans_brick->power.off_led && !rot->trans_brick->outputs[0]->nr_connected) {
+ bool do_attach = _check_allow(global, parent, "attach");
+
+ XIO_DBG("do_attach = %d\n", do_attach);
+ if (!do_attach) {
+ rot->trans_brick->killme = true;
+ rot->trans_brick = NULL;
+ }
+ }
+
+ _show_actual(
+ rot->parent_path,
+ "is-replaying",
+ rot->trans_brick && rot->trans_brick->replay_mode && !rot->trans_brick->power.off_led);
+ _show_rate(rot, &rot->replay_limiter, "replay_rate");
+ _show_actual(rot->parent_path, "is-copying", rot->fetch_brick && !rot->fetch_brick->power.off_led);
+ _show_rate(rot, &rot->fetch_limiter, "file_rate");
+ _show_actual(rot->parent_path, "is-syncing", rot->sync_brick && !rot->sync_brick->power.off_led);
+ _show_rate(rot, &rot->sync_limiter, "sync_rate");
+err:
+ return status;
+}
+
+/*********************************************************************/
+
+/* specific handlers */
+
+static
+int make_primary(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent;
+ struct mars_rotate *rot;
+ int status = -EINVAL;
+
+ parent = dent->d_parent;
+ CHECK_PTR(parent, done);
+ rot = parent->d_private;
+ if (!rot)
+ goto done;
+ CHECK_PTR(rot, done);
+
+ rot->has_symlinks = true;
+
+ rot->todo_primary =
+ global->global_power.button && dent->link_val && !strcmp(dent->link_val, my_id());
+ XIO_DBG("todo_primary = %d is_primary = %d\n", rot->todo_primary, rot->is_primary);
+ status = 0;
+
+done:
+ return status;
+}
+
+static
+int make_bio(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_rotate *rot;
+ struct xio_brick *brick;
+ bool switch_on;
+ int status = 0;
+
+ if (!global || !dent->d_parent)
+ goto done;
+ rot = dent->d_parent->d_private;
+ if (!rot)
+ goto done;
+
+ rot->has_symlinks = true;
+
+ switch_on = _check_allow(global, dent->d_parent, "attach");
+ if (switch_on && rot->res_shutdown) {
+ XIO_ERR("cannot access disk: resource shutdown mode is currently active\n");
+ switch_on = false;
+ }
+
+ brick =
+ make_brick_all(
+ global,
+ dent,
+ _set_bio_params,
+ NULL,
+ dent->d_path,
+ (const struct generic_brick_type *)&bio_brick_type,
+ (const struct generic_brick_type *[]){},
+ switch_on ? 2 : -1,
+ dent->d_path,
+ (const char *[]){},
+ 0);
+ rot->bio_brick = brick;
+ if (unlikely(!brick)) {
+ status = -ENXIO;
+ goto done;
+ }
+
+ /* Report the actual size of the device.
+ * It may be larger than the global size.
+ */
+ if (brick && brick->power.on_led) {
+ struct xio_info info = {};
+ struct xio_output *output;
+ char *src = NULL;
+ char *dst = NULL;
+
+ output = brick->outputs[0];
+ status = output->ops->xio_get_info(output, &info);
+ if (status < 0) {
+ XIO_ERR("cannot get info on '%s'\n", dent->d_path);
+ goto done;
+ }
+ src = path_make("%lld", info.current_size);
+ dst = path_make("%s/actsize-%s", dent->d_parent->d_path, my_id());
+ if (src && dst)
+ (void)mars_symlink(src, dst, NULL, 0);
+ brick_string_free(src);
+ brick_string_free(dst);
+ }
+
+done:
+ return status;
+}
+
+static int make_replay(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent = dent->d_parent;
+ int status = 0;
+
+ if (!global->global_power.button || !parent || !dent->link_val) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ status = make_log_finalize(global, dent);
+ if (status < 0) {
+ XIO_DBG("logger not initialized\n");
+ goto done;
+ }
+
+done:
+ return status;
+}
+
+static
+int make_dev_remote(struct mars_global *global, struct mars_dent *dent, struct mars_rotate *rot)
+{
+ struct mars_dent *parent = dent->d_parent;
+ char *primary;
+ char *status_path = NULL;
+ char *status_val = NULL;
+ char *client_path = NULL;
+ struct xio_brick *remote_brick;
+ struct xio_brick *dev_brick;
+ int switch_on = 0;
+ int status = -EINVAL;
+
+ primary = dent->d_argv[1];
+ if (!primary)
+ goto setup;
+
+ if (!global->global_power.button)
+ goto setup;
+
+ /* Check both the local and the remote attach switch.
+ */
+ if (!_check_allow(global, dent->d_parent, "rattach"))
+ goto setup;
+ if (!__check_allow(global, dent->d_parent, "attach", primary))
+ goto setup;
+
+ /* Check whether designated primary is the correct one
+ */
+ status_path = path_make(
+ "%s/primary",
+ parent->d_path, primary);
+ status_val = mars_readlink(status_path);
+ if (strcmp(status_val, primary))
+ goto setup;
+
+ brick_string_free(status_path);
+ brick_string_free(status_val);
+
+ /* In addition, check actual primary
+ */
+ status_path = path_make(
+ "%s/actual-%s/is-primary",
+ parent->d_path, primary);
+ status_val = mars_readlink(status_path);
+ status = kstrtoint(status_val, 10, &switch_on);
+ if (unlikely(status)) {
+ switch_on = 0;
+ goto setup;
+ }
+
+ switch_on = 1;
+
+setup:
+
+ client_path = path_make(
+ "%s/replay-%s@%s",
+ parent->d_path, primary, primary);
+
+ remote_brick =
+ make_brick_all(
+ global,
+ dent,
+ _set_client_params,
+ NULL,
+ client_path,
+ (const struct generic_brick_type *)&client_brick_type,
+ (const struct generic_brick_type *[]){},
+ switch_on || rot->if_brick ? 2 : -1,
+ "%s",
+ (const char *[]){},
+ 0,
+ client_path);
+ rot->remote_brick = (void *)remote_brick;
+ if (remote_brick) {
+ remote_brick->kill_ptr = (void **)&rot->remote_brick;
+ /* When on, set the timeout to infinite.
+ * This is necessary for prevention of IO errors reported to
+ * filesystems like XFS. Some fs could run into problems when
+ * other requests after the timeout could succeed again, e.g.
+ * needing an xfs_repair (which is worse than hanging or
+ * simple power loss).
+ */
+ if (switch_on) {
+ remote_brick->power.io_timeout = -1;
+ } else {
+ remote_brick->power.io_timeout = 1;
+ remote_brick->killme = true;
+ }
+ } else {
+ switch_on = 0;
+ }
+
+ dev_brick =
+ make_brick_all(
+ global,
+ dent,
+ _set_if_params,
+ rot,
+ dent->d_argv[0],
+ (const struct generic_brick_type *)&if_brick_type,
+ (
+ const struct generic_brick_type *[]){(
+ const struct generic_brick_type *)&client_brick_type},
+ switch_on || (rot->if_brick && atomic_read(&rot->if_brick->open_count) > 0) ? 2 : -1,
+ "%s/device-%s",
+ (const char *[]){client_path},
+ 1,
+ parent->d_path,
+ my_id());
+ rot->if_brick = (void *)dev_brick;
+ if (dev_brick) {
+ dev_brick->kill_ptr = (void **)&rot->if_brick;
+ if (!switch_on)
+ dev_brick->killme = true;
+ }
+
+ brick_string_free(status_path);
+ brick_string_free(status_val);
+ brick_string_free(client_path);
+ return status;
+}
+
+static
+int make_dev(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_dent *parent = dent->d_parent;
+ struct mars_rotate *rot = NULL;
+ struct xio_brick *dev_brick;
+ char *remote;
+ char *dev_name = NULL;
+ bool switch_on;
+ int open_count;
+ int status = 0;
+
+ if (!parent || !dent->link_val) {
+ XIO_ERR("nothing to do\n");
+ return -EINVAL;
+ }
+ rot = parent->d_private;
+ if (!rot || !rot->parent_path) {
+ XIO_DBG("nothing to do\n");
+ goto err;
+ }
+ if (strcmp(dent->d_rest, my_id())) {
+ XIO_DBG("nothing to do\n");
+ goto err;
+ }
+ rot->has_symlinks = true;
+ status = _parse_args(dent, dent->link_val, 1);
+ if (unlikely(status < 0))
+ goto done;
+
+ remote = strstr(dent->d_argv[0], "@");
+ if (remote) {
+ brick_string_free(dent->d_argv[1]);
+ dent->d_argv[1] = brick_strdup(remote + 1);
+ *remote = '\0';
+ status = make_dev_remote(global, dent, rot);
+ goto done;
+ }
+
+ if (!rot->trans_brick) {
+ XIO_DBG("transaction logger does not exist\n");
+ goto done;
+ }
+ if (rot->dev_size <= 0) {
+ XIO_WRN("trying to create device '%s' with zero size\n", dent->d_path);
+ goto done;
+ }
+
+ dev_name = path_make("mars/%s", dent->d_argv[0]);
+
+ switch_on =
+ (rot->if_brick && atomic_read(&rot->if_brick->open_count) > 0) ||
+ (rot->todo_primary &&
+ !rot->trans_brick->replay_mode &&
+ rot->trans_brick->power.on_led &&
+ strcmp(dent->d_argv[0], "(none)") &&
+ _check_allow(global, dent->d_parent, "attach"));
+ if (!global->global_power.button)
+ switch_on = false;
+ if (switch_on && rot->res_shutdown) {
+ XIO_ERR("cannot create device: resource shutdown mode is currently active\n");
+ switch_on = false;
+ }
+
+ dev_brick =
+ make_brick_all(
+ global,
+ dent,
+ _set_if_params,
+ rot,
+ dev_name,
+ (const struct generic_brick_type *)&if_brick_type,
+ (
+ const struct generic_brick_type *[]){(
+ const struct generic_brick_type *)&trans_logger_brick_type},
+ switch_on ? 2 : -1,
+ "%s/device-%s",
+ (const char *[]){"%s/replay-%s"},
+ 1,
+ parent->d_path,
+ my_id(),
+ parent->d_path,
+ my_id());
+ rot->if_brick = (void *)dev_brick;
+ if (!dev_brick) {
+ XIO_DBG("device not shown\n");
+ goto done;
+ }
+ if (!switch_on) {
+ XIO_DBG("setting killme on if_brick\n");
+ dev_brick->killme = true;
+ }
+ dev_brick->kill_ptr = (void **)&rot->if_brick;
+ dev_brick->show_status = _show_brick_status;
+
+done:
+ open_count = 0;
+ if (rot->if_brick) {
+ _show_rate(rot, &rot->if_brick->io_limiter, "if_rate");
+ open_count = atomic_read(&rot->if_brick->open_count);
+ }
+ __show_actual(rot->parent_path, "open-count", open_count);
+ rot->is_primary =
+ rot->trans_brick &&
+ rot->trans_brick->power.on_led &&
+ !rot->trans_brick->replay_mode;
+ _show_primary(rot, parent);
+
+err:
+ brick_string_free(dev_name);
+ return status;
+}
+
+static
+int kill_dev(void *buf, struct mars_dent *dent)
+{
+ struct mars_dent *parent = dent->d_parent;
+ int status = kill_any(buf, dent);
+
+ if (status > 0 && parent) {
+ struct mars_rotate *rot = parent->d_private;
+
+ if (rot)
+ rot->if_brick = NULL;
+ }
+ return status;
+}
+
+static
+int _update_syncstatus(struct mars_rotate *rot, struct copy_brick *copy, char *peer)
+{
+ const char *src = NULL;
+ const char *dst = NULL;
+ const char *syncpos_path = NULL;
+ const char *peer_replay_path = NULL;
+ const char *peer_replay_link = NULL;
+ const char *peer_time_path = NULL;
+ int status = -EINVAL;
+
+ /* create syncpos symlink when necessary */
+ if (copy->copy_last == copy->copy_end && !rot->sync_finish_stamp.tv_sec) {
+ get_lamport(&rot->sync_finish_stamp);
+ XIO_DBG(
+ "sync finished at timestamp %lu\n",
+ rot->sync_finish_stamp.tv_sec);
+ /* Give the remote replay position a chance to become
+ * recent enough.
+ */
+ remote_trigger();
+ status = -EAGAIN;
+ goto done;
+ }
+ if (rot->sync_finish_stamp.tv_sec) {
+ struct kstat peer_time_stat = {};
+
+ peer_time_path = path_make("/mars/tree-%s", peer);
+ status = mars_stat(peer_time_path, &peer_time_stat, true);
+ if (unlikely(status < 0)) {
+ XIO_ERR("cannot stat '%s'\n", peer_time_path);
+ goto done;
+ }
+
+ /* The syncpos tells us the replay position at the primary
+ * which was effective at the moment when the local sync was done.
+ * It is used to guarantee consistency:
+ * before our underlying disk is _really_ consistent, not only
+ * the sync must have finished, but additionally the local
+ * replay must have grown (at least) until the same position
+ * at which the primary was at that moment.
+ * Therefore, we have to remember the replay position of
+ * the primary at that moment.
+ * And because of the network delays we must ensure
+ * to get a recent enough remote version.
+ */
+ syncpos_path = path_make("%s/syncpos-%s", rot->parent_path, my_id());
+ peer_replay_path = path_make("%s/replay-%s", rot->parent_path, peer);
+ peer_replay_link = mars_readlink(peer_replay_path);
+ if (unlikely(!peer_replay_link || !peer_replay_link[0])) {
+ XIO_ERR("cannot read peer replay link '%s'\n", peer_replay_path);
+ goto done;
+ }
+
+ _crashme(3, true);
+
+ status = _update_link_when_necessary(rot, "syncpos", peer_replay_link, syncpos_path);
+ /* Sync is only marked as finished when the syncpos
+ * production was successful and timestamps are recent enough.
+ */
+ if (unlikely(status < 0))
+ goto done;
+ if (timespec_compare(&peer_time_stat.mtime, &rot->sync_finish_stamp) < 0) {
+ XIO_INF(
+ "peer replay link '%s' is not recent enough, %lu < %lu\n",
+ peer_replay_path,
+ peer_time_stat.mtime.tv_sec,
+ rot->sync_finish_stamp.tv_sec);
+ remote_trigger();
+ status = -EAGAIN;
+ goto done;
+ }
+ }
+
+ src = path_make("%lld", copy->copy_last);
+ dst = path_make("%s/syncstatus-%s", rot->parent_path, my_id());
+
+ _crashme(4, true);
+
+ status = _update_link_when_necessary(rot, "syncstatus", src, dst);
+
+ brick_string_free(src);
+ brick_string_free(dst);
+ src = path_make("%lld,%lld", copy->verify_ok_count, copy->verify_error_count);
+ dst = path_make("%s/verifystatus-%s", rot->parent_path, my_id());
+
+ _crashme(5, true);
+
+ (void)_update_link_when_necessary(rot, "verifystatus", src, dst);
+
+ memset(&rot->sync_finish_stamp, 0, sizeof(rot->sync_finish_stamp));
+done:
+ brick_string_free(src);
+ brick_string_free(dst);
+ brick_string_free(peer_replay_link);
+ brick_string_free(peer_replay_path);
+ brick_string_free(syncpos_path);
+ brick_string_free(peer_time_path);
+ return status;
+}
+
+static int make_sync(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ struct mars_rotate *rot;
+ loff_t start_pos = 0;
+ loff_t end_pos = 0;
+ struct mars_dent *size_dent;
+ struct mars_dent *primary_dent;
+ struct mars_dent *syncfrom_dent;
+ char *peer;
+ struct copy_brick *copy = NULL;
+ char *tmp = NULL;
+ const char *switch_path = NULL;
+ const char *copy_path = NULL;
+ const char *src = NULL;
+ const char *dst = NULL;
+ bool do_start;
+ int status;
+
+ if (!dent->d_parent || !dent->link_val)
+ return 0;
+
+ /* Determine peer
+ */
+ tmp = path_make("%s/primary", dent->d_parent->d_path);
+ primary_dent = (void *)mars_find_dent(global, tmp);
+ if (!primary_dent || !primary_dent->link_val) {
+ XIO_ERR("cannot determine primary, symlink '%s'\n", tmp);
+ status = 0;
+ goto done;
+ }
+ peer = primary_dent->link_val;
+
+ do_start = _check_allow(global, dent->d_parent, "attach");
+
+ /* Analyze replay position
+ */
+ status = kstrtos64(dent->link_val, 10, &start_pos);
+ if (unlikely(status)) {
+ XIO_ERR("bad syncstatus symlink syntax '%s' (%s)\n", dent->link_val, dent->d_path);
+ status = -EINVAL;
+ goto done;
+ }
+
+ rot = dent->d_parent->d_private;
+ status = -ENOENT;
+ CHECK_PTR(rot, done);
+
+ rot->has_symlinks = true;
+ rot->allow_update = true;
+ rot->syncstatus_dent = dent;
+
+ /* Sync necessary?
+ */
+ brick_string_free(tmp);
+ tmp = path_make("%s/size", dent->d_parent->d_path);
+ status = -ENOMEM;
+ if (unlikely(!tmp))
+ goto done;
+ size_dent = (void *)mars_find_dent(global, tmp);
+ if (!size_dent || !size_dent->link_val) {
+ XIO_ERR("cannot determine size '%s'\n", tmp);
+ status = -ENOENT;
+ goto done;
+ }
+ status = kstrtos64(size_dent->link_val, 10, &end_pos);
+ if (unlikely(status)) {
+ XIO_ERR("bad size symlink syntax '%s' (%s)\n", size_dent->link_val, tmp);
+ status = -EINVAL;
+ goto done;
+ }
+
+ /* Is sync necessary at all?
+ */
+ if (start_pos >= end_pos) {
+ XIO_DBG("no data sync necessary, size = %lld\n", start_pos);
+ do_start = false;
+ }
+
+ /* Handle final waiting step when finished
+ */
+ if (rot->sync_finish_stamp.tv_sec && do_start)
+ goto shortcut;
+
+ /* Don't sync when logfiles are discontiguous
+ */
+ if (do_start && (rot->has_double_logfile | rot->has_hole_logfile)) {
+ XIO_WRN(
+ "no sync possible due to discontiguous logfiles %d ~!~ %d\n",
+ rot->has_double_logfile, rot->has_hole_logfile);
+ if (do_start)
+ start_pos = 0;
+ do_start = false;
+ }
+
+ /* stop sync when primary is unknown
+ */
+ if (!strcmp(peer, "(none)")) {
+ XIO_INF("cannot start sync, no primary is designated\n");
+ if (do_start)
+ start_pos = 0;
+ do_start = false;
+ }
+
+ /* Check syncfrom link (when existing)
+ */
+ brick_string_free(tmp);
+ tmp = path_make("%s/syncfrom-%s", dent->d_parent->d_path, my_id());
+ syncfrom_dent = (void *)mars_find_dent(global, tmp);
+ if (do_start && syncfrom_dent && syncfrom_dent->link_val &&
+ strcmp(syncfrom_dent->link_val, peer)) {
+ XIO_WRN(
+ "cannot start sync, primary has changed: '%s' != '%s'\n",
+ syncfrom_dent->link_val, peer);
+ if (do_start)
+ start_pos = 0;
+ do_start = false;
+ }
+
+ /* Disallow contemporary sync & logfile_replay
+ */
+ if (do_start &&
+ rot->trans_brick &&
+ !rot->trans_brick->power.off_led) {
+ XIO_INF("cannot start sync because logger is working\n");
+ do_start = false;
+ }
+
+ /* Disallow overwrite of newer data
+ */
+ if (do_start)
+ write_info_links(rot);
+ rot->forbid_replay = (do_start && compare_replaylinks(rot, peer, my_id()) < 0);
+ if (rot->forbid_replay) {
+ XIO_INF("cannot start sync because my data is newer than the remote one at '%s'!\n", peer);
+ do_start = false;
+ }
+
+ /* Flip between replay and sync
+ */
+ if (do_start && rot->replay_mode && rot->end_pos > rot->start_pos &&
+ mars_sync_flip_interval >= 8) {
+ if (!rot->flip_start) {
+ rot->flip_start = jiffies;
+ } else if ((long long)jiffies - rot->flip_start > mars_sync_flip_interval * HZ) {
+ do_start = false;
+ rot->flip_start = jiffies + mars_sync_flip_interval * HZ;
+ }
+ } else {
+ rot->flip_start = 0;
+ }
+
+ XIO_DBG("initial sync '%s' => '%s' do_start = %d\n", src, dst, do_start);
+ /* Obey global sync limit
+ */
+ rot->wants_sync = (do_start != 0);
+ if (rot->wants_sync && global_sync_limit > 0) {
+ do_start = rot->gets_sync;
+ if (!rot->gets_sync) {
+ XIO_INF_TO(
+ rot->log_say, "won't start sync because of parallelism limit %d\n", global_sync_limit);
+ }
+ }
+
+shortcut:
+ /* Start copy
+ */
+ src = path_make("data-%s@%s:%d", peer, peer, xio_net_default_port + 2);
+ dst = path_make("data-%s", my_id());
+ copy_path = backskip_replace(dent->d_path, '/', true, "/copy-");
+
+ /* check whether connection is allowed */
+ switch_path = path_make("%s/todo-%s/sync", dent->d_parent->d_path, my_id());
+
+ status = -ENOMEM;
+ if (unlikely(!src || !dst || !copy_path || !switch_path))
+ goto done;
+
+ /* Informational
+ */
+ XIO_DBG(
+ "start_pos = %lld end_pos = %lld sync_finish_stamp=%lu do_start=%d\n",
+ start_pos, end_pos, rot->sync_finish_stamp.tv_sec, do_start);
+
+ if (!do_start)
+ memset(&rot->sync_finish_stamp, 0, sizeof(rot->sync_finish_stamp));
+
+ /* Now do it....
+ */
+ {
+ const char *argv[2] = { src, dst };
+
+ status = __make_copy(
+ global, dent,
+ do_start ? switch_path : "",
+ copy_path, dent->d_parent->d_path, argv, find_key(rot->msgs, "inf-sync"),
+ start_pos, end_pos,
+ true,
+ mars_fast_fullsync > 0,
+ true, false, &copy);
+ if (copy) {
+ copy->kill_ptr = (void **)&rot->sync_brick;
+ copy->copy_limiter = &rot->sync_limiter;
+ }
+ rot->sync_brick = copy;
+ }
+
+ /* Update syncstatus symlink
+ */
+ if (status >= 0 && copy &&
+ ((copy->power.button && copy->power.on_led) ||
+ !copy->copy_start ||
+ (copy->copy_last == copy->copy_end && copy->copy_end > 0))) {
+ status = _update_syncstatus(rot, copy, peer);
+ }
+
+done:
+ XIO_DBG("status = %d\n", status);
+ brick_string_free(tmp);
+ brick_string_free(src);
+ brick_string_free(dst);
+ brick_string_free(copy_path);
+ brick_string_free(switch_path);
+ return status;
+}
+
+static
+bool remember_peer(struct mars_rotate *rot, struct mars_peerinfo *peer)
+{
+ if (!peer || !rot || rot->preferred_peer)
+ return false;
+
+ if ((long long)peer->last_remote_jiffies + mars_scan_interval * HZ * 2 < (long long)jiffies)
+ return false;
+
+ rot->preferred_peer = brick_strdup(peer->peer);
+ return true;
+}
+
+static
+int make_connect(void *buf, struct mars_dent *dent)
+{
+ struct mars_rotate *rot;
+ struct mars_peerinfo *peer;
+ char *names;
+ char *this_name;
+ char *tmp;
+
+ if (unlikely(!dent->d_parent || !dent->link_val))
+ goto done;
+ rot = dent->d_parent->d_private;
+ if (unlikely(!rot))
+ goto done;
+
+ names = brick_strdup(dent->link_val);
+ for (tmp = this_name = names; *tmp; tmp++) {
+ if (*tmp == MARS_DELIM) {
+ *tmp = '\0';
+ peer = find_peer(this_name);
+ if (remember_peer(rot, peer))
+ goto found;
+ this_name = tmp + 1;
+ }
+ }
+ peer = find_peer(this_name);
+ remember_peer(rot, peer);
+
+found:
+ brick_string_free(names);
+done:
+ return 0;
+}
+
+static int prepare_delete(void *buf, struct mars_dent *dent)
+{
+ struct kstat stat;
+ struct kstat *to_delete = NULL;
+ struct mars_global *global = buf;
+ struct mars_dent *target;
+ struct mars_dent *response;
+ const char *marker_path = NULL;
+ const char *response_path = NULL;
+ struct xio_brick *brick;
+ int max_serial = 0;
+ int status;
+
+ if (!global || !dent || !dent->link_val || !dent->d_path)
+ goto err;
+
+ /* create a marker which prevents concurrent updates from remote hosts */
+ marker_path = backskip_replace(dent->link_val, '/', true, "/.deleted-");
+ if (mars_stat(marker_path, &stat, true) < 0 ||
+ timespec_compare(&dent->stat_val.mtime, &stat.mtime) > 0) {
+ XIO_DBG(
+ "creating / updating marker '%s' mtime=%lu.%09lu\n",
+ marker_path, dent->stat_val.mtime.tv_sec, dent->stat_val.mtime.tv_nsec);
+ mars_symlink("1", marker_path, &dent->stat_val.mtime, 0);
+ }
+
+ brick = mars_find_brick(global, NULL, dent->link_val);
+ if (brick &&
+ unlikely((brick->nr_outputs > 0 && brick->outputs[0] && brick->outputs[0]->nr_connected) ||
+ (brick->type == (void *)&if_brick_type && !brick->power.off_led))) {
+ XIO_WRN("target '%s' cannot be deleted, its brick '%s' in use\n", dent->link_val, brick->brick_name);
+ goto done;
+ }
+
+ status = 0;
+ target = mars_find_dent(global, dent->link_val);
+ if (target) {
+ if (timespec_compare(&target->stat_val.mtime, &dent->stat_val.mtime) > 0) {
+ XIO_WRN("target '%s' has newer timestamp than deletion link, ignoring\n", dent->link_val);
+ status = -EAGAIN;
+ goto ok;
+ }
+ if (target->d_child_count) {
+ XIO_WRN("target '%s' has %d children, cannot kill\n", dent->link_val, target->d_child_count);
+ goto done;
+ }
+ target->d_killme = true;
+ XIO_DBG("target '%s' marked for removal\n", dent->link_val);
+ to_delete = &target->stat_val;
+ } else if (mars_stat(dent->link_val, &stat, true) >= 0) {
+ if (timespec_compare(&stat.mtime, &dent->stat_val.mtime) > 0) {
+ XIO_WRN("target '%s' has newer timestamp than deletion link, ignoring\n", dent->link_val);
+ status = -EAGAIN;
+ goto ok;
+ }
+ to_delete = &stat;
+ } else {
+ status = -EAGAIN;
+ XIO_DBG("target '%s' does no longer exist\n", dent->link_val);
+ }
+ if (to_delete) {
+ status = mars_unlink(dent->link_val);
+ XIO_DBG("unlink '%s', status = %d\n", dent->link_val, status);
+ }
+
+ok:
+ if (status < 0) {
+ XIO_DBG(
+ "deletion '%s' to target '%s' is accomplished\n",
+ dent->d_path, dent->link_val);
+ if (dent->d_serial <= global->deleted_border) {
+ XIO_DBG("removing deletion symlink '%s'\n", dent->d_path);
+ dent->d_killme = true;
+ mars_unlink(dent->d_path);
+ XIO_DBG("removing marker '%s'\n", marker_path);
+ mars_unlink(marker_path);
+ }
+ }
+
+done:
+ /* tell the world that we have seen this deletion... (even when not yet accomplished) */
+ response_path = path_make("/mars/todo-global/deleted-%s", my_id());
+ response = mars_find_dent(global, response_path);
+ if (response && response->link_val) {
+ int status = kstrtoint(response->link_val, 10, &max_serial);
+
+ (void)status; /* leave untouched in case of errors */
+ }
+ if (dent->d_serial > max_serial) {
+ char response_val[16];
+
+ max_serial = dent->d_serial;
+ global->deleted_my_border = max_serial;
+ snprintf(response_val, sizeof(response_val), "%09d", max_serial);
+ mars_symlink(response_val, response_path, NULL, 0);
+ }
+
+err:
+ brick_string_free(marker_path);
+ brick_string_free(response_path);
+ return 0;
+}
+
+static int check_deleted(void *buf, struct mars_dent *dent)
+{
+ struct mars_global *global = buf;
+ int serial = 0;
+ int status;
+
+ if (!global || !dent || !dent->link_val)
+ goto done;
+
+ status = kstrtoint(dent->link_val, 10, &serial);
+ if (unlikely(status || serial <= 0)) {
+ XIO_WRN("cannot parse symlink '%s' -> '%s'\n", dent->d_path, dent->link_val);
+ goto done;
+ }
+
+ if (!strcmp(dent->d_rest, my_id()))
+ global->deleted_my_border = serial;
+
+ /* Compute the minimum of the deletion progress among
+ * the resource members.
+ */
+ if (serial < global->deleted_min || !global->deleted_min)
+ global->deleted_min = serial;
+
+done:
+ return 0;
+}
+
+static
+int make_res(void *buf, struct mars_dent *dent)
+{
+ struct mars_rotate *rot = dent->d_private;
+
+ if (!rot) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ rot->has_symlinks = false;
+
+done:
+ return 0;
+}
+
+static
+int kill_res(void *buf, struct mars_dent *dent)
+{
+ struct mars_rotate *rot = dent->d_private;
+
+ if (unlikely(!rot || !rot->parent_path)) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+
+ show_vals(rot->msgs, rot->parent_path, "");
+
+ if (unlikely(!rot->global)) {
+ XIO_DBG("nothing to do\n");
+ goto done;
+ }
+ if (rot->has_symlinks) {
+ XIO_DBG("symlinks were present, nothing to kill.\n");
+ goto done;
+ }
+
+ /* this code is only executed in case of forced deletion of symlinks */
+ if (rot->if_brick || rot->sync_brick || rot->fetch_brick || rot->trans_brick) {
+ rot->res_shutdown = true;
+ XIO_WRN("resource '%s' has no symlinks, shutting down.\n", rot->parent_path);
+ }
+ if (rot->if_brick) {
+ if (atomic_read(&rot->if_brick->open_count) > 0) {
+ XIO_ERR("cannot destroy resource '%s': device is is use!\n", rot->parent_path);
+ goto done;
+ }
+ rot->if_brick->killme = true;
+ if (!rot->if_brick->power.off_led) {
+ int status = mars_power_button((void *)rot->if_brick, false, false);
+
+ XIO_INF("switching off resource '%s', device status = %d\n", rot->parent_path, status);
+ } else {
+ xio_kill_brick((void *)rot->if_brick);
+ rot->if_brick = NULL;
+ }
+ }
+ if (rot->sync_brick) {
+ rot->sync_brick->killme = true;
+ if (!rot->sync_brick->power.off_led) {
+ int status = mars_power_button((void *)rot->sync_brick, false, false);
+
+ XIO_INF("switching off resource '%s', sync status = %d\n", rot->parent_path, status);
+ }
+ }
+ if (rot->fetch_brick) {
+ rot->fetch_brick->killme = true;
+ if (!rot->fetch_brick->power.off_led) {
+ int status = mars_power_button((void *)rot->fetch_brick, false, false);
+
+ XIO_INF("switching off resource '%s', fetch status = %d\n", rot->parent_path, status);
+ }
+ }
+ if (rot->trans_brick) {
+ struct trans_logger_output *output = rot->trans_brick->outputs[0];
+
+ if (!output || output->nr_connected) {
+ XIO_ERR("cannot destroy resource '%s': trans_logger is is use!\n", rot->parent_path);
+ goto done;
+ }
+ rot->trans_brick->killme = true;
+ if (!rot->trans_brick->power.off_led) {
+ int status = mars_power_button((void *)rot->trans_brick, false, false);
+
+ XIO_INF("switching off resource '%s', logger status = %d\n", rot->parent_path, status);
+ }
+ }
+ if (!rot->if_brick && !rot->sync_brick && !rot->fetch_brick && !rot->trans_brick)
+ rot->res_shutdown = false;
+
+done:
+ return 0;
+}
+
+static
+int make_defaults(void *buf, struct mars_dent *dent)
+{
+ if (!dent->link_val)
+ goto done;
+
+ XIO_DBG("name = '%s' value = '%s'\n", dent->d_name, dent->link_val);
+
+ if (!strcmp(dent->d_name, "sync-limit")) {
+ int status = kstrtoint(dent->link_val, 10, &global_sync_limit);
+
+ (void)status; /* leave untouched in case of errors */
+ } else if (!strcmp(dent->d_name, "sync-pref-list")) {
+ const char *start;
+ struct list_head *tmp;
+ int len;
+ int want_count = 0;
+ int get_count = 0;
+
+ for (tmp = rot_anchor.next; tmp != &rot_anchor; tmp = tmp->next) {
+ struct mars_rotate *rot = container_of(tmp, struct mars_rotate, rot_head);
+
+ if (rot->wants_sync)
+ want_count++;
+ else
+ rot->gets_sync = false;
+ if (rot->sync_brick && rot->sync_brick->power.on_led)
+ get_count++;
+ }
+ global_sync_want = want_count;
+ global_sync_nr = get_count;
+
+ /* prefer mentioned resources in the right order */
+ for (start = dent->link_val; *start && get_count < global_sync_limit; start += len) {
+ len = 1;
+ while (start[len] && start[len] != ',')
+ len++;
+ for (tmp = rot_anchor.next; tmp != &rot_anchor; tmp = tmp->next) {
+ struct mars_rotate *rot = container_of(tmp, struct mars_rotate, rot_head);
+
+ if (rot->wants_sync && rot->parent_rest && !strncmp(start, rot->parent_rest, len)) {
+ rot->gets_sync = true;
+ get_count++;
+ XIO_DBG(
+ "new get_count = %d res = '%s' wants_sync = %d gets_sync = %d\n",
+ get_count, rot->parent_rest, rot->wants_sync, rot->gets_sync);
+ break;
+ }
+ }
+ if (start[len])
+ len++;
+ }
+ /* fill up with unmentioned resources */
+ for (tmp = rot_anchor.next; tmp != &rot_anchor && get_count < global_sync_limit; tmp = tmp->next) {
+ struct mars_rotate *rot = container_of(tmp, struct mars_rotate, rot_head);
+
+ if (rot->wants_sync && !rot->gets_sync) {
+ rot->gets_sync = true;
+ get_count++;
+ }
+ XIO_DBG(
+ "new get_count = %d res = '%s' wants_sync = %d gets_sync = %d\n",
+ get_count, rot->parent_rest, rot->wants_sync, rot->gets_sync);
+ }
+ XIO_DBG("final want_count = %d get_count = %d\n", want_count, get_count);
+ } else {
+ XIO_DBG("unimplemented default '%s'\n", dent->d_name);
+ }
+done:
+ return 0;
+}
+
+/*********************************************************************/
+
+/* Please keep the order the same as in the enum.
+ */
+static const struct main_class main_classes[] = {
+ /* Placeholder for root node /mars/
+ */
+ [CL_ROOT] = {
+ },
+
+ /* UUID, indentifying the whole cluster.
+ */
+ [CL_UUID] = {
+ .cl_name = "uuid",
+ .cl_len = 4,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+
+ /* Subdirectory for global userspace items...
+ */
+ [CL_GLOBAL_USERSPACE] = {
+ .cl_name = "userspace",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_ROOT,
+ },
+ [CL_GLOBAL_USERSPACE_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_GLOBAL_USERSPACE,
+ },
+
+ /* Subdirectory for defaults...
+ */
+ [CL_DEFAULTS0] = {
+ .cl_name = "defaults",
+ .cl_len = 8,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_ROOT,
+ },
+ [CL_DEFAULTS] = {
+ .cl_name = "defaults-",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_hostcontext = true,
+ .cl_father = CL_ROOT,
+ },
+ /* ... and its contents
+ */
+ [CL_DEFAULTS_ITEMS0] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_DEFAULTS0,
+ },
+ [CL_DEFAULTS_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_DEFAULTS,
+ .cl_forward = make_defaults,
+ },
+
+ /* Subdirectory for global controlling items...
+ */
+ [CL_GLOBAL_TODO] = {
+ .cl_name = "todo-global",
+ .cl_len = 11,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_ROOT,
+ },
+ /* ... and its contents
+ */
+ [CL_GLOBAL_TODO_DELETE] = {
+ .cl_name = "delete-",
+ .cl_len = 7,
+ .cl_type = 'l',
+ .cl_serial = true,
+ .cl_hostcontext = false, /* ignore context, although present */
+ .cl_father = CL_GLOBAL_TODO,
+ .cl_prepare = prepare_delete,
+ },
+ [CL_GLOBAL_TODO_DELETED] = {
+ .cl_name = "deleted-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_father = CL_GLOBAL_TODO,
+ .cl_prepare = check_deleted,
+ },
+
+ /* Directory containing the addresses of all peers
+ */
+ [CL_IPS] = {
+ .cl_name = "ips",
+ .cl_len = 3,
+ .cl_type = 'd',
+ .cl_father = CL_ROOT,
+ },
+ /* Anyone participating in a MARS cluster must
+ * be named here (symlink pointing to the IP address).
+ * We have no DNS in kernel space.
+ */
+ [CL_PEERS] = {
+ .cl_name = "ip-",
+ .cl_len = 3,
+ .cl_type = 'l',
+ .cl_father = CL_IPS,
+ .cl_forward = make_scan,
+ .cl_backward = kill_scan,
+ },
+ /* Subdirectory for actual state
+ */
+ [CL_GBL_ACTUAL] = {
+ .cl_name = "actual-",
+ .cl_len = 7,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_ROOT,
+ },
+ /* ... and its contents
+ */
+ [CL_GBL_ACTUAL_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_GBL_ACTUAL,
+ },
+ /* Indicate aliveness of all cluster paritcipants
+ * by the timestamp of this link.
+ */
+ [CL_ALIVE] = {
+ .cl_name = "alive-",
+ .cl_len = 6,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+ [CL_TIME] = {
+ .cl_name = "time-",
+ .cl_len = 5,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+ /* Show version indication for symlink tree.
+ */
+ [CL_TREE] = {
+ .cl_name = "tree-",
+ .cl_len = 5,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+ /* Indicate whether filesystem is full
+ */
+ [CL_EMERGENCY] = {
+ .cl_name = "emergency-",
+ .cl_len = 10,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+ /* dto as percentage
+ */
+ [CL_REST_SPACE] = {
+ .cl_name = "rest-space-",
+ .cl_len = 11,
+ .cl_type = 'l',
+ .cl_father = CL_ROOT,
+ },
+
+ /* Directory containing all items of a resource
+ */
+ [CL_RESOURCE] = {
+ .cl_name = "resource-",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_use_channel = true,
+ .cl_father = CL_ROOT,
+ .cl_forward = make_res,
+ .cl_backward = kill_res,
+ },
+
+ /* Subdirectory for resource-specific userspace items...
+ */
+ [CL_RESOURCE_USERSPACE] = {
+ .cl_name = "userspace",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ [CL_RESOURCE_USERSPACE_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_RESOURCE_USERSPACE,
+ },
+
+ /* Subdirectory for defaults...
+ */
+ [CL_RES_DEFAULTS0] = {
+ .cl_name = "defaults",
+ .cl_len = 8,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ [CL_RES_DEFAULTS] = {
+ .cl_name = "defaults-",
+ .cl_len = 9,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ /* ... and its contents
+ */
+ [CL_RES_DEFAULTS_ITEMS0] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_RES_DEFAULTS0,
+ },
+ [CL_RES_DEFAULTS_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_RES_DEFAULTS,
+ },
+
+ /* Subdirectory for controlling items...
+ */
+ [CL_TODO] = {
+ .cl_name = "todo-",
+ .cl_len = 5,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ /* ... and its contents
+ */
+ [CL_TODO_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_TODO,
+ },
+
+ /* Subdirectory for actual state
+ */
+ [CL_ACTUAL] = {
+ .cl_name = "actual-",
+ .cl_len = 7,
+ .cl_type = 'd',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ /* ... and its contents
+ */
+ [CL_ACTUAL_ITEMS] = {
+ .cl_name = "",
+ .cl_len = 0, /* catch any */
+ .cl_type = 'l',
+ .cl_father = CL_ACTUAL,
+ },
+
+ /* File or symlink to the real device / real (sparse) file
+ * when hostcontext is missing, the corresponding peer will
+ * not participate in that resource.
+ */
+ [CL_DATA] = {
+ .cl_name = "data-",
+ .cl_len = 5,
+ .cl_type = 'F',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_bio,
+ .cl_backward = kill_any,
+ },
+ /* Symlink indicating the (common) size of the resource
+ */
+ [CL_SIZE] = {
+ .cl_name = "size",
+ .cl_len = 4,
+ .cl_type = 'l',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_log_init,
+ .cl_backward = kill_any,
+ },
+ /* Dito for each individual size
+ */
+ [CL_ACTSIZE] = {
+ .cl_name = "actsize-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ },
+ /* Symlink pointing to the name of the primary node
+ */
+ [CL_PRIMARY] = {
+ .cl_name = "primary",
+ .cl_len = 7,
+ .cl_type = 'l',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_primary,
+ .cl_backward = NULL,
+ },
+ /* Symlink for connection preferences
+ */
+ [CL_CONNECT] = {
+ .cl_name = "connect-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_connect,
+ },
+ /* informational symlink indicating the current
+ * status / start / pos / end of logfile transfers.
+ */
+ [CL_TRANSFER] = {
+ .cl_name = "transferstatus-",
+ .cl_len = 15,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ },
+ /* symlink indicating the current status / end
+ * of initial data sync.
+ */
+ [CL_SYNC] = {
+ .cl_name = "syncstatus-",
+ .cl_len = 11,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_sync,
+ .cl_backward = kill_any,
+ },
+ /* informational symlink for verify status
+ * of initial data sync.
+ */
+ [CL_VERIF] = {
+ .cl_name = "verifystatus-",
+ .cl_len = 13,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ },
+ /* informational symlink: after sync has finished,
+ * keep a copy of the replay symlink from the primary.
+ * when comparing the own replay symlink against this,
+ * we can determine whether we are consistent.
+ */
+ [CL_SYNCPOS] = {
+ .cl_name = "syncpos-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ },
+ /* Passive symlink indicating the split-brain crypto hash
+ */
+ [CL_VERSION] = {
+ .cl_name = "version-",
+ .cl_len = 8,
+ .cl_type = 'l',
+ .cl_serial = true,
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ },
+ /* Logfiles for transaction logger
+ */
+ [CL_LOG] = {
+ .cl_name = "log-",
+ .cl_len = 4,
+ .cl_type = 'F',
+ .cl_serial = true,
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_log_step,
+ .cl_backward = kill_any,
+ },
+ /* Symlink indicating the last state of
+ * transaction log replay.
+ */
+ [CL_REPLAYSTATUS] = {
+ .cl_name = "replay-",
+ .cl_len = 7,
+ .cl_type = 'l',
+ .cl_hostcontext = true,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_replay,
+ .cl_backward = kill_any,
+ },
+
+ /* Name of the device appearing at the primary
+ */
+ [CL_DEVICE] = {
+ .cl_name = "device-",
+ .cl_len = 7,
+ .cl_type = 'l',
+ .cl_hostcontext = false,
+ .cl_father = CL_RESOURCE,
+ .cl_forward = make_dev,
+ .cl_backward = kill_dev,
+ },
+
+ /* Quirk: when dead resources are recreated during a network partition,
+ * this is used to void version number clashes in the
+ * partitioned cluster.
+ */
+ [CL_MAXNR] = {
+ .cl_name = "maxnr",
+ .cl_len = 5,
+ .cl_type = 'l',
+ .cl_father = CL_RESOURCE,
+ },
+ {}
+};
+
+/* Helper routine to pre-determine the relevance of a name from the filesystem.
+ */
+int main_checker(
+struct mars_dent *parent,
+const char *_name,
+int namlen,
+unsigned int d_type,
+int *prefix,
+int *serial,
+bool *use_channel)
+{
+ int class;
+ int status = -2;
+
+#ifdef XIO_DEBUGGING
+ const char *name = brick_strndup(_name, namlen);
+
+#else
+ const char *name = _name;
+
+#endif
+
+ /* XIO_DBG("trying '%s' '%s'\n", path, name); */
+ for (class = CL_ROOT + 1; ; class++) {
+ const struct main_class *test = &main_classes[class];
+ int len = test->cl_len;
+
+ if (!test->cl_name) { /* end of table */
+ break;
+ }
+
+ /* XIO_DBG(" testing class '%s'\n", test->cl_name); */
+
+#ifdef XIO_DEBUGGING
+ if (len != strlen(test->cl_name)) {
+ XIO_ERR(
+ "internal table '%s' mismatch: %d != %d\n", test->cl_name, len, (int)strlen(test->cl_name));
+ len = strlen(test->cl_name);
+ }
+#endif
+
+ if (test->cl_father &&
+ (!parent || parent->d_class != test->cl_father)) {
+ continue;
+ }
+
+ if (len > 0 &&
+ (namlen < len || memcmp(name, test->cl_name, len))) {
+ continue;
+ }
+
+ /* XIO_DBG("path '%s/%s' matches class %d '%s'\n", path, name, class, test->cl_name); */
+
+ /* check special contexts */
+ if (test->cl_serial) {
+ int plus = 0;
+ int count;
+
+ count = sscanf(name + len, "%d%n", serial, &plus);
+ if (count < 1) {
+ /* XIO_DBG("'%s' serial number mismatch at '%s'\n", name, name + len); */
+ continue;
+ }
+ /* XIO_DBG("'%s' serial number = %d\n", name, *serial); */
+ len += plus;
+ if (name[len] == '-')
+ len++;
+ }
+ if (prefix)
+ *prefix = len;
+ if (test->cl_hostcontext) {
+ if (memcmp(name + len, my_id(), namlen - len)) {
+ /* XIO_DBG("context mismatch '%s' at '%s'\n", name, name + len); */
+ continue;
+ }
+ }
+
+ /* all ok */
+ status = class;
+ *use_channel = test->cl_use_channel;
+ }
+
+#ifdef XIO_DEBUGGING
+ brick_string_free(name);
+#endif
+ return status;
+}
+
+/* Do some syntactic checks, then delegate work to the real worker functions
+ * from the main_classes[] table.
+ */
+static int main_worker(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction)
+{
+ main_worker_fn worker;
+ int class = dent->d_class;
+
+ if (class < 0 || class >= sizeof(main_classes) / sizeof(struct main_class)) {
+ XIO_ERR("bad internal class %d of '%s'\n", class, dent->d_path);
+ return -EINVAL;
+ }
+ switch (main_classes[class].cl_type) {
+ case 'd':
+ if (!S_ISDIR(dent->stat_val.mode)) {
+ XIO_ERR("'%s' should be a directory, but is something else\n", dent->d_path);
+ return -EINVAL;
+ }
+ break;
+ case 'f':
+ if (!S_ISREG(dent->stat_val.mode)) {
+ XIO_ERR("'%s' should be a regular file, but is something else\n", dent->d_path);
+ return -EINVAL;
+ }
+ break;
+ case 'F':
+ if (!S_ISREG(dent->stat_val.mode) && !S_ISLNK(dent->stat_val.mode)) {
+ XIO_ERR("'%s' should be a regular file or a symlink, but is something else\n", dent->d_path);
+ return -EINVAL;
+ }
+ break;
+ case 'l':
+ if (!S_ISLNK(dent->stat_val.mode)) {
+ XIO_ERR("'%s' should be a symlink, but is something else\n", dent->d_path);
+ return -EINVAL;
+ }
+ break;
+ }
+ if (likely(class > CL_ROOT)) {
+ int father = main_classes[class].cl_father;
+
+ if (father == CL_ROOT) {
+ if (unlikely(dent->d_parent)) {
+ XIO_ERR("'%s' class %d is not at the root of the hierarchy\n", dent->d_path, class);
+ return -EINVAL;
+ }
+ } else if (unlikely(!dent->d_parent || dent->d_parent->d_class != father)) {
+ XIO_ERR(
+ "last component '%s' from '%s' is at the wrong position in the hierarchy (class = %d, parent_class = %d, parent = '%s')\n",
+ dent->d_name,
+ dent->d_path,
+ father,
+ dent->d_parent ? dent->d_parent->d_class : -9999,
+ dent->d_parent ? dent->d_parent->d_path : "");
+ return -EINVAL;
+ }
+ }
+ if (prepare)
+ worker = main_classes[class].cl_prepare;
+ else if (direction)
+ worker = main_classes[class].cl_backward;
+ else
+ worker = main_classes[class].cl_forward;
+ if (worker) {
+ int status;
+
+ if (!direction)
+ XIO_DBG(
+ "--- start working %s on '%s' rest='%s'\n",
+ direction ? "backward" : "forward",
+ dent->d_path,
+ dent->d_rest);
+ status = worker(global, (void *)dent);
+ XIO_DBG(
+ "--- done, worked %s on '%s', status = %d\n",
+ direction ? "backward" : "forward",
+ dent->d_path,
+ status);
+ return status;
+ }
+ return 0;
+}
+
+static struct mars_global _global = {
+ .dent_anchor = LIST_HEAD_INIT(_global.dent_anchor),
+ .brick_anchor = LIST_HEAD_INIT(_global.brick_anchor),
+ .global_power = {
+ .button = true,
+ },
+ .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(_global.main_event),
+};
+
+static int _main_thread(void *data)
+{
+ long long last_rollover = jiffies;
+ char *id = my_id();
+ int status = 0;
+
+ init_rwsem(&_global.dent_mutex);
+ init_rwsem(&_global.brick_mutex);
+
+ mars_global = &_global;
+
+ if (!id || strlen(id) < 2) {
+ XIO_ERR("invalid hostname\n");
+ status = -EFAULT;
+ goto done;
+ }
+
+ XIO_INF("-------- starting as host '%s' ----------\n", id);
+
+ while (_global.global_power.button || !list_empty(&_global.brick_anchor)) {
+ int status;
+
+ XIO_DBG("-------- NEW ROUND ---------\n");
+
+ if (mars_mem_percent < 0)
+ mars_mem_percent = 0;
+ if (mars_mem_percent > 70)
+ mars_mem_percent = 70;
+ brick_global_memlimit = (long long)brick_global_memavail * mars_mem_percent / 100;
+
+ brick_msleep(100);
+
+ if (brick_thread_should_stop()) {
+ _global.global_power.button = false;
+ xio_net_is_alive = false;
+ }
+
+ _make_alive();
+
+ compute_emergency_mode();
+
+ XIO_DBG("-------- start worker ---------\n");
+ _global.deleted_min = 0;
+ status = mars_dent_work(
+ &_global, "/mars", sizeof(struct mars_dent), main_checker, main_worker, &_global, 3);
+ _global.deleted_border = _global.deleted_min;
+ XIO_DBG("-------- worker deleted_min = %d status = %d\n", _global.deleted_min, status);
+
+ if (!_global.global_power.button) {
+ status = xio_kill_brick_when_possible(
+ &_global, &_global.brick_anchor, false, (void *)&copy_brick_type, true);
+ XIO_DBG("kill copy bricks (when possible) = %d\n", status);
+ }
+
+ status = xio_kill_brick_when_possible(&_global, &_global.brick_anchor, false, NULL, false);
+ XIO_DBG("kill main bricks (when possible) = %d\n", status);
+
+ status = xio_kill_brick_when_possible(
+ &_global, &_global.brick_anchor, false, (void *)&client_brick_type, true);
+ XIO_DBG("kill client bricks (when possible) = %d\n", status);
+ status = xio_kill_brick_when_possible(
+ &_global, &_global.brick_anchor, false, (void *)&sio_brick_type, true);
+ XIO_DBG("kill sio bricks (when possible) = %d\n", status);
+ status = xio_kill_brick_when_possible(
+ &_global, &_global.brick_anchor, false, (void *)&bio_brick_type, true);
+ XIO_DBG("kill bio bricks (when possible) = %d\n", status);
+
+ if ((long long)jiffies + mars_rollover_interval * HZ >= last_rollover) {
+ last_rollover = jiffies;
+ rollover_all();
+ }
+
+ _show_status_all(&_global);
+ show_vals(gbl_pairs, "/mars", "");
+ show_statistics(&_global, "main");
+
+ XIO_DBG(
+ "ban_count = %d ban_renew_count = %d\n", xio_global_ban.ban_count, xio_global_ban.ban_renew_count);
+
+ brick_msleep(500);
+
+ wait_event_interruptible_timeout(_global.main_event, _global.main_trigger, mars_scan_interval * HZ);
+
+ _global.main_trigger = false;
+ }
+
+done:
+ XIO_INF("-------- cleaning up ----------\n");
+ remote_trigger();
+ brick_msleep(1000);
+
+ xio_free_dent_all(&_global, &_global.dent_anchor);
+ xio_kill_brick_all(&_global, &_global.brick_anchor, false);
+
+ _show_status_all(&_global);
+ show_vals(gbl_pairs, "/mars", "");
+ show_statistics(&_global, "main");
+
+ mars_global = NULL;
+
+ XIO_INF("-------- done status = %d ----------\n", status);
+ /* cleanup_mm(); */
+ return status;
+}
+
+static
+char *_xio_info(void)
+{
+ int max = PAGE_SIZE - 64;
+ char *txt;
+ struct list_head *tmp;
+ int dent_count = 0;
+ int brick_count = 0;
+ int pos = 0;
+
+ if (unlikely(!mars_global))
+ return NULL;
+
+ txt = brick_string_alloc(max);
+
+ txt[--max] = '\0'; /* safeguard */
+
+ down_read(&mars_global->brick_mutex);
+ for (tmp = mars_global->brick_anchor.next; tmp != &mars_global->brick_anchor; tmp = tmp->next) {
+ struct xio_brick *test;
+
+ brick_count++;
+ test = container_of(tmp, struct xio_brick, global_brick_link);
+ pos += scnprintf(
+ txt + pos, max - pos,
+ "brick button=%d off=%d on=%d path='%s'\n",
+ test->power.button,
+ test->power.off_led,
+ test->power.on_led,
+ test->brick_path
+ );
+ }
+ up_read(&mars_global->brick_mutex);
+
+ pos += scnprintf(
+ txt + pos, max - pos,
+ "SUMMARY: brick_count=%d dent_count=%d\n",
+ brick_count,
+ dent_count
+ );
+
+ return txt;
+}
+
+#define INIT_MAX 32
+static char *exit_names[INIT_MAX];
+static void (*exit_fn[INIT_MAX])(void);
+static int exit_fn_nr;
+
+#define DO_INIT(name) \
+ do { \
+ XIO_DBG("=== starting module " #name "...\n"); \
+ status = init_##name(); \
+ if (status < 0) \
+ goto done; \
+ exit_names[exit_fn_nr] = #name; \
+ exit_fn[exit_fn_nr++] = exit_##name; \
+ } while (0)
+
+void (*_remote_trigger)(void);
+
+static void exit_main(void)
+{
+ XIO_DBG("====================== stopping everything...\n");
+ /* TODO: make this thread-safe. */
+ if (main_thread) {
+ XIO_DBG("=== stopping main thread...\n");
+ local_trigger();
+ XIO_INF("stopping main thread...\n");
+ brick_thread_stop(main_thread);
+ }
+
+ xio_info = NULL;
+ _remote_trigger = NULL;
+
+ while (exit_fn_nr > 0) {
+ XIO_DBG("=== stopping module %s ...\n", exit_names[exit_fn_nr - 1]);
+ exit_fn[--exit_fn_nr]();
+ }
+
+ XIO_DBG("====================== stopped everything.\n");
+ exit_say();
+ printk(KERN_INFO "stopped MARS\n");
+ /* Workaround for nasty race: some kernel threads have not yet
+ * really finished even _after_ kthread_stop() and may execute
+ * some code which will disappear right after return from this
+ * function.
+ * A correct solution would probably need the help of the kernel
+ * scheduler.
+ */
+ brick_msleep(1000);
+}
+
+static int __init init_main(void)
+{
+ struct kstat dummy;
+ int status = mars_stat("/mars/uuid", &dummy, true);
+
+ if (unlikely(status < 0)) {
+ printk(
+ KERN_ERR "cannot load MARS: cluster UUID is missing. Mount /mars/, and/or use {create,join}-cluster first.\n");
+ return -ENOENT;
+ }
+
+ printk(KERN_INFO "loading MARS, tree_version=%s\n", SYMLINK_TREE_VERSION);
+
+ init_say(); /* this must come first */
+
+ /* be careful: order is important!
+ */
+ DO_INIT(brick_mem);
+ DO_INIT(brick);
+ DO_INIT(xio);
+ DO_INIT(xio_mapfree);
+ DO_INIT(xio_net);
+ DO_INIT(xio_client);
+ DO_INIT(xio_sio);
+ DO_INIT(xio_bio);
+ DO_INIT(xio_copy);
+ DO_INIT(log_format);
+ DO_INIT(xio_trans_logger);
+ DO_INIT(xio_if);
+
+ DO_INIT(sy);
+ DO_INIT(sy_net);
+ DO_INIT(xio_proc);
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ brick_pre_reserve[5] = 64;
+ brick_mem_reserve();
+#endif
+
+ DO_INIT(xio_server);
+
+ status = compute_emergency_mode();
+ if (check_mars_space && unlikely(status < 0)) {
+ XIO_ERR("Sorry, your /mars/ filesystem is too small!\n");
+ goto done;
+ }
+ status = 0;
+
+ main_thread = brick_thread_create(_main_thread, NULL, "mars_main");
+ if (unlikely(!main_thread)) {
+ status = -ENOENT;
+ goto done;
+ }
+
+done:
+ if (status < 0) {
+ XIO_ERR("module init failed with status = %d, exiting.\n", status);
+ exit_main();
+ }
+ _remote_trigger = __remote_trigger;
+ xio_info = _xio_info;
+ return status;
+}
+
+/* force module loading */
+const void *dummy1 = &client_brick_type;
+const void *dummy2 = &server_brick_type;
+
+MODULE_DESCRIPTION("XIO");
+MODULE_AUTHOR("Thomas Schoebel-Theuer <tst@{schoebel-theuer,1und1}.de>");
+MODULE_VERSION(SYMLINK_TREE_VERSION);
+MODULE_LICENSE("GPL");
+
+#ifndef CONFIG_MARS_DEBUG
+MODULE_INFO(debug, "production");
+#else
+MODULE_INFO(debug, "DEBUG");
+#endif
+#ifdef CONFIG_MARS_DEBUG_MEM
+MODULE_INFO(io, "BAD_PERFORMANCE");
+#endif
+#ifdef CONFIG_MARS_DEBUG_ORDER0
+MODULE_INFO(memory, "EVIL_PERFORMANCE");
+#endif
+
+module_init(init_main);
+module_exit(exit_main);
--
2.11.0

2016-12-30 22:59:45

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 31/32] mars: add new module Kconfig

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/Kconfig | 266 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 266 insertions(+)
create mode 100644 drivers/staging/mars/Kconfig

diff --git a/drivers/staging/mars/Kconfig b/drivers/staging/mars/Kconfig
new file mode 100644
index 000000000000..836185e9509c
--- /dev/null
+++ b/drivers/staging/mars/Kconfig
@@ -0,0 +1,266 @@
+#
+# MARS configuration
+#
+
+config MARS
+ tristate "storage system MARS (EXPERIMENTAL)"
+ depends on BLOCK && PROC_SYSCTL && HIGH_RES_TIMERS && !DEBUG_SLAB && !DEBUG_SG
+ default n
+ ---help---
+ MARS is a long-distance replication of generic block devices.
+ It works asynchronously and tolerates network bottlenecks.
+ Please read the full documentation at
+ https://github.com/schoebel/mars/blob/master/docu/mars-manual.pdf?raw=true
+ Always compile MARS as a module!
+
+config MARS_CHECKS
+ bool "enable simple runtime checks in MARS"
+ depends on MARS
+ default y
+ ---help---
+ These checks should be rather lightweight. Use them
+ for beta testing and for production systems where
+ safety is more important than performance.
+ In case of bugs in the reference counting, an automatic repair
+ is attempted, which lowers the risk of memory corruptions.
+ Disable only if you need the absolutely last grain of
+ performance.
+ If unsure, say Y here.
+
+config MARS_DEBUG
+ bool "enable full runtime checks and some tracing in MARS"
+ depends on MARS
+ default n
+ ---help---
+ Some of these checks and some additional error tracing may
+ consume noticeable amounts of memory. However, this is extremely
+ valuable for finding bugs, even in production systems.
+
+ OFF for production systems. ON for testing!
+
+ If you encounter bugs in production systems, you
+ may / should use this also in production if you carefully
+ monitor your systems.
+
+config MARS_DEBUG_MEM
+ bool "debug memory operations"
+ depends on MARS_DEBUG
+ default n
+ ---help---
+ This adds considerable space and time overhead, but catches
+ many errors (including some that are not caught by kmemleak).
+
+ OFF for production systems. ON for testing!
+ Use only for development and thorough testing!
+
+config MARS_DEBUG_MEM_STRONG
+ bool "intensified debugging of memory operations"
+ depends on MARS_DEBUG_MEM
+ default y
+ ---help---
+ Trace all block allocations, find more errors.
+ Adds some overhead.
+
+ Use for debugging of new bricks or for intensified
+ regression testing.
+
+config MARS_DEBUG_ORDER0
+ bool "also debug order0 operations"
+ depends on MARS_DEBUG_MEM
+ default n
+ ---help---
+ Turn even order 0 allocations into order 1 ones and provoke
+ heavy memory fragmentation problems from the buddy allocator,
+ but catch some additional memory problems.
+ Use only if you know what you are doing!
+ Normally OFF.
+
+config MARS_DEFAULT_PORT
+ int "port number where MARS is listening"
+ depends on MARS
+ default 7777
+ ---help---
+ Best practice is to uniformly use the same port number
+ in a cluster. Therefore, this is a compiletime constant.
+ You may override this at insmod time via the mars_port= parameter.
+
+config MARS_NET_COMPAT
+ bool "compatibility to 0.1 series network protocol"
+ depends on MARS
+ default y
+ ---help---
+ TRANSITIONAL: this is only needed for _mixed_ operations of the
+ MARS Light 0.1 kernel modules and 0.2 module.
+ Typically, you will need this only during upgrade for minimizig
+ downtime (e.g. first upgrade secondary side, then handover,
+ and finally upgrade the former primary side).
+ This option will be removed for 0.3 and later stable
+ series, since you will no longer need it.
+
+config MARS_LOGDIR
+ string "absolute path to the logging directory"
+ depends on MARS
+ default "/mars"
+ ---help---
+ Path to the directory where all MARS messages will reside.
+ Usually this is equal to the global /mars directory.
+
+ Logfiles and status files obey the following naming conventions:
+ 0.debug.log
+ 1.info.log
+ 2.warn.log
+ 3.error.log
+ 4.fatal.log
+ 5.total.log
+ Logfiles must already exist in order to be appended.
+ Logiles can be rotated by renaming them and creating
+ a new empty file in place of the old one.
+
+ Status files follow the same rules, but .log is replaced
+ by .status, and they are created automatically. Their content
+ is however limited to a few seconds or minutes.
+
+ Leave this at the default unless you know what you are doing.
+
+config MARS_MIN_SPACE_4
+ int "absolutely necessary free space in /mars/ (hard limit in GB)"
+ depends on MARS
+ default 2
+ ---help---
+ HARDEST EMERGENCY LIMIT
+
+ When free space in /mars/ drops under this limit,
+ transaction logging to /mars/ will stop at all,
+ even at all primary resources. All IO will directly go to the
+ underlying raw devices. The transaction logfile sequence numbers
+ will be disrupted, deliberately leaving holes in the sequence.
+
+ This is a last-resort desperate action of the kernel.
+
+ As a consequence, all secodaries will have no chance to
+ replay at that gap, even if they got the logfiles.
+ The secondaries will stop at the gap, left in an outdated,
+ but logically consistent state.
+
+ After the problem has been fixed, the secondaries must
+ start a full-sync in order to continue replication at the
+ recent state.
+
+ This is the hardest measure the kernel can take in order
+ to TRY to continue undisrupted operation at the primary side.
+
+ In general, you should avoid such situations at the admin level.
+
+ Please implement your own monitoring at the admin level,
+ which warns you and/or takes appropriate countermeasures
+ much earlier.
+
+ Never rely on this emergency feature!
+
+config MARS_MIN_SPACE_3
+ int "free space in /mars/ for primary logfiles (additional limit in GB)"
+ depends on MARS
+ default 2
+ ---help---
+ MEDIUM EMERGENCY LIMIT
+
+ When free space in /mars/ drops under
+ MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3,
+ elder transaction logfiles will be deleted at primary resources.
+
+ As a consequence, the secondaries may no longer be able to
+ get a consecute series of copies of logfiles.
+ As a result, they may get stuck somewhere inbetween at an
+ outdated, but logically consistent state.
+
+ This is a desperate action of the kernel.
+
+ After the problem has been fixed, some secondaries may need to
+ start a full-sync in order to continue replication at the
+ recent state.
+
+ In general, you should avoid such situations at the admin level.
+
+ Please implement your own monitoring at the admin level,
+ which warns you and/or takes appropriate countermeasures
+ much earlier.
+
+ Never rely on this emergency feature!
+
+config MARS_MIN_SPACE_2
+ int "free space in /mars/ for secondary logfiles (additional limit in GB)"
+ depends on MARS
+ default 2
+ ---help---
+ MEDIUM EMERGENCY LIMIT
+
+ When free space in /mars/ drops under
+ MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2,
+ elder transaction logfiles will be deleted at secondary resources.
+
+ As a consequence, some local secondary resources
+ may get stuck somewhere inbetween at an
+ outdated, but logically consistent state.
+
+ This is a desperate action of the kernel.
+
+ After the problem has been fixed and the free space becomes
+ larger than MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2
+ + MARS_MIN_SPACE_2, the secondary tries to fetch the missing
+ logfiles from the primary again.
+
+ However, if the necessary logfiles have been deleted at the
+ primary side in the meantime, this may fail.
+
+ In general, you should avoid such situations at the admin level.
+
+ Please implement your own monitoring at the admin level,
+ which warns you and/or takes appropriate countermeasures
+ much earlier.
+
+ Never rely on this emergency feature!
+
+config MARS_MIN_SPACE_1
+ int "free space in /mars/ for replication (additional limit in GB)"
+ depends on MARS
+ default 2
+ ---help---
+ LOWEST EMERGENCY LIMIT
+
+ When free space in /mars/ drops under MARS_MIN_SPACE_4
+ + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2 + MARS_MIN_SPACE_1,
+ fetching of transaction logfiles will stop at local secondary
+ resources.
+
+ As a consequence, some local secondary resources
+ may get stuck somewhere inbetween at an
+ outdated, but logically consistent state.
+
+ This is a desperate action of the kernel.
+
+ After the problem has been fixed and the free space becomes
+ larger than MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2
+ + MARS_MIN_SPACE_2, the secondary will continue fetching its
+ copy of logfiles from the primary side.
+
+ In general, you should avoid such situations at the admin level.
+
+ Please implement your own monitoring at the admin level,
+ which warns you and/or takes appropriate countermeasures
+ much earlier.
+
+ Never rely on this emergency feature!
+
+config MARS_MIN_SPACE_0
+ int "total space needed in /mars/ for (additional limit in GB)"
+ depends on MARS
+ default 12
+ ---help---
+ Operational pre-requirement.
+
+ In order to use MARS, the total space available in /mars/ must
+ be at least MARS_MIN_SPACE_4 + MARS_MIN_SPACE_3 + MARS_MIN_SPACE_2
+ + MARS_MIN_SPACE_1 + MARS_MIN_SPACE_0.
+
+ If you cannot afford that amount of storage space, please use
+ DRBD in place of MARS.
--
2.11.0

2016-12-30 22:58:28

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 18/32] mars: add new module xio_sio

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/xio_sio.c | 578 ++++++++++++++++++++++++++++++
include/linux/xio/xio_sio.h | 68 ++++
2 files changed, 646 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/xio_sio.c
create mode 100644 include/linux/xio/xio_sio.h

diff --git a/drivers/staging/mars/xio_bricks/xio_sio.c b/drivers/staging/mars/xio_bricks/xio_sio.c
new file mode 100644
index 000000000000..c910cbda2ae5
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/xio_sio.c
@@ -0,0 +1,578 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/string.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/highmem.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+
+#include <linux/xio/xio.h>
+
+/************************ own type definitions ***********************/
+
+#include <linux/xio/xio_sio.h>
+
+/***************** own brick * input * output operations *****************/
+
+static int sio_io_get(struct sio_output *output, struct aio_object *aio)
+{
+ struct file *file;
+
+ if (unlikely(!output->brick->power.on_led))
+ return -EBADFD;
+
+ if (aio->obj_initialized) {
+ obj_get(aio);
+ return aio->io_len;
+ }
+
+ file = output->mf->mf_filp;
+ if (file) {
+ loff_t total_size = i_size_read(file->f_mapping->host);
+
+ aio->io_total_size = total_size;
+ /* Only check reads.
+ * Writes behind EOF are always allowed (sparse files)
+ */
+ if (!aio->io_may_write) {
+ loff_t len = total_size - aio->io_pos;
+
+ if (unlikely(len <= 0)) {
+ /* Special case: allow reads starting _exactly_ at EOF when a timeout is specified.
+ */
+ if (len < 0 || aio->io_timeout <= 0) {
+ XIO_DBG("ENODATA %lld\n", len);
+ return -ENODATA;
+ }
+ }
+ /* Shorten below EOF, but allow special case */
+ if (aio->io_len > len && len > 0)
+ aio->io_len = len;
+ }
+ }
+
+ /* Buffered IO.
+ */
+ if (!aio->io_data) {
+ struct sio_aio_aspect *aio_a = sio_aio_get_aspect(output->brick, aio);
+
+ if (unlikely(!aio_a))
+ return -EILSEQ;
+ if (unlikely(aio->io_len <= 0)) {
+ XIO_ERR("bad io_len = %d\n", aio->io_len);
+ return -ENOMEM;
+ }
+ aio->io_data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len));
+ aio_a->do_dealloc = true;
+ /* atomic_inc(&output->total_alloc_count); */
+ /* atomic_inc(&output->alloc_count); */
+ }
+
+ obj_get_first(aio);
+ return aio->io_len;
+}
+
+static void sio_io_put(struct sio_output *output, struct aio_object *aio)
+{
+ struct file *file;
+ struct sio_aio_aspect *aio_a;
+
+ if (!obj_put(aio))
+ goto out_return;
+ file = output->mf->mf_filp;
+ aio->io_total_size = i_size_read(file->f_mapping->host);
+
+ aio_a = sio_aio_get_aspect(output->brick, aio);
+ if (aio_a && aio_a->do_dealloc) {
+ brick_block_free(aio->io_data, aio_a->alloc_len);
+ /* atomic_dec(&output->alloc_count); */
+ }
+
+ obj_free(aio);
+out_return:;
+}
+
+static
+int write_aops(struct sio_output *output, struct aio_object *aio)
+{
+ struct file *file = output->mf->mf_filp;
+ loff_t pos = aio->io_pos;
+ void *data = aio->io_data;
+ int len = aio->io_len;
+ int ret = 0;
+
+ mm_segment_t oldfs;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ ret = vfs_write(file, data, len, &pos);
+ set_fs(oldfs);
+ return ret;
+}
+
+static
+int read_aops(struct sio_output *output, struct aio_object *aio)
+{
+ loff_t pos = aio->io_pos;
+ int len = aio->io_len;
+ int ret;
+
+ mm_segment_t oldfs;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ ret = vfs_read(output->mf->mf_filp, aio->io_data, len, &pos);
+ set_fs(oldfs);
+
+ if (unlikely(ret < 0))
+ XIO_ERR("%p %p status=%d\n", output, aio, ret);
+ return ret;
+}
+
+static void sync_file(struct sio_output *output)
+{
+ struct file *file = output->mf->mf_filp;
+ int ret;
+
+#if defined(S_BIAS) || (defined(RHEL_MAJOR) && (RHEL_MAJOR < 7))
+ ret = vfs_fsync(file, file->f_path.dentry, 1);
+#else
+ ret = vfs_fsync(file, 1);
+#endif
+ if (unlikely(ret))
+ XIO_ERR("syncing pages failed: %d\n", ret);
+ goto out_return;
+out_return:;
+}
+
+static
+void _complete(struct sio_output *output, struct aio_object *aio, int err)
+{
+ obj_check(aio);
+
+ if (err < 0) {
+ XIO_ERR(
+ "IO error %d at pos=%lld len=%d (aio=%p io_data=%p)\n",
+ err,
+ aio->io_pos,
+ aio->io_len,
+ aio,
+ aio->io_data);
+ } else {
+ aio_checksum(aio);
+ aio->io_flags |= AIO_UPTODATE;
+ }
+
+#ifdef CONFIG_MARS_DEBUG
+ while (mars_hang_mode & 1)
+ brick_msleep(100);
+#endif
+
+ CHECKED_CALLBACK(aio, err, err_found);
+
+done:
+ sio_io_put(output, aio);
+
+ atomic_dec(&output->work_count);
+ atomic_dec(&xio_global_io_flying);
+ goto out_return;
+err_found:
+ XIO_FAT("giving up...\n");
+ goto done;
+out_return:;
+}
+
+/* This is called by the threads
+ */
+static
+void _sio_io_io(struct sio_threadinfo *tinfo, struct aio_object *aio)
+{
+ struct sio_output *output = tinfo->output;
+ bool barrier = false;
+ int status;
+
+ obj_check(aio);
+
+ atomic_inc(&tinfo->fly_count);
+
+ if (unlikely(!output->mf || !output->mf->mf_filp)) {
+ status = -EINVAL;
+ goto done;
+ }
+
+ if (barrier) {
+ XIO_INF("got barrier request\n");
+ sync_file(output);
+ }
+
+ if (aio->io_rw == READ) {
+ status = read_aops(output, aio);
+ } else {
+ status = write_aops(output, aio);
+ if (barrier || output->brick->o_fdsync)
+ sync_file(output);
+ }
+
+ mapfree_set(output->mf, aio->io_pos, aio->io_pos + aio->io_len);
+
+done:
+ _complete(output, aio, status);
+
+ atomic_dec(&tinfo->fly_count);
+}
+
+/* This is called from outside
+ */
+static
+void sio_io_io(struct sio_output *output, struct aio_object *aio)
+{
+ int index;
+ struct sio_threadinfo *tinfo;
+ struct sio_aio_aspect *aio_a;
+ unsigned long flags;
+
+ obj_check(aio);
+
+ aio_a = sio_aio_get_aspect(output->brick, aio);
+ if (unlikely(!aio_a)) {
+ XIO_FAT("cannot get aspect\n");
+ SIMPLE_CALLBACK(aio, -EINVAL);
+ goto out_return;
+ }
+
+ if (unlikely(!output->brick->power.on_led)) {
+ SIMPLE_CALLBACK(aio, -EBADFD);
+ goto out_return;
+ }
+
+ atomic_inc(&xio_global_io_flying);
+ atomic_inc(&output->work_count);
+ obj_get(aio);
+
+ mapfree_set(output->mf, aio->io_pos, -1);
+
+ index = 0;
+ if (aio->io_rw == READ) {
+ spin_lock_irqsave(&output->g_lock, flags);
+ index = output->index++;
+ spin_unlock_irqrestore(&output->g_lock, flags);
+ index = (index % WITH_THREAD) + 1;
+ }
+
+ tinfo = &output->tinfo[index];
+
+ atomic_inc(&tinfo->total_count);
+ atomic_inc(&tinfo->queue_count);
+
+ spin_lock_irqsave(&tinfo->lock, flags);
+ list_add_tail(&aio_a->io_head, &tinfo->aio_list);
+ spin_unlock_irqrestore(&tinfo->lock, flags);
+
+ wake_up_interruptible(&tinfo->event);
+out_return:;
+}
+
+static int sio_thread(void *data)
+{
+ struct sio_threadinfo *tinfo = data;
+
+ XIO_INF("sio thread has started.\n");
+ /* set_user_nice(current, -20); */
+
+ while (!brick_thread_should_stop()) {
+ struct list_head *tmp = NULL;
+ struct aio_object *aio;
+ struct sio_aio_aspect *aio_a;
+ unsigned long flags;
+
+ wait_event_interruptible_timeout(
+ tinfo->event,
+ !list_empty(&tinfo->aio_list) || brick_thread_should_stop(),
+ HZ);
+
+ tinfo->last_jiffies = jiffies;
+
+ spin_lock_irqsave(&tinfo->lock, flags);
+
+ if (!list_empty(&tinfo->aio_list)) {
+ tmp = tinfo->aio_list.next;
+ list_del_init(tmp);
+ atomic_dec(&tinfo->queue_count);
+ }
+
+ spin_unlock_irqrestore(&tinfo->lock, flags);
+
+ if (!tmp)
+ continue;
+
+ aio_a = container_of(tmp, struct sio_aio_aspect, io_head);
+ aio = aio_a->object;
+ _sio_io_io(tinfo, aio);
+ }
+
+ XIO_INF("sio thread has stopped.\n");
+ return 0;
+}
+
+static int sio_get_info(struct sio_output *output, struct xio_info *info)
+{
+ struct file *file = output->mf->mf_filp;
+
+ if (unlikely(!file || !file->f_mapping || !file->f_mapping->host))
+ return -EINVAL;
+
+ info->tf_align = 1;
+ info->tf_min_size = 1;
+ info->current_size = i_size_read(file->f_mapping->host);
+ XIO_DBG("determined file size = %lld\n", info->current_size);
+ return 0;
+}
+
+/*************** informational * statistics **************/
+
+static noinline
+char *sio_statistics(struct sio_brick *brick, int verbose)
+{
+ struct sio_output *output = brick->outputs[0];
+ char *res = brick_string_alloc(1024);
+ int queue_sum = 0;
+ int fly_sum = 0;
+ int total_sum = 0;
+ int i;
+
+ for (i = 1; i <= WITH_THREAD; i++) {
+ struct sio_threadinfo *tinfo = &output->tinfo[i];
+
+ queue_sum += atomic_read(&tinfo->queue_count);
+ fly_sum += atomic_read(&tinfo->fly_count);
+ total_sum += atomic_read(&tinfo->total_count);
+ }
+
+ snprintf(
+ res, 1024,
+ "queued read = %d write = %d flying read = %d write = %d total read = %d write = %d\n",
+ queue_sum, atomic_read(&output->tinfo[0].queue_count),
+ fly_sum, atomic_read(&output->tinfo[0].fly_count),
+ total_sum, atomic_read(&output->tinfo[0].total_count)
+ );
+ return res;
+}
+
+static noinline
+void sio_reset_statistics(struct sio_brick *brick)
+{
+ struct sio_output *output = brick->outputs[0];
+ int i;
+
+ for (i = 0; i <= WITH_THREAD; i++) {
+ struct sio_threadinfo *tinfo = &output->tinfo[i];
+
+ atomic_set(&tinfo->total_count, 0);
+ }
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int sio_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct sio_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->io_head);
+ return 0;
+}
+
+static void sio_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct sio_aio_aspect *ini = (void *)_ini;
+
+ (void)ini;
+ CHECK_HEAD_EMPTY(&ini->io_head);
+}
+
+XIO_MAKE_STATICS(sio);
+
+/********************* brick constructors * destructors *******************/
+
+static int sio_brick_construct(struct sio_brick *brick)
+{
+ return 0;
+}
+
+static int sio_switch(struct sio_brick *brick)
+{
+ static int sio_nr;
+ struct sio_output *output = brick->outputs[0];
+ const char *path = output->brick->brick_path;
+ int status = 0;
+
+ if (brick->power.button) {
+ int flags = O_CREAT | O_RDWR | O_LARGEFILE;
+ int index;
+
+ if (brick->power.on_led)
+ goto done;
+
+ if (brick->o_direct) {
+ flags |= O_DIRECT;
+ XIO_INF("using O_DIRECT on %s\n", path);
+ }
+
+ xio_set_power_off_led((void *)brick, false);
+
+ output->mf = mapfree_get(path, flags);
+ if (unlikely(IS_ERR(output->mf))) {
+ XIO_ERR("could not open file = '%s' flags = %d\n", path, flags);
+ status = -ENOENT;
+ goto done;
+ }
+
+ output->index = 0;
+ for (index = 0; index <= WITH_THREAD; index++) {
+ struct sio_threadinfo *tinfo = &output->tinfo[index];
+
+ tinfo->last_jiffies = jiffies;
+ tinfo->thread = brick_thread_create(sio_thread, tinfo, "xio_sio%d", sio_nr++);
+ if (unlikely(!tinfo->thread)) {
+ XIO_ERR("cannot create thread\n");
+ status = -ENOENT;
+ goto done;
+ }
+ }
+ xio_set_power_on_led((void *)brick, true);
+ }
+done:
+ if (unlikely(status < 0) || !brick->power.button) {
+ int index;
+ int count;
+
+ xio_set_power_on_led((void *)brick, false);
+ for (;;) {
+ count = atomic_read(&output->work_count);
+ if (count <= 0)
+ break;
+ XIO_DBG("working on %d requests\n", count);
+ brick_msleep(1000);
+ }
+ for (index = 0; index <= WITH_THREAD; index++) {
+ struct sio_threadinfo *tinfo = &output->tinfo[index];
+
+ if (!tinfo->thread)
+ continue;
+ XIO_DBG("stopping thread %d\n", index);
+ brick_thread_stop(tinfo->thread);
+ tinfo->thread = NULL;
+ }
+ if (output->mf) {
+ XIO_DBG("closing file\n");
+ mapfree_put(output->mf);
+ output->mf = NULL;
+ }
+ xio_set_power_off_led((void *)brick, true);
+ }
+ return status;
+}
+
+static int sio_output_construct(struct sio_output *output)
+{
+ int index;
+
+ spin_lock_init(&output->g_lock);
+ for (index = 0; index <= WITH_THREAD; index++) {
+ struct sio_threadinfo *tinfo = &output->tinfo[index];
+
+ tinfo->output = output;
+ spin_lock_init(&tinfo->lock);
+ init_waitqueue_head(&tinfo->event);
+ INIT_LIST_HEAD(&tinfo->aio_list);
+ }
+
+ return 0;
+}
+
+static int sio_output_destruct(struct sio_output *output)
+{
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct sio_brick_ops sio_brick_ops = {
+ .brick_switch = sio_switch,
+ .brick_statistics = sio_statistics,
+ .reset_statistics = sio_reset_statistics,
+};
+
+static struct sio_output_ops sio_output_ops = {
+ .aio_get = sio_io_get,
+ .aio_put = sio_io_put,
+ .aio_io = sio_io_io,
+ .xio_get_info = sio_get_info,
+};
+
+const struct sio_input_type sio_input_type = {
+ .type_name = "sio_input",
+ .input_size = sizeof(struct sio_input),
+};
+
+static const struct sio_input_type *sio_input_types[] = {
+ &sio_input_type,
+};
+
+const struct sio_output_type sio_output_type = {
+ .type_name = "sio_output",
+ .output_size = sizeof(struct sio_output),
+ .master_ops = &sio_output_ops,
+ .output_construct = &sio_output_construct,
+ .output_destruct = &sio_output_destruct,
+};
+
+static const struct sio_output_type *sio_output_types[] = {
+ &sio_output_type,
+};
+
+const struct sio_brick_type sio_brick_type = {
+ .type_name = "sio_brick",
+ .brick_size = sizeof(struct sio_brick),
+ .max_inputs = 0,
+ .max_outputs = 1,
+ .master_ops = &sio_brick_ops,
+ .aspect_types = sio_aspect_types,
+ .default_input_types = sio_input_types,
+ .default_output_types = sio_output_types,
+ .brick_construct = &sio_brick_construct,
+};
+
+/***************** module init stuff ************************/
+
+int __init init_xio_sio(void)
+{
+ XIO_INF("init_sio()\n");
+ _sio_brick_type = (void *)&sio_brick_type;
+ return sio_register_brick_type();
+}
+
+void exit_xio_sio(void)
+{
+ XIO_INF("exit_sio()\n");
+ sio_unregister_brick_type();
+}
diff --git a/include/linux/xio/xio_sio.h b/include/linux/xio/xio_sio.h
new file mode 100644
index 000000000000..170733f2ea27
--- /dev/null
+++ b/include/linux/xio/xio_sio.h
@@ -0,0 +1,68 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_SIO_H
+#define XIO_SIO_H
+
+#include <linux/xio/lib_mapfree.h>
+
+#define WITH_THREAD 16
+
+struct sio_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct list_head io_head;
+ int alloc_len;
+ bool do_dealloc;
+};
+
+struct sio_brick {
+ XIO_BRICK(sio);
+ /* parameters */
+ bool o_direct;
+ bool o_fdsync;
+};
+
+struct sio_input {
+ XIO_INPUT(sio);
+};
+
+struct sio_threadinfo {
+ struct sio_output *output;
+ struct list_head aio_list;
+ struct task_struct *thread;
+
+ wait_queue_head_t event;
+ spinlock_t lock;
+ atomic_t queue_count;
+ atomic_t fly_count;
+ atomic_t total_count;
+ unsigned long last_jiffies;
+};
+
+struct sio_output {
+ XIO_OUTPUT(sio);
+ /* private */
+ struct mapfree_info *mf;
+ struct sio_threadinfo tinfo[WITH_THREAD+1];
+ spinlock_t g_lock;
+ atomic_t work_count;
+ int index;
+};
+
+XIO_TYPES(sio);
+
+#endif
--
2.11.0

2016-12-30 23:00:01

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 26/32] mars: add new module net

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/mars/net.c | 109 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 109 insertions(+)
create mode 100644 drivers/staging/mars/mars/net.c

diff --git a/drivers/staging/mars/mars/net.c b/drivers/staging/mars/mars/net.c
new file mode 100644
index 000000000000..d1b9715c0a93
--- /dev/null
+++ b/drivers/staging/mars/mars/net.c
@@ -0,0 +1,109 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include "strategy.h"
+#include <linux/xio/xio_net.h>
+
+static
+char *_xio_translate_hostname(const char *name)
+{
+ char *res = brick_strdup(name);
+ char *test;
+ char *tmp;
+
+ for (tmp = res; *tmp; tmp++) {
+ if (*tmp == ':') {
+ *tmp = '\0';
+ break;
+ }
+ }
+
+ tmp = path_make("/mars/ips/ip-%s", res);
+ if (unlikely(!tmp))
+ goto done;
+
+ test = mars_readlink(tmp);
+ if (test && test[0]) {
+ XIO_DBG("'%s' => '%s'\n", tmp, test);
+ brick_string_free(res);
+ res = test;
+ } else {
+ brick_string_free(test);
+ XIO_WRN("no hostname translation for '%s'\n", tmp);
+ }
+ brick_string_free(tmp);
+
+done:
+ return res;
+}
+
+int xio_send_dent_list(struct xio_socket *sock, struct list_head *anchor)
+{
+ struct list_head *tmp;
+ struct mars_dent *dent;
+ int status = 0;
+
+ for (tmp = anchor->next; tmp != anchor; tmp = tmp->next) {
+ dent = container_of(tmp, struct mars_dent, dent_link);
+ status = xio_send_struct(sock, dent, mars_dent_meta);
+ if (status < 0)
+ break;
+ }
+ if (status >= 0) { /* send EOR */
+ status = xio_send_struct(sock, NULL, mars_dent_meta);
+ }
+ return status;
+}
+
+int xio_recv_dent_list(struct xio_socket *sock, struct list_head *anchor)
+{
+ int status;
+
+ for (;;) {
+ struct mars_dent *dent = brick_zmem_alloc(sizeof(struct mars_dent));
+
+ INIT_LIST_HEAD(&dent->dent_link);
+ INIT_LIST_HEAD(&dent->brick_list);
+
+ status = xio_recv_struct(sock, dent, mars_dent_meta);
+ if (status <= 0) {
+ xio_free_dent(dent);
+ goto done;
+ }
+ list_add_tail(&dent->dent_link, anchor);
+ }
+done:
+ return status;
+}
+
+/***************** module init stuff ************************/
+
+int __init init_sy_net(void)
+{
+ XIO_INF("init_sy_net()\n");
+ xio_translate_hostname = _xio_translate_hostname;
+ return 0;
+}
+
+void exit_sy_net(void)
+{
+ XIO_INF("exit_sy_net()\n");
+}
--
2.11.0

2016-12-30 23:00:18

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 20/32] mars: add new module xio_if

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/xio_if.c | 892 +++++++++++++++++++++++++++++++
include/linux/xio/xio_if.h | 109 ++++
2 files changed, 1001 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/xio_if.c
create mode 100644 include/linux/xio/xio_if.h

diff --git a/drivers/staging/mars/xio_bricks/xio_if.c b/drivers/staging/mars/xio_bricks/xio_if.c
new file mode 100644
index 000000000000..97e0cd541c5c
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/xio_if.c
@@ -0,0 +1,892 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* Interface to a Linux device.
+ * 1 Input, 0 Outputs.
+ */
+
+#define ALWAYS_UNPLUG true
+#define PREFETCH_LEN PAGE_SIZE
+
+/* low-level device parameters */
+#define USE_MAX_SECTORS (PAGE_SIZE >> 9)
+#define USE_MAX_PHYS_SEGMENTS (PAGE_SIZE >> 9)
+#define USE_MAX_SEGMENT_SIZE PAGE_SIZE
+#define USE_LOGICAL_BLOCK_SIZE 512
+#define USE_SEGMENT_BOUNDARY (PAGE_SIZE - 1)
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/bio.h>
+#include <linux/major.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+
+#include <linux/xio/xio.h>
+#include <linux/xio/xio_if.h>
+
+#ifndef XIO_MAJOR
+#define XIO_MAJOR (DRBD_MAJOR + 1)
+#endif
+
+/************************ global tuning ***********************/
+
+int if_throttle_start_size;
+
+struct rate_limiter if_throttle = {
+ .lim_max_amount_rate = 5000,
+};
+
+/************************ own type definitions ***********************/
+
+/************************ own static definitions ***********************/
+
+/* TODO: check bounds, ensure that free minor numbers are recycled */
+static int device_minor;
+
+/*************** object * aspect constructors * destructors **************/
+
+/************************ linux operations ***********************/
+
+static
+void _if_start_io_acct(struct if_input *input, struct bio_wrapper *biow)
+{
+ struct bio *bio = biow->bio;
+ const int rw = bio_data_dir(bio);
+ const int cpu = part_stat_lock();
+
+ (void)cpu;
+ part_round_stats(cpu, &input->disk->part0);
+ part_stat_inc(cpu, &input->disk->part0, ios[rw]);
+ part_stat_add(cpu, &input->disk->part0, sectors[rw], bio->bi_iter.bi_size >> 9);
+ part_inc_in_flight(&input->disk->part0, rw);
+ part_stat_unlock();
+ biow->start_time = jiffies;
+}
+
+static
+void _if_end_io_acct(struct if_input *input, struct bio_wrapper *biow)
+{
+ unsigned long duration = jiffies - biow->start_time;
+ struct bio *bio = biow->bio;
+ const int rw = bio_data_dir(bio);
+ const int cpu = part_stat_lock();
+
+ (void)cpu;
+ part_stat_add(cpu, &input->disk->part0, ticks[rw], duration);
+ part_round_stats(cpu, &input->disk->part0);
+ part_dec_in_flight(&input->disk->part0, rw);
+ part_stat_unlock();
+}
+
+/* callback
+ */
+static
+void if_endio(struct generic_callback *cb)
+{
+ struct if_aio_aspect *aio_a = cb->cb_private;
+ struct if_input *input;
+ int k;
+ int rw;
+ int error;
+
+ LAST_CALLBACK(cb);
+ if (unlikely(!aio_a || !aio_a->object)) {
+ XIO_FAT("aio_a = %p aio = %p, something is very wrong here!\n", aio_a, aio_a->object);
+ goto out_return;
+ }
+ input = aio_a->input;
+ CHECK_PTR(input, err);
+
+ rw = aio_a->object->io_rw;
+
+ for (k = 0; k < aio_a->bio_count; k++) {
+ struct bio_wrapper *biow;
+ struct bio *bio;
+
+ biow = aio_a->orig_biow[k];
+ aio_a->orig_biow[k] = NULL;
+ CHECK_PTR(biow, err);
+
+ CHECK_ATOMIC(&biow->bi_comp_cnt, 1);
+ if (!atomic_dec_and_test(&biow->bi_comp_cnt))
+ continue;
+
+ bio = biow->bio;
+ CHECK_PTR_NULL(bio, err);
+
+ _if_end_io_acct(input, biow);
+
+ error = CALLBACK_ERROR(aio_a->object);
+ if (unlikely(error < 0)) {
+ int bi_size = bio->bi_iter.bi_size;
+
+ XIO_ERR("NYI: error=%d RETRY LOGIC %u\n", error, bi_size);
+ } else { /* bio conventions are slightly different... */
+ error = 0;
+ bio->bi_iter.bi_size = 0;
+ }
+ bio->bi_error = error;
+ bio_endio(bio);
+ bio_put(bio);
+ brick_mem_free(biow);
+ }
+ atomic_dec(&input->flying_count);
+ if (rw)
+ atomic_dec(&input->write_flying_count);
+ else
+ atomic_dec(&input->read_flying_count);
+ goto out_return;
+err:
+ XIO_FAT("error in callback, giving up\n");
+out_return:;
+}
+
+/* Kick off plugged aios
+ */
+static
+void _if_unplug(struct if_input *input)
+{
+ /* struct if_brick *brick = input->brick; */
+ LIST_HEAD(tmp_list);
+ unsigned long flags;
+
+#ifdef CONFIG_MARS_DEBUG
+ might_sleep();
+#endif
+
+ spin_lock_irqsave(&input->req_lock, flags);
+ if (!list_empty(&input->plug_anchor)) {
+ /* move over the whole list */
+ list_replace_init(&input->plug_anchor, &tmp_list);
+ atomic_set(&input->plugged_count, 0);
+ }
+ spin_unlock_irqrestore(&input->req_lock, flags);
+
+ while (!list_empty(&tmp_list)) {
+ struct if_aio_aspect *aio_a;
+ struct aio_object *aio;
+
+ aio_a = container_of(tmp_list.next, struct if_aio_aspect, plug_head);
+ list_del_init(&aio_a->plug_head);
+
+ aio = aio_a->object;
+
+ if (unlikely(aio_a->current_len > aio_a->max_len))
+ XIO_ERR("request len %d > %d\n", aio_a->current_len, aio_a->max_len);
+ aio->io_len = aio_a->current_len;
+
+ atomic_inc(&input->flying_count);
+ atomic_inc(&input->total_fire_count);
+ if (aio->io_rw)
+ atomic_inc(&input->write_flying_count);
+ else
+ atomic_inc(&input->read_flying_count);
+ if (aio->io_skip_sync)
+ atomic_inc(&input->total_skip_sync_count);
+
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+ }
+}
+
+/* accept a linux bio, convert to aio and call buf_io() on it.
+ */
+static
+blk_qc_t if_make_request(struct request_queue *q, struct bio *bio)
+{
+ struct if_input *input = q->queuedata;
+ struct if_brick *brick = input->brick;
+
+ /* Original flags of the source bio
+ */
+ const int rw = bio_data_dir(bio);
+ const int sectors = bio_sectors(bio);
+
+/* adapt to different kernel versions (TBD: improve) */
+#if defined(BIO_RW_RQ_MASK) || defined(BIO_FLUSH)
+ const bool ahead = bio_rw_flagged(bio, BIO_RW_AHEAD) && rw == READ;
+ const bool barrier = bio_rw_flagged(bio, BIO_RW_BARRIER);
+ const bool syncio = bio_rw_flagged(bio, BIO_RW_SYNCIO);
+ const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG);
+ const bool meta = bio_rw_flagged(bio, BIO_RW_META);
+ const bool discard = bio_rw_flagged(bio, BIO_RW_DISCARD);
+ const bool noidle = bio_rw_flagged(bio, BIO_RW_NOIDLE);
+
+#elif defined(REQ_FLUSH) && defined(REQ_SYNC)
+#define _flagged(x) (bio->bi_rw & (x))
+ const bool ahead = _flagged(REQ_RAHEAD) && rw == READ;
+ const bool barrier = _flagged(REQ_FLUSH);
+ const bool syncio = _flagged(REQ_SYNC);
+ const bool unplug = false;
+ const bool meta = _flagged(REQ_META);
+ const bool discard = _flagged(REQ_DISCARD);
+ const bool noidle = _flagged(REQ_THROTTLED);
+
+#else
+#error Cannot decode the bio flags
+#endif
+ const int prio = bio_prio(bio);
+
+ /* Transform into XIO flags
+ */
+ const int io_prio =
+ (prio == IOPRIO_CLASS_RT || (meta | syncio)) ?
+ XIO_PRIO_HIGH :
+ (prio == IOPRIO_CLASS_IDLE) ?
+ XIO_PRIO_LOW :
+ XIO_PRIO_NORMAL;
+ const bool do_unplug = ALWAYS_UNPLUG | unplug | noidle;
+ const bool do_skip_sync = brick->skip_sync && !(barrier | syncio);
+
+ struct bio_wrapper *biow;
+ struct aio_object *aio = NULL;
+ struct if_aio_aspect *aio_a;
+
+ struct bio_vec bvec;
+ struct bvec_iter i;
+
+ loff_t pos = ((loff_t)bio->bi_iter.bi_sector) << 9; /* TODO: make dynamic */
+ int total_len = bio->bi_iter.bi_size;
+
+ bool assigned = false;
+ int error = -EINVAL;
+
+ bind_to_channel(brick->say_channel, current);
+
+ might_sleep();
+
+ if (unlikely(!sectors)) {
+ _if_unplug(input);
+ /* THINK: usually this happens only at write barriers.
+ * We have no "barrier" operation in XIO, since
+ * callback semantics should always denote
+ * "writethrough accomplished".
+ * In case of exceptional semantics, we need to do
+ * something here. For now, we do just nothing.
+ */
+ error = 0;
+ bio->bi_error = error;
+ bio_endio(bio);
+ goto done;
+ }
+
+ /* throttling of too big write requests */
+ if (rw && if_throttle_start_size > 0) {
+ int kb = (total_len + 512) / 1024;
+
+ if (kb >= if_throttle_start_size)
+ rate_limit_sleep(&if_throttle, kb);
+ }
+
+ (void)ahead; /* shut up gcc */
+ if (unlikely(discard)) { /* NYI */
+ error = 0;
+ bio->bi_error = error;
+ bio_endio(bio);
+ goto done;
+ }
+
+ biow = brick_mem_alloc(sizeof(struct bio_wrapper));
+ biow->bio = bio;
+ atomic_set(&biow->bi_comp_cnt, 0);
+
+ if (rw)
+ atomic_inc(&input->total_write_count);
+ else
+ atomic_inc(&input->total_read_count);
+ _if_start_io_acct(input, biow);
+
+ /* Get a reference to the bio.
+ * Will be released after bio_endio().
+ */
+ bio_get(bio);
+
+ /* FIXME: THIS IS PROVISIONARY (use event instead)
+ */
+ while (unlikely(!brick->power.on_led))
+ brick_msleep(100);
+
+ bio_for_each_segment(bvec, bio, i) {
+ struct page *page = bvec.bv_page;
+ int bv_len = bvec.bv_len;
+ int offset = bvec.bv_offset;
+
+ void *data;
+
+ /* gather statistics on IOPS etc */
+ rate_limit(&brick->io_limiter, bv_len);
+
+#ifdef ARCH_HAS_KMAP
+#error FIXME/TODO: the current infrastructure cannot deal with HIGHMEM / kmap()
+#error HINT: XIO is supposed to run on big 64bit (storage) servers.
+#endif
+ data = page_address(page);
+ error = -EINVAL;
+ if (unlikely(!data))
+ break;
+
+ data += offset;
+
+ while (bv_len > 0) {
+ int this_len = 0;
+ unsigned long flags;
+
+ aio = NULL;
+ aio_a = NULL;
+
+ if (!aio) {
+ int prefetch_len;
+
+ error = -ENOMEM;
+ aio = if_alloc_aio(brick);
+ aio_a = if_aio_get_aspect(brick, aio);
+ if (unlikely(!aio_a))
+ goto err;
+
+#ifdef PREFETCH_LEN
+ prefetch_len = PREFETCH_LEN - offset;
+/**/
+ if (prefetch_len > total_len)
+ prefetch_len = total_len;
+ if (pos + prefetch_len > brick->dev_size)
+ prefetch_len = brick->dev_size - pos;
+ if (prefetch_len < bv_len)
+ prefetch_len = bv_len;
+#else
+ prefetch_len = bv_len;
+#endif
+
+ SETUP_CALLBACK(aio, if_endio, aio_a);
+
+ aio_a->input = input;
+ aio->io_rw = rw;
+ aio->io_may_write = rw;
+ aio->io_pos = pos;
+ aio->io_len = prefetch_len;
+ aio->io_data = data; /* direct IO */
+ aio->io_prio = io_prio;
+ aio_a->orig_page = page;
+
+ error = GENERIC_INPUT_CALL(input, aio_get, aio);
+ if (unlikely(error < 0))
+ goto err;
+
+ this_len = aio->io_len; /* now may be shorter than originally requested. */
+ aio_a->max_len = this_len;
+ if (this_len > bv_len)
+ this_len = bv_len;
+ aio_a->current_len = this_len;
+ if (rw)
+ atomic_inc(&input->total_aio_write_count);
+ else
+ atomic_inc(&input->total_aio_read_count);
+ CHECK_ATOMIC(&biow->bi_comp_cnt, 0);
+ atomic_inc(&biow->bi_comp_cnt);
+ aio_a->orig_biow[0] = biow;
+ aio_a->bio_count = 1;
+ assigned = true;
+
+ /* When a bio with multiple biovecs is split into
+ * multiple aios, only the last one should be
+ * working in synchronous writethrough mode.
+ */
+ aio->io_skip_sync = true;
+ if (!do_skip_sync && i.bi_idx + 1 >= bio->bi_iter.bi_idx)
+ aio->io_skip_sync = false;
+
+ atomic_inc(&input->plugged_count);
+
+ spin_lock_irqsave(&input->req_lock, flags);
+ list_add_tail(&aio_a->plug_head, &input->plug_anchor);
+ spin_unlock_irqrestore(&input->req_lock, flags);
+ } /* !aio */
+
+ pos += this_len;
+ data += this_len;
+ bv_len -= this_len;
+ total_len -= this_len;
+ } /* while bv_len > 0 */
+ } /* foreach bvec */
+
+ if (likely(!total_len))
+ error = 0;
+ else
+ XIO_ERR("bad rest len = %d\n", total_len);
+err:
+ if (error < 0) {
+ XIO_ERR("cannot submit request from bio, status=%d\n", error);
+ if (!assigned) {
+ bio->bi_error = error;
+ bio_endio(bio);
+ }
+ }
+
+ if (do_unplug ||
+ (brick && brick->max_plugged > 0 && atomic_read(&input->plugged_count) > brick->max_plugged)) {
+ _if_unplug(input);
+ }
+
+done:
+ remove_binding_from(brick->say_channel, current);
+
+ return BLK_QC_T_NONE;
+}
+
+static
+loff_t if_get_capacity(struct if_brick *brick)
+{
+ /* Don't read always, read only when unknown.
+ * brick->dev_size may be different from underlying sizes,
+ * e.g. when the size symlink indicates a logically smaller
+ * device than physically.
+ */
+ if (brick->real_size <= 0 || brick->max_size != brick->old_max_size) {
+ struct xio_info info = {};
+ struct if_input *input = brick->inputs[0];
+ int status;
+
+ status = GENERIC_INPUT_CALL(input, xio_get_info, &info);
+ if (unlikely(status < 0)) {
+ XIO_WRN("cannot get device info, status=%d\n", status);
+ return 0;
+ }
+ XIO_INF("determined default capacity: %lld bytes\n", info.current_size);
+ brick->real_size = info.current_size;
+ }
+ if (brick->max_size > 0 && brick->real_size > brick->max_size)
+ brick->dev_size = brick->max_size;
+ else
+ brick->dev_size = brick->real_size;
+ brick->old_max_size = brick->max_size;
+ return brick->dev_size;
+}
+
+static
+void if_set_capacity(struct if_input *input, loff_t capacity)
+{
+ CHECK_PTR(input->disk, done);
+ CHECK_PTR(input->disk->disk_name, done);
+ XIO_INF("new capacity of '%s': %lld bytes\n", input->disk->disk_name, capacity);
+ set_capacity(input->disk, capacity >> 9);
+ if (likely(input->bdev && input->bdev->bd_inode))
+ i_size_write(input->bdev->bd_inode, capacity);
+done:;
+}
+
+static const struct block_device_operations if_blkdev_ops;
+
+static int if_switch(struct if_brick *brick)
+{
+ struct if_input *input = brick->inputs[0];
+ struct request_queue *q;
+ struct gendisk *disk;
+ int minor;
+ int status = 0;
+
+ down(&brick->switch_sem);
+
+ /* brick is in operation */
+ if (brick->power.button && brick->power.on_led) {
+ loff_t capacity;
+
+ capacity = if_get_capacity(brick);
+ if (capacity > 0 && capacity != input->capacity) {
+ XIO_INF(
+ "changing capacity from %lld to %lld\n", (long long)input->capacity, (long long)capacity);
+ input->capacity = capacity;
+ if_set_capacity(input, capacity);
+ }
+ }
+
+ /* brick should be switched on */
+ if (brick->power.button && brick->power.off_led) {
+ loff_t capacity;
+
+ brick->say_channel = get_binding(current);
+
+ capacity = if_get_capacity(brick);
+ XIO_INF("capacity is %lld\n", (long long)capacity);
+ if (capacity > 0) {
+ input->capacity = capacity;
+ xio_set_power_off_led((void *)brick, false);
+ }
+ }
+ if (brick->power.button && !brick->power.on_led && !brick->power.off_led) {
+ status = -ENOMEM;
+ q = blk_alloc_queue(GFP_BRICK);
+ if (!q) {
+ XIO_ERR("cannot allocate device request queue\n");
+ goto is_down;
+ }
+ q->queuedata = input;
+ input->q = q;
+
+ disk = alloc_disk(1);
+ if (!disk) {
+ XIO_ERR("cannot allocate gendisk\n");
+ goto is_down;
+ }
+
+ minor = device_minor++; /* TODO: protect against races (e.g. atomic_t) */
+ set_disk_ro(disk, true);
+
+ disk->queue = q;
+ disk->major = XIO_MAJOR; /* TODO: make this dynamic for >256 devices */
+ disk->first_minor = minor;
+ disk->fops = &if_blkdev_ops;
+ snprintf(disk->disk_name, sizeof(disk->disk_name), "%s", brick->brick_name);
+ disk->private_data = input;
+ input->disk = disk;
+ XIO_DBG(
+ "created device name %s, capacity=%lld\n",
+ disk->disk_name, input->capacity);
+ if_set_capacity(input, input->capacity);
+
+ blk_queue_make_request(q, if_make_request);
+ blk_queue_max_hw_sectors(q, USE_MAX_SECTORS);
+ blk_queue_max_segments(q, USE_MAX_PHYS_SEGMENTS);
+ blk_queue_max_segment_size(q, USE_MAX_SEGMENT_SIZE);
+ blk_queue_logical_block_size(q, USE_LOGICAL_BLOCK_SIZE);
+ blk_queue_segment_boundary(q, USE_SEGMENT_BOUNDARY);
+ blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
+ q->queue_lock = &input->req_lock; /* needed! */
+
+ input->bdev = bdget(MKDEV(disk->major, minor));
+ /* we have no partitions. we contain only ourselves. */
+ input->bdev->bd_contains = input->bdev;
+
+ /* point of no return */
+ XIO_DBG("add_disk()\n");
+ add_disk(disk);
+ set_disk_ro(disk, false);
+
+ /* report success */
+ xio_set_power_on_led((void *)brick, true);
+ status = 0;
+ }
+
+ /* brick should be switched off */
+ if (!brick->power.button && !brick->power.off_led) {
+ int opened;
+ int plugged;
+ int flying;
+
+ xio_set_power_on_led((void *)brick, false);
+ disk = input->disk;
+ if (!disk)
+ goto is_down;
+
+ opened = atomic_read(&brick->open_count);
+ if (unlikely(opened > 0)) {
+ XIO_INF("device '%s' is open %d times, cannot shutdown\n", disk->disk_name, opened);
+ status = -EBUSY;
+ goto done; /* don't indicate "off" status */
+ }
+ plugged = atomic_read(&input->plugged_count);
+ if (unlikely(plugged > 0)) {
+ XIO_INF("device '%s' has %d plugged requests, cannot shutdown\n", disk->disk_name, plugged);
+ status = -EBUSY;
+ goto done; /* don't indicate "off" status */
+ }
+ flying = atomic_read(&input->flying_count);
+ if (unlikely(flying > 0)) {
+ XIO_INF("device '%s' has %d flying requests, cannot shutdown\n", disk->disk_name, flying);
+ status = -EBUSY;
+ goto done; /* don't indicate "off" status */
+ }
+ XIO_DBG("calling del_gendisk()\n");
+ del_gendisk(input->disk);
+ /* There might be subtle races */
+ while (atomic_read(&input->flying_count) > 0) {
+ XIO_WRN("device '%s' unexpectedly has %d flying requests\n", disk->disk_name, flying);
+ brick_msleep(1000);
+ }
+ if (input->bdev) {
+ XIO_DBG("calling bdput()\n");
+ bdput(input->bdev);
+ input->bdev = NULL;
+ }
+ XIO_DBG("calling put_disk()\n");
+ put_disk(input->disk);
+ input->disk = NULL;
+ q = input->q;
+ if (q) {
+ blk_cleanup_queue(q);
+ input->q = NULL;
+ }
+ status = 0;
+is_down:
+ xio_set_power_off_led((void *)brick, true);
+ }
+
+done:
+ up(&brick->switch_sem);
+ return status;
+}
+
+/*************** interface to the outer world (kernel) **************/
+
+static int if_open(struct block_device *bdev, fmode_t mode)
+{
+ struct if_input *input;
+ struct if_brick *brick;
+
+ if (unlikely(!bdev || !bdev->bd_disk)) {
+ XIO_ERR("----------------------- INVAL ------------------------------\n");
+ return -EINVAL;
+ }
+
+ input = bdev->bd_disk->private_data;
+
+ if (unlikely(!input || !input->brick)) {
+ XIO_ERR("----------------------- BAD IF SETUP ------------------------------\n");
+ return -EINVAL;
+ }
+ brick = input->brick;
+
+ down(&brick->switch_sem);
+
+ if (unlikely(!brick->power.on_led)) {
+ XIO_INF(
+ "----------------------- BUSY %d ------------------------------\n", atomic_read(&brick->open_count));
+ up(&brick->switch_sem);
+ return -EBUSY;
+ }
+
+ atomic_inc(&brick->open_count);
+
+ XIO_INF("----------------------- OPEN %d ------------------------------\n", atomic_read(&brick->open_count));
+
+ up(&brick->switch_sem);
+ return 0;
+}
+
+static
+void
+if_release(struct gendisk *gd, fmode_t mode)
+{
+ struct if_input *input = gd->private_data;
+ struct if_brick *brick = input->brick;
+ int nr;
+
+ XIO_INF("----------------------- CLOSE %d ------------------------------\n", atomic_read(&brick->open_count));
+
+ if (atomic_dec_and_test(&brick->open_count)) {
+ while ((nr = atomic_read(&input->flying_count)) > 0) {
+ XIO_INF("%d IO requests not yet completed\n", nr);
+ brick_msleep(1000);
+ }
+
+ XIO_DBG(
+ "status button=%d on_led=%d off_led=%d\n",
+ brick->power.button,
+ brick->power.on_led,
+ brick->power.off_led);
+ local_trigger();
+ }
+}
+
+static const struct block_device_operations if_blkdev_ops = {
+ .owner = THIS_MODULE,
+ .open = if_open,
+ .release = if_release,
+};
+
+/*************** informational * statistics **************/
+
+static
+char *if_statistics(struct if_brick *brick, int verbose)
+{
+ struct if_input *input = brick->inputs[0];
+ char *res = brick_string_alloc(512);
+ int tmp0 = atomic_read(&input->total_reada_count);
+ int tmp1 = atomic_read(&input->total_read_count);
+ int tmp2 = atomic_read(&input->total_aio_read_count);
+ int tmp3 = atomic_read(&input->total_write_count);
+ int tmp4 = atomic_read(&input->total_aio_write_count);
+
+ snprintf(
+ res, 512,
+ "total reada = %d reads = %d aio_reads = %d (%d%%) writes = %d aio_writes = %d (%d%%) empty = %d fired = %d skip_sync = %d | plugged = %d flying = %d (reads = %d writes = %d)\n",
+ tmp0,
+ tmp1,
+ tmp2,
+ tmp1 ? tmp2 * 100 / tmp1 : 0,
+ tmp3,
+ tmp4,
+ tmp3 ? tmp4 * 100 / tmp3 : 0,
+ atomic_read(&input->total_empty_count),
+ atomic_read(&input->total_fire_count),
+ atomic_read(&input->total_skip_sync_count),
+ atomic_read(&input->plugged_count),
+ atomic_read(&input->flying_count),
+ atomic_read(&input->read_flying_count),
+ atomic_read(&input->write_flying_count));
+ return res;
+}
+
+static
+void if_reset_statistics(struct if_brick *brick)
+{
+ struct if_input *input = brick->inputs[0];
+
+ atomic_set(&input->total_read_count, 0);
+ atomic_set(&input->total_write_count, 0);
+ atomic_set(&input->total_empty_count, 0);
+ atomic_set(&input->total_fire_count, 0);
+ atomic_set(&input->total_skip_sync_count, 0);
+ atomic_set(&input->total_aio_read_count, 0);
+ atomic_set(&input->total_aio_write_count, 0);
+}
+
+/***************** own brick * input * output operations *****************/
+
+/* none */
+
+/*************** object * aspect constructors * destructors **************/
+
+static int if_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct if_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->plug_head);
+ return 0;
+}
+
+static void if_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct if_aio_aspect *ini = (void *)_ini;
+
+ CHECK_HEAD_EMPTY(&ini->plug_head);
+}
+
+XIO_MAKE_STATICS(if);
+
+/*********************** constructors * destructors ***********************/
+
+static int if_brick_construct(struct if_brick *brick)
+{
+ sema_init(&brick->switch_sem, 1);
+ atomic_set(&brick->open_count, 0);
+ return 0;
+}
+
+static int if_brick_destruct(struct if_brick *brick)
+{
+ return 0;
+}
+
+static int if_input_construct(struct if_input *input)
+{
+ INIT_LIST_HEAD(&input->plug_anchor);
+ spin_lock_init(&input->req_lock);
+ atomic_set(&input->flying_count, 0);
+ atomic_set(&input->read_flying_count, 0);
+ atomic_set(&input->write_flying_count, 0);
+ atomic_set(&input->plugged_count, 0);
+ return 0;
+}
+
+static int if_input_destruct(struct if_input *input)
+{
+ CHECK_HEAD_EMPTY(&input->plug_anchor);
+ return 0;
+}
+
+static int if_output_construct(struct if_output *output)
+{
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct if_brick_ops if_brick_ops = {
+ .brick_switch = if_switch,
+ .brick_statistics = if_statistics,
+ .reset_statistics = if_reset_statistics,
+};
+
+static struct if_output_ops if_output_ops;
+
+const struct if_input_type if_input_type = {
+ .type_name = "if_input",
+ .input_size = sizeof(struct if_input),
+ .input_construct = &if_input_construct,
+ .input_destruct = &if_input_destruct,
+};
+
+static const struct if_input_type *if_input_types[] = {
+ &if_input_type,
+};
+
+const struct if_output_type if_output_type = {
+ .type_name = "if_output",
+ .output_size = sizeof(struct if_output),
+ .master_ops = &if_output_ops,
+ .output_construct = &if_output_construct,
+};
+
+static const struct if_output_type *if_output_types[] = {
+ &if_output_type,
+};
+
+const struct if_brick_type if_brick_type = {
+ .type_name = "if_brick",
+ .brick_size = sizeof(struct if_brick),
+ .max_inputs = 1,
+ .max_outputs = 0,
+ .master_ops = &if_brick_ops,
+ .aspect_types = if_aspect_types,
+ .default_input_types = if_input_types,
+ .default_output_types = if_output_types,
+ .brick_construct = &if_brick_construct,
+ .brick_destruct = &if_brick_destruct,
+};
+
+/***************** module init stuff ************************/
+
+void exit_xio_if(void)
+{
+ int status;
+
+ XIO_INF("exit_if()\n");
+ status = if_unregister_brick_type();
+ unregister_blkdev(XIO_MAJOR, "xio");
+}
+
+int __init init_xio_if(void)
+{
+ int status;
+
+ (void)if_aspect_types; /* not used, shut up gcc */
+
+ XIO_INF("init_if()\n");
+ status = register_blkdev(XIO_MAJOR, "xio");
+ if (status)
+ return status;
+ status = if_register_brick_type();
+ if (status)
+ goto err_device;
+ return status;
+err_device:
+ XIO_ERR("init_if() status=%d\n", status);
+ exit_xio_if();
+ return status;
+}
diff --git a/include/linux/xio/xio_if.h b/include/linux/xio/xio_if.h
new file mode 100644
index 000000000000..10a36b8c486b
--- /dev/null
+++ b/include/linux/xio/xio_if.h
@@ -0,0 +1,109 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_IF_H
+#define XIO_IF_H
+
+#include <linux/brick/lib_limiter.h>
+
+#include <linux/semaphore.h>
+
+#define HT_SHIFT 6 /* ???? */
+#define XIO_MAX_SEGMENT_SIZE (1U << (9+HT_SHIFT))
+
+#define MAX_BIO 32
+
+/************************ global tuning ***********************/
+
+extern int if_throttle_start_size; /* in kb */
+extern struct rate_limiter if_throttle;
+
+/***********************************************/
+
+/* I don't want to enhance / intrude into struct bio for compatibility reasons
+ * (support for a variety of kernel versions).
+ * The following is just a silly workaround which could be removed again.
+ */
+struct bio_wrapper {
+ struct bio *bio;
+ atomic_t bi_comp_cnt;
+ unsigned long start_time;
+};
+
+struct if_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct list_head plug_head;
+ int bio_count;
+ int current_len;
+ int max_len;
+ struct page *orig_page;
+ struct bio_wrapper *orig_biow[MAX_BIO];
+ struct if_input *input;
+};
+
+struct if_input {
+ XIO_INPUT(if);
+ /* TODO: move this to if_brick (better systematics) */
+ struct list_head plug_anchor;
+ struct request_queue *q;
+ struct gendisk *disk;
+ struct block_device *bdev;
+ loff_t capacity;
+ atomic_t plugged_count;
+ atomic_t flying_count;
+
+ /* only for statistics */
+ atomic_t read_flying_count;
+ atomic_t write_flying_count;
+ atomic_t total_reada_count;
+ atomic_t total_read_count;
+ atomic_t total_write_count;
+ atomic_t total_empty_count;
+ atomic_t total_fire_count;
+ atomic_t total_skip_sync_count;
+ atomic_t total_aio_read_count;
+ atomic_t total_aio_write_count;
+ spinlock_t req_lock;
+};
+
+struct if_output {
+ XIO_OUTPUT(if);
+};
+
+struct if_brick {
+ XIO_BRICK(if);
+ /* parameters */
+ loff_t real_size;
+ loff_t max_size;
+ loff_t dev_size;
+ int max_plugged;
+ int readahead;
+ bool skip_sync;
+
+ /* inspectable */
+ atomic_t open_count;
+ struct rate_limiter io_limiter;
+
+ /* private */
+ struct semaphore switch_sem;
+ struct say_channel *say_channel;
+ loff_t old_max_size;
+};
+
+XIO_TYPES(if);
+
+#endif
--
2.11.0

2016-12-30 23:00:45

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 23/32] mars: add new module xio_server

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/xio_server.c | 493 +++++++++++++++++++++++++++
include/linux/xio/xio_server.h | 91 +++++
2 files changed, 584 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/xio_server.c
create mode 100644 include/linux/xio/xio_server.h

diff --git a/drivers/staging/mars/xio_bricks/xio_server.c b/drivers/staging/mars/xio_bricks/xio_server.c
new file mode 100644
index 000000000000..28944d15a7bf
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/xio_server.c
@@ -0,0 +1,493 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* Server brick (just for demonstration) */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/brick/brick.h>
+#include <linux/xio/xio.h>
+#include <linux/xio/xio_bio.h>
+#include <linux/xio/xio_sio.h>
+#include <linux/xio/xio_trans_logger.h>
+
+/************************ own type definitions ***********************/
+
+#include <linux/xio/xio_server.h>
+
+static struct xio_socket server_socket[NR_SERVER_SOCKETS];
+static struct task_struct *server_threads[NR_SERVER_SOCKETS];
+
+/************************ own helper functions ***********************/
+
+int cb_thread(void *data)
+{
+ struct server_brick *brick = data;
+ struct xio_socket *sock = &brick->handler_socket;
+ bool aborted = false;
+ bool ok = xio_get_socket(sock);
+ int status = -EINVAL;
+
+ XIO_DBG("--------------- cb_thread starting on socket #%d, ok = %d\n", sock->s_debug_nr, ok);
+ if (!ok)
+ goto done;
+
+ brick->cb_running = true;
+ wake_up_interruptible(&brick->startup_event);
+
+ while (!brick_thread_should_stop() ||
+ !list_empty(&brick->cb_read_list) ||
+ !list_empty(&brick->cb_write_list) ||
+ atomic_read(&brick->in_flight) > 0) {
+ struct server_aio_aspect *aio_a;
+ struct aio_object *aio;
+ struct list_head *tmp;
+ unsigned long flags;
+
+ wait_event_interruptible_timeout(
+ brick->cb_event,
+ !list_empty(&brick->cb_read_list) ||
+ !list_empty(&brick->cb_write_list),
+ 1 * HZ);
+
+ spin_lock_irqsave(&brick->cb_lock, flags);
+ tmp = brick->cb_write_list.next;
+ if (tmp == &brick->cb_write_list) {
+ tmp = brick->cb_read_list.next;
+ if (tmp == &brick->cb_read_list) {
+ spin_unlock_irqrestore(&brick->cb_lock, flags);
+ brick_msleep(1000 / HZ);
+ continue;
+ }
+ }
+ list_del_init(tmp);
+ spin_unlock_irqrestore(&brick->cb_lock, flags);
+
+ aio_a = container_of(tmp, struct server_aio_aspect, cb_head);
+ aio = aio_a->object;
+ status = -EINVAL;
+ CHECK_PTR(aio, err);
+
+ status = 0;
+ /* Report a remote error when consistency cannot be guaranteed,
+ * e.g. emergency mode during sync.
+ */
+ if (brick->conn_brick &&
+ brick->conn_brick->mode_ptr &&
+ *brick->conn_brick->mode_ptr < 0 &&
+ aio->object_cb)
+ aio->object_cb->cb_error = *brick->conn_brick->mode_ptr;
+ if (!aborted) {
+ down(&brick->socket_sem);
+ status = xio_send_cb(sock, aio);
+ up(&brick->socket_sem);
+ }
+
+err:
+ if (unlikely(status < 0) && !aborted) {
+ aborted = true;
+ XIO_WRN("cannot send response, status = %d\n", status);
+ /* Just shutdown the socket and forget all pending
+ * requests.
+ * The _client_ is responsible for resending
+ * any lost operations.
+ */
+ xio_shutdown_socket(sock);
+ }
+
+ if (aio_a->data) {
+ brick_block_free(aio_a->data, aio_a->len);
+ aio->io_data = NULL;
+ }
+ if (aio_a->do_put) {
+ GENERIC_INPUT_CALL(brick->inputs[0], aio_put, aio);
+ atomic_dec(&brick->in_flight);
+ } else {
+ obj_free(aio);
+ }
+ }
+
+ xio_shutdown_socket(sock);
+ xio_put_socket(sock);
+
+done:
+ XIO_DBG("---------- cb_thread terminating, status = %d\n", status);
+ wake_up_interruptible(&brick->startup_event);
+ return status;
+}
+
+static
+void server_endio(struct generic_callback *cb)
+{
+ struct server_aio_aspect *aio_a;
+ struct aio_object *aio;
+ struct server_brick *brick;
+ int rw;
+ unsigned long flags;
+
+ aio_a = cb->cb_private;
+ CHECK_PTR(aio_a, err);
+ aio = aio_a->object;
+ CHECK_PTR(aio, err);
+ LAST_CALLBACK(cb);
+ if (unlikely(cb != &aio->_object_cb))
+ XIO_ERR("bad cb pointer %p != %p\n", cb, &aio->_object_cb);
+
+ brick = aio_a->brick;
+ if (unlikely(!brick)) {
+ XIO_WRN("late IO callback -- cannot do anything\n");
+ goto out_return;
+ }
+
+ rw = aio->io_rw;
+
+ spin_lock_irqsave(&brick->cb_lock, flags);
+ if (rw)
+ list_add_tail(&aio_a->cb_head, &brick->cb_write_list);
+ else
+ list_add_tail(&aio_a->cb_head, &brick->cb_read_list);
+ spin_unlock_irqrestore(&brick->cb_lock, flags);
+
+ wake_up_interruptible(&brick->cb_event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle callback - giving up\n");
+out_return:;
+}
+
+int server_io(struct server_brick *brick, struct xio_socket *sock, struct xio_cmd *cmd)
+{
+ struct aio_object *aio;
+ struct server_aio_aspect *aio_a;
+ int amount;
+ int status = -ENOTRECOVERABLE;
+
+ if (!brick->cb_running || !brick->handler_running || !xio_socket_is_alive(sock))
+ goto done;
+
+ aio = server_alloc_aio(brick);
+ status = -ENOMEM;
+ aio_a = server_aio_get_aspect(brick, aio);
+ if (unlikely(!aio_a)) {
+ obj_free(aio);
+ goto done;
+ }
+
+ status = xio_recv_aio(sock, aio, cmd);
+ if (status < 0) {
+ obj_free(aio);
+ goto done;
+ }
+
+ aio_a->brick = brick;
+ aio_a->data = aio->io_data;
+ aio_a->len = aio->io_len;
+ SETUP_CALLBACK(aio, server_endio, aio_a);
+
+ amount = 0;
+ if (!aio->io_cs_mode < 2)
+ amount = (aio->io_len - 1) / 1024 + 1;
+ rate_limit_sleep(&server_limiter, amount);
+
+ status = GENERIC_INPUT_CALL(brick->inputs[0], aio_get, aio);
+ if (unlikely(status < 0)) {
+ XIO_WRN("aio_get execution error = %d\n", status);
+ SIMPLE_CALLBACK(aio, status);
+ status = 0; /* continue serving requests */
+ goto done;
+ }
+ aio_a->do_put = true;
+ atomic_inc(&brick->in_flight);
+ GENERIC_INPUT_CALL(brick->inputs[0], aio_io, aio);
+
+done:
+ return status;
+}
+
+/***************** own brick * input * output operations *****************/
+
+static int server_get_info(struct server_output *output, struct xio_info *info)
+{
+ struct server_input *input = output->brick->inputs[0];
+
+ return GENERIC_INPUT_CALL(input, xio_get_info, info);
+}
+
+static int server_io_get(struct server_output *output, struct aio_object *aio)
+{
+ struct server_input *input = output->brick->inputs[0];
+
+ return GENERIC_INPUT_CALL(input, aio_get, aio);
+}
+
+static void server_io_put(struct server_output *output, struct aio_object *aio)
+{
+ struct server_input *input = output->brick->inputs[0];
+
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+}
+
+static void server_io_io(struct server_output *output, struct aio_object *aio)
+{
+ struct server_input *input = output->brick->inputs[0];
+
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+}
+
+int server_switch(struct server_brick *brick)
+{
+ struct xio_socket *sock = &brick->handler_socket;
+ int status = 0;
+
+ if (brick->power.button) {
+ static int version;
+ bool ok;
+
+ if (brick->power.on_led)
+ goto done;
+
+ ok = xio_get_socket(sock);
+ if (unlikely(!ok)) {
+ status = -ENOENT;
+ goto err;
+ }
+
+ xio_set_power_off_led((void *)brick, false);
+
+ brick->version = version++;
+ brick->handler_thread = brick_thread_create(handler_thread, brick, "xio_handler%d", brick->version);
+ if (unlikely(!brick->handler_thread)) {
+ XIO_ERR("cannot create handler thread\n");
+ status = -ENOENT;
+ goto err;
+ }
+
+ xio_set_power_on_led((void *)brick, true);
+ } else if (!brick->power.off_led) {
+ struct task_struct *thread;
+
+ xio_set_power_on_led((void *)brick, false);
+
+ xio_shutdown_socket(sock);
+
+ thread = brick->handler_thread;
+ if (thread) {
+ brick->handler_thread = NULL;
+ brick->handler_running = false;
+ XIO_DBG("#%d stopping handler thread....\n", sock->s_debug_nr);
+ brick_thread_stop(thread);
+ }
+
+ xio_put_socket(sock);
+ XIO_DBG("#%d socket s_count = %d\n", sock->s_debug_nr, atomic_read(&sock->s_count));
+
+ xio_set_power_off_led((void *)brick, true);
+ }
+err:
+ if (unlikely(status < 0)) {
+ xio_set_power_off_led((void *)brick, true);
+ xio_shutdown_socket(sock);
+ xio_put_socket(sock);
+ }
+done:
+ return status;
+}
+
+/*************** informational * statistics **************/
+
+static
+char *server_statistics(struct server_brick *brick, int verbose)
+{
+ char *res = brick_string_alloc(1024);
+
+ snprintf(
+ res, 1024,
+ "cb_running = %d handler_running = %d in_flight = %d\n",
+ brick->cb_running,
+ brick->handler_running,
+ atomic_read(&brick->in_flight));
+
+ return res;
+}
+
+static
+void server_reset_statistics(struct server_brick *brick)
+{
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int server_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct server_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->cb_head);
+ return 0;
+}
+
+static void server_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct server_aio_aspect *ini = (void *)_ini;
+
+ CHECK_HEAD_EMPTY(&ini->cb_head);
+}
+
+XIO_MAKE_STATICS(server);
+
+/********************* brick constructors * destructors *******************/
+
+static int server_brick_construct(struct server_brick *brick)
+{
+ init_waitqueue_head(&brick->startup_event);
+ init_waitqueue_head(&brick->cb_event);
+ sema_init(&brick->socket_sem, 1);
+ spin_lock_init(&brick->cb_lock);
+ INIT_LIST_HEAD(&brick->cb_read_list);
+ INIT_LIST_HEAD(&brick->cb_write_list);
+ return 0;
+}
+
+static int server_brick_destruct(struct server_brick *brick)
+{
+ CHECK_HEAD_EMPTY(&brick->cb_read_list);
+ CHECK_HEAD_EMPTY(&brick->cb_write_list);
+ return 0;
+}
+
+static int server_output_construct(struct server_output *output)
+{
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct server_brick_ops server_brick_ops = {
+ .brick_switch = server_switch,
+ .brick_statistics = server_statistics,
+ .reset_statistics = server_reset_statistics,
+};
+
+static struct server_output_ops server_output_ops = {
+ .xio_get_info = server_get_info,
+ .aio_get = server_io_get,
+ .aio_put = server_io_put,
+ .aio_io = server_io_io,
+};
+
+const struct server_input_type server_input_type = {
+ .type_name = "server_input",
+ .input_size = sizeof(struct server_input),
+};
+
+static const struct server_input_type *server_input_types[] = {
+ &server_input_type,
+};
+
+const struct server_output_type server_output_type = {
+ .type_name = "server_output",
+ .output_size = sizeof(struct server_output),
+ .master_ops = &server_output_ops,
+ .output_construct = &server_output_construct,
+};
+
+static const struct server_output_type *server_output_types[] = {
+ &server_output_type,
+};
+
+const struct server_brick_type server_brick_type = {
+ .type_name = "server_brick",
+ .brick_size = sizeof(struct server_brick),
+ .max_inputs = 1,
+ .max_outputs = 0,
+ .master_ops = &server_brick_ops,
+ .aspect_types = server_aspect_types,
+ .default_input_types = server_input_types,
+ .default_output_types = server_output_types,
+ .brick_construct = &server_brick_construct,
+ .brick_destruct = &server_brick_destruct,
+};
+
+/*********************************************************************/
+
+/* strategy layer */
+
+int server_show_statist;
+
+/***************** module init stuff ************************/
+
+struct rate_limiter server_limiter = {
+ /* Let all be zero */
+};
+
+void exit_xio_server(void)
+{
+ int i;
+
+ XIO_INF("exit_server()\n");
+ server_unregister_brick_type();
+
+ for (i = 0; i < NR_SERVER_SOCKETS; i++) {
+ if (server_threads[i]) {
+ XIO_INF("stopping server thread %d...\n", i);
+ brick_thread_stop(server_threads[i]);
+ }
+ XIO_INF("closing server socket %d...\n", i);
+ xio_put_socket(&server_socket[i]);
+ }
+}
+
+int __init init_xio_server(void)
+{
+ int i;
+
+ XIO_INF("init_server()\n");
+
+ for (i = 0; i < NR_SERVER_SOCKETS; i++) {
+ struct sockaddr_storage sockaddr = {};
+ char tmp[64];
+ int status;
+
+ if (xio_translate_hostname)
+ snprintf(tmp, sizeof(tmp), "%s:%d", my_id(), xio_net_default_port + i);
+ else
+ snprintf(tmp, sizeof(tmp), ":%d", xio_net_default_port + i);
+
+ status = xio_create_sockaddr(&sockaddr, tmp);
+ if (unlikely(status < 0)) {
+ exit_xio_server();
+ return status;
+ }
+
+ status = xio_create_socket(&server_socket[i], &sockaddr, NULL, &device_tcp_params);
+ if (unlikely(status < 0)) {
+ XIO_ERR("could not create server socket %d, status = %d\n", i, status);
+ exit_xio_server();
+ return status;
+ }
+
+ server_threads[i] = brick_thread_create(server_thread, &server_socket[i], "xio_server_%d", i);
+ if (unlikely(!server_threads[i] || IS_ERR(server_threads[i]))) {
+ XIO_ERR("could not create server thread %d\n", i);
+ exit_xio_server();
+ return -ENOENT;
+ }
+ }
+
+ return server_register_brick_type();
+}
diff --git a/include/linux/xio/xio_server.h b/include/linux/xio/xio_server.h
new file mode 100644
index 000000000000..2c13f263e6c8
--- /dev/null
+++ b/include/linux/xio/xio_server.h
@@ -0,0 +1,91 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_SERVER_H
+#define XIO_SERVER_H
+
+#include <linux/wait.h>
+
+#include <linux/xio/xio_net.h>
+#include <linux/brick/lib_limiter.h>
+
+#define NR_SERVER_SOCKETS 3
+
+extern int server_show_statist;
+
+extern struct rate_limiter server_limiter;
+
+struct server_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct server_brick *brick;
+ struct list_head cb_head;
+ void *data;
+ int len;
+ bool do_put;
+};
+
+struct server_output {
+ XIO_OUTPUT(server);
+};
+
+struct server_brick {
+ XIO_BRICK(server);
+ struct semaphore socket_sem;
+ struct xio_socket handler_socket;
+ struct xio_brick *conn_brick;
+ struct task_struct *handler_thread;
+ struct task_struct *cb_thread;
+
+ wait_queue_head_t startup_event;
+ wait_queue_head_t cb_event;
+ spinlock_t cb_lock;
+ struct list_head cb_read_list;
+ struct list_head cb_write_list;
+ atomic_t in_flight;
+ int version;
+ bool cb_running;
+ bool handler_running;
+};
+
+struct server_input {
+ XIO_INPUT(server);
+};
+
+XIO_TYPES(server);
+
+/* Internal interface to specific implementations.
+ * This is used for a rough separation of the strategy layer
+ * from the ordinary XIO layer.
+ * Currently, separation is at linker level.
+ * TODO: implement a dynamic separation later.
+ */
+
+/* Implemented separately, used by generic part */
+
+extern int server_thread(void *data);
+
+extern int handler_thread(void *data);
+
+extern int cb_thread(void *data);
+
+extern int server_io(struct server_brick *brick, struct xio_socket *sock, struct xio_cmd *cmd);
+
+/* Implemented by generic part, used by specific part */
+
+extern int server_switch(struct server_brick *brick);
+
+#endif
--
2.11.0

2016-12-30 23:01:08

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 17/32] mars: add new module xio_bio

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/xio_bio.c | 845 ++++++++++++++++++++++++++++++
include/linux/xio/xio_bio.h | 85 +++
2 files changed, 930 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/xio_bio.c
create mode 100644 include/linux/xio/xio_bio.h

diff --git a/drivers/staging/mars/xio_bricks/xio_bio.c b/drivers/staging/mars/xio_bricks/xio_bio.c
new file mode 100644
index 000000000000..97bc4fc46f3e
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/xio_bio.c
@@ -0,0 +1,845 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* Bio brick (interface to blkdev IO via kernel bios) */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/bio.h>
+
+#include <linux/xio/xio.h>
+#include <linux/brick/lib_timing.h>
+#include <linux/xio/lib_mapfree.h>
+
+#include <linux/xio/xio_bio.h>
+static struct timing_stats timings[2];
+
+struct threshold bio_submit_threshold = {
+ .thr_ban = &xio_global_ban,
+ .thr_parent = &global_io_threshold,
+ .thr_limit = BIO_SUBMIT_MAX_LATENCY,
+ .thr_factor = 100,
+ .thr_plus = 0,
+};
+
+struct threshold bio_io_threshold[2] = {
+ [0] = {
+ .thr_ban = &xio_global_ban,
+ .thr_parent = &global_io_threshold,
+ .thr_limit = BIO_IO_R_MAX_LATENCY,
+ .thr_factor = 10,
+ .thr_plus = 10000,
+ },
+ [1] = {
+ .thr_ban = &xio_global_ban,
+ .thr_parent = &global_io_threshold,
+ .thr_limit = BIO_IO_W_MAX_LATENCY,
+ .thr_factor = 10,
+ .thr_plus = 10000,
+ },
+};
+
+/************************ own type definitions ***********************/
+
+/************************ own helper functions ***********************/
+
+/* This is called from the kernel bio layer.
+ */
+static
+void bio_callback(struct bio *bio)
+{
+ struct bio_aio_aspect *aio_a = bio->bi_private;
+ struct bio_brick *brick;
+ unsigned long flags;
+
+ CHECK_PTR(aio_a, err);
+ CHECK_PTR(aio_a->output, err);
+ brick = aio_a->output->brick;
+ CHECK_PTR(brick, err);
+
+ aio_a->status_code = bio->bi_error;
+
+ spin_lock_irqsave(&brick->lock, flags);
+ list_del(&aio_a->io_head);
+ list_add_tail(&aio_a->io_head, &brick->completed_list);
+ atomic_inc(&brick->completed_count);
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ wake_up_interruptible(&brick->response_event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle bio callback\n");
+out_return:;
+}
+
+/* Map from kernel address/length to struct page (if not already known),
+ * check alignment constraints, create bio from it.
+ * Return the length (may be smaller than requested).
+ */
+static
+int make_bio(
+struct bio_brick *brick, void *data, int len, loff_t pos, struct bio_aio_aspect *private, struct bio **_bio)
+{
+ unsigned long long sector;
+ int sector_offset;
+ int data_offset;
+ int page_offset;
+ int page_len;
+ int bvec_count;
+ int rest_len = len;
+ int result_len = 0;
+ int status;
+ int i;
+ struct bio *bio = NULL;
+ struct block_device *bdev;
+
+ status = -EINVAL;
+ CHECK_PTR(brick, out);
+ bdev = brick->bdev;
+ CHECK_PTR(bdev, out);
+
+ if (unlikely(rest_len <= 0)) {
+ XIO_ERR("bad bio len %d\n", rest_len);
+ goto out;
+ }
+
+ sector = pos >> 9; /* TODO: make dynamic */
+ sector_offset = pos & ((1 << 9) - 1); /* TODO: make dynamic */
+ data_offset = ((unsigned long)data) & ((1 << 9) - 1); /* TODO: make dynamic */
+
+ if (unlikely(sector_offset > 0)) {
+ XIO_ERR("odd sector offset %d\n", sector_offset);
+ goto out;
+ }
+ if (unlikely(sector_offset != data_offset)) {
+ XIO_ERR("bad alignment: sector_offset %d != data_offset %d\n", sector_offset, data_offset);
+ goto out;
+ }
+ if (unlikely(rest_len & ((1 << 9) - 1))) {
+ XIO_ERR("odd length %d\n", rest_len);
+ goto out;
+ }
+
+ page_offset = ((unsigned long)data) & (PAGE_SIZE - 1);
+ page_len = rest_len + page_offset;
+ bvec_count = (page_len - 1) / PAGE_SIZE + 1;
+ if (bvec_count > brick->bvec_max) {
+ bvec_count = brick->bvec_max;
+ } else if (unlikely(bvec_count <= 0)) {
+ XIO_WRN("bvec_count=%d\n", bvec_count);
+ bvec_count = 1;
+ }
+
+ bio = bio_alloc(GFP_BRICK, bvec_count);
+ status = -ENOMEM;
+
+ for (i = 0; i < bvec_count && rest_len > 0; i++) {
+ struct page *page;
+ int this_rest = PAGE_SIZE - page_offset;
+ int this_len = rest_len;
+
+ if (this_len > this_rest)
+ this_len = this_rest;
+
+ page = brick_iomap(data, &page_offset, &this_len);
+ if (unlikely(!page)) {
+ XIO_ERR("cannot iomap() kernel address %p\n", data);
+ status = -EINVAL;
+ goto out;
+ }
+
+ bio->bi_io_vec[i].bv_page = page;
+ bio->bi_io_vec[i].bv_len = this_len;
+ bio->bi_io_vec[i].bv_offset = page_offset;
+
+ data += this_len;
+ rest_len -= this_len;
+ result_len += this_len;
+ page_offset = 0;
+ }
+
+ if (unlikely(rest_len != 0)) {
+ XIO_ERR("computation of bvec_count %d was wrong, diff=%d\n", bvec_count, rest_len);
+ status = -EINVAL;
+ goto out;
+ }
+
+ bio->bi_vcnt = i;
+ bio->bi_iter.bi_idx = 0;
+ bio->bi_iter.bi_size = result_len;
+ bio->bi_iter.bi_sector = sector;
+ bio->bi_bdev = bdev;
+ bio->bi_private = private;
+ bio->bi_end_io = bio_callback;
+ bio->bi_rw = 0; /* must be filled in later */
+ status = result_len;
+
+out:
+ if (unlikely(status < 0)) {
+ XIO_ERR("error %d\n", status);
+ if (bio) {
+ bio_put(bio);
+ bio = NULL;
+ }
+ }
+ *_bio = bio;
+ return status;
+}
+
+/***************** own brick * input * output operations *****************/
+
+#define PRIO_INDEX(aio) ((aio)->io_prio + 1)
+
+static int bio_get_info(struct bio_output *output, struct xio_info *info)
+{
+ struct bio_brick *brick = output->brick;
+ struct inode *inode;
+ int status = -ENOENT;
+
+ if (unlikely(!brick->mf ||
+ !brick->mf->mf_filp ||
+ !brick->mf->mf_filp->f_mapping)) {
+ goto done;
+ }
+ inode = brick->mf->mf_filp->f_mapping->host;
+ if (unlikely(!inode))
+ goto done;
+
+ info->tf_align = 512;
+ info->tf_min_size = 512;
+ brick->total_size = i_size_read(inode);
+ info->current_size = brick->total_size;
+ XIO_DBG("determined device size = %lld\n", info->current_size);
+ status = 0;
+
+done:
+ return status;
+}
+
+static int bio_io_get(struct bio_output *output, struct aio_object *aio)
+{
+ struct bio_aio_aspect *aio_a;
+ int status = -EINVAL;
+
+ CHECK_PTR(output, done);
+ CHECK_PTR(output->brick, done);
+
+ if (aio->obj_initialized) {
+ obj_get(aio);
+ return aio->io_len;
+ }
+
+ aio_a = bio_aio_get_aspect(output->brick, aio);
+ CHECK_PTR(aio_a, done);
+ aio_a->output = output;
+ aio_a->bio = NULL;
+
+ if (!aio->io_data) { /* buffered IO. */
+ if (unlikely(aio->io_len <= 0))
+ goto done;
+ status = -ENOMEM;
+ aio->io_data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len));
+ aio_a->do_dealloc = true;
+ }
+
+ status = make_bio(output->brick, aio->io_data, aio->io_len, aio->io_pos, aio_a, &aio_a->bio);
+ if (unlikely(status < 0 || !aio_a->bio)) {
+ XIO_ERR("could not create bio, status = %d\n", status);
+ goto done;
+ }
+
+ if (unlikely(aio->io_prio < XIO_PRIO_HIGH))
+ aio->io_prio = XIO_PRIO_HIGH;
+ else if (unlikely(aio->io_prio > XIO_PRIO_LOW))
+ aio->io_prio = XIO_PRIO_LOW;
+
+ aio->io_len = status;
+ obj_get_first(aio);
+ status = 0;
+
+done:
+ return status;
+}
+
+static
+void _bio_io_put(struct bio_output *output, struct aio_object *aio)
+{
+ struct bio_aio_aspect *aio_a;
+
+ aio->io_total_size = output->brick->total_size;
+
+ aio_a = bio_aio_get_aspect(output->brick, aio);
+ CHECK_PTR(aio_a, err);
+
+ if (likely(aio_a->bio)) {
+ bio_put(aio_a->bio);
+ aio_a->bio = NULL;
+ }
+ if (aio_a->do_dealloc) {
+ brick_block_free(aio->io_data, aio_a->alloc_len);
+ aio->io_data = NULL;
+ }
+ obj_free(aio);
+
+ goto out_return;
+err:
+ XIO_FAT("cannot work\n");
+out_return:;
+}
+
+#define BIO_AIO_PUT(output, aio) \
+ ({ \
+ if (obj_put(aio)) { \
+ _bio_io_put(output, aio); \
+ } \
+ })
+
+static
+void bio_io_put(struct bio_output *output, struct aio_object *aio)
+{
+ BIO_AIO_PUT(output, aio);
+}
+
+static
+void _bio_io_io(struct bio_output *output, struct aio_object *aio, bool cork)
+{
+ struct bio_brick *brick = output->brick;
+ struct bio_aio_aspect *aio_a = bio_aio_get_aspect(output->brick, aio);
+ struct bio *bio;
+ unsigned long long latency;
+ unsigned long flags;
+ int rw;
+ int status = -EINVAL;
+
+ CHECK_PTR(aio_a, err);
+ bio = aio_a->bio;
+ CHECK_PTR(bio, err);
+
+ obj_get(aio);
+ atomic_inc(&brick->fly_count[PRIO_INDEX(aio)]);
+
+ bio_get(bio);
+
+ rw = aio->io_rw & 1;
+ if (brick->do_noidle && !cork)
+ rw |= REQ_NOIDLE;
+ if (!aio->io_skip_sync) {
+ if (brick->do_sync)
+ rw |= REQ_SYNC;
+ }
+
+ aio_a->start_stamp = cpu_clock(raw_smp_processor_id());
+ spin_lock_irqsave(&brick->lock, flags);
+ list_add_tail(&aio_a->io_head, &brick->submitted_list[rw & 1]);
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ bio->bi_rw = rw;
+ latency = TIME_STATS(
+ &timings[rw & 1],
+ submit_bio(rw, bio)
+ );
+
+ threshold_check(&bio_submit_threshold, latency);
+
+ status = 0;
+#ifdef BIO_EOPNOTSUPP /* missing since b25de9d6da49b1a8760a89672283128aa8c78345 */
+ if (unlikely(bio_flagged(bio, BIO_EOPNOTSUPP)))
+ status = -EOPNOTSUPP;
+#endif
+
+ if (likely(status >= 0))
+ goto done;
+
+ bio_put(bio);
+ atomic_dec(&brick->fly_count[PRIO_INDEX(aio)]);
+
+err:
+ XIO_ERR("IO error %d\n", status);
+ CHECKED_CALLBACK(aio, status, done);
+ atomic_dec(&xio_global_io_flying);
+
+done:;
+}
+
+static
+void bio_io_io(struct bio_output *output, struct aio_object *aio)
+{
+ CHECK_PTR(aio, fatal);
+
+ obj_get(aio);
+ atomic_inc(&xio_global_io_flying);
+
+ if (aio->io_prio == XIO_PRIO_LOW ||
+ (aio->io_prio == XIO_PRIO_NORMAL && aio->io_rw)) {
+ struct bio_aio_aspect *aio_a = bio_aio_get_aspect(output->brick, aio);
+ struct bio_brick *brick = output->brick;
+ unsigned long flags;
+
+ spin_lock_irqsave(&brick->lock, flags);
+ list_add_tail(&aio_a->io_head, &brick->queue_list[PRIO_INDEX(aio)]);
+ atomic_inc(&brick->queue_count[PRIO_INDEX(aio)]);
+ spin_unlock_irqrestore(&brick->lock, flags);
+ brick->submitted = true;
+
+ wake_up_interruptible(&brick->submit_event);
+ goto out_return;
+ }
+
+ /* realtime IO: start immediately */
+ _bio_io_io(output, aio, false);
+ BIO_AIO_PUT(output, aio);
+ goto out_return;
+fatal:
+ XIO_FAT("cannot handle aio %p on output %p\n", aio, output);
+out_return:;
+}
+
+static
+int bio_response_thread(void *data)
+{
+ struct bio_brick *brick = data;
+
+ XIO_INF("bio response thread has started on '%s'.\n", brick->brick_path);
+
+ for (;;) {
+ LIST_HEAD(tmp_list);
+ unsigned long flags;
+ int thr_limit;
+ int sleeptime;
+ int count;
+ int i;
+
+ thr_limit = bio_io_threshold[0].thr_limit;
+ if (bio_io_threshold[1].thr_limit < thr_limit)
+ thr_limit = bio_io_threshold[1].thr_limit;
+
+ sleeptime = HZ / 10;
+ if (thr_limit > 0) {
+ sleeptime = thr_limit / (1000000 * 2 / HZ);
+ if (unlikely(sleeptime < 2))
+ sleeptime = 2;
+ }
+
+ wait_event_interruptible_timeout(
+ brick->response_event,
+ atomic_read(&brick->completed_count) > 0,
+ sleeptime);
+
+#ifdef CONFIG_MARS_DEBUG
+ if (mars_hang_mode & 2) {
+ brick_msleep(100);
+ continue;
+ }
+#endif
+ spin_lock_irqsave(&brick->lock, flags);
+ list_replace_init(&brick->completed_list, &tmp_list);
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ count = 0;
+ for (;;) {
+ struct list_head *tmp;
+ struct bio_aio_aspect *aio_a;
+ struct aio_object *aio;
+ unsigned long long latency;
+ int code;
+
+ if (list_empty(&tmp_list)) {
+ if (brick_thread_should_stop() &&
+ atomic_read(&brick->fly_count[0]) +
+ atomic_read(&brick->fly_count[1]) +
+ atomic_read(&brick->fly_count[2]) <= 0)
+ goto done;
+ break;
+ }
+
+ tmp = tmp_list.next;
+ list_del_init(tmp);
+ atomic_dec(&brick->completed_count);
+
+ aio_a = container_of(tmp, struct bio_aio_aspect, io_head);
+ aio = aio_a->object;
+
+ latency = cpu_clock(raw_smp_processor_id()) - aio_a->start_stamp;
+ threshold_check(&bio_io_threshold[aio->io_rw & 1], latency);
+
+ code = aio_a->status_code;
+
+ if (code < 0) {
+ XIO_ERR("IO error %d\n", code);
+ } else {
+ aio_checksum(aio);
+ aio->io_flags |= AIO_UPTODATE;
+ }
+
+ SIMPLE_CALLBACK(aio, code);
+
+ atomic_dec(&brick->fly_count[PRIO_INDEX(aio)]);
+ atomic_inc(&brick->total_completed_count[PRIO_INDEX(aio)]);
+ count++;
+
+ if (likely(aio_a->bio))
+ bio_put(aio_a->bio);
+ BIO_AIO_PUT(aio_a->output, aio);
+
+ atomic_dec(&xio_global_io_flying);
+ }
+
+ /* Try to detect slow requests as early as possible,
+ * even before they have completed.
+ */
+ for (i = 0; i < 2; i++) {
+ unsigned long long eldest = 0;
+
+ spin_lock_irqsave(&brick->lock, flags);
+ if (!list_empty(&brick->submitted_list[i])) {
+ struct bio_aio_aspect *aio_a;
+
+ aio_a = container_of(brick->submitted_list[i].next, struct bio_aio_aspect, io_head);
+ eldest = aio_a->start_stamp;
+ }
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ if (eldest)
+ threshold_check(&bio_io_threshold[i], cpu_clock(raw_smp_processor_id()) - eldest);
+ }
+
+ if (count) {
+ brick->submitted = true;
+ wake_up_interruptible(&brick->submit_event);
+ }
+ }
+done:
+ XIO_INF("bio response thread has stopped.\n");
+ return 0;
+}
+
+static
+bool _bg_should_run(struct bio_brick *brick)
+{
+ return (atomic_read(&brick->queue_count[2]) > 0 &&
+ atomic_read(&brick->fly_count[0]) + atomic_read(&brick->fly_count[1]) <= brick->bg_threshold &&
+ (brick->bg_maxfly <= 0 || atomic_read(&brick->fly_count[2]) < brick->bg_maxfly));
+}
+
+static
+int bio_submit_thread(void *data)
+{
+ struct bio_brick *brick = data;
+
+ XIO_INF("bio submit thread has started on '%s'.\n", brick->brick_path);
+
+ while (!brick_thread_should_stop()) {
+ int prio;
+
+ wait_event_interruptible_timeout(
+ brick->submit_event,
+ brick->submitted,
+ HZ / 2);
+
+ brick->submitted = false;
+
+ for (prio = 0; prio < XIO_PRIO_NR; prio++) {
+ LIST_HEAD(tmp_list);
+ unsigned long flags;
+
+ if (prio == XIO_PRIO_NR - 1 && !_bg_should_run(brick))
+ break;
+
+ spin_lock_irqsave(&brick->lock, flags);
+ list_replace_init(&brick->queue_list[prio], &tmp_list);
+ spin_unlock_irqrestore(&brick->lock, flags);
+
+ while (!list_empty(&tmp_list)) {
+ struct list_head *tmp = tmp_list.next;
+ struct bio_aio_aspect *aio_a;
+ struct aio_object *aio;
+ bool cork;
+
+ list_del_init(tmp);
+
+ aio_a = container_of(tmp, struct bio_aio_aspect, io_head);
+ aio = aio_a->object;
+ if (unlikely(!aio)) {
+ XIO_ERR("invalid aio\n");
+ continue;
+ }
+
+ atomic_dec(&brick->queue_count[PRIO_INDEX(aio)]);
+ cork = atomic_read(&brick->queue_count[PRIO_INDEX(aio)]) > 0;
+
+ _bio_io_io(aio_a->output, aio, cork);
+
+ BIO_AIO_PUT(aio_a->output, aio);
+ }
+ }
+ }
+
+ XIO_INF("bio submit thread has stopped.\n");
+ return 0;
+}
+
+static int bio_switch(struct bio_brick *brick)
+{
+ int status = 0;
+
+ if (brick->power.button) {
+ if (brick->power.on_led)
+ goto done;
+
+ xio_set_power_off_led((void *)brick, false);
+
+ if (!brick->bdev) {
+ static int index;
+ const char *path = brick->brick_path;
+ int flags = O_RDWR | O_EXCL | O_LARGEFILE;
+ struct address_space *mapping;
+ struct inode *inode = NULL;
+ struct request_queue *q;
+
+ brick->mf = mapfree_get(path, flags);
+ if (unlikely(!brick->mf || !brick->mf->mf_filp)) {
+ status = -ENOENT;
+ XIO_ERR("cannot open file '%s'\n", path);
+ goto done;
+ }
+ mapfree_pages(brick->mf, -1);
+ mapping = brick->mf->mf_filp->f_mapping;
+ if (likely(mapping))
+ inode = mapping->host;
+ if (unlikely(!mapping || !inode)) {
+ XIO_ERR("internal problem with '%s'\n", path);
+ status = -EINVAL;
+ goto done;
+ }
+ if (unlikely(!S_ISBLK(inode->i_mode) || !inode->i_bdev)) {
+ XIO_ERR("sorry, '%s' is not a block device\n", path);
+ status = -ENODEV;
+ goto done;
+ }
+
+ mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~(__GFP_IO | __GFP_FS));
+
+ q = bdev_get_queue(inode->i_bdev);
+ if (unlikely(!q)) {
+ XIO_ERR("internal queue '%s' does not exist\n", path);
+ status = -EINVAL;
+ goto done;
+ }
+
+ XIO_INF(
+ "'%s' ra_pages OLD=%lu NEW=%d\n", path, q->backing_dev_info.ra_pages, brick->ra_pages);
+ q->backing_dev_info.ra_pages = brick->ra_pages;
+
+ brick->bvec_max = queue_max_hw_sectors(q) >> (PAGE_SHIFT - 9);
+ if (brick->bvec_max > BIO_MAX_PAGES)
+ brick->bvec_max = BIO_MAX_PAGES;
+ else if (brick->bvec_max <= 1)
+ brick->bvec_max = 1;
+ brick->total_size = i_size_read(inode);
+ XIO_INF(
+ "'%s' size=%lld bvec_max=%d\n",
+ path, brick->total_size, brick->bvec_max);
+
+ brick->response_thread = brick_thread_create(
+ bio_response_thread, brick, "xio_bio_r%d", index);
+ brick->submit_thread = brick_thread_create(bio_submit_thread, brick, "xio_bio_s%d", index);
+ status = -ENOMEM;
+ if (likely(brick->submit_thread && brick->response_thread)) {
+ brick->bdev = inode->i_bdev;
+ brick->mode_ptr = &brick->mf->mf_mode;
+ index++;
+ status = 0;
+ }
+ }
+ }
+
+ xio_set_power_on_led((void *)brick, brick->power.button && brick->bdev);
+
+done:
+ if (status < 0 || !brick->power.button) {
+ if (brick->submit_thread) {
+ brick_thread_stop(brick->submit_thread);
+ brick->submit_thread = NULL;
+ }
+ if (brick->response_thread) {
+ brick_thread_stop(brick->response_thread);
+ brick->response_thread = NULL;
+ }
+ if (brick->mf) {
+ mapfree_put(brick->mf);
+ brick->mf = NULL;
+ }
+ brick->mode_ptr = NULL;
+ brick->bdev = NULL;
+ if (!brick->power.button) {
+ xio_set_power_off_led((void *)brick, true);
+ brick->total_size = 0;
+ }
+ }
+ return status;
+}
+
+/*************** informational * statistics **************/
+
+static noinline
+char *bio_statistics(struct bio_brick *brick, int verbose)
+{
+ char *res = brick_string_alloc(4096);
+ int pos = 0;
+
+ pos += report_timing(&timings[0], res + pos, 4096 - pos);
+ pos += report_timing(&timings[1], res + pos, 4096 - pos);
+
+ snprintf(
+ res + pos, 4096 - pos,
+ "total completed[0] = %d completed[1] = %d completed[2] = %d | queued[0] = %d queued[1] = %d queued[2] = %d flying[0] = %d flying[1] = %d flying[2] = %d completing = %d\n",
+ atomic_read(&brick->total_completed_count[0]),
+ atomic_read(&brick->total_completed_count[1]),
+ atomic_read(&brick->total_completed_count[2]),
+ atomic_read(&brick->fly_count[0]),
+ atomic_read(&brick->queue_count[0]),
+ atomic_read(&brick->queue_count[1]),
+ atomic_read(&brick->queue_count[2]),
+ atomic_read(&brick->fly_count[1]),
+ atomic_read(&brick->fly_count[2]),
+ atomic_read(&brick->completed_count));
+
+ return res;
+}
+
+static noinline
+void bio_reset_statistics(struct bio_brick *brick)
+{
+ atomic_set(&brick->total_completed_count[0], 0);
+ atomic_set(&brick->total_completed_count[1], 0);
+ atomic_set(&brick->total_completed_count[2], 0);
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int bio_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct bio_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->io_head);
+ return 0;
+}
+
+static void bio_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct bio_aio_aspect *ini = (void *)_ini;
+
+ (void)ini;
+}
+
+XIO_MAKE_STATICS(bio);
+
+/********************* brick constructors * destructors *******************/
+
+static int bio_brick_construct(struct bio_brick *brick)
+{
+ spin_lock_init(&brick->lock);
+ INIT_LIST_HEAD(&brick->queue_list[0]);
+ INIT_LIST_HEAD(&brick->queue_list[1]);
+ INIT_LIST_HEAD(&brick->queue_list[2]);
+ INIT_LIST_HEAD(&brick->submitted_list[0]);
+ INIT_LIST_HEAD(&brick->submitted_list[1]);
+ INIT_LIST_HEAD(&brick->completed_list);
+ init_waitqueue_head(&brick->submit_event);
+ init_waitqueue_head(&brick->response_event);
+ return 0;
+}
+
+static int bio_brick_destruct(struct bio_brick *brick)
+{
+ return 0;
+}
+
+static int bio_output_construct(struct bio_output *output)
+{
+ return 0;
+}
+
+static int bio_output_destruct(struct bio_output *output)
+{
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct bio_brick_ops bio_brick_ops = {
+ .brick_switch = bio_switch,
+ .brick_statistics = bio_statistics,
+ .reset_statistics = bio_reset_statistics,
+};
+
+static struct bio_output_ops bio_output_ops = {
+ .xio_get_info = bio_get_info,
+ .aio_get = bio_io_get,
+ .aio_put = bio_io_put,
+ .aio_io = bio_io_io,
+};
+
+const struct bio_input_type bio_input_type = {
+ .type_name = "bio_input",
+ .input_size = sizeof(struct bio_input),
+};
+
+static const struct bio_input_type *bio_input_types[] = {
+ &bio_input_type,
+};
+
+const struct bio_output_type bio_output_type = {
+ .type_name = "bio_output",
+ .output_size = sizeof(struct bio_output),
+ .master_ops = &bio_output_ops,
+ .output_construct = &bio_output_construct,
+ .output_destruct = &bio_output_destruct,
+};
+
+static const struct bio_output_type *bio_output_types[] = {
+ &bio_output_type,
+};
+
+const struct bio_brick_type bio_brick_type = {
+ .type_name = "bio_brick",
+ .brick_size = sizeof(struct bio_brick),
+ .max_inputs = 0,
+ .max_outputs = 1,
+ .master_ops = &bio_brick_ops,
+ .aspect_types = bio_aspect_types,
+ .default_input_types = bio_input_types,
+ .default_output_types = bio_output_types,
+ .brick_construct = &bio_brick_construct,
+ .brick_destruct = &bio_brick_destruct,
+};
+
+/***************** module init stuff ************************/
+
+int __init init_xio_bio(void)
+{
+ XIO_INF("init_bio()\n");
+ _bio_brick_type = (void *)&bio_brick_type;
+ return bio_register_brick_type();
+}
+
+void exit_xio_bio(void)
+{
+ XIO_INF("exit_bio()\n");
+ bio_unregister_brick_type();
+}
diff --git a/include/linux/xio/xio_bio.h b/include/linux/xio/xio_bio.h
new file mode 100644
index 000000000000..a0d98bed63b5
--- /dev/null
+++ b/include/linux/xio/xio_bio.h
@@ -0,0 +1,85 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_BIO_H
+#define XIO_BIO_H
+
+#define BIO_SUBMIT_MAX_LATENCY 250 /* 250 us */
+#define BIO_IO_R_MAX_LATENCY 40000 /* 40 ms */
+#define BIO_IO_W_MAX_LATENCY 100000 /* 100 ms */
+
+extern struct threshold bio_submit_threshold;
+extern struct threshold bio_io_threshold[2];
+
+#include <linux/blkdev.h>
+
+struct bio_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct list_head io_head;
+ struct bio *bio;
+ struct bio_output *output;
+ unsigned long long start_stamp;
+ int status_code;
+ int hash_pos;
+ int alloc_len;
+ bool do_dealloc;
+};
+
+struct bio_brick {
+ XIO_BRICK(bio);
+ /* tunables */
+ int ra_pages;
+ int bg_threshold;
+ int bg_maxfly;
+ bool do_noidle;
+ bool do_sync;
+ bool do_unplug;
+
+ /* readonly */
+ loff_t total_size;
+ atomic_t fly_count[XIO_PRIO_NR];
+ atomic_t queue_count[XIO_PRIO_NR];
+ atomic_t completed_count;
+ atomic_t total_completed_count[XIO_PRIO_NR];
+
+ /* private */
+ spinlock_t lock;
+ struct list_head queue_list[XIO_PRIO_NR];
+ struct list_head submitted_list[2];
+ struct list_head completed_list;
+
+ wait_queue_head_t submit_event;
+ wait_queue_head_t response_event;
+ struct mapfree_info *mf;
+ struct block_device *bdev;
+ struct task_struct *submit_thread;
+ struct task_struct *response_thread;
+ int bvec_max;
+ bool submitted;
+};
+
+struct bio_input {
+ XIO_INPUT(bio);
+};
+
+struct bio_output {
+ XIO_OUTPUT(bio);
+};
+
+XIO_TYPES(bio);
+
+#endif
--
2.11.0

2016-12-30 23:01:20

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 08/32] mars: add new module lib_queue

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/lib_queue.h | 165 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 165 insertions(+)
create mode 100644 include/linux/brick/lib_queue.h

diff --git a/include/linux/brick/lib_queue.h b/include/linux/brick/lib_queue.h
new file mode 100644
index 000000000000..72cd0a2710c2
--- /dev/null
+++ b/include/linux/brick/lib_queue.h
@@ -0,0 +1,165 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef LIB_QUEUE_H
+#define LIB_QUEUE_H
+
+#define QUEUE_ANCHOR(PREFIX, KEYTYPE, HEAPTYPE) \
+ /* parameters */ \
+ /* readonly from outside */ \
+ atomic_t q_queued; \
+ atomic_t q_flying; \
+ atomic_t q_total; \
+ /* tunables */ \
+ int q_batchlen; \
+ int q_io_prio; \
+ bool q_ordering; \
+ /* private */ \
+ wait_queue_head_t *q_event; \
+ spinlock_t q_lock; \
+ struct list_head q_anchor; \
+ struct pairing_heap_##HEAPTYPE *heap_high; \
+ struct pairing_heap_##HEAPTYPE *heap_low; \
+ long long q_last_insert; /* jiffies */ \
+ KEYTYPE heap_margin; \
+ KEYTYPE last_pos
+
+#define QUEUE_FUNCTIONS(PREFIX, ELEM_TYPE, HEAD, KEYFN, KEYCMP, HEAPTYPE)\
+ \
+static inline \
+void q_##PREFIX##_trigger(struct PREFIX##_queue *q) \
+{ \
+ if (q->q_event) { \
+ wake_up_interruptible(q->q_event); \
+ } \
+} \
+ \
+static inline \
+void q_##PREFIX##_init(struct PREFIX##_queue *q) \
+{ \
+ INIT_LIST_HEAD(&q->q_anchor); \
+ q->heap_low = NULL; \
+ q->heap_high = NULL; \
+ spin_lock_init(&q->q_lock); \
+ atomic_set(&q->q_queued, 0); \
+ atomic_set(&q->q_flying, 0); \
+} \
+ \
+static inline \
+void q_##PREFIX##_insert(struct PREFIX##_queue *q, ELEM_TYPE * elem) \
+{ \
+ unsigned long flags; \
+ \
+ spin_lock_irqsave(&q->q_lock, flags); \
+ \
+ if (q->q_ordering) { \
+ struct pairing_heap_##HEAPTYPE **use = &q->heap_high; \
+ if (KEYCMP(KEYFN(elem), &q->heap_margin) <= 0) { \
+ use = &q->heap_low; \
+ } \
+ ph_insert_##HEAPTYPE(use, &elem->ph); \
+ } else { \
+ list_add_tail(&elem->HEAD, &q->q_anchor); \
+ } \
+ atomic_inc(&q->q_queued); \
+ atomic_inc(&q->q_total); \
+ q->q_last_insert = jiffies; \
+ \
+ spin_unlock_irqrestore(&q->q_lock, flags); \
+ \
+ q_##PREFIX##_trigger(q); \
+} \
+ \
+static inline \
+void q_##PREFIX##_pushback(struct PREFIX##_queue *q, ELEM_TYPE * elem) \
+{ \
+ unsigned long flags; \
+ \
+ if (q->q_ordering) { \
+ atomic_dec(&q->q_total); \
+ q_##PREFIX##_insert(q, elem); \
+ return; \
+ } \
+ \
+ spin_lock_irqsave(&q->q_lock, flags); \
+ \
+ list_add(&elem->HEAD, &q->q_anchor); \
+ atomic_inc(&q->q_queued); \
+ \
+ spin_unlock_irqrestore(&q->q_lock, flags); \
+} \
+ \
+static inline \
+ELEM_TYPE *q_##PREFIX##_fetch(struct PREFIX##_queue *q) \
+{ \
+ ELEM_TYPE *elem = NULL; \
+ unsigned long flags; \
+ \
+ spin_lock_irqsave(&q->q_lock, flags); \
+ \
+ if (q->q_ordering) { \
+ if (!q->heap_high) { \
+ q->heap_high = q->heap_low; \
+ q->heap_low = NULL; \
+ q->heap_margin = 0; \
+ q->last_pos = 0; \
+ } \
+ if (q->heap_high) { \
+ elem = container_of(q->heap_high, ELEM_TYPE, ph);\
+ \
+ if (unlikely(KEYCMP(KEYFN(elem), &q->last_pos) < 0)) {\
+ printk(KERN_ERR \
+ "backskip pos %lld -> %lld\n", (long long)q->last_pos, (long long)KEYFN(elem));\
+ } \
+ memcpy(&q->last_pos, KEYFN(elem), sizeof(q->last_pos));\
+ \
+ if (KEYCMP(KEYFN(elem), &q->heap_margin) > 0) { \
+ memcpy(&q->heap_margin, KEYFN(elem), sizeof(q->heap_margin));\
+ } \
+ ph_delete_min_##HEAPTYPE(&q->heap_high); \
+ atomic_dec(&q->q_queued); \
+ } \
+ } else if (!list_empty(&q->q_anchor)) { \
+ struct list_head *next = q->q_anchor.next; \
+ list_del_init(next); \
+ atomic_dec(&q->q_queued); \
+ elem = container_of(next, ELEM_TYPE, HEAD); \
+ } \
+ \
+ spin_unlock_irqrestore(&q->q_lock, flags); \
+ \
+ q_##PREFIX##_trigger(q); \
+ \
+ return elem; \
+} \
+ \
+static inline \
+void q_##PREFIX##_inc_flying(struct PREFIX##_queue *q) \
+{ \
+ atomic_inc(&q->q_flying); \
+ q_##PREFIX##_trigger(q); \
+} \
+ \
+static inline \
+void q_##PREFIX##_dec_flying(struct PREFIX##_queue *q) \
+{ \
+ atomic_dec(&q->q_flying); \
+ q_##PREFIX##_trigger(q); \
+} \
+ \
+
+#endif
--
2.11.0

2016-12-30 23:01:16

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 10/32] mars: add new module lib_limiter

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/lib/lib_limiter.c | 163 +++++++++++++++++++++++++++++++++
include/linux/brick/lib_limiter.h | 52 +++++++++++
2 files changed, 215 insertions(+)
create mode 100644 drivers/staging/mars/lib/lib_limiter.c
create mode 100644 include/linux/brick/lib_limiter.h

diff --git a/drivers/staging/mars/lib/lib_limiter.c b/drivers/staging/mars/lib/lib_limiter.c
new file mode 100644
index 000000000000..e77b74a0eae7
--- /dev/null
+++ b/drivers/staging/mars/lib/lib_limiter.c
@@ -0,0 +1,163 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/brick/lib_limiter.h>
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/jiffies.h>
+#include <linux/sched.h>
+
+#define LIMITER_TIME_RESOLUTION NSEC_PER_SEC
+
+int rate_limit(struct rate_limiter *lim, int amount)
+{
+ int delay = 0;
+ long long now;
+
+ now = cpu_clock(raw_smp_processor_id());
+
+ /* Compute the maximum delay along the path
+ * down to the root of the hierarchy tree.
+ */
+ while (lim) {
+ long long window = now - lim->lim_stamp;
+
+ /* Sometimes, raw CPU clocks may do weired things...
+ * Smaller windows in the denominator than 1s could fake unrealistic rates.
+ */
+ if (unlikely(lim->lim_min_window <= 0))
+ lim->lim_min_window = 1000;
+ if (unlikely(lim->lim_max_window <= lim->lim_min_window))
+ lim->lim_max_window = lim->lim_min_window + 8000;
+ if (unlikely(window < (long long)lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000)))
+ window = (long long)lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000);
+
+ /* Update total statistics.
+ * They will intentionally wrap around.
+ * Userspace must take care of that.
+ */
+ if (likely(amount > 0)) {
+ lim->lim_total_amount += amount;
+ lim->lim_total_ops++;
+ }
+
+ /* Only use incremental accumulation at repeated calls, but
+ * never after longer pauses.
+ */
+ if (likely(lim->lim_stamp &&
+ window < (long long)lim->lim_max_window * (LIMITER_TIME_RESOLUTION / 1000))) {
+ long long rate_raw;
+ int rate;
+ int max_rate;
+
+ /* Races are possible, but taken into account.
+ * There is no real harm from rarely lost updates.
+ */
+ if (likely(amount > 0)) {
+ lim->lim_amount_accu += amount;
+ lim->lim_amount_cumul += amount;
+ lim->lim_ops_accu++;
+ lim->lim_ops_cumul++;
+ }
+
+ /* compute amount values */
+ rate_raw = lim->lim_amount_accu * LIMITER_TIME_RESOLUTION / window;
+ rate = rate_raw;
+ if (unlikely(rate_raw > INT_MAX))
+ rate = INT_MAX;
+ lim->lim_amount_rate = rate;
+
+ /* amount limit exceeded? */
+ max_rate = lim->lim_max_amount_rate;
+ if (max_rate > 0 && rate > max_rate) {
+ int this_delay = (
+
+ window * rate / max_rate - window) / (LIMITER_TIME_RESOLUTION / 1000);
+ /* compute maximum */
+ if (this_delay > delay && this_delay > 0)
+ delay = this_delay;
+ }
+
+ /* compute ops values */
+ rate_raw = lim->lim_ops_accu * LIMITER_TIME_RESOLUTION / window;
+ rate = rate_raw;
+ if (unlikely(rate_raw > INT_MAX))
+ rate = INT_MAX;
+ lim->lim_ops_rate = rate;
+
+ /* ops limit exceeded? */
+ max_rate = lim->lim_max_ops_rate;
+ if (max_rate > 0 && rate > max_rate) {
+ int this_delay = (
+
+ window * rate / max_rate - window) / (LIMITER_TIME_RESOLUTION / 1000);
+ /* compute maximum */
+ if (this_delay > delay && this_delay > 0)
+ delay = this_delay;
+ }
+
+ /* Try to keep the next window below min_window
+ */
+ window -= lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000);
+ if (window > 0) {
+ long long used_up = (long long)lim->lim_amount_rate * window / LIMITER_TIME_RESOLUTION;
+
+ if (used_up > 0) {
+ lim->lim_stamp += window;
+ lim->lim_amount_accu -= used_up;
+ if (unlikely(lim->lim_amount_accu < 0))
+ lim->lim_amount_accu = 0;
+ }
+ used_up = (long long)lim->lim_ops_rate * window / LIMITER_TIME_RESOLUTION;
+ if (used_up > 0) {
+ lim->lim_stamp += window;
+ lim->lim_ops_accu -= used_up;
+ if (unlikely(lim->lim_ops_accu < 0))
+ lim->lim_ops_accu = 0;
+ }
+ }
+ } else { /* reset, start over with new measurement cycle */
+ if (unlikely(amount < 0))
+ amount = 0;
+ lim->lim_ops_accu = 1;
+ lim->lim_amount_accu = amount;
+ lim->lim_stamp = now - lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000);
+ lim->lim_ops_rate = 0;
+ lim->lim_amount_rate = 0;
+ }
+ lim = lim->lim_father;
+ }
+ return delay;
+}
+
+void rate_limit_sleep(struct rate_limiter *lim, int amount)
+{
+ int sleep = rate_limit(lim, amount);
+
+ if (sleep > 0) {
+ unsigned long timeout;
+
+ if (unlikely(lim->lim_max_delay <= 0))
+ lim->lim_max_delay = 1000;
+ if (sleep > lim->lim_max_delay)
+ sleep = lim->lim_max_delay;
+ timeout = msecs_to_jiffies(sleep);
+ while ((long)timeout > 0)
+ timeout = schedule_timeout_uninterruptible(timeout);
+ }
+}
diff --git a/include/linux/brick/lib_limiter.h b/include/linux/brick/lib_limiter.h
new file mode 100644
index 000000000000..fab0c0e1858f
--- /dev/null
+++ b/include/linux/brick/lib_limiter.h
@@ -0,0 +1,52 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef LIB_LIMITER_H
+#define LIB_LIMITER_H
+
+#include <linux/utsname.h>
+
+struct rate_limiter {
+ /* hierarchy tree */
+ struct rate_limiter *lim_father;
+
+ /* tunables */
+ int lim_max_ops_rate;
+ int lim_max_amount_rate;
+ int lim_max_delay;
+ int lim_min_window;
+ int lim_max_window;
+
+ /* readable */
+ int lim_ops_rate;
+ int lim_amount_rate;
+ int lim_ops_cumul;
+ int lim_amount_cumul;
+ int lim_total_ops;
+ int lim_total_amount;
+ long long lim_stamp;
+
+ /* internal */
+ long long lim_ops_accu;
+ long long lim_amount_accu;
+};
+
+extern int rate_limit(struct rate_limiter *lim, int amount);
+
+extern void rate_limit_sleep(struct rate_limiter *lim, int amount);
+
+#endif
--
2.11.0

2016-12-30 23:01:25

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 05/32] mars: add new module meta

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/meta.h | 106 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 106 insertions(+)
create mode 100644 include/linux/brick/meta.h

diff --git a/include/linux/brick/meta.h b/include/linux/brick/meta.h
new file mode 100644
index 000000000000..a92b2b649c1f
--- /dev/null
+++ b/include/linux/brick/meta.h
@@ -0,0 +1,106 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef META_H
+#define META_H
+
+/***********************************************************************/
+
+/* metadata descriptions */
+
+/* The idea is to describe your C structures in such a way that
+ * transfers to disk or over a network become self-describing.
+ *
+ * In essence, this is a kind of version-independent marshalling.
+ *
+ * Advantage:
+ * When you extend your original C struct (and of course update the
+ * corresponding meta structure), old data on disk (or network peers
+ * running an old version of your program) will remain valid.
+ * Upon read, newly added fields missing in the old version will be simply
+ * not filled in and therefore remain zeroed (if you don't forget to
+ * initially clear your structures via memset() / initializers / etc).
+ * Note that this works only if you never rename or remove existing
+ * fields; you should only add new ones.
+ * [TODO: add macros for description of ignored / renamed fields to
+ * overcome this limitation]
+ * You may increase the size of integers, for example from 32bit to 64bit
+ * or even higher; sign extension will be automatically carried out
+ * when necessary.
+ * Also, you may change the order of fields, because the metadata interpreter
+ * will check each field individually; field offsets are automatically
+ * maintained.
+ *
+ * Disadvantage: this adds some (small) overhead.
+ */
+
+enum field_type {
+ FIELD_DONE,
+ FIELD_REF,
+ FIELD_SUB,
+ FIELD_STRING,
+ FIELD_RAW,
+ FIELD_INT,
+ FIELD_UINT,
+};
+
+struct meta {
+ /* char field_name[MAX_FIELD_LEN]; */
+ char *field_name;
+
+ short field_type;
+ short field_data_size;
+ short field_transfer_size;
+ int field_offset;
+ const struct meta *field_ref;
+};
+
+#define _META_INI(NAME, STRUCT, TYPE, TSIZE) \
+ .field_name = #NAME, \
+ .field_type = TYPE, \
+ .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \
+ .field_transfer_size = (TSIZE), \
+ .field_offset = offsetof(STRUCT, NAME) \
+
+#define META_INI_TRANSFER(NAME, STRUCT, TYPE, TSIZE) \
+ { _META_INI(NAME, STRUCT, TYPE, TSIZE) }
+
+#define META_INI(NAME, STRUCT, TYPE) \
+ { _META_INI(NAME, STRUCT, TYPE, 0) }
+
+#define _META_INI_AIO(NAME, STRUCT, AIO) \
+ .field_name = #NAME, \
+ .field_type = FIELD_REF, \
+ .field_data_size = sizeof(*(((STRUCT *)NULL)->NAME)), \
+ .field_offset = offsetof(STRUCT, NAME), \
+ .field_ref = AIO
+
+#define META_INI_AIO(NAME, STRUCT, AIO) { _META_INI_AIO(NAME, STRUCT, AIO) }
+
+#define _META_INI_SUB(NAME, STRUCT, SUB) \
+ .field_name = #NAME, \
+ .field_type = FIELD_SUB, \
+ .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \
+ .field_offset = offsetof(STRUCT, NAME), \
+ .field_ref = SUB
+
+#define META_INI_SUB(NAME, STRUCT, SUB) { _META_INI_SUB(NAME, STRUCT, SUB) }
+
+extern const struct meta *find_meta(const struct meta *meta, const char *field_name);
+/* extern void free_meta(void *data, const struct meta *meta); */
+
+#endif
--
2.11.0

2016-12-30 23:01:36

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 22/32] mars: add new module xio_trans_logger

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/xio_trans_logger.c | 3410 ++++++++++++++++++++
include/linux/xio/xio_trans_logger.h | 271 ++
2 files changed, 3681 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/xio_trans_logger.c
create mode 100644 include/linux/xio/xio_trans_logger.h

diff --git a/drivers/staging/mars/xio_bricks/xio_trans_logger.c b/drivers/staging/mars/xio_bricks/xio_trans_logger.c
new file mode 100644
index 000000000000..f82e9075ac5a
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/xio_trans_logger.c
@@ -0,0 +1,3410 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* Trans_Logger brick */
+
+#define XIO_DEBUGGING
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/bio.h>
+
+#include <linux/xio/xio.h>
+#include <linux/brick/lib_limiter.h>
+
+#include <linux/xio/xio_trans_logger.h>
+
+/* variants */
+#define KEEP_UNIQUE
+#define DELAY_CALLERS /* this is _needed_ for production systems */
+/* When possible, queue 1 executes phase3_startio() directly without
+ * intermediate queueing into queue 3 = > may be irritating, but has better
+ * performance. NOTICE: when some day the IO scheduling should be
+ * different between queue 1 and 3, you MUST disable this in order
+ * to distinguish between them!
+ */
+#define SHORTCUT_1_to_3
+
+/* commenting this out is dangerous for data integrity! use only for testing! */
+#define USE_MEMCPY
+#define DO_WRITEBACK /* otherwise FAKE IO */
+#define REPLAY_DATA
+
+/* tuning */
+#ifdef BRICK_DEBUG_MEM
+#define CONF_TRANS_CHUNKSIZE (128 * 1024 - PAGE_SIZE * 2)
+#else
+#define CONF_TRANS_CHUNKSIZE (128 * 1024)
+#endif
+#define CONF_TRANS_MAX_AIO_SIZE PAGE_SIZE
+#define CONF_TRANS_ALIGN 0
+
+#define XIO_RPL(_args...) /*empty*/
+
+struct trans_logger_hash_anchor {
+ struct rw_semaphore hash_mutex;
+ struct list_head hash_anchor;
+};
+
+#define NR_HASH_PAGES 64
+
+#define MAX_HASH_PAGES (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor *))
+#define HASH_PER_PAGE (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor))
+#define HASH_TOTAL (NR_HASH_PAGES * HASH_PER_PAGE)
+
+#define STATIST_SIZE 2048
+
+/************************ global tuning ***********************/
+
+int trans_logger_completion_semantics = 1;
+
+int trans_logger_do_crc =
+#ifdef CONFIG_MARS_DEBUG
+ true;
+#else
+ false;
+#endif
+
+int trans_logger_mem_usage; /* in KB */
+
+int trans_logger_max_interleave = -1;
+
+int trans_logger_resume = 1;
+
+int trans_logger_replay_timeout = 1; /* in s */
+
+struct writeback_group global_writeback = {
+ .lock = __RW_LOCK_UNLOCKED(global_writeback.lock),
+ .group_anchor = LIST_HEAD_INIT(global_writeback.group_anchor),
+ .until_percent = 30,
+};
+
+static
+void add_to_group(struct writeback_group *gr, struct trans_logger_brick *brick)
+{
+ unsigned long flags;
+
+ write_lock_irqsave(&gr->lock, flags);
+ list_add_tail(&brick->group_head, &gr->group_anchor);
+ write_unlock_irqrestore(&gr->lock, flags);
+}
+
+static
+void remove_from_group(struct writeback_group *gr, struct trans_logger_brick *brick)
+{
+ unsigned long flags;
+
+ write_lock_irqsave(&gr->lock, flags);
+ list_del_init(&brick->group_head);
+ gr->leader = NULL;
+ write_unlock_irqrestore(&gr->lock, flags);
+}
+
+static
+struct trans_logger_brick *elect_leader(struct writeback_group *gr)
+{
+ struct trans_logger_brick *res = gr->leader;
+ struct list_head *tmp;
+ unsigned long flags;
+
+ if (res && gr->until_percent >= 0) {
+ loff_t used = atomic64_read(&res->shadow_mem_used);
+
+ if (used > gr->biggest * gr->until_percent / 100)
+ goto done;
+ }
+
+ read_lock_irqsave(&gr->lock, flags);
+ for (tmp = gr->group_anchor.next; tmp != &gr->group_anchor; tmp = tmp->next) {
+ struct trans_logger_brick *test = container_of(tmp, struct trans_logger_brick, group_head);
+ loff_t new_used = atomic64_read(&test->shadow_mem_used);
+
+ if (!res || new_used > atomic64_read(&res->shadow_mem_used)) {
+ res = test;
+ gr->biggest = new_used;
+ }
+ }
+ read_unlock_irqrestore(&gr->lock, flags);
+
+ gr->leader = res;
+
+done:
+ return res;
+}
+
+/************************ own type definitions ***********************/
+
+static inline
+int lh_cmp(loff_t *a, loff_t *b)
+{
+ if (*a < *b)
+ return -1;
+ if (*a > *b)
+ return 1;
+ return 0;
+}
+
+static inline
+int tr_cmp(struct pairing_heap_logger *_a, struct pairing_heap_logger *_b)
+{
+ struct logger_head *a = container_of(_a, struct logger_head, ph);
+ struct logger_head *b = container_of(_b, struct logger_head, ph);
+
+ return lh_cmp(a->lh_pos, b->lh_pos);
+}
+
+_PAIRING_HEAP_FUNCTIONS(static, logger, tr_cmp);
+
+static inline
+loff_t *lh_get(struct logger_head *th)
+{
+ return th->lh_pos;
+}
+
+QUEUE_FUNCTIONS(logger, struct logger_head, lh_head, lh_get, lh_cmp, logger);
+
+/************************* logger queue handling ***********************/
+
+static inline
+void qq_init(struct logger_queue *q, struct trans_logger_brick *brick)
+{
+ q_logger_init(q);
+ q->q_event = &brick->worker_event;
+ q->q_brick = brick;
+}
+
+static inline
+void qq_inc_flying(struct logger_queue *q)
+{
+ q_logger_inc_flying(q);
+}
+
+static inline
+void qq_dec_flying(struct logger_queue *q)
+{
+ q_logger_dec_flying(q);
+}
+
+static inline
+void qq_aio_insert(struct logger_queue *q, struct trans_logger_aio_aspect *aio_a)
+{
+ struct aio_object *aio = aio_a->object;
+
+ obj_get(aio); /* must be paired with __trans_logger_io_put() */
+ atomic_inc(&q->q_brick->inner_balance_count);
+
+ q_logger_insert(q, &aio_a->lh);
+}
+
+static inline
+void qq_wb_insert(struct logger_queue *q, struct writeback_info *wb)
+{
+ q_logger_insert(q, &wb->w_lh);
+}
+
+static inline
+void qq_aio_pushback(struct logger_queue *q, struct trans_logger_aio_aspect *aio_a)
+{
+ obj_check(aio_a->object);
+
+ q->pushback_count++;
+
+ q_logger_pushback(q, &aio_a->lh);
+}
+
+static inline
+void qq_wb_pushback(struct logger_queue *q, struct writeback_info *wb)
+{
+ q->pushback_count++;
+ q_logger_pushback(q, &wb->w_lh);
+}
+
+static inline
+struct trans_logger_aio_aspect *qq_aio_fetch(struct logger_queue *q)
+{
+ struct logger_head *test;
+ struct trans_logger_aio_aspect *aio_a = NULL;
+
+ test = q_logger_fetch(q);
+
+ if (test) {
+ aio_a = container_of(test, struct trans_logger_aio_aspect, lh);
+ obj_check(aio_a->object);
+ }
+ return aio_a;
+}
+
+static inline
+struct writeback_info *qq_wb_fetch(struct logger_queue *q)
+{
+ struct logger_head *test;
+ struct writeback_info *res = NULL;
+
+ test = q_logger_fetch(q);
+
+ if (test)
+ res = container_of(test, struct writeback_info, w_lh);
+ return res;
+}
+
+/************************ own helper functions ***********************/
+
+static inline
+int hash_fn(loff_t pos)
+{
+ /* simple and stupid */
+ long base_index = pos >> REGION_SIZE_BITS;
+
+ base_index += base_index / HASH_TOTAL / 7;
+ return base_index % HASH_TOTAL;
+}
+
+static inline
+struct trans_logger_aio_aspect *_hash_find(
+struct list_head *start, loff_t pos, int *max_len, bool use_collect_head, bool find_unstable)
+{
+ struct list_head *tmp;
+ struct trans_logger_aio_aspect *res = NULL;
+ int len = *max_len;
+
+ /* The lists are always sorted according to age (newest first).
+ * Caution: there may be duplicates in the list, some of them
+ * overlapping with the search area in many different ways.
+ */
+ for (tmp = start->next; tmp != start; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *test_a;
+ struct aio_object *test;
+ int diff;
+
+ if (use_collect_head)
+ test_a = container_of(tmp, struct trans_logger_aio_aspect, collect_head);
+ else
+ test_a = container_of(tmp, struct trans_logger_aio_aspect, hash_head);
+ test = test_a->object;
+
+ obj_check(test);
+
+ /* are the regions overlapping? */
+ if (pos >= test->io_pos + test->io_len || pos + len <= test->io_pos)
+ continue; /* not relevant */
+
+ /* searching for unstable elements (only in special cases) */
+ if (find_unstable && test_a->is_stable)
+ break;
+
+ diff = test->io_pos - pos;
+ if (diff <= 0) {
+ int restlen = test->io_len + diff;
+
+ res = test_a;
+ if (restlen < len)
+ len = restlen;
+ break;
+ }
+ if (diff < len)
+ len = diff;
+ }
+
+ *max_len = len;
+ return res;
+}
+
+static
+struct trans_logger_aio_aspect *hash_find(
+struct trans_logger_brick *brick, loff_t pos, int *max_len, bool find_unstable)
+{
+ int hash = hash_fn(pos);
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+ struct trans_logger_hash_anchor *start = &sub_table[hash % HASH_PER_PAGE];
+ struct trans_logger_aio_aspect *res;
+
+ atomic_inc(&brick->total_hash_find_count);
+
+ down_read(&start->hash_mutex);
+
+ res = _hash_find(&start->hash_anchor, pos, max_len, false, find_unstable);
+
+ /* Ensure the found aio can't go away...
+ */
+ if (res && res->object)
+ obj_get(res->object);
+
+ up_read(&start->hash_mutex);
+
+ return res;
+}
+
+static
+void hash_insert(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *elem_a)
+{
+ int hash = hash_fn(elem_a->object->io_pos);
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+ struct trans_logger_hash_anchor *start = &sub_table[hash % HASH_PER_PAGE];
+
+ CHECK_HEAD_EMPTY(&elem_a->hash_head);
+ obj_check(elem_a->object);
+
+ /* only for statistics: */
+ atomic_inc(&brick->hash_count);
+ atomic_inc(&brick->total_hash_insert_count);
+
+ down_write(&start->hash_mutex);
+
+ list_add(&elem_a->hash_head, &start->hash_anchor);
+ elem_a->is_hashed = true;
+
+ up_write(&start->hash_mutex);
+}
+
+/* Find the transitive closure of overlapping requests
+ * and collect them into a list.
+ */
+static
+void hash_extend(struct trans_logger_brick *brick, loff_t *_pos, int *_len, struct list_head *collect_list)
+{
+ loff_t pos = *_pos;
+ int len = *_len;
+ int hash = hash_fn(pos);
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+ struct trans_logger_hash_anchor *start = &sub_table[hash % HASH_PER_PAGE];
+ struct list_head *tmp;
+ bool extended;
+
+ if (collect_list)
+ CHECK_HEAD_EMPTY(collect_list);
+
+ atomic_inc(&brick->total_hash_extend_count);
+
+ down_read(&start->hash_mutex);
+
+ do {
+ extended = false;
+
+ for (tmp = start->hash_anchor.next; tmp != &start->hash_anchor; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *test_a;
+ struct aio_object *test;
+ loff_t diff;
+
+ test_a = container_of(tmp, struct trans_logger_aio_aspect, hash_head);
+ test = test_a->object;
+
+ obj_check(test);
+
+ /* are the regions overlapping? */
+ if (pos >= test->io_pos + test->io_len || pos + len <= test->io_pos)
+ continue; /* not relevant */
+
+ /* collision detection */
+ if (test_a->is_collected)
+ goto collision;
+
+ /* no writeback of non-persistent data */
+ if (!(test_a->is_persistent & test_a->is_completed))
+ goto collision;
+
+ /* extend the search region when necessary */
+ diff = pos - test->io_pos;
+ if (diff > 0) {
+ len += diff;
+ pos = test->io_pos;
+ extended = true;
+ }
+ diff = (test->io_pos + test->io_len) - (pos + len);
+ if (diff > 0) {
+ len += diff;
+ extended = true;
+ }
+ }
+ } while (extended); /* start over for transitive closure */
+
+ *_pos = pos;
+ *_len = len;
+
+ for (tmp = start->hash_anchor.next; tmp != &start->hash_anchor; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *test_a;
+ struct aio_object *test;
+
+ test_a = container_of(tmp, struct trans_logger_aio_aspect, hash_head);
+ test = test_a->object;
+
+ /* are the regions overlapping? */
+ if (pos >= test->io_pos + test->io_len || pos + len <= test->io_pos)
+ continue; /* not relevant */
+
+ /* collect */
+ CHECK_HEAD_EMPTY(&test_a->collect_head);
+ if (unlikely(test_a->is_collected))
+ XIO_ERR("collision detection did not work\n");
+ test_a->is_collected = true;
+ obj_check(test);
+ list_add_tail(&test_a->collect_head, collect_list);
+ }
+
+collision:
+ up_read(&start->hash_mutex);
+}
+
+/* Atomically put all elements from the list.
+ * All elements must reside in the same collision list.
+ */
+static inline
+void hash_put_all(struct trans_logger_brick *brick, struct list_head *list)
+{
+ struct list_head *tmp;
+ struct trans_logger_hash_anchor *start = NULL;
+ int first_hash = -1;
+
+ for (tmp = list->next; tmp != list; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *elem_a;
+ struct aio_object *elem;
+ int hash;
+
+ elem_a = container_of(tmp, struct trans_logger_aio_aspect, collect_head);
+ elem = elem_a->object;
+ CHECK_PTR(elem, err);
+ obj_check(elem);
+
+ hash = hash_fn(elem->io_pos);
+ if (!start) {
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+
+ start = &sub_table[hash % HASH_PER_PAGE];
+ first_hash = hash;
+ down_write(&start->hash_mutex);
+ } else if (unlikely(hash != first_hash)) {
+ XIO_ERR("oops, different hashes: %d != %d\n", hash, first_hash);
+ }
+
+ if (!elem_a->is_hashed)
+ continue;
+
+ list_del_init(&elem_a->hash_head);
+ elem_a->is_hashed = false;
+ atomic_dec(&brick->hash_count);
+ }
+
+err:
+ if (start)
+ up_write(&start->hash_mutex);
+}
+
+static inline
+void hash_ensure_stableness(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *aio_a)
+{
+ if (!aio_a->is_stable) {
+ struct aio_object *aio = aio_a->object;
+ int hash = hash_fn(aio->io_pos);
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[hash / HASH_PER_PAGE];
+ struct trans_logger_hash_anchor *start = &sub_table[hash % HASH_PER_PAGE];
+
+ down_write(&start->hash_mutex);
+
+ aio_a->is_stable = true;
+
+ up_write(&start->hash_mutex);
+ }
+}
+
+static
+void _inf_callback(struct trans_logger_input *input, bool force)
+{
+ if (!force &&
+ input->inf_last_jiffies &&
+ input->inf_last_jiffies + 4 * HZ > (long long)jiffies)
+ goto out_return;
+ if (input->inf.inf_callback && input->is_operating) {
+ input->inf_last_jiffies = jiffies;
+
+ input->inf.inf_callback(&input->inf);
+
+ input->inf_last_jiffies = jiffies;
+ } else {
+ XIO_DBG(
+ "%p skipped callback, callback = %p is_operating = %d\n",
+ input,
+ input->inf.inf_callback,
+ input->is_operating);
+ }
+out_return:;
+}
+
+static inline
+int _congested(struct trans_logger_brick *brick, int nr_queues)
+{
+ int i;
+
+ for (i = 0; i < nr_queues; i++)
+ if (atomic_read(&brick->q_phase[i].q_queued) ||
+ atomic_read(&brick->q_phase[i].q_flying))
+ return 1;
+ return 0;
+}
+
+/***************** own brick * input * output operations *****************/
+
+atomic_t global_mshadow_count = ATOMIC_INIT(0);
+
+atomic64_t global_mshadow_used = ATOMIC64_INIT(0);
+
+static
+int trans_logger_get_info(struct trans_logger_output *output, struct xio_info *info)
+{
+ struct trans_logger_input *input = output->brick->inputs[TL_INPUT_READ];
+
+ return GENERIC_INPUT_CALL(input, xio_get_info, info);
+}
+
+static
+int _make_sshadow(
+struct trans_logger_output *output,
+struct trans_logger_aio_aspect *aio_a,
+struct trans_logger_aio_aspect *mshadow_a)
+{
+ struct trans_logger_brick *brick = output->brick;
+ struct aio_object *aio = aio_a->object;
+ struct aio_object *mshadow;
+ int diff;
+
+ mshadow = mshadow_a->object;
+ if (unlikely(aio->io_len > mshadow->io_len)) {
+ XIO_ERR("oops %d -> %d\n", aio->io_len, mshadow->io_len);
+ aio->io_len = mshadow->io_len;
+ }
+ if (unlikely(mshadow_a == aio_a)) {
+ XIO_ERR("oops %p == %p\n", mshadow_a, aio_a);
+ return -EINVAL;
+ }
+
+ diff = aio->io_pos - mshadow->io_pos;
+ if (unlikely(diff < 0)) {
+ XIO_ERR("oops diff = %d\n", diff);
+ return -EINVAL;
+ }
+
+ /* Attach aio to the existing shadow ("slave shadow").
+ */
+ aio_a->shadow_data = mshadow_a->shadow_data + diff;
+ aio_a->do_dealloc = false;
+ if (!aio->io_data) { /* buffered IO */
+ aio->io_data = aio_a->shadow_data;
+ aio_a->do_buffered = true;
+ atomic_inc(&brick->total_sshadow_buffered_count);
+ }
+ aio->io_flags = mshadow->io_flags;
+ aio_a->shadow_aio = mshadow_a;
+ aio_a->my_brick = brick;
+
+ /* Get an ordinary internal reference
+ */
+ obj_get_first(aio); /* must be paired with __trans_logger_io_put() */
+ atomic_inc(&brick->inner_balance_count);
+
+ /* The internal reference from slave to master is already
+ * present due to hash_find(),
+ * such that the master cannot go away before the slave.
+ * It is compensated by master transition in __trans_logger_io_put()
+ */
+ atomic_inc(&brick->inner_balance_count);
+
+ atomic_inc(&brick->sshadow_count);
+ atomic_inc(&brick->total_sshadow_count);
+
+ if (unlikely(aio->io_len <= 0)) {
+ XIO_ERR("oops, len = %d\n", aio->io_len);
+ return -EINVAL;
+ }
+
+ return aio->io_len;
+}
+
+static
+int _read_io_get(struct trans_logger_output *output, struct trans_logger_aio_aspect *aio_a)
+{
+ struct trans_logger_brick *brick = output->brick;
+ struct aio_object *aio = aio_a->object;
+ struct trans_logger_input *input = brick->inputs[TL_INPUT_READ];
+ struct trans_logger_aio_aspect *mshadow_a;
+
+ /* Look if there is a newer version on the fly, shadowing
+ * the old one.
+ * When a shadow is found, use it as buffer for the aio.
+ */
+ mshadow_a = hash_find(brick, aio->io_pos, &aio->io_len, false);
+ if (!mshadow_a)
+ return GENERIC_INPUT_CALL(input, aio_get, aio);
+
+ return _make_sshadow(output, aio_a, mshadow_a);
+}
+
+static
+int _write_io_get(struct trans_logger_output *output, struct trans_logger_aio_aspect *aio_a)
+{
+ struct trans_logger_brick *brick = output->brick;
+ struct aio_object *aio = aio_a->object;
+ void *data;
+
+#ifdef KEEP_UNIQUE
+ struct trans_logger_aio_aspect *mshadow_a;
+
+#endif
+
+#ifdef CONFIG_MARS_DEBUG
+ if (unlikely(aio->io_len <= 0)) {
+ XIO_ERR("oops, io_len = %d\n", aio->io_len);
+ return -EINVAL;
+ }
+#endif
+
+#ifdef KEEP_UNIQUE
+ mshadow_a = hash_find(brick, aio->io_pos, &aio->io_len, true);
+ if (mshadow_a)
+ return _make_sshadow(output, aio_a, mshadow_a);
+#endif
+
+#ifdef DELAY_CALLERS
+ /* delay in case of too many master shadows / memory shortage */
+ wait_event_interruptible_timeout(
+ brick->caller_event,
+ !brick->delay_callers &&
+ (brick_global_memlimit < 1024 ||
+ atomic64_read(&global_mshadow_used) / 1024 < brick_global_memlimit),
+ HZ / 2);
+#endif
+
+ /* create a new master shadow */
+ data = brick_block_alloc(aio->io_pos, aio->io_len);
+ aio_a->alloc_len = aio->io_len;
+ atomic64_add(aio->io_len, &brick->shadow_mem_used);
+#ifdef CONFIG_MARS_DEBUG
+ memset(data, 0x11, aio->io_len);
+#endif
+ aio_a->shadow_data = data;
+ aio_a->do_dealloc = true;
+ if (!aio->io_data) { /* buffered IO */
+ aio->io_data = data;
+ aio_a->do_buffered = true;
+ atomic_inc(&brick->total_mshadow_buffered_count);
+ }
+ aio_a->my_brick = brick;
+ aio->io_flags = 0;
+ aio_a->shadow_aio = aio_a; /* cyclic self-reference = > indicates master shadow */
+
+ atomic_inc(&brick->mshadow_count);
+ atomic_inc(&brick->total_mshadow_count);
+ atomic_inc(&global_mshadow_count);
+ atomic64_add(aio->io_len, &global_mshadow_used);
+
+ atomic_inc(&brick->inner_balance_count);
+ obj_get_first(aio); /* must be paired with __trans_logger_io_put() */
+
+ return aio->io_len;
+}
+
+static
+int trans_logger_io_get(struct trans_logger_output *output, struct aio_object *aio)
+{
+ struct trans_logger_brick *brick;
+ struct trans_logger_aio_aspect *aio_a;
+ loff_t base_offset;
+
+ CHECK_PTR(output, err);
+ brick = output->brick;
+ CHECK_PTR(brick, err);
+ CHECK_PTR(aio, err);
+
+ aio_a = trans_logger_aio_get_aspect(brick, aio);
+ CHECK_PTR(aio_a, err);
+ CHECK_ASPECT(aio_a, aio, err);
+
+ atomic_inc(&brick->outer_balance_count);
+
+ if (aio->obj_initialized) { /* setup already performed */
+ obj_check(aio);
+ obj_get(aio); /* must be paired with __trans_logger_io_put() */
+ return aio->io_len;
+ }
+
+ get_lamport(&aio_a->stamp);
+
+ if (aio->io_len > CONF_TRANS_MAX_AIO_SIZE && CONF_TRANS_MAX_AIO_SIZE > 0)
+ aio->io_len = CONF_TRANS_MAX_AIO_SIZE;
+
+ /* ensure that REGION_SIZE boundaries are obeyed by hashing */
+ base_offset = aio->io_pos & (loff_t)(REGION_SIZE - 1);
+ if (aio->io_len > REGION_SIZE - base_offset)
+ aio->io_len = REGION_SIZE - base_offset;
+
+ /* Reads are directly going through when possible.
+ * When necessary, slave shadow buffers are used.
+ * The latter happens only during re-read of data which is pending
+ * in writeback.
+ */
+ if (aio->io_may_write == READ)
+ return _read_io_get(output, aio_a);
+
+ /* Only in emergency mode: directly access the underlying disk.
+ */
+ if (brick->stopped_logging) { /* only in EMERGENCY mode */
+ struct trans_logger_input *input = brick->inputs[TL_INPUT_READ];
+
+ aio_a->is_emergency = true;
+ return GENERIC_INPUT_CALL(input, aio_get, aio);
+ }
+
+ /* FIXME: THIS IS PROVISIONARY PARANOIA, to be removed.
+ * It should not be necessary at all, but when trying to get
+ * better reliability than hardware (more than 99.999%)
+ * I became a little bit paranoid.
+ * I found some extremely rare unexplainable cases where IO was
+ * probably submitted by XFS _after_ the device was closed.
+ * Reproduction is extremely hard.
+ * Probably the session management of iSCSI may also play a role
+ * at the wrong moment.
+ */
+ while (unlikely(!brick->power.on_led))
+ brick_msleep(HZ / 10);
+
+ return _write_io_get(output, aio_a);
+
+err:
+ return -EINVAL;
+}
+
+static void pos_complete(struct trans_logger_aio_aspect *orig_aio_a);
+
+static
+void __trans_logger_io_put(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *aio_a)
+{
+ struct aio_object *aio;
+ struct trans_logger_aio_aspect *shadow_a;
+ struct trans_logger_input *input;
+
+restart:
+ CHECK_PTR(aio_a, err);
+ aio = aio_a->object;
+ CHECK_PTR(aio, err);
+
+ obj_check(aio);
+
+ /* are we a shadow (whether master or slave)? */
+ shadow_a = aio_a->shadow_aio;
+ if (shadow_a) {
+ bool finished;
+
+ CHECK_PTR(shadow_a, err);
+ CHECK_PTR(shadow_a->object, err);
+ obj_check(shadow_a->object);
+
+ finished = obj_put(aio);
+ atomic_dec(&brick->inner_balance_count);
+ if (unlikely(finished && aio_a->is_hashed)) {
+ XIO_ERR("trying to put a hashed aio, pos = %lld len = %d\n", aio->io_pos, aio->io_len);
+ finished = false; /* leaves a memleak */
+ }
+
+ if (!finished)
+ goto out_return;
+
+ CHECK_HEAD_EMPTY(&aio_a->lh.lh_head);
+ CHECK_HEAD_EMPTY(&aio_a->hash_head);
+ CHECK_HEAD_EMPTY(&aio_a->replay_head);
+ CHECK_HEAD_EMPTY(&aio_a->collect_head);
+ CHECK_HEAD_EMPTY(&aio_a->sub_list);
+ CHECK_HEAD_EMPTY(&aio_a->sub_head);
+
+ if (aio_a->is_collected && likely(aio_a->wb_error >= 0))
+ pos_complete(aio_a);
+
+ CHECK_HEAD_EMPTY(&aio_a->pos_head);
+
+ if (shadow_a != aio_a) { /* we are a slave shadow */
+ /* XIO_DBG("slave\n"); */
+ atomic_dec(&brick->sshadow_count);
+ CHECK_HEAD_EMPTY(&aio_a->hash_head);
+ obj_free(aio);
+ /* now put the master shadow */
+ aio_a = shadow_a;
+ goto restart;
+ }
+ /* we are a master shadow */
+ CHECK_PTR(aio_a->shadow_data, err);
+ if (aio_a->do_dealloc) {
+ brick_block_free(aio_a->shadow_data, aio_a->alloc_len);
+ atomic64_sub(aio_a->alloc_len, &brick->shadow_mem_used);
+ aio_a->shadow_data = NULL;
+ aio_a->do_dealloc = false;
+ }
+ if (aio_a->do_buffered)
+ aio->io_data = NULL;
+ atomic_dec(&brick->mshadow_count);
+ atomic_dec(&global_mshadow_count);
+ atomic64_sub(aio->io_len, &global_mshadow_used);
+ obj_free(aio);
+ goto out_return;
+ }
+
+ /* only READ is allowed on non-shadow buffers */
+ if (unlikely(aio->io_rw != READ && !aio_a->is_emergency))
+ XIO_FAT("bad operation %d on non-shadow\n", aio->io_rw);
+
+ /* no shadow = > call through */
+ input = brick->inputs[TL_INPUT_READ];
+ CHECK_PTR(input, err);
+
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+
+err:;
+out_return:;
+}
+
+static
+void _trans_logger_io_put(struct trans_logger_output *output, struct aio_object *aio)
+{
+ struct trans_logger_aio_aspect *aio_a;
+
+ aio_a = trans_logger_aio_get_aspect(output->brick, aio);
+ CHECK_PTR(aio_a, err);
+ CHECK_ASPECT(aio_a, aio, err);
+
+ __trans_logger_io_put(output->brick, aio_a);
+ goto out_return;
+err:
+ XIO_FAT("giving up...\n");
+out_return:;
+}
+
+static
+void trans_logger_io_put(struct trans_logger_output *output, struct aio_object *aio)
+{
+ struct trans_logger_brick *brick = output->brick;
+
+ atomic_dec(&brick->outer_balance_count);
+ _trans_logger_io_put(output, aio);
+}
+
+static
+void _trans_logger_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *aio_a;
+ struct trans_logger_brick *brick;
+
+ _crashme(20, false);
+
+ aio_a = cb->cb_private;
+ CHECK_PTR(aio_a, err);
+ if (unlikely(&aio_a->cb != cb)) {
+ XIO_FAT("bad callback -- hanging up\n");
+ goto err;
+ }
+ brick = aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ NEXT_CHECKED_CALLBACK(cb, err);
+
+ if (aio_a->my_queue)
+ qq_dec_flying(aio_a->my_queue);
+ atomic_dec(&brick->any_fly_count);
+ atomic_inc(&brick->total_cb_count);
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle callback\n");
+out_return:;
+}
+
+static
+void __trans_logger_io_io(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *aio_a)
+{
+ struct trans_logger_aio_aspect *shadow_a;
+ struct trans_logger_input *input;
+
+ /* is this a shadow buffer? */
+ shadow_a = aio_a->shadow_aio;
+ if (shadow_a) {
+ CHECK_HEAD_EMPTY(&aio_a->lh.lh_head);
+ CHECK_HEAD_EMPTY(&aio_a->hash_head);
+ CHECK_HEAD_EMPTY(&aio_a->pos_head);
+
+ obj_get(aio_a->object); /* must be paired with __trans_logger_io_put() */
+ atomic_inc(&brick->inner_balance_count);
+
+ qq_aio_insert(&brick->q_phase[0], aio_a);
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+ }
+
+ /* only READ is allowed on non-shadow buffers */
+ if (unlikely(aio_a->object->io_rw != READ && !aio_a->is_emergency))
+ XIO_FAT("bad operation %d on non-shadow\n", aio_a->object->io_rw);
+
+ atomic_inc(&brick->any_fly_count);
+
+ INSERT_CALLBACK(aio_a->object, &aio_a->cb, _trans_logger_endio, aio_a);
+
+ input = brick->inputs[TL_INPUT_READ];
+
+ GENERIC_INPUT_CALL(input, aio_io, aio_a->object);
+out_return:;
+}
+
+static
+void trans_logger_io_io(struct trans_logger_output *output, struct aio_object *aio)
+{
+ struct trans_logger_brick *brick = output->brick;
+ struct trans_logger_aio_aspect *aio_a;
+
+ obj_check(aio);
+
+ aio_a = trans_logger_aio_get_aspect(brick, aio);
+ CHECK_PTR(aio_a, err);
+ CHECK_ASPECT(aio_a, aio, err);
+
+ aio_a->my_brick = brick;
+
+ /* statistics */
+ if (aio->io_rw)
+ atomic_inc(&brick->total_write_count);
+ else
+ atomic_inc(&brick->total_read_count);
+ if (aio_a->is_emergency && _congested(brick, LOGGER_QUEUES)) {
+ /* Only during transition from writeback mode to emergency mode:
+ * Wait until writeback has finished, by queuing into the last queue.
+ * We have to do this because writeback is out-of-order.
+ * Otherwise storage semantics could be violated for some time.
+ */
+ obj_get(aio); /* must be paired with __trans_logger_io_put() */
+ atomic_inc(&brick->any_fly_count);
+ qq_aio_insert(&brick->q_phase[4], aio_a);
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+ }
+
+ __trans_logger_io_io(brick, aio_a);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle IO\n");
+out_return:;
+}
+
+/***************************** writeback info *****************************/
+
+/* save final completion status when necessary
+ */
+static
+void pos_complete(struct trans_logger_aio_aspect *orig_aio_a)
+{
+ struct trans_logger_brick *brick = orig_aio_a->my_brick;
+ struct trans_logger_input *log_input = orig_aio_a->log_input;
+ loff_t finished;
+ struct list_head *tmp;
+
+ CHECK_PTR(brick, err);
+ CHECK_PTR(log_input, err);
+
+ atomic_inc(&brick->total_writeback_count);
+
+ tmp = &orig_aio_a->pos_head;
+
+ down(&log_input->inf_mutex);
+
+ finished = orig_aio_a->log_pos;
+ /* am I the first member? (means "youngest" list entry) */
+ if (tmp == log_input->pos_list.next) {
+ if (unlikely(finished <= log_input->inf.inf_min_pos))
+ XIO_ERR("backskip in log writeback: %lld -> %lld\n", log_input->inf.inf_min_pos, finished);
+ if (unlikely(finished > log_input->inf.inf_max_pos))
+ XIO_ERR("min_pos > max_pos: %lld > %lld\n", finished, log_input->inf.inf_max_pos);
+ log_input->inf.inf_min_pos = finished;
+ get_lamport(&log_input->inf.inf_min_pos_stamp);
+ _inf_callback(log_input, false);
+ } else {
+ struct trans_logger_aio_aspect *prev_aio_a;
+
+ prev_aio_a = container_of(tmp->prev, struct trans_logger_aio_aspect, pos_head);
+ if (unlikely(finished <= prev_aio_a->log_pos)) {
+ XIO_ERR("backskip: %lld -> %lld\n", finished, prev_aio_a->log_pos);
+ } else {
+ /* Transitively transfer log_pos to the predecessor
+ * to correctly reflect the committed region.
+ */
+ prev_aio_a->log_pos = finished;
+ }
+ }
+
+ list_del_init(tmp);
+ atomic_dec(&log_input->pos_count);
+
+ up(&log_input->inf_mutex);
+err:;
+}
+
+static
+void free_writeback(struct writeback_info *wb)
+{
+ struct list_head *tmp;
+
+ if (unlikely(wb->w_error < 0)) {
+ XIO_ERR(
+ "writeback error = %d at pos = %lld len = %d, writeback is incomplete\n",
+ wb->w_error,
+ wb->w_pos,
+ wb->w_len);
+ }
+
+ /* Now complete the original requests.
+ */
+ while ((tmp = wb->w_collect_list.next) != &wb->w_collect_list) {
+ struct trans_logger_aio_aspect *orig_aio_a;
+ struct aio_object *orig_aio;
+
+ list_del_init(tmp);
+
+ orig_aio_a = container_of(tmp, struct trans_logger_aio_aspect, collect_head);
+ orig_aio = orig_aio_a->object;
+
+ obj_check(orig_aio);
+ if (unlikely(!orig_aio_a->is_collected)) {
+ XIO_ERR(
+ "request %lld (len = %d) was not collected\n", orig_aio->io_pos, orig_aio->io_len);
+ }
+ if (unlikely(wb->w_error < 0))
+ orig_aio_a->wb_error = wb->w_error;
+
+ __trans_logger_io_put(orig_aio_a->my_brick, orig_aio_a);
+ }
+
+ brick_mem_free(wb);
+}
+
+/* Generic endio() for writeback_info
+ */
+static
+void wb_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_brick *brick;
+ struct writeback_info *wb;
+ atomic_t *dec;
+ int rw;
+ void (**_endio)(struct generic_callback *cb);
+ void (*endio)(struct generic_callback *cb);
+
+ _crashme(21, false);
+
+ LAST_CALLBACK(cb);
+ sub_aio_a = cb->cb_private;
+ CHECK_PTR(sub_aio_a, err);
+ sub_aio = sub_aio_a->object;
+ CHECK_PTR(sub_aio, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ if (cb->cb_error < 0)
+ wb->w_error = cb->cb_error;
+
+ atomic_dec(&brick->wb_balance_count);
+
+ rw = sub_aio_a->orig_rw;
+ dec = rw ? &wb->w_sub_write_count : &wb->w_sub_read_count;
+ CHECK_ATOMIC(dec, 1);
+ if (!atomic_dec_and_test(dec))
+ goto done;
+
+ _endio = rw ? &wb->write_endio : &wb->read_endio;
+ endio = *_endio;
+ *_endio = NULL;
+ if (likely(endio))
+ endio(cb);
+ else
+ XIO_ERR("internal: no endio defined\n");
+done:
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_FAT("hanging up....\n");
+out_return:;
+}
+
+/* Atomically create writeback info, based on "snapshot" of current hash
+ * state.
+ * Notice that the hash can change during writeback IO, thus we need
+ * struct writeback_info to precisely catch that information at a single
+ * point in time.
+ */
+static
+struct writeback_info *make_writeback(struct trans_logger_brick *brick, loff_t pos, int len)
+{
+ struct writeback_info *wb;
+ struct trans_logger_input *read_input;
+ struct trans_logger_input *write_input;
+ int write_input_nr;
+
+ /* Allocate structure representing a bunch of adjacent writebacks
+ */
+ wb = brick_zmem_alloc(sizeof(struct writeback_info));
+ if (unlikely(len < 0))
+ XIO_ERR("len = %d\n", len);
+
+ wb->w_brick = brick;
+ wb->w_pos = pos;
+ wb->w_len = len;
+ wb->w_lh.lh_pos = &wb->w_pos;
+ INIT_LIST_HEAD(&wb->w_lh.lh_head);
+ INIT_LIST_HEAD(&wb->w_collect_list);
+ INIT_LIST_HEAD(&wb->w_sub_read_list);
+ INIT_LIST_HEAD(&wb->w_sub_write_list);
+
+ /* Atomically fetch transitive closure on all requests
+ * overlapping with the current search region.
+ */
+ hash_extend(brick, &wb->w_pos, &wb->w_len, &wb->w_collect_list);
+
+ if (list_empty(&wb->w_collect_list))
+ goto collision;
+
+ pos = wb->w_pos;
+ len = wb->w_len;
+
+ if (unlikely(len < 0))
+ XIO_ERR("len = %d\n", len);
+
+ /* Determine the "channels" we want to operate on
+ */
+ read_input = brick->inputs[TL_INPUT_READ];
+ write_input_nr = TL_INPUT_WRITEBACK;
+ write_input = brick->inputs[write_input_nr];
+ if (!write_input->connect) {
+ write_input_nr = TL_INPUT_READ;
+ write_input = read_input;
+ }
+
+ /* Create sub_aios for read of old disk version (phase1)
+ */
+ if (brick->log_reads) {
+ while (len > 0) {
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_input *log_input;
+ int this_len;
+ int status;
+
+ sub_aio = trans_logger_alloc_aio(brick);
+
+ sub_aio->io_pos = pos;
+ sub_aio->io_len = len;
+ sub_aio->io_may_write = READ;
+ sub_aio->io_rw = READ;
+ sub_aio->io_data = NULL;
+
+ sub_aio_a = trans_logger_aio_get_aspect(brick, sub_aio);
+ CHECK_PTR(sub_aio_a, err);
+ CHECK_ASPECT(sub_aio_a, sub_aio, err);
+
+ sub_aio_a->my_input = read_input;
+ log_input = brick->inputs[brick->log_input_nr];
+ sub_aio_a->log_input = log_input;
+ atomic_inc(&log_input->log_obj_count);
+ sub_aio_a->my_brick = brick;
+ sub_aio_a->orig_rw = READ;
+ sub_aio_a->wb = wb;
+
+ status = GENERIC_INPUT_CALL(read_input, aio_get, sub_aio);
+ if (unlikely(status < 0)) {
+ XIO_FAT("cannot get sub_aio, status = %d\n", status);
+ goto err;
+ }
+
+ list_add_tail(&sub_aio_a->sub_head, &wb->w_sub_read_list);
+ atomic_inc(&wb->w_sub_read_count);
+ atomic_inc(&brick->wb_balance_count);
+
+ this_len = sub_aio->io_len;
+ pos += this_len;
+ len -= this_len;
+ }
+ /* Re-init for startover
+ */
+ pos = wb->w_pos;
+ len = wb->w_len;
+ }
+
+ /* Always create sub_aios for writeback (phase3)
+ */
+ while (len > 0) {
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_aio_aspect *orig_aio_a;
+ struct aio_object *orig_aio;
+ struct trans_logger_input *log_input;
+ void *data;
+ int this_len = len;
+ int diff;
+ int status;
+
+ atomic_inc(&brick->total_hash_find_count);
+
+ orig_aio_a = _hash_find(&wb->w_collect_list, pos, &this_len, true, false);
+ if (unlikely(!orig_aio_a)) {
+ XIO_FAT("could not find data\n");
+ goto err;
+ }
+
+ orig_aio = orig_aio_a->object;
+ diff = pos - orig_aio->io_pos;
+ if (unlikely(diff < 0)) {
+ XIO_FAT("bad diff %d\n", diff);
+ goto err;
+ }
+ data = orig_aio_a->shadow_data + diff;
+
+ sub_aio = trans_logger_alloc_aio(brick);
+
+ sub_aio->io_pos = pos;
+ sub_aio->io_len = this_len;
+ sub_aio->io_may_write = WRITE;
+ sub_aio->io_rw = WRITE;
+ sub_aio->io_data = data;
+
+ sub_aio_a = trans_logger_aio_get_aspect(brick, sub_aio);
+ CHECK_PTR(sub_aio_a, err);
+ CHECK_ASPECT(sub_aio_a, sub_aio, err);
+
+ sub_aio_a->orig_aio_a = orig_aio_a;
+ sub_aio_a->my_input = write_input;
+ log_input = orig_aio_a->log_input;
+ sub_aio_a->log_input = log_input;
+ atomic_inc(&log_input->log_obj_count);
+ sub_aio_a->my_brick = brick;
+ sub_aio_a->orig_rw = WRITE;
+ sub_aio_a->wb = wb;
+
+ status = GENERIC_INPUT_CALL(write_input, aio_get, sub_aio);
+ if (unlikely(status < 0)) {
+ XIO_FAT("cannot get sub_aio, status = %d\n", status);
+ wb->w_error = status;
+ goto err;
+ }
+
+ list_add_tail(&sub_aio_a->sub_head, &wb->w_sub_write_list);
+ atomic_inc(&wb->w_sub_write_count);
+ atomic_inc(&brick->wb_balance_count);
+
+ this_len = sub_aio->io_len;
+ pos += this_len;
+ len -= this_len;
+ }
+
+ return wb;
+
+err:
+ XIO_ERR("cleaning up...\n");
+collision:
+ if (wb)
+ free_writeback(wb);
+ return NULL;
+}
+
+static inline
+void _fire_one(struct list_head *tmp, bool do_update)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_input *sub_input;
+
+ sub_aio_a = container_of(tmp, struct trans_logger_aio_aspect, sub_head);
+ sub_aio = sub_aio_a->object;
+
+ if (unlikely(sub_aio_a->is_fired)) {
+ XIO_ERR("trying to fire twice\n");
+ goto out_return;
+ }
+ sub_aio_a->is_fired = true;
+
+ SETUP_CALLBACK(sub_aio, wb_endio, sub_aio_a);
+
+ sub_input = sub_aio_a->my_input;
+
+#ifdef DO_WRITEBACK
+ GENERIC_INPUT_CALL(sub_input, aio_io, sub_aio);
+#else
+ SIMPLE_CALLBACK(sub_aio, 0);
+#endif
+ if (do_update) { /* CHECK: shouldn't we do this always? */
+ GENERIC_INPUT_CALL(sub_input, aio_put, sub_aio);
+ }
+out_return:;
+}
+
+static inline
+void fire_writeback(struct list_head *start, bool do_update)
+{
+ struct list_head *tmp;
+
+ /* Caution! The wb structure may get deallocated
+ * during _fire_one() in some cases (e.g. when the
+ * callback is directly called by the aio_io operation).
+ * Ensure that no ptr dereferencing can take
+ * place after working on the last list member.
+ */
+ tmp = start->next;
+ while (tmp != start) {
+ struct list_head *next = tmp->next;
+
+ list_del_init(tmp);
+ _fire_one(tmp, do_update);
+ tmp = next;
+ }
+}
+
+static inline
+void update_max_pos(struct trans_logger_aio_aspect *orig_aio_a)
+{
+ loff_t max_pos = orig_aio_a->log_pos;
+ struct trans_logger_input *log_input = orig_aio_a->log_input;
+
+ CHECK_PTR(log_input, done);
+
+ down(&log_input->inf_mutex);
+
+ if (unlikely(max_pos < log_input->inf.inf_min_pos))
+ XIO_ERR("new max_pos < min_pos: %lld < %lld\n", max_pos, log_input->inf.inf_min_pos);
+ if (log_input->inf.inf_max_pos < max_pos) {
+ log_input->inf.inf_max_pos = max_pos;
+ get_lamport(&log_input->inf.inf_max_pos_stamp);
+ _inf_callback(log_input, false);
+ }
+
+ up(&log_input->inf_mutex);
+done:;
+}
+
+static inline
+void update_writeback_info(struct writeback_info *wb)
+{
+ struct list_head *start = &wb->w_collect_list;
+ struct list_head *tmp;
+
+ /* Notice: in case of log rotation, each list member
+ * may belong to a different log_input.
+ */
+ for (tmp = start->next; tmp != start; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *orig_aio_a;
+
+ orig_aio_a = container_of(tmp, struct trans_logger_aio_aspect, collect_head);
+ update_max_pos(orig_aio_a);
+ }
+}
+
+/***************************** worker thread *****************************/
+
+/*********************************************************************
+ * Phase 0: write transaction log entry for the original write request.
+ */
+
+static
+void _complete(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *orig_aio_a, int error, bool pre_io)
+{
+ struct aio_object *orig_aio;
+
+ orig_aio = orig_aio_a->object;
+ CHECK_PTR(orig_aio, err);
+
+ if (orig_aio_a->is_completed ||
+ (pre_io &&
+ (trans_logger_completion_semantics >= 2 ||
+ (trans_logger_completion_semantics >= 1 && !orig_aio->io_skip_sync)))) {
+ goto done;
+ }
+
+ if (cmpxchg(&orig_aio_a->is_completed, false, true))
+ goto done;
+
+ atomic_dec(&brick->log_fly_count);
+
+ if (likely(error >= 0)) {
+ aio_checksum(orig_aio);
+ orig_aio->io_flags &= ~AIO_WRITING;
+ orig_aio->io_flags |= AIO_UPTODATE;
+ }
+ CHECKED_CALLBACK(orig_aio, error, err);
+
+done:
+ goto out_return;
+err:
+ XIO_ERR("giving up...\n");
+out_return:;
+}
+
+static
+void phase0_preio(void *private)
+{
+ struct trans_logger_aio_aspect *orig_aio_a;
+ struct trans_logger_brick *brick;
+
+ orig_aio_a = private;
+ CHECK_PTR(orig_aio_a, err);
+ CHECK_PTR(orig_aio_a->object, err);
+ brick = orig_aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ /* signal completion to the upper layer */
+/**/
+
+ obj_check(orig_aio_a->object);
+ _complete(brick, orig_aio_a, 0, true);
+ obj_check(orig_aio_a->object);
+ goto out_return;
+err:
+ XIO_ERR("giving up...\n");
+out_return:;
+}
+
+static
+void phase0_endio(void *private, int error)
+{
+ struct aio_object *orig_aio;
+ struct trans_logger_aio_aspect *orig_aio_a;
+ struct trans_logger_brick *brick;
+
+ orig_aio_a = private;
+ CHECK_PTR(orig_aio_a, err);
+
+ brick = orig_aio_a->my_brick;
+ CHECK_PTR(brick, err);
+ orig_aio = orig_aio_a->object;
+ CHECK_PTR(orig_aio, err);
+
+ orig_aio_a->is_persistent = true;
+ qq_dec_flying(&brick->q_phase[0]);
+
+ _CHECK(orig_aio_a->shadow_aio, err);
+
+ /* signal completion to the upper layer */
+ _complete(brick, orig_aio_a, error, false);
+
+ /* Queue up for the next phase.
+ */
+ qq_aio_insert(&brick->q_phase[1], orig_aio_a);
+
+ /* Undo the above pinning
+ */
+ __trans_logger_io_put(brick, orig_aio_a);
+
+ banning_reset(&brick->q_phase[0].q_banning);
+
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_ERR("giving up...\n");
+out_return:;
+}
+
+static
+bool phase0_startio(struct trans_logger_aio_aspect *orig_aio_a)
+{
+ struct aio_object *orig_aio;
+ struct trans_logger_brick *brick;
+ struct trans_logger_input *input;
+ struct log_status *logst;
+ loff_t log_pos;
+ void *data;
+ bool ok;
+
+ CHECK_PTR(orig_aio_a, err);
+ orig_aio = orig_aio_a->object;
+ CHECK_PTR(orig_aio, err);
+ brick = orig_aio_a->my_brick;
+ CHECK_PTR(brick, err);
+ input = orig_aio_a->log_input;
+ CHECK_PTR(input, err);
+ logst = &input->logst;
+ logst->do_crc = trans_logger_do_crc;
+
+ {
+ struct log_header l = {
+ .l_stamp = orig_aio_a->stamp,
+ .l_pos = orig_aio->io_pos,
+ .l_len = orig_aio->io_len,
+ .l_code = CODE_WRITE_NEW,
+ };
+ data = log_reserve(logst, &l);
+ }
+ if (unlikely(!data))
+ goto err;
+
+ hash_ensure_stableness(brick, orig_aio_a);
+
+ memcpy(data, orig_aio_a->shadow_data, orig_aio->io_len);
+
+ /* Pin aio->obj_count so it can't go away
+ * after _complete().
+ * This may happen rather early in phase0_preio().
+ */
+ obj_get(orig_aio); /* must be paired with __trans_logger_io_put() */
+ atomic_inc(&brick->inner_balance_count);
+ atomic_inc(&brick->log_fly_count);
+
+ ok = log_finalize(logst, orig_aio->io_len, phase0_endio, orig_aio_a);
+ if (unlikely(!ok)) {
+ atomic_dec(&brick->log_fly_count);
+ goto err;
+ }
+ log_pos = logst->log_pos + logst->offset;
+ orig_aio_a->log_pos = log_pos;
+
+ /* update new log_pos in the symlinks */
+ down(&input->inf_mutex);
+ input->inf.inf_log_pos = log_pos;
+ memcpy(&input->inf.inf_log_pos_stamp, &logst->log_pos_stamp, sizeof(input->inf.inf_log_pos_stamp));
+ _inf_callback(input, false);
+
+#ifdef CONFIG_MARS_DEBUG
+ if (!list_empty(&input->pos_list)) {
+ struct trans_logger_aio_aspect *last_aio_a;
+
+ last_aio_a = container_of(input->pos_list.prev, struct trans_logger_aio_aspect, pos_head);
+ if (last_aio_a->log_pos >= orig_aio_a->log_pos)
+ XIO_ERR("backskip in pos_list, %lld >= %lld\n", last_aio_a->log_pos, orig_aio_a->log_pos);
+ }
+#endif
+ list_add_tail(&orig_aio_a->pos_head, &input->pos_list);
+ atomic_inc(&input->pos_count);
+ up(&input->inf_mutex);
+
+ qq_inc_flying(&brick->q_phase[0]);
+
+ phase0_preio(orig_aio_a);
+
+ return true;
+
+err:
+ return false;
+}
+
+static
+bool prep_phase_startio(struct trans_logger_aio_aspect *aio_a)
+{
+ struct aio_object *aio = aio_a->object;
+ struct trans_logger_aio_aspect *shadow_a;
+ struct trans_logger_brick *brick;
+
+ CHECK_PTR(aio, err);
+ shadow_a = aio_a->shadow_aio;
+ CHECK_PTR(shadow_a, err);
+ brick = aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ if (aio->io_rw == READ) {
+ /* nothing to do: directly signal success. */
+ struct aio_object *shadow = shadow_a->object;
+
+ if (unlikely(shadow == aio))
+ XIO_ERR("oops, we should be a slave shadow, but are a master one\n");
+#ifdef USE_MEMCPY
+ if (aio_a->shadow_data != aio->io_data) {
+ if (unlikely(aio->io_len <= 0 || aio->io_len > PAGE_SIZE))
+ XIO_ERR("implausible io_len = %d\n", aio->io_len);
+ memcpy(aio->io_data, aio_a->shadow_data, aio->io_len);
+ }
+#endif
+ aio->io_flags |= AIO_UPTODATE;
+
+ CHECKED_CALLBACK(aio, 0, err);
+
+ __trans_logger_io_put(brick, aio_a);
+
+ return true;
+ }
+ /* else WRITE */
+ CHECK_HEAD_EMPTY(&aio_a->lh.lh_head);
+ CHECK_HEAD_EMPTY(&aio_a->hash_head);
+ if (unlikely(aio->io_flags & (AIO_READING | AIO_WRITING)))
+ XIO_ERR("bad flags %d\n", aio->io_flags);
+ /* In case of non-buffered IO, the buffer is
+ * under control of the user. In particular, he
+ * may change it without telling us.
+ * Therefore we make a copy (or "snapshot") here.
+ */
+ aio->io_flags |= AIO_WRITING;
+#ifdef USE_MEMCPY
+ if (aio_a->shadow_data != aio->io_data) {
+ if (unlikely(aio->io_len <= 0 || aio->io_len > PAGE_SIZE))
+ XIO_ERR("implausible io_len = %d\n", aio->io_len);
+ memcpy(aio_a->shadow_data, aio->io_data, aio->io_len);
+ }
+#endif
+ aio_a->is_dirty = true;
+ aio_a->shadow_aio->is_dirty = true;
+#ifndef KEEP_UNIQUE
+ if (unlikely(aio_a->shadow_aio != aio_a))
+ XIO_ERR("something is wrong: %p != %p\n", aio_a->shadow_aio, aio_a);
+#endif
+ if (likely(!aio_a->is_hashed)) {
+ struct trans_logger_input *log_input;
+
+ log_input = brick->inputs[brick->log_input_nr];
+ aio_a->log_input = log_input;
+ atomic_inc(&log_input->log_obj_count);
+ hash_insert(brick, aio_a);
+ } else {
+ XIO_ERR("tried to hash twice\n");
+ }
+ return phase0_startio(aio_a);
+
+err:
+ XIO_ERR("cannot work\n");
+ brick_msleep(1000);
+ return false;
+}
+
+/*********************************************************************
+ * Phase 1: read original version of data.
+ * This happens _after_ phase 0, deliberately.
+ * We are explicitly dealing with old and new versions.
+ * The new version is hashed in memory all the time (such that parallel
+ * READs will see them), so we have plenty of time for getting the
+ * old version from disk somewhen later, e.g. when IO contention is low.
+ */
+
+static
+void phase1_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct writeback_info *wb;
+ struct trans_logger_brick *brick;
+
+ CHECK_PTR(cb, err);
+ sub_aio_a = cb->cb_private;
+ CHECK_PTR(sub_aio_a, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ if (unlikely(cb->cb_error < 0)) {
+ XIO_FAT("IO error %d\n", cb->cb_error);
+ goto err;
+ }
+
+ qq_dec_flying(&brick->q_phase[1]);
+
+ banning_reset(&brick->q_phase[1].q_banning);
+
+ /* queue up for the next phase */
+ qq_wb_insert(&brick->q_phase[2], wb);
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_FAT("hanging up....\n");
+out_return:;
+}
+
+static void phase3_endio(struct generic_callback *cb);
+
+static bool phase3_startio(struct writeback_info *wb);
+
+static
+bool phase1_startio(struct trans_logger_aio_aspect *orig_aio_a)
+{
+ struct aio_object *orig_aio;
+ struct trans_logger_brick *brick;
+ struct writeback_info *wb = NULL;
+
+ CHECK_PTR(orig_aio_a, err);
+ orig_aio = orig_aio_a->object;
+ CHECK_PTR(orig_aio, err);
+ brick = orig_aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ if (orig_aio_a->is_collected)
+ goto done;
+ if (!orig_aio_a->is_hashed)
+ goto done;
+
+ wb = make_writeback(brick, orig_aio->io_pos, orig_aio->io_len);
+ if (unlikely(!wb))
+ goto collision;
+
+ if (unlikely(list_empty(&wb->w_sub_write_list))) {
+ XIO_ERR(
+ "sub_write_list is empty, orig pos = %lld len = %d (collected=%d), extended pos = %lld len = %d\n", orig_aio->io_pos, orig_aio->io_len, (
+
+ int)orig_aio_a->is_collected, wb->w_pos, wb->w_len);
+ goto err;
+ }
+
+ wb->read_endio = phase1_endio;
+ wb->write_endio = phase3_endio;
+ atomic_set(&wb->w_sub_log_count, atomic_read(&wb->w_sub_read_count));
+
+ if (brick->log_reads) {
+ qq_inc_flying(&brick->q_phase[1]);
+ fire_writeback(&wb->w_sub_read_list, false);
+ } else { /* shortcut */
+#ifndef SHORTCUT_1_to_3
+ qq_wb_insert(&brick->q_phase[3], wb);
+ wake_up_interruptible_all(&brick->worker_event);
+#else
+ return phase3_startio(wb);
+#endif
+ }
+
+done:
+ return true;
+
+err:
+ if (wb)
+ free_writeback(wb);
+collision:
+ return false;
+}
+
+/*********************************************************************
+ * Phase 2: log the old disk version.
+ */
+
+static inline
+void _phase2_endio(struct writeback_info *wb)
+{
+ struct trans_logger_brick *brick = wb->w_brick;
+
+ /* queue up for the next phase */
+ qq_wb_insert(&brick->q_phase[3], wb);
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+out_return:;
+}
+
+static
+void phase2_endio(void *private, int error)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct trans_logger_brick *brick;
+ struct writeback_info *wb;
+
+ sub_aio_a = private;
+ CHECK_PTR(sub_aio_a, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ qq_dec_flying(&brick->q_phase[2]);
+
+ if (unlikely(error < 0)) {
+ XIO_FAT("IO error %d\n", error);
+ goto err; /* FIXME: this leads to hanging requests. do better. */
+ }
+
+ CHECK_ATOMIC(&wb->w_sub_log_count, 1);
+ if (atomic_dec_and_test(&wb->w_sub_log_count)) {
+ banning_reset(&brick->q_phase[2].q_banning);
+ _phase2_endio(wb);
+ }
+ goto out_return;
+err:
+ XIO_FAT("hanging up....\n");
+out_return:;
+}
+
+static
+bool _phase2_startio(struct trans_logger_aio_aspect *sub_aio_a)
+{
+ struct aio_object *sub_aio = NULL;
+ struct writeback_info *wb;
+ struct trans_logger_input *input;
+ struct trans_logger_brick *brick;
+ struct log_status *logst;
+ void *data;
+ bool ok;
+
+ CHECK_PTR(sub_aio_a, err);
+ sub_aio = sub_aio_a->object;
+ CHECK_PTR(sub_aio, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+ input = sub_aio_a->log_input;
+ CHECK_PTR(input, err);
+ logst = &input->logst;
+ logst->do_crc = trans_logger_do_crc;
+
+ {
+ struct log_header l = {
+ .l_stamp = sub_aio_a->stamp,
+ .l_pos = sub_aio->io_pos,
+ .l_len = sub_aio->io_len,
+ .l_code = CODE_WRITE_OLD,
+ };
+ data = log_reserve(logst, &l);
+ }
+
+ if (unlikely(!data))
+ goto err;
+
+ memcpy(data, sub_aio->io_data, sub_aio->io_len);
+
+ ok = log_finalize(logst, sub_aio->io_len, phase2_endio, sub_aio_a);
+ if (unlikely(!ok))
+ goto err;
+
+ qq_inc_flying(&brick->q_phase[2]);
+
+ return true;
+
+err:
+ XIO_FAT(
+ "cannot log old data, pos = %lld len = %d\n",
+ sub_aio ? sub_aio->io_pos : 0,
+ sub_aio ? sub_aio->io_len : 0);
+ return false;
+}
+
+static
+bool phase2_startio(struct writeback_info *wb)
+{
+ struct trans_logger_brick *brick;
+ bool ok = true;
+
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ if (brick->log_reads && atomic_read(&wb->w_sub_log_count) > 0) {
+ struct list_head *start;
+ struct list_head *tmp;
+
+ start = &wb->w_sub_read_list;
+ for (tmp = start->next; tmp != start; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+
+ sub_aio_a = container_of(tmp, struct trans_logger_aio_aspect, sub_head);
+ sub_aio = sub_aio_a->object;
+
+ if (!_phase2_startio(sub_aio_a))
+ ok = false;
+ }
+ wake_up_interruptible_all(&brick->worker_event);
+ } else {
+ _phase2_endio(wb);
+ }
+ return ok;
+err:
+ return false;
+}
+
+/*********************************************************************
+ * Phase 3: overwrite old disk version with new version.
+ */
+
+static
+void phase3_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct writeback_info *wb;
+ struct trans_logger_brick *brick;
+
+ CHECK_PTR(cb, err);
+ sub_aio_a = cb->cb_private;
+ CHECK_PTR(sub_aio_a, err);
+ wb = sub_aio_a->wb;
+ CHECK_PTR(wb, err);
+ brick = wb->w_brick;
+ CHECK_PTR(brick, err);
+
+ if (unlikely(cb->cb_error < 0)) {
+ XIO_FAT("IO error %d\n", cb->cb_error);
+ goto err;
+ }
+
+ hash_put_all(brick, &wb->w_collect_list);
+
+ qq_dec_flying(&brick->q_phase[3]);
+ atomic_inc(&brick->total_writeback_cluster_count);
+
+ free_writeback(wb);
+
+ banning_reset(&brick->q_phase[3].q_banning);
+
+ wake_up_interruptible_all(&brick->worker_event);
+
+ goto out_return;
+err:
+ XIO_FAT("hanging up....\n");
+out_return:;
+}
+
+static
+bool phase3_startio(struct writeback_info *wb)
+{
+ struct list_head *start = &wb->w_sub_read_list;
+ struct list_head *tmp;
+
+ /* Cleanup read requests (if they exist from previous phases)
+ */
+ while ((tmp = start->next) != start) {
+ struct trans_logger_aio_aspect *sub_aio_a;
+ struct aio_object *sub_aio;
+ struct trans_logger_input *sub_input;
+
+ list_del_init(tmp);
+
+ sub_aio_a = container_of(tmp, struct trans_logger_aio_aspect, sub_head);
+ sub_aio = sub_aio_a->object;
+ sub_input = sub_aio_a->my_input;
+
+ GENERIC_INPUT_CALL(sub_input, aio_put, sub_aio);
+ }
+
+ update_writeback_info(wb);
+
+ /* Start writeback IO
+ */
+ qq_inc_flying(&wb->w_brick->q_phase[3]);
+ fire_writeback(&wb->w_sub_write_list, true);
+ return true;
+}
+
+/*********************************************************************
+ * Phase 4: only used during transition from normal operations
+ * to emergency mode.
+ * This is needed to guarantee consistency.
+ * Writeback must have fully completed before the underlying disk
+ * can be accessed directly.
+ */
+
+static
+bool phase4_startio(struct trans_logger_aio_aspect *aio_a)
+{
+ struct aio_object *aio;
+ struct trans_logger_brick *brick;
+
+ CHECK_PTR(aio_a, err);
+ aio = aio_a->object;
+ CHECK_PTR(aio, err);
+ brick = aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ aio_a->my_queue = &brick->q_phase[4];
+ qq_inc_flying(&brick->q_phase[4]);
+ __trans_logger_io_io(brick, aio_a);
+ atomic_dec(&brick->any_fly_count);
+ __trans_logger_io_put(brick, aio_a);
+
+ return true;
+
+err:
+ return false;
+}
+
+/*********************************************************************
+ * The logger thread.
+ * There is only a single instance, dealing with all requests in parallel.
+ */
+
+static
+int run_aio_queue(
+struct logger_queue *q, bool (*startio)(struct trans_logger_aio_aspect *sub_aio_a), int max, bool do_limit)
+{
+ struct trans_logger_brick *brick = q->q_brick;
+ int total_len = 0;
+ bool found = false;
+ bool ok;
+ int res = 0;
+
+ do {
+ struct trans_logger_aio_aspect *aio_a;
+
+ aio_a = qq_aio_fetch(q);
+ if (!aio_a)
+ goto done;
+
+ if (likely(aio_a->object))
+ total_len += aio_a->object->io_len;
+
+ ok = startio(aio_a);
+ if (unlikely(!ok)) {
+ qq_aio_pushback(q, aio_a);
+ goto done;
+ }
+ res++;
+ found = true;
+ __trans_logger_io_put(aio_a->my_brick, aio_a);
+ } while (--max > 0);
+
+done:
+ if (found) {
+ if (do_limit && total_len)
+ rate_limit(&global_writeback.limiter, (total_len - 1) / 1024 + 1);
+ wake_up_interruptible_all(&brick->worker_event);
+ }
+ return res;
+}
+
+static
+int run_wb_queue(struct logger_queue *q, bool (*startio)(struct writeback_info *wb), int max)
+{
+ struct trans_logger_brick *brick = q->q_brick;
+ int total_len = 0;
+ bool found = false;
+ bool ok;
+ int res = 0;
+
+ do {
+ struct writeback_info *wb;
+
+ wb = qq_wb_fetch(q);
+ if (!wb)
+ goto done;
+
+ total_len += wb->w_len;
+
+ ok = startio(wb);
+ if (unlikely(!ok)) {
+ qq_wb_pushback(q, wb);
+ goto done;
+ }
+ res++;
+ found = true;
+ } while (--max > 0);
+
+done:
+ if (found) {
+ rate_limit(&global_writeback.limiter, (total_len - 1) / 1024 + 1);
+ wake_up_interruptible_all(&brick->worker_event);
+ }
+ return res;
+}
+
+/* Ranking tables.
+ */
+static
+struct rank_info float_queue_rank_log[] = {
+ { 0, 0 },
+ { 1, 100 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info float_queue_rank_io[] = {
+ { 0, 0 },
+ { 1, 1 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info float_fly_rank_log[] = {
+ { 0, 0 },
+ { 1, 1 },
+ { 32, 10 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info float_fly_rank_io[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { 2, -10 },
+ { 10000, -200 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info nofloat_queue_rank_log[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info nofloat_queue_rank_io[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { 100, 100 },
+ { RKI_DUMMY }
+};
+
+#define nofloat_fly_rank_log float_fly_rank_log
+
+static
+struct rank_info nofloat_fly_rank_io[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { 128, 8 },
+ { 129, -200 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info *queue_ranks[2][LOGGER_QUEUES] = {
+ [0] = {
+ [0] = float_queue_rank_log,
+ [1] = float_queue_rank_io,
+ [2] = float_queue_rank_io,
+ [3] = float_queue_rank_io,
+ },
+ [1] = {
+ [0] = nofloat_queue_rank_log,
+ [1] = nofloat_queue_rank_io,
+ [2] = nofloat_queue_rank_io,
+ [3] = nofloat_queue_rank_io,
+ },
+};
+
+static
+struct rank_info *fly_ranks[2][LOGGER_QUEUES] = {
+ [0] = {
+ [0] = float_fly_rank_log,
+ [1] = float_fly_rank_io,
+ [2] = float_fly_rank_io,
+ [3] = float_fly_rank_io,
+ },
+ [1] = {
+ [0] = nofloat_fly_rank_log,
+ [1] = nofloat_fly_rank_io,
+ [2] = nofloat_fly_rank_io,
+ [3] = nofloat_fly_rank_io,
+ },
+};
+
+static
+struct rank_info extra_rank_aio_flying[] = {
+ { 0, 0 },
+ { 1, 10 },
+ { 16, 30 },
+ { 31, 0 },
+ { 32, -200 },
+ { RKI_DUMMY }
+};
+
+static
+struct rank_info global_rank_aio_flying[] = {
+ { 0, 0 },
+ { 63, 0 },
+ { 64, -200 },
+ { RKI_DUMMY }
+};
+
+static
+int _do_ranking(struct trans_logger_brick *brick)
+{
+ struct rank_data *rkd = brick->rkd;
+ int res;
+ int i;
+ int floating_mode;
+ int aio_flying;
+ bool delay_callers;
+
+ ranking_start(rkd, LOGGER_QUEUES);
+
+ /* check the memory situation... */
+ delay_callers = false;
+ floating_mode = 1;
+ if (brick_global_memlimit >= 1024) {
+ int global_mem_used = atomic64_read(&global_mshadow_used) / 1024;
+
+ trans_logger_mem_usage = global_mem_used;
+
+ floating_mode = (global_mem_used < brick_global_memlimit / 2) ? 0 : 1;
+
+ if (global_mem_used >= brick_global_memlimit)
+ delay_callers = true;
+ } else if (brick->shadow_mem_limit >= 8) {
+ int local_mem_used = atomic64_read(&brick->shadow_mem_used) / 1024;
+
+ floating_mode = (local_mem_used < brick->shadow_mem_limit / 2) ? 0 : 1;
+
+ if (local_mem_used >= brick->shadow_mem_limit)
+ delay_callers = true;
+ }
+ if (delay_callers) {
+ if (!brick->delay_callers) {
+ brick->delay_callers = true;
+ atomic_inc(&brick->total_delay_count);
+ }
+ } else if (brick->delay_callers) {
+ brick->delay_callers = false;
+ wake_up_interruptible(&brick->caller_event);
+ }
+
+ /* global limit for flying aios */
+ ranking_compute(&rkd[0], global_rank_aio_flying, atomic_read(&global_aio_flying));
+
+ /* local limit for flying aios */
+ aio_flying = 0;
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+
+ aio_flying += atomic_read(&input->logst.aio_flying);
+ }
+
+ /* obey the basic rules... */
+ for (i = 0; i < LOGGER_QUEUES; i++) {
+ int queued = atomic_read(&brick->q_phase[i].q_queued);
+ int flying;
+
+ /* This must come first.
+ * When a queue is empty, you must not credit any positive points.
+ * Otherwise, (almost) infinite selection of untreatable
+ * queues may occur.
+ */
+ if (queued <= 0)
+ continue;
+
+ if (banning_is_hit(&brick->q_phase[i].q_banning))
+ break;
+
+ if (i == 0) {
+ /* limit aio IO parallelism on transaction log */
+ ranking_compute(&rkd[0], extra_rank_aio_flying, aio_flying);
+ } else if (i == 1 && !floating_mode) {
+ struct trans_logger_brick *leader;
+ int lim;
+
+ if (!aio_flying && atomic_read(&brick->q_phase[0].q_queued) > 0)
+ break;
+
+ leader = elect_leader(&global_writeback);
+ if (leader != brick)
+ break;
+
+ if (banning_is_hit(&xio_global_ban))
+ break;
+
+ lim = rate_limit(&global_writeback.limiter, 0);
+ if (lim > 0)
+ break;
+ }
+
+ ranking_compute(&rkd[i], queue_ranks[floating_mode][i], queued);
+
+ flying = atomic_read(&brick->q_phase[i].q_flying);
+
+ ranking_compute(&rkd[i], fly_ranks[floating_mode][i], flying);
+ }
+
+ /* finalize it */
+ ranking_stop(rkd, LOGGER_QUEUES);
+
+ res = ranking_select(rkd, LOGGER_QUEUES);
+
+ /* Ensure that the extra queue is only run when all others are empty.
+ */
+ if (res < 0 &&
+ atomic_read(&brick->q_phase[EXTRA_QUEUES - 1].q_queued) &&
+ !_congested(brick, LOGGER_QUEUES)) {
+ res = EXTRA_QUEUES - 1;
+ }
+
+ return res;
+}
+
+static
+void _init_input(struct trans_logger_input *input, loff_t start_pos, loff_t end_pos)
+{
+ struct trans_logger_brick *brick = input->brick;
+ struct log_status *logst = &input->logst;
+
+ init_logst(logst, (void *)input, start_pos, end_pos);
+ logst->signal_event = &brick->worker_event;
+ logst->align_size = CONF_TRANS_ALIGN;
+ logst->chunk_size = CONF_TRANS_CHUNKSIZE;
+ logst->max_size = CONF_TRANS_MAX_AIO_SIZE;
+
+ input->inf.inf_min_pos = start_pos;
+ input->inf.inf_max_pos = end_pos;
+ get_lamport(&input->inf.inf_max_pos_stamp);
+ memcpy(&input->inf.inf_min_pos_stamp, &input->inf.inf_max_pos_stamp, sizeof(input->inf.inf_min_pos_stamp));
+
+ logst->log_pos = start_pos;
+ input->inf.inf_log_pos = start_pos;
+ input->inf_last_jiffies = jiffies;
+ input->inf.inf_is_replaying = false;
+ input->inf.inf_is_logging = false;
+
+ input->is_operating = true;
+}
+
+static
+void _init_inputs(struct trans_logger_brick *brick, bool is_first)
+{
+ struct trans_logger_input *input;
+ int old_nr = brick->old_input_nr;
+ int log_nr = brick->log_input_nr;
+ int new_nr = brick->new_input_nr;
+
+ if (!is_first &&
+ (new_nr == log_nr ||
+ log_nr != old_nr)) {
+ goto done;
+ }
+ if (unlikely(new_nr < TL_INPUT_LOG1 || new_nr > TL_INPUT_LOG2)) {
+ XIO_ERR("bad new_input_nr = %d\n", new_nr);
+ goto done;
+ }
+
+ input = brick->inputs[new_nr];
+ CHECK_PTR(input, done);
+
+ if (input->is_operating || !input->connect)
+ goto done;
+
+ down(&input->inf_mutex);
+
+ _init_input(input, 0, 0);
+ input->inf.inf_is_logging = is_first;
+
+ /* from now on, new requests should go to the new input */
+ brick->log_input_nr = new_nr;
+ XIO_INF("switched over to new logfile %d (old = %d)\n", new_nr, old_nr);
+
+ /* Flush the old log buffer and update its symlinks.
+ * Notice: for some short time, _both_ logfiles may grow
+ * due to (harmless) races with log_flush().
+ */
+ if (likely(!is_first)) {
+ struct trans_logger_input *other_input = brick->inputs[old_nr];
+
+ down(&other_input->inf_mutex);
+ log_flush(&other_input->logst);
+ _inf_callback(other_input, true);
+ up(&other_input->inf_mutex);
+ }
+
+ _inf_callback(input, true);
+
+ up(&input->inf_mutex);
+done:;
+}
+
+static
+int _nr_flying_inputs(struct trans_logger_brick *brick)
+{
+ int count = 0;
+ int i;
+
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+ struct log_status *logst = &input->logst;
+
+ if (input->is_operating)
+ count += logst->count;
+ }
+ return count;
+}
+
+static
+void _flush_inputs(struct trans_logger_brick *brick)
+{
+ int i;
+
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+ struct log_status *logst = &input->logst;
+
+ if (input->is_operating && logst->count > 0) {
+ atomic_inc(&brick->total_flush_count);
+ log_flush(logst);
+ }
+ }
+}
+
+static
+void _exit_inputs(struct trans_logger_brick *brick, bool force)
+{
+ int i;
+
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+ struct log_status *logst = &input->logst;
+
+ if (input->is_operating &&
+ (force || !input->connect)) {
+ bool old_replaying = input->inf.inf_is_replaying;
+ bool old_logging = input->inf.inf_is_logging;
+
+ XIO_DBG(
+ "cleaning up input %d (log = %d old = %d), old_replaying = %d old_logging = %d\n",
+ i,
+ brick->log_input_nr,
+ brick->old_input_nr,
+ old_replaying,
+ old_logging);
+ exit_logst(logst);
+ /* no locking here: we should be the only thread doing this. */
+ _inf_callback(input, true);
+ input->inf_last_jiffies = 0;
+ input->inf.inf_is_replaying = false;
+ input->inf.inf_is_logging = false;
+ input->is_operating = false;
+ if (i == brick->old_input_nr && i != brick->log_input_nr) {
+ struct trans_logger_input *other_input = brick->inputs[brick->log_input_nr];
+
+ down(&other_input->inf_mutex);
+ brick->old_input_nr = brick->log_input_nr;
+ other_input->inf.inf_is_replaying = old_replaying;
+ other_input->inf.inf_is_logging = old_logging;
+ _inf_callback(other_input, true);
+ up(&other_input->inf_mutex);
+ }
+ }
+ }
+}
+
+/* Performance-critical:
+ * Calling log_flush() too often may result in
+ * increased overhead (and thus in lower throughput).
+ * Call it only when the IO scheduler need not do anything else.
+ * OTOH, calling it too seldom may hold back
+ * IO completion for the end user for too long time.
+ *
+ * Be careful to flush any leftovers in the log buffer, at least after
+ * some short delay.
+ *
+ * Description of flush_mode:
+ * 0 = flush unconditionally
+ * 1 = flush only when nothing can be appended to the transaction log
+ * 2 = see 1 && flush only when the user is waiting for an answer
+ * 3 = see 1 && not 2 && flush only when there is no other activity (background mode)
+ * Notice: 3 makes only sense for leftovers where the user is _not_ waiting for
+ */
+static inline
+void flush_inputs(struct trans_logger_brick *brick, int flush_mode)
+{
+ if (flush_mode < 1 ||
+ /* there is nothing to append any more */
+ (atomic_read(&brick->q_phase[0].q_queued) <= 0 &&
+ /* and the user is waiting for an answer */
+ (flush_mode < 2 ||
+ atomic_read(&brick->log_fly_count) > 0 ||
+ /* else flush any leftovers in background, when there is no writeback activity */
+ (flush_mode == 3 &&
+ atomic_read(&brick->q_phase[1].q_flying) + atomic_read(&brick->q_phase[3].q_flying) <= 0))))
+ _flush_inputs(brick);
+}
+
+static
+void trans_logger_log(struct trans_logger_brick *brick)
+{
+ long long old_jiffies = jiffies;
+ long long work_jiffies = jiffies;
+ int interleave = 0;
+ int nr_flying;
+
+ memset(brick->rkd, 0, sizeof(brick->rkd));
+ brick->replay_code = 0; /* indicates "running" */
+ brick->disk_io_error = 0;
+
+ _init_inputs(brick, true);
+
+ xio_set_power_on_led((void *)brick, true);
+
+ while (!brick_thread_should_stop() || _congested(brick, EXTRA_QUEUES)) {
+ int winner;
+ int nr;
+
+ wait_event_interruptible_timeout(
+ brick->worker_event,
+ ({
+ winner = _do_ranking(brick);
+ if (winner < 0) { /* no more work to do */
+ int flush_mode = 2 - ((int)(jiffies - work_jiffies)) / (HZ * 2);
+
+ flush_inputs(brick, flush_mode);
+ interleave = 0;
+ } else { /* reset the timer whenever something is to do */
+ work_jiffies = jiffies;
+ }
+ winner >= 0;
+ }),
+ HZ / 10);
+
+ atomic_inc(&brick->total_round_count);
+
+ if (brick->cease_logging)
+ brick->stopped_logging = true;
+ else if (brick->stopped_logging && !_congested(brick, EXTRA_QUEUES))
+ brick->stopped_logging = false;
+
+ _init_inputs(brick, false);
+
+ switch (winner) {
+ case 0:
+ interleave = 0;
+ nr = run_aio_queue(
+ &brick->q_phase[0], prep_phase_startio, brick->q_phase[0].q_batchlen, true);
+ goto done;
+ case 1:
+ if (interleave >= trans_logger_max_interleave && trans_logger_max_interleave >= 0) {
+ interleave = 0;
+ flush_inputs(brick, 3);
+ }
+ nr = run_aio_queue(&brick->q_phase[1], phase1_startio, brick->q_phase[1].q_batchlen, true);
+ interleave += nr;
+ goto done;
+ case 2:
+ interleave = 0;
+ nr = run_wb_queue(&brick->q_phase[2], phase2_startio, brick->q_phase[2].q_batchlen);
+ goto done;
+ case 3:
+ if (interleave >= trans_logger_max_interleave && trans_logger_max_interleave >= 0) {
+ interleave = 0;
+ flush_inputs(brick, 3);
+ }
+ nr = run_wb_queue(&brick->q_phase[3], phase3_startio, brick->q_phase[3].q_batchlen);
+ interleave += nr;
+ goto done;
+ case 4:
+ nr = run_aio_queue(&brick->q_phase[4], phase4_startio, brick->q_phase[4].q_batchlen, false);
+done:
+ if (unlikely(nr <= 0)) {
+ /* This should not happen!
+ * However, in error situations, the ranking
+ * algorithm cannot foresee anything.
+ */
+ brick->q_phase[winner].no_progress_count++;
+ banning_hit(&brick->q_phase[winner].q_banning, 10000);
+ flush_inputs(brick, 0);
+ }
+ ranking_select_done(brick->rkd, winner, nr);
+ break;
+
+ default:
+ break;
+ }
+
+ /* Update symlinks even during pauses.
+ */
+ if (winner < 0 && ((long long)jiffies) - old_jiffies >= HZ) {
+ int i;
+
+ old_jiffies = jiffies;
+ for (i = TL_INPUT_LOG1; i <= TL_INPUT_LOG2; i++) {
+ struct trans_logger_input *input = brick->inputs[i];
+
+ down(&input->inf_mutex);
+ _inf_callback(input, false);
+ up(&input->inf_mutex);
+ }
+ }
+
+ _exit_inputs(brick, false);
+ }
+
+ for (;;) {
+ _exit_inputs(brick, true);
+ nr_flying = _nr_flying_inputs(brick);
+ if (nr_flying <= 0)
+ break;
+ XIO_INF("%d inputs are operating\n", nr_flying);
+ brick_msleep(1000);
+ }
+}
+
+/***************************** log replay *****************************/
+
+static
+void replay_endio(struct generic_callback *cb)
+{
+ struct trans_logger_aio_aspect *aio_a = cb->cb_private;
+ struct trans_logger_brick *brick;
+ bool ok;
+ unsigned long flags;
+
+ _crashme(22, false);
+
+ LAST_CALLBACK(cb);
+ CHECK_PTR(aio_a, err);
+ brick = aio_a->my_brick;
+ CHECK_PTR(brick, err);
+
+ if (unlikely(cb->cb_error < 0)) {
+ brick->disk_io_error = cb->cb_error;
+ XIO_ERR("IO error = %d\n", cb->cb_error);
+ }
+
+ spin_lock_irqsave(&brick->replay_lock, flags);
+ ok = !list_empty(&aio_a->replay_head);
+ list_del_init(&aio_a->replay_head);
+ spin_unlock_irqrestore(&brick->replay_lock, flags);
+
+ if (likely(ok))
+ atomic_dec(&brick->replay_count);
+ else
+ XIO_ERR("callback with empty replay_head (replay_count=%d)\n", atomic_read(&brick->replay_count));
+ wake_up_interruptible_all(&brick->worker_event);
+ goto out_return;
+err:
+ XIO_FAT("cannot handle replay IO\n");
+out_return:;
+}
+
+static
+bool _has_conflict(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *aio_a)
+{
+ struct aio_object *aio = aio_a->object;
+ struct list_head *tmp;
+ bool res = false;
+ unsigned long flags;
+
+/**/
+
+ spin_lock_irqsave(&brick->replay_lock, flags);
+
+ for (tmp = brick->replay_list.next; tmp != &brick->replay_list; tmp = tmp->next) {
+ struct trans_logger_aio_aspect *tmp_a;
+ struct aio_object *tmp_aio;
+
+ tmp_a = container_of(tmp, struct trans_logger_aio_aspect, replay_head);
+ tmp_aio = tmp_a->object;
+ if (
+ tmp_aio->io_pos + tmp_aio->io_len > aio->io_pos && tmp_aio->io_pos < aio->io_pos + aio->io_len) {
+ res = true;
+ break;
+ }
+ }
+
+ spin_unlock_irqrestore(&brick->replay_lock, flags);
+ return res;
+}
+
+static
+void wait_replay(struct trans_logger_brick *brick, struct trans_logger_aio_aspect *aio_a)
+{
+ const int max = 512; /* limit parallelism somewhat */
+ int conflicts = 0;
+ bool ok = false;
+ bool was_empty;
+ unsigned long flags;
+
+ wait_event_interruptible_timeout(
+ brick->worker_event,
+ atomic_read(&brick->replay_count) < max &&
+ (_has_conflict(brick, aio_a) ? conflicts++ : (ok = true), ok),
+ 60 * HZ);
+
+ atomic_inc(&brick->total_replay_count);
+ if (conflicts)
+ atomic_inc(&brick->total_replay_conflict_count);
+
+ spin_lock_irqsave(&brick->replay_lock, flags);
+ was_empty = !!list_empty(&aio_a->replay_head);
+ if (likely(was_empty))
+ atomic_inc(&brick->replay_count);
+ else
+ list_del(&aio_a->replay_head);
+ list_add(&aio_a->replay_head, &brick->replay_list);
+ spin_unlock_irqrestore(&brick->replay_lock, flags);
+
+ if (unlikely(!was_empty)) {
+ XIO_ERR(
+ "replay_head was already used (ok=%d, conflicts=%d, replay_count=%d)\n", ok, conflicts, atomic_read(
+ &brick->replay_count));
+ }
+}
+
+static
+int replay_data(struct trans_logger_brick *brick, loff_t pos, void *buf, int len)
+{
+ struct trans_logger_input *input = brick->inputs[TL_INPUT_WRITEBACK];
+ int status;
+
+ if (!input->connect)
+ input = brick->inputs[TL_INPUT_READ];
+
+ /* TODO for better efficiency:
+ * Instead of starting IO here, just put the data into the hashes
+ * and queues such that ordinary IO will be corrected.
+ * Writeback will be lazy then.
+ * The switch infrastructure must be changed before this
+ * becomes possible.
+ */
+#ifdef REPLAY_DATA
+ while (len > 0) {
+ struct aio_object *aio;
+ struct trans_logger_aio_aspect *aio_a;
+
+ status = -ENOMEM;
+ aio = trans_logger_alloc_aio(brick);
+ aio_a = trans_logger_aio_get_aspect(brick, aio);
+ CHECK_PTR(aio_a, done);
+ CHECK_ASPECT(aio_a, aio, done);
+
+ aio->io_pos = pos;
+ aio->io_data = NULL;
+ aio->io_len = len;
+ aio->io_may_write = WRITE;
+ aio->io_rw = WRITE;
+
+ status = GENERIC_INPUT_CALL(input, aio_get, aio);
+ if (unlikely(status < 0)) {
+ XIO_ERR("cannot get aio, status = %d\n", status);
+ goto done;
+ }
+ if (unlikely(!aio->io_data)) {
+ status = -ENOMEM;
+ XIO_ERR("cannot get aio, status = %d\n", status);
+ goto done;
+ }
+ if (unlikely(aio->io_len <= 0 || aio->io_len > len)) {
+ status = -EINVAL;
+ XIO_ERR("bad aio len = %d (requested = %d)\n", aio->io_len, len);
+ goto done;
+ }
+
+ wait_replay(brick, aio_a);
+
+ memcpy(aio->io_data, buf, aio->io_len);
+
+ SETUP_CALLBACK(aio, replay_endio, aio_a);
+ aio_a->my_brick = brick;
+
+ GENERIC_INPUT_CALL(input, aio_io, aio);
+
+ if (unlikely(aio->io_len <= 0)) {
+ status = -EINVAL;
+ XIO_ERR("bad aio len = %d (requested = %d)\n", aio->io_len, len);
+ goto done;
+ }
+
+ pos += aio->io_len;
+ buf += aio->io_len;
+ len -= aio->io_len;
+
+ GENERIC_INPUT_CALL(input, aio_put, aio);
+ }
+#endif
+ status = 0;
+done:
+ return status;
+}
+
+static
+void trans_logger_replay(struct trans_logger_brick *brick)
+{
+ struct trans_logger_input *input = brick->inputs[brick->log_input_nr];
+ struct log_header lh = {};
+ loff_t start_pos;
+ loff_t end_pos;
+ loff_t finished_pos = -1;
+ loff_t new_finished_pos = -1;
+ long long old_jiffies = jiffies;
+ int nr_flying;
+ int backoff = 0;
+ int status = 0;
+
+ brick->replay_code = 0; /* indicates "running" */
+ brick->disk_io_error = 0;
+
+ start_pos = brick->replay_start_pos;
+ end_pos = brick->replay_end_pos;
+ brick->replay_current_pos = start_pos;
+
+ _init_input(input, start_pos, end_pos);
+
+ input->inf.inf_min_pos = start_pos;
+ input->inf.inf_max_pos = end_pos;
+ input->inf.inf_log_pos = end_pos;
+ input->inf.inf_is_replaying = true;
+ input->inf.inf_is_logging = false;
+
+ XIO_INF("starting replay from %lld to %lld\n", start_pos, end_pos);
+
+ xio_set_power_on_led((void *)brick, true);
+
+ for (;;) {
+ void *buf = NULL;
+ int len = 0;
+
+ if (brick_thread_should_stop() ||
+ (!brick->continuous_replay_mode && finished_pos >= brick->replay_end_pos)) {
+ status = 0; /* treat as EOF */
+ break;
+ }
+
+ status = log_read(&input->logst, false, &lh, &buf, &len);
+
+ new_finished_pos = input->logst.log_pos + input->logst.offset;
+ XIO_RPL("read %lld %lld\n", finished_pos, new_finished_pos);
+
+ if (status == -EAGAIN) {
+ loff_t remaining = brick->replay_end_pos - new_finished_pos;
+
+ XIO_DBG("got -EAGAIN, remaining = %lld\n", remaining);
+ if (brick->replay_tolerance > 0 && remaining < brick->replay_tolerance) {
+ XIO_WRN(
+ "logfile is truncated at position %lld (end_pos = %lld, remaining = %lld, tolerance = %d)\n",
+ new_finished_pos,
+ brick->replay_end_pos,
+ remaining,
+ brick->replay_tolerance);
+ finished_pos = new_finished_pos;
+ brick->replay_code = status;
+ break;
+ }
+ brick_msleep(backoff);
+ if (backoff < trans_logger_replay_timeout * 1000) {
+ backoff += 100;
+ } else {
+ XIO_WRN(
+ "logfile replay not possible at position %lld (end_pos = %lld, remaining = %lld), please check/repair your logfile in userspace by some tool!\n",
+ new_finished_pos,
+ brick->replay_end_pos,
+ remaining);
+ brick->replay_code = status;
+ break;
+ }
+ continue;
+ }
+ if (unlikely(status < 0)) {
+ brick->replay_code = status;
+ XIO_WRN("cannot read logfile data, status = %d\n", status);
+ break;
+ }
+
+ if ((!status && len <= 0) ||
+ new_finished_pos > brick->replay_end_pos) {
+ /* EOF -> wait until brick_thread_should_stop() */
+ XIO_DBG(
+ "EOF at %lld (old = %lld, end_pos = %lld)\n",
+ new_finished_pos,
+ finished_pos,
+ brick->replay_end_pos);
+ if (!brick->continuous_replay_mode) {
+ /* notice: finished_pos remains at old value here! */
+ break;
+ }
+ brick_msleep(1000);
+ continue;
+ }
+
+ if (lh.l_code != CODE_WRITE_NEW) {
+ /* ignore other records silently */
+ } else if (unlikely(brick->disk_io_error)) {
+ status = brick->disk_io_error;
+ brick->replay_code = status;
+ XIO_ERR("IO error %d\n", status);
+ break;
+ } else if (likely(buf && len)) {
+ if (brick->replay_limiter)
+ rate_limit_sleep(brick->replay_limiter, (len - 1) / 1024 + 1);
+ status = replay_data(brick, lh.l_pos, buf, len);
+ XIO_RPL(
+ "replay %lld %lld (pos=%lld status=%d)\n", finished_pos, new_finished_pos, lh.l_pos, status);
+ if (unlikely(status < 0)) {
+ brick->replay_code = status;
+ XIO_ERR(
+ "cannot replay data at pos = %lld len = %d, status = %d\n", lh.l_pos, len, status);
+ break;
+ }
+ finished_pos = new_finished_pos;
+ }
+
+ /* do this _after_ any opportunities for errors... */
+ if ((atomic_read(&brick->replay_count) <= 0 ||
+ ((long long)jiffies) - old_jiffies >= HZ * 3) &&
+ finished_pos >= 0) {
+ /* for safety, wait until the IO queue has drained. */
+ wait_event_interruptible_timeout(
+ brick->worker_event, atomic_read(&brick->replay_count) <= 0, 30 * HZ);
+
+ if (unlikely(brick->disk_io_error)) {
+ status = brick->disk_io_error;
+ brick->replay_code = status;
+ XIO_ERR("IO error %d\n", status);
+ break;
+ }
+
+ down(&input->inf_mutex);
+ input->inf.inf_min_pos = finished_pos;
+ get_lamport(&input->inf.inf_min_pos_stamp);
+ old_jiffies = jiffies;
+ _inf_callback(input, false);
+ up(&input->inf_mutex);
+ }
+ _exit_inputs(brick, false);
+ }
+
+ XIO_INF("waiting for finish...\n");
+
+ wait_event_interruptible_timeout(brick->worker_event, atomic_read(&brick->replay_count) <= 0, 60 * HZ);
+
+ if (unlikely(finished_pos > brick->replay_end_pos)) {
+ XIO_ERR(
+ "finished_pos too large: %lld + %d = %lld > %lld\n",
+ input->logst.log_pos,
+ input->logst.offset,
+ finished_pos,
+ brick->replay_end_pos);
+ }
+
+ if (finished_pos >= 0 && !brick->disk_io_error) {
+ input->inf.inf_min_pos = finished_pos;
+ brick->replay_current_pos = finished_pos;
+ }
+
+ get_lamport(&input->inf.inf_min_pos_stamp);
+
+ if (status >= 0 && finished_pos == brick->replay_end_pos) {
+ XIO_INF("replay finished at %lld\n", finished_pos);
+ brick->replay_code = 1;
+ } else if (status == -EAGAIN && finished_pos + brick->replay_tolerance > brick->replay_end_pos) {
+ XIO_INF("TOLERANCE: logfile is incomplete at %lld (of %lld)\n", finished_pos, brick->replay_end_pos);
+ brick->replay_code = 2;
+ } else if (status < 0) {
+ if (finished_pos < 0)
+ finished_pos = new_finished_pos;
+ if (finished_pos + brick->replay_tolerance > brick->replay_end_pos) {
+ XIO_INF(
+ "TOLERANCE: logfile is incomplete at %lld (of %lld), status = %d\n",
+ finished_pos,
+ brick->replay_end_pos,
+ status);
+ } else {
+ XIO_ERR("replay error %d at %lld (of %lld)\n", status, finished_pos, brick->replay_end_pos);
+ }
+ brick->replay_code = status;
+ } else {
+ XIO_INF("replay stopped prematurely at %lld (of %lld)\n", finished_pos, brick->replay_end_pos);
+ brick->replay_code = 2;
+ }
+
+ for (;;) {
+ _exit_inputs(brick, true);
+ nr_flying = _nr_flying_inputs(brick);
+ if (nr_flying <= 0)
+ break;
+ XIO_INF("%d inputs are operating\n", nr_flying);
+ brick_msleep(1000);
+ }
+
+ local_trigger();
+
+ while (!brick_thread_should_stop())
+ brick_msleep(500);
+}
+
+/************************ logger thread * switching ************************/
+
+static
+int trans_logger_thread(void *data)
+{
+ struct trans_logger_output *output = data;
+ struct trans_logger_brick *brick = output->brick;
+
+ XIO_INF("........... logger has started.\n");
+
+ if (brick->replay_mode)
+ trans_logger_replay(brick);
+ else
+ trans_logger_log(brick);
+ XIO_INF("........... logger has stopped.\n");
+ xio_set_power_on_led((void *)brick, false);
+ xio_set_power_off_led((void *)brick, true);
+ return 0;
+}
+
+static
+int trans_logger_switch(struct trans_logger_brick *brick)
+{
+ static int index;
+ struct trans_logger_output *output = brick->outputs[0];
+
+ if (brick->power.button) {
+ if (!brick->thread && brick->power.off_led) {
+ xio_set_power_off_led((void *)brick, false);
+
+ brick->thread = brick_thread_create(trans_logger_thread, output, "xio_logger%d", index++);
+ if (unlikely(!brick->thread)) {
+ XIO_ERR("cannot create logger thread\n");
+ return -ENOENT;
+ }
+ }
+ } else {
+ xio_set_power_on_led((void *)brick, false);
+ if (brick->thread) {
+ XIO_INF("stopping thread...\n");
+ brick_thread_stop(brick->thread);
+ brick->thread = NULL;
+ }
+ }
+ return 0;
+}
+
+/*************** informational * statistics **************/
+
+static
+char *trans_logger_statistics(struct trans_logger_brick *brick, int verbose)
+{
+ char *res = brick_string_alloc(STATIST_SIZE);
+
+ snprintf(
+ res, STATIST_SIZE - 1,
+ "mode replay=%d continuous=%d replay_code=%d disk_io_error=%d log_reads=%d | delay_callers = %d cease_logging=%d stopped_logging=%d congested=%d | replay_start_pos = %lld replay_end_pos = %lld | new_input_nr = %d log_input_nr = %d (old = %d) inf_min_pos1 = %lld inf_max_pos1 = %lld inf_min_pos2 = %lld inf_max_pos2 = %lld | total hash_insert=%d hash_find=%d hash_extend=%d replay=%d replay_conflict=%d (%d%%) callbacks=%d reads=%d writes=%d flushes=%d (%d%%) wb_clusters=%d writebacks=%d (%d%%) shortcut=%d (%d%%) mshadow=%d sshadow=%d mshadow_buffered=%d sshadow_buffered=%d rounds=%d restarts=%d delays=%d phase0=%d phase1=%d phase2=%d phase3=%d phase4=%d | current #aios = %d shadow_mem_used=%ld/%lld replay_count=%d mshadow=%d/%d sshadow=%d hash_count=%d balance=%d/%d/%d/%d pos_count1=%d pos_count2=%d "
+ "log_aios1=%d log_aios2=%d any_fly=%d log_fly=%d aio_flying1=%d aio_flying2=%d ban0=%d ban1=%d ban2=%d ban3=%d ban4=%d phase0=%d+%d <%d/%d> phase1=%d+%d <%d/%d> phase2=%d+%d <%d/%d> phase3=%d+%d <%d/%d> phase4=%d+%d <%d/%d>\n",
+ brick->replay_mode,
+ brick->continuous_replay_mode,
+ brick->replay_code,
+ brick->disk_io_error,
+ brick->log_reads,
+ brick->delay_callers,
+ brick->cease_logging,
+ brick->stopped_logging,
+ _congested(brick, EXTRA_QUEUES),
+ brick->replay_start_pos,
+ brick->replay_end_pos,
+ brick->new_input_nr,
+ brick->log_input_nr,
+ brick->old_input_nr,
+ brick->inputs[TL_INPUT_LOG1]->inf.inf_min_pos,
+ brick->inputs[TL_INPUT_LOG1]->inf.inf_max_pos,
+ brick->inputs[TL_INPUT_LOG2]->inf.inf_min_pos,
+ brick->inputs[TL_INPUT_LOG2]->inf.inf_max_pos,
+ atomic_read(&brick->total_hash_insert_count),
+ atomic_read(&brick->total_hash_find_count),
+ atomic_read(&brick->total_hash_extend_count),
+ atomic_read(&brick->total_replay_count),
+ atomic_read(&brick->total_replay_conflict_count),
+ atomic_read(
+ &brick->total_replay_count) ? atomic_read(
+ &brick->total_replay_conflict_count) * 100 / atomic_read(&brick->total_replay_count) : 0,
+ atomic_read(&brick->total_cb_count),
+ atomic_read(&brick->total_read_count),
+ atomic_read(&brick->total_write_count),
+ atomic_read(&brick->total_flush_count),
+ atomic_read(
+ &brick->total_write_count) ? atomic_read(
+ &brick->total_flush_count) * 100 / atomic_read(&brick->total_write_count) : 0,
+ atomic_read(&brick->total_writeback_cluster_count),
+ atomic_read(&brick->total_writeback_count),
+ atomic_read(
+ &brick->total_writeback_cluster_count) ? atomic_read(
+ &brick->total_writeback_count) * 100 / atomic_read(&brick->total_writeback_cluster_count) : 0,
+ atomic_read(&brick->total_shortcut_count),
+ atomic_read(
+ &brick->total_writeback_count) ? atomic_read(
+ &brick->total_shortcut_count) * 100 / atomic_read(&brick->total_writeback_count) : 0,
+ atomic_read(&brick->total_mshadow_count),
+ atomic_read(&brick->total_sshadow_count),
+ atomic_read(&brick->total_mshadow_buffered_count),
+ atomic_read(&brick->total_sshadow_buffered_count),
+ atomic_read(&brick->total_round_count),
+ atomic_read(&brick->total_restart_count),
+ atomic_read(&brick->total_delay_count),
+ atomic_read(&brick->q_phase[0].q_total),
+ atomic_read(&brick->q_phase[1].q_total),
+ atomic_read(&brick->q_phase[2].q_total),
+ atomic_read(&brick->q_phase[3].q_total),
+ atomic_read(&brick->q_phase[4].q_total),
+ atomic_read(&brick->aio_object_layout.alloc_count),
+ atomic64_read(&brick->shadow_mem_used) / 1024,
+ brick_global_memlimit,
+ atomic_read(&brick->replay_count),
+ atomic_read(&brick->mshadow_count),
+ brick->shadow_mem_limit,
+ atomic_read(&brick->sshadow_count),
+ atomic_read(&brick->hash_count),
+ atomic_read(&brick->sub_balance_count),
+ atomic_read(&brick->inner_balance_count),
+ atomic_read(&brick->outer_balance_count),
+ atomic_read(&brick->wb_balance_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG1]->pos_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG2]->pos_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG1]->log_obj_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG2]->log_obj_count),
+ atomic_read(&brick->any_fly_count),
+ atomic_read(&brick->log_fly_count),
+ atomic_read(&brick->inputs[TL_INPUT_LOG1]->logst.aio_flying),
+ atomic_read(&brick->inputs[TL_INPUT_LOG2]->logst.aio_flying),
+ banning_is_hit(&brick->q_phase[0].q_banning),
+ banning_is_hit(&brick->q_phase[1].q_banning),
+ banning_is_hit(&brick->q_phase[2].q_banning),
+ banning_is_hit(&brick->q_phase[3].q_banning),
+ banning_is_hit(&brick->q_phase[4].q_banning),
+ atomic_read(&brick->q_phase[0].q_queued),
+ atomic_read(&brick->q_phase[0].q_flying),
+ brick->q_phase[0].pushback_count,
+ brick->q_phase[0].no_progress_count,
+ atomic_read(&brick->q_phase[1].q_queued),
+ atomic_read(&brick->q_phase[1].q_flying),
+ brick->q_phase[1].pushback_count,
+ brick->q_phase[1].no_progress_count,
+ atomic_read(&brick->q_phase[2].q_queued),
+ atomic_read(&brick->q_phase[2].q_flying),
+ brick->q_phase[2].pushback_count,
+ brick->q_phase[2].no_progress_count,
+ atomic_read(&brick->q_phase[3].q_queued),
+ atomic_read(&brick->q_phase[3].q_flying),
+ brick->q_phase[3].pushback_count,
+ brick->q_phase[3].no_progress_count,
+ atomic_read(&brick->q_phase[4].q_queued),
+ atomic_read(&brick->q_phase[4].q_flying),
+ brick->q_phase[4].pushback_count,
+ brick->q_phase[4].no_progress_count);
+ return res;
+}
+
+static
+void trans_logger_reset_statistics(struct trans_logger_brick *brick)
+{
+ atomic_set(&brick->total_hash_insert_count, 0);
+ atomic_set(&brick->total_hash_find_count, 0);
+ atomic_set(&brick->total_hash_extend_count, 0);
+ atomic_set(&brick->total_replay_count, 0);
+ atomic_set(&brick->total_replay_conflict_count, 0);
+ atomic_set(&brick->total_cb_count, 0);
+ atomic_set(&brick->total_read_count, 0);
+ atomic_set(&brick->total_write_count, 0);
+ atomic_set(&brick->total_flush_count, 0);
+ atomic_set(&brick->total_writeback_count, 0);
+ atomic_set(&brick->total_writeback_cluster_count, 0);
+ atomic_set(&brick->total_shortcut_count, 0);
+ atomic_set(&brick->total_mshadow_count, 0);
+ atomic_set(&brick->total_sshadow_count, 0);
+ atomic_set(&brick->total_mshadow_buffered_count, 0);
+ atomic_set(&brick->total_sshadow_buffered_count, 0);
+ atomic_set(&brick->total_round_count, 0);
+ atomic_set(&brick->total_restart_count, 0);
+ atomic_set(&brick->total_delay_count, 0);
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static
+int trans_logger_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct trans_logger_aio_aspect *ini = (void *)_ini;
+
+ ini->lh.lh_pos = &ini->object->io_pos;
+ INIT_LIST_HEAD(&ini->lh.lh_head);
+ INIT_LIST_HEAD(&ini->hash_head);
+ INIT_LIST_HEAD(&ini->pos_head);
+ INIT_LIST_HEAD(&ini->replay_head);
+ INIT_LIST_HEAD(&ini->collect_head);
+ INIT_LIST_HEAD(&ini->sub_list);
+ INIT_LIST_HEAD(&ini->sub_head);
+ return 0;
+}
+
+static
+void trans_logger_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct trans_logger_aio_aspect *ini = (void *)_ini;
+
+ CHECK_HEAD_EMPTY(&ini->lh.lh_head);
+ CHECK_HEAD_EMPTY(&ini->hash_head);
+ CHECK_HEAD_EMPTY(&ini->pos_head);
+ CHECK_HEAD_EMPTY(&ini->replay_head);
+ CHECK_HEAD_EMPTY(&ini->collect_head);
+ CHECK_HEAD_EMPTY(&ini->sub_list);
+ CHECK_HEAD_EMPTY(&ini->sub_head);
+ if (ini->log_input)
+ atomic_dec(&ini->log_input->log_obj_count);
+}
+
+XIO_MAKE_STATICS(trans_logger);
+
+/********************* brick constructors * destructors *******************/
+
+static
+void _free_pages(struct trans_logger_brick *brick)
+{
+ int i;
+
+ for (i = 0; i < NR_HASH_PAGES; i++) {
+ struct trans_logger_hash_anchor *sub_table = brick->hash_table[i];
+ int j;
+
+ if (!sub_table)
+ continue;
+ for (j = 0; j < HASH_PER_PAGE; j++) {
+ struct trans_logger_hash_anchor *start = &sub_table[j];
+
+ CHECK_HEAD_EMPTY(&start->hash_anchor);
+ }
+ brick_block_free(sub_table, PAGE_SIZE);
+ }
+ brick_block_free(brick->hash_table, PAGE_SIZE);
+}
+
+static
+int trans_logger_brick_construct(struct trans_logger_brick *brick)
+{
+ int i;
+
+ brick->hash_table = brick_block_alloc(0, PAGE_SIZE);
+ memset(brick->hash_table, 0, PAGE_SIZE);
+
+ for (i = 0; i < NR_HASH_PAGES; i++) {
+ struct trans_logger_hash_anchor *sub_table;
+ int j;
+
+ /* this should be usually optimized away as dead code */
+ if (unlikely(i >= MAX_HASH_PAGES)) {
+ XIO_ERR("sorry, subtable index %d is too large.\n", i);
+ _free_pages(brick);
+ return -EINVAL;
+ }
+
+ sub_table = brick_block_alloc(0, PAGE_SIZE);
+ brick->hash_table[i] = sub_table;
+
+ memset(sub_table, 0, PAGE_SIZE);
+ for (j = 0; j < HASH_PER_PAGE; j++) {
+ struct trans_logger_hash_anchor *start = &sub_table[j];
+
+ init_rwsem(&start->hash_mutex);
+ INIT_LIST_HEAD(&start->hash_anchor);
+ }
+ }
+
+ atomic_set(&brick->hash_count, 0);
+ spin_lock_init(&brick->replay_lock);
+ INIT_LIST_HEAD(&brick->replay_list);
+ INIT_LIST_HEAD(&brick->group_head);
+ init_waitqueue_head(&brick->worker_event);
+ init_waitqueue_head(&brick->caller_event);
+ qq_init(&brick->q_phase[0], brick);
+ qq_init(&brick->q_phase[1], brick);
+ qq_init(&brick->q_phase[2], brick);
+ qq_init(&brick->q_phase[3], brick);
+ qq_init(&brick->q_phase[4], brick);
+ brick->q_phase[0].q_insert_info = "q0_ins";
+ brick->q_phase[0].q_pushback_info = "q0_push";
+ brick->q_phase[0].q_fetch_info = "q0_fetch";
+ brick->q_phase[1].q_insert_info = "q1_ins";
+ brick->q_phase[1].q_pushback_info = "q1_push";
+ brick->q_phase[1].q_fetch_info = "q1_fetch";
+ brick->q_phase[2].q_insert_info = "q2_ins";
+ brick->q_phase[2].q_pushback_info = "q2_push";
+ brick->q_phase[2].q_fetch_info = "q2_fetch";
+ brick->q_phase[3].q_insert_info = "q3_ins";
+ brick->q_phase[3].q_pushback_info = "q3_push";
+ brick->q_phase[3].q_fetch_info = "q3_fetch";
+ brick->q_phase[4].q_insert_info = "q4_ins";
+ brick->q_phase[4].q_pushback_info = "q4_push";
+ brick->new_input_nr = TL_INPUT_LOG1;
+ brick->log_input_nr = TL_INPUT_LOG1;
+ brick->old_input_nr = TL_INPUT_LOG1;
+ add_to_group(&global_writeback, brick);
+ return 0;
+}
+
+static
+int trans_logger_brick_destruct(struct trans_logger_brick *brick)
+{
+ _free_pages(brick);
+ CHECK_HEAD_EMPTY(&brick->replay_list);
+ remove_from_group(&global_writeback, brick);
+ return 0;
+}
+
+static
+int trans_logger_output_construct(struct trans_logger_output *output)
+{
+ return 0;
+}
+
+static
+int trans_logger_input_construct(struct trans_logger_input *input)
+{
+ INIT_LIST_HEAD(&input->pos_list);
+ sema_init(&input->inf_mutex, 1);
+ return 0;
+}
+
+static
+int trans_logger_input_destruct(struct trans_logger_input *input)
+{
+ CHECK_HEAD_EMPTY(&input->pos_list);
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct trans_logger_brick_ops trans_logger_brick_ops = {
+ .brick_switch = trans_logger_switch,
+ .brick_statistics = trans_logger_statistics,
+ .reset_statistics = trans_logger_reset_statistics,
+};
+
+static struct trans_logger_output_ops trans_logger_output_ops = {
+ .xio_get_info = trans_logger_get_info,
+ .aio_get = trans_logger_io_get,
+ .aio_put = trans_logger_io_put,
+ .aio_io = trans_logger_io_io,
+};
+
+const struct trans_logger_input_type trans_logger_input_type = {
+ .type_name = "trans_logger_input",
+ .input_size = sizeof(struct trans_logger_input),
+ .input_construct = &trans_logger_input_construct,
+ .input_destruct = &trans_logger_input_destruct,
+};
+
+static const struct trans_logger_input_type *trans_logger_input_types[] = {
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+ &trans_logger_input_type,
+};
+
+const struct trans_logger_output_type trans_logger_output_type = {
+ .type_name = "trans_logger_output",
+ .output_size = sizeof(struct trans_logger_output),
+ .master_ops = &trans_logger_output_ops,
+ .output_construct = &trans_logger_output_construct,
+};
+
+static const struct trans_logger_output_type *trans_logger_output_types[] = {
+ &trans_logger_output_type,
+};
+
+const struct trans_logger_brick_type trans_logger_brick_type = {
+ .type_name = "trans_logger_brick",
+ .brick_size = sizeof(struct trans_logger_brick),
+ .max_inputs = TL_INPUT_NR,
+ .max_outputs = 1,
+ .master_ops = &trans_logger_brick_ops,
+ .aspect_types = trans_logger_aspect_types,
+ .default_input_types = trans_logger_input_types,
+ .default_output_types = trans_logger_output_types,
+ .brick_construct = &trans_logger_brick_construct,
+ .brick_destruct = &trans_logger_brick_destruct,
+};
+
+/***************** module init stuff ************************/
+
+int __init init_xio_trans_logger(void)
+{
+ XIO_INF("init_trans_logger()\n");
+ return trans_logger_register_brick_type();
+}
+
+void exit_xio_trans_logger(void)
+{
+ XIO_INF("exit_trans_logger()\n");
+ trans_logger_unregister_brick_type();
+}
diff --git a/include/linux/xio/xio_trans_logger.h b/include/linux/xio/xio_trans_logger.h
new file mode 100644
index 000000000000..79bd4ee1b255
--- /dev/null
+++ b/include/linux/xio/xio_trans_logger.h
@@ -0,0 +1,271 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_TRANS_LOGGER_H
+#define XIO_TRANS_LOGGER_H
+
+#define REGION_SIZE_BITS (PAGE_SHIFT + 4)
+#define REGION_SIZE (1 << REGION_SIZE_BITS)
+#define LOGGER_QUEUES 4
+#define EXTRA_QUEUES (LOGGER_QUEUES + 1)
+
+#include <linux/time.h>
+#include <net/sock.h>
+
+#include <linux/xio/xio.h>
+#include <linux/xio/lib_log.h>
+#include <linux/brick/lib_pairing_heap.h>
+#include <linux/brick/lib_queue.h>
+#include <linux/brick/lib_timing.h>
+#include <linux/brick/lib_rank.h>
+#include <linux/brick/lib_limiter.h>
+
+/************************ global tuning ***********************/
+
+/* 0 = early completion of all writes
+ * 1 = early completion of non-sync
+ * 2 = late completion
+ */
+extern int trans_logger_completion_semantics;
+extern int trans_logger_do_crc;
+extern int trans_logger_mem_usage; /* in KB */
+extern int trans_logger_max_interleave;
+extern int trans_logger_resume;
+extern int trans_logger_replay_timeout; /* in s */
+extern atomic_t global_mshadow_count;
+extern atomic64_t global_mshadow_used;
+
+struct writeback_group {
+ rwlock_t lock;
+ struct trans_logger_brick *leader;
+ loff_t biggest;
+ struct list_head group_anchor;
+
+ /* tuning */
+ struct rate_limiter limiter;
+ int until_percent;
+};
+
+extern struct writeback_group global_writeback;
+
+/******************************************************************/
+
+_PAIRING_HEAP_TYPEDEF(logger, /*empty*/);
+
+struct logger_queue {
+ QUEUE_ANCHOR(logger, loff_t, logger);
+ struct trans_logger_brick *q_brick;
+ const char *q_insert_info;
+ const char *q_pushback_info;
+ const char *q_fetch_info;
+ struct banning q_banning;
+ int no_progress_count;
+ int pushback_count;
+};
+
+struct logger_head {
+ struct list_head lh_head;
+ loff_t *lh_pos;
+ struct pairing_heap_logger ph;
+};
+
+/******************************************************************/
+
+#define TL_INPUT_READ 0
+#define TL_INPUT_WRITEBACK 0
+#define TL_INPUT_LOG1 1
+#define TL_INPUT_LOG2 2
+#define TL_INPUT_NR 3
+
+struct writeback_info {
+ struct trans_logger_brick *w_brick;
+ struct logger_head w_lh;
+ loff_t w_pos;
+ int w_len;
+ int w_error;
+
+ struct list_head w_collect_list; /* list of collected orig requests */
+ struct list_head w_sub_read_list; /* for saving the old data before overwrite */
+ struct list_head w_sub_write_list; /* for overwriting */
+ atomic_t w_sub_read_count;
+ atomic_t w_sub_write_count;
+ atomic_t w_sub_log_count;
+ void (*read_endio)(struct generic_callback *cb);
+ void (*write_endio)(struct generic_callback *cb);
+};
+
+struct trans_logger_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct trans_logger_brick *my_brick;
+ struct trans_logger_input *my_input;
+ struct trans_logger_input *log_input;
+ struct logger_queue *my_queue;
+ struct logger_head lh;
+ struct list_head hash_head;
+ struct list_head pos_head;
+ struct list_head replay_head;
+ struct list_head collect_head;
+ struct pairing_heap_logger ph;
+ struct trans_logger_aio_aspect *shadow_aio;
+ struct trans_logger_aio_aspect *orig_aio_a;
+ void *shadow_data;
+ int orig_rw;
+ int wb_error;
+ bool do_dealloc;
+ bool do_buffered;
+ bool is_hashed;
+ bool is_stable;
+ bool is_dirty;
+ bool is_collected;
+ bool is_fired;
+ bool is_completed;
+ bool is_endio;
+ bool is_persistent;
+ bool is_emergency;
+ struct timespec stamp;
+ loff_t log_pos;
+ struct generic_callback cb;
+ struct writeback_info *wb;
+ struct list_head sub_list;
+ struct list_head sub_head;
+ int total_sub_count;
+ int alloc_len;
+ atomic_t current_sub_count;
+};
+
+struct trans_logger_hash_anchor;
+
+struct trans_logger_brick {
+ XIO_BRICK(trans_logger);
+ /* parameters */
+ struct rate_limiter *replay_limiter;
+
+ int shadow_mem_limit; /* max # master shadows */
+ bool replay_mode; /* mode of operation */
+ bool continuous_replay_mode; /* mode of operation */
+ bool log_reads; /* additionally log pre-images */
+ bool cease_logging; /* direct IO without logging (only in case of EMERGENCY) */
+ loff_t replay_start_pos; /* where to start replay */
+ loff_t replay_end_pos; /* end of replay */
+ int new_input_nr; /* whereto we should switchover ASAP */
+ int replay_tolerance; /* how many bytes to ignore at truncated logfiles */
+ /* readonly from outside */
+ loff_t replay_current_pos; /* end of replay */
+ int log_input_nr; /* where we are currently logging to */
+ int old_input_nr; /* where old IO requests may be on the fly */
+ int replay_code; /* replay errors (if any) */
+ bool stopped_logging; /* direct IO without logging (only in case of EMERGENCY) */
+ /* private */
+ int disk_io_error; /* replay errors from callbacks */
+ struct trans_logger_hash_anchor **hash_table;
+ struct list_head group_head;
+ loff_t old_margin;
+ spinlock_t replay_lock;
+ struct list_head replay_list;
+ struct task_struct *thread;
+
+ wait_queue_head_t worker_event;
+ wait_queue_head_t caller_event;
+ struct sockaddr peer_addr;
+
+ /* statistics */
+ atomic64_t shadow_mem_used;
+ atomic_t replay_count;
+ atomic_t any_fly_count;
+ atomic_t log_fly_count;
+ atomic_t hash_count;
+ atomic_t mshadow_count;
+ atomic_t sshadow_count;
+ atomic_t outer_balance_count;
+ atomic_t inner_balance_count;
+ atomic_t sub_balance_count;
+ atomic_t wb_balance_count;
+ atomic_t total_hash_insert_count;
+ atomic_t total_hash_find_count;
+ atomic_t total_hash_extend_count;
+ atomic_t total_replay_count;
+ atomic_t total_replay_conflict_count;
+ atomic_t total_cb_count;
+ atomic_t total_read_count;
+ atomic_t total_write_count;
+ atomic_t total_flush_count;
+ atomic_t total_writeback_count;
+ atomic_t total_writeback_cluster_count;
+ atomic_t total_shortcut_count;
+ atomic_t total_mshadow_count;
+ atomic_t total_sshadow_count;
+ atomic_t total_mshadow_buffered_count;
+ atomic_t total_sshadow_buffered_count;
+ atomic_t total_round_count;
+ atomic_t total_restart_count;
+ atomic_t total_delay_count;
+
+ /* queues */
+ struct logger_queue q_phase[EXTRA_QUEUES];
+ struct rank_data rkd[EXTRA_QUEUES];
+ bool delay_callers;
+};
+
+struct trans_logger_output {
+ XIO_OUTPUT(trans_logger);
+};
+
+#define MAX_HOST_LEN 32
+
+struct trans_logger_info {
+ /* to be maintained / initialized from outside */
+ void (*inf_callback)(struct trans_logger_info *inf);
+ void *inf_private;
+ char inf_host[MAX_HOST_LEN];
+
+ int inf_sequence; /* logfile sequence number */
+
+ /* maintained by trans_logger */
+ loff_t inf_min_pos; /* current replay position (both in replay mode and in logging mode) */
+ loff_t inf_max_pos; /* dito, indicating the "dirty" area which could be potentially "inconsistent" */
+ loff_t inf_log_pos; /* position of transaction logging (may be ahead of replay position) */
+ struct timespec inf_min_pos_stamp; /* when the data has been _successfully_ overwritten */
+/* when the data has _started_ overwrite (maybe "trashed" in case of errors / aborts) */
+ struct timespec inf_max_pos_stamp;
+
+ struct timespec inf_log_pos_stamp; /* stamp from transaction log */
+ bool inf_is_replaying;
+ bool inf_is_logging;
+};
+
+struct trans_logger_input {
+ XIO_INPUT(trans_logger);
+ /* parameters */
+ /* informational */
+ struct trans_logger_info inf;
+
+ /* readonly from outside */
+ atomic_t log_obj_count;
+ atomic_t pos_count;
+ bool is_operating;
+ long long last_jiffies;
+
+ /* private */
+ struct log_status logst;
+ struct list_head pos_list;
+ long long inf_last_jiffies;
+ struct semaphore inf_mutex;
+};
+
+XIO_TYPES(trans_logger);
+
+#endif
--
2.11.0

2016-12-30 23:01:28

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 13/32] mars: add new module xio

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/xio.c | 227 ++++++++++++++++++++++++
include/linux/xio/xio.h | 319 ++++++++++++++++++++++++++++++++++
2 files changed, 546 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/xio.c
create mode 100644 include/linux/xio/xio.h

diff --git a/drivers/staging/mars/xio_bricks/xio.c b/drivers/staging/mars/xio_bricks/xio.c
new file mode 100644
index 000000000000..e58f11f497f9
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/xio.c
@@ -0,0 +1,227 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+
+#include <linux/xio/xio.h>
+
+/************************************************************/
+
+/* infrastructure */
+
+struct banning xio_global_ban = {};
+atomic_t xio_global_io_flying = ATOMIC_INIT(0);
+
+/************************************************************/
+
+/* object stuff */
+
+const struct generic_object_type aio_type = {
+ .object_type_name = "aio",
+ .default_size = sizeof(struct aio_object),
+ .object_type_nr = OBJ_TYPE_AIO,
+};
+
+/************************************************************/
+
+/* brick stuff */
+
+/*******************************************************************/
+
+/* meta descriptions */
+
+const struct meta xio_info_meta[] = {
+ META_INI(current_size, struct xio_info, FIELD_INT),
+ META_INI(tf_align, struct xio_info, FIELD_INT),
+ META_INI(tf_min_size, struct xio_info, FIELD_INT),
+ {}
+};
+
+const struct meta xio_aio_user_meta[] = {
+ META_INI(_object_cb.cb_error, struct aio_object, FIELD_INT),
+ META_INI(io_pos, struct aio_object, FIELD_INT),
+ META_INI(io_len, struct aio_object, FIELD_INT),
+ META_INI(io_may_write, struct aio_object, FIELD_INT),
+ META_INI(io_prio, struct aio_object, FIELD_INT),
+ META_INI(io_cs_mode, struct aio_object, FIELD_INT),
+ META_INI(io_timeout, struct aio_object, FIELD_INT),
+ META_INI(io_total_size, struct aio_object, FIELD_INT),
+ META_INI(io_checksum, struct aio_object, FIELD_RAW),
+ META_INI(io_flags, struct aio_object, FIELD_INT),
+ META_INI(io_rw, struct aio_object, FIELD_INT),
+ META_INI(io_id, struct aio_object, FIELD_INT),
+ META_INI(io_skip_sync, struct aio_object, FIELD_INT),
+ {}
+};
+
+const struct meta xio_timespec_meta[] = {
+ META_INI_TRANSFER(tv_sec, struct timespec, FIELD_UINT, 8),
+ META_INI_TRANSFER(tv_nsec, struct timespec, FIELD_UINT, 4),
+ {}
+};
+
+/************************************************************/
+
+/* crypto stuff */
+
+#include <linux/scatterlist.h>
+#include <linux/crypto.h>
+
+/* 896545098777564212b9e91af4c973f094649aa7 */
+#ifndef crt_hash
+#define HAS_NEW_CRYPTO
+#endif
+
+#ifdef HAS_NEW_CRYPTO
+
+/* Nor now, use shash.
+ * Later, asynchronous support should be added for full exploitation
+ * of crypto hardware.
+ */
+#include <crypto/hash.h>
+
+static struct crypto_shash *xio_tfm;
+int xio_digest_size;
+
+struct mars_sdesc {
+ struct shash_desc shash;
+ char ctx[];
+};
+
+void xio_digest(unsigned char *digest, void *data, int len)
+{
+ int size = sizeof(struct mars_sdesc) + crypto_shash_descsize(xio_tfm);
+ struct mars_sdesc *sdesc = brick_mem_alloc(size);
+ int status;
+
+ sdesc->shash.tfm = xio_tfm;
+ sdesc->shash.flags = 0;
+
+ memset(digest, 0, xio_digest_size);
+ status = crypto_shash_digest(&sdesc->shash, data, len, digest);
+ if (unlikely(status < 0))
+ XIO_ERR(
+ "cannot calculate cksum on %p len=%d, status=%d\n",
+ data, len,
+ status);
+
+ brick_mem_free(sdesc);
+}
+
+#else /* HAS_NEW_CRYPTO */
+
+/* Old implementation, to disappear.
+ * Was a quick'n dirty lab prototype with unnecessary
+ * global variables and locking.
+ */
+
+static struct crypto_hash *xio_tfm;
+static struct semaphore tfm_sem;
+int xio_digest_size;
+
+void xio_digest(unsigned char *digest, void *data, int len)
+{
+ struct hash_desc desc = {
+ .tfm = xio_tfm,
+ .flags = 0,
+ };
+ struct scatterlist sg;
+
+ memset(digest, 0, xio_digest_size);
+
+ /* TODO: use per-thread instance, omit locking */
+ down(&tfm_sem);
+
+ crypto_hash_init(&desc);
+ sg_init_table(&sg, 1);
+ sg_set_buf(&sg, data, len);
+ crypto_hash_update(&desc, &sg, sg.length);
+ crypto_hash_final(&desc, digest);
+ up(&tfm_sem);
+}
+
+#endif /* HAS_NEW_CRYPTO */
+
+void aio_checksum(struct aio_object *aio)
+{
+ unsigned char checksum[xio_digest_size];
+ int len;
+
+ if (aio->io_cs_mode <= 0 || !aio->io_data)
+ goto out_return;
+ xio_digest(checksum, aio->io_data, aio->io_len);
+
+ len = sizeof(aio->io_checksum);
+ if (len > xio_digest_size)
+ len = xio_digest_size;
+ memcpy(&aio->io_checksum, checksum, len);
+out_return:;
+}
+
+/*******************************************************************/
+
+/* init stuff */
+
+int __init init_xio(void)
+{
+ XIO_INF("init_xio()\n");
+
+ sema_init(&tfm_sem, 1);
+
+#ifdef HAS_NEW_CRYPTO
+ xio_tfm = crypto_alloc_shash("md5", 0, 0);
+ if (unlikely(!xio_tfm) || IS_ERR(xio_tfm)) {
+ XIO_ERR(
+ "cannot alloc crypto hash, status=%ld\n",
+ PTR_ERR(xio_tfm));
+ return -ELIBACC;
+ }
+ xio_digest_size = crypto_shash_digestsize(xio_tfm);
+#else /* HAS_NEW_CRYPTO */
+ xio_tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
+ if (!xio_tfm) {
+ XIO_ERR("cannot alloc crypto hash\n");
+ return -ENOMEM;
+ }
+ if (IS_ERR(xio_tfm)) {
+ XIO_ERR("alloc crypto hash failed, status = %d\n", (int)PTR_ERR(xio_tfm));
+ return PTR_ERR(xio_tfm);
+ }
+ xio_digest_size = crypto_hash_digestsize(xio_tfm);
+#endif /* HAS_NEW_CRYPTO */
+ XIO_INF("digest_size = %d\n", xio_digest_size);
+
+ return 0;
+}
+
+void exit_xio(void)
+{
+ XIO_INF("exit_xio()\n");
+
+ if (xio_tfm) {
+#ifdef HAS_NEW_CRYPTO
+ crypto_free_shash(xio_tfm);
+#else /* HAS_NEW_CRYPTO */
+ crypto_free_hash(xio_tfm);
+#endif /* HAS_NEW_CRYPTO */
+ }
+}
diff --git a/include/linux/xio/xio.h b/include/linux/xio/xio.h
new file mode 100644
index 000000000000..d26a1c761ee3
--- /dev/null
+++ b/include/linux/xio/xio.h
@@ -0,0 +1,319 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_H
+#define XIO_H
+
+#include <linux/semaphore.h>
+#include <linux/rwsem.h>
+#include <linux/major.h>
+
+#if defined(CONFIG_CRYPTO_LZO) || defined(CONFIG_CRYPTO_LZO_MODULE)
+#define __HAVE_LZO
+#endif
+
+#ifdef __enabled_CONFIG_CRYPTO_LZO
+#if __enabled_CONFIG_CRYPTO_LZO
+#define __HAVE_LZO
+#endif
+#endif
+
+#ifdef __enabled_CONFIG_CRYPTO_LZO_MODULE
+#if __enabled_CONFIG_CRYPTO_LZO_MODULE
+#define __HAVE_LZO
+#endif
+#endif
+
+/* TRANSITIONAL compatibility to BOTH the old prepatch
+ * and the new wrapper around vfs_*(). Both will be replaced
+ * for kernel upstream.
+ */
+#include <linux/brick/vfs_compat.h>
+#ifndef MARS_MAJOR
+#define __USE_COMPAT
+#endif
+
+/***********************************************************************/
+
+/* include the generic brick infrastructure */
+
+#define OBJ_TYPE_AIO 0
+#define OBJ_TYPE_MAX 1
+
+#include <linux/brick/brick.h>
+#include <linux/brick/brick_mem.h>
+#include <linux/brick/lamport.h>
+#include <linux/brick/lib_timing.h>
+
+/***********************************************************************/
+
+/* XIO-specific debugging helpers */
+
+#define _XIO_MSG(_class, _dump, _fmt, _args...) \
+ brick_say(_class, _dump, "XIO", __BASE_FILE__, __LINE__, __func__, _fmt, ##_args)
+
+#define XIO_FAT(_fmt, _args...) _XIO_MSG(SAY_FATAL, true, _fmt, ##_args)
+#define XIO_ERR(_fmt, _args...) _XIO_MSG(SAY_ERROR, false, _fmt, ##_args)
+#define XIO_WRN(_fmt, _args...) _XIO_MSG(SAY_WARN, false, _fmt, ##_args)
+#define XIO_INF(_fmt, _args...) _XIO_MSG(SAY_INFO, false, _fmt, ##_args)
+
+#ifdef XIO_DEBUGGING
+#define XIO_DBG(_fmt, _args...) _XIO_MSG(SAY_DEBUG, false, _fmt, ##_args)
+#else
+#define XIO_DBG(_args...) /**/
+#endif
+
+/***********************************************************************/
+
+/* XIO-specific definitions */
+
+#define XIO_PRIO_HIGH -1
+#define XIO_PRIO_NORMAL 0 /* this is automatically used by memset() */
+#define XIO_PRIO_LOW 1
+#define XIO_PRIO_NR 3
+
+/* object stuff */
+
+/* aio */
+
+#define AIO_UPTODATE 1
+#define AIO_READING 2
+#define AIO_WRITING 4
+
+extern const struct generic_object_type aio_type;
+
+#define XIO_CHECKSUM_SIZE 16
+
+#define AIO_OBJECT(OBJTYPE) \
+ CALLBACK_OBJECT(OBJTYPE); \
+ /* supplied by caller */ \
+ void *io_data; /* preset to NULL for buffered IO */ \
+ loff_t io_pos; \
+ int io_len; \
+ int io_may_write; \
+ int io_prio; \
+ int io_timeout; \
+ int io_cs_mode; /* 0 = off, 1 = checksum + data, 2 = checksum only */\
+ /* maintained by the aio implementation, readable for callers */\
+ loff_t io_total_size; /* just for info, need not be implemented */\
+ unsigned char io_checksum[XIO_CHECKSUM_SIZE]; \
+ int io_flags; \
+ int io_rw; \
+ int io_id; /* not mandatory; may be used for identification */\
+ bool io_skip_sync /* skip sync for this particular aio */
+
+struct aio_object {
+ AIO_OBJECT(aio);
+};
+
+/* internal helper structs */
+
+struct xio_info {
+ loff_t current_size;
+
+ int tf_align; /* transfer alignment constraint */
+ int tf_min_size; /* transfer is only possible in multiples of this */
+};
+
+/* brick stuff */
+
+#define XIO_BRICK(BRITYPE) \
+ GENERIC_BRICK(BRITYPE); \
+ struct generic_object_layout aio_object_layout; \
+ struct list_head global_brick_link; \
+ struct list_head dent_brick_link; \
+ const char *brick_name; \
+ const char *brick_path; \
+ void *private_ptr; \
+ void **kill_ptr; \
+ int *mode_ptr; \
+ int kill_round; \
+ bool killme; \
+ void (*show_status)(struct xio_brick *brick, bool shutdown)
+
+struct xio_brick {
+ XIO_BRICK(xio);
+};
+
+#define XIO_INPUT(BRITYPE) \
+ GENERIC_INPUT(BRITYPE)
+
+struct xio_input {
+ XIO_INPUT(xio);
+};
+
+#define XIO_OUTPUT(BRITYPE) \
+ GENERIC_OUTPUT(BRITYPE)
+
+struct xio_output {
+ XIO_OUTPUT(xio);
+};
+
+#define XIO_BRICK_OPS(BRITYPE) \
+ GENERIC_BRICK_OPS(BRITYPE); \
+ char *(*brick_statistics)(struct BRITYPE##_brick *brick, int verbose);\
+ void (*reset_statistics)(struct BRITYPE##_brick *brick)
+
+#define XIO_OUTPUT_OPS(BRITYPE) \
+ GENERIC_OUTPUT_OPS(BRITYPE); \
+ int (*xio_get_info)(struct BRITYPE##_output *output, struct xio_info *info);\
+ /* aio */ \
+ int (*aio_get)(struct BRITYPE##_output *output, struct aio_object *aio);\
+ void (*aio_io)(struct BRITYPE##_output *output, struct aio_object *aio);\
+ void (*aio_put)(struct BRITYPE##_output *output, struct aio_object *aio)
+
+/* all non-extendable types */
+
+#define _XIO_TYPES(BRITYPE) \
+ \
+struct BRITYPE##_brick_ops { \
+ XIO_BRICK_OPS(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_output_ops { \
+ XIO_OUTPUT_OPS(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_brick_type { \
+ GENERIC_BRICK_TYPE(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_input_type { \
+ GENERIC_INPUT_TYPE(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_output_type { \
+ GENERIC_OUTPUT_TYPE(BRITYPE); \
+}; \
+ \
+struct BRITYPE##_callback { \
+ GENERIC_CALLBACK(BRITYPE); \
+}; \
+ \
+DECLARE_BRICK_FUNCTIONS(BRITYPE)
+
+#define XIO_TYPES(BRITYPE) \
+ \
+_XIO_TYPES(BRITYPE); \
+ \
+DECLARE_ASPECT_FUNCTIONS(BRITYPE, aio) \
+extern int init_xio_##BRITYPE(void); \
+extern void exit_xio_##BRITYPE(void)
+
+/* instantiate pseudo base-classes */
+
+DECLARE_OBJECT_FUNCTIONS(aio)
+_XIO_TYPES(xio);
+DECLARE_ASPECT_FUNCTIONS(xio, aio)
+
+/***********************************************************************/
+
+/* XIO-specific helpers */
+
+#define XIO_MAKE_STATICS(BRITYPE) \
+ \
+int BRITYPE##_brick_nr = -EEXIST; \
+ \
+static const struct generic_aspect_type BRITYPE##_aio_aspect_type = { \
+ .aspect_type_name = #BRITYPE "_aio_aspect_type", \
+ .object_type = &aio_type, \
+ .aspect_size = sizeof(struct BRITYPE##_aio_aspect), \
+ .init_fn = BRITYPE##_aio_aspect_init_fn, \
+ .exit_fn = BRITYPE##_aio_aspect_exit_fn, \
+}; \
+ \
+static const struct generic_aspect_type *BRITYPE##_aspect_types[OBJ_TYPE_MAX] = {\
+ [OBJ_TYPE_AIO] = &BRITYPE##_aio_aspect_type, \
+}
+
+extern const struct meta xio_info_meta[];
+extern const struct meta xio_aio_user_meta[];
+extern const struct meta xio_timespec_meta[];
+
+/***********************************************************************/
+
+/* Some minimal upcalls from generic IO layer to the strategy layer.
+ * TODO: abstract away.
+ */
+
+extern void xio_set_power_on_led(struct xio_brick *brick, bool val);
+extern void xio_set_power_off_led(struct xio_brick *brick, bool val);
+
+/* this should disappear!
+ */
+extern void (*_local_trigger)(void);
+extern void (*_remote_trigger)(void);
+#define local_trigger() do { if (_local_trigger) { XIO_DBG("trigger...\n"); _local_trigger(); } } while (0)
+#define remote_trigger() \
+do { if (_remote_trigger) { XIO_DBG("remote_trigger...\n"); _remote_trigger(); } } while (0)
+
+/***********************************************************************/
+
+/* Some global stuff.
+ */
+
+extern struct banning xio_global_ban;
+
+extern atomic_t xio_global_io_flying;
+
+extern int xio_throttle_start;
+extern int xio_throttle_end;
+
+/***********************************************************************/
+
+/* Some special brick types for avoidance of cyclic references.
+ *
+ * The client/server network bricks use this for independent instantiation
+ * from the main instantiation logic (separate modprobe for xio_server
+ * is possible).
+ */
+extern const struct generic_brick_type *_client_brick_type;
+extern const struct generic_brick_type *_bio_brick_type;
+extern const struct generic_brick_type *_sio_brick_type;
+
+/***********************************************************************/
+
+/* Crypto stuff
+ */
+
+extern int xio_digest_size;
+extern void xio_digest(unsigned char *digest, void *data, int len);
+extern void aio_checksum(struct aio_object *aio);
+
+/***********************************************************************/
+
+/* Crash-testing instrumentation.
+ * Only for debugging. Never use this for production.
+ * Simulate a crash at the "wrong moment".
+ */
+
+#ifdef CONFIG_MARS_DEBUG
+extern int mars_crash_mode;
+extern int mars_hang_mode;
+extern void _crashme(int mode, bool do_sync);
+#else
+extern inline void _crashme(int mode, bool do_sync) {}
+#endif
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_xio(void);
+extern void exit_xio(void);
+
+#endif
--
2.11.0

2016-12-30 23:02:02

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 03/32] mars: add new module brick_mem

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/brick_mem.c | 1080 ++++++++++++++++++++++++++++++++++++++
include/linux/brick/brick_mem.h | 218 ++++++++
2 files changed, 1298 insertions(+)
create mode 100644 drivers/staging/mars/brick_mem.c
create mode 100644 include/linux/brick/brick_mem.h

diff --git a/drivers/staging/mars/brick_mem.c b/drivers/staging/mars/brick_mem.c
new file mode 100644
index 000000000000..232dbf6cb0ca
--- /dev/null
+++ b/drivers/staging/mars/brick_mem.c
@@ -0,0 +1,1080 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/delay.h>
+
+#include <linux/atomic.h>
+
+#include <linux/brick/brick_mem.h>
+#include <linux/brick/brick_say.h>
+#include <linux/brick/lamport.h>
+
+#define USE_KERNEL_PAGES /* currently mandatory (vmalloc does not work) */
+
+#define MAGIC_BLOCK 0x8B395D7B
+#define MAGIC_BEND 0x8B395D7C
+#define MAGIC_MEM1 0x8B395D7D
+#define MAGIC_MEM2 0x9B395D8D
+#define MAGIC_MEND1 0x8B395D7E
+#define MAGIC_MEND2 0x9B395D8E
+#define MAGIC_STR 0x8B395D7F
+#define MAGIC_SEND 0x9B395D8F
+
+#define INT_ACCESS(ptr, offset) (*(int *)(((char *)(ptr)) + (offset)))
+
+#define _BRICK_FMT(_fmt, _class) \
+ "%ld.%09ld %ld.%09ld MEM_%-5s %s[%d] %s:%d %s(): " \
+ _fmt, \
+ _s_now.tv_sec, _s_now.tv_nsec, \
+ _l_now.tv_sec, _l_now.tv_nsec, \
+ say_class[_class], \
+ current->comm, (int)smp_processor_id(), \
+ __BASE_FILE__, \
+ __LINE__, \
+ __func__
+
+#define _BRICK_MSG(_class, _dump, _fmt, _args...) \
+ do { \
+ struct timespec _s_now = CURRENT_TIME; \
+ struct timespec _l_now; \
+ get_lamport(&_l_now); \
+ say(_class, _BRICK_FMT(_fmt, _class), ##_args); \
+ if (_dump) \
+ dump_stack(); \
+ } while (0)
+
+#define BRICK_ERR(_fmt, _args...) _BRICK_MSG(SAY_ERROR, true, _fmt, ##_args)
+#define BRICK_WRN(_fmt, _args...) _BRICK_MSG(SAY_WARN, false, _fmt, ##_args)
+#define BRICK_INF(_fmt, _args...) _BRICK_MSG(SAY_INFO, false, _fmt, ##_args)
+
+/***********************************************************************/
+
+/* limit handling */
+
+#include <linux/swap.h>
+
+long long brick_global_memavail;
+long long brick_global_memlimit;
+
+atomic64_t brick_global_block_used = ATOMIC64_INIT(0);
+
+void get_total_ram(void)
+{
+ struct sysinfo i = {};
+
+ si_meminfo(&i);
+ /* si_swapinfo(&i); */
+ brick_global_memavail = (long long)i.totalram * (PAGE_SIZE / 1024);
+ BRICK_INF("total RAM = %lld [KiB]\n", brick_global_memavail);
+}
+
+/***********************************************************************/
+
+/* small memory allocation (use this only for len < PAGE_SIZE) */
+
+#ifdef BRICK_DEBUG_MEM
+static atomic_t phys_mem_alloc = ATOMIC_INIT(0);
+static atomic_t mem_redirect_alloc = ATOMIC_INIT(0);
+static atomic_t mem_count[BRICK_DEBUG_MEM];
+static atomic_t mem_free[BRICK_DEBUG_MEM];
+static int mem_len[BRICK_DEBUG_MEM];
+
+#define PLUS_SIZE (6 * sizeof(int))
+#else
+#define PLUS_SIZE (2 * sizeof(int))
+#endif
+
+static inline
+void *__brick_mem_alloc(int len)
+{
+ void *res;
+
+ if (len >= PAGE_SIZE) {
+#ifdef BRICK_DEBUG_MEM
+ atomic_inc(&mem_redirect_alloc);
+#endif
+ res = _brick_block_alloc(0, len, 0);
+ } else {
+ for (;;) {
+ res = kmalloc(len, GFP_BRICK);
+ if (likely(res))
+ break;
+ msleep(1000);
+ }
+#ifdef BRICK_DEBUG_MEM
+ atomic_inc(&phys_mem_alloc);
+#endif
+ }
+ return res;
+}
+
+static inline
+void __brick_mem_free(void *data, int len)
+{
+ if (len >= PAGE_SIZE) {
+ _brick_block_free(data, len, 0);
+#ifdef BRICK_DEBUG_MEM
+ atomic_dec(&mem_redirect_alloc);
+#endif
+ } else {
+ kfree(data);
+#ifdef BRICK_DEBUG_MEM
+ atomic_dec(&phys_mem_alloc);
+#endif
+ }
+}
+
+void *_brick_mem_alloc(int len, int line)
+{
+ void *res;
+
+#ifdef CONFIG_MARS_DEBUG
+ might_sleep();
+#endif
+
+ res = __brick_mem_alloc(len + PLUS_SIZE);
+
+#ifdef BRICK_DEBUG_MEM
+ if (unlikely(line < 0))
+ line = 0;
+ else if (unlikely(line >= BRICK_DEBUG_MEM))
+ line = BRICK_DEBUG_MEM - 1;
+ INT_ACCESS(res, 0 * sizeof(int)) = MAGIC_MEM1;
+ INT_ACCESS(res, 1 * sizeof(int)) = len;
+ INT_ACCESS(res, 2 * sizeof(int)) = line;
+ INT_ACCESS(res, 3 * sizeof(int)) = MAGIC_MEM2;
+ res += 4 * sizeof(int);
+ INT_ACCESS(res, len + 0 * sizeof(int)) = MAGIC_MEND1;
+ INT_ACCESS(res, len + 1 * sizeof(int)) = MAGIC_MEND2;
+ atomic_inc(&mem_count[line]);
+ mem_len[line] = len;
+#else
+ INT_ACCESS(res, 0 * sizeof(int)) = len;
+ res += PLUS_SIZE;
+#endif
+ return res;
+}
+
+void _brick_mem_free(void *data, int cline)
+{
+#ifdef BRICK_DEBUG_MEM
+ void *test = data - 4 * sizeof(int);
+ int magic1 = INT_ACCESS(test, 0 * sizeof(int));
+ int len = INT_ACCESS(test, 1 * sizeof(int));
+ int line = INT_ACCESS(test, 2 * sizeof(int));
+ int magic2 = INT_ACCESS(test, 3 * sizeof(int));
+
+ if (unlikely(magic1 != MAGIC_MEM1)) {
+ BRICK_ERR(
+ "line %d memory corruption: magix1 %08x != %08x, len = %d\n", cline, magic1, MAGIC_MEM1, len);
+ goto _out_return;
+ }
+ if (unlikely(magic2 != MAGIC_MEM2)) {
+ BRICK_ERR(
+ "line %d memory corruption: magix2 %08x != %08x, len = %d\n", cline, magic2, MAGIC_MEM2, len);
+ goto _out_return;
+ }
+ if (unlikely(line < 0 || line >= BRICK_DEBUG_MEM)) {
+ BRICK_ERR("line %d memory corruption: alloc line = %d, len = %d\n", cline, line, len);
+ goto _out_return;
+ }
+ INT_ACCESS(test, 0) = 0xffffffff;
+ magic1 = INT_ACCESS(data, len + 0 * sizeof(int));
+ if (unlikely(magic1 != MAGIC_MEND1)) {
+ BRICK_ERR(
+ "line %d memory corruption: magix1 %08x != %08x, len = %d\n", cline, magic1, MAGIC_MEND1, len);
+ goto _out_return;
+ }
+ magic2 = INT_ACCESS(data, len + 1 * sizeof(int));
+ if (unlikely(magic2 != MAGIC_MEND2)) {
+ BRICK_ERR(
+ "line %d memory corruption: magix2 %08x != %08x, len = %d\n", cline, magic2, MAGIC_MEND2, len);
+ goto _out_return;
+ }
+ INT_ACCESS(data, len) = 0xffffffff;
+ atomic_dec(&mem_count[line]);
+ atomic_inc(&mem_free[line]);
+#else
+ void *test = data - PLUS_SIZE;
+ int len = INT_ACCESS(test, 0 * sizeof(int));
+
+#endif
+ data = test;
+ __brick_mem_free(data, len + PLUS_SIZE);
+#ifdef BRICK_DEBUG_MEM
+_out_return:;
+#endif
+}
+
+/***********************************************************************/
+
+/* string memory allocation */
+
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+# define STRING_CANARY \
+ "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
+ "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy" \
+ "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" \
+ "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
+ "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy" \
+ "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" \
+ "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
+ "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy" \
+ "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" \
+ " FILE = " __FILE__ \
+ " VERSION = " __VERSION__ \
+ " xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx STRING_error xxx\n"
+# define STRING_PLUS (sizeof(int) * 3 + sizeof(STRING_CANARY))
+#elif defined(BRICK_DEBUG_MEM)
+# define STRING_PLUS (sizeof(int) * 4)
+#else
+# define STRING_PLUS 0
+#endif
+
+#ifdef BRICK_DEBUG_MEM
+static atomic_t phys_string_alloc = ATOMIC_INIT(0);
+static atomic_t string_count[BRICK_DEBUG_MEM];
+static atomic_t string_free[BRICK_DEBUG_MEM];
+
+#endif
+
+char *_brick_string_alloc(int len, int line)
+{
+ char *res;
+
+#ifdef CONFIG_MARS_DEBUG
+ might_sleep();
+ if (unlikely(len > PAGE_SIZE))
+ BRICK_WRN("line = %d string too long: len = %d\n", line, len);
+#endif
+ if (len <= 0)
+ len = BRICK_STRING_LEN;
+
+ for (;;) {
+ res = kzalloc(len + STRING_PLUS, GFP_BRICK);
+ if (likely(res))
+ break;
+ msleep(1000);
+ }
+
+#ifdef BRICK_DEBUG_MEM
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ memset(res + 1, '?', len - 1);
+#endif
+ atomic_inc(&phys_string_alloc);
+ if (unlikely(line < 0))
+ line = 0;
+ else if (unlikely(line >= BRICK_DEBUG_MEM))
+ line = BRICK_DEBUG_MEM - 1;
+ INT_ACCESS(res, 0) = MAGIC_STR;
+ INT_ACCESS(res, sizeof(int)) = len;
+ INT_ACCESS(res, sizeof(int) * 2) = line;
+ res += sizeof(int) * 3;
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ strcpy(res + len, STRING_CANARY);
+#else
+ INT_ACCESS(res, len) = MAGIC_SEND;
+#endif
+ atomic_inc(&string_count[line]);
+#endif
+ return res;
+}
+
+void _brick_string_free(const char *data, int cline)
+{
+#ifdef BRICK_DEBUG_MEM
+ int magic;
+ int len;
+ int line;
+ char *orig = (void *)data;
+
+ data -= sizeof(int) * 3;
+ magic = INT_ACCESS(data, 0);
+ if (unlikely(magic != MAGIC_STR)) {
+ BRICK_ERR("cline %d stringmem corruption: magix %08x != %08x\n", cline, magic, MAGIC_STR);
+ goto _out_return;
+ }
+ len = INT_ACCESS(data, sizeof(int));
+ line = INT_ACCESS(data, sizeof(int) * 2);
+ if (unlikely(len <= 0)) {
+ BRICK_ERR("cline %d stringmem corruption: line = %d len = %d\n", cline, line, len);
+ goto _out_return;
+ }
+ if (unlikely(len > PAGE_SIZE))
+ BRICK_ERR("cline %d string too long: line = %d len = %d string='%s'\n", cline, line, len, orig);
+ if (unlikely(line < 0 || line >= BRICK_DEBUG_MEM)) {
+ BRICK_ERR("cline %d stringmem corruption: line = %d (len = %d)\n", cline, line, len);
+ goto _out_return;
+ }
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ if (unlikely(strcmp(orig + len, STRING_CANARY))) {
+ BRICK_ERR(
+ "cline %d stringmem corruption: bad canary '%s', line = %d len = %d\n",
+ cline, STRING_CANARY, line, len);
+ goto _out_return;
+ }
+ orig[len]--;
+ memset(orig, '!', len);
+#else
+ magic = INT_ACCESS(orig, len);
+ if (unlikely(magic != MAGIC_SEND)) {
+ BRICK_ERR(
+ "cline %d stringmem corruption: end_magix %08x != %08x, line = %d len = %d\n",
+ cline, magic, MAGIC_SEND, line, len);
+ goto _out_return;
+ }
+ INT_ACCESS(orig, len) = 0xffffffff;
+#endif
+ atomic_dec(&string_count[line]);
+ atomic_inc(&string_free[line]);
+ atomic_dec(&phys_string_alloc);
+#endif
+ kfree(data);
+#ifdef BRICK_DEBUG_MEM
+_out_return:;
+#endif
+}
+
+/***********************************************************************/
+
+/* block memory allocation */
+
+static
+int len2order(int len)
+{
+ int order = 0;
+
+ if (unlikely(len <= 0)) {
+ BRICK_ERR("trying to use %d bytes\n", len);
+ return 0;
+ }
+
+ while ((PAGE_SIZE << order) < len)
+ order++;
+
+ if (unlikely(order > BRICK_MAX_ORDER)) {
+ BRICK_ERR("trying to use %d bytes (oder = %d, max = %d)\n", len, order, BRICK_MAX_ORDER);
+ return BRICK_MAX_ORDER;
+ }
+ return order;
+}
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+static atomic_t _alloc_count[BRICK_MAX_ORDER + 1];
+int brick_mem_alloc_count[BRICK_MAX_ORDER + 1] = {};
+int brick_mem_alloc_max[BRICK_MAX_ORDER + 1] = {};
+int brick_mem_freelist_max[BRICK_MAX_ORDER + 1] = {};
+
+#endif
+
+#ifdef BRICK_DEBUG_MEM
+static atomic_t phys_block_alloc = ATOMIC_INIT(0);
+
+/* indexed by line */
+static atomic_t block_count[BRICK_DEBUG_MEM];
+static atomic_t block_free[BRICK_DEBUG_MEM];
+static int block_len[BRICK_DEBUG_MEM];
+
+/* indexed by order */
+static atomic_t op_count[BRICK_MAX_ORDER + 1];
+static atomic_t raw_count[BRICK_MAX_ORDER + 1];
+static int alloc_line[BRICK_MAX_ORDER + 1];
+static int alloc_len[BRICK_MAX_ORDER + 1];
+
+#endif
+
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+
+#define MAX_INFO_LISTS 1024
+
+#define INFO_LIST_HASH(addr) ((unsigned long)(addr) / (PAGE_SIZE * 2) % MAX_INFO_LISTS)
+
+struct mem_block_info {
+ struct list_head inf_head;
+ void *inf_data;
+ int inf_len;
+ int inf_line;
+ bool inf_used;
+};
+
+static struct list_head inf_anchor[MAX_INFO_LISTS];
+static rwlock_t inf_lock[MAX_INFO_LISTS];
+
+static
+void _new_block_info(void *data, int len, int cline)
+{
+ struct mem_block_info *inf;
+ int hash;
+
+ for (;;) {
+ inf = kmalloc(sizeof(*inf), GFP_BRICK);
+ if (likely(inf))
+ break;
+ msleep(1000);
+ }
+ inf->inf_data = data;
+ inf->inf_len = len;
+ inf->inf_line = cline;
+ inf->inf_used = true;
+
+ hash = INFO_LIST_HASH(data);
+
+ write_lock(&inf_lock[hash]);
+ list_add(&inf->inf_head, &inf_anchor[hash]);
+ write_unlock(&inf_lock[hash]);
+}
+
+static
+struct mem_block_info *_find_block_info(void *data, bool remove)
+{
+ struct mem_block_info *res = NULL;
+ struct list_head *tmp;
+ int hash = INFO_LIST_HASH(data);
+
+ if (remove)
+ write_lock(&inf_lock[hash]);
+ else
+ read_lock(&inf_lock[hash]);
+ for (tmp = inf_anchor[hash].next; tmp != &inf_anchor[hash]; tmp = tmp->next) {
+ struct mem_block_info *inf = container_of(tmp, struct mem_block_info, inf_head);
+
+ if (inf->inf_data != data)
+ continue;
+ if (remove)
+ list_del_init(tmp);
+ res = inf;
+ break;
+ }
+ if (remove)
+ write_unlock(&inf_lock[hash]);
+ else
+ read_unlock(&inf_lock[hash]);
+ return res;
+}
+
+#endif /* CONFIG_MARS_DEBUG_MEM_STRONG */
+
+static inline
+void *__brick_block_alloc(gfp_t gfp, int order, int cline)
+{
+ void *res;
+
+ for (;;) {
+#ifdef USE_KERNEL_PAGES
+ res = (void *)__get_free_pages(gfp, order);
+#else
+ res = __vmalloc(PAGE_SIZE << order, gfp, PAGE_KERNEL_IO);
+#endif
+ if (likely(res))
+ break;
+ msleep(1000);
+ }
+
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ _new_block_info(res, PAGE_SIZE << order, cline);
+#endif
+#ifdef BRICK_DEBUG_MEM
+ atomic_inc(&phys_block_alloc);
+ atomic_inc(&raw_count[order]);
+#endif
+ atomic64_add((PAGE_SIZE / 1024) << order, &brick_global_block_used);
+
+ return res;
+}
+
+static inline
+void __brick_block_free(void *data, int order, int cline)
+{
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ struct mem_block_info *inf = _find_block_info(data, true);
+
+ if (likely(inf)) {
+ int inf_len = inf->inf_len;
+ int inf_line = inf->inf_line;
+
+ kfree(inf);
+ if (unlikely(inf_len != (PAGE_SIZE << order))) {
+ BRICK_ERR(
+ "line %d: address %p: bad freeing size %d (correct should be %d, previous line = %d)\n", cline, data, (
+
+ int)(PAGE_SIZE << order), inf_len, inf_line);
+ goto err;
+ }
+ } else {
+ BRICK_ERR("line %d: trying to free non-existent address %p (order = %d)\n", cline, data, order);
+ goto err;
+ }
+#endif
+#ifdef USE_KERNEL_PAGES
+ __free_pages(virt_to_page((unsigned long)data), order);
+#else
+ vfree(data);
+#endif
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+err:
+#endif
+#ifdef BRICK_DEBUG_MEM
+ atomic_dec(&phys_block_alloc);
+ atomic_dec(&raw_count[order]);
+#endif
+ atomic64_sub((PAGE_SIZE / 1024) << order, &brick_global_block_used);
+}
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+int brick_allow_freelist = 1;
+
+int brick_pre_reserve[BRICK_MAX_ORDER + 1] = {};
+
+/* Note: we have no separate lists per CPU.
+ * This should not hurt because the freelists are only used
+ * for higher-order pages which should be rather low-frequency.
+ */
+static spinlock_t freelist_lock[BRICK_MAX_ORDER + 1];
+static void *brick_freelist[BRICK_MAX_ORDER + 1];
+static atomic_t freelist_count[BRICK_MAX_ORDER + 1];
+
+static
+void *_get_free(int order, int cline)
+{
+ void *data;
+ unsigned long flags;
+
+ spin_lock_irqsave(&freelist_lock[order], flags);
+ data = brick_freelist[order];
+ if (likely(data)) {
+ void *next = *(void **)data;
+
+#ifdef BRICK_DEBUG_MEM /* check for corruptions */
+ long pattern = *(((long *)data) + 1);
+ void *copy = *(((void **)data) + 2);
+
+ if (unlikely(pattern != 0xf0f0f0f0f0f0f0f0 || next != copy)) { /* found a corruption */
+ /* prevent further trouble by leaving a memleak */
+ brick_freelist[order] = NULL;
+ spin_unlock_irqrestore(&freelist_lock[order], flags);
+ BRICK_ERR(
+ "line %d:freelist corruption at %p (pattern = %lx next %p != %p, murdered = %d), order = %d\n",
+ cline, data, pattern, next, copy, atomic_read(&freelist_count[order]), order);
+ return NULL;
+ }
+#endif
+ brick_freelist[order] = next;
+ atomic_dec(&freelist_count[order]);
+ }
+ spin_unlock_irqrestore(&freelist_lock[order], flags);
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ if (data) {
+ struct mem_block_info *inf = _find_block_info(data, false);
+
+ if (likely(inf)) {
+ if (unlikely(inf->inf_len != (PAGE_SIZE << order))) {
+ BRICK_ERR(
+ "line %d: address %p: bad freelist size %d (correct should be %d, previous line = %d)\n",
+ cline, data, (int)(PAGE_SIZE << order), inf->inf_len, inf->inf_line);
+ }
+ inf->inf_line = cline;
+ inf->inf_used = true;
+ } else {
+ BRICK_ERR("line %d: freelist address %p is invalid (order = %d)\n", cline, data, order);
+ }
+ }
+#endif
+ return data;
+}
+
+static
+void _put_free(void *data, int order)
+{
+ void *next;
+ unsigned long flags;
+
+#ifdef BRICK_DEBUG_MEM /* fill with pattern */
+ memset(data, 0xf0, PAGE_SIZE << order);
+#endif
+
+ spin_lock_irqsave(&freelist_lock[order], flags);
+ next = brick_freelist[order];
+ *(void **)data = next;
+#ifdef BRICK_DEBUG_MEM /* insert redundant copy for checking */
+ *(((void **)data) + 2) = next;
+#endif
+ brick_freelist[order] = data;
+ spin_unlock_irqrestore(&freelist_lock[order], flags);
+ atomic_inc(&freelist_count[order]);
+}
+
+static
+void _free_all(void)
+{
+ int order;
+
+ for (order = BRICK_MAX_ORDER; order >= 0; order--) {
+ for (;;) {
+ void *data = _get_free(order, __LINE__);
+
+ if (!data)
+ break;
+ __brick_block_free(data, order, __LINE__);
+ }
+ }
+}
+
+int brick_mem_reserve(void)
+{
+ int order;
+ int status = 0;
+
+ for (order = BRICK_MAX_ORDER; order >= 0; order--) {
+ int max = brick_pre_reserve[order];
+ int i;
+
+ brick_mem_freelist_max[order] += max;
+ BRICK_INF(
+ "preallocating %d at order %d (new maxlevel = %d)\n", max, order, brick_mem_freelist_max[order]);
+
+ max = brick_mem_freelist_max[order] - atomic_read(&freelist_count[order]);
+ if (max >= 0) {
+ for (i = 0; i < max; i++) {
+ void *data = __brick_block_alloc(GFP_KERNEL, order, __LINE__);
+
+ if (likely(data))
+ _put_free(data, order);
+ else
+ status = -ENOMEM;
+ }
+ } else {
+ for (i = 0; i < -max; i++) {
+ void *data = _get_free(order, __LINE__);
+
+ if (likely(data))
+ __brick_block_free(data, order, __LINE__);
+ }
+ }
+ }
+ return status;
+}
+#else
+int brick_mem_reserve(struct mem_reservation *r)
+{
+ BRICK_INF("preallocation is not compiled in\n");
+ return 0;
+}
+#endif
+
+void *_brick_block_alloc(loff_t pos, int len, int line)
+{
+ void *data;
+ int count;
+
+#ifdef BRICK_DEBUG_MEM
+#ifdef BRICK_DEBUG_ORDER0
+ const int plus0 = PAGE_SIZE;
+
+#else
+ const int plus0 = 0;
+
+#endif
+ const int plus = len <= PAGE_SIZE ? plus0 : PAGE_SIZE * 2;
+
+#else
+ const int plus = 0;
+
+#endif
+ int order = len2order(len + plus);
+
+ if (unlikely(order < 0)) {
+ BRICK_ERR("trying to allocate %d bytes (max = %d)\n", len, (int)(PAGE_SIZE << order));
+ return NULL;
+ }
+
+#ifdef CONFIG_MARS_DEBUG
+ might_sleep();
+#endif
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ count = atomic_add_return(1, &_alloc_count[order]);
+ brick_mem_alloc_count[order] = count;
+ if (count > brick_mem_alloc_max[order])
+ brick_mem_alloc_max[order] = count;
+#endif
+
+#ifdef BRICK_DEBUG_MEM
+ atomic_inc(&op_count[order]);
+ /* statistics */
+ alloc_line[order] = line;
+ alloc_len[order] = len;
+#endif
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ /* Dynamic increase of limits, in order to reduce
+ * fragmentation on higher-order pages.
+ * This comes on cost of higher memory usage.
+ */
+ if (order > 0 && count > brick_mem_freelist_max[order])
+ brick_mem_freelist_max[order] = count;
+#endif
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ data = _get_free(order, line);
+ if (!data)
+#endif
+ data = __brick_block_alloc(GFP_BRICK, order, line);
+
+#ifdef BRICK_DEBUG_MEM
+ if (order > 0) {
+ if (unlikely(line < 0))
+ line = 0;
+ else if (unlikely(line >= BRICK_DEBUG_MEM))
+ line = BRICK_DEBUG_MEM - 1;
+ atomic_inc(&block_count[line]);
+ block_len[line] = len;
+ if (order > 1) {
+ INT_ACCESS(data, 0 * sizeof(int)) = MAGIC_BLOCK;
+ INT_ACCESS(data, 1 * sizeof(int)) = line;
+ INT_ACCESS(data, 2 * sizeof(int)) = len;
+ data += PAGE_SIZE;
+ INT_ACCESS(data, -1 * sizeof(int)) = MAGIC_BLOCK;
+ INT_ACCESS(data, len) = MAGIC_BEND;
+ } else if (order == 1) {
+ INT_ACCESS(data, PAGE_SIZE + 0 * sizeof(int)) = MAGIC_BLOCK;
+ INT_ACCESS(data, PAGE_SIZE + 1 * sizeof(int)) = line;
+ INT_ACCESS(data, PAGE_SIZE + 2 * sizeof(int)) = len;
+ }
+ }
+#endif
+ return data;
+}
+
+void _brick_block_free(void *data, int len, int cline)
+{
+ int order;
+
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ struct mem_block_info *inf;
+ char *real_data;
+
+#endif
+#ifdef BRICK_DEBUG_MEM
+ int prev_line = 0;
+
+#ifdef BRICK_DEBUG_ORDER0
+ const int plus0 = PAGE_SIZE;
+
+#else
+ const int plus0 = 0;
+
+#endif
+ const int plus = len <= PAGE_SIZE ? plus0 : PAGE_SIZE * 2;
+
+#else
+ const int plus = 0;
+
+#endif
+
+ order = len2order(len + plus);
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ real_data = data;
+ if (order > 1)
+ real_data -= PAGE_SIZE;
+ inf = _find_block_info(real_data, false);
+ if (likely(inf)) {
+ prev_line = inf->inf_line;
+ if (unlikely(inf->inf_len != (PAGE_SIZE << order))) {
+ BRICK_ERR(
+ "line %d: address %p: bad freeing size %d (correct should be %d, previous line = %d)\n",
+ cline, data, (int)(PAGE_SIZE << order), inf->inf_len, prev_line);
+ goto _out_return;
+ }
+ if (unlikely(!inf->inf_used)) {
+ BRICK_ERR(
+ "line %d: address %p: double freeing (previous line = %d)\n", cline, data, prev_line);
+ goto _out_return;
+ }
+ inf->inf_line = cline;
+ inf->inf_used = false;
+ } else {
+ BRICK_ERR("line %d: trying to free non-existent address %p (order = %d)\n", cline, data, order);
+ goto _out_return;
+ }
+#endif
+#ifdef BRICK_DEBUG_MEM
+ if (order > 1) {
+ void *test = data - PAGE_SIZE;
+ int magic = INT_ACCESS(test, 0);
+ int line = INT_ACCESS(test, sizeof(int));
+ int oldlen = INT_ACCESS(test, sizeof(int) * 2);
+ int magic1 = INT_ACCESS(data, -1 * sizeof(int));
+ int magic2;
+
+ if (unlikely(magic1 != MAGIC_BLOCK)) {
+ BRICK_ERR(
+ "line %d memory corruption: %p magix1 %08x != %08x (previous line = %d)\n",
+ cline,
+ data,
+ magic1,
+ MAGIC_BLOCK,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(magic != MAGIC_BLOCK)) {
+ BRICK_ERR(
+ "line %d memory corruption: %p magix %08x != %08x (previous line = %d)\n",
+ cline,
+ data,
+ magic,
+ MAGIC_BLOCK,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(line < 0 || line >= BRICK_DEBUG_MEM)) {
+ BRICK_ERR(
+ "line %d memory corruption %p: alloc line = %d (previous line = %d)\n",
+ cline,
+ data,
+ line,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(oldlen != len)) {
+ BRICK_ERR(
+ "line %d memory corruption %p: len != oldlen (%d != %d, previous line = %d))\n",
+ cline,
+ data,
+ len,
+ oldlen,
+ prev_line);
+ goto _out_return;
+ }
+ magic2 = INT_ACCESS(data, len);
+ if (unlikely(magic2 != MAGIC_BEND)) {
+ BRICK_ERR(
+ "line %d memory corruption %p: magix %08x != %08x (previous line = %d)\n",
+ cline,
+ data,
+ magic,
+ MAGIC_BEND,
+ prev_line);
+ goto _out_return;
+ }
+ INT_ACCESS(test, 0) = 0xffffffff;
+ INT_ACCESS(data, len) = 0xffffffff;
+ data = test;
+ atomic_dec(&block_count[line]);
+ atomic_inc(&block_free[line]);
+ } else if (order == 1) {
+ void *test = data + PAGE_SIZE;
+ int magic = INT_ACCESS(test, 0 * sizeof(int));
+ int line = INT_ACCESS(test, 1 * sizeof(int));
+ int oldlen = INT_ACCESS(test, 2 * sizeof(int));
+
+ if (unlikely(magic != MAGIC_BLOCK)) {
+ BRICK_ERR(
+ "line %d memory corruption %p: magix %08x != %08x (previous line = %d)\n",
+ cline,
+ data,
+ magic,
+ MAGIC_BLOCK,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(line < 0 || line >= BRICK_DEBUG_MEM)) {
+ BRICK_ERR(
+ "line %d memory corruption %p: alloc line = %d (previous line = %d)\n",
+ cline,
+ data,
+ line,
+ prev_line);
+ goto _out_return;
+ }
+ if (unlikely(oldlen != len)) {
+ BRICK_ERR(
+ "line %d memory corruption %p: len != oldlen (%d != %d, previous line = %d))\n",
+ cline,
+ data,
+ len,
+ oldlen,
+ prev_line);
+ goto _out_return;
+ }
+ atomic_dec(&block_count[line]);
+ atomic_inc(&block_free[line]);
+ }
+#endif
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ if (
+ order > 0 && brick_allow_freelist && atomic_read(&freelist_count[order]) <= brick_mem_freelist_max[order]) {
+ _put_free(data, order);
+ } else
+#endif
+ __brick_block_free(data, order, cline);
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ brick_mem_alloc_count[order] = atomic_dec_return(&_alloc_count[order]);
+#endif
+#ifdef BRICK_DEBUG_MEM
+_out_return:;
+#endif
+}
+
+struct page *brick_iomap(void *data, int *offset, int *len)
+{
+ int _offset = ((unsigned long)data) & (PAGE_SIZE - 1);
+ struct page *page;
+
+ *offset = _offset;
+ if (*len > PAGE_SIZE - _offset)
+ *len = PAGE_SIZE - _offset;
+ if (is_vmalloc_addr(data))
+ page = vmalloc_to_page(data);
+ else
+ page = virt_to_page(data);
+ return page;
+}
+
+/***********************************************************************/
+
+/* module */
+
+void brick_mem_statistics(bool final)
+{
+#ifdef BRICK_DEBUG_MEM
+ int i;
+ int count = 0;
+ int places = 0;
+
+ BRICK_INF("======== page allocation:\n");
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ for (i = 0; i <= BRICK_MAX_ORDER; i++) {
+ BRICK_INF(
+ "pages order = %2d operations = %9d freelist_count = %4d / %3d raw_count = %5d alloc_count = %5d alloc_len = %5d line = %5d max_count = %5d\n",
+ i,
+ atomic_read(&op_count[i]),
+ atomic_read(&freelist_count[i]),
+ brick_mem_freelist_max[i],
+ atomic_read(&raw_count[i]),
+ brick_mem_alloc_count[i],
+ alloc_len[i],
+ alloc_line[i],
+ brick_mem_alloc_max[i]);
+ }
+#endif
+ for (i = 0; i < BRICK_DEBUG_MEM; i++) {
+ int val = atomic_read(&block_count[i]);
+
+ if (val) {
+ count += val;
+ places++;
+ BRICK_INF("line %4d: %6d allocated (last size = %4d, freed = %6d)\n",
+ i,
+ val,
+ block_len[i],
+ atomic_read(&block_free[i]));
+ }
+ }
+ if (!final || !count) {
+ BRICK_INF("======== %d block allocations in %d places (phys=%d)\n",
+ count, places, atomic_read(&phys_block_alloc));
+ } else {
+ BRICK_ERR("======== %d block allocations in %d places (phys=%d)\n",
+ count, places, atomic_read(&phys_block_alloc));
+ }
+ count = 0;
+ places = 0;
+ for (i = 0; i < BRICK_DEBUG_MEM; i++) {
+ int val = atomic_read(&mem_count[i]);
+
+ if (val) {
+ count += val;
+ places++;
+ BRICK_INF("line %4d: %6d allocated (last size = %4d, freed = %6d)\n",
+ i,
+ val,
+ mem_len[i],
+ atomic_read(&mem_free[i]));
+ }
+ }
+ if (!final || !count) {
+ BRICK_INF("======== %d memory allocations in %d places (phys=%d,redirect=%d)\n",
+ count, places,
+ atomic_read(&phys_mem_alloc), atomic_read(&mem_redirect_alloc));
+ } else {
+ BRICK_ERR("======== %d memory allocations in %d places (phys=%d,redirect=%d)\n",
+ count, places,
+ atomic_read(&phys_mem_alloc), atomic_read(&mem_redirect_alloc));
+ }
+ count = 0;
+ places = 0;
+ for (i = 0; i < BRICK_DEBUG_MEM; i++) {
+ int val = atomic_read(&string_count[i]);
+
+ if (val) {
+ count += val;
+ places++;
+ BRICK_INF("line %4d: %6d allocated (freed = %6d)\n",
+ i,
+ val,
+ atomic_read(&string_free[i]));
+ }
+ }
+ if (!final || !count) {
+ BRICK_INF("======== %d string allocations in %d places (phys=%d)\n",
+ count, places, atomic_read(&phys_string_alloc));
+ } else {
+ BRICK_ERR("======== %d string allocations in %d places (phys=%d)\n",
+ count, places, atomic_read(&phys_string_alloc));
+ }
+#endif
+}
+
+/* module init stuff */
+
+int __init init_brick_mem(void)
+{
+ int i;
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ for (i = BRICK_MAX_ORDER; i >= 0; i--)
+ spin_lock_init(&freelist_lock[i]);
+#endif
+#ifdef CONFIG_MARS_DEBUG_MEM_STRONG
+ for (i = 0; i < MAX_INFO_LISTS; i++) {
+ INIT_LIST_HEAD(&inf_anchor[i]);
+ rwlock_init(&inf_lock[i]);
+ }
+#else
+ (void)i;
+#endif
+
+ get_total_ram();
+
+ return 0;
+}
+
+void exit_brick_mem(void)
+{
+ BRICK_INF("deallocating memory...\n");
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ _free_all();
+#endif
+
+ brick_mem_statistics(true);
+}
diff --git a/include/linux/brick/brick_mem.h b/include/linux/brick/brick_mem.h
new file mode 100644
index 000000000000..cb812da83877
--- /dev/null
+++ b/include/linux/brick/brick_mem.h
@@ -0,0 +1,218 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef BRICK_MEM_H
+#define BRICK_MEM_H
+
+#include <linux/mm_types.h>
+
+#define BRICK_DEBUG_MEM 4096
+
+#ifndef CONFIG_MARS_DEBUG_MEM
+#undef BRICK_DEBUG_MEM
+#endif
+#ifdef CONFIG_MARS_DEBUG_ORDER0
+#define BRICK_DEBUG_ORDER0
+#endif
+
+#define CONFIG_MARS_MEM_PREALLOC /* this is VITAL - disable only for experiments! */
+
+#define GFP_BRICK GFP_NOIO
+
+extern long long brick_global_memavail;
+extern long long brick_global_memlimit;
+extern atomic64_t brick_global_block_used;
+
+/* All brick memory allocations are guaranteed to succeed.
+ * In case of low memory, they will just retry (forever).
+ *
+ * We always prefer threads for concurrency.
+ * Therefore, in_interrupt() code does not occur, and we can
+ * always sleep in case of memory pressure.
+ *
+ * Resource deadlocks are avoided by the above memory limits.
+ * When exceeded, new memory is simply not allocated any more
+ * (except for vital memory, such as IO memory for which a
+ * low_mem_reserve must always exist, anyway).
+ */
+
+/***********************************************************************/
+
+/* compiler tweaking */
+
+/* Some functions are known to return non-null pointer values,
+ * at least under some Kconfig conditions.
+ *
+ * In code like...
+ *
+ * void *ptr = myfunction();
+ * if (unlikely(!ptr)) {
+ * printk("ERROR: this should not happen\n");
+ * goto fail;
+ * }
+ *
+ * ... the dead code elimination of gcc will not remove the if clause
+ * because the function might return a NULL value, even if a human
+ * would know that myfunction() does not return a NULL value.
+ *
+ * Unfortunately, the __attribute__((nonnull)) can only be applied
+ * to input parameters, but not to the return value.
+ *
+ * More unfortunately, a small inline wrapper does not help,
+ * because it seems that together with the elimination of the wrapper,
+ * its nonnull attribute seems to be eliminated alltogether.
+ * I don't know whether this is a bug or a feature (or just a weakness).
+ *
+ * Following is a small hack which solves the problem at least for gcc 4.7.
+ *
+ * In order to be useful, the -fdelete-null-pointer-checks must be set.
+ * Since BRICK is superuser-only anyway, enabling this for MARS should not
+ * be a security risk
+ * (c.f. upstream kernel commit a3ca86aea507904148870946d599e07a340b39bf)
+ */
+extern inline
+void *brick_mark_nonnull(void *_ptr)
+{
+ char *ptr = _ptr;
+
+ /* fool gcc to believe that the pointer were dereferenced... */
+ asm("" : : "X" (*ptr));
+ return ptr;
+}
+
+/***********************************************************************/
+
+/* small memory allocation (use this only for len < PAGE_SIZE) */
+
+#define brick_mem_alloc(_len_) \
+ ({ \
+ void *_res_ = _brick_mem_alloc(_len_, __LINE__); \
+ brick_mark_nonnull(_res_); \
+ })
+
+#define brick_zmem_alloc(_len_) \
+ ({ \
+ void *_res_ = _brick_mem_alloc(_len_, __LINE__); \
+ _res_ = brick_mark_nonnull(_res_); \
+ memset(_res_, 0, _len_); \
+ _res_; \
+ })
+
+#define brick_mem_free(_data_) \
+ do { \
+ if (_data_) { \
+ _brick_mem_free(_data_, __LINE__); \
+ } \
+ } while (0)
+
+/* don't use the following directly */
+extern void *_brick_mem_alloc(int len, int line) __attribute__((malloc)) __attribute__((alloc_size(1)));
+extern void _brick_mem_free(void *data, int line);
+
+/***********************************************************************/
+
+/* string memory allocation */
+
+#define BRICK_STRING_LEN 1024 /* default value when len == 0 */
+
+#define brick_string_alloc(_len_) \
+ ({ \
+ char *_res_ = _brick_string_alloc((_len_), __LINE__); \
+ (char *)brick_mark_nonnull(_res_); \
+ })
+
+#define brick_strndup(_orig_, _len_) \
+ ({ \
+ char *_res_ = _brick_string_alloc((_len_) + 1, __LINE__);\
+ _res_ = brick_mark_nonnull(_res_); \
+ strncpy(_res_, (_orig_), (_len_)); \
+ /* always null-terminate for safety */ \
+ _res_[_len_] = '\0'; \
+ (char *)brick_mark_nonnull(_res_); \
+ })
+
+#define brick_strdup(_orig_) \
+ ({ \
+ int _len_ = strlen(_orig_); \
+ char *_res_ = _brick_string_alloc((_len_) + 1, __LINE__);\
+ _res_ = brick_mark_nonnull(_res_); \
+ strncpy(_res_, (_orig_), (_len_) + 1); \
+ (char *)brick_mark_nonnull(_res_); \
+ })
+
+#define brick_string_free(_data_) \
+ do { \
+ if (_data_) { \
+ _brick_string_free(_data_, __LINE__); \
+ } \
+ } while (0)
+
+/* don't use the following directly */
+extern char *_brick_string_alloc(int len, int line) __attribute__((malloc));
+extern void _brick_string_free(const char *data, int line);
+
+/***********************************************************************/
+
+/* block memory allocation (for aligned multiples of 512 resp PAGE_SIZE) */
+
+#define brick_block_alloc(_pos_, _len_) \
+ ({ \
+ void *_res_ = _brick_block_alloc((_pos_), (_len_), __LINE__);\
+ brick_mark_nonnull(_res_); \
+ })
+
+#define brick_block_free(_data_, _len_) \
+ do { \
+ if (_data_) { \
+ _brick_block_free((_data_), (_len_), __LINE__); \
+ } \
+ } while (0)
+
+extern struct page *brick_iomap(void *data, int *offset, int *len);
+
+/* don't use the following directly */
+extern void *_brick_block_alloc(loff_t pos, int len, int line) __attribute__((malloc)) __attribute__((alloc_size(2)));
+extern void _brick_block_free(void *data, int len, int cline);
+
+/***********************************************************************/
+
+/* reservations / preallocation */
+
+#define BRICK_MAX_ORDER 11
+
+#ifdef CONFIG_MARS_MEM_PREALLOC
+extern int brick_allow_freelist;
+
+extern int brick_pre_reserve[BRICK_MAX_ORDER+1];
+extern int brick_mem_freelist_max[BRICK_MAX_ORDER+1];
+extern int brick_mem_alloc_count[BRICK_MAX_ORDER+1];
+extern int brick_mem_alloc_max[BRICK_MAX_ORDER+1];
+
+extern int brick_mem_reserve(void);
+
+#endif
+
+extern void brick_mem_statistics(bool final);
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_brick_mem(void);
+extern void exit_brick_mem(void);
+
+#endif
--
2.11.0

2016-12-30 23:02:51

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 14/32] mars: add new module xio_net

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/xio_net.c | 1849 +++++++++++++++++++++++++++++
include/linux/xio/xio_net.h | 177 +++
2 files changed, 2026 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/xio_net.c
create mode 100644 include/linux/xio/xio_net.h

diff --git a/drivers/staging/mars/xio_bricks/xio_net.c b/drivers/staging/mars/xio_bricks/xio_net.c
new file mode 100644
index 000000000000..441eee1f3912
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/xio_net.c
@@ -0,0 +1,1849 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/moduleparam.h>
+#include <linux/lzo.h>
+#include <linux/utsname.h>
+
+#include <linux/xio/xio.h>
+#include <linux/xio/xio_net.h>
+
+/******************************************************************/
+
+/* provisionary version detection */
+
+#ifndef TCP_MAX_REORDERING
+#define __HAS_IOV_ITER
+#endif
+
+#ifdef sk_net_refcnt
+/* see eeb1bd5c40edb0e2fd925c8535e2fdebdbc5cef2 */
+#define __HAS_STRUCT_NET
+#endif
+
+/******************************************************************/
+
+#define USE_BUFFERING
+
+#define SEND_PROTO_VERSION 2
+
+enum COMPRESS_TYPES {
+ COMPRESS_NONE = 0,
+ COMPRESS_LZO = 1,
+ /* insert further methods here */
+};
+
+int xio_net_compress_data;
+
+const u16 net_global_flags = 0
+#ifdef __HAVE_LZO
+ | COMPRESS_LZO
+#endif
+ ;
+
+/******************************************************************/
+
+/* Internal data structures for low-level transfer of C structures
+ * described by struct meta.
+ * Only these low-level fields need to have a fixed size like s64.
+ * The size and bytesex of the higher-level C structures is converted
+ * automatically; therefore classical "int" or "long long" etc is viable.
+ */
+
+#define MAX_FIELD_LEN (32 + 16)
+
+/* Please keep this at a size of 64 bytes by
+ * reuse of *spare* fields.
+ */
+struct xio_desc_cache {
+ u8 cache_sender_proto;
+ u8 cache_recver_proto;
+ s8 cache_is_bigendian;
+ u8 cache_spare0;
+ s16 cache_items;
+ u16 cache_spare1;
+ u32 cache_spare2;
+ u32 cache_spare3;
+ u64 cache_spare4[4];
+ u64 cache_sender_cookie;
+ u64 cache_recver_cookie;
+};
+
+/* Please keep this also at a size of 64 bytes by
+ * reuse of *spare* fields.
+ */
+struct xio_desc_item {
+ s8 field_type;
+ s8 field_spare0;
+ s16 field_data_size;
+ s16 field_sender_size;
+ s16 field_sender_offset;
+ s16 field_recver_size;
+ s16 field_recver_offset;
+ s32 field_spare;
+ char field_name[MAX_FIELD_LEN];
+};
+
+/* This must not be mirror symmetric between big and little endian
+ */
+#define XIO_DESC_MAGIC 0x73D0A2EC6148F48Ell
+
+struct xio_desc_header {
+ u64 h_magic;
+ u64 h_cookie;
+ s16 h_meta_len;
+ s16 h_index;
+ u32 h_spare1;
+ u64 h_spare2;
+};
+
+#define MAX_INT_TRANSFER 16
+
+/******************************************************************/
+
+/* Bytesex conversion / sign extension
+ */
+
+#ifdef __LITTLE_ENDIAN
+static const bool myself_is_bigendian;
+
+#endif
+#ifdef __BIG_ENDIAN
+static const bool myself_is_bigendian = true;
+
+#endif
+
+static inline
+void swap_bytes(void *data, int len)
+{
+ char *a = data;
+ char *b = data + len - 1;
+
+ while (a < b) {
+ char tmp = *a;
+
+ *a = *b;
+ *b = tmp;
+ a++;
+ b--;
+ }
+}
+
+#define SWAP_FIELD(x) swap_bytes(&(x), sizeof(x))
+
+static inline
+void swap_mc(struct xio_desc_cache *mc, int len)
+{
+ struct xio_desc_item *mi;
+
+ SWAP_FIELD(mc->cache_sender_cookie);
+ SWAP_FIELD(mc->cache_recver_cookie);
+ SWAP_FIELD(mc->cache_items);
+
+ len -= sizeof(*mc);
+
+ for (mi = (void *)(mc + 1); len > 0; mi++, len -= sizeof(*mi)) {
+ SWAP_FIELD(mi->field_data_size);
+ SWAP_FIELD(mi->field_sender_size);
+ SWAP_FIELD(mi->field_sender_offset);
+ SWAP_FIELD(mi->field_recver_size);
+ SWAP_FIELD(mi->field_recver_offset);
+ }
+}
+
+static inline
+char get_sign(const void *data, int len, bool is_bigendian, bool is_signed)
+{
+ if (is_signed) {
+ char x = is_bigendian ?
+ ((const char *)data)[0] :
+ ((const char *)data)[len - 1];
+ if (x < 0)
+ return -1;
+ }
+ return 0;
+}
+
+/******************************************************************/
+
+/* Low-level network traffic
+ */
+
+int xio_net_default_port = CONFIG_MARS_DEFAULT_PORT;
+
+module_param_named(xio_port, xio_net_default_port, int, 0);
+
+int xio_net_bind_before_listen = 1;
+
+module_param_named(xio_net_bind_before_listen, xio_net_bind_before_listen, int, 0);
+
+int xio_net_bind_before_connect = 1;
+
+/* TODO: add authentication.
+ * TODO: add encryption.
+ */
+
+struct xio_tcp_params repl_tcp_params = {
+ .ip_tos = IPTOS_LOWDELAY,
+ .tcp_window_size = 8 * 1024 * 1024, /* for long distance replications */
+ .tcp_nodelay = 0,
+ .tcp_timeout = 2,
+ .tcp_keepcnt = 3,
+ .tcp_keepintvl = 3, /* keepalive ping time */
+ .tcp_keepidle = 4,
+};
+
+struct xio_tcp_params device_tcp_params = {
+ .ip_tos = IPTOS_LOWDELAY,
+ .tcp_window_size = 2 * 1024 * 1024,
+ .tcp_nodelay = 1,
+ .tcp_timeout = 2,
+ .tcp_keepcnt = 3,
+ .tcp_keepintvl = 3, /* keepalive ping time */
+ .tcp_keepidle = 4,
+};
+
+static char *id;
+
+char *my_id(void)
+{
+ struct new_utsname *u;
+
+ if (!id) {
+ /* down_read(&uts_sem); // FIXME: this is currenty not EXPORTed from the kernel! */
+ u = utsname();
+ if (u)
+ id = brick_strdup(u->nodename);
+ /* up_read(&uts_sem); */
+ }
+ return id;
+}
+
+static
+void __setsockopt(struct socket *sock, int level, int optname, char *optval, int optsize)
+{
+ int status = kernel_setsockopt(sock, level, optname, optval, optsize);
+
+ if (status < 0) {
+ XIO_WRN(
+ "cannot set %d socket option %d to value %d, status = %d\n",
+ level, optname, *(int *)optval, status);
+ }
+}
+
+#define _setsockopt(sock, level, optname, val) __setsockopt(sock, level, optname, (char *)&(val), sizeof(val))
+
+int xio_create_sockaddr(struct sockaddr_storage *addr, const char *spec)
+{
+ struct sockaddr_in *sockaddr = (void *)addr;
+ const char *new_spec;
+ const char *tmp_spec;
+ int status = 0;
+
+ memset(addr, 0, sizeof(*addr));
+ sockaddr->sin_family = AF_INET;
+ sockaddr->sin_port = htons(xio_net_default_port);
+
+ /* Try to translate hostnames to IPs if possible.
+ */
+ if (xio_translate_hostname)
+ new_spec = xio_translate_hostname(spec);
+ else
+ new_spec = brick_strdup(spec);
+ tmp_spec = new_spec;
+
+ /* This is PROVISIONARY!
+ * TODO: add IPV6 syntax and many more features :)
+ */
+ if (!*tmp_spec)
+ goto done;
+ if (*tmp_spec != ':') {
+ unsigned char u0 = 0, u1 = 0, u2 = 0, u3 = 0;
+
+ status = sscanf(tmp_spec, "%hhu.%hhu.%hhu.%hhu", &u0, &u1, &u2, &u3);
+ if (status != 4) {
+ XIO_ERR("invalid sockaddr IP syntax '%s', status = %d\n", tmp_spec, status);
+ status = -EINVAL;
+ goto done;
+ }
+ XIO_DBG("decoded IP = %u.%u.%u.%u\n", u0, u1, u2, u3);
+ sockaddr->sin_addr.s_addr = (__be32)u0 | (__be32)u1 << 8 | (__be32)u2 << 16 | (__be32)u3 << 24;
+ }
+ /* deocde port number (when present) */
+ tmp_spec = spec;
+ while (*tmp_spec && *tmp_spec++ != ':')
+ ; /* empty */
+ if (*tmp_spec) {
+ int port = 0;
+
+ status = kstrtoint(tmp_spec, 10, &port);
+ if (unlikely(status)) {
+ XIO_ERR("invalid sockaddr PORT syntax '%s', status = %d\n", tmp_spec, status);
+ status = -EINVAL;
+ goto done;
+ }
+ XIO_DBG("decoded PORT = %d\n", port);
+ sockaddr->sin_port = htons(port);
+ }
+ status = 0;
+done:
+ brick_string_free(new_spec);
+ return status;
+}
+
+static int current_debug_nr; /* no locking, just for debugging */
+
+static
+void _set_socketopts(struct socket *sock, struct xio_tcp_params *params)
+{
+ struct timeval t = {
+ .tv_sec = params->tcp_timeout,
+ };
+ int x_true = 1;
+
+ /* TODO: improve this by a table-driven approach
+ */
+ sock->sk->sk_sndtimeo = params->tcp_timeout * HZ;
+ sock->sk->sk_rcvtimeo = params->tcp_timeout * HZ;
+ sock->sk->sk_reuse = 1;
+ _setsockopt(sock, SOL_SOCKET, SO_SNDBUFFORCE, params->tcp_window_size);
+ _setsockopt(sock, SOL_SOCKET, SO_RCVBUFFORCE, params->tcp_window_size);
+ _setsockopt(sock, SOL_IP, SO_PRIORITY, params->ip_tos);
+ _setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, params->tcp_nodelay);
+ _setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, x_true);
+ _setsockopt(sock, IPPROTO_TCP, TCP_KEEPCNT, params->tcp_keepcnt);
+ _setsockopt(sock, IPPROTO_TCP, TCP_KEEPINTVL, params->tcp_keepintvl);
+ _setsockopt(sock, IPPROTO_TCP, TCP_KEEPIDLE, params->tcp_keepidle);
+ _setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, t);
+ _setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, t);
+
+ if (sock->file) { /* switch back to blocking mode */
+ sock->file->f_flags &= ~O_NONBLOCK;
+ }
+}
+
+static int _xio_send_raw(struct xio_socket *msock, const void *buf, int len, int flags);
+static int _xio_recv_raw(struct xio_socket *msock, void *buf, int minlen, int maxlen, int flags);
+
+static
+void xio_proto_check(struct xio_socket *msock)
+{
+ u8 service_version = 0;
+ u16 service_flags = 0;
+ int status;
+
+ status = _xio_recv_raw(msock, &service_version, 1, 1, 0);
+ if (unlikely(status < 0)) {
+ XIO_DBG(
+ "#%d protocol exchange failed at receiving, status = %d\n",
+ msock->s_debug_nr,
+ status);
+ goto out_return;
+ }
+
+ /* take the the minimum of both protocol versions */
+ if (service_version > msock->s_send_proto)
+ service_version = msock->s_send_proto;
+ msock->s_send_proto = service_version;
+
+ status = _xio_recv_raw(msock, &service_flags, 2, 2, 0);
+ if (unlikely(status < 0)) {
+ XIO_DBG(
+ "#%d protocol exchange failed at receiving, status = %d\n",
+ msock->s_debug_nr,
+ status);
+ goto out_return;
+ }
+
+ msock->s_recv_flags = service_flags;
+out_return:;
+}
+
+static
+int xio_proto_exchange(struct xio_socket *msock, const char *msg)
+{
+ int status;
+
+ msock->s_send_proto = SEND_PROTO_VERSION;
+ status = xio_send_raw(msock, &msock->s_send_proto, 1, false);
+ if (unlikely(status < 0)) {
+ XIO_DBG(
+ "#%d protocol exchange on %s failed at sending, status = %d\n",
+ msock->s_debug_nr,
+ msg,
+ status);
+ goto done;
+ }
+
+ msock->s_send_flags = net_global_flags;
+ status = xio_send_raw(msock, &msock->s_send_flags, 2, false);
+ if (unlikely(status < 0)) {
+ XIO_DBG(
+ "#%d flags exchange on %s failed at sending, status = %d\n",
+ msock->s_debug_nr,
+ msg,
+ status);
+ goto done;
+ }
+
+done:
+ return status;
+}
+
+int xio_create_socket(
+struct xio_socket *msock,
+struct sockaddr_storage *src_addr,
+struct sockaddr_storage *dst_addr,
+struct xio_tcp_params *params)
+{
+ struct socket *sock;
+ struct sockaddr *src_sockaddr = (void *)src_addr;
+ struct sockaddr *dst_sockaddr = (void *)dst_addr;
+ int status = -EEXIST;
+
+ if (unlikely(atomic_read(&msock->s_count))) {
+ XIO_ERR("#%d socket already in use\n", msock->s_debug_nr);
+ goto final;
+ }
+ if (unlikely(msock->s_socket)) {
+ XIO_ERR("#%d socket already open\n", msock->s_debug_nr);
+ goto final;
+ }
+ atomic_set(&msock->s_count, 1);
+
+#ifdef __HAS_STRUCT_NET
+ status = sock_create_kern(&init_net, AF_INET, SOCK_STREAM, IPPROTO_TCP, &msock->s_socket);
+#else
+ status = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &msock->s_socket);
+#endif
+ if (unlikely(status < 0 || !msock->s_socket)) {
+ msock->s_socket = NULL;
+ XIO_WRN("cannot create socket, status = %d\n", status);
+ goto final;
+ }
+ msock->s_debug_nr = ++current_debug_nr;
+ sock = msock->s_socket;
+ CHECK_PTR(sock, done);
+ msock->s_alive = true;
+
+ _set_socketopts(sock, params);
+
+ if (!dst_sockaddr) { /* we are server */
+ struct sockaddr_in bind_addr;
+
+ if (unlikely(!src_sockaddr)) {
+ XIO_ERR("no srcaddr given for bind()\n");
+ status = -EINVAL;
+ goto done;
+ }
+
+ memcpy(&bind_addr, src_sockaddr, sizeof(bind_addr));
+ if (!xio_net_bind_before_listen)
+ memset(&bind_addr.sin_addr, 0, sizeof(bind_addr.sin_addr));
+
+ status = kernel_bind(sock, (struct sockaddr *)&bind_addr, sizeof(bind_addr));
+ if (unlikely(status < 0)) {
+ XIO_WRN("#%d bind failed, status = %d\n", msock->s_debug_nr, status);
+ goto done;
+ }
+ status = kernel_listen(sock, 16);
+ if (status < 0)
+ XIO_WRN("#%d listen failed, status = %d\n", msock->s_debug_nr, status);
+ } else {
+ /* When both src and dst are given, explicitly bind local address.
+ * Needed for multihomed hosts.
+ */
+ if (src_sockaddr && xio_net_bind_before_connect) {
+ struct sockaddr_in bind_addr;
+
+ memcpy(&bind_addr, src_sockaddr, sizeof(bind_addr));
+ bind_addr.sin_port = 0;
+
+ status = kernel_bind(sock, (struct sockaddr *)&bind_addr, sizeof(struct sockaddr));
+ if (unlikely(status < 0)) {
+ XIO_WRN(
+ "#%d bind before connect failed, ignored, status = %d\n",
+ msock->s_debug_nr, status);
+ }
+ }
+
+ status = kernel_connect(sock, dst_sockaddr, sizeof(*dst_sockaddr), 0);
+ /* Treat non-blocking connects as successful.
+ * Any potential errors will show up later during traffic.
+ */
+ if (status == -EINPROGRESS) {
+ XIO_DBG("#%d connect in progress\n", msock->s_debug_nr);
+ status = 0;
+ }
+ if (unlikely(status < 0)) {
+ XIO_DBG("#%d connect failed, status = %d\n", msock->s_debug_nr, status);
+ goto done;
+ }
+ status = xio_proto_exchange(msock, "connect");
+ }
+
+done:
+ if (status < 0)
+ xio_put_socket(msock);
+ else
+ XIO_DBG("successfully created socket #%d\n", msock->s_debug_nr);
+final:
+ return status;
+}
+
+int xio_accept_socket(struct xio_socket *new_msock, struct xio_socket *old_msock, struct xio_tcp_params *params)
+{
+ int status = -ENOENT;
+ struct socket *new_socket = NULL;
+ bool ok;
+
+ ok = xio_get_socket(old_msock);
+ if (likely(ok)) {
+ struct socket *sock = old_msock->s_socket;
+
+ if (unlikely(!sock))
+ goto err;
+
+ status = kernel_accept(sock, &new_socket, O_NONBLOCK);
+ if (unlikely(status < 0))
+ goto err;
+ if (unlikely(!new_socket)) {
+ status = -EBADF;
+ goto err;
+ }
+
+ _set_socketopts(new_socket, params);
+
+ memset(new_msock, 0, sizeof(struct xio_socket));
+ new_msock->s_socket = new_socket;
+ atomic_set(&new_msock->s_count, 1);
+ new_msock->s_alive = true;
+ new_msock->s_debug_nr = ++current_debug_nr;
+ XIO_DBG("#%d successfully accepted socket #%d\n", old_msock->s_debug_nr, new_msock->s_debug_nr);
+
+ status = xio_proto_exchange(new_msock, "accept");
+err:
+ xio_put_socket(old_msock);
+ }
+ return status;
+}
+
+bool xio_get_socket(struct xio_socket *msock)
+{
+ if (unlikely(atomic_read(&msock->s_count) <= 0)) {
+ XIO_ERR("#%d bad nesting on msock = %p\n", msock->s_debug_nr, msock);
+ return false;
+ }
+
+ atomic_inc(&msock->s_count);
+
+ if (unlikely(!msock->s_socket || !msock->s_alive)) {
+ xio_put_socket(msock);
+ return false;
+ }
+ return true;
+}
+
+void xio_put_socket(struct xio_socket *msock)
+{
+ if (unlikely(atomic_read(&msock->s_count) <= 0)) {
+ XIO_ERR("#%d bad nesting on msock = %p sock = %p\n", msock->s_debug_nr, msock, msock->s_socket);
+ } else if (atomic_dec_and_test(&msock->s_count)) {
+ struct socket *sock = msock->s_socket;
+ int i;
+
+ XIO_DBG("#%d closing socket %p\n", msock->s_debug_nr, sock);
+ if (likely(sock && cmpxchg(&msock->s_alive, true, false)))
+ kernel_sock_shutdown(sock, SHUT_RDWR);
+ if (likely(sock && !msock->s_alive)) {
+ XIO_DBG("#%d releasing socket %p\n", msock->s_debug_nr, sock);
+ sock_release(sock);
+ }
+ for (i = 0; i < MAX_DESC_CACHE; i++) {
+ if (msock->s_desc_send[i])
+ brick_block_free(msock->s_desc_send[i], PAGE_SIZE);
+ if (msock->s_desc_recv[i])
+ brick_block_free(msock->s_desc_recv[i], PAGE_SIZE);
+ }
+ brick_block_free(msock->s_buffer, PAGE_SIZE);
+ memset(msock, 0, sizeof(struct xio_socket));
+ }
+}
+
+void xio_shutdown_socket(struct xio_socket *msock)
+{
+ if (msock->s_socket) {
+ bool ok = xio_get_socket(msock);
+
+ if (likely(ok)) {
+ struct socket *sock = msock->s_socket;
+
+ if (likely(sock && cmpxchg(&msock->s_alive, true, false))) {
+ XIO_DBG("#%d shutdown socket %p\n", msock->s_debug_nr, sock);
+ kernel_sock_shutdown(sock, SHUT_RDWR);
+ }
+ xio_put_socket(msock);
+ }
+ }
+}
+
+bool xio_socket_is_alive(struct xio_socket *msock)
+{
+ bool res = false;
+
+ if (!msock->s_socket || !msock->s_alive)
+ goto done;
+ if (unlikely(atomic_read(&msock->s_count) <= 0)) {
+ XIO_ERR("#%d bad nesting on msock = %p sock = %p\n", msock->s_debug_nr, msock, msock->s_socket);
+ goto done;
+ }
+ res = true;
+done:
+ return res;
+}
+
+long xio_socket_send_space_available(struct xio_socket *msock)
+{
+ struct socket *raw_sock = msock->s_socket;
+ long res = 0;
+
+ if (!msock->s_alive || !raw_sock || !raw_sock->sk)
+ goto done;
+ if (unlikely(atomic_read(&msock->s_count) <= 0)) {
+ XIO_ERR("#%d bad nesting on msock = %p sock = %p\n", msock->s_debug_nr, msock, msock->s_socket);
+ goto done;
+ }
+
+ res = raw_sock->sk->sk_sndbuf - raw_sock->sk->sk_wmem_queued;
+ if (res < 0)
+ res = 0;
+ res += msock->s_pos;
+
+done:
+ return res;
+}
+
+static
+int _xio_send_raw(struct xio_socket *msock, const void *buf, int len, int flags)
+{
+ int sleeptime = 1000 / HZ;
+ int sent = 0;
+ int status = 0;
+
+ msock->s_send_cnt = 0;
+ while (len > 0) {
+ int this_len = len;
+ struct socket *sock = msock->s_socket;
+
+ if (unlikely(!sock || !xio_net_is_alive || brick_thread_should_stop())) {
+ XIO_WRN("interrupting, sent = %d\n", sent);
+ status = -EIDRM;
+ break;
+ }
+
+ {
+ struct kvec iov = {
+ .iov_base = (void *)buf,
+ .iov_len = this_len,
+ };
+ struct msghdr msg = {
+#ifndef __HAS_IOV_ITER
+ .msg_iov = (struct iovec *)&iov,
+#endif
+ .msg_flags = 0 | MSG_NOSIGNAL,
+ };
+ status = kernel_sendmsg(sock, &msg, &iov, 1, this_len);
+ }
+
+ if (status == -EAGAIN) {
+ if (msock->s_send_abort > 0 && ++msock->s_send_cnt > msock->s_send_abort) {
+ XIO_WRN("#%d reached send abort %d\n", msock->s_debug_nr, msock->s_send_abort);
+ status = -EINTR;
+ break;
+ }
+ brick_msleep(sleeptime);
+ /* linearly increasing backoff */
+ if (sleeptime < 100)
+ sleeptime += 1000 / HZ;
+ continue;
+ }
+ msock->s_send_cnt = 0;
+ if (unlikely(status == -EINTR)) { /* ignore it */
+ flush_signals(current);
+ brick_msleep(50);
+ continue;
+ }
+ if (unlikely(!status)) {
+ XIO_WRN("#%d EOF from socket upon send_page()\n", msock->s_debug_nr);
+ brick_msleep(50);
+ status = -ECOMM;
+ break;
+ }
+ if (unlikely(status < 0)) {
+ XIO_WRN(
+ "#%d bad socket sendmsg, len=%d, this_len=%d, sent=%d, status = %d\n",
+ msock->s_debug_nr,
+ len,
+ this_len,
+ sent,
+ status);
+ break;
+ }
+
+ len -= status;
+ buf += status;
+ sent += status;
+ sleeptime = 1000 / HZ;
+ }
+
+ msock->s_send_bytes += sent;
+ if (status >= 0)
+ status = sent;
+
+ return status;
+}
+
+int xio_send_raw(struct xio_socket *msock, const void *buf, int len, bool cork)
+{
+#ifdef USE_BUFFERING
+ int sent = 0;
+ int rest = len;
+
+#endif
+ int status = -EINVAL;
+
+ if (!xio_get_socket(msock))
+ goto final;
+
+#ifdef USE_BUFFERING
+restart:
+ if (!msock->s_buffer) {
+ msock->s_pos = 0;
+ msock->s_buffer = brick_block_alloc(0, PAGE_SIZE);
+ }
+
+ if (msock->s_pos + rest < PAGE_SIZE) {
+ memcpy(msock->s_buffer + msock->s_pos, buf, rest);
+ msock->s_pos += rest;
+ sent += rest;
+ rest = 0;
+ status = sent;
+ if (cork)
+ goto done;
+ }
+
+ if (msock->s_pos > 0) {
+ status = _xio_send_raw(msock, msock->s_buffer, msock->s_pos, 0);
+ if (status < 0)
+ goto done;
+
+ brick_block_free(msock->s_buffer, PAGE_SIZE);
+ msock->s_buffer = NULL;
+ msock->s_pos = 0;
+ }
+
+ if (rest >= PAGE_SIZE) {
+ status = _xio_send_raw(msock, buf, rest, 0);
+ goto done;
+ } else if (rest > 0) {
+ goto restart;
+ }
+ status = sent;
+
+done:
+#else
+ status = _xio_send_raw(msock, buf, len, 0);
+#endif
+ if (status < 0 && msock->s_shutdown_on_err)
+ xio_shutdown_socket(msock);
+
+ xio_put_socket(msock);
+
+final:
+ return status;
+}
+
+/**
+ * xio_recv_raw() - Get [min, max] number of bytes
+ * @msock: socket to read from
+ * @buf: buffer to put the data in
+ * @minlen: minimum number of bytes to read
+ * @maxlen: maximum number of bytes to read
+ *
+ * Returns a negative error code or a number between [@minlen, @maxlen].
+ * Short reads are mapped to an error.
+ *
+ * Hint: by setting @minlen to 1, you can read any number up to @maxlen.
+ * However, the most important use case is @minlen == @maxlen.
+ *
+ * Note: buf may be NULL. In this case, the data is simply consumed,
+ * like /dev/null
+ */
+static
+int _xio_recv_raw(struct xio_socket *msock, void *buf, int minlen, int maxlen, int flags)
+{
+ void *dummy = NULL;
+ int sleeptime = 1000 / HZ;
+ int status = -EIDRM;
+ int done = 0;
+
+ if (!buf) {
+ buf = brick_block_alloc(0, maxlen);
+ dummy = buf;
+ }
+
+ if (!xio_get_socket(msock))
+ goto final;
+
+ if (minlen < maxlen) {
+ struct socket *sock = msock->s_socket;
+
+ if (sock && sock->file) {
+ /* Use nonblocking reads to consume as much data
+ * as possible
+ */
+ sock->file->f_flags |= O_NONBLOCK;
+ }
+ }
+
+ msock->s_recv_cnt = 0;
+ while (done < minlen || (!minlen && !done)) {
+ struct kvec iov = {
+ .iov_base = buf + done,
+ .iov_len = maxlen - done,
+ };
+ struct msghdr msg = {
+#ifndef __HAS_IOV_ITER
+ .msg_iovlen = 1,
+ .msg_iov = (struct iovec *)&iov,
+ .msg_flags = flags | MSG_NOSIGNAL,
+#endif
+ };
+ struct socket *sock = msock->s_socket;
+
+ if (unlikely(!sock)) {
+ XIO_WRN("#%d socket has disappeared\n", msock->s_debug_nr);
+ status = -EIDRM;
+ goto err;
+ }
+
+ if (!xio_net_is_alive || brick_thread_should_stop()) {
+ XIO_WRN("#%d interrupting, done = %d\n", msock->s_debug_nr, done);
+ status = -EIDRM;
+ goto err;
+ }
+
+ status = kernel_recvmsg(sock, &msg, &iov, 1, maxlen - done, msg.msg_flags);
+
+ if (!xio_net_is_alive || brick_thread_should_stop()) {
+ XIO_WRN("#%d interrupting, done = %d\n", msock->s_debug_nr, done);
+ status = -EIDRM;
+ goto err;
+ }
+
+ if (status == -EAGAIN) {
+ if (msock->s_recv_abort > 0 && ++msock->s_recv_cnt > msock->s_recv_abort) {
+ XIO_WRN("#%d reached recv abort %d\n", msock->s_debug_nr, msock->s_recv_abort);
+ status = -EINTR;
+ goto err;
+ }
+ brick_msleep(sleeptime);
+ if (minlen <= 0)
+ break;
+ /* linearly increasing backoff */
+ if (sleeptime < 100)
+ sleeptime += 1000 / HZ;
+ continue;
+ }
+ msock->s_recv_cnt = 0;
+ if (!status) { /* EOF */
+ XIO_WRN(
+ "#%d got EOF from socket (done=%d, req_size=%d)\n", msock->s_debug_nr, done, maxlen - done);
+ status = -EPIPE;
+ goto err;
+ }
+ if (status < 0) {
+ XIO_WRN("#%d bad recvmsg, status = %d\n", msock->s_debug_nr, status);
+ goto err;
+ }
+ done += status;
+ sleeptime = 1000 / HZ;
+ }
+ status = done;
+ msock->s_recv_bytes += done;
+
+err:
+ if (status < 0 && msock->s_shutdown_on_err)
+ xio_shutdown_socket(msock);
+ xio_put_socket(msock);
+final:
+ if (dummy)
+ brick_block_free(dummy, maxlen);
+ return status;
+}
+
+int xio_recv_raw(struct xio_socket *msock, void *buf, int minlen, int maxlen)
+{
+ /* Check the very first received byte for higher-level protocol
+ * information. This safes one ping-pong cycle at
+ * xio_proto_exchange() because the sender can immediately
+ * start sending bulk data without need to wait there.
+ * This is important for latency, thus we exceptionally break
+ * the layering hierarchy here. Also, we start sending at
+ * the lowest possible protocol version and may increase
+ * the protocol capabilities dynamically at runtime,
+ * somewhen later. This bears some slight nondeterminism,
+ * but we take it into account for performance reasons.
+ */
+ if (unlikely(!msock->s_recv_bytes))
+ xio_proto_check(msock);
+
+ return _xio_recv_raw(msock, buf, minlen, maxlen, 0);
+}
+
+int xio_send_compressed(struct xio_socket *msock, const void *buf, s32 len, int compress, bool cork)
+{
+ void *compr_data = NULL;
+
+ s16 compr_code = 0;
+ int status;
+
+ switch (compress) {
+ case COMPRESS_LZO:
+#ifdef __HAVE_LZO
+ /* tolerate mixes of different proto versions */
+ if (msock->s_send_proto >= 2 && (msock->s_recv_flags & COMPRESS_LZO)) {
+ size_t compr_len = 0;
+ int lzo_status;
+ void *wrkmem;
+
+ compr_data = brick_mem_alloc(lzo1x_worst_compress(len));
+ wrkmem = brick_mem_alloc(LZO1X_1_MEM_COMPRESS);
+
+ lzo_status = lzo1x_1_compress(buf, len, compr_data, &compr_len, wrkmem);
+
+ brick_mem_free(wrkmem);
+ if (likely(lzo_status == LZO_E_OK && compr_len < len)) {
+ compr_code = COMPRESS_LZO;
+ buf = compr_data;
+ len = compr_len;
+ }
+ }
+#endif
+ break;
+
+ /* implement further methods here */
+
+ default:
+ /* ignore unknown compress codes */
+ break;
+ }
+
+ /* allow mixing of different proto versions */
+ if (likely(msock->s_send_proto >= 2)) {
+ status = xio_send_raw(msock, &compr_code, sizeof(compr_code), true);
+ if (unlikely(status < 0))
+ goto done;
+ if (compr_code > 0) {
+ status = xio_send_raw(msock, &len, sizeof(len), true);
+ if (unlikely(status < 0))
+ goto done;
+ }
+ }
+
+ status = xio_send_raw(msock, buf, len, cork);
+
+done:
+ brick_mem_free(compr_data);
+ return status;
+}
+
+int xio_recv_compressed(struct xio_socket *msock, void *buf, int minlen, int maxlen)
+{
+ void *compr_data = NULL;
+
+ s16 compr_code = COMPRESS_NONE;
+ int status;
+
+ /* allow mixing of different proto versions */
+ if (msock->s_send_proto >= 2) {
+ status = xio_recv_raw(msock, &compr_code, sizeof(compr_code), sizeof(compr_code));
+ if (unlikely(status < 0))
+ goto done;
+ }
+
+ switch (compr_code) {
+ case COMPRESS_NONE:
+ status = xio_recv_raw(msock, buf, minlen, maxlen);
+ break;
+
+ case COMPRESS_LZO:
+#ifdef __HAVE_LZO
+ {
+ s32 compr_len = 0;
+ size_t this_len;
+ int lzo_status;
+
+ status = xio_recv_raw(msock, &compr_len, sizeof(compr_len), sizeof(compr_len));
+ if (unlikely(status < 0))
+ goto done;
+ if (unlikely(compr_len <= 0 || compr_len >= maxlen)) {
+ XIO_ERR(
+ "bad comp_len = %d, real minlen = %d maxlen = %d\n",
+ compr_len, minlen, maxlen);
+ status = -EOVERFLOW;
+ goto done;
+ }
+
+ compr_data = brick_mem_alloc(compr_len);
+
+ status = xio_recv_raw(msock, compr_data, compr_len, compr_len);
+ if (unlikely(status < 0))
+ goto done;
+
+ this_len = maxlen;
+ lzo_status = lzo1x_decompress_safe(compr_data, compr_len, buf, &this_len);
+
+ status = this_len;
+ if (unlikely(lzo_status != LZO_E_OK)) {
+ XIO_ERR("bad decompression, lzo_status = %d\n", lzo_status);
+ status = -EBADE;
+ goto done;
+ }
+ if (unlikely(this_len < minlen || this_len > maxlen)) {
+ XIO_WRN(
+ "bad decompression length this_len = %ld, minlen = %d maxlen = %d\n", (
+ long)this_len, minlen, maxlen);
+ status = -EBADMSG;
+ goto done;
+ }
+ break;
+ }
+#else
+ XIO_WRN("cannot LZO decompress\n");
+ status = -EBADMSG;
+ break;
+#endif
+
+ /* implement further methods here */
+
+ default:
+ XIO_WRN("got unknown compr_code = %d\n", compr_code);
+ status = -EBADRQC;
+ }
+
+done:
+ brick_mem_free(compr_data);
+ return status;
+}
+
+/*********************************************************************/
+
+/* Mid-level field data exchange
+ */
+
+static
+void dump_meta(const struct meta *meta)
+{
+ int count = 0;
+
+ for (; meta->field_name; meta++) {
+ XIO_ERR(
+ "%2d %4d %4d %4d %p '%s'\n",
+ meta->field_type,
+ meta->field_data_size,
+ meta->field_transfer_size,
+ meta->field_offset,
+ meta->field_ref,
+ meta->field_name);
+ count++;
+ }
+ XIO_ERR("-------- %d fields.\n", count);
+}
+
+static
+int _add_fields(struct xio_desc_item *mi, const struct meta *meta, int offset, const char *prefix, int maxlen)
+{
+ int count = 0;
+
+ for (; meta->field_name; meta++) {
+ const char *new_prefix;
+ int new_offset;
+ int len;
+
+ short this_size;
+
+ new_prefix = mi->field_name;
+ new_offset = offset + meta->field_offset;
+
+ if (unlikely(maxlen < sizeof(struct xio_desc_item))) {
+ XIO_ERR("desc cache item overflow\n");
+ count = -1;
+ goto done;
+ }
+
+ len = scnprintf(mi->field_name, MAX_FIELD_LEN, "%s.%s", prefix, meta->field_name);
+ if (unlikely(len >= MAX_FIELD_LEN)) {
+ XIO_ERR("field len overflow on '%s.%s'\n", prefix, meta->field_name);
+ count = -1;
+ goto done;
+ }
+ mi->field_type = meta->field_type;
+ this_size = meta->field_data_size;
+ mi->field_data_size = this_size;
+ mi->field_sender_size = this_size;
+ this_size = meta->field_transfer_size;
+ if (this_size > 0)
+ mi->field_sender_size = this_size;
+ mi->field_sender_offset = new_offset;
+ mi->field_recver_offset = -1;
+
+ mi++;
+ maxlen -= sizeof(struct xio_desc_item);
+ count++;
+
+ if (meta->field_type == FIELD_SUB) {
+ int sub_count;
+
+ sub_count = _add_fields(mi, meta->field_ref, new_offset, new_prefix, maxlen);
+ if (sub_count < 0)
+ return sub_count;
+
+ mi += sub_count;
+ count += sub_count;
+ maxlen -= sub_count * sizeof(struct xio_desc_item);
+ }
+ }
+done:
+ return count;
+}
+
+static
+struct xio_desc_cache *make_sender_cache(struct xio_socket *msock, const struct meta *meta, int *cache_index)
+{
+ int orig_len = PAGE_SIZE;
+ int maxlen = orig_len;
+ struct xio_desc_cache *mc;
+ struct xio_desc_item *mi;
+ int i;
+ int status;
+
+ for (i = 0; i < MAX_DESC_CACHE; i++) {
+ mc = msock->s_desc_send[i];
+ if (!mc)
+ break;
+ if (mc->cache_sender_cookie == (u64)meta)
+ goto done;
+ }
+
+ if (unlikely(i >= MAX_DESC_CACHE - 1)) {
+ XIO_ERR("#%d desc cache overflow\n", msock->s_debug_nr);
+ return NULL;
+ }
+
+ mc = brick_block_alloc(0, maxlen);
+
+ memset(mc, 0, maxlen);
+ mc->cache_sender_cookie = (u64)meta;
+ /* further bits may be used in future */
+ mc->cache_sender_proto = msock->s_send_proto;
+ mc->cache_recver_proto = msock->s_recv_proto;
+
+ maxlen -= sizeof(struct xio_desc_cache);
+ mi = (void *)(mc + 1);
+
+ status = _add_fields(mi, meta, 0, "", maxlen);
+
+ if (likely(status > 0)) {
+ mc->cache_items = status;
+ mc->cache_is_bigendian = myself_is_bigendian;
+ msock->s_desc_send[i] = mc;
+ *cache_index = i;
+ } else {
+ brick_block_free(mc, orig_len);
+ mc = NULL;
+ }
+
+done:
+ return mc;
+}
+
+static
+int _make_recver_cache(struct xio_desc_cache *mc, const struct meta *meta, int offset, const char *prefix)
+{
+ char *tmp = brick_string_alloc(MAX_FIELD_LEN);
+ int count = 0;
+ int i;
+
+ for (; meta->field_name; meta++, count++) {
+ snprintf(tmp, MAX_FIELD_LEN, "%s.%s", prefix, meta->field_name);
+ for (i = 0; i < mc->cache_items; i++) {
+ struct xio_desc_item *mi = ((struct xio_desc_item *)(mc + 1)) + i;
+
+ if (meta->field_type == mi->field_type &&
+ !strcmp(tmp, mi->field_name)) {
+ mi->field_recver_size = meta->field_data_size;
+ mi->field_recver_offset = offset + meta->field_offset;
+ if (meta->field_type == FIELD_SUB) {
+ int sub_count = _make_recver_cache(
+
+ mc, meta->field_ref, mi->field_recver_offset, tmp);
+ if (unlikely(sub_count <= 0)) {
+ count = 0;
+ goto done;
+ }
+ }
+ goto found;
+ }
+ }
+ if (unlikely(!count)) {
+ XIO_ERR("field '%s' is missing\n", meta->field_name);
+ goto done;
+ }
+ XIO_WRN("field %2d '%s' is missing\n", count, meta->field_name);
+found:;
+ }
+done:
+ brick_string_free(tmp);
+ return count;
+}
+
+static
+int make_recver_cache(struct xio_desc_cache *mc, const struct meta *meta)
+{
+ int count;
+ int i;
+
+ mc->cache_recver_cookie = (u64)meta;
+ count = _make_recver_cache(mc, meta, 0, "");
+
+ for (i = 0; i < mc->cache_items; i++) {
+ struct xio_desc_item *mi = ((struct xio_desc_item *)(mc + 1)) + i;
+
+ if (unlikely(mi->field_recver_offset < 0))
+ XIO_WRN("field '%s' is not transferred\n", mi->field_name);
+ }
+ return count;
+}
+
+#define _CHECK_STATUS(_txt_) \
+do { \
+ if (unlikely(status < 0)) { \
+ XIO_DBG("%s status = %d\n", _txt_, status); \
+ goto err; \
+ } \
+} while (0)
+
+static
+int _desc_send_item(
+struct xio_socket *msock, const void *data, const struct xio_desc_cache *mc, int index, bool cork)
+{
+ struct xio_desc_item *mi = ((struct xio_desc_item *)(mc + 1)) + index;
+ const void *item = data + mi->field_sender_offset;
+
+ s16 data_len = mi->field_data_size;
+ s16 transfer_len = mi->field_sender_size;
+ int status;
+ bool is_signed = false;
+ int res = -1;
+
+ switch (mi->field_type) {
+ case FIELD_REF:
+ XIO_ERR("field '%s' NYI type = %d\n", mi->field_name, mi->field_type);
+ goto err;
+ case FIELD_SUB:
+ /* skip this */
+ res = 0;
+ break;
+ case FIELD_INT:
+ is_signed = true;
+ /* fallthrough */
+ case FIELD_UINT:
+ if (unlikely(data_len <= 0 || data_len > MAX_INT_TRANSFER)) {
+ XIO_ERR("field '%s' bad data_len = %d\n", mi->field_name, data_len);
+ goto err;
+ }
+ if (unlikely(transfer_len > MAX_INT_TRANSFER)) {
+ XIO_ERR("field '%s' bad transfer_len = %d\n", mi->field_name, transfer_len);
+ goto err;
+ }
+
+ if (likely(data_len == transfer_len))
+ goto raw;
+
+ if (transfer_len > data_len) {
+ int diff = transfer_len - data_len;
+ char empty[diff];
+ char sign;
+
+ sign = get_sign(item, data_len, myself_is_bigendian, is_signed);
+ memset(empty, sign, diff);
+
+ if (myself_is_bigendian) {
+ status = xio_send_raw(msock, empty, diff, true);
+ _CHECK_STATUS("send_diff");
+ status = xio_send_raw(msock, item, data_len, cork);
+ _CHECK_STATUS("send_item");
+ } else {
+ status = xio_send_raw(msock, item, data_len, true);
+ _CHECK_STATUS("send_item");
+ status = xio_send_raw(msock, empty, diff, cork);
+ _CHECK_STATUS("send_diff");
+ }
+
+ res = data_len;
+ break;
+ } else if (unlikely(transfer_len <= 0)) {
+ XIO_ERR("bad transfer_len = %d\n", transfer_len);
+ goto err;
+ } else { /* transfer_len < data_len */
+ char check = get_sign(item, data_len, myself_is_bigendian, is_signed);
+ int start;
+ int end;
+ int i;
+
+ if (is_signed &&
+ unlikely(get_sign(item, transfer_len, myself_is_bigendian, true) != check)) {
+ XIO_ERR(
+ "cannot sign-reduce signed integer from %d to %d bytes, byte %d !~ %d\n",
+ data_len,
+ transfer_len,
+ ((char *)item)[transfer_len - 1],
+ check);
+ goto err;
+ }
+
+ if (myself_is_bigendian) {
+ start = 0;
+ end = data_len - transfer_len;
+ } else {
+ start = transfer_len;
+ end = data_len;
+ }
+
+ for (i = start; i < end; i++) {
+ if (unlikely(((char *)item)[i] != check)) {
+ XIO_ERR(
+ "cannot sign-reduce %ssigned integer from %d to %d bytes at pos %d, byte %d != %d\n",
+ is_signed ? "" : "un",
+ data_len,
+ transfer_len,
+ i,
+ ((char *)item)[i],
+ check);
+ goto err;
+ }
+ }
+
+ /* just omit the higher/lower bytes */
+ data_len = transfer_len;
+ if (myself_is_bigendian)
+ item += end;
+ goto raw;
+ }
+ case FIELD_STRING:
+ item = *(void **)item;
+ data_len = 0;
+ if (item)
+ data_len = strlen(item) + 1;
+
+ status = xio_send_raw(msock, &data_len, sizeof(data_len), true);
+ _CHECK_STATUS("send_string_len");
+ /* fallthrough */
+ case FIELD_RAW:
+raw:
+ if (unlikely(data_len < 0)) {
+ XIO_ERR("field '%s' bad data_len = %d\n", mi->field_name, data_len);
+ goto err;
+ }
+ status = xio_send_raw(msock, item, data_len, cork);
+ _CHECK_STATUS("send_raw");
+ res = data_len;
+ break;
+ default:
+ XIO_ERR("field '%s' unknown type = %d\n", mi->field_name, mi->field_type);
+ }
+err:
+ return res;
+}
+
+static
+int _desc_recv_item(struct xio_socket *msock, void *data, const struct xio_desc_cache *mc, int index, int line)
+{
+ struct xio_desc_item *mi = ((struct xio_desc_item *)(mc + 1)) + index;
+ void *item = NULL;
+
+ s16 data_len = mi->field_recver_size;
+ s16 transfer_len = mi->field_sender_size;
+ int status;
+ bool is_signed = false;
+ int res = -1;
+
+ if (likely(data && data_len > 0 && mi->field_recver_offset >= 0))
+ item = data + mi->field_recver_offset;
+
+ switch (mi->field_type) {
+ case FIELD_REF:
+ XIO_ERR("field '%s' NYI type = %d\n", mi->field_name, mi->field_type);
+ goto err;
+ case FIELD_SUB:
+ /* skip this */
+ res = 0;
+ break;
+ case FIELD_INT:
+ is_signed = true;
+ /* fallthrough */
+ case FIELD_UINT:
+ if (unlikely(data_len <= 0 || data_len > MAX_INT_TRANSFER)) {
+ XIO_ERR("field '%s' bad data_len = %d\n", mi->field_name, data_len);
+ goto err;
+ }
+ if (unlikely(transfer_len > MAX_INT_TRANSFER)) {
+ XIO_ERR("field '%s' bad transfer_len = %d\n", mi->field_name, transfer_len);
+ goto err;
+ }
+
+ if (likely(data_len == transfer_len))
+ goto raw;
+
+ if (transfer_len > data_len) {
+ int diff = transfer_len - data_len;
+ char empty[diff];
+ char check;
+
+ memset(empty, 0, diff);
+
+ if (myself_is_bigendian) {
+ status = xio_recv_raw(msock, empty, diff, diff);
+ _CHECK_STATUS("recv_diff");
+ }
+
+ status = xio_recv_raw(msock, item, data_len, data_len);
+ _CHECK_STATUS("recv_item");
+ if (unlikely(mc->cache_is_bigendian != myself_is_bigendian && item))
+ swap_bytes(item, data_len);
+
+ if (!myself_is_bigendian) {
+ status = xio_recv_raw(msock, empty, diff, diff);
+ _CHECK_STATUS("recv_diff");
+ }
+
+ /* check that sign extension did no harm */
+ check = get_sign(empty, diff, mc->cache_is_bigendian, is_signed);
+ while (--diff >= 0) {
+ if (unlikely(empty[diff] != check)) {
+ XIO_ERR(
+ "field '%s' %sSIGNED INTEGER OVERFLOW on size reduction from %d to %d, byte %d != %d\n",
+ mi->field_name,
+ is_signed ? "" : "UN",
+ transfer_len,
+ data_len,
+ empty[diff],
+ check);
+ goto err;
+ }
+ }
+ if (is_signed && item &&
+ unlikely(get_sign(item, data_len, myself_is_bigendian, true) != check)) {
+ XIO_ERR(
+ "field '%s' SIGNED INTEGER OVERLOW on reduction from size %d to %d, byte %d !~ %d\n",
+ mi->field_name,
+ transfer_len,
+ data_len,
+ ((char *)item)[data_len - 1],
+ check);
+ goto err;
+ }
+
+ res = data_len;
+ break;
+ } else if (unlikely(transfer_len <= 0)) {
+ XIO_ERR("field '%s' bad transfer_len = %d\n", mi->field_name, transfer_len);
+ goto err;
+ } else if (unlikely(!item)) { /* shortcut without checks */
+ data_len = transfer_len;
+ goto raw;
+ } else { /* transfer_len < data_len */
+ int diff = data_len - transfer_len;
+ char *transfer_ptr = item;
+ char sign;
+
+ if (myself_is_bigendian)
+ transfer_ptr += diff;
+
+ status = xio_recv_raw(msock, transfer_ptr, transfer_len, transfer_len);
+ _CHECK_STATUS("recv_transfer");
+ if (unlikely(mc->cache_is_bigendian != myself_is_bigendian))
+ swap_bytes(transfer_ptr, transfer_len);
+
+ /* sign-extend from transfer_len to data_len */
+ sign = get_sign(transfer_ptr, transfer_len, myself_is_bigendian, is_signed);
+ if (myself_is_bigendian)
+ memset(item, sign, diff);
+ else
+ memset(item + transfer_len, sign, diff);
+ res = data_len;
+ break;
+ }
+ case FIELD_STRING:
+ data_len = 0;
+ status = xio_recv_raw(msock, &data_len, sizeof(data_len), sizeof(data_len));
+ _CHECK_STATUS("recv_string_len");
+
+ if (unlikely(mc->cache_is_bigendian != myself_is_bigendian))
+ swap_bytes(&data_len, sizeof(data_len));
+
+ if (data_len > 0 && item) {
+ char *str = _brick_string_alloc(data_len, line);
+
+ *(void **)item = str;
+ item = str;
+ }
+
+ transfer_len = data_len;
+ /* fallthrough */
+ case FIELD_RAW:
+raw:
+ if (unlikely(data_len < 0)) {
+ XIO_ERR("field = '%s' implausible data_len = %d\n", mi->field_name, data_len);
+ goto err;
+ }
+ if (likely(data_len > 0)) {
+ if (unlikely(transfer_len != data_len)) {
+ XIO_ERR(
+ "cannot handle generic mismatch in transfer sizes, field = '%s', %d != %d\n",
+ mi->field_name,
+ transfer_len,
+ data_len);
+ goto err;
+ }
+ status = xio_recv_raw(msock, item, data_len, data_len);
+ _CHECK_STATUS("recv_raw");
+ }
+ res = data_len;
+ break;
+ default:
+ XIO_ERR("field '%s' unknown type = %d\n", mi->field_name, mi->field_type);
+ }
+err:
+ return res;
+}
+
+static inline
+int _desc_send_struct(struct xio_socket *msock, int cache_index, const void *data, int h_meta_len, bool cork)
+{
+ const struct xio_desc_cache *mc = msock->s_desc_send[cache_index];
+
+ struct xio_desc_header header = {
+ .h_magic = XIO_DESC_MAGIC,
+ .h_cookie = mc->cache_sender_cookie,
+ .h_meta_len = h_meta_len,
+ .h_index = data ? cache_index : -1,
+ };
+ int index;
+ int count = 0;
+ int status = 0;
+
+ status = xio_send_raw(msock, &header, sizeof(header), cork || data);
+ _CHECK_STATUS("send_header");
+
+ if (unlikely(h_meta_len > 0)) {
+ status = xio_send_raw(msock, mc, h_meta_len, true);
+ _CHECK_STATUS("send_meta");
+ }
+
+ if (likely(data)) {
+ for (index = 0; index < mc->cache_items; index++) {
+ status = _desc_send_item(msock, data, mc, index, cork || index < mc->cache_items - 1);
+ _CHECK_STATUS("send_cache_item");
+ count++;
+ }
+ }
+
+ if (status >= 0)
+ status = count;
+err:
+ return status;
+}
+
+static
+int desc_send_struct(struct xio_socket *msock, const void *data, const struct meta *meta, bool cork)
+{
+ struct xio_desc_cache *mc;
+ int i;
+ int h_meta_len = 0;
+ int status = -EINVAL;
+
+ for (i = 0; i < MAX_DESC_CACHE; i++) {
+ mc = msock->s_desc_send[i];
+ if (!mc)
+ break;
+ if (mc->cache_sender_cookie == (u64)meta)
+ goto found;
+ }
+
+ mc = make_sender_cache(msock, meta, &i);
+ if (unlikely(!mc))
+ goto done;
+
+ h_meta_len = mc->cache_items * sizeof(struct xio_desc_item) + sizeof(struct xio_desc_cache);
+
+found:
+ status = _desc_send_struct(msock, i, data, h_meta_len, cork);
+
+done:
+ return status;
+}
+
+static
+int desc_recv_struct(struct xio_socket *msock, void *data, const struct meta *meta, int line)
+{
+ struct xio_desc_header header = {};
+ struct xio_desc_cache *mc;
+ int cache_index;
+ int index;
+ int count = 0;
+ int status = 0;
+ bool need_swap = false;
+
+ status = xio_recv_raw(msock, &header, sizeof(header), sizeof(header));
+ _CHECK_STATUS("recv_header");
+
+ if (unlikely(header.h_magic != XIO_DESC_MAGIC)) {
+ need_swap = true;
+ SWAP_FIELD(header.h_magic);
+ if (unlikely(header.h_magic != XIO_DESC_MAGIC)) {
+ XIO_WRN(
+ "#%d called from line %d bad packet header magic = %llx\n",
+ msock->s_debug_nr,
+ line,
+ header.h_magic);
+ status = -ENOMSG;
+ goto err;
+ }
+ SWAP_FIELD(header.h_cookie);
+ SWAP_FIELD(header.h_meta_len);
+ SWAP_FIELD(header.h_index);
+ }
+
+ cache_index = header.h_index;
+ if (cache_index < 0) { /* EOR */
+ goto done;
+ }
+ if (unlikely(cache_index >= MAX_DESC_CACHE - 1)) {
+ XIO_WRN("#%d called from line %d bad cache index %d\n", msock->s_debug_nr, line, cache_index);
+ status = -EBADF;
+ goto err;
+ }
+
+ mc = msock->s_desc_recv[cache_index];
+ if (unlikely(!mc)) {
+ if (unlikely(header.h_meta_len <= 0)) {
+ XIO_WRN("#%d called from line %d missing meta information\n", msock->s_debug_nr, line);
+ status = -ENOMSG;
+ goto err;
+ }
+
+ mc = _brick_block_alloc(0, PAGE_SIZE, line);
+
+ status = xio_recv_raw(msock, mc, header.h_meta_len, header.h_meta_len);
+ if (unlikely(status < 0))
+ brick_block_free(mc, PAGE_SIZE);
+ _CHECK_STATUS("recv_meta");
+
+ if (unlikely(need_swap))
+ swap_mc(mc, header.h_meta_len);
+
+ status = make_recver_cache(mc, meta);
+ if (unlikely(status < 0)) {
+ brick_block_free(mc, PAGE_SIZE);
+ goto err;
+ }
+ msock->s_desc_recv[cache_index] = mc;
+ } else if (unlikely(header.h_meta_len > 0)) {
+ XIO_WRN(
+ "#%d called from line %d has %d unexpected meta bytes\n", msock->s_debug_nr, line, header.h_meta_len);
+ status = -EMSGSIZE;
+ goto err;
+ } else if (unlikely(mc->cache_recver_cookie != (u64)meta)) {
+ XIO_ERR("#%d protocol error %p != %p\n", msock->s_debug_nr, meta, (void *)mc->cache_recver_cookie);
+ dump_meta((void *)mc->cache_recver_cookie);
+ dump_meta(meta);
+ status = -EPROTO;
+ goto err;
+ }
+
+ for (index = 0; index < mc->cache_items; index++) {
+ status = _desc_recv_item(msock, data, mc, index, line);
+ _CHECK_STATUS("recv_cache_item");
+ count++;
+ }
+
+done:
+ if (status >= 0)
+ status = count;
+err:
+ return status;
+}
+
+int xio_send_struct(struct xio_socket *msock, const void *data, const struct meta *meta)
+{
+ return desc_send_struct(msock, data, meta, false);
+}
+
+int _xio_recv_struct(struct xio_socket *msock, void *data, const struct meta *meta, int line)
+{
+ return desc_recv_struct(msock, data, meta, line);
+}
+
+/*********************************************************************/
+
+/* High-level transport of xio structures
+ */
+
+const struct meta xio_cmd_meta[] = {
+ META_INI_SUB(cmd_stamp, struct xio_cmd, xio_timespec_meta),
+ META_INI(cmd_code, struct xio_cmd, FIELD_INT),
+ META_INI(cmd_int1, struct xio_cmd, FIELD_INT),
+ META_INI(cmd_str1, struct xio_cmd, FIELD_STRING),
+ {}
+};
+
+int xio_send_aio(struct xio_socket *msock, struct aio_object *aio)
+{
+ struct xio_cmd cmd = {
+ .cmd_code = CMD_AIO,
+ .cmd_int1 = aio->io_id,
+ };
+ int seq = 0;
+ int status;
+
+ if (aio->io_rw != 0 && aio->io_data && aio->io_cs_mode < 2)
+ cmd.cmd_code |= CMD_FLAG_HAS_DATA;
+
+ get_lamport(&cmd.cmd_stamp);
+
+ status = desc_send_struct(msock, &cmd, xio_cmd_meta, true);
+ if (status < 0)
+ goto done;
+
+ seq = 0;
+ status = desc_send_struct(msock, aio, xio_aio_user_meta, cmd.cmd_code & CMD_FLAG_HAS_DATA);
+ if (status < 0)
+ goto done;
+
+ if (cmd.cmd_code & CMD_FLAG_HAS_DATA)
+ status = xio_send_compressed(msock, aio->io_data, aio->io_len, xio_net_compress_data, false);
+done:
+ return status;
+}
+
+int xio_recv_aio(struct xio_socket *msock, struct aio_object *aio, struct xio_cmd *cmd)
+{
+ int status;
+
+ status = desc_recv_struct(msock, aio, xio_aio_user_meta, __LINE__);
+ if (status < 0)
+ goto done;
+
+ set_lamport(&cmd->cmd_stamp);
+
+ if (cmd->cmd_code & CMD_FLAG_HAS_DATA) {
+ if (!aio->io_data)
+ aio->io_data = brick_block_alloc(0, aio->io_len);
+ status = xio_recv_compressed(msock, aio->io_data, aio->io_len, aio->io_len);
+ if (unlikely(status < 0))
+ XIO_WRN("#%d aio_len = %d, status = %d\n", msock->s_debug_nr, aio->io_len, status);
+ }
+done:
+ return status;
+}
+
+int xio_send_cb(struct xio_socket *msock, struct aio_object *aio)
+{
+ struct xio_cmd cmd = {
+ .cmd_code = CMD_CB,
+ .cmd_int1 = aio->io_id,
+ };
+ int seq = 0;
+ int status;
+
+ if (aio->io_rw == 0 && aio->io_data && aio->io_cs_mode < 2)
+ cmd.cmd_code |= CMD_FLAG_HAS_DATA;
+
+ get_lamport(&cmd.cmd_stamp);
+
+ status = desc_send_struct(msock, &cmd, xio_cmd_meta, true);
+ if (status < 0)
+ goto done;
+
+ seq = 0;
+ status = desc_send_struct(msock, aio, xio_aio_user_meta, cmd.cmd_code & CMD_FLAG_HAS_DATA);
+ if (status < 0)
+ goto done;
+
+ if (cmd.cmd_code & CMD_FLAG_HAS_DATA)
+ status = xio_send_compressed(msock, aio->io_data, aio->io_len, xio_net_compress_data, false);
+done:
+ return status;
+}
+
+int xio_recv_cb(struct xio_socket *msock, struct aio_object *aio, struct xio_cmd *cmd)
+{
+ int status;
+
+ status = desc_recv_struct(msock, aio, xio_aio_user_meta, __LINE__);
+ if (status < 0)
+ goto done;
+
+ set_lamport(&cmd->cmd_stamp);
+
+ if (cmd->cmd_code & CMD_FLAG_HAS_DATA) {
+ if (!aio->io_data) {
+ XIO_WRN("#%d no internal buffer available\n", msock->s_debug_nr);
+ status = -EINVAL;
+ goto done;
+ }
+ status = xio_recv_compressed(msock, aio->io_data, aio->io_len, aio->io_len);
+ }
+done:
+ return status;
+}
+
+/***************** module init stuff ************************/
+
+char *(*xio_translate_hostname)(const char *name) = NULL;
+
+bool xio_net_is_alive;
+
+int __init init_xio_net(void)
+{
+ XIO_INF("init_net()\n");
+ xio_net_is_alive = true;
+ return 0;
+}
+
+void exit_xio_net(void)
+{
+ xio_net_is_alive = false;
+ brick_string_free(id);
+ id = NULL;
+ XIO_INF("exit_net()\n");
+}
diff --git a/include/linux/xio/xio_net.h b/include/linux/xio/xio_net.h
new file mode 100644
index 000000000000..4c000015863f
--- /dev/null
+++ b/include/linux/xio/xio_net.h
@@ -0,0 +1,177 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_NET_H
+#define XIO_NET_H
+
+#include <net/sock.h>
+#include <net/ipconfig.h>
+#include <net/tcp.h>
+
+#include <linux/brick/brick.h>
+
+extern int xio_net_compress_data;
+
+extern int xio_net_default_port;
+extern int xio_net_bind_before_listen;
+extern int xio_net_bind_before_connect;
+
+extern bool xio_net_is_alive;
+
+#define MAX_DESC_CACHE 16
+
+/* The original struct socket has no refcount. This leads to problems
+ * during long-lasting system calls when racing with socket shutdown.
+ *
+ * The original idea of struct xio_socket was just a small wrapper
+ * adding a refcount and some debugging aid.
+ * Later, some buffering was added in order to take advantage of
+ * kernel_sendpage().
+ * Caching of meta description has also been added.
+ *
+ * Notice: we have a slightly restricted parallelism model.
+ * One sender and one receiver thread may work in parallel
+ * on the same socket instance. At low level, there must not exist
+ * multiple readers in parallel to each other, or multiple
+ * writers in parallel to each other. Otherwise, higher level
+ * protocol sequences would be disturbed anyway.
+ * When needed, you may achieve higher parallelism by doing your own
+ * semaphore locking around xio_{send, recv}_struct() or even longer
+ * sequences of subsets of your high-level protocol.
+ */
+struct xio_socket {
+ struct socket *s_socket;
+
+ u64 s_send_bytes;
+ u64 s_recv_bytes;
+ void *s_buffer;
+ atomic_t s_count;
+ int s_pos;
+ int s_debug_nr;
+ int s_send_abort;
+ int s_recv_abort;
+ int s_send_cnt;
+ int s_recv_cnt;
+ bool s_shutdown_on_err;
+ bool s_alive;
+
+ u8 s_send_proto;
+ u8 s_recv_proto;
+ u16 s_send_flags;
+ u16 s_recv_flags;
+ struct xio_desc_cache *s_desc_send[MAX_DESC_CACHE];
+ struct xio_desc_cache *s_desc_recv[MAX_DESC_CACHE];
+};
+
+struct xio_tcp_params {
+ int ip_tos;
+ int tcp_window_size;
+ int tcp_nodelay;
+ int tcp_timeout;
+ int tcp_keepcnt;
+ int tcp_keepintvl;
+ int tcp_keepidle;
+};
+
+extern struct xio_tcp_params repl_tcp_params;
+extern struct xio_tcp_params device_tcp_params;
+
+enum {
+ CMD_NOP,
+ CMD_NOTIFY,
+ CMD_CONNECT,
+ CMD_GETINFO,
+ CMD_GETENTS,
+ CMD_AIO,
+ CMD_CB,
+ CMD_CONNECT_LOGGER,
+};
+
+#define CMD_FLAG_MASK 255
+#define CMD_FLAG_HAS_DATA 256
+
+struct xio_cmd {
+ struct timespec cmd_stamp; /* for automatic lamport clock */
+ int cmd_code;
+ int cmd_int1;
+
+ /* int cmd_int2; */
+ /* int cmd_int3; */
+ char *cmd_str1;
+
+ /* char *cmd_str2; */
+ /* char *cmd_str3; */
+};
+
+extern const struct meta xio_cmd_meta[];
+
+extern char *(*xio_translate_hostname)(const char *name);
+
+extern char *my_id(void);
+
+/* Low-level network traffic
+ */
+extern int xio_create_sockaddr(struct sockaddr_storage *addr, const char *spec);
+
+extern int xio_create_socket(
+struct xio_socket *msock,
+struct sockaddr_storage *src_addr,
+struct sockaddr_storage *dst_addr,
+struct xio_tcp_params *params);
+
+extern int xio_accept_socket(
+struct xio_socket *new_msock, struct xio_socket *old_msock, struct xio_tcp_params *params);
+
+extern bool xio_get_socket(struct xio_socket *msock);
+extern void xio_put_socket(struct xio_socket *msock);
+extern void xio_shutdown_socket(struct xio_socket *msock);
+extern bool xio_socket_is_alive(struct xio_socket *msock);
+extern long xio_socket_send_space_available(struct xio_socket *msock);
+
+extern int xio_send_raw(struct xio_socket *msock, const void *buf, int len, bool cork);
+extern int xio_recv_raw(struct xio_socket *msock, void *buf, int minlen, int maxlen);
+
+int xio_send_compressed(struct xio_socket *msock, const void *buf, s32 len, int compress, bool cork);
+int xio_recv_compressed(struct xio_socket *msock, void *buf, int minlen, int maxlen);
+
+/* Mid-level generic field data exchange
+ */
+extern int xio_send_struct(struct xio_socket *msock, const void *data, const struct meta *meta);
+#define xio_recv_struct(_sock_, _data_, _meta_) \
+ ({ \
+ _xio_recv_struct(_sock_, _data_, _meta_, __LINE__); \
+ })
+extern int _xio_recv_struct(struct xio_socket *msock, void *data, const struct meta *meta, int line);
+
+/* High-level transport of xio structures
+ */
+extern int xio_send_dent_list(struct xio_socket *msock, struct list_head *anchor);
+extern int xio_recv_dent_list(struct xio_socket *msock, struct list_head *anchor);
+
+extern int xio_send_aio(struct xio_socket *msock, struct aio_object *aio);
+extern int xio_recv_aio(struct xio_socket *msock, struct aio_object *aio, struct xio_cmd *cmd);
+extern int xio_send_cb(struct xio_socket *msock, struct aio_object *aio);
+extern int xio_recv_cb(struct xio_socket *msock, struct aio_object *aio, struct xio_cmd *cmd);
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_xio_net(void);
+extern void exit_xio_net(void);
+
+#endif
--
2.11.0

2016-12-30 23:02:50

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 19/32] mars: add new module xio_client

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/xio_bricks/xio_client.c | 1083 ++++++++++++++++++++++++++
include/linux/xio/xio_client.h | 105 +++
2 files changed, 1188 insertions(+)
create mode 100644 drivers/staging/mars/xio_bricks/xio_client.c
create mode 100644 include/linux/xio/xio_client.h

diff --git a/drivers/staging/mars/xio_bricks/xio_client.c b/drivers/staging/mars/xio_bricks/xio_client.c
new file mode 100644
index 000000000000..209523378660
--- /dev/null
+++ b/drivers/staging/mars/xio_bricks/xio_client.c
@@ -0,0 +1,1083 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/jiffies.h>
+
+#include <linux/xio/xio.h>
+
+/************************ own type definitions ***********************/
+
+#include <linux/xio/xio_client.h>
+
+#define CLIENT_HASH_MAX (PAGE_SIZE / sizeof(struct list_head))
+
+int xio_client_abort = 10;
+
+int max_client_channels = 1;
+
+int max_client_bulk = 16;
+
+/************************ own helper functions ***********************/
+
+static int thread_count;
+
+static
+void _do_resubmit(struct client_channel *ch)
+{
+ struct client_output *output = ch->output;
+ unsigned long flags;
+
+ spin_lock_irqsave(&output->lock, flags);
+ if (!list_empty(&ch->wait_list)) {
+ struct list_head *first = ch->wait_list.next;
+ struct list_head *last = ch->wait_list.prev;
+ struct list_head *old_start = output->aio_list.next;
+
+#define list_connect __list_del /* the original routine has a misleading name: in reality it is more general */
+ list_connect(&output->aio_list, first);
+ list_connect(last, old_start);
+ INIT_LIST_HEAD(&ch->wait_list);
+ }
+ spin_unlock_irqrestore(&output->lock, flags);
+}
+
+static
+void _kill_thread(struct client_threadinfo *ti, const char *name)
+{
+ struct task_struct *thread = ti->thread;
+
+ if (thread) {
+ XIO_DBG("stopping %s thread\n", name);
+ ti->thread = NULL;
+ brick_thread_stop(thread);
+ }
+}
+
+static
+void _kill_channel(struct client_channel *ch)
+{
+ XIO_DBG("channel = %p\n", ch);
+ if (xio_socket_is_alive(&ch->socket)) {
+ XIO_DBG("shutdown socket\n");
+ xio_shutdown_socket(&ch->socket);
+ }
+ _kill_thread(&ch->receiver, "receiver");
+ if (ch->is_open) {
+ XIO_DBG("close socket\n");
+ xio_put_socket(&ch->socket);
+ }
+ ch->recv_error = 0;
+ ch->is_used = false;
+ ch->is_open = false;
+ ch->is_connected = false;
+ /* Re-Submit any waiting requests
+ */
+ _do_resubmit(ch);
+}
+
+static inline
+void _kill_all_channels(struct client_bundle *bundle)
+{
+ int i;
+
+ /* first pass: shutdown in parallel without waiting */
+ for (i = 0; i < MAX_CLIENT_CHANNELS; i++) {
+ struct client_channel *ch = &bundle->channel[i];
+
+ if (xio_socket_is_alive(&ch->socket)) {
+ XIO_DBG("shutdown socket %d\n", i);
+ xio_shutdown_socket(&ch->socket);
+ }
+ }
+ /* separate pass (may wait) */
+ for (i = 0; i < MAX_CLIENT_CHANNELS; i++)
+ _kill_channel(&bundle->channel[i]);
+}
+
+static int receiver_thread(void *data);
+
+static
+int _setup_channel(struct client_bundle *bundle, int ch_nr)
+{
+ struct client_channel *ch = &bundle->channel[ch_nr];
+ struct sockaddr_storage src_sockaddr;
+ struct sockaddr_storage dst_sockaddr;
+ int status;
+
+ ch->ch_nr = ch_nr;
+ if (unlikely(ch->receiver.thread)) {
+ XIO_WRN("receiver thread %d unexpectedly not dead\n", ch_nr);
+ _kill_thread(&ch->receiver, "receiver");
+ }
+
+ status = xio_create_sockaddr(&src_sockaddr, my_id());
+ if (unlikely(status < 0)) {
+ XIO_DBG("no src sockaddr, status = %d\n", status);
+ goto done;
+ }
+
+ status = xio_create_sockaddr(&dst_sockaddr, bundle->host);
+ if (unlikely(status < 0)) {
+ XIO_DBG("no dst sockaddr, status = %d\n", status);
+ goto done;
+ }
+
+ status = xio_create_socket(&ch->socket, &src_sockaddr, &dst_sockaddr, &device_tcp_params);
+ if (unlikely(status < 0)) {
+ XIO_DBG("no socket, status = %d\n", status);
+ goto really_done;
+ }
+ ch->socket.s_shutdown_on_err = true;
+ ch->socket.s_send_abort = xio_client_abort;
+ ch->socket.s_recv_abort = xio_client_abort;
+ ch->is_open = true;
+
+ ch->receiver.thread = brick_thread_create(
+ receiver_thread, ch, "xio_receiver%d.%d.%d", bundle->thread_count, ch_nr, ch->thread_count++);
+ if (unlikely(!ch->receiver.thread)) {
+ XIO_ERR("cannot start receiver thread %d, status = %d\n", ch_nr, status);
+ status = -ENOENT;
+ goto done;
+ }
+ ch->is_used = true;
+
+done:
+ if (status < 0) {
+ XIO_DBG(
+ "cannot connect channel %d to remote host '%s', retrying, status = %d\n",
+ ch_nr,
+ bundle->host ? bundle->host : "NULL",
+ status);
+ _kill_channel(ch);
+ }
+
+really_done:
+ return status;
+}
+
+static
+void _kill_bundle(struct client_bundle *bundle)
+{
+ _kill_thread(&bundle->sender, "sender");
+ _kill_all_channels(bundle);
+}
+
+static
+void _maintain_bundle(struct client_bundle *bundle)
+{
+ int i;
+
+ /* Re-open _any_ failed channel, even old ones.
+ * Reason: the number of channels might change during operation.
+ */
+ for (i = 0; i < MAX_CLIENT_CHANNELS; i++) {
+ struct client_channel *ch = &bundle->channel[i];
+
+ if (!ch->is_used ||
+ (!ch->recv_error && xio_socket_is_alive(&ch->socket)))
+ continue;
+
+ XIO_DBG("killing channel %d\n", i);
+ _kill_channel(ch);
+ /* Re-setup including connect optiona is done later.
+ */
+ }
+}
+
+static
+struct client_channel *_get_channel(struct client_bundle *bundle, int min_channel, int max_channel)
+{
+ struct client_channel *res;
+ long best_space;
+ int best_channel;
+ int i;
+
+ if (unlikely(max_channel <= 0 || max_channel > MAX_CLIENT_CHANNELS))
+ max_channel = MAX_CLIENT_CHANNELS;
+ if (unlikely(min_channel < 0 || min_channel >= max_channel)) {
+ min_channel = max_channel - 1;
+ if (unlikely(min_channel < 0))
+ min_channel = 0;
+ }
+
+ /* Fast path.
+ * Speculate that the next channel is already usable,
+ * and that it has enough room.
+ */
+ best_channel = bundle->old_channel + 1;
+ if (best_channel >= max_channel)
+ best_channel = min_channel;
+ res = &bundle->channel[best_channel];
+ if (res->is_connected && !res->recv_error && xio_socket_is_alive(&res->socket)) {
+ res->current_space = xio_socket_send_space_available(&res->socket);
+ if (res->current_space > (PAGE_SIZE + PAGE_SIZE / 4))
+ goto found;
+ }
+
+ /* Slow path. Do all the teady work.
+ */
+ _maintain_bundle(bundle);
+
+ res = NULL;
+ best_space = -1;
+ best_channel = -1;
+ for (i = min_channel; i < max_channel; i++) {
+ struct client_channel *ch = &bundle->channel[i];
+ long this_space;
+
+ /* create new channels when necessary */
+ if (unlikely(!ch->is_open)) {
+ int status;
+
+ /* only create one new channel at a time */
+ status = _setup_channel(bundle, i);
+ XIO_DBG("setup channel %d status=%d\n", i, status);
+ if (unlikely(status < 0))
+ continue;
+
+ this_space = xio_socket_send_space_available(&ch->socket);
+ ch->current_space = this_space;
+ /* Always prefer the newly opened channel */
+ res = ch;
+ best_channel = i;
+ break;
+ }
+
+ /* select the best usable channel */
+ this_space = xio_socket_send_space_available(&ch->socket);
+ ch->current_space = this_space;
+ if (this_space > best_space) {
+ best_space = this_space;
+ best_channel = i;
+ res = ch;
+ }
+ }
+
+ if (unlikely(!res)) {
+ XIO_WRN(
+ "cannot setup communication channel '%s' @%s\n",
+ bundle->path,
+ bundle->host);
+ goto done;
+ }
+
+ /* send initial connect command */
+ if (unlikely(!res->is_connected)) {
+ struct xio_cmd cmd = {
+ .cmd_code = CMD_CONNECT,
+ .cmd_str1 = bundle->path,
+ };
+ int status;
+
+ if (strstr(bundle->path, "/replay-"))
+ cmd.cmd_code = CMD_CONNECT_LOGGER;
+
+ status = xio_send_struct(&res->socket, &cmd, xio_cmd_meta);
+ XIO_DBG("send CMD_CONNECT status = %d\n", status);
+ if (unlikely(status < 0)) {
+ XIO_WRN(
+ "connect '%s' @%s on channel %d failed, status = %d\n",
+ bundle->path,
+ bundle->host,
+ best_channel,
+ status);
+ _kill_channel(res);
+ res = NULL;
+ goto done;
+ }
+ res->is_connected = true;
+ }
+
+found:
+ bundle->old_channel = best_channel;
+
+done:
+ return res;
+}
+
+static
+int _request_info(struct client_channel *ch)
+{
+ struct xio_cmd cmd = {
+ .cmd_code = CMD_GETINFO,
+ };
+ int status;
+
+ status = xio_send_struct(&ch->socket, &cmd, xio_cmd_meta);
+ XIO_DBG("send CMD_GETINFO status = %d\n", status);
+ if (unlikely(status < 0))
+ XIO_DBG("send of getinfo failed, status = %d\n", status);
+ return status;
+}
+
+static int sender_thread(void *data);
+
+static
+int _setup_bundle(struct client_bundle *bundle, const char *str)
+{
+ int status = -ENOMEM;
+
+ _kill_bundle(bundle);
+ brick_string_free(bundle->path);
+
+ bundle->path = brick_strdup(str);
+
+ status = -EINVAL;
+ bundle->host = strchr(bundle->path, '@');
+ if (unlikely(!bundle->host)) {
+ brick_string_free(bundle->path);
+ bundle->path = NULL;
+ XIO_ERR("parameter string '%s' contains no remote specifier with '@'-syntax\n", str);
+ goto done;
+ }
+ *bundle->host++ = '\0';
+
+ bundle->thread_count = thread_count++;
+ bundle->sender.thread = brick_thread_create(sender_thread, bundle, "xio_sender%d", bundle->thread_count);
+ if (unlikely(!bundle->sender.thread)) {
+ XIO_ERR(
+ "cannot start sender thread for '%s' @%s\n",
+ bundle->path,
+ bundle->host);
+ status = -ENOENT;
+ goto done;
+ }
+
+ status = 0;
+
+done:
+ XIO_DBG("status = %d\n", status);
+ return status;
+}
+
+/***************** own brick * input * output operations *****************/
+
+static int client_get_info(struct client_output *output, struct xio_info *info)
+{
+ int status;
+
+ output->get_info = true;
+ wake_up_interruptible_all(&output->bundle.sender_event);
+
+ wait_event_interruptible_timeout(output->info_event, output->got_info, 20 * HZ);
+ status = -ETIME;
+ if (output->got_info && info) {
+ output->got_info = false;
+ memcpy(info, &output->info, sizeof(*info));
+ status = 0;
+ }
+
+ return status;
+}
+
+static int client_io_get(struct client_output *output, struct aio_object *aio)
+{
+ int maxlen;
+
+ if (aio->obj_initialized) {
+ obj_get(aio);
+ return aio->io_len;
+ }
+
+ /* Limit transfers to page boundaries.
+ * Currently, this is more restrictive than necessary.
+ * TODO: improve performance by doing better when possible.
+ * This needs help from the server in some efficient way.
+ */
+ maxlen = PAGE_SIZE - (aio->io_pos & (PAGE_SIZE - 1));
+ if (aio->io_len > maxlen)
+ aio->io_len = maxlen;
+
+ if (!aio->io_data) { /* buffered IO */
+ struct client_aio_aspect *aio_a = client_aio_get_aspect(output->brick, aio);
+
+ if (!aio_a)
+ return -EILSEQ;
+
+ aio->io_data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len));
+
+ aio_a->do_dealloc = true;
+ aio->io_flags = 0;
+ }
+
+ obj_get_first(aio);
+ return 0;
+}
+
+static void client_io_put(struct client_output *output, struct aio_object *aio)
+{
+ struct client_aio_aspect *aio_a;
+
+ if (!obj_put(aio))
+ goto out_return;
+ aio_a = client_aio_get_aspect(output->brick, aio);
+ if (aio_a && aio_a->do_dealloc)
+ brick_block_free(aio->io_data, aio_a->alloc_len);
+ obj_free(aio);
+out_return:;
+}
+
+static
+void _hash_insert(struct client_output *output, struct client_aio_aspect *aio_a)
+{
+ struct aio_object *aio = aio_a->object;
+ unsigned long flags;
+ int hash_index;
+
+ spin_lock_irqsave(&output->lock, flags);
+ list_del(&aio_a->io_head);
+ list_add_tail(&aio_a->io_head, &output->aio_list);
+ list_del(&aio_a->hash_head);
+ aio->io_id = ++output->last_id;
+ hash_index = aio->io_id % CLIENT_HASH_MAX;
+ list_add_tail(&aio_a->hash_head, &output->hash_table[hash_index]);
+ spin_unlock_irqrestore(&output->lock, flags);
+}
+
+static void client_io_io(struct client_output *output, struct aio_object *aio)
+{
+ struct client_aio_aspect *aio_a;
+ int error = -EINVAL;
+
+ aio_a = client_aio_get_aspect(output->brick, aio);
+ if (unlikely(!aio_a))
+ goto error;
+
+ while (output->brick->max_flying > 0 && atomic_read(&output->fly_count) > output->brick->max_flying)
+ brick_msleep(1000 * 2 / HZ);
+
+ if (!output->brick->power.on_led)
+ XIO_ERR("IO submission on dead instance\n");
+
+ atomic_inc(&xio_global_io_flying);
+ atomic_inc(&output->fly_count);
+ obj_get(aio);
+
+ aio_a->submit_jiffies = jiffies;
+ _hash_insert(output, aio_a);
+
+ wake_up_interruptible_all(&output->bundle.sender_event);
+
+ goto out_return;
+error:
+ XIO_ERR("IO error = %d\n", error);
+ SIMPLE_CALLBACK(aio, error);
+ client_io_put(output, aio);
+out_return:;
+}
+
+static
+int receiver_thread(void *data)
+{
+ struct client_channel *ch = data;
+ struct client_output *output = ch->output;
+ int status = 0;
+
+ while (!brick_thread_should_stop()) {
+ struct xio_cmd cmd = {};
+ struct list_head *tmp;
+ struct client_aio_aspect *aio_a = NULL;
+ struct aio_object *aio = NULL;
+ unsigned long flags;
+
+ if (ch->recv_error) {
+ /* The protocol may be out of sync.
+ * Consume some data to avoid distributed deadlocks.
+ */
+ (void)xio_recv_raw(&ch->socket, &cmd, 0, sizeof(cmd));
+ brick_msleep(100);
+ status = ch->recv_error;
+ continue;
+ }
+
+ status = xio_recv_struct(&ch->socket, &cmd, xio_cmd_meta);
+ if (status <= 0) {
+ if (!xio_socket_is_alive(&ch->socket)) {
+ XIO_DBG("socket is dead\n");
+ brick_msleep(1000);
+ continue;
+ }
+ goto done;
+ }
+
+ switch (cmd.cmd_code & CMD_FLAG_MASK) {
+ case CMD_NOTIFY:
+ local_trigger();
+ break;
+ case CMD_CONNECT:
+ if (cmd.cmd_int1 < 0) {
+ status = cmd.cmd_int1;
+ XIO_ERR(
+ "remote brick connect '%s' @%s failed, remote status = %d\n",
+ output->bundle.path,
+ output->bundle.host,
+ status);
+ goto done;
+ }
+ break;
+ case CMD_CB:
+ {
+ int hash_index = cmd.cmd_int1 % CLIENT_HASH_MAX;
+
+ spin_lock_irqsave(&output->lock, flags);
+ for (
+ tmp = output->hash_table[hash_index].next; tmp != &output->hash_table[hash_index]; tmp = tmp->next) {
+ struct aio_object *tmp_aio;
+
+ aio_a = container_of(tmp, struct client_aio_aspect, hash_head);
+ tmp_aio = aio_a->object;
+ CHECK_PTR(tmp_aio, err);
+ if (tmp_aio->io_id != cmd.cmd_int1)
+ continue;
+ aio = tmp_aio;
+ list_del_init(&aio_a->hash_head);
+ list_del_init(&aio_a->io_head);
+ break;
+
+err:
+ spin_unlock_irqrestore(&output->lock, flags);
+ status = -EBADR;
+ goto done;
+ }
+ spin_unlock_irqrestore(&output->lock, flags);
+
+ if (unlikely(!aio)) {
+ XIO_WRN(
+ "got unknown callback id %d on '%s' @%s\n",
+ cmd.cmd_int1,
+ output->bundle.path,
+ output->bundle.host);
+ /* try to consume the corresponding payload */
+ aio = client_alloc_aio(output->brick);
+ status = xio_recv_cb(&ch->socket, aio, &cmd);
+ obj_free(aio);
+ goto done;
+ }
+
+ status = xio_recv_cb(&ch->socket, aio, &cmd);
+ if (unlikely(status < 0)) {
+ XIO_WRN(
+ "interrupted data transfer during callback on '%s' @%s, status = %d\n",
+ output->bundle.path,
+ output->bundle.host,
+ status);
+ _hash_insert(output, aio_a);
+ goto done;
+ }
+
+ if (aio->_object_cb.cb_error < 0)
+ XIO_DBG("ERROR %d\n", aio->_object_cb.cb_error);
+ SIMPLE_CALLBACK(aio, aio->_object_cb.cb_error);
+
+ client_io_put(output, aio);
+
+ atomic_dec(&output->fly_count);
+ atomic_dec(&xio_global_io_flying);
+ break;
+ }
+ case CMD_GETINFO:
+ status = xio_recv_struct(&ch->socket, &output->info, xio_info_meta);
+ if (status < 0) {
+ XIO_WRN(
+ "got bad info from remote '%s' @%s, status = %d\n",
+ output->bundle.path,
+ output->bundle.host,
+ status);
+ goto done;
+ }
+ output->got_info = true;
+ wake_up_interruptible_all(&output->info_event);
+ break;
+ default:
+ XIO_ERR(
+ "got bad command %d from remote '%s' @%s, terminating.\n",
+ cmd.cmd_code,
+ output->bundle.path,
+ output->bundle.host);
+ status = -EBADR;
+ goto done;
+ }
+done:
+ brick_string_free(cmd.cmd_str1);
+ if (unlikely(status < 0)) {
+ if (!ch->recv_error) {
+ XIO_DBG("signalling recv_error = %d\n", status);
+ ch->recv_error = status;
+ }
+ brick_msleep(100);
+ }
+ /* wake up sender in any case */
+ wake_up_interruptible_all(&output->bundle.sender_event);
+ }
+
+ if (unlikely(status < 0)) {
+ XIO_WRN(
+ "receiver thread '%s' @%s terminated with status = %d\n",
+ output->bundle.path,
+ output->bundle.host,
+ status);
+ }
+
+ xio_shutdown_socket(&ch->socket);
+ return status;
+}
+
+static
+void _do_timeout(struct client_output *output, struct list_head *anchor, int *rounds, bool force)
+{
+ struct client_brick *brick = output->brick;
+ struct list_head *tmp;
+ struct list_head *next;
+ LIST_HEAD(tmp_list);
+ long io_timeout = brick->power.io_timeout;
+ unsigned long flags;
+
+ if (list_empty(anchor))
+ goto out_return;
+ /* When io_timeout is 0, use the global default.
+ * When io_timeout is negative, no timeout will occur.
+ * Exeception: when the brick is forcefully shutting down.
+ */
+ if (!io_timeout)
+ io_timeout = global_net_io_timeout;
+
+ if (!xio_net_is_alive || !brick->power.button)
+ force = true;
+
+ if (!force && io_timeout <= 0)
+ goto out_return;
+ io_timeout *= HZ;
+
+ spin_lock_irqsave(&output->lock, flags);
+ for (tmp = anchor->next, next = tmp->next; tmp != anchor; tmp = next, next = tmp->next) {
+ struct client_aio_aspect *aio_a;
+
+ aio_a = container_of(tmp, struct client_aio_aspect, io_head);
+
+ if (!force &&
+ !time_is_before_jiffies(aio_a->submit_jiffies + io_timeout)) {
+ continue;
+ }
+
+ list_del_init(&aio_a->hash_head);
+ list_del_init(&aio_a->io_head);
+ list_add_tail(&aio_a->tmp_head, &tmp_list);
+ }
+ spin_unlock_irqrestore(&output->lock, flags);
+
+ while (!list_empty(&tmp_list)) {
+ struct client_aio_aspect *aio_a;
+ struct aio_object *aio;
+
+ tmp = tmp_list.next;
+ list_del_init(tmp);
+ aio_a = container_of(tmp, struct client_aio_aspect, tmp_head);
+ aio = aio_a->object;
+
+ if (unlikely(!(*rounds)++)) {
+ XIO_WRN(
+ "'%s' @%s timeout after %ld: signalling IO error at pos = %lld len = %d\n",
+ output->bundle.path,
+ output->bundle.host,
+ io_timeout,
+ aio->io_pos,
+ aio->io_len);
+ }
+
+ atomic_inc(&output->timeout_count);
+
+ SIMPLE_CALLBACK(aio, -ESTALE);
+
+ client_io_put(output, aio);
+
+ atomic_dec(&output->fly_count);
+ atomic_dec(&xio_global_io_flying);
+ }
+out_return:;
+}
+
+static
+void _do_timeout_all(struct client_output *output, bool force)
+{
+ int rounds = 0;
+ int i;
+
+ for (i = 0; i < MAX_CLIENT_CHANNELS; i++) {
+ struct client_channel *ch = &output->bundle.channel[i];
+
+ if (!ch->is_used)
+ continue;
+ _do_timeout(output, &ch->wait_list, &rounds, force);
+ }
+ _do_timeout(output, &output->aio_list, &rounds, force);
+ if (unlikely(rounds > 0)) {
+ XIO_WRN(
+ "'%s' @%s had %d timeouts, force = %d\n",
+ output->bundle.path,
+ output->bundle.host,
+ rounds,
+ force);
+ }
+}
+
+static int sender_thread(void *data)
+{
+ struct client_bundle *bundle = data;
+ struct client_output *output = container_of(bundle, struct client_output, bundle);
+ struct client_brick *brick = output->brick;
+ struct client_channel *ch = NULL;
+ bool do_timeout = false;
+ int ch_skip = max_client_bulk;
+ int status = -ESHUTDOWN;
+ unsigned long flags;
+
+ while (!brick_thread_should_stop()) {
+ struct list_head *tmp = NULL;
+ struct client_aio_aspect *aio_a;
+ struct aio_object *aio;
+ int min_nr;
+ int max_nr;
+
+ /* timeouting is a rather expensive operation, don't do it too often */
+ if (do_timeout) {
+ do_timeout = false;
+ _maintain_bundle(&output->bundle);
+ _do_timeout_all(output, false);
+ }
+
+ wait_event_interruptible_timeout(
+ output->bundle.sender_event,
+ !list_empty(&output->aio_list) ||
+ output->get_info,
+ 2 * HZ);
+
+ if (output->get_info) {
+ ch = _get_channel(bundle, 0, 1);
+ if (unlikely(!ch)) {
+ do_timeout = true;
+ brick_msleep(1000);
+ continue;
+ }
+ status = _request_info(ch);
+ if (unlikely(status < 0)) {
+ XIO_WRN(
+ "cannot send info request '%s' @%s, status = %d\n",
+ output->bundle.path,
+ output->bundle.host,
+ status);
+ do_timeout = true;
+ brick_msleep(1000);
+ continue;
+ }
+ output->get_info = false;
+ }
+
+ /* Grab the next aio from the queue
+ */
+ spin_lock_irqsave(&output->lock, flags);
+ tmp = output->aio_list.next;
+ if (tmp == &output->aio_list) {
+ spin_unlock_irqrestore(&output->lock, flags);
+ XIO_DBG("empty %d %d\n", output->get_info, brick_thread_should_stop());
+ do_timeout = true;
+ continue;
+ }
+ list_del_init(tmp);
+ /* notice: hash_head remains in its list! */
+ spin_unlock_irqrestore(&output->lock, flags);
+
+ aio_a = container_of(tmp, struct client_aio_aspect, io_head);
+ aio = aio_a->object;
+
+ if (brick->limit_mode) {
+ int amount = 0;
+
+ if (aio->io_cs_mode < 2)
+ amount = (aio->io_len - 1) / 1024 + 1;
+ rate_limit_sleep(&client_limiter, amount);
+ }
+
+ /* try to spread reads over multiple channels.... */
+ min_nr = 0;
+ max_nr = max_client_channels;
+ if (!aio->io_rw) {
+ /* optionally separate reads from writes */
+ if (brick->separate_reads && max_nr > 1)
+ min_nr = 1;
+ } else if (!brick->allow_permuting_writes) {
+ max_nr = 1;
+ }
+ if (!ch || ch->recv_error ||
+ !xio_socket_is_alive(&ch->socket))
+ do_timeout = true;
+ if (do_timeout || ch->ch_nr >= max_nr || --ch_skip < 0) {
+ ch = _get_channel(bundle, min_nr, max_nr);
+ if (unlikely(!ch)) {
+ /* notice: this will re-assign hash_head without harm */
+ _hash_insert(output, aio_a);
+ brick_msleep(1000);
+ continue;
+ }
+ /* estimate: add some headroom for overhead */
+ ch_skip = ch->current_space / PAGE_SIZE +
+ ch->current_space / (PAGE_SIZE * 8);
+ if (ch_skip > max_client_bulk)
+ ch_skip = max_client_bulk;
+ }
+
+ spin_lock_irqsave(&output->lock, flags);
+ list_add(tmp, &ch->wait_list);
+ /* notice: hash_head is already there! */
+ spin_unlock_irqrestore(&output->lock, flags);
+
+ status = xio_send_aio(&ch->socket, aio);
+ if (unlikely(status < 0)) {
+ _hash_insert(output, aio_a);
+ do_timeout = true;
+ ch = NULL;
+ /* retry submission on next occasion.. */
+ XIO_WRN(
+ "aio send '%s' @%s failed, status = %d\n",
+ output->bundle.path,
+ output->bundle.host,
+ status);
+
+ brick_msleep(100);
+ continue;
+ }
+ }
+
+ if (unlikely(status < 0)) {
+ XIO_WRN(
+ "sender thread '%s' @%s terminated with status = %d\n",
+ output->bundle.path,
+ output->bundle.host,
+ status);
+ }
+
+ _kill_all_channels(bundle);
+
+ /* Signal error on all pending IO requests.
+ * We have no other chance (except probably delaying
+ * this until destruction which is probably not what
+ * we want).
+ */
+ _do_timeout_all(output, true);
+ wake_up_interruptible_all(&output->bundle.sender_event);
+ XIO_DBG("sender terminated\n");
+ return status;
+}
+
+static int client_switch(struct client_brick *brick)
+{
+ struct client_output *output = brick->outputs[0];
+ int status = 0;
+
+ if (brick->power.button) {
+ if (brick->power.on_led)
+ goto done;
+ xio_set_power_off_led((void *)brick, false);
+ status = _setup_bundle(&output->bundle, brick->brick_name);
+ if (likely(status >= 0)) {
+ output->get_info = true;
+ brick->connection_state = 1;
+ xio_set_power_on_led((void *)brick, true);
+ }
+ } else {
+ if (brick->power.off_led)
+ goto done;
+ xio_set_power_on_led((void *)brick, false);
+ _kill_bundle(&output->bundle);
+ _do_timeout_all(output, true);
+ output->got_info = false;
+ brick->connection_state = 0;
+ xio_set_power_off_led((void *)brick, !output->bundle.sender.thread);
+ }
+done:
+ return status;
+}
+
+/*************** informational * statistics **************/
+
+static
+char *client_statistics(struct client_brick *brick, int verbose)
+{
+ struct client_output *output = brick->outputs[0];
+ char *res = brick_string_alloc(1024);
+ int socket_count = 0;
+ int i;
+
+ for (i = 0; i < MAX_CLIENT_CHANNELS; i++) {
+ struct client_channel *ch = &output->bundle.channel[i];
+
+ if (xio_socket_is_alive(&ch->socket))
+ socket_count++;
+ }
+ snprintf(
+ res, 1024,
+ "socket_count = %d max_flying = %d io_timeout = %d | timeout_count = %d fly_count = %d\n",
+ socket_count,
+ brick->max_flying,
+ brick->power.io_timeout,
+ atomic_read(&output->timeout_count),
+ atomic_read(&output->fly_count));
+
+ return res;
+}
+
+static
+void client_reset_statistics(struct client_brick *brick)
+{
+ struct client_output *output = brick->outputs[0];
+
+ atomic_set(&output->timeout_count, 0);
+}
+
+/*************** object * aspect constructors * destructors **************/
+
+static int client_aio_aspect_init_fn(struct generic_aspect *_ini)
+{
+ struct client_aio_aspect *ini = (void *)_ini;
+
+ INIT_LIST_HEAD(&ini->io_head);
+ INIT_LIST_HEAD(&ini->hash_head);
+ INIT_LIST_HEAD(&ini->tmp_head);
+ return 0;
+}
+
+static void client_aio_aspect_exit_fn(struct generic_aspect *_ini)
+{
+ struct client_aio_aspect *ini = (void *)_ini;
+
+ CHECK_HEAD_EMPTY(&ini->io_head);
+ CHECK_HEAD_EMPTY(&ini->hash_head);
+}
+
+XIO_MAKE_STATICS(client);
+
+/********************* brick constructors * destructors *******************/
+
+static int client_brick_construct(struct client_brick *brick)
+{
+ return 0;
+}
+
+static int client_output_construct(struct client_output *output)
+{
+ int i;
+
+ output->hash_table = brick_block_alloc(0, PAGE_SIZE);
+
+ for (i = 0; i < CLIENT_HASH_MAX; i++)
+ INIT_LIST_HEAD(&output->hash_table[i]);
+
+ for (i = 0; i < MAX_CLIENT_CHANNELS; i++) {
+ struct client_channel *ch = &output->bundle.channel[i];
+
+ ch->output = output;
+ INIT_LIST_HEAD(&ch->wait_list);
+ }
+
+ init_waitqueue_head(&output->bundle.sender_event);
+
+ spin_lock_init(&output->lock);
+ INIT_LIST_HEAD(&output->aio_list);
+ init_waitqueue_head(&output->info_event);
+ return 0;
+}
+
+static int client_output_destruct(struct client_output *output)
+{
+ brick_string_free(output->bundle.path);
+ output->bundle.path = NULL;
+ brick_block_free(output->hash_table, PAGE_SIZE);
+ return 0;
+}
+
+/************************ static structs ***********************/
+
+static struct client_brick_ops client_brick_ops = {
+ .brick_switch = client_switch,
+ .brick_statistics = client_statistics,
+ .reset_statistics = client_reset_statistics,
+};
+
+static struct client_output_ops client_output_ops = {
+ .xio_get_info = client_get_info,
+ .aio_get = client_io_get,
+ .aio_put = client_io_put,
+ .aio_io = client_io_io,
+};
+
+const struct client_input_type client_input_type = {
+ .type_name = "client_input",
+ .input_size = sizeof(struct client_input),
+};
+
+static const struct client_input_type *client_input_types[] = {
+ &client_input_type,
+};
+
+const struct client_output_type client_output_type = {
+ .type_name = "client_output",
+ .output_size = sizeof(struct client_output),
+ .master_ops = &client_output_ops,
+ .output_construct = &client_output_construct,
+ .output_destruct = &client_output_destruct,
+};
+
+static const struct client_output_type *client_output_types[] = {
+ &client_output_type,
+};
+
+const struct client_brick_type client_brick_type = {
+ .type_name = "client_brick",
+ .brick_size = sizeof(struct client_brick),
+ .max_inputs = 0,
+ .max_outputs = 1,
+ .master_ops = &client_brick_ops,
+ .aspect_types = client_aspect_types,
+ .default_input_types = client_input_types,
+ .default_output_types = client_output_types,
+ .brick_construct = &client_brick_construct,
+};
+
+/***************** module init stuff ************************/
+
+struct rate_limiter client_limiter = {
+ /* Let all be zero */
+};
+
+int global_net_io_timeout = 30;
+
+module_param_named(net_io_timeout, global_net_io_timeout, int, 0);
+
+int __init init_xio_client(void)
+{
+ XIO_INF("init_client()\n");
+ _client_brick_type = (void *)&client_brick_type;
+ return client_register_brick_type();
+}
+
+void exit_xio_client(void)
+{
+ XIO_INF("exit_client()\n");
+ client_unregister_brick_type();
+}
diff --git a/include/linux/xio/xio_client.h b/include/linux/xio/xio_client.h
new file mode 100644
index 000000000000..5accb4898adc
--- /dev/null
+++ b/include/linux/xio/xio_client.h
@@ -0,0 +1,105 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef XIO_CLIENT_H
+#define XIO_CLIENT_H
+
+#include <linux/xio/xio_net.h>
+#include <linux/brick/lib_limiter.h>
+
+extern struct rate_limiter client_limiter;
+extern int global_net_io_timeout;
+extern int xio_client_abort;
+extern int max_client_channels;
+extern int max_client_bulk;
+
+#define MAX_CLIENT_CHANNELS 4
+
+struct client_aio_aspect {
+ GENERIC_ASPECT(aio);
+ struct list_head io_head;
+ struct list_head hash_head;
+ struct list_head tmp_head;
+ unsigned long submit_jiffies;
+ int alloc_len;
+ bool do_dealloc;
+};
+
+struct client_brick {
+ XIO_BRICK(client);
+ /* tunables */
+ int max_flying; /* limit on parallelism */
+ bool limit_mode;
+ bool allow_permuting_writes;
+ bool separate_reads;
+
+ /* readonly from outside */
+ int connection_state; /* 0 = switched off, 1 = not connected, 2 = connected */
+};
+
+struct client_input {
+ XIO_INPUT(client);
+};
+
+struct client_threadinfo {
+ struct task_struct *thread;
+};
+
+struct client_channel {
+ struct xio_socket socket;
+ struct client_threadinfo receiver;
+ struct list_head wait_list;
+ struct client_output *output;
+ long current_space;
+ int thread_count;
+ int recv_error;
+ int ch_nr;
+ bool is_used;
+ bool is_open;
+ bool is_connected;
+};
+
+struct client_bundle {
+ char *host;
+ char *path;
+ int thread_count;
+ int old_channel;
+
+ wait_queue_head_t sender_event;
+ struct client_threadinfo sender;
+ struct client_channel channel[MAX_CLIENT_CHANNELS];
+};
+
+struct client_output {
+ XIO_OUTPUT(client);
+ atomic_t fly_count;
+ atomic_t timeout_count;
+ spinlock_t lock;
+ struct list_head aio_list;
+ int last_id;
+ struct client_bundle bundle;
+ struct xio_info info;
+
+ wait_queue_head_t info_event;
+ bool get_info;
+ bool got_info;
+ struct list_head *hash_table;
+};
+
+XIO_TYPES(client);
+
+#endif
--
2.11.0

2016-12-30 23:03:51

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 09/32] mars: add new module lib_rank

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/lib/lib_rank.c | 87 +++++++++++++++++++++++
include/linux/brick/lib_rank.h | 136 ++++++++++++++++++++++++++++++++++++
2 files changed, 223 insertions(+)
create mode 100644 drivers/staging/mars/lib/lib_rank.c
create mode 100644 include/linux/brick/lib_rank.h

diff --git a/drivers/staging/mars/lib/lib_rank.c b/drivers/staging/mars/lib/lib_rank.c
new file mode 100644
index 000000000000..6327479039b6
--- /dev/null
+++ b/drivers/staging/mars/lib/lib_rank.c
@@ -0,0 +1,87 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* (c) 2012 Thomas Schoebel-Theuer */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <linux/brick/lib_rank.h>
+
+void ranking_compute(struct rank_data *rkd, const struct rank_info rki[], int x)
+{
+ int points = 0;
+ int i;
+
+ for (i = 0; ; i++) {
+ int x0;
+ int x1;
+ int y0;
+ int y1;
+
+ x0 = rki[i].rki_x;
+ if (x < x0)
+ break;
+
+ x1 = rki[i + 1].rki_x;
+
+ if (unlikely(x1 == RKI_DUMMY)) {
+ points = rki[i].rki_y;
+ break;
+ }
+
+ if (x > x1)
+ continue;
+
+ y0 = rki[i].rki_y;
+ y1 = rki[i + 1].rki_y;
+
+ /* linear interpolation */
+ points = ((long long)(x - x0) * (long long)(y1 - y0)) / (x1 - x0) + y0;
+ break;
+ }
+ rkd->rkd_tmp += points;
+}
+
+int ranking_select(struct rank_data rkd[], int rkd_count)
+{
+ int res = -1;
+ long long max = LLONG_MIN / 2;
+ int i;
+
+ for (i = 0; i < rkd_count; i++) {
+ struct rank_data *tmp = &rkd[i];
+ long long rest = tmp->rkd_current_points;
+
+ if (rest <= 0)
+ continue;
+ /* rest -= tmp->rkd_got; */
+ if (rest > max) {
+ max = rest;
+ res = i;
+ }
+ }
+ /* Prevent underflow in the long term
+ * and reset the "clocks" after each round of
+ * weighted round-robin selection.
+ */
+ if (max < 0 && res >= 0) {
+ for (i = 0; i < rkd_count; i++)
+ rkd[i].rkd_got += max;
+ }
+ return res;
+}
diff --git a/include/linux/brick/lib_rank.h b/include/linux/brick/lib_rank.h
new file mode 100644
index 000000000000..fa18fdf15597
--- /dev/null
+++ b/include/linux/brick/lib_rank.h
@@ -0,0 +1,136 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* (c) 2012 Thomas Schoebel-Theuer */
+
+#ifndef LIB_RANK_H
+#define LIB_RANK_H
+
+/* Generic round-robin scheduler based on ranking information.
+ */
+
+#define RKI_DUMMY INT_MIN
+
+struct rank_info {
+ int rki_x;
+ int rki_y;
+};
+
+struct rank_data {
+ /* public readonly */
+ long long rkd_current_points;
+
+ /* private */
+ long long rkd_tmp;
+ long long rkd_got;
+};
+
+/* Ranking phase.
+ *
+ * Calls should follow the following usage pattern:
+ *
+ * ranking_start(...);
+ * for (...) {
+ * ranking_compute(&rkd[this_time], ...);
+ * // usually you need at least 1 call for each rkd[] element,
+ * // but you can call more often to include ranking information
+ * // from many different sources.
+ * // Note: instead / additionally, you may also use
+ * // ranking_add() or ranking_override().
+ * }
+ * ranking_stop(...);
+ *
+ * = > now the new ranking values are computed and already active
+ * for the round-robin ranking_select() mechanism described below.
+ *
+ * Important: the rki[] array describes a ranking function at some
+ * example points (x_i, y_i) which must be ordered according to x_i
+ * in ascending order. And, of course, you need to supply at least
+ * two sample points (otherwise a linear function cannot
+ * be described).
+ * The array _must_ always end with a dummy record where the x_i has the
+ * value RKI_DUMMY.
+ */
+
+extern inline
+void ranking_start(struct rank_data rkd[], int rkd_count)
+{
+ int i;
+
+ for (i = 0; i < rkd_count; i++)
+ rkd[i].rkd_tmp = 0;
+}
+
+extern void ranking_compute(struct rank_data *rkd, const struct rank_info rki[], int x);
+
+/* This may be used to (exceptionally) add some extra salt...
+ */
+extern inline
+void ranking_add(struct rank_data *rkd, int y)
+{
+ rkd->rkd_tmp += y;
+}
+
+/* This may be used to (exceptionally) override certain ranking values.
+ */
+extern inline
+void ranking_override(struct rank_data *rkd, int y)
+{
+ rkd->rkd_tmp = y;
+}
+
+extern inline
+void ranking_stop(struct rank_data rkd[], int rkd_count)
+{
+ int i;
+
+ for (i = 0; i < rkd_count; i++)
+ rkd[i].rkd_current_points = rkd[i].rkd_tmp;
+}
+
+/* This is a round-robin scheduler taking her weights
+ * from the previous ranking phase (
+ the more ranking points,
+ * the more frequently a candidate will be selected).
+ *
+ * Typical usage pattern (independent from the above ranking phase
+ * usage pattern):
+ *
+ * while (__there_is_work_to_be_done(...)) {
+ * int winner = ranking_select(...);
+ * if (winner >= 0) {
+ * __do_something(winner);
+ * ranking_select_done(..., winner, 1); // or higher, winpoints >= 1 must hold
+ * }
+ * ...
+ * }
+ *
+ */
+
+extern int ranking_select(struct rank_data rkd[], int rkd_count);
+
+extern inline
+void ranking_select_done(struct rank_data rkd[], int winner, int win_points)
+{
+ if (winner >= 0) {
+ if (win_points < 1)
+ win_points = 1;
+ rkd[winner].rkd_got += win_points;
+ }
+}
+
+#endif
--
2.11.0

2016-12-30 23:04:09

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 07/32] mars: add new module lib_pairing_heap

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
include/linux/brick/lib_pairing_heap.h | 109 +++++++++++++++++++++++++++++++++
1 file changed, 109 insertions(+)
create mode 100644 include/linux/brick/lib_pairing_heap.h

diff --git a/include/linux/brick/lib_pairing_heap.h b/include/linux/brick/lib_pairing_heap.h
new file mode 100644
index 000000000000..9456e9ea348c
--- /dev/null
+++ b/include/linux/brick/lib_pairing_heap.h
@@ -0,0 +1,109 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef PAIRING_HEAP_H
+#define PAIRING_HEAP_H
+
+/* Algorithm: see http://en.wikipedia.org/wiki/Pairing_heap
+ * This is just an efficient translation from recursive to iterative form.
+ *
+ * Note: find_min() is so trivial that we don't implement it.
+ */
+
+/* generic version: KEYDEF is kept separate, allowing you to
+ * embed this structure into other container structures already
+ * possessing some key (just provide an empty KEYDEF in this case).
+ */
+#define _PAIRING_HEAP_TYPEDEF(KEYTYPE, KEYDEF) \
+ \
+struct pairing_heap_##KEYTYPE { \
+ KEYDEF \
+ struct pairing_heap_##KEYTYPE *next; \
+ struct pairing_heap_##KEYTYPE *subheaps; \
+}
+
+/* less generic version: define the key inside.
+ */
+#define PAIRING_HEAP_TYPEDEF(KEYTYPE) \
+ _PAIRING_HEAP_TYPEDEF(KEYTYPE, KEYTYPE key;)
+
+/* generic methods: allow arbitrary CMP() functions.
+ */
+#define _PAIRING_HEAP_FUNCTIONS(_STATIC, KEYTYPE, CMP) \
+ \
+_STATIC \
+struct pairing_heap_##KEYTYPE *_ph_merge_##KEYTYPE( \
+struct pairing_heap_##KEYTYPE *heap1, struct pairing_heap_##KEYTYPE *heap2)\
+{ \
+ if (!heap1) \
+ return heap2; \
+ if (!heap2) \
+ return heap1; \
+ if (CMP(heap1, heap2) < 0) { \
+ heap2->next = heap1->subheaps; \
+ heap1->subheaps = heap2; \
+ return heap1; \
+ } \
+ heap1->next = heap2->subheaps; \
+ heap2->subheaps = heap1; \
+ return heap2; \
+} \
+ \
+_STATIC \
+void ph_insert_##KEYTYPE(struct pairing_heap_##KEYTYPE **heap, struct pairing_heap_##KEYTYPE *new)\
+{ \
+ new->next = NULL; \
+ new->subheaps = NULL; \
+ *heap = _ph_merge_##KEYTYPE(*heap, new); \
+} \
+ \
+_STATIC \
+void ph_delete_min_##KEYTYPE(struct pairing_heap_##KEYTYPE **heap) \
+{ \
+ struct pairing_heap_##KEYTYPE *tmplist = NULL; \
+ struct pairing_heap_##KEYTYPE *ptr; \
+ struct pairing_heap_##KEYTYPE *next; \
+ struct pairing_heap_##KEYTYPE *res; \
+ if (!*heap) { \
+ return; \
+ } \
+ for (ptr = (*heap)->subheaps; ptr; ptr = next) { \
+ struct pairing_heap_##KEYTYPE *p2 = ptr->next; \
+ next = p2; \
+ if (p2) { \
+ next = p2->next; \
+ ptr = _ph_merge_##KEYTYPE(ptr, p2); \
+ } \
+ ptr->next = tmplist; \
+ tmplist = ptr; \
+ } \
+ res = NULL; \
+ for (ptr = tmplist; ptr; ptr = next) { \
+ next = ptr->next; \
+ res = _ph_merge_##KEYTYPE(res, ptr); \
+ } \
+ *heap = res; \
+}
+
+/* some default CMP() function */
+#define PAIRING_HEAP_COMPARE(a, b) ((a)->key < (b)->key ? -1 : ((a)->key > (b)->key ? 1 : 0))
+
+/* less generic version: use the default CMP() function */
+#define PAIRING_HEAP_FUNCTIONS(_STATIC, KEYTYPE) \
+ _PAIRING_HEAP_FUNCTIONS(_STATIC, KEYTYPE, PAIRING_HEAP_COMPARE)
+
+#endif
--
2.11.0

2016-12-30 23:04:22

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 06/32] mars: add new module brick

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/brick.c | 723 +++++++++++++++++++++++++++++++++++++++++++
include/linux/brick/brick.h | 620 +++++++++++++++++++++++++++++++++++++
2 files changed, 1343 insertions(+)
create mode 100644 drivers/staging/mars/brick.c
create mode 100644 include/linux/brick/brick.h

diff --git a/drivers/staging/mars/brick.c b/drivers/staging/mars/brick.c
new file mode 100644
index 000000000000..be741e896fc9
--- /dev/null
+++ b/drivers/staging/mars/brick.c
@@ -0,0 +1,723 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#define _STRATEGY
+
+#include <linux/brick/brick.h>
+#include <linux/brick/brick_mem.h>
+
+/************************************************************/
+
+/* init / exit functions */
+
+void _generic_output_init(
+struct generic_brick *brick, const struct generic_output_type *type, struct generic_output *output)
+{
+ output->brick = brick;
+ output->type = type;
+ output->ops = type->master_ops;
+ output->nr_connected = 0;
+ INIT_LIST_HEAD(&output->output_head);
+}
+
+void _generic_output_exit(struct generic_output *output)
+{
+ list_del_init(&output->output_head);
+ output->brick = NULL;
+ output->type = NULL;
+ output->ops = NULL;
+ output->nr_connected = 0;
+}
+
+int generic_brick_init(const struct generic_brick_type *type, struct generic_brick *brick)
+{
+ brick->aspect_context.brick_index = get_brick_nr();
+ brick->type = type;
+ brick->ops = type->master_ops;
+ brick->nr_inputs = 0;
+ brick->nr_outputs = 0;
+ brick->power.off_led = true;
+ init_waitqueue_head(&brick->power.event);
+ INIT_LIST_HEAD(&brick->tmp_head);
+ return 0;
+}
+
+void generic_brick_exit(struct generic_brick *brick)
+{
+ list_del_init(&brick->tmp_head);
+ brick->type = NULL;
+ brick->ops = NULL;
+ brick->nr_inputs = 0;
+ brick->nr_outputs = 0;
+ put_brick_nr(brick->aspect_context.brick_index);
+}
+
+int generic_input_init(
+struct generic_brick *brick, int index, const struct generic_input_type *type, struct generic_input *input)
+{
+ if (index < 0 || index >= brick->type->max_inputs)
+ return -EINVAL;
+ if (brick->inputs[index])
+ return -EEXIST;
+ input->brick = brick;
+ input->type = type;
+ input->connect = NULL;
+ INIT_LIST_HEAD(&input->input_head);
+ brick->inputs[index] = input;
+ brick->nr_inputs++;
+ return 0;
+}
+
+void generic_input_exit(struct generic_input *input)
+{
+ list_del_init(&input->input_head);
+ input->brick = NULL;
+ input->type = NULL;
+ input->connect = NULL;
+}
+
+int generic_output_init(
+struct generic_brick *brick, int index, const struct generic_output_type *type, struct generic_output *output)
+{
+ if (index < 0 || index >= brick->type->max_outputs)
+ return -ENOMEM;
+ if (brick->outputs[index])
+ return -EEXIST;
+ _generic_output_init(brick, type, output);
+ brick->outputs[index] = output;
+ brick->nr_outputs++;
+ return 0;
+}
+
+int generic_size(const struct generic_brick_type *brick_type)
+{
+ int size = brick_type->brick_size;
+ int i;
+
+ size += brick_type->max_inputs * sizeof(void *);
+ for (i = 0; i < brick_type->max_inputs; i++)
+ size += brick_type->default_input_types[i]->input_size;
+ size += brick_type->max_outputs * sizeof(void *);
+ for (i = 0; i < brick_type->max_outputs; i++)
+ size += brick_type->default_output_types[i]->output_size;
+ return size;
+}
+
+int generic_connect(struct generic_input *input, struct generic_output *output)
+{
+ BRICK_DBG("generic_connect(input=%p, output=%p)\n", input, output);
+ if (unlikely(!input || !output))
+ return -EINVAL;
+ if (unlikely(input->connect))
+ return -EEXIST;
+ if (unlikely(!list_empty(&input->input_head)))
+ return -EINVAL;
+ /* helps only against the most common errors */
+ if (unlikely(input->brick == output->brick))
+ return -EDEADLK;
+
+ input->connect = output;
+ output->nr_connected++;
+ list_add(&input->input_head, &output->output_head);
+ return 0;
+}
+
+int generic_disconnect(struct generic_input *input)
+{
+ struct generic_output *connect;
+
+ BRICK_DBG("generic_disconnect(input=%p)\n", input);
+ if (!input)
+ return -EINVAL;
+ connect = input->connect;
+ if (connect) {
+ connect->nr_connected--;
+ input->connect = NULL;
+ list_del_init(&input->input_head);
+ }
+ return 0;
+}
+
+/************************************************************/
+
+/* general */
+
+int _brick_msleep(int msecs, bool shorten)
+{
+ unsigned long timeout;
+
+ flush_signals(current);
+ if (msecs <= 0) {
+ schedule();
+ return 0;
+ }
+ timeout = msecs_to_jiffies(msecs) + 1;
+
+ timeout = schedule_timeout_interruptible(timeout);
+
+ if (!shorten) {
+ while ((long)timeout > 0)
+ timeout = schedule_timeout_uninterruptible(timeout);
+ }
+
+ return jiffies_to_msecs(timeout);
+}
+
+/************************************************************/
+
+/* number management */
+
+static char *nr_table;
+int nr_max = 256;
+
+int get_brick_nr(void)
+{
+ char *new;
+ int nr;
+
+ if (unlikely(!nr_table))
+ nr_table = brick_zmem_alloc(nr_max);
+
+ for (;;) {
+ for (nr = 1; nr < nr_max; nr++) {
+ if (!nr_table[nr]) {
+ nr_table[nr] = 1;
+ return nr;
+ }
+ }
+ new = brick_zmem_alloc(nr_max << 1);
+ memcpy(new, nr_table, nr_max);
+ brick_mem_free(nr_table);
+ nr_table = new;
+ nr_max <<= 1;
+ }
+}
+
+void put_brick_nr(int nr)
+{
+ if (likely(nr_table && nr > 0 && nr < nr_max))
+ nr_table[nr] = 0;
+}
+
+/************************************************************/
+
+/* object stuff */
+
+/************************************************************/
+
+/* brick stuff */
+
+static int nr_brick_types;
+static const struct generic_brick_type *brick_types[MAX_BRICK_TYPES];
+
+int generic_register_brick_type(const struct generic_brick_type *new_type)
+{
+ int i;
+ int found = -1;
+
+ BRICK_DBG("generic_register_brick_type() name=%s\n", new_type->type_name);
+ for (i = 0; i < nr_brick_types; i++) {
+ if (!brick_types[i]) {
+ found = i;
+ continue;
+ }
+ if (!strcmp(brick_types[i]->type_name, new_type->type_name))
+ return 0;
+ }
+ if (found < 0) {
+ if (nr_brick_types >= MAX_BRICK_TYPES) {
+ BRICK_ERR("sorry, cannot register bricktype %s.\n", new_type->type_name);
+ return -ENOMEM;
+ }
+ found = nr_brick_types++;
+ }
+ brick_types[found] = new_type;
+ BRICK_DBG("generic_register_brick_type() done.\n");
+ return 0;
+}
+
+int generic_unregister_brick_type(const struct generic_brick_type *old_type)
+{
+ BRICK_DBG("generic_unregister_brick_type()\n");
+ return -1; /* NYI */
+}
+
+int generic_brick_init_full(
+ void *data,
+ int size,
+ const struct generic_brick_type *brick_type,
+ const struct generic_input_type **input_types,
+ const struct generic_output_type **output_types)
+{
+ struct generic_brick *brick = data;
+ int status;
+ int i;
+
+ if (unlikely(!data)) {
+ BRICK_ERR("invalid memory\n");
+ return -EINVAL;
+ }
+
+ /* call the generic constructors */
+
+ status = generic_brick_init(brick_type, brick);
+ if (status)
+ return status;
+ data += brick_type->brick_size;
+ size -= brick_type->brick_size;
+ if (size < 0) {
+ BRICK_ERR("Not enough MEMORY\n");
+ return -ENOMEM;
+ }
+ if (!input_types) {
+ input_types = brick_type->default_input_types;
+ if (unlikely(!input_types)) {
+ BRICK_ERR("no input types specified\n");
+ return -EINVAL;
+ }
+ }
+ brick->inputs = data;
+ data += sizeof(void *) * brick_type->max_inputs;
+ size -= sizeof(void *) * brick_type->max_inputs;
+ if (size < 0)
+ return -ENOMEM;
+ for (i = 0; i < brick_type->max_inputs; i++) {
+ struct generic_input *input = data;
+ const struct generic_input_type *type = *input_types++;
+
+ if (!type || type->input_size <= 0)
+ return -EINVAL;
+ BRICK_DBG("generic_brick_init_full: calling generic_input_init()\n");
+ status = generic_input_init(brick, i, type, input);
+ if (status < 0)
+ return status;
+ data += type->input_size;
+ size -= type->input_size;
+ if (size < 0)
+ return -ENOMEM;
+ }
+ if (!output_types) {
+ output_types = brick_type->default_output_types;
+ if (unlikely(!output_types)) {
+ BRICK_ERR("no output types specified\n");
+ return -EINVAL;
+ }
+ }
+ brick->outputs = data;
+ data += sizeof(void *) * brick_type->max_outputs;
+ size -= sizeof(void *) * brick_type->max_outputs;
+ if (size < 0)
+ return -ENOMEM;
+ for (i = 0; i < brick_type->max_outputs; i++) {
+ struct generic_output *output = data;
+ const struct generic_output_type *type = *output_types++;
+
+ if (!type || type->output_size <= 0)
+ return -EINVAL;
+ BRICK_DBG("generic_brick_init_full: calling generic_output_init()\n");
+ generic_output_init(brick, i, type, output);
+ if (status < 0)
+ return status;
+ data += type->output_size;
+ size -= type->output_size;
+ if (size < 0)
+ return -ENOMEM;
+ }
+
+ /* call the specific constructors */
+ if (brick_type->brick_construct) {
+ BRICK_DBG("generic_brick_init_full: calling brick_construct()\n");
+ status = brick_type->brick_construct(brick);
+ if (status < 0)
+ return status;
+ }
+ for (i = 0; i < brick_type->max_inputs; i++) {
+ struct generic_input *input = brick->inputs[i];
+
+ if (!input)
+ continue;
+ if (!input->type) {
+ BRICK_ERR("input has no associated type!\n");
+ continue;
+ }
+ if (input->type->input_construct) {
+ BRICK_DBG("generic_brick_init_full: calling input_construct()\n");
+ status = input->type->input_construct(input);
+ if (status < 0)
+ return status;
+ }
+ }
+ for (i = 0; i < brick_type->max_outputs; i++) {
+ struct generic_output *output = brick->outputs[i];
+
+ if (!output)
+ continue;
+ if (!output->type) {
+ BRICK_ERR("output has no associated type!\n");
+ continue;
+ }
+ if (output->type->output_construct) {
+ BRICK_DBG("generic_brick_init_full: calling output_construct()\n");
+ status = output->type->output_construct(output);
+ if (status < 0)
+ return status;
+ }
+ }
+ return 0;
+}
+
+int generic_brick_exit_full(struct generic_brick *brick)
+{
+ int i;
+ int status;
+
+ /* first, check all outputs */
+ for (i = 0; i < brick->type->max_outputs; i++) {
+ struct generic_output *output = brick->outputs[i];
+
+ if (!output)
+ continue;
+ if (!output->type) {
+ BRICK_ERR("output has no associated type!\n");
+ continue;
+ }
+ if (output->nr_connected) {
+ BRICK_ERR("output is connected!\n");
+ return -EPERM;
+ }
+ }
+ /* ok, test succeeded. start destruction... */
+ for (i = 0; i < brick->type->max_outputs; i++) {
+ struct generic_output *output = brick->outputs[i];
+
+ if (!output)
+ continue;
+ if (!output->type) {
+ BRICK_ERR("output has no associated type!\n");
+ continue;
+ }
+ if (output->type->output_destruct) {
+ BRICK_DBG("generic_brick_exit_full: calling output_destruct()\n");
+ status = output->type->output_destruct(output);
+ if (status < 0)
+ return status;
+ _generic_output_exit(output);
+ brick->outputs[i] = NULL; /* others may remain leftover */
+ }
+ }
+ for (i = 0; i < brick->type->max_inputs; i++) {
+ struct generic_input *input = brick->inputs[i];
+
+ if (!input)
+ continue;
+ if (!input->type) {
+ BRICK_ERR("input has no associated type!\n");
+ continue;
+ }
+ if (input->type->input_destruct) {
+ status = generic_disconnect(input);
+ if (status < 0)
+ return status;
+ BRICK_DBG("generic_brick_exit_full: calling input_destruct()\n");
+ status = input->type->input_destruct(input);
+ if (status < 0)
+ return status;
+ brick->inputs[i] = NULL; /* others may remain leftover */
+ generic_input_exit(input);
+ }
+ }
+ if (brick->type->brick_destruct) {
+ BRICK_DBG("generic_brick_exit_full: calling brick_destruct()\n");
+ status = brick->type->brick_destruct(brick);
+ if (status < 0)
+ return status;
+ }
+ generic_brick_exit(brick);
+ return 0;
+}
+
+/**********************************************************************/
+
+/* default implementations */
+
+struct generic_object *generic_alloc(
+struct generic_object_layout *object_layout, const struct generic_object_type *object_type)
+{
+ struct generic_object *object;
+ void *data;
+ int object_size;
+ int aspect_nr_max;
+ int total_size;
+ int hint_size;
+
+ CHECK_PTR_NULL(object_type, err);
+ CHECK_PTR(object_layout, err);
+
+ object_size = object_type->default_size;
+ aspect_nr_max = nr_max;
+ total_size = object_size + aspect_nr_max * sizeof(void *);
+ hint_size = object_layout->size_hint;
+ if (likely(total_size <= hint_size)) {
+ total_size = hint_size;
+ } else { /* usually happens only at the first time */
+ object_layout->size_hint = total_size;
+ }
+
+ data = brick_zmem_alloc(total_size);
+
+ atomic_inc(&object_layout->alloc_count);
+ atomic_inc(&object_layout->total_alloc_count);
+
+ object = data;
+ object->object_type = object_type;
+ object->object_layout = object_layout;
+ object->aspects = data + object_size;
+ object->aspect_nr_max = aspect_nr_max;
+ object->free_offset = object_size + aspect_nr_max * sizeof(void *);
+ object->max_offset = total_size;
+
+ if (object_type->init_fn) {
+ int status = object_type->init_fn(object);
+
+ if (status < 0)
+ goto err_free;
+ }
+
+ return object;
+
+err_free:
+ brick_mem_free(data);
+err:
+ return NULL;
+}
+
+void generic_free(struct generic_object *object)
+{
+ const struct generic_object_type *object_type;
+ struct generic_object_layout *object_layout;
+ int i;
+
+ CHECK_PTR(object, done);
+ object_type = object->object_type;
+ CHECK_PTR_NULL(object_type, done);
+ object_layout = object->object_layout;
+ CHECK_PTR(object_layout, done);
+ _CHECK_ATOMIC(&object->obj_count, !=, 0);
+
+ atomic_dec(&object_layout->alloc_count);
+ for (i = 0; i < object->aspect_nr_max; i++) {
+ const struct generic_aspect_type *aspect_type;
+ struct generic_aspect *aspect = object->aspects[i];
+
+ if (!aspect)
+ continue;
+ object->aspects[i] = NULL;
+ aspect_type = aspect->aspect_type;
+ CHECK_PTR_NULL(aspect_type, done);
+ if (aspect_type->exit_fn)
+ aspect_type->exit_fn(aspect);
+ if (aspect->shortcut)
+ continue;
+ brick_mem_free(aspect);
+ atomic_dec(&object_layout->aspect_count);
+ }
+ if (object_type->exit_fn)
+ object_type->exit_fn(object);
+ brick_mem_free(object);
+done:;
+}
+
+static inline
+struct generic_aspect *_new_aspect(const struct generic_aspect_type *aspect_type, struct generic_object *obj)
+{
+ struct generic_aspect *res = NULL;
+ int size;
+ int rest;
+
+ size = aspect_type->aspect_size;
+ rest = obj->max_offset - obj->free_offset;
+ if (likely(size <= rest)) {
+ /* Optimisation: re-use single memory allocation for both
+ * the object and the new aspect.
+ */
+ res = ((void *)obj) + obj->free_offset;
+ obj->free_offset += size;
+ res->shortcut = true;
+ } else {
+ struct generic_object_layout *object_layout = obj->object_layout;
+
+ CHECK_PTR(object_layout, done);
+ /* Maintain the size hint.
+ * In future, only small aspects should be integrated into
+ * the same memory block, and the hint should not grow larger
+ * than PAGE_SIZE if it was smaller before.
+ */
+ if (size < PAGE_SIZE / 2) {
+ int max;
+
+ max = obj->free_offset + size;
+ /* This is racy, but races won't do any harm because
+ * it is just a hint, not essential.
+ */
+ if ((max < PAGE_SIZE || object_layout->size_hint > PAGE_SIZE) &&
+ object_layout->size_hint < max)
+ object_layout->size_hint = max;
+ }
+
+ res = brick_zmem_alloc(size);
+ atomic_inc(&object_layout->aspect_count);
+ atomic_inc(&object_layout->total_aspect_count);
+ }
+ res->object = obj;
+ res->aspect_type = aspect_type;
+
+ if (aspect_type->init_fn) {
+ int status = aspect_type->init_fn(res);
+
+ if (unlikely(status < 0)) {
+ BRICK_ERR("aspect init %p %p %p status = %d\n", aspect_type, obj, res, status);
+ goto done;
+ }
+ }
+
+done:
+ return res;
+}
+
+struct generic_aspect *generic_get_aspect(struct generic_brick *brick, struct generic_object *obj)
+{
+ struct generic_aspect *res = NULL;
+ int nr;
+
+ CHECK_PTR(brick, done);
+ CHECK_PTR(obj, done);
+
+ nr = brick->aspect_context.brick_index;
+ if (unlikely(nr <= 0 || nr >= obj->aspect_nr_max)) {
+ BRICK_ERR("bad nr = %d\n", nr);
+ goto done;
+ }
+
+ res = obj->aspects[nr];
+ if (!res) {
+ const struct generic_object_type *object_type = obj->object_type;
+ const struct generic_brick_type *brick_type = brick->type;
+ const struct generic_aspect_type *aspect_type;
+ int object_type_nr;
+
+ CHECK_PTR_NULL(object_type, done);
+ CHECK_PTR_NULL(brick_type, done);
+ object_type_nr = object_type->object_type_nr;
+ aspect_type = brick_type->aspect_types[object_type_nr];
+ CHECK_PTR_NULL(aspect_type, done);
+
+ res = _new_aspect(aspect_type, obj);
+
+ obj->aspects[nr] = res;
+ }
+ CHECK_PTR(res, done);
+ CHECK_PTR(res->object, done);
+ _CHECK(res->object == obj, done);
+
+done:
+ return res;
+}
+
+/***************************************************************/
+
+/* helper stuff */
+
+void set_button(struct generic_switch *sw, bool val, bool force)
+{
+ bool oldval = sw->button;
+
+ sw->force_off |= force;
+ if (sw->force_off)
+ val = false;
+ if (val != oldval) {
+ sw->button = val;
+ wake_up_interruptible(&sw->event);
+ }
+}
+
+void set_on_led(struct generic_switch *sw, bool val)
+{
+ bool oldval = sw->on_led;
+
+ if (val != oldval) {
+ sw->on_led = val;
+ wake_up_interruptible(&sw->event);
+ }
+}
+
+void set_off_led(struct generic_switch *sw, bool val)
+{
+ bool oldval = sw->off_led;
+
+ if (val != oldval) {
+ sw->off_led = val;
+ wake_up_interruptible(&sw->event);
+ }
+}
+
+void set_button_wait(struct generic_brick *brick, bool val, bool force, int timeout)
+{
+ set_button(&brick->power, val, force);
+ if (brick->ops)
+ (void)brick->ops->brick_switch(brick);
+ if (val)
+ wait_event_interruptible_timeout(brick->power.event, brick->power.on_led, timeout);
+ else
+ wait_event_interruptible_timeout(brick->power.event, brick->power.off_led, timeout);
+}
+
+/***************************************************************/
+
+/* meta stuff */
+
+const struct meta *find_meta(const struct meta *meta, const char *field_name)
+{
+ const struct meta *tmp;
+
+ for (tmp = meta; tmp->field_name; tmp++) {
+ if (!strcmp(field_name, tmp->field_name))
+ return tmp;
+ }
+ return NULL;
+}
+
+/***********************************************************************/
+
+/* module init stuff */
+
+int __init init_brick(void)
+{
+ nr_table = brick_zmem_alloc(nr_max);
+ return 0;
+}
+
+void exit_brick(void)
+{
+ if (nr_table) {
+ brick_mem_free(nr_table);
+ nr_table = NULL;
+ }
+}
diff --git a/include/linux/brick/brick.h b/include/linux/brick/brick.h
new file mode 100644
index 000000000000..04f5084e26ff
--- /dev/null
+++ b/include/linux/brick/brick.h
@@ -0,0 +1,620 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef BRICK_H
+#define BRICK_H
+
+#include <linux/list.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kthread.h>
+
+#include <linux/atomic.h>
+
+#include <linux/brick/brick_say.h>
+#include <linux/brick/meta.h>
+
+#define MAX_BRICK_TYPES 64
+
+#define brick_msleep(msecs) _brick_msleep(msecs, false)
+extern int _brick_msleep(int msecs, bool shorten);
+#define brick_yield() brick_msleep(0)
+
+/***********************************************************************/
+
+/* printk() replacements */
+
+#define _BRICK_MSG(_class, _dump, _fmt, _args...) \
+ brick_say(_class, _dump, "BRICK", __BASE_FILE__, __LINE__, __func__, _fmt, ##_args)
+
+#define BRICK_FAT(_fmt, _args...) _BRICK_MSG(SAY_FATAL, true, _fmt, ##_args)
+#define BRICK_ERR(_fmt, _args...) _BRICK_MSG(SAY_ERROR, false, _fmt, ##_args)
+#define BRICK_WRN(_fmt, _args...) _BRICK_MSG(SAY_WARN, false, _fmt, ##_args)
+#define BRICK_INF(_fmt, _args...) _BRICK_MSG(SAY_INFO, false, _fmt, ##_args)
+
+#ifdef BRICK_DEBUGGING
+#define BRICK_DBG(_fmt, _args...) _BRICK_MSG(SAY_DEBUG, false, _fmt, ##_args)
+#else
+#define BRICK_DBG(_args...) /**/
+#endif
+
+#include <linux/brick/brick_checking.h>
+
+/***********************************************************************/
+
+/* number management helpers */
+
+extern int get_brick_nr(void);
+extern void put_brick_nr(int nr);
+
+/***********************************************************************/
+
+/* definitions for generic objects with aspects */
+
+struct generic_object;
+struct generic_aspect;
+
+#define GENERIC_ASPECT_TYPE(OBJTYPE) \
+ /* readonly from outside */ \
+ const char *aspect_type_name; \
+ const struct generic_object_type *object_type; \
+ /* private */ \
+ int aspect_size; \
+ int (*init_fn)(struct OBJTYPE##_aspect *ini); \
+ void (*exit_fn)(struct OBJTYPE##_aspect *ini)
+
+struct generic_aspect_type {
+ GENERIC_ASPECT_TYPE(generic);
+};
+
+#define GENERIC_OBJECT_TYPE(OBJTYPE) \
+ /* readonly from outside */ \
+ const char *object_type_name; \
+ /* private */ \
+ int default_size; \
+ int object_type_nr; \
+ int (*init_fn)(struct OBJTYPE##_object *ini); \
+ void (*exit_fn)(struct OBJTYPE##_object *ini)
+
+struct generic_object_type {
+ GENERIC_OBJECT_TYPE(generic);
+};
+
+#define GENERIC_OBJECT_LAYOUT(OBJTYPE) \
+ /* private */ \
+ int size_hint; \
+ atomic_t alloc_count; \
+ atomic_t aspect_count; \
+ atomic_t total_alloc_count; \
+ atomic_t total_aspect_count
+
+struct generic_object_layout {
+ GENERIC_OBJECT_LAYOUT(generic);
+};
+
+#define GENERIC_OBJECT(OBJTYPE) \
+ /* maintenance, access by macros */ \
+ atomic_t obj_count; /* reference counter */ \
+ bool obj_initialized; /* internally used for checking */ \
+ /* readonly from outside */ \
+ const struct generic_object_type *object_type; \
+ /* private */ \
+ struct generic_object_layout *object_layout; \
+ struct OBJTYPE##_aspect **aspects; \
+ int aspect_nr_max; \
+ int free_offset; \
+ int max_offset
+
+struct generic_object {
+ GENERIC_OBJECT(generic);
+};
+
+#define GENERIC_ASPECT(OBJTYPE) \
+ /* readonly from outside */ \
+ struct OBJTYPE##_object *object; \
+ const struct generic_aspect_type *aspect_type; \
+ /* private */ \
+ bool shortcut
+
+struct generic_aspect {
+ GENERIC_ASPECT(generic);
+};
+
+#define GENERIC_ASPECT_CONTEXT(OBJTYPE) \
+ /* private (for any layer) */ \
+ int brick_index /* globally unique */
+
+struct generic_aspect_context {
+ GENERIC_ASPECT_CONTEXT(generic);
+};
+
+#define obj_check(object) \
+ ({ \
+ if (unlikely(BRICK_CHECKING && !(object)->obj_initialized)) {\
+ BRICK_ERR("object %p is not initialized\n", (object));\
+ } \
+ CHECK_ATOMIC(&(object)->obj_count, 1); \
+ })
+
+#define obj_get_first(object) \
+ ({ \
+ if (unlikely(BRICK_CHECKING && (object)->obj_initialized)) {\
+ BRICK_ERR("object %p is already initialized\n", (object));\
+ } \
+ _CHECK_ATOMIC(&(object)->obj_count, !=, 0); \
+ (object)->obj_initialized = true; \
+ atomic_inc(&(object)->obj_count); \
+ })
+
+#define obj_get(object) \
+ ({ \
+ obj_check(object); \
+ atomic_inc(&(object)->obj_count); \
+ })
+
+#define obj_put(object) \
+ ({ \
+ obj_check(object); \
+ atomic_dec_and_test(&(object)->obj_count); \
+ })
+
+#define obj_free(object) \
+ ({ \
+ if (likely(object)) { \
+ generic_free((struct generic_object *)(object));\
+ } \
+ })
+
+/***********************************************************************/
+
+/* definitions for asynchronous callback objects */
+
+#define GENERIC_CALLBACK(OBJTYPE) \
+ /* set by macros, afterwards readonly from outside */ \
+ void (*cb_fn)(struct OBJTYPE##_callback *cb); \
+ void *cb_private; \
+ int cb_error; \
+ /* private */ \
+ struct generic_callback *cb_next
+
+struct generic_callback {
+ GENERIC_CALLBACK(generic);
+};
+
+#define CALLBACK_OBJECT(OBJTYPE) \
+ GENERIC_OBJECT(OBJTYPE); \
+ /* private, access by macros */ \
+ struct generic_callback *object_cb; \
+ struct generic_callback _object_cb
+
+struct callback_object {
+ CALLBACK_OBJECT(generic);
+};
+
+/* Initial setup of the callback chain
+ */
+#define _SETUP_CALLBACK(obj, fn, priv) \
+do { \
+ (obj)->_object_cb.cb_fn = (fn); \
+ (obj)->_object_cb.cb_private = (priv); \
+ (obj)->_object_cb.cb_error = 0; \
+ (obj)->_object_cb.cb_next = NULL; \
+ (obj)->object_cb = &(obj)->_object_cb; \
+} while (0)
+
+#ifdef BRICK_DEBUGGING
+#define SETUP_CALLBACK(obj, fn, priv) \
+do { \
+ if (unlikely((obj)->_object_cb.cb_fn)) { \
+ BRICK_ERR("callback function %p is already installed (new=%p)\n",\
+ (obj)->_object_cb.cb_fn, (fn)); \
+ } \
+ _SETUP_CALLBACK(obj, fn, priv) \
+} while (0)
+#else
+#define SETUP_CALLBACK(obj, fn, priv) _SETUP_CALLBACK(obj, fn, priv)
+#endif
+
+/* Insert a new member into the callback chain
+ */
+#define _INSERT_CALLBACK(obj, new, fn, priv) \
+do { \
+ if (likely(!(new)->cb_fn)) { \
+ (new)->cb_fn = (fn); \
+ (new)->cb_private = (priv); \
+ (new)->cb_error = 0; \
+ (new)->cb_next = (obj)->object_cb; \
+ (obj)->object_cb = (new); \
+ } \
+} while (0)
+
+#ifdef BRICK_DEBUGGING
+#define INSERT_CALLBACK(obj, new, fn, priv) \
+do { \
+ if (unlikely(!(obj)->_object_cb.cb_fn)) { \
+ BRICK_ERR("initical callback function is missing\n"); \
+ } \
+ if (unlikely((new)->cb_fn)) { \
+ BRICK_ERR("new object %p is not pristine\n", (new)->cb_fn);\
+ } \
+ _INSERT_CALLBACK(obj, new, fn, priv); \
+} while (0)
+#else
+#define INSERT_CALLBACK(obj, new, fn, priv) _INSERT_CALLBACK(obj, new, fn, priv)
+#endif
+
+/* Call the first callback in the chain.
+ */
+#define SIMPLE_CALLBACK(obj, err) \
+do { \
+ if (likely(obj)) { \
+ struct generic_callback *__cb = (obj)->object_cb; \
+ if (likely(__cb)) { \
+ __cb->cb_error = (err); \
+ __cb->cb_fn(__cb); \
+ } else { \
+ BRICK_ERR("callback object_cb pointer is NULL\n");\
+ } \
+ } else { \
+ BRICK_ERR("callback obj pointer is NULL\n"); \
+ } \
+} while (0)
+
+#define CHECKED_CALLBACK(obj, err, done) \
+do { \
+ struct generic_callback *__cb; \
+ CHECK_PTR(obj, done); \
+ __cb = (obj)->object_cb; \
+ CHECK_PTR_NULL(__cb, done); \
+ __cb->cb_error = (err); \
+ __cb->cb_fn(__cb); \
+} while (0)
+
+/* An intermediate callback handler must call this
+ * to continue the callback chain.
+ */
+#define NEXT_CHECKED_CALLBACK(cb, done) \
+do { \
+ struct generic_callback *__next_cb = (cb)->cb_next; \
+ CHECK_PTR_NULL(__next_cb, done); \
+ __next_cb->cb_error = (cb)->cb_error; \
+ __next_cb->cb_fn(__next_cb); \
+} while (0)
+
+/* The last callback handler in the chain should call this
+ * for checking whether the end of the chain has been reached
+ */
+#define LAST_CALLBACK(cb) \
+do { \
+ struct generic_callback *__next_cb = (cb)->cb_next; \
+ if (unlikely(__next_cb)) { \
+ BRICK_ERR("end of callback chain %p has not been reached, rest = %p\n", (cb), __next_cb);\
+ } \
+} while (0)
+
+/* Query the callback status.
+ * This uses always the first member of the chain!
+ */
+#define CALLBACK_ERROR(obj) \
+ ((obj)->object_cb ? (obj)->object_cb->cb_error : -EINVAL)
+
+/***********************************************************************/
+
+/* definitions for generic bricks */
+
+struct generic_input;
+struct generic_output;
+struct generic_brick_ops;
+struct generic_output_ops;
+struct generic_brick_type;
+
+struct generic_switch {
+ /* public */
+ bool button; /* in: main switch (on/off) */
+ bool on_led; /* out: indicate regular operation */
+ bool off_led; /* out: indicate no activity of any kind */
+ bool force_off; /* in: make ready for destruction */
+ int io_timeout; /* in: report IO errors after timeout (seconds) */
+ int percent_done; /* out: generic progress indicator */
+ /* private (for any layer) */
+ wait_queue_head_t event;
+};
+
+#define GENERIC_BRICK(BRITYPE) \
+ /* accessible */ \
+ struct generic_switch power; \
+ /* set by strategy layer, readonly from worker layer */ \
+ const struct BRITYPE##_brick_type *type; \
+ int nr_inputs; \
+ int nr_outputs; \
+ struct BRITYPE##_input **inputs; \
+ struct BRITYPE##_output **outputs; \
+ /* private (for any layer) */ \
+ struct BRITYPE##_brick_ops *ops; \
+ struct generic_aspect_context aspect_context; \
+ int (*free)(struct BRITYPE##_brick *del); \
+ struct list_head tmp_head
+
+struct generic_brick {
+ GENERIC_BRICK(generic);
+};
+
+#define GENERIC_INPUT(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ struct BRITYPE##_brick *brick; \
+ const struct BRITYPE##_input_type *type; \
+ /* private (for any layer) */ \
+ struct BRITYPE##_output *connect; \
+ struct list_head input_head
+
+struct generic_input {
+ GENERIC_INPUT(generic);
+};
+
+#define GENERIC_OUTPUT(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ struct BRITYPE##_brick *brick; \
+ const struct BRITYPE##_output_type *type; \
+ /* private (for any layer) */ \
+ struct BRITYPE##_output_ops *ops; \
+ struct list_head output_head; \
+ int nr_connected
+
+struct generic_output {
+ GENERIC_OUTPUT(generic);
+};
+
+#define GENERIC_OUTPUT_CALL(OUTPUT, OP, ARGS...) \
+ ( \
+ (OUTPUT) && (OUTPUT)->ops->OP ? \
+ (OUTPUT)->ops->OP(OUTPUT, ##ARGS) : \
+ -ENOTCONN \
+ )
+
+#define GENERIC_INPUT_CALL(INPUT, OP, ARGS...) \
+ ( \
+ (INPUT) && (INPUT)->connect ? \
+ GENERIC_OUTPUT_CALL((INPUT)->connect, OP, ##ARGS) : \
+ -ENOTCONN \
+ )
+
+#define GENERIC_BRICK_OPS(BRITYPE) \
+ int (*brick_switch)(struct BRITYPE##_brick *brick)
+
+struct generic_brick_ops {
+ GENERIC_BRICK_OPS(generic);
+};
+
+#define GENERIC_OUTPUT_OPS(BRITYPE) \
+ /*int (*output_start)(struct BRITYPE##_output *output);*/ \
+ /*int (*output_stop)(struct BRITYPE##_output *output);*/
+
+struct generic_output_ops {
+ GENERIC_OUTPUT_OPS(generic);
+};
+
+/* although possible, *_type should never be extended */
+#define GENERIC_BRICK_TYPE(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ const char *type_name; \
+ int max_inputs; \
+ int max_outputs; \
+ const struct BRITYPE##_input_type **default_input_types; \
+ const char **default_input_names; \
+ const struct BRITYPE##_output_type **default_output_types; \
+ const char **default_output_names; \
+ /* private (for any layer) */ \
+ int brick_size; \
+ struct BRITYPE##_brick_ops *master_ops; \
+ const struct generic_aspect_type **aspect_types; \
+ const struct BRITYPE##_input_types **default_type; \
+ int (*brick_construct)(struct BRITYPE##_brick *brick); \
+ int (*brick_destruct)(struct BRITYPE##_brick *brick)
+
+struct generic_brick_type {
+ GENERIC_BRICK_TYPE(generic);
+};
+
+#define GENERIC_INPUT_TYPE(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ char *type_name; \
+ /* private (for any layer) */ \
+ int input_size; \
+ int (*input_construct)(struct BRITYPE##_input *input); \
+ int (*input_destruct)(struct BRITYPE##_input *input)
+
+struct generic_input_type {
+ GENERIC_INPUT_TYPE(generic);
+};
+
+#define GENERIC_OUTPUT_TYPE(BRITYPE) \
+ /* set by strategy layer, readonly from worker layer */ \
+ char *type_name; \
+ /* private (for any layer) */ \
+ int output_size; \
+ struct BRITYPE##_output_ops *master_ops; \
+ int (*output_construct)(struct BRITYPE##_output *output); \
+ int (*output_destruct)(struct BRITYPE##_output *output)
+
+struct generic_output_type {
+ GENERIC_OUTPUT_TYPE(generic);
+};
+
+int generic_register_brick_type(const struct generic_brick_type *new_type);
+int generic_unregister_brick_type(const struct generic_brick_type *old_type);
+
+extern void _generic_output_init(
+struct generic_brick *brick, const struct generic_output_type *type, struct generic_output *output);
+
+extern void _generic_output_exit(struct generic_output *output);
+
+#ifdef _STRATEGY /* call this only in strategy bricks, never in ordinary bricks */
+
+/* you need this only if you circumvent generic_brick_init_full() */
+extern int generic_brick_init(const struct generic_brick_type *type, struct generic_brick *brick);
+
+extern void generic_brick_exit(struct generic_brick *brick);
+
+extern int generic_input_init(
+struct generic_brick *brick, int index, const struct generic_input_type *type, struct generic_input *input);
+
+extern void generic_input_exit(struct generic_input *input);
+
+extern int generic_output_init(
+struct generic_brick *brick, int index, const struct generic_output_type *type, struct generic_output *output);
+
+extern int generic_size(const struct generic_brick_type *brick_type);
+
+extern int generic_connect(struct generic_input *input, struct generic_output *output);
+
+extern int generic_disconnect(struct generic_input *input);
+
+/* If possible, use this instead of generic_*_init().
+ * input_types and output_types may be NULL = > use default_*_types
+ */
+int generic_brick_init_full(
+ void *data,
+ int size,
+ const struct generic_brick_type *brick_type,
+ const struct generic_input_type **input_types,
+ const struct generic_output_type **output_types);
+
+int generic_brick_exit_full(
+ struct generic_brick *brick);
+
+#endif /* _STRATEGY */
+
+/* simple wrappers for type safety */
+
+#define DECLARE_BRICK_FUNCTIONS(BRITYPE) \
+extern inline int BRITYPE##_register_brick_type(void) \
+{ \
+ extern const struct BRITYPE##_brick_type BRITYPE##_brick_type; \
+ extern int BRITYPE##_brick_nr; \
+ if (unlikely(BRITYPE##_brick_nr >= 0)) { \
+ BRICK_ERR("brick type " #BRITYPE " is already registered.\n");\
+ return -EEXIST; \
+ } \
+ BRITYPE##_brick_nr = generic_register_brick_type((const struct generic_brick_type *)&BRITYPE##_brick_type);\
+ return BRITYPE##_brick_nr < 0 ? BRITYPE##_brick_nr : 0; \
+} \
+ \
+extern inline int BRITYPE##_unregister_brick_type(void) \
+{ \
+ extern const struct BRITYPE##_brick_type BRITYPE##_brick_type; \
+ return generic_unregister_brick_type((const struct generic_brick_type *)&BRITYPE##_brick_type);\
+} \
+ \
+extern const struct BRITYPE##_brick_type BRITYPE##_brick_type; \
+extern const struct BRITYPE##_input_type BRITYPE##_input_type; \
+extern const struct BRITYPE##_output_type BRITYPE##_output_type
+
+/*********************************************************************/
+
+/* default operations on objects / aspects */
+
+extern struct generic_object *generic_alloc(
+struct generic_object_layout *object_layout, const struct generic_object_type *object_type);
+
+extern void generic_free(struct generic_object *object);
+extern struct generic_aspect *generic_get_aspect(struct generic_brick *brick, struct generic_object *obj);
+
+#define DECLARE_OBJECT_FUNCTIONS(OBJTYPE) \
+extern inline struct OBJTYPE##_object *alloc_##OBJTYPE(struct generic_object_layout *layout)\
+{ \
+ return (void *)generic_alloc(layout, &OBJTYPE##_type); \
+}
+
+#define DECLARE_ASPECT_FUNCTIONS(BRITYPE, OBJTYPE) \
+ \
+extern inline struct OBJTYPE##_object *BRITYPE##_alloc_##OBJTYPE(struct BRITYPE##_brick *brick)\
+{ \
+ return alloc_##OBJTYPE(&brick->OBJTYPE##_object_layout); \
+} \
+ \
+extern inline struct BRITYPE##_##OBJTYPE##_aspect *BRITYPE##_##OBJTYPE##_get_aspect(\
+struct BRITYPE##_brick *brick, struct OBJTYPE##_object *obj) \
+{ \
+ return (void *)generic_get_aspect((struct generic_brick *)brick, (struct generic_object *)obj);\
+}
+
+/*********************************************************************/
+
+/* some general helpers */
+
+#ifdef _STRATEGY /* call this only from the strategy implementation */
+
+/* Generic interface to simple brick status changes.
+ */
+extern void set_button(struct generic_switch *sw, bool val, bool force);
+extern void set_on_led(struct generic_switch *sw, bool val);
+extern void set_off_led(struct generic_switch *sw, bool val);
+/*
+ * "Forced switch off" means that it cannot be switched on again.
+ */
+extern void set_button_wait(struct generic_brick *brick, bool val, bool force, int timeout);
+
+#endif
+
+/***********************************************************************/
+
+/* thread automation (avoid code duplication) */
+
+#define brick_thread_create(_thread_fn, _data, _fmt, _args...) \
+ ({ \
+ struct task_struct *_thr = kthread_create(_thread_fn, _data, _fmt, ##_args);\
+ if (unlikely(IS_ERR(_thr))) { \
+ int _err = PTR_ERR(_thr); \
+ BRICK_ERR("cannot create thread '%s', status = %d\n", _fmt, _err);\
+ _thr = NULL; \
+ } else { \
+ struct say_channel *ch = get_binding(current); \
+ if (likely(ch)) \
+ bind_to_channel(ch, _thr); \
+ get_task_struct(_thr); \
+ wake_up_process(_thr); \
+ } \
+ _thr; \
+ })
+
+#define brick_thread_stop(_thread) \
+ do { \
+ struct task_struct *__thread__ = (_thread); \
+ if (likely(__thread__)) { \
+ BRICK_DBG("stopping thread '%s'\n", __thread__->comm);\
+ kthread_stop(__thread__); \
+ BRICK_DBG("thread '%s' finished.\n", __thread__->comm);\
+ remove_binding(__thread__); \
+ put_task_struct(__thread__); \
+ _thread = NULL; \
+ } \
+ } while (0)
+
+#define brick_thread_should_stop() \
+ ({ \
+ brick_yield(); \
+ kthread_should_stop(); \
+ })
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_brick(void);
+extern void exit_brick(void);
+
+#endif
--
2.11.0

2016-12-30 23:04:33

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 28/32] mars: add new module mars_proc

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/mars/mars_proc.c | 389 ++++++++++++++++++++++++++++++++++
drivers/staging/mars/mars/mars_proc.h | 34 +++
2 files changed, 423 insertions(+)
create mode 100644 drivers/staging/mars/mars/mars_proc.c
create mode 100644 drivers/staging/mars/mars/mars_proc.h

diff --git a/drivers/staging/mars/mars/mars_proc.c b/drivers/staging/mars/mars/mars_proc.c
new file mode 100644
index 000000000000..84b4dfc82211
--- /dev/null
+++ b/drivers/staging/mars/mars/mars_proc.c
@@ -0,0 +1,389 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/sysctl.h>
+#include <linux/uaccess.h>
+
+#include "strategy.h"
+#include "mars_proc.h"
+#include <linux/xio/lib_mapfree.h>
+#include <linux/xio/xio_bio.h>
+#include <linux/xio/xio_if.h>
+#include <linux/xio/xio_copy.h>
+#include <linux/xio/xio_client.h>
+#include <linux/xio/xio_server.h>
+#include <linux/xio/xio_trans_logger.h>
+
+xio_info_fn xio_info;
+
+static
+int trigger_sysctl_handler(
+ struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length,
+ loff_t *ppos)
+{
+ ssize_t res = 0;
+ size_t len = *length;
+
+ XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos);
+
+ if (!len || *ppos > 0)
+ goto done;
+
+ if (write) {
+ char tmp[8] = {};
+
+ res = len; /* fake consumption of all data */
+
+ if (len > 7)
+ len = 7;
+ if (!copy_from_user(tmp, buffer, len)) {
+ int code = 0;
+ int status = kstrtoint(tmp, 10, &code);
+
+ /* the return value from ssanf() does not matter */
+ (void)status;
+ if (code > 0)
+ local_trigger();
+ if (code > 1)
+ remote_trigger();
+ }
+ } else {
+ char *answer = "MARS module not operational\n";
+ char *tmp = NULL;
+ int mylen;
+
+ if (xio_info) {
+ answer = "internal error while determining xio_info\n";
+ tmp = xio_info();
+ if (tmp)
+ answer = tmp;
+ }
+
+ mylen = strlen(answer);
+ if (len > mylen)
+ len = mylen;
+ res = len;
+ if (copy_to_user(buffer, answer, len)) {
+ XIO_ERR("write %ld bytes at %p failed\n", len, buffer);
+ res = -EFAULT;
+ }
+ brick_string_free(tmp);
+ }
+
+done:
+ XIO_DBG("res = %ld\n", res);
+ *length = res;
+ if (res >= 0) {
+ *ppos += res;
+ return 0;
+ }
+ return res;
+}
+
+static
+int lamport_sysctl_handler(
+ struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length,
+ loff_t *ppos)
+{
+ ssize_t res = 0;
+ size_t len = *length;
+ int my_len = 128;
+ char *tmp = brick_string_alloc(my_len);
+ struct timespec know = CURRENT_TIME;
+ struct timespec lnow;
+
+ XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos);
+
+ if (!len || *ppos > 0)
+ goto done;
+
+ if (write)
+ return -EINVAL;
+
+ get_lamport(&lnow);
+
+ res = scnprintf(
+ tmp,
+ my_len,
+ "CURRENT_TIME=%ld.%09ld\nlamport_now=%ld.%09ld\n",
+ know.tv_sec, know.tv_nsec,
+ lnow.tv_sec, lnow.tv_nsec
+ );
+
+ if (copy_to_user(buffer, tmp, res)) {
+ XIO_ERR("write %ld bytes at %p failed\n", res, buffer);
+ res = -EFAULT;
+ }
+ brick_string_free(tmp);
+
+done:
+ XIO_DBG("res = %ld\n", res);
+ *length = res;
+ if (res >= 0) {
+ *ppos += res;
+ return 0;
+ }
+ return res;
+}
+
+#ifdef CTL_UNNUMBERED
+#define _CTL_NAME .ctl_name = CTL_UNNUMBERED,
+#define _CTL_STRATEGY(handler) .strategy = &handler,
+#else
+#define _CTL_NAME /*empty*/
+#define _CTL_STRATEGY(handler) /*empty*/
+#endif
+
+#define VEC_ENTRY(NAME, VAR, MODE, COUNT) \
+ { \
+ _CTL_NAME \
+ .procname = NAME, \
+ .data = &(VAR), \
+ .maxlen = sizeof(int) * (COUNT), \
+ .mode = MODE, \
+ .proc_handler = &proc_dointvec, \
+ _CTL_STRATEGY(sysctl_intvec) \
+ }
+
+#define INT_ENTRY(NAME, VAR, MODE) \
+ VEC_ENTRY(NAME, VAR, MODE, 1)
+
+/* checkpatch.pl: no, these complex values cannot be easily enclosed
+ * in parentheses. If { ... } were used inside the macro body, it would
+ * no longer be possible to add additional fields externally.
+ * I could inject further fields externally via parameters, but
+ * that would make it less understandable.
+ */
+#define LIMITER_ENTRIES(VAR, PREFIX, SUFFIX) \
+ INT_ENTRY(PREFIX "_ops_total_" SUFFIX, (VAR)->lim_total_ops, 0400),\
+ INT_ENTRY(PREFIX "_amount_total_" SUFFIX, (VAR)->lim_total_amount, 0400),\
+ INT_ENTRY(PREFIX "_ops_ratelimit_" SUFFIX, (VAR)->lim_max_ops_rate, 0600),\
+ INT_ENTRY(PREFIX "_amount_ratelimit_" SUFFIX, (VAR)->lim_max_amount_rate, 0600),\
+ INT_ENTRY(PREFIX "_maxdelay_ms", (VAR)->lim_max_delay, 0600), \
+ INT_ENTRY(PREFIX "_minwindow_ms", (VAR)->lim_min_window, 0600),\
+ INT_ENTRY(PREFIX "_maxwindow_ms", (VAR)->lim_max_window, 0600),\
+ INT_ENTRY(PREFIX "_ops_cumul_" SUFFIX, (VAR)->lim_ops_cumul, 0600),\
+ INT_ENTRY(PREFIX "_amount_cumul_" SUFFIX, (VAR)->lim_amount_cumul, 0600),\
+ INT_ENTRY(PREFIX "_ops_rate_" SUFFIX, (VAR)->lim_ops_rate, 0400),\
+ INT_ENTRY(PREFIX "_amount_rate_" SUFFIX, (VAR)->lim_amount_rate, 0400)\
+
+#define THRESHOLD_ENTRIES(VAR, PREFIX) \
+ INT_ENTRY(PREFIX "_threshold_us", (VAR)->thr_limit, 0600), \
+ INT_ENTRY(PREFIX "_factor_percent", (VAR)->thr_factor, 0600), \
+ INT_ENTRY(PREFIX "_plus_us", (VAR)->thr_plus, 0600), \
+ INT_ENTRY(PREFIX "_max_ms", (VAR)->thr_max, 0600), \
+ INT_ENTRY(PREFIX "_triggered", (VAR)->thr_triggered, 0400),\
+ INT_ENTRY(PREFIX "_true_hit", (VAR)->thr_true_hit, 0400) \
+
+static
+struct ctl_table traffic_tuning_table[] = {
+ LIMITER_ENTRIES(&client_limiter, "client_role_traffic", "kb"),
+ LIMITER_ENTRIES(&server_limiter, "server_role_traffic", "kb"),
+ {}
+};
+
+static
+struct ctl_table io_tuning_table[] = {
+ LIMITER_ENTRIES(&global_writeback.limiter, "writeback", "kb"),
+ INT_ENTRY("writeback_until_percent", global_writeback.until_percent, 0600),
+ THRESHOLD_ENTRIES(&global_io_threshold, "global_io"),
+ THRESHOLD_ENTRIES(&bio_submit_threshold, "bio_submit"),
+ THRESHOLD_ENTRIES(&bio_io_threshold[0], "bio_io_r"),
+ THRESHOLD_ENTRIES(&bio_io_threshold[1], "bio_io_w"),
+ {}
+};
+
+#define TCP_TUNING_ENTRIES(VAR) \
+ INT_ENTRY("ip_tos", (VAR)->ip_tos, 0600), \
+ INT_ENTRY("tcp_window_size", (VAR)->tcp_window_size, 0600), \
+ INT_ENTRY("tcp_nodelay", (VAR)->tcp_nodelay, 0600), \
+ INT_ENTRY("tcp_timeout", (VAR)->tcp_timeout, 0600), \
+ INT_ENTRY("tcp_keepcnt", (VAR)->tcp_keepcnt, 0600), \
+ INT_ENTRY("tcp_keepintvl", (VAR)->tcp_keepintvl, 0600), \
+ INT_ENTRY("tcp_keepidle", (VAR)->tcp_keepidle, 0600), \
+
+static
+struct ctl_table repl_tuning_table[] = {
+ TCP_TUNING_ENTRIES(&repl_tcp_params)
+ {}
+};
+
+static
+struct ctl_table device_tuning_table[] = {
+ TCP_TUNING_ENTRIES(&device_tcp_params)
+ {}
+};
+
+static
+struct ctl_table mars_table[] = {
+ {
+ _CTL_NAME
+ .procname = "trigger",
+ .mode = 0200,
+ .proc_handler = &trigger_sysctl_handler,
+ },
+ {
+ _CTL_NAME
+ .procname = "info",
+ .mode = 0400,
+ .proc_handler = &trigger_sysctl_handler,
+ },
+ {
+ _CTL_NAME
+ .procname = "lamport_clock",
+ .mode = 0400,
+ .proc_handler = &lamport_sysctl_handler,
+ },
+ INT_ENTRY("show_log_messages", brick_say_logging, 0600),
+ INT_ENTRY("show_debug_messages", brick_say_debug, 0600),
+ INT_ENTRY("show_statistics_global", global_show_statist, 0600),
+ INT_ENTRY("show_statistics_server", server_show_statist, 0600),
+#ifdef CONFIG_MARS_DEBUG
+ INT_ENTRY("debug_crash_mode", mars_crash_mode, 0600),
+ INT_ENTRY("debug_hang_mode", mars_hang_mode, 0600),
+#endif
+ INT_ENTRY("logger_completion_semantics", trans_logger_completion_semantics, 0600),
+ INT_ENTRY("logger_do_crc", trans_logger_do_crc, 0600),
+ INT_ENTRY("syslog_min_class", brick_say_syslog_min, 0600),
+ INT_ENTRY("syslog_max_class", brick_say_syslog_max, 0600),
+ INT_ENTRY("syslog_flood_class", brick_say_syslog_flood_class, 0600),
+ INT_ENTRY("syslog_flood_limit", brick_say_syslog_flood_limit, 0600),
+ INT_ENTRY("syslog_flood_recovery_s", brick_say_syslog_flood_recovery, 0600),
+ INT_ENTRY("delay_say_on_overflow", delay_say_on_overflow, 0600),
+ INT_ENTRY("mapfree_period_sec", mapfree_period_sec, 0600),
+ INT_ENTRY("mapfree_grace_keep_mb", mapfree_grace_keep_mb, 0600),
+ INT_ENTRY("logger_max_interleave", trans_logger_max_interleave, 0600),
+ INT_ENTRY("logger_resume", trans_logger_resume, 0600),
+ INT_ENTRY("logger_replay_timeout_sec", trans_logger_replay_timeout, 0600),
+ INT_ENTRY("mem_limit_percent", mars_mem_percent, 0600),
+ INT_ENTRY("logger_mem_used_kb", trans_logger_mem_usage, 0400),
+ INT_ENTRY("mem_used_raw_kb", brick_global_block_used, 0400),
+#ifdef CONFIG_MARS_MEM_PREALLOC
+ INT_ENTRY("mem_allow_freelist", brick_allow_freelist, 0600),
+ VEC_ENTRY("mem_freelist_max", brick_mem_freelist_max, 0600, BRICK_MAX_ORDER + 1),
+ VEC_ENTRY("mem_alloc_count", brick_mem_alloc_count, 0400, BRICK_MAX_ORDER + 1),
+ VEC_ENTRY("mem_alloc_max", brick_mem_alloc_count, 0600, BRICK_MAX_ORDER + 1),
+#endif
+ INT_ENTRY("io_flying_count", xio_global_io_flying, 0400),
+ INT_ENTRY("copy_overlap", xio_copy_overlap, 0600),
+ INT_ENTRY("copy_read_prio", xio_copy_read_prio, 0600),
+ INT_ENTRY("copy_write_prio", xio_copy_write_prio, 0600),
+ INT_ENTRY("copy_read_max_fly", xio_copy_read_max_fly, 0600),
+ INT_ENTRY("copy_write_max_fly", xio_copy_write_max_fly, 0600),
+ INT_ENTRY("statusfiles_rollover_sec", mars_rollover_interval, 0600),
+ INT_ENTRY("scan_interval_sec", mars_scan_interval, 0600),
+ INT_ENTRY("propagate_interval_sec", mars_propagate_interval, 0600),
+ INT_ENTRY("sync_flip_interval_sec", mars_sync_flip_interval, 0600),
+ INT_ENTRY("peer_abort", mars_peer_abort, 0600),
+ INT_ENTRY("client_abort", xio_client_abort, 0600),
+ INT_ENTRY("do_fast_fullsync", mars_fast_fullsync, 0600),
+ INT_ENTRY("logrot_auto_gb", global_logrot_auto, 0600),
+ INT_ENTRY("remaining_space_kb", global_remaining_space, 0400),
+ INT_ENTRY("required_total_space_0_gb", global_free_space_0, 0600),
+ INT_ENTRY("required_free_space_1_gb", global_free_space_1, 0600),
+ INT_ENTRY("required_free_space_2_gb", global_free_space_2, 0600),
+ INT_ENTRY("required_free_space_3_gb", global_free_space_3, 0600),
+ INT_ENTRY("required_free_space_4_gb", global_free_space_4, 0600),
+ INT_ENTRY("sync_want", global_sync_want, 0400),
+ INT_ENTRY("sync_nr", global_sync_nr, 0400),
+ INT_ENTRY("sync_limit", global_sync_limit, 0600),
+ INT_ENTRY("mars_emergency_mode", mars_emergency_mode, 0600),
+ INT_ENTRY("mars_reset_emergency", mars_reset_emergency, 0600),
+ INT_ENTRY("mars_keep_msg_s", mars_keep_msg, 0600),
+ INT_ENTRY("write_throttle_start_percent", xio_throttle_start, 0600),
+ INT_ENTRY("write_throttle_end_percent", xio_throttle_end, 0600),
+ INT_ENTRY("write_throttle_size_threshold_kb", if_throttle_start_size, 0400),
+ LIMITER_ENTRIES(&if_throttle, "write_throttle", "kb"),
+ /* changing makes no sense because the server will immediately start upon modprobe */
+ INT_ENTRY("xio_port", xio_net_default_port, 0400),
+#ifdef __HAVE_LZO
+ INT_ENTRY("network_compress_data", xio_net_compress_data, 0600),
+#endif
+ INT_ENTRY("net_bind_before_listen", xio_net_bind_before_listen, 0600),
+ INT_ENTRY("net_bind_before_connect", xio_net_bind_before_connect, 0600),
+ INT_ENTRY("network_io_timeout", global_net_io_timeout, 0600),
+ INT_ENTRY("parallel_connections", max_client_channels, 0600),
+ INT_ENTRY("parallel_bulk_feed", max_client_bulk, 0600),
+ {
+ _CTL_NAME
+ .procname = "traffic_tuning",
+ .mode = 0500,
+ .child = traffic_tuning_table,
+ },
+ {
+ _CTL_NAME
+ .procname = "io_tuning",
+ .mode = 0500,
+ .child = io_tuning_table,
+ },
+ {
+ _CTL_NAME
+ .procname = "tcp_tuning_repl",
+ .mode = 0500,
+ .child = repl_tuning_table,
+ },
+ {
+ _CTL_NAME
+ .procname = "tcp_tuning_device",
+ .mode = 0500,
+ .child = device_tuning_table,
+ },
+ {}
+};
+
+static
+struct ctl_table mars_root_table[] = {
+ {
+ _CTL_NAME
+ .procname = "mars",
+ .mode = 0500,
+ .child = mars_table,
+ },
+ {}
+};
+
+/***************** module init stuff ************************/
+
+static struct ctl_table_header *header;
+
+int __init init_xio_proc(void)
+{
+ XIO_INF("init_proc()\n");
+
+ header = register_sysctl_table(mars_root_table);
+
+ return 0;
+}
+
+void exit_xio_proc(void)
+{
+ XIO_INF("exit_proc()\n");
+ if (header) {
+ unregister_sysctl_table(header);
+ header = NULL;
+ }
+}
diff --git a/drivers/staging/mars/mars/mars_proc.h b/drivers/staging/mars/mars/mars_proc.h
new file mode 100644
index 000000000000..883b819eced1
--- /dev/null
+++ b/drivers/staging/mars/mars/mars_proc.h
@@ -0,0 +1,34 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef MARS_PROC_H
+#define MARS_PROC_H
+
+typedef char * (*xio_info_fn)(void);
+
+extern xio_info_fn xio_info;
+
+extern int min_free_kbytes;
+
+/***********************************************************************/
+
+/* init */
+
+extern int init_xio_proc(void);
+extern void exit_xio_proc(void);
+
+#endif
--
2.11.0

2016-12-30 23:04:52

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: [RFC 02/32] mars: add new module brick_say

Signed-off-by: Thomas Schoebel-Theuer <[email protected]>
---
drivers/staging/mars/brick_say.c | 920 +++++++++++++++++++++++++++++++++++++++
include/linux/brick/brick_say.h | 89 ++++
2 files changed, 1009 insertions(+)
create mode 100644 drivers/staging/mars/brick_say.c
create mode 100644 include/linux/brick/brick_say.h

diff --git a/drivers/staging/mars/brick_say.c b/drivers/staging/mars/brick_say.c
new file mode 100644
index 000000000000..f3bb49a0dfc3
--- /dev/null
+++ b/drivers/staging/mars/brick_say.c
@@ -0,0 +1,920 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <linux/brick/brick_say.h>
+#include <linux/brick/lamport.h>
+
+/*******************************************************************/
+
+/* messaging */
+
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/file.h>
+#include <linux/sched.h>
+#include <linux/preempt.h>
+#include <linux/hardirq.h>
+#include <linux/smp.h>
+#include <linux/slab.h>
+#include <linux/kthread.h>
+#include <linux/syscalls.h>
+
+#include <linux/uaccess.h>
+
+#include <linux/brick/vfs_compat.h>
+
+#ifndef GFP_BRICK
+#define GFP_BRICK GFP_NOIO
+#endif
+
+#define SAY_ORDER 0
+#define SAY_BUFMAX (PAGE_SIZE << SAY_ORDER)
+#define SAY_BUF_LIMIT (SAY_BUFMAX - 1500)
+#define MAX_FILELEN 16
+#define MAX_IDS 1000
+
+const char *say_class[MAX_SAY_CLASS] = {
+ [SAY_DEBUG] = "debug",
+ [SAY_INFO] = "info",
+ [SAY_WARN] = "warn",
+ [SAY_ERROR] = "error",
+ [SAY_FATAL] = "fatal",
+ [SAY_TOTAL] = "total",
+};
+
+int brick_say_logging = 1;
+
+module_param_named(say_logging, brick_say_logging, int, 0);
+int brick_say_debug;
+
+module_param_named(say_debug, brick_say_debug, int, 0);
+
+int brick_say_syslog_min = 1;
+int brick_say_syslog_max = -1;
+int brick_say_syslog_flood_class = 3;
+int brick_say_syslog_flood_limit = 20;
+int brick_say_syslog_flood_recovery = 300;
+
+int delay_say_on_overflow =
+#ifdef CONFIG_MARS_DEBUG
+ 1;
+#else
+ 0;
+#endif
+
+static atomic_t say_alloc_channels = ATOMIC_INIT(0);
+static atomic_t say_alloc_names = ATOMIC_INIT(0);
+static atomic_t say_alloc_pages = ATOMIC_INIT(0);
+
+static unsigned long flood_start_jiffies;
+static int flood_count;
+
+struct say_channel {
+ char *ch_name;
+ struct say_channel *ch_next;
+
+ /* protect against concurrent writes */
+ spinlock_t ch_lock[MAX_SAY_CLASS];
+ char *ch_buf[MAX_SAY_CLASS][2];
+
+ short ch_index[MAX_SAY_CLASS];
+ struct file *ch_filp[MAX_SAY_CLASS][2];
+ int ch_overflow[MAX_SAY_CLASS];
+ bool ch_written[MAX_SAY_CLASS];
+ bool ch_rollover;
+ bool ch_must_exist;
+ bool ch_is_dir;
+ bool ch_delete;
+ int ch_status_written;
+ int ch_id_max;
+ void *ch_ids[MAX_IDS];
+
+ wait_queue_head_t ch_progress;
+};
+
+struct say_channel *default_channel;
+
+static struct say_channel *channel_list;
+
+static rwlock_t say_lock = __RW_LOCK_UNLOCKED(say_lock);
+
+static struct task_struct *say_thread;
+
+static DECLARE_WAIT_QUEUE_HEAD(say_event);
+
+bool say_dirty;
+
+#define use_atomic() \
+ ((preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK)) != 0 || irqs_disabled())
+
+static
+void wait_channel(struct say_channel *ch, int class)
+{
+ if (delay_say_on_overflow && ch->ch_index[class] > SAY_BUF_LIMIT) {
+ if (!use_atomic()) {
+ say_dirty = true;
+ wake_up_interruptible(&say_event);
+ wait_event_interruptible_timeout(
+ ch->ch_progress, ch->ch_index[class] < SAY_BUF_LIMIT, HZ / 10);
+ }
+ }
+}
+
+static
+struct say_channel *find_channel(const void *id)
+{
+ struct say_channel *res = default_channel;
+ struct say_channel *ch;
+
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ int i;
+
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (ch->ch_ids[i] == id) {
+ res = ch;
+ goto found;
+ }
+ }
+ }
+found:
+ read_unlock(&say_lock);
+ return res;
+}
+
+static
+void _remove_binding(struct task_struct *whom)
+{
+ struct say_channel *ch;
+ int i;
+
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (ch->ch_ids[i] == whom)
+ ch->ch_ids[i] = NULL;
+ }
+ }
+}
+
+void bind_to_channel(struct say_channel *ch, struct task_struct *whom)
+{
+ int i;
+
+ write_lock(&say_lock);
+ _remove_binding(whom);
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (!ch->ch_ids[i]) {
+ ch->ch_ids[i] = whom;
+ goto done;
+ }
+ }
+ if (likely(ch->ch_id_max < MAX_IDS - 1))
+ ch->ch_ids[ch->ch_id_max++] = whom;
+ else
+ goto err;
+done:
+ write_unlock(&say_lock);
+ goto out_return;
+err:
+ write_unlock(&say_lock);
+
+ say_to(default_channel, SAY_ERROR, "ID overflow for thread '%s'\n", whom->comm);
+out_return:;
+}
+
+struct say_channel *get_binding(struct task_struct *whom)
+{
+ struct say_channel *ch;
+ int i;
+
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (ch->ch_ids[i] == whom)
+ goto found;
+ }
+ }
+ ch = NULL;
+found:
+ read_unlock(&say_lock);
+ return ch;
+}
+
+void remove_binding_from(struct say_channel *ch, struct task_struct *whom)
+{
+ bool found = false;
+ int i;
+
+ write_lock(&say_lock);
+ for (i = 0; i < ch->ch_id_max; i++) {
+ if (ch->ch_ids[i] == whom) {
+ ch->ch_ids[i] = NULL;
+ found = true;
+ break;
+ }
+ }
+ if (!found)
+ _remove_binding(whom);
+ write_unlock(&say_lock);
+}
+
+void remove_binding(struct task_struct *whom)
+{
+ write_lock(&say_lock);
+ _remove_binding(whom);
+ write_unlock(&say_lock);
+}
+
+void rollover_channel(struct say_channel *ch)
+{
+ if (!ch)
+ ch = find_channel(current);
+ if (likely(ch))
+ ch->ch_rollover = true;
+}
+
+void rollover_all(void)
+{
+ struct say_channel *ch;
+
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next)
+ ch->ch_rollover = true;
+ read_unlock(&say_lock);
+}
+
+void del_channel(struct say_channel *ch)
+{
+ if (unlikely(!ch))
+ goto out_return;
+ if (unlikely(ch == default_channel)) {
+ say_to(default_channel, SAY_ERROR, "thread '%s' tried to delete the default channel\n", current->comm);
+ goto out_return;
+ }
+
+ ch->ch_delete = true;
+out_return:;
+}
+
+static
+void _del_channel(struct say_channel *ch)
+{
+ struct say_channel *tmp;
+ struct say_channel **_tmp;
+ int i, j;
+
+ if (!ch)
+ goto out_return;
+ write_lock(&say_lock);
+ for (_tmp = &channel_list; (tmp = *_tmp) != NULL; _tmp = &tmp->ch_next) {
+ if (tmp == ch) {
+ *_tmp = tmp->ch_next;
+ break;
+ }
+ }
+ write_unlock(&say_lock);
+
+ for (i = 0; i < MAX_SAY_CLASS; i++) {
+ for (j = 0; j < 2; j++) {
+ if (ch->ch_filp[i][j]) {
+ filp_close(ch->ch_filp[i][j], NULL);
+ ch->ch_filp[i][j] = NULL;
+ }
+ }
+ for (j = 0; j < 2; j++) {
+ char *buf = ch->ch_buf[i][j];
+
+ if (buf) {
+ __free_pages(virt_to_page((unsigned long)buf), SAY_ORDER);
+ atomic_dec(&say_alloc_pages);
+ }
+ }
+ }
+ if (ch->ch_name) {
+ atomic_dec(&say_alloc_names);
+ kfree(ch->ch_name);
+ }
+ kfree(ch);
+ atomic_dec(&say_alloc_channels);
+out_return:;
+}
+
+static
+struct say_channel *_make_channel(const char *name, bool must_exist)
+{
+ struct say_channel *res = NULL;
+ struct kstat kstat = {};
+ int i, j;
+ unsigned long mode = use_atomic() ? GFP_ATOMIC : GFP_BRICK;
+
+ mm_segment_t oldfs;
+ bool is_dir = false;
+ int status;
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+ status = vfs_stat((char *)name, &kstat);
+ set_fs(oldfs);
+
+ if (unlikely(status < 0)) {
+ if (must_exist) {
+ say(SAY_ERROR, "cannot create channel '%s', status = %d\n", name, status);
+ goto done;
+ }
+ } else {
+ is_dir = S_ISDIR(kstat.mode);
+ }
+
+restart:
+ res = kzalloc(sizeof(*res), mode);
+ if (unlikely(!res)) {
+ schedule();
+ goto restart;
+ }
+ atomic_inc(&say_alloc_channels);
+ res->ch_must_exist = must_exist;
+ res->ch_is_dir = is_dir;
+ init_waitqueue_head(&res->ch_progress);
+restart2:
+ res->ch_name = kstrdup(name, mode);
+ if (unlikely(!res->ch_name)) {
+ schedule();
+ goto restart2;
+ }
+ atomic_inc(&say_alloc_names);
+ for (i = 0; i < MAX_SAY_CLASS; i++) {
+ spin_lock_init(&res->ch_lock[i]);
+ for (j = 0; j < 2; j++) {
+ char *buf;
+
+restart3:
+ buf = (void *)__get_free_pages(mode, SAY_ORDER);
+ if (unlikely(!buf)) {
+ schedule();
+ goto restart3;
+ }
+ atomic_inc(&say_alloc_pages);
+ res->ch_buf[i][j] = buf;
+ }
+ }
+done:
+ return res;
+}
+
+struct say_channel *make_channel(const char *name, bool must_exist)
+{
+ struct say_channel *res = NULL;
+ struct say_channel *ch;
+
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ if (!strcmp(ch->ch_name, name)) {
+ res = ch;
+ break;
+ }
+ }
+ read_unlock(&say_lock);
+
+ if (unlikely(!res)) {
+ res = _make_channel(name, must_exist);
+ if (unlikely(!res))
+ goto done;
+
+ write_lock(&say_lock);
+
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ if (ch != res && unlikely(!strcmp(ch->ch_name, name))) {
+ _del_channel(res);
+ res = ch;
+ goto race_found;
+ }
+ }
+
+ res->ch_next = channel_list;
+ channel_list = res;
+
+race_found:
+ write_unlock(&say_lock);
+ }
+
+done:
+ return res;
+}
+
+/* tell gcc to check for varargs errors */
+static
+void _say(struct say_channel *ch, int class, va_list args, bool use_args, const char *fmt, ...) __printf(5, 6);
+
+static
+void _say(struct say_channel *ch, int class, va_list args, bool use_args, const char *fmt, ...)
+{
+ char *start;
+ int offset;
+ int rest;
+ int written;
+
+ if (unlikely(!ch))
+ goto out_return;
+ if (unlikely(ch->ch_delete && ch != default_channel)) {
+ say_to(default_channel, SAY_ERROR, "thread '%s' tried to write on deleted channel\n", current->comm);
+ goto out_return;
+ }
+
+ offset = ch->ch_index[class];
+ start = ch->ch_buf[class][0] + offset;
+ rest = SAY_BUFMAX - 1 - offset;
+ if (unlikely(rest <= 0)) {
+ ch->ch_overflow[class]++;
+ goto out_return;
+ }
+
+ if (use_args) {
+ va_list args2;
+
+ va_start(args2, fmt);
+ written = vscnprintf(start, rest, fmt, args2);
+ va_end(args2);
+ } else {
+ written = vscnprintf(start, rest, fmt, args);
+ }
+
+ if (likely(rest > written)) {
+ start[written] = '\0';
+ ch->ch_index[class] += written;
+ say_dirty = true;
+ } else {
+ /* indicate overflow */
+ start[0] = '\0';
+ ch->ch_overflow[class]++;
+ }
+out_return:;
+}
+
+void say_to(struct say_channel *ch, int class, const char *fmt, ...)
+{
+ va_list args;
+ unsigned long flags;
+
+ if (!class && !brick_say_debug)
+ goto out_return;
+ if (!ch)
+ ch = find_channel(current);
+
+ if (likely(ch)) {
+ if (!ch->ch_is_dir)
+ class = SAY_TOTAL;
+ if (likely(class >= 0 && class < MAX_SAY_CLASS)) {
+ wait_channel(ch, class);
+ spin_lock_irqsave(&ch->ch_lock[class], flags);
+
+ va_start(args, fmt);
+ _say(ch, class, args, false, fmt);
+ va_end(args);
+
+ spin_unlock_irqrestore(&ch->ch_lock[class], flags);
+ }
+ }
+
+ ch = default_channel;
+ if (likely(ch)) {
+ class = SAY_TOTAL;
+ wait_channel(ch, class);
+ spin_lock_irqsave(&ch->ch_lock[class], flags);
+
+ va_start(args, fmt);
+ _say(ch, class, args, false, fmt);
+ va_end(args);
+
+ spin_unlock_irqrestore(&ch->ch_lock[class], flags);
+
+ wake_up_interruptible(&say_event);
+ }
+out_return:;
+}
+
+void brick_say_to(
+struct say_channel *ch,
+int class,
+bool dump,
+const char *prefix,
+const char *file,
+int line,
+const char *func,
+const char *fmt,
+...)
+{
+ const char *channel_name = "-";
+ struct timespec s_now;
+ struct timespec l_now;
+ int filelen;
+ int orig_class;
+ va_list args;
+ unsigned long flags;
+
+ if (!class && !brick_say_debug)
+ goto out_return;
+ s_now = CURRENT_TIME;
+ get_lamport(&l_now);
+
+ if (!ch)
+ ch = find_channel(current);
+
+ orig_class = class;
+
+ /* limit the filename */
+ filelen = strlen(file);
+ if (filelen > MAX_FILELEN)
+ file += filelen - MAX_FILELEN;
+
+ if (likely(ch)) {
+ channel_name = ch->ch_name;
+ if (!ch->ch_is_dir)
+ class = SAY_TOTAL;
+ if (likely(class >= 0 && class < MAX_SAY_CLASS)) {
+ wait_channel(ch, class);
+ spin_lock_irqsave(&ch->ch_lock[class], flags);
+
+ _say(
+ ch, class, NULL, true,
+ "%ld.%09ld %ld.%09ld %s %s[%d] %s:%d %s(): ",
+ s_now.tv_sec, s_now.tv_nsec,
+ l_now.tv_sec, l_now.tv_nsec,
+ prefix,
+ current->comm, (int)smp_processor_id(),
+ file, line,
+ func);
+
+ va_start(args, fmt);
+ _say(ch, class, args, false, fmt);
+ va_end(args);
+
+ spin_unlock_irqrestore(&ch->ch_lock[class], flags);
+ }
+ }
+
+ ch = default_channel;
+ if (likely(ch)) {
+ wait_channel(ch, SAY_TOTAL);
+ spin_lock_irqsave(&ch->ch_lock[SAY_TOTAL], flags);
+
+ _say(
+ ch, SAY_TOTAL, NULL, true,
+ "%ld.%09ld %ld.%09ld %s_%-5s %s %s[%d] %s:%d %s(): ",
+ s_now.tv_sec, s_now.tv_nsec,
+ l_now.tv_sec, l_now.tv_nsec,
+ prefix, say_class[orig_class],
+ channel_name,
+ current->comm, (int)smp_processor_id(),
+ file, line,
+ func);
+
+ va_start(args, fmt);
+ _say(ch, SAY_TOTAL, args, false, fmt);
+ va_end(args);
+
+ spin_unlock_irqrestore(&ch->ch_lock[SAY_TOTAL], flags);
+ }
+#ifdef CONFIG_MARS_DEBUG
+ if (dump)
+ brick_dump_stack();
+#endif
+ wake_up_interruptible(&say_event);
+out_return:;
+}
+
+static
+void try_open_file(struct file **file, char *filename, bool creat)
+{
+ struct address_space *mapping;
+ int flags = O_APPEND | O_WRONLY | O_LARGEFILE;
+ int prot = 0600;
+
+ if (creat)
+ flags |= O_CREAT;
+
+ *file = filp_open(filename, flags, prot);
+ if (unlikely(IS_ERR(*file))) {
+ *file = NULL;
+ goto out_return;
+ }
+ mapping = (*file)->f_mapping;
+ if (likely(mapping))
+ mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~(__GFP_IO | __GFP_FS));
+out_return:;
+}
+
+static
+void out_to_file(struct file *file, char *buf, int len)
+{
+ loff_t log_pos = 0;
+
+ mm_segment_t oldfs;
+
+ if (file) {
+ oldfs = get_fs();
+ set_fs(get_ds());
+ (void)vfs_write(file, buf, len, &log_pos);
+ set_fs(oldfs);
+ }
+}
+
+static inline
+void reset_flood(void)
+{
+ if (flood_start_jiffies &&
+ time_is_before_jiffies(flood_start_jiffies + brick_say_syslog_flood_recovery * HZ)) {
+ flood_start_jiffies = 0;
+ flood_count = 0;
+ }
+}
+
+static
+void printk_with_class(int class, char *buf)
+{
+ switch (class) {
+ case SAY_INFO:
+ printk(KERN_INFO "%s", buf);
+ break;
+ case SAY_WARN:
+ printk(KERN_WARNING "%s", buf);
+ break;
+ case SAY_ERROR:
+ case SAY_FATAL:
+ printk(KERN_ERR "%s", buf);
+ break;
+ default:
+ printk(KERN_DEBUG "%s", buf);
+ }
+}
+
+static
+void out_to_syslog(int class, char *buf, int len)
+{
+ reset_flood();
+ if (class >= brick_say_syslog_min && class <= brick_say_syslog_max) {
+ buf[len] = '\0';
+ printk_with_class(class, buf);
+ } else if (class >= brick_say_syslog_flood_class && brick_say_syslog_flood_class >= 0 && class != SAY_TOTAL) {
+ flood_start_jiffies = jiffies;
+ if (++flood_count <= brick_say_syslog_flood_limit) {
+ buf[len] = '\0';
+ printk_with_class(class, buf);
+ }
+ }
+}
+
+static inline
+char *_make_filename(struct say_channel *ch, int class, int transact, int add_tmp)
+{
+ char *filename;
+
+restart:
+ filename = kmalloc(1024, GFP_KERNEL);
+ if (unlikely(!filename)) {
+ schedule();
+ goto restart;
+ }
+ atomic_inc(&say_alloc_names);
+ if (ch->ch_is_dir) {
+ snprintf(
+ filename,
+ 1023,
+ "%s/%d.%s.%s%s",
+ ch->ch_name,
+ class,
+ say_class[class],
+ transact ? "status" : "log",
+ add_tmp ? ".tmp" : "");
+ } else {
+ snprintf(filename, 1023, "%s.%s%s", ch->ch_name, transact ? "status" : "log", add_tmp ? ".tmp" : "");
+ }
+ return filename;
+}
+
+static
+void _rollover_channel(struct say_channel *ch)
+{
+ int start = 0;
+ int class;
+
+ ch->ch_rollover = false;
+ ch->ch_status_written = 0;
+
+ if (!ch->ch_is_dir)
+ start = SAY_TOTAL;
+
+ for (class = start; class < MAX_SAY_CLASS; class++) {
+ char *old = _make_filename(ch, class, 1, 1);
+ char *new = _make_filename(ch, class, 1, 0);
+
+ if (likely(old && new)) {
+ int i;
+
+ mm_segment_t oldfs;
+
+ for (i = 0; i < 2; i++) {
+ if (ch->ch_filp[class][i]) {
+ filp_close(ch->ch_filp[class][i], NULL);
+ ch->ch_filp[class][i] = NULL;
+ }
+ }
+
+ oldfs = get_fs();
+ set_fs(get_ds());
+#ifdef __USE_COMPAT
+ _compat_rename(old, new);
+#else
+ sys_rename(old, new);
+#endif
+ set_fs(oldfs);
+ }
+
+ if (likely(old)) {
+ kfree(old);
+ atomic_dec(&say_alloc_names);
+ }
+ if (likely(new)) {
+ kfree(new);
+ atomic_dec(&say_alloc_names);
+ }
+ }
+}
+
+static
+void treat_channel(struct say_channel *ch, int class)
+{
+ int len;
+ int overflow;
+ int transact;
+ int start;
+ char *buf;
+ char *tmp;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ch->ch_lock[class], flags);
+
+ buf = ch->ch_buf[class][0];
+ tmp = ch->ch_buf[class][1];
+ ch->ch_buf[class][1] = buf;
+ ch->ch_buf[class][0] = tmp;
+ len = ch->ch_index[class];
+ ch->ch_index[class] = 0;
+ overflow = ch->ch_overflow[class];
+ ch->ch_overflow[class] = 0;
+
+ spin_unlock_irqrestore(&ch->ch_lock[class], flags);
+
+ wake_up_interruptible(&ch->ch_progress);
+
+ ch->ch_status_written += len;
+ out_to_syslog(class, buf, len);
+ start = 0;
+ if (!brick_say_logging)
+ start++;
+ for (transact = start; transact < 2; transact++) {
+ if (unlikely(!ch->ch_filp[class][transact])) {
+ char *filename = _make_filename(ch, class, transact, transact);
+
+ if (likely(filename)) {
+ try_open_file(&ch->ch_filp[class][transact], filename, transact);
+ kfree(filename);
+ atomic_dec(&say_alloc_names);
+ }
+ }
+ out_to_file(ch->ch_filp[class][transact], buf, len);
+ }
+
+ if (unlikely(overflow > 0)) {
+ struct timespec s_now = CURRENT_TIME;
+ struct timespec l_now;
+
+ get_lamport(&l_now);
+ len = scnprintf(
+ buf,
+ SAY_BUFMAX,
+ "%ld.%09ld %ld.%09ld %s %d OVERFLOW %d times\n",
+ s_now.tv_sec, s_now.tv_nsec,
+ l_now.tv_sec, l_now.tv_nsec,
+ ch->ch_name,
+ class,
+ overflow);
+ ch->ch_status_written += len;
+ out_to_syslog(class, buf, len);
+ for (transact = 0; transact < 2; transact++)
+ out_to_file(ch->ch_filp[class][transact], buf, len);
+ }
+}
+
+static
+int _say_thread(void *data)
+{
+ while (!kthread_should_stop()) {
+ struct say_channel *ch;
+ int i;
+
+ wait_event_interruptible_timeout(say_event, say_dirty, HZ);
+ say_dirty = false;
+
+restart_rollover:
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ if (ch->ch_rollover && ch->ch_status_written > 0) {
+ read_unlock(&say_lock);
+ _rollover_channel(ch);
+ goto restart_rollover;
+ }
+ }
+ read_unlock(&say_lock);
+
+restart:
+ read_lock(&say_lock);
+ for (ch = channel_list; ch; ch = ch->ch_next) {
+ int start = 0;
+
+ if (!ch->ch_is_dir)
+ start = SAY_TOTAL;
+ for (i = start; i < MAX_SAY_CLASS; i++) {
+ if (ch->ch_index[i] > 0) {
+ read_unlock(&say_lock);
+ treat_channel(ch, i);
+ goto restart;
+ }
+ }
+ if (ch->ch_delete) {
+ read_unlock(&say_lock);
+ _del_channel(ch);
+ goto restart;
+ }
+ }
+ read_unlock(&say_lock);
+ }
+
+ return 0;
+}
+
+void init_say(void)
+{
+ default_channel = make_channel(CONFIG_MARS_LOGDIR, true);
+ say_thread = kthread_create(_say_thread, NULL, "brick_say");
+ if (IS_ERR(say_thread)) {
+ say_thread = NULL;
+ } else {
+ get_task_struct(say_thread);
+ wake_up_process(say_thread);
+ }
+}
+
+void exit_say(void)
+{
+ int memleak_channels;
+ int memleak_names;
+ int memleak_pages;
+
+ if (say_thread) {
+ kthread_stop(say_thread);
+ put_task_struct(say_thread);
+ say_thread = NULL;
+ }
+
+ default_channel = NULL;
+ while (channel_list)
+ _del_channel(channel_list);
+
+ memleak_channels = atomic_read(&say_alloc_channels);
+ memleak_names = atomic_read(&say_alloc_names);
+ memleak_pages = atomic_read(&say_alloc_pages);
+ if (unlikely(memleak_channels || memleak_names || memleak_pages))
+ printk("MEMLEAK: channels=%d names=%d pages=%d\n", memleak_channels, memleak_names, memleak_pages);
+}
+
+#ifdef CONFIG_MARS_DEBUG
+
+static int dump_max = 5;
+
+void brick_dump_stack(void)
+{
+ if (dump_max > 0) {
+ dump_max--; /* racy, but does no harm */
+ dump_stack();
+ }
+}
+
+#endif
diff --git a/include/linux/brick/brick_say.h b/include/linux/brick/brick_say.h
new file mode 100644
index 000000000000..13a28c80081f
--- /dev/null
+++ b/include/linux/brick/brick_say.h
@@ -0,0 +1,89 @@
+/*
+ * MARS Long Distance Replication Software
+ *
+ * Copyright (C) 2010-2014 Thomas Schoebel-Theuer
+ * Copyright (C) 2011-2014 1&1 Internet AG
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef BRICK_SAY_H
+#define BRICK_SAY_H
+
+/***********************************************************************/
+
+extern int brick_say_logging;
+extern int brick_say_debug;
+extern int brick_say_syslog_min;
+extern int brick_say_syslog_max;
+extern int brick_say_syslog_flood_class;
+extern int brick_say_syslog_flood_limit;
+extern int brick_say_syslog_flood_recovery;
+extern int delay_say_on_overflow;
+
+/* printk() replacements */
+
+enum {
+ SAY_DEBUG,
+ SAY_INFO,
+ SAY_WARN,
+ SAY_ERROR,
+ SAY_FATAL,
+ SAY_TOTAL,
+ MAX_SAY_CLASS
+};
+
+extern const char *say_class[MAX_SAY_CLASS];
+
+struct say_channel;
+
+extern struct say_channel *default_channel;
+
+extern struct say_channel *make_channel(const char *name, bool must_exit);
+
+extern void del_channel(struct say_channel *ch);
+
+extern void bind_to_channel(struct say_channel *ch, struct task_struct *whom);
+
+#define bind_me(_name) \
+ bind_to_channel(make_channel(_name), current)
+
+extern struct say_channel *get_binding(struct task_struct *whom);
+
+extern void remove_binding_from(struct say_channel *ch, struct task_struct *whom);
+extern void remove_binding(struct task_struct *whom);
+
+extern void rollover_channel(struct say_channel *ch);
+extern void rollover_all(void);
+
+extern void say_to(struct say_channel *ch, int class, const char *fmt, ...) __printf(3, 4);
+
+#define say(_class, _fmt, _args...) \
+ say_to(NULL, _class, _fmt, ##_args)
+
+extern void brick_say_to(
+struct say_channel *ch, int class, bool dump, const char *prefix, const char *file, int line, const char *func, const char *fmt, ...) __printf(
+
+8, 9);
+
+#define brick_say(_class, _dump, _prefix, _file, _line, _func, _fmt, _args...)\
+ brick_say_to(NULL, _class, _dump, _prefix, _file, _line, _func, _fmt, ##_args)
+
+extern void init_say(void);
+extern void exit_say(void);
+
+#ifdef CONFIG_MARS_DEBUG
+extern void brick_dump_stack(void);
+#else /* CONFIG_MARS_DEBUG */
+#define brick_dump_stack() /*empty*/
+#endif /* CONFIG_MARS_DEBUG */
+
+#endif
--
2.11.0

2016-12-31 06:47:08

by Thomas Schoebel-Theuer

[permalink] [raw]
Subject: Re: [RFC 00/32] State of MARS Reo-Redundancy Module

Typo correction:

On 12/30/2016 11:57 PM, Thomas Schoebel-Theuer wrote:
> standalone servers with local hardware RAIDs. They are hosting about
> 500 MARS resources (originally DRBD resources) just for the web servers;

This must read 2500. Somehow the leading "2" was eaten at wraparound.