Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754898AbcL3XBc (ORCPT ); Fri, 30 Dec 2016 18:01:32 -0500 Received: from mo4-p00-ob.smtp.rzone.de ([81.169.146.219]:36759 "EHLO mo4-p00-ob.smtp.rzone.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754848AbcL3XBZ (ORCPT ); Fri, 30 Dec 2016 18:01:25 -0500 X-RZG-AUTH: :OH8QVVOrc/CP6za/qRmbF3BWedPGA1vjs2ejZCzW8NRdwTYefHi0L5RzHLEjAZn5asq7vKs= X-RZG-CLASS-ID: mo00 From: Thomas Schoebel-Theuer To: linux-kernel@vger.kernel.org, tst@schoebel-theuer.de Subject: [RFC 00/32] State of MARS Reo-Redundancy Module Date: Fri, 30 Dec 2016 23:57:26 +0100 Message-Id: X-Mailer: git-send-email 2.11.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12367 Lines: 250 Hi together, here is my traditional annual status report on the development of MARS [1]. In the meantime, the out-of-tree MARS has replaced DRBD as the backbone of the 1&1 geo-redundancy feature as publicly advertised for 1&1 Shared Hosting Linux (ShaHoLin). MARS is also running on several other 1&1 clusters. Some other people over the world have also seemingly started to use it. At 1&1, MARS is now running on more than 2000 Servers and on more than 2 * 8 petabytes of data, and it has collected more than 20 millions of operation hours. The slides [1] are explaining why the sharding architecture supported by MARS has no problems scaling out to such numbers, while some other non-MARS and non-blocklevel 1&1 clusters (architecturally called "Big Clusters" in the slides, although they have less than 1 PB of data) have seemingly reached their _practical_ scaling limits at their practical dimensioning and practical workload, although they were originally advertized as scaling almost "unlimited" in theory, and some people had seemed to _believe_ this in the past. Some of the reasons for massive differences in scalability are explained in the slides, and some more explanations will hopefully follow in 2017. During 2016, I published several bugfix releases for the stable branch, and some portability improvements to some newer kernels for the out-of-tree (OOT) version of MARS at the github repo [2]. There is also a prototype of a prepatch-less WIP-compatibility branch which is not yet merged with master. Some minor developments have also started: there is a lab prototye for md5 checksumming of 4k blocks on the underlying disk devices. This was motivated by the observation that most operational incidents are due to hardware defects, and we want to catch them as early as possible. Conversely, this also implies that MARS is considered more stable than the hardware, but of course this is _expected_ from a HA solution ;) In January 2016, I took a few holidays for improving the upstream version of MARS a little bit, mainly improving some checkpatch issues. After that, I had to do much other work at 1&1, so unfortunately the development of this part got stuck at that point. Sorry. The attached code is more or less just for your personal information that this part is not dead, and that development will continue. This autumn, I got a new boss and some new objectives for 2017. One of the new objectives will involve MARS. The current ShaHoLin sharding architecture has been diretly migrated from the former DRBD hardware setup to MARS: it just consists of about 1000 _pairs_ of hard-iron iSCSI storage servers and some hard-iron standalone servers with local hardware RAIDs. They are hosting about 500 MARS resources (originally DRBD resources) just for the web servers; there are even more resources at (already virtualized) database servers. The former should be virtualized during 2017 for reducing the server iron, likely using LXC and/or KVM. The resulting future system should increase flexibility by MARS resource data migration among the _whole_ pool. This means that MARS will give up the traditional DRBD-like pairing in favour of a new feature: treating all of the existing storage like one big "virtual LVM-like storage pool". Notice that MARS's internal _architecture_ can already do this: it allows for k > 2 replicas, and already has dynamic join-cluster and join-resource and leave-resource operations which can be easily used for runtime data migration during operation (while resources are loaded), and even for very big resources. After adding a new operation "merge-cluster" which checks that all the resource names are disjoint, this _would_ even work with the current version of MARS, at least in theory. However, the current limitation is at the internal _metadata_ updates (not at the IO data paths): currently all cluster members are exchanging all _metadata_ with all other nodes, leading to O(n^2) _metadata_ communications. A future version of MARS will reduce this to the necessary scale (only among the nodes involved in some resources), and it will communicate the other metadata less frequently and no longer full-mesh. As a side note: hopefully I will also get the necessary time for replacing the current symlink tree by metadata files, which should also improve the metadata scaling properties. The goal will be a very low number of MARS clusters (one for US, one for EU, and both probably split for web hosting versus databases) consisting of several thousands of nodes. The realtime-critical data IO paths will remain at the sharding principle, leading to excellent scalability. Only the _metadata_ updates will follow the "big cluster" architectural approach, and only as far as necessary. They are not time-critical anyway. As always, for the opensource community part of my work: it would be nice if some other kernel hackers would start joining the MARS development in 2017, at least for helping me getting it upstream. I would be excited if I would be invited to the next kernel summit or a similar meeting. A happy new year from your devoted Thomas [1] https://github.com/schoebel/mars/blob/master/docu/MARS_GUUG2016.pdf [2] https://github.com/schoebel/mars Thomas Schoebel-Theuer (32): mars: add new module lamport mars: add new module brick_say mars: add new module brick_mem mars: add new module brick_checking mars: add new module meta mars: add new module brick mars: add new module lib_pairing_heap mars: add new module lib_queue mars: add new module lib_rank mars: add new module lib_limiter mars: add new module lib_timing mars: add new module vfs_compat mars: add new module xio mars: add new module xio_net mars: add new module lib_mapfree mars: add new module lib_log mars: add new module xio_bio mars: add new module xio_sio mars: add new module xio_client mars: add new module xio_if mars: add new module xio_copy mars: add new module xio_trans_logger mars: add new module xio_server mars: add new module strategy mars: add new module main_strategy mars: add new module net mars: add new module server_strategy mars: add new module mars_proc mars: add new module mars_main mars: add new module Makefile mars: add new module Kconfig mars: activate build drivers/staging/Kconfig | 2 + drivers/staging/Makefile | 1 + drivers/staging/mars/Kconfig | 266 + drivers/staging/mars/Makefile | 96 + drivers/staging/mars/brick.c | 723 +++ drivers/staging/mars/brick_mem.c | 1080 ++++ drivers/staging/mars/brick_say.c | 920 +++ drivers/staging/mars/lamport.c | 61 + drivers/staging/mars/lib/lib_limiter.c | 163 + drivers/staging/mars/lib/lib_rank.c | 87 + drivers/staging/mars/lib/lib_timing.c | 68 + drivers/staging/mars/mars/main_strategy.c | 2135 +++++++ drivers/staging/mars/mars/mars_main.c | 6160 ++++++++++++++++++++ drivers/staging/mars/mars/mars_proc.c | 389 ++ drivers/staging/mars/mars/mars_proc.h | 34 + drivers/staging/mars/mars/net.c | 109 + drivers/staging/mars/mars/server_strategy.c | 436 ++ drivers/staging/mars/mars/strategy.h | 239 + drivers/staging/mars/xio_bricks/lib_log.c | 506 ++ drivers/staging/mars/xio_bricks/lib_mapfree.c | 382 ++ drivers/staging/mars/xio_bricks/xio.c | 227 + drivers/staging/mars/xio_bricks/xio_bio.c | 845 +++ drivers/staging/mars/xio_bricks/xio_client.c | 1083 ++++ drivers/staging/mars/xio_bricks/xio_copy.c | 1005 ++++ drivers/staging/mars/xio_bricks/xio_if.c | 892 +++ drivers/staging/mars/xio_bricks/xio_net.c | 1849 ++++++ drivers/staging/mars/xio_bricks/xio_server.c | 493 ++ drivers/staging/mars/xio_bricks/xio_sio.c | 578 ++ drivers/staging/mars/xio_bricks/xio_trans_logger.c | 3410 +++++++++++ include/linux/brick/brick.h | 620 ++ include/linux/brick/brick_checking.h | 107 + include/linux/brick/brick_mem.h | 218 + include/linux/brick/brick_say.h | 89 + include/linux/brick/lamport.h | 26 + include/linux/brick/lib_limiter.h | 52 + include/linux/brick/lib_pairing_heap.h | 109 + include/linux/brick/lib_queue.h | 165 + include/linux/brick/lib_rank.h | 136 + include/linux/brick/lib_timing.h | 182 + include/linux/brick/meta.h | 106 + include/linux/brick/vfs_compat.h | 48 + include/linux/xio/lib_log.h | 333 ++ include/linux/xio/lib_mapfree.h | 84 + include/linux/xio/xio.h | 319 + include/linux/xio/xio_bio.h | 85 + include/linux/xio/xio_client.h | 105 + include/linux/xio/xio_copy.h | 115 + include/linux/xio/xio_if.h | 109 + include/linux/xio/xio_net.h | 177 + include/linux/xio/xio_server.h | 91 + include/linux/xio/xio_sio.h | 68 + include/linux/xio/xio_trans_logger.h | 271 + 52 files changed, 27854 insertions(+) create mode 100644 drivers/staging/mars/Kconfig create mode 100644 drivers/staging/mars/Makefile create mode 100644 drivers/staging/mars/brick.c create mode 100644 drivers/staging/mars/brick_mem.c create mode 100644 drivers/staging/mars/brick_say.c create mode 100644 drivers/staging/mars/lamport.c create mode 100644 drivers/staging/mars/lib/lib_limiter.c create mode 100644 drivers/staging/mars/lib/lib_rank.c create mode 100644 drivers/staging/mars/lib/lib_timing.c create mode 100644 drivers/staging/mars/mars/main_strategy.c create mode 100644 drivers/staging/mars/mars/mars_main.c create mode 100644 drivers/staging/mars/mars/mars_proc.c create mode 100644 drivers/staging/mars/mars/mars_proc.h create mode 100644 drivers/staging/mars/mars/net.c create mode 100644 drivers/staging/mars/mars/server_strategy.c create mode 100644 drivers/staging/mars/mars/strategy.h create mode 100644 drivers/staging/mars/xio_bricks/lib_log.c create mode 100644 drivers/staging/mars/xio_bricks/lib_mapfree.c create mode 100644 drivers/staging/mars/xio_bricks/xio.c create mode 100644 drivers/staging/mars/xio_bricks/xio_bio.c create mode 100644 drivers/staging/mars/xio_bricks/xio_client.c create mode 100644 drivers/staging/mars/xio_bricks/xio_copy.c create mode 100644 drivers/staging/mars/xio_bricks/xio_if.c create mode 100644 drivers/staging/mars/xio_bricks/xio_net.c create mode 100644 drivers/staging/mars/xio_bricks/xio_server.c create mode 100644 drivers/staging/mars/xio_bricks/xio_sio.c create mode 100644 drivers/staging/mars/xio_bricks/xio_trans_logger.c create mode 100644 include/linux/brick/brick.h create mode 100644 include/linux/brick/brick_checking.h create mode 100644 include/linux/brick/brick_mem.h create mode 100644 include/linux/brick/brick_say.h create mode 100644 include/linux/brick/lamport.h create mode 100644 include/linux/brick/lib_limiter.h create mode 100644 include/linux/brick/lib_pairing_heap.h create mode 100644 include/linux/brick/lib_queue.h create mode 100644 include/linux/brick/lib_rank.h create mode 100644 include/linux/brick/lib_timing.h create mode 100644 include/linux/brick/meta.h create mode 100644 include/linux/brick/vfs_compat.h create mode 100644 include/linux/xio/lib_log.h create mode 100644 include/linux/xio/lib_mapfree.h create mode 100644 include/linux/xio/xio.h create mode 100644 include/linux/xio/xio_bio.h create mode 100644 include/linux/xio/xio_client.h create mode 100644 include/linux/xio/xio_copy.h create mode 100644 include/linux/xio/xio_if.h create mode 100644 include/linux/xio/xio_net.h create mode 100644 include/linux/xio/xio_server.h create mode 100644 include/linux/xio/xio_sio.h create mode 100644 include/linux/xio/xio_trans_logger.h -- 2.11.0