LinuxLists.cc - Selling Points for MARS Light

2014-07-04 11:07:34

Subject: Selling Points for MARS Light

Hi together,

since not all people have attended my presentations on MARS Light, and
since probably not all are willing to click at links such as to the
presentation slides
https://github.com/schoebel/mars/blob/master/docu/MARS_LinuxTag2014.pdf?raw=true
, I will try to explain something which goes beyond the presentations,
but nevertheless I don't want to repeat too much from the slides.

So the following is hopefully also interesting for people who attended
the presentations.

My main selling point:

To my knowledge, MARS Light is the only enterprise-grade opensource
solution constructed for mass replication of whole datacenters
(thousands to tenthousands of instances) over _long_ _distances_, aka
_geo-redundancy_.

Lets take a guided tour through important keywords given by this sentence.

1) geo-redundancy

There exist some big companies operating datacenters, typically equipped
with many thousands of servers each (in some cases it goes up to the
order of hundred-thousand servers per datacenter). Many big companies
have more than one datacenter.

CEOs are not sleeping well at night when they dream of artificial or
natural disasters such as terrorist bombing, earthquakes, fire, or
simply power blackouts due to human error or technical failures of
components which should protect against that, but don't work at the
wrong moment.

In addition, when companies are traded at the stock exchange, they are
ranked not only by the market, but also by some rating agencies.

If you (as a company) have implemented geo-redundancy as a means for
failover to a backup datacenter, your rating will improve.

If the _existence_ of your company is _depending_ on IT and it is
potentially vulnerable by such disasters, geo-redundanancy is not even a
matter of ranking, but simply a _must_.

2) long-distance

In theory, this is just a simple consequence from geo-redundancy: the
longer the distances between your datacenters, the better the protection
against large-scale disasters such as earthquakes.

There are further reasons for long distances. Probably the most
important is the cost of electricity, which is much cheaper (by
_factors_) when building your datacenters near mountains with cheap
hydro-electric power, but far away from overcrowded cities. The cost for
electricity is one of the main costs (I know some outlines of business
cases).

Further reasons are the structure of the internet: the alternative
routing / multipathing does not always work well, e.g. in case of mass
DDOS attacks. You don't want to concentrate too big masses of
datacenters nearby at the globe, and you don't want to go too far away
from the big masses of your customers / web surfers if you are a
webhosting company such as 1&1.

When you are world-wide company, distances between your datacenters are
likely to be high, naturally.

In practice, long distances raise very difficult technical problems,
which have to do with

a) Einstein's law, the speed of light (aka network latencies), and

b) network bottlenecks due to multiplexing masses of parallel data
traffic through single wide-distance fibres via routers (aka packet loss),

c) frequent short network outages, and even long-lasting network outages
_do_ occur from time to time. See the daily press.

Notice that in TCP/IP, packet loss is the _only_ means for congestion
control.

It is _impossible_ to avoid packet loss over shared long distance lines
in practice. You would need to _guarantee_ an extremely high total
bandwidth, capable of delivering every single packet from a packet
storm, or from a DDOS attack. Even if this were technically possible by
building up extremely high bandwidth capacities (which isn't in many
cases), it would be so expensive that it is essentially unaffordable.

3) replication of whole datacenters

3a) block layer

Many datacenters are hosting a wide spectrum of heterogeneous
applications. Many of them carry a _history_ (sometimes dating back over
decades), i.e. there exist many "running systems" which should "never be
touched" in sysadmin speak, or even be re-engineered just for the sake
of adding replication in order to improve your stock market ranking.
That would be expensive, and it could go wrong.

Therefore, long-distance replication is needed on block device layer.

Doing it there is _no_ _fundamental_ change of the overall architecture,
because almost every contemporary standard server has hard disks or SSDs
with a block layer interface.

Only in case of implementing _new_ applications, you could try to do
long-distance replication _always_ (consistently thoughout your whole
datacenter, even when it contains some Windows boxes) at higher levels
such as filesystem level or application level (e.g. by programming it
yourself), but I wish you good luck. There is a too wide spectrum of
applications, operating systems, storage systems, IO load behaviour, etc
for allowing general statements such as "do long-distance replication
_always_ on filesystem level". Maybe that it is _sometimes_ both
possible and affordable (and maybe sometimes even advantageous) to do
long-distance replication at filesystem or even at application level,
but not generally.

3b) synchronous vs asynchronous replication

Probably most of you already know that "synchronous" replication means
to apply a write-through strategy. A write operation is only signalled
back to the initiator as "completed" when it has been commited to all
replicas, that means transferred from the active datacenter side to the
passive one (or to _all_passive ones when having k > 2 replicas), and
after an answer indicating successful completion has been transferred
back over the network again (ping-pong game).

Some years ago, I evaluated DRBD (which works synchronously, and which
was already in productive use for _short_ distances for some years where
it worked well), but this time for long-distance replication between one
of our datacenters located in Europe and another one in USA over a 10Gig
line which is fully administrated by our WAN department, and which was
only lowly loaded at that time.

Guess the result?

Well, I did a detailed analysis using an early prototype version of my
blkreplay toolkit (see http://www.blkreplay.org). I used a real-life load
recorded by blktrace from shared hosting in our datacenter. Blkreplay
generates very detailed analysis graphics on latency behaviour, as well
as on many other interesting properties like workingset behaviour of the
application etc.

Ok, what do you guess?

At that time, I asked our sysadmins internally: nobody was betting that
there was a chance that it could work. All were believing that the
network latencies of ~120ms (ping) would kill the idea.

They were right with respect to the final result. It was easy to see
from the so-called "sonar diagram" graphics, even for our management,
that it does not work. I tried lots of network settings such as TCP send
buffer tuning, all available TCP congestion control algorithms including
the historic ones, etc, but nothing could help.

I strongly believe that it is not the fault of the DRBD implementation,
but the fault of synchronous replication at _concept_ level. I got oral
reports from other people inside 1&1 that similiar long-distance setups
with commercial storage boxes in synchronous mode also don't work, at
least for them.

Well, when re-looking at the old graphics with my present experience, I
should modify my former conclusions a little bit. Not the basic
latencies as such were likely killing the idea of using synchronous
replication for long distances, but the packet loss (and what's
happening after that). Details are out of scope of this discussion (you
can ask me in personal conversations).

OK, can I assume in this audience that I don't need to provide a
stronger proof that asynchronous replication is _mandatory_ for
long-distance replication?

With help from Philipp Reisner from Linbit, I also evaluated the
combination DRBD + proxy which does asynchronous replication. Since
proxy is under a commercial license, it is probably out of scope of this
discussion in this forum. For interested people: a feature comparison
between MARS Light and DRBD+proxy is on slide 4 of the LinuxTag2014
presentation.

In short: proxy uses RAM for asynchronous buffering. There is no
persistence at that point. So you can imagine why we decided not to use
it for replication of whole datacenters. Details are out of scope here.

4) enterprise grade

I am arguing on feature level here, not on maturity (where MARS Light
certainly has to catch up more, at least when compared to
long-established solutions like DRBD).

At feature level, you need at least all of the following things when
replicating whole datacenters over long distances:

a) fully automatic handover / switchover

Not only in case of disasters, but also for operational reasons such as
necessary reboots because of security fixes or kernel upgrades, you need
to switch the primary side in a controlled manner. Whenever possible,
you should avoid creating a split-brain.

Notice that the _problem space_ long-distance / asynchronous replication
bears some hidden pitfalls which are not obvious even when you are
already familiar with synchronous replication. For example, it is
impossible to eliminate split brains in general, you can only do your
best to avoid it.

b) very close-automatic resolution of split brain

That sounds easy for DRBD users, but it isn't when having huge diverging
backlogs caused by long-lasting network disruptions or other problems,
and/or when this is combined with more than 2 replicas. You may argue
that you should avoid (or even try to prohibit wherever you can) any
split brain situations in advance because you loose data upon
resolution, but such an argumentation is not practical. Shit happens in
practice.

You must be able to clean up the shit once it has happened, as
automatically as possible.

c) shared storage for the backlogs (called "transaction logs" in MARS Light)

When you have to have n resources on the same storage server (which is a
_must_ for storage consolidation in big datacenters), and n is at least
in the order of 10 (or even is planned to grow to other powers of 10 in
future), you will notice a problem which is probably not yet clear to
everyone here.

Let me try to explain it.

Probably at least the filesystem experts present in this audience will
know the 90/10 rule, sometimes also called the 80/20 rule, or other
naming variants.

It is an _observation_ from _practice_.

The rule is not only occurring in filesystems, but also in many other
places. Well-known instances are "90% of accesses go to 10% of all the
data", "10% of all files consume 90% of all storage space because they
are so big", "10% of all users create 90% of the system load", "10% of
all customers create 90% of our revenue", "90% of all complaints are
caused by 10% of all customers", and so on. The list can be extended
almost indefinitely.

Please don't argue whether 90%/10% ist "correct" or not; in some cases
the "correct" numbers are 95%/5% or even 99%/1%, while in other cases
its only 70%/30%, or whatever. It is just a practical rule of thumb
where exact numbers don't matter. In many cases the so-called "Zipf's
law" appears to be the reason for the rule, but even Zipf's law is an
observation from practice where no full scientific proof exists up to
now AFAIK.

My point is: the 90%/10% thumb rule also applies, you guess it, to what?

When you subdivide your available storage storage space into n
statically allocated ring buffers as a means for storing the backlogs
(which are _needed_ for asynchronous replication), the cited rule of
thumb says that you are wasting about one order of magnitude of storage
space when the first ring buffer is full, but all other (n - 1) ring
buffers are not yet full. The rule says that for sufficiently large n,
about 10% of all ring buffers will contain about 90% of all waste.

Imagine: translate this into ? or $ for thousands or tenthousands of
servers, then you probably will get the point.

Like there is no scientific proof for Zipf's law, I cannot prove this in
strong sense. But this is just what I observe whenever I log into an
arbitrary MARS node having some long-lasting network problem, and when I
type "du -s /mars/resource-* | sort -n".

IMNSHO, this observation retrospectively justifies why an
enterprise-class asynchronous replication system needs not only a
_shared_ storage for _masses_ of backlogs, but also a _dynamic_ storage.

Some additional justification for the latter: notice that Zipf's law and
friends are not only observerable in the space dimension, but also in
the time dimension when looking at the throughput, at least when
choosing the right scale.

Consequence from this: block layer concepts, which are supposed to deal
with rather static or near-static storage spaces, are not neccesarily
the appropriate concepts when implementing huge masses of backlogs for
huge masses of resources, which in turn is needed for enterprise-class
reasons.

5) "only"

I know that there exist some opensource prototype implementations of
asynchronous long-distance replication, more or less in form of concept
studies or similar.

I don't know of any one which implements the enterprise-grade
requirements for thousands of servers / whole datacenters as pointed out
above.

--

OK, I am at the end of this posting, and I deliberately omitted the
other main selling point from the slides called "Anytime Consistency".
When interested, you can ask questions,which I will try to answer as
best as I can.

Cheers,

Thomas

2014-07-04 13:04:13

by Richard Weinberger

[permalink] [raw]

Subject: Re: Selling Points for MARS Light

On Fri, Jul 4, 2014 at 1:01 PM, Thomas Schöbel-Theuer
<[email protected]> wrote:
> Hi together,
>
> since not all people have attended my presentations on MARS Light, and
> since probably not all are willing to click at links such as to the
> presentation slides
> https://github.com/schoebel/mars/blob/master/docu/MARS_LinuxTag2014.pdf?raw=true
> , I will try to explain something which goes beyond the presentations,
> but nevertheless I don't want to repeat too much from the slides.
>
> So the following is hopefully also interesting for people who attended
> the presentations.
>
> My main selling point:
>
> To my knowledge, MARS Light is the only enterprise-grade opensource
> solution constructed for mass replication of whole datacenters
> (thousands to tenthousands of instances) over _long_ _distances_, aka
> _geo-redundancy_.

I think hch meant not a high level marketing-drone -ready design document
by "explain and sell your design".
We are interested in the real low level design ideas.
i.e. Why do you need all syscalls exported?

> Lets take a guided tour through important keywords given by this sentence.
>
> 1) geo-redundancy
>
> There exist some big companies operating datacenters, typically equipped
> with many thousands of servers each (in some cases it goes up to the
> order of hundred-thousand servers per datacenter). Many big companies
> have more than one datacenter.
>
> CEOs are not sleeping well at night when they dream of artificial or
> natural disasters such as terrorist bombing, earthquakes, fire, or
> simply power blackouts due to human error or technical failures of
> components which should protect against that, but don't work at the
> wrong moment.
>
> In addition, when companies are traded at the stock exchange, they are
> ranked not only by the market, but also by some rating agencies.
>
> If you (as a company) have implemented geo-redundancy as a means for
> failover to a backup datacenter, your rating will improve.
>
> If the _existence_ of your company is _depending_ on IT and it is
> potentially vulnerable by such disasters, geo-redundanancy is not even a
> matter of ranking, but simply a _must_.
>
> 2) long-distance
>
> In theory, this is just a simple consequence from geo-redundancy: the
> longer the distances between your datacenters, the better the protection
> against large-scale disasters such as earthquakes.
>
> There are further reasons for long distances. Probably the most
> important is the cost of electricity, which is much cheaper (by
> _factors_) when building your datacenters near mountains with cheap
> hydro-electric power, but far away from overcrowded cities. The cost for
> electricity is one of the main costs (I know some outlines of business
> cases).
>
> Further reasons are the structure of the internet: the alternative
> routing / multipathing does not always work well, e.g. in case of mass
> DDOS attacks. You don't want to concentrate too big masses of
> datacenters nearby at the globe, and you don't want to go too far away
> from the big masses of your customers / web surfers if you are a
> webhosting company such as 1&1.
>
> When you are world-wide company, distances between your datacenters are
> likely to be high, naturally.
>
> In practice, long distances raise very difficult technical problems,
> which have to do with
>
> a) Einstein's law, the speed of light (aka network latencies), and
>
> b) network bottlenecks due to multiplexing masses of parallel data
> traffic through single wide-distance fibres via routers (aka packet loss),
>
> c) frequent short network outages, and even long-lasting network outages
> _do_ occur from time to time. See the daily press.
>
> Notice that in TCP/IP, packet loss is the _only_ means for congestion
> control.
>
> It is _impossible_ to avoid packet loss over shared long distance lines
> in practice. You would need to _guarantee_ an extremely high total
> bandwidth, capable of delivering every single packet from a packet
> storm, or from a DDOS attack. Even if this were technically possible by
> building up extremely high bandwidth capacities (which isn't in many
> cases), it would be so expensive that it is essentially unaffordable.
>
> 3) replication of whole datacenters
>
> 3a) block layer
>
> Many datacenters are hosting a wide spectrum of heterogeneous
> applications. Many of them carry a _history_ (sometimes dating back over
> decades), i.e. there exist many "running systems" which should "never be
> touched" in sysadmin speak, or even be re-engineered just for the sake
> of adding replication in order to improve your stock market ranking.
> That would be expensive, and it could go wrong.
>
> Therefore, long-distance replication is needed on block device layer.
>
> Doing it there is _no_ _fundamental_ change of the overall architecture,
> because almost every contemporary standard server has hard disks or SSDs
> with a block layer interface.
>
> Only in case of implementing _new_ applications, you could try to do
> long-distance replication _always_ (consistently thoughout your whole
> datacenter, even when it contains some Windows boxes) at higher levels
> such as filesystem level or application level (e.g. by programming it
> yourself), but I wish you good luck. There is a too wide spectrum of
> applications, operating systems, storage systems, IO load behaviour, etc
> for allowing general statements such as "do long-distance replication
> _always_ on filesystem level". Maybe that it is _sometimes_ both
> possible and affordable (and maybe sometimes even advantageous) to do
> long-distance replication at filesystem or even at application level,
> but not generally.
>
> 3b) synchronous vs asynchronous replication
>
> Probably most of you already know that "synchronous" replication means
> to apply a write-through strategy. A write operation is only signalled
> back to the initiator as "completed" when it has been commited to all
> replicas, that means transferred from the active datacenter side to the
> passive one (or to _all_passive ones when having k > 2 replicas), and
> after an answer indicating successful completion has been transferred
> back over the network again (ping-pong game).
>
> Some years ago, I evaluated DRBD (which works synchronously, and which
> was already in productive use for _short_ distances for some years where
> it worked well), but this time for long-distance replication between one
> of our datacenters located in Europe and another one in USA over a 10Gig
> line which is fully administrated by our WAN department, and which was
> only lowly loaded at that time.
>
> Guess the result?
>
> Well, I did a detailed analysis using an early prototype version of my
> blkreplay toolkit (see http://www.blkreplay.org). I used a real-life load
> recorded by blktrace from shared hosting in our datacenter. Blkreplay
> generates very detailed analysis graphics on latency behaviour, as well
> as on many other interesting properties like workingset behaviour of the
> application etc.
>
> Ok, what do you guess?
>
> At that time, I asked our sysadmins internally: nobody was betting that
> there was a chance that it could work. All were believing that the
> network latencies of ~120ms (ping) would kill the idea.
>
> They were right with respect to the final result. It was easy to see
> from the so-called "sonar diagram" graphics, even for our management,
> that it does not work. I tried lots of network settings such as TCP send
> buffer tuning, all available TCP congestion control algorithms including
> the historic ones, etc, but nothing could help.
>
> I strongly believe that it is not the fault of the DRBD implementation,
> but the fault of synchronous replication at _concept_ level. I got oral
> reports from other people inside 1&1 that similiar long-distance setups
> with commercial storage boxes in synchronous mode also don't work, at
> least for them.
>
> Well, when re-looking at the old graphics with my present experience, I
> should modify my former conclusions a little bit. Not the basic
> latencies as such were likely killing the idea of using synchronous
> replication for long distances, but the packet loss (and what's
> happening after that). Details are out of scope of this discussion (you
> can ask me in personal conversations).
>
> OK, can I assume in this audience that I don't need to provide a
> stronger proof that asynchronous replication is _mandatory_ for
> long-distance replication?
>
> With help from Philipp Reisner from Linbit, I also evaluated the
> combination DRBD + proxy which does asynchronous replication. Since
> proxy is under a commercial license, it is probably out of scope of this
> discussion in this forum. For interested people: a feature comparison
> between MARS Light and DRBD+proxy is on slide 4 of the LinuxTag2014
> presentation.
>
> In short: proxy uses RAM for asynchronous buffering. There is no
> persistence at that point. So you can imagine why we decided not to use
> it for replication of whole datacenters. Details are out of scope here.
>
> 4) enterprise grade
>
> I am arguing on feature level here, not on maturity (where MARS Light
> certainly has to catch up more, at least when compared to
> long-established solutions like DRBD).
>
> At feature level, you need at least all of the following things when
> replicating whole datacenters over long distances:
>
> a) fully automatic handover / switchover
>
> Not only in case of disasters, but also for operational reasons such as
> necessary reboots because of security fixes or kernel upgrades, you need
> to switch the primary side in a controlled manner. Whenever possible,
> you should avoid creating a split-brain.
>
> Notice that the _problem space_ long-distance / asynchronous replication
> bears some hidden pitfalls which are not obvious even when you are
> already familiar with synchronous replication. For example, it is
> impossible to eliminate split brains in general, you can only do your
> best to avoid it.
>
> b) very close-automatic resolution of split brain
>
> That sounds easy for DRBD users, but it isn't when having huge diverging
> backlogs caused by long-lasting network disruptions or other problems,
> and/or when this is combined with more than 2 replicas. You may argue
> that you should avoid (or even try to prohibit wherever you can) any
> split brain situations in advance because you loose data upon
> resolution, but such an argumentation is not practical. Shit happens in
> practice.
>
> You must be able to clean up the shit once it has happened, as
> automatically as possible.
>
> c) shared storage for the backlogs (called "transaction logs" in MARS Light)
>
> When you have to have n resources on the same storage server (which is a
> _must_ for storage consolidation in big datacenters), and n is at least
> in the order of 10 (or even is planned to grow to other powers of 10 in
> future), you will notice a problem which is probably not yet clear to
> everyone here.
>
> Let me try to explain it.
>
> Probably at least the filesystem experts present in this audience will
> know the 90/10 rule, sometimes also called the 80/20 rule, or other
> naming variants.
>
> It is an _observation_ from _practice_.
>
> The rule is not only occurring in filesystems, but also in many other
> places. Well-known instances are "90% of accesses go to 10% of all the
> data", "10% of all files consume 90% of all storage space because they
> are so big", "10% of all users create 90% of the system load", "10% of
> all customers create 90% of our revenue", "90% of all complaints are
> caused by 10% of all customers", and so on. The list can be extended
> almost indefinitely.
>
> Please don't argue whether 90%/10% ist "correct" or not; in some cases
> the "correct" numbers are 95%/5% or even 99%/1%, while in other cases
> its only 70%/30%, or whatever. It is just a practical rule of thumb
> where exact numbers don't matter. In many cases the so-called "Zipf's
> law" appears to be the reason for the rule, but even Zipf's law is an
> observation from practice where no full scientific proof exists up to
> now AFAIK.
>
> My point is: the 90%/10% thumb rule also applies, you guess it, to what?
>
> When you subdivide your available storage storage space into n
> statically allocated ring buffers as a means for storing the backlogs
> (which are _needed_ for asynchronous replication), the cited rule of
> thumb says that you are wasting about one order of magnitude of storage
> space when the first ring buffer is full, but all other (n - 1) ring
> buffers are not yet full. The rule says that for sufficiently large n,
> about 10% of all ring buffers will contain about 90% of all waste.
>
> Imagine: translate this into € or $ for thousands or tenthousands of
> servers, then you probably will get the point.
>
> Like there is no scientific proof for Zipf's law, I cannot prove this in
> strong sense. But this is just what I observe whenever I log into an
> arbitrary MARS node having some long-lasting network problem, and when I
> type "du -s /mars/resource-* | sort -n".
>
> IMNSHO, this observation retrospectively justifies why an
> enterprise-class asynchronous replication system needs not only a
> _shared_ storage for _masses_ of backlogs, but also a _dynamic_ storage.
>
> Some additional justification for the latter: notice that Zipf's law and
> friends are not only observerable in the space dimension, but also in
> the time dimension when looking at the throughput, at least when
> choosing the right scale.
>
> Consequence from this: block layer concepts, which are supposed to deal
> with rather static or near-static storage spaces, are not neccesarily
> the appropriate concepts when implementing huge masses of backlogs for
> huge masses of resources, which in turn is needed for enterprise-class
> reasons.
>
> 5) "only"
>
> I know that there exist some opensource prototype implementations of
> asynchronous long-distance replication, more or less in form of concept
> studies or similar.
>
> I don't know of any one which implements the enterprise-grade
> requirements for thousands of servers / whole datacenters as pointed out
> above.
>
> --
>
> OK, I am at the end of this posting, and I deliberately omitted the
> other main selling point from the slides called "Anytime Consistency".
> When interested, you can ask questions,which I will try to answer as
> best as I can.
>
> Cheers,
>
> Thomas
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Thanks,
//richard

2014-07-04 17:37:47

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: Selling Points for MARS Light

On Fri, Jul 04, 2014 at 03:04:07PM +0200, Richard Weinberger wrote:
> On Fri, Jul 4, 2014 at 1:01 PM, Thomas Sch?bel-Theuer
> <[email protected]> wrote:
> > Hi together,
> >
> > since not all people have attended my presentations on MARS Light, and
> > since probably not all are willing to click at links such as to the
> > presentation slides
> > https://github.com/schoebel/mars/blob/master/docu/MARS_LinuxTag2014.pdf?raw=true
> > , I will try to explain something which goes beyond the presentations,
> > but nevertheless I don't want to repeat too much from the slides.
> >
> > So the following is hopefully also interesting for people who attended
> > the presentations.
> >
> > My main selling point:
> >
> > To my knowledge, MARS Light is the only enterprise-grade opensource
> > solution constructed for mass replication of whole datacenters
> > (thousands to tenthousands of instances) over _long_ _distances_, aka
> > _geo-redundancy_.
>
> I think hch meant not a high level marketing-drone -ready design document
> by "explain and sell your design".
> We are interested in the real low level design ideas.
> i.e. Why do you need all syscalls exported?

Exactly, normally we don't care about any high-level things, I want to
know why you think you have to call remove within a kernel module when
no one else does that.

Also, I doubt you handle the namespace issue properly, but given that it
is impossible to review the 50+ patches as sent, it might be correct, or
might not, I don't know...

greg k-h

2014-07-04 18:32:14

by Thomas Schoebel-Theuer

[permalink] [raw]

Subject: Re: Selling Points for MARS Light

> We are interested in the real low level design ideas. i.e. Why do you
need all syscalls exported?

OK, simple question, simple first-order answer, more complicated
second-order answer.

First-order answer:

Exporting all syscalls was a mistake made by me. It's not needed. Only a
handful is really needed. Of course, I am willing to correct my mistake.

A first-order fix for this is really simple. A second-order fix
(avoiding _any_ additional exports, or almost any) would get much more
complicated.

Second-order answer to the more detailed question behind your question:
why does tst need filesystem operations like *_symlink() or *_readlink()
at the low level?

I tried to explain the general high-level rationale behind the
architecture of MARS in the last part of my "marketing drone" appraisal
(which I also hated some years ago, but I had to learn how to produce
such an evil thing when changing from academia to the industry).

Detailed steps of the rationale, with some non-obvious intermediate steps:

Due to the enterprise-class dynamic shared store needed for masses of
backlog data (as explained in the last part of my former posting), I had
the low-level implementation choice of

a) just re-using existing filesystem technology for that, or

b) implementing my own filesystem-like dynamic store for that.

I also examined another potential alternative c), namely trying to use
LVM / DM for that. I looked at the concepts and some parts of the
interfaces, as well as of the implemention (all in a rather old kernel
as used in production!!!), and found some clashes at concept level, and
therefore got the impression that it could likely become very
cumbersome, so I precluded this potential alternative at a rather early
stage.

At that time, my office was next door to the IT operations system
architect, who became a friend of mine. So we were discussing many
political as well as detailed technical implementation problems in a
very small team (sometimes when drinking a beer together).

At that time, the rationale was to produce a POC (proof of concept)
ASAP, in order to aquire some resources for the emerging MARS project
internally. Around that time, we decided jointly to postpone all
"expensive" efforts to a later version called "MARS FULL" (originally
all in capital letters), and to try to get a quick-win called "MARS
LIGHT" ASAP.

Under that premise, it becomes immediately clear that a) had to be taken
for MARS Light, and whenever b) should turn out a better alternative
(which I think is quite possible), we decided to do it "later".

An additional advantage of a) is that ordinary sysadmin commands such as
"ls -l" can be used for inspection of the internal state of MARS, since
implementation of a key->value store this way has the low-level
advantage that you can easily experiment with it or even workaround some
bugs via "ln -s" and friends.

Not as an excuse, but please believe me that MARS originally started as
a strictly _internal_ 1&1 project, and it took quite a while until I got
the official permission from our CTO to publish it under GPL (before
that, I had to convince many other people). Until that point, we had no
official thoughts of ever going kernel upstream, probably besides some
drinking of beer in very small group ;)

Yes, well, life is not straightforward.... not only some letters were
lowercased, but the rollout of MARS Light to practice, as originally
planned, took much longer than expected (due to reasons which are out of
scope here).

In the meantime, MARS Light got into production in _another_ sysadmin
team as originally planned. They just took it, and it just worked for them.

Of course, that was fantastic for me in first order.

But a second-order consequence now is that any changes in low-level data
formats (such as replacing the symlinks by other formats such as files)
_must_ provide an upgrade path for them.

If I am not operating very carefully, this could become a problem in
future. I have to avoid any bigger forks between the out-of-tree and the
potential in-tree versions; at least I have to provide an easy and
robust upgrade path between them (probably in both directions).

In addition, I have to ensure that the timescale for parallel
maintainance of two major flavours is limited. If I did things which
would prolifer the version split ad infinitum, this would very likely
become a no-go for going upstream.

I hope you can understand that. In no way I want to put pressure on you.
It's just the diverging pressure I am trapped in.

Now, the discussion here showed up some painpoints for me (in my
subjective view).

In oder to try to resolve the complaints about the "symlink farms", I
could try to replace some operations, such as the symlink operations, by
file operations. But I am not sure whether this really helps. It just
complicates things.

Therefore I will continue trying to convince you whether we could
jointly find a solution for both sides, which both does not
unnecessarily violate the upstream rules, while not consuming too much
of my valuable manpower for maintaining both versions in a compatible
manner from a luser's perspective.

Now my posting got longer than intended. But I hope it could contribute
to a better mutual understanding.

Cheers,

yours sincerly Thomas