2023-02-15 11:44:23

by Roger Heflin

[permalink] [raw]
Subject: Re: [dm-devel] RAID4 with no striping mode request

I think he is wanting the parity across the data blocks on the
separate filesystems (some sort of parity across fs[1-8]/block0 to
parity/block0).

it is not clear to me what this setup would be enough better than what
the current setups. Given that one could have 8 spin + 1ssd or 12
spin for the same price. And 2 6 disk raid6's would have the same
usable space, and be pretty safe (can lose any 2 of the 6 and lose no
data). And given the separate filesystems requirement that would
require some software above the filesystems to manage spreading the
disks across multiple filesystems. The risk of another disk going
bad (while one was failed) and losing a disk's worth of data would
push me to use the 6-disk raid6.

WOL: current SSD's are rated for around 1000-2000 writes. So a 1Tb
disk can sustain 1000-2000TB of total writes. And writes to
filesystem blocks would get re-written more often than data blocks.
How well it would work would depend on how often the data is deleted
and re-written. If the disks are some sort of long term storage then
the SSD is not going to get used up. And I am not sure if the rated
used up really means anything unless you are using a STUPID enterprise
controller that proactively disables/kills the SSD when it says the
rated writes have happened. I have a 500GB ssd in a mirror that
"FAILED" according to smart 2 years ago and so far is still fully
functional, and it is "GOOD" again because the counters used to
determine total writes seems to have rolled over.

On Tue, Feb 14, 2023 at 8:23 PM Heinz Mauelshagen <[email protected]> wrote:
>
> Roger,
>
> as any of the currently implemented 'parity' algorithms (block xor/P-/Q-Syndrome) provided by DM/MD RAID
> have to have at least two data blocks to calculate: are you, apart from the filesystem thoughts you bring up, thinking
> about running those on e.g. pairs of disks of mentioned even numbered set of 8?
>
> Heinz
>
> On Tue, Feb 14, 2023 at 11:28 PM Roger Heflin <[email protected]> wrote:
>>
>> On Tue, Feb 14, 2023 at 3:27 PM Heinz Mauelshagen <[email protected]> wrote:
>> >
>>
>> >
>> >
>> > ...which is RAID1 plus a parity disk which seems superfluous as you achieve (N-1)
>> > resilience against single device failures already without the later.
>> >
>> > What would you need such parity disk for?
>> >
>> > Heinz
>> >
>>
>> I thought that at first too, but threw that idea out as it did not
>> make much sense.
>>
>> What he appears to want is 8 linear non-striped data disks + a parity disk.
>>
>> Such that you can lose any one data disk and parity can rebuild that
>> disk. And if you lose several data diskis, then you have intact
>> non-striped data for the remaining disks.
>>
>> It would almost seem that you would need to put a separate filesystem
>> on each data disk/section (or have a filesystem that is redundant
>> enough to survive) otherwise losing an entire data disk would leave
>> the filesystem in a mess..
>>
>> So N filesystems + a parity disk for the data on the N separate
>> filesystems. And each write needs you to read the data from the disk
>> you are writing to, and the parity and recalculate the new parity and
>> write out the data and new parity.
>>
>> If the parity disk was an SSD it would be fast enough, but if parity
>> was an SSD I would expect it to get used up/burned out from all of
>> parity being re-written for each write on each disk unless you bought
>> an expensive high-write ssd.
>>
>> The only advantage of the setup is that if you lose too many disks you
>> still have some data.
>>
>> It is not clear to me that it would be any cheaper if parity needs to
>> be a normal ssd's (since ssds are about 4x the price/gb and high-write
>> ones are even more) than a classic bunch of mirrors, or even say a 4
>> disks raid6 where you can lose any 2 and still have data.
>>


2023-02-15 14:53:13

by Wols Lists

[permalink] [raw]
Subject: Re: [dm-devel] RAID4 with no striping mode request

On 15/02/2023 11:44, Roger Heflin wrote:
> WOL: current SSD's are rated for around 1000-2000 writes. So a 1Tb
> disk can sustain 1000-2000TB of total writes. And writes to
> filesystem blocks would get re-written more often than data blocks.
> How well it would work would depend on how often the data is deleted
> and re-written.

When did that guy do that study of SSDs? Basically hammered them to
death 24/7? I think it took about three years of continuous write/erase
cycles to destroy them.

Given that most drives are obsolete long before they've had three years
of writes ... the conclusion was that - for the same write load -
"modern" (as they were several years ago) SSDs would probably outlast
mechanical drives for the same workload.

(Cheap SD cards, on the other hand ...)

Cheers,
Wol

2023-02-15 15:22:52

by Roger Heflin

[permalink] [raw]
Subject: Re: [dm-devel] RAID4 with no striping mode request

The SMART on the disk marks the disk as FAILED when you hit the
manufacturer's posted limit (1000 or 2000 writes average). I am
sure using a "FAILED" disk would make a lot of people nervous.

The conclusion of you can write as fast as you can and it will take 3
years to wear out would be specific to that specific brand/version
with a given set of chips in it, and may or may not hold to other
vendors/chips/versions, and so may have quite a bit of variation in
it. I think I remember seeing that, but I don't remember what the
average write rate was. The one I just found says 200TB of writes on
a 240g drive, so about 8000erases per cell was the lowest failure
rate, with some drives making it 3-5x higher.


On Wed, Feb 15, 2023 at 8:53 AM Wols Lists <[email protected]> wrote:
>
> On 15/02/2023 11:44, Roger Heflin wrote:
> > WOL: current SSD's are rated for around 1000-2000 writes. So a 1Tb
> > disk can sustain 1000-2000TB of total writes. And writes to
> > filesystem blocks would get re-written more often than data blocks.
> > How well it would work would depend on how often the data is deleted
> > and re-written.
>
> When did that guy do that study of SSDs? Basically hammered them to
> death 24/7? I think it took about three years of continuous write/erase
> cycles to destroy them.
>
> Given that most drives are obsolete long before they've had three years
> of writes ... the conclusion was that - for the same write load -
> "modern" (as they were several years ago) SSDs would probably outlast
> mechanical drives for the same workload.
>
> (Cheap SD cards, on the other hand ...)
>
> Cheers,
> Wol

2023-02-16 00:02:14

by Kyle Sanderson

[permalink] [raw]
Subject: Re: [dm-devel] RAID4 with no striping mode request

> On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin <[email protected]> wrote:
>
> I think he is wanting the parity across the data blocks on the
> separate filesystems (some sort of parity across fs[1-8]/block0 to
> parity/block0).

Correct.

> On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin <[email protected]> wrote:
> it is not clear to me what this setup would be enough better than what
> the current setups. Given that one could have 8 spin + 1ssd or 12
> spin for the same price. And 2 6 disk raid6's would have the same
> usable space, and be pretty safe (can lose any 2 of the 6 and lose no
> data).

They're not the same price though. Remember these disks are mixed
sizes and various ages, exposing their entire data value
(4d+8d+12d+12p gives you 24T of usable storage) all protected by the
single parity disk.

Yes, higher levels of RAID will always be better. However, that's not
how these millions of appliances are developed by a number of
manufacturers and sold at your local retailer. The proposal (and ask
for help) that I've raised is to have an open-source solution to these
proprietary MD implementations, as opposed to being trapped with buggy
MD drivers on firmware that's glitchy and breaks other aspects of the
kernel.

> On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin <[email protected]> wrote:
> And given the separate filesystems requirement that would
> require some software above the filesystems to manage spreading the
> disks across multiple filesystems. The risk of another disk going
> bad (while one was failed) and losing a disk's worth of data would
> push me to use the 6-disk raid6.

This is long solved by a number of FUSE filesystems, as well as
overlayfs (which would be nice if it could gradually spool data down
into layers, but that's another ball of wax).

Hopefully that makes sense. The only thing that's coming closer to
this is bcachefs, but that's still looking like a multi-year long road
(with the above being deployed in homes since the early 2000s).

On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin <[email protected]> wrote:
>
> I think he is wanting the parity across the data blocks on the
> separate filesystems (some sort of parity across fs[1-8]/block0 to
> parity/block0).
>
> it is not clear to me what this setup would be enough better than what
> the current setups. Given that one could have 8 spin + 1ssd or 12
> spin for the same price. And 2 6 disk raid6's would have the same
> usable space, and be pretty safe (can lose any 2 of the 6 and lose no
> data). And given the separate filesystems requirement that would
> require some software above the filesystems to manage spreading the
> disks across multiple filesystems. The risk of another disk going
> bad (while one was failed) and losing a disk's worth of data would
> push me to use the 6-disk raid6.
>
> WOL: current SSD's are rated for around 1000-2000 writes. So a 1Tb
> disk can sustain 1000-2000TB of total writes. And writes to
> filesystem blocks would get re-written more often than data blocks.
> How well it would work would depend on how often the data is deleted
> and re-written. If the disks are some sort of long term storage then
> the SSD is not going to get used up. And I am not sure if the rated
> used up really means anything unless you are using a STUPID enterprise
> controller that proactively disables/kills the SSD when it says the
> rated writes have happened. I have a 500GB ssd in a mirror that
> "FAILED" according to smart 2 years ago and so far is still fully
> functional, and it is "GOOD" again because the counters used to
> determine total writes seems to have rolled over.
>
> On Tue, Feb 14, 2023 at 8:23 PM Heinz Mauelshagen <[email protected]> wrote:
> >
> > Roger,
> >
> > as any of the currently implemented 'parity' algorithms (block xor/P-/Q-Syndrome) provided by DM/MD RAID
> > have to have at least two data blocks to calculate: are you, apart from the filesystem thoughts you bring up, thinking
> > about running those on e.g. pairs of disks of mentioned even numbered set of 8?
> >
> > Heinz
> >
> > On Tue, Feb 14, 2023 at 11:28 PM Roger Heflin <[email protected]> wrote:
> >>
> >> On Tue, Feb 14, 2023 at 3:27 PM Heinz Mauelshagen <[email protected]> wrote:
> >> >
> >>
> >> >
> >> >
> >> > ...which is RAID1 plus a parity disk which seems superfluous as you achieve (N-1)
> >> > resilience against single device failures already without the later.
> >> >
> >> > What would you need such parity disk for?
> >> >
> >> > Heinz
> >> >
> >>
> >> I thought that at first too, but threw that idea out as it did not
> >> make much sense.
> >>
> >> What he appears to want is 8 linear non-striped data disks + a parity disk.
> >>
> >> Such that you can lose any one data disk and parity can rebuild that
> >> disk. And if you lose several data diskis, then you have intact
> >> non-striped data for the remaining disks.
> >>
> >> It would almost seem that you would need to put a separate filesystem
> >> on each data disk/section (or have a filesystem that is redundant
> >> enough to survive) otherwise losing an entire data disk would leave
> >> the filesystem in a mess..
> >>
> >> So N filesystems + a parity disk for the data on the N separate
> >> filesystems. And each write needs you to read the data from the disk
> >> you are writing to, and the parity and recalculate the new parity and
> >> write out the data and new parity.
> >>
> >> If the parity disk was an SSD it would be fast enough, but if parity
> >> was an SSD I would expect it to get used up/burned out from all of
> >> parity being re-written for each write on each disk unless you bought
> >> an expensive high-write ssd.
> >>
> >> The only advantage of the setup is that if you lose too many disks you
> >> still have some data.
> >>
> >> It is not clear to me that it would be any cheaper if parity needs to
> >> be a normal ssd's (since ssds are about 4x the price/gb and high-write
> >> ones are even more) than a classic bunch of mirrors, or even say a 4
> >> disks raid6 where you can lose any 2 and still have data.
> >>