Four Drive RAID-5 on RAIDFrame Considered Harmful...

Discussion:

Robert P. Thille

2007-10-09 23:59:58 UTC

I'm building up a new server to replace my Cobalt RaQ2+ running NetBSD.
I got a Mini-ITX board a 1U case and hung 4 drives off it (actually 5
while I bring it up, but the 5th will go away once the raidset is
properly setup, and there's not much IO to it)

The drives are all 400GB Seagate drives, and because of the I/O on the
Via EN-15000G, 2 are SATA and 2 are IDE (ide drives on their own channel).

Setting it up initially, I raided the 4 drives together with two
partitions on each of the components: a small one for RAID-1 to load the
kernel, and a large one for RAID-5. Unfortunately, the RAID-5 had
horrible performance: 2-3MB/sec sometimes and never higher than about
12MB/sec.

I tracked it down to the problem that 2 and 3 are relatively prime :-)
Given 4 drives in a RAID-5 setup, you get 3 data blocks per raid stripe,
but the filesystem block size Must be a power of 2, so you always have
to write at least one partial stripe.

I'm still doing testing, but at one point I saw differences in
performance between 3 (+1 spare) and 4 drive RAID-5 sets of 20:1. That
is, adding the 4th drive into the RAID set caused performance to drop by
a factor of 20.

So far, it looks like the best overall performance I'm getting is with 3
drives, 32 SectsPerSU (for 32K Stripes) with a filesystem block size
also of 32K.

I'm sort of disappointed at losing 25% of my storage, but the
performance loss just isn't worth it. Would a hardware RAID card have
these issues, or do they do tricks with buffering or something to get
around it?

Once I finish testing, I'll post my results.

Thanks,

Robert

--
Robert Thille 7575 Meadowlark Dr.; Sebastopol, CA 95472
Home: 707.824.9753 Office/VOIP: 707.780.1560 Cell: 707.217.7544
***@mirapoint.com YIM:rthille http://www.rangat.org/rthille
Cyclist, Mountain Biker, Freediver, Kayaker, Rock Climber, Hiker, Geek
May your spirit dive deep the blue, where the fish are many and large!

Thor Lancelot Simon

2007-10-10 01:53:01 UTC

Permalink

Post by Robert P. Thille
I tracked it down to the problem that 2 and 3 are relatively prime :-)
Given 4 drives in a RAID-5 setup, you get 3 data blocks per raid stripe,
but the filesystem block size Must be a power of 2, so you always have
to write at least one partial stripe.

No. Examine the "maxcontig" FFS parameter.

Robert P. Thille

2007-10-10 03:22:42 UTC

Permalink

Post by Thor Lancelot Simon

Post by Robert P. Thille
I tracked it down to the problem that 2 and 3 are relatively
prime :-)
Given 4 drives in a RAID-5 setup, you get 3 data blocks per raid stripe,
but the filesystem block size Must be a power of 2, so you always have
to write at least one partial stripe.

No. Examine the "maxcontig" FFS parameter.

maxcontig appears to be obsolete. Looking thru the sources (current
from a month or so ago), I don't think it's used by the kernel/
filesystem, and the options mentioned in postings I found in the
mailing lists don't seem to exist anymore.

Am I missing something?

Thanks,

Robert

--
Robert Thille 7575 Meadowlark Dr.; Sebastopol, CA 95472
Home: 707.824.9753 Office/VOIP: 707.780.1560 Cell: 707.217.7544
***@rangat.org YIM:rthille http://www.rangat.org/rthille
Cyclist, Mountain Biker, Freediver, Kayaker, Rock Climber, Hiker, Geek
May your spirit dive deep the blue, where the fish are many and large!

Zafer Aydogan

2007-10-10 09:12:28 UTC

Permalink

Post by Robert P. Thille

Post by Thor Lancelot Simon

No. Examine the "maxcontig" FFS parameter.

maxcontig appears to be obsolete. Looking thru the sources (current
from a month or so ago), I don't think it's used by the kernel/
filesystem, and the options mentioned in postings I found in the
mailing lists don't seem to exist anymore.
Am I missing something?
Thanks,
Robert
--
Robert Thille 7575 Meadowlark Dr.; Sebastopol, CA 95472
Home: 707.824.9753 Office/VOIP: 707.780.1560 Cell: 707.217.7544
Cyclist, Mountain Biker, Freediver, Kayaker, Rock Climber, Hiker, Geek
May your spirit dive deep the blue, where the fish are many and large!

I wouldn't recommend RAID 5 at all. If one of your drives die or your
controller dies (in the case you would have hardware raid), you can
have a lot of trouble recovering your data. But if you have 1x RAID 0
plus RAID 1 (RAID 10 ?) you can always replace the drives and copy
the data since the data is always available as a whole unit.

Zafer.

Greg Troxel

2007-10-10 11:48:50 UTC

Permalink

Setting it up initially, I raided the 4 drives together with two
partitions on each of the components: a small one for RAID-1 to load
the kernel, and a large one for RAID-5. Unfortunately, the RAID-5 had
horrible performance: 2-3MB/sec sometimes and never higher than about
12MB/sec.

I am not 100% clear on this, but I have the impression that RAID-5
requires read-modify-write and that in the event of system crash or
power loss you can get corruption, and thus the good hardware
controllers have a) battery backed RAM and b) code that won't crash. (I
don't mean to malign the raidframe code - but because it's in-kernel if
the kernel crashes for any reason - not unheard of - then pending raid
writes may not happen.)

Because of this I've always just bought two big disks and done RAID-1.

Perhaps Greg Oster will chime in, and it would be a good addition to the
Guide to discuss the wisdom of using RAID-5.
http://www.netbsd.org/docs/guide/en/chap-rf.html

Robert P. Thille

2007-10-10 15:42:45 UTC

Permalink

Post by Robert P. Thille
Setting it up initially, I raided the 4 drives together with two
partitions on each of the components: a small one for RAID-1 to load
the kernel, and a large one for RAID-5. Unfortunately, the
RAID-5 had
horrible performance: 2-3MB/sec sometimes and never higher than about
12MB/sec.
I am not 100% clear on this, but I have the impression that RAID-5
requires read-modify-write

Yep, unless you can do "full-stripe" writes, which is why you want
the filesystem block size to match the stripe size...

Post by Robert P. Thille
and that in the event of system crash or
power loss you can get corruption, and thus the good hardware
controllers have a) battery backed RAM and b) code that won't
crash. (I
don't mean to malign the raidframe code - but because it's in-
kernel if
the kernel crashes for any reason - not unheard of - then pending raid
writes may not happen.)

Well, I'm not that familiar with the 'Raid-5 hole', but Wikipedia
seems to indicate that the parity can get wrong on a crash, and
unless you detect that before a failure, you'll lose data. But since
RAIDFrame always regenerates the parity when it's not shutdown
cleanly, that shouldn't (?) be a problem...I think :-)

Post by Robert P. Thille
Because of this I've always just bought two big disks and done RAID-1.
Perhaps Greg Oster will chime in, and it would be a good addition to the
Guide to discuss the wisdom of using RAID-5.
http://www.netbsd.org/docs/guide/en/chap-rf.html

I thought about cross-posting to tech-kern, since it seems like that
gets a fair amount of RAIDFrame traffic, but my initial concern was
primarily with the performance I was seeing, and I know cross-posting
is frowned on.

Thanks,

Robert

--
Robert Thille 7575 Meadowlark Dr.; Sebastopol, CA 95472
Home: 707.824.9753 Office/VOIP: 707.780.1560 Cell: 707.217.7544
***@rangat.org YIM:rthille http://www.rangat.org/rthille
Cyclist, Mountain Biker, Freediver, Kayaker, Rock Climber, Hiker, Geek
May your spirit dive deep the blue, where the fish are many and large!

Greg Oster

2007-10-10 20:59:37 UTC

Permalink

Post by Robert P. Thille
Setting it up initially, I raided the 4 drives together with two
partitions on each of the components: a small one for RAID-1 to load
the kernel, and a large one for RAID-5. Unfortunately, the RAID-5 had
horrible performance: 2-3MB/sec sometimes and never higher than about
12MB/sec.
I am not 100% clear on this, but I have the impression that RAID-5
requires read-modify-write

There is no 'requirement' for that. You might end up having to do
that, depending on configuration settings, but the reality is that
you want to avoid r-m-w as much as possible.

Post by Robert P. Thille
and that in the event of system crash or
power loss you can get corruption,

What sort of corruption would you like? :) In general, the filesystem
corruption that you'd see is about the same as what you'd see with a
single disk. Where there is a chance for "more corruption than with
a single disk" is if a component fails before parity has been
completely verified. In that case, if incorrect parity is used to
reconstruct lost data, that lost data might be gone for good (and bad
data returned). Of course, you could get that same sort of bad data
on any filesystem that overwrites files in-place, and the system
happens to go down at an inopportune moment...

Post by Robert P. Thille
and thus the good hardware
controllers have a) battery backed RAM and b) code that won't crash. (I
don't mean to malign the raidframe code - but because it's in-kernel if
the kernel crashes for any reason - not unheard of - then pending raid
writes may not happen.)

Right. Just like any other pending writes to anything won't happen.

Post by Robert P. Thille
Because of this I've always just bought two big disks and done RAID-1.

RAID 1 has the same issue -- Say the machine dies at the point where
block n is written to component 0 but not to component 1. If
component 0 dies before block n gets synced between the two, then
when you read block n from component 1, you're going to get the old
data.

Post by Robert P. Thille
Perhaps Greg Oster will chime in, and it would be a good addition to the
Guide to discuss the wisdom of using RAID-5.
http://www.netbsd.org/docs/guide/en/chap-rf.html

The reality is that RAID (hardware or software) is not a panacea, and
you're just attempting to bump the odds in your favour. And just like
there is a non-zero chance of corruption, there is also a non-zero chance
that all your disks won't die at once, or that a power surge doesn't
take out your hardware raid configuration.

Personally, I like the odds that I'll be able to get a parity rewrite
done in time, vs. the odds of having my only disk die. :)

Later...

Greg Oster

Pawel Jakub Dawidek

2007-10-11 10:18:23 UTC

Permalink

Post by Greg Oster

Post by Greg Troxel
Because of this I've always just bought two big disks and done RAID-1.

It's not exactly the same. IMHO old (in RAID1 case) is better than
random (in RAID5 case).

--
Pawel Jakub Dawidek http://www.wheel.pl
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Pavel Cahyna

2007-10-19 19:50:51 UTC

Permalink

Post by Greg Oster

Post by Greg Troxel
Because of this I've always just bought two big disks and done RAID-1.

I am curious - what do the BIOS-based software RAID controllers (as
offered by e.g. Promise and supported by ataraid(4)) do to correct the
situation when block n is written to component 0 but not 1? AIUI, you have
to do a re-sync after such unclean shutdown (hoping that the component
won't die before the data are synced) or have a battery backed cache.
So, do those "RAID controllers" re-sync parity after unclean shutdown?

Pavel

Greg Troxel

2007-10-19 20:14:24 UTC

Permalink

Post by Pavel Cahyna

Post by Greg Oster

Post by Greg Troxel
Because of this I've always just bought two big disks and done RAID-1.

But that's ok - the property that if you crash near a write you get
either the old data or the new data means that the RAID set behaves like
a non-failing disk, and that's ok.

My concern about RAID-5, which seems to be addressed by judicious stripe
size choices, is that I thought it was possible to have the system go
down, without any failures, and after reboot/reconstruction have bits
that were not either the old or the new.

It would be really nice to explain this and what stripe sizes to use in
the RAIDframe section of the guide. I definitely don't understand this
well enough to give advice - just enough to be worried.

Post by Pavel Cahyna
I am curious - what do the BIOS-based software RAID controllers (as
offered by e.g. Promise and supported by ataraid(4)) do to correct the
situation when block n is written to component 0 but not 1? AIUI, you have
to do a re-sync after such unclean shutdown (hoping that the component
won't die before the data are synced) or have a battery backed cache.
So, do those "RAID controllers" re-sync parity after unclean shutdown?

That's a good question - I have assumed they do a parity rewrite much
like RAIDframe does. I have the impression none of the included raid
features have battery backed caches (that's for expensive cards, which
may well be worth it), and I don't see how else they could work. I have
avoided using BIOS raid because the feature of raidframe that I can
mount one of the disks in any machine and get at it is comforting for
recovery from problems.

Manuel Bouyer

2007-10-10 18:45:15 UTC

Permalink

Post by Robert P. Thille
[...]
I'm sort of disappointed at losing 25% of my storage, but the
performance loss just isn't worth it. Would a hardware RAID card have
these issues, or do they do tricks with buffering or something to get
around it?

Hardware RAID are much faster, if they have a battery-backed cache and
can use write-back caching policy. With the write-back cache they can defer
the partial writes until they have enough data to do a full-stripe write.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--