Discussion:
FFSv1 performance on large filesystems
Matthias Scheler
2005-03-03 13:05:51 UTC
Permalink
Hello,

while searching for the reason for the "bad" NFS performance in my home
network I got interesting filesystem performance numbers

Filesystem Size Device Write Performance(*)
1.9G raid0 34MB/Sec
101G raid0 22MB/Sec <--
9.7G raid1 34MB/Sec
26G raid1 33MB/Sec

It looks like FFSv1 has a performance problem with large partitions. Any
ideas why that would happen? I wonder if it distributes the data too
much and therefore renders the 8MB cache of my IDE disks useless.

Kind regards

(*) Measured with "dd if=/dev/zero of=test.img bs=1024k count=256"
--
Matthias Scheler http://scheler.de/~matthias/
Thor Lancelot Simon
2005-03-03 15:32:42 UTC
Permalink
Post by Matthias Scheler
Hello,
while searching for the reason for the "bad" NFS performance in my home
network I got interesting filesystem performance numbers
Filesystem Size Device Write Performance(*)
1.9G raid0 34MB/Sec
101G raid0 22MB/Sec <--
How large is the underlying disk? You should see a significant difference
in performance from center to edge of the disk.

How large is the file you're writing? If you want contiguous writes of
very very large files you'll need to play with the -e option to newfs.

I actually think that, in any case where we have more than 100 or so
cylinder groups, we should default -e to be a full group (it would be
nice if it could meaningfully be _more_ than a group).

Thor
Matthias Scheler
2005-03-03 16:32:38 UTC
Permalink
Post by Thor Lancelot Simon
Post by Matthias Scheler
Filesystem Size Device Write Performance(*)
1.9G raid0 34MB/Sec
101G raid0 22MB/Sec <--
How large is the underlying disk?
wd0 at atabus0 drive 0: <WDC WD1600JB-32EVA0>
wd0: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
wd1 at atabus0 drive 1: <WDC WD1600JB-32EVA0>
wd1: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
raid0: Components: /dev/wd0a /dev/wd1a
Post by Thor Lancelot Simon
You should see a significant difference in performance from center to
edge of the disk.
I had that idea too which is why I tested another partition which is
I tested this partition ...

26G raid1 33MB/Sec

... which is at the end of pair of 80GB disks.

I've run another test on my NetBSD-current system on a single disk ...

wd1 at atabus3 drive 0: <WDC WD1600JD-00GBB0>
wd1: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors

... and performance is much better here ...
Post by Thor Lancelot Simon
dd if=/dev/zero of=test.img bs=1024k count=256
256+0 records in
256+0 records out
268435456 bytes transferred in 5.739 secs (46773907 bytes/sec)

... on an ever bigger partition:

Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/wd1f 111479275 2122056 103783256 2% /export/scratch

So either somebody "fixed" FFSv1 after NetBSD 2.0.1 or the problem isn't
related to FFSv1. Possible other reasons:

1.) The disks
The only difference between the disk is the interface (PATA vs. SATA).
And in my tests ("dd" from a raw device) that didn't make much
difference.

2.) RAIDframe
I'm also not convinced that RAIDframe causes the problem because all
initial test cases used RAIDframe RAID 1 devices and the problem
only affected a single partition.

Any other ideas?
Post by Thor Lancelot Simon
How large is the file you're writing?
256MB
Post by Thor Lancelot Simon
I actually think that, in any case where we have more than 100 or so
cylinder groups, we should default -e to be a full group (it would be
nice if it could meaningfully be _more_ than a group).
After "tunefs -e 40960 ..." the performance dropped below 20MB/Sec.

Kind regards
--
Matthias Scheler http://scheler.de/~matthias/
Chris Gilbert
2005-03-03 21:48:06 UTC
Permalink
On Thu, 3 Mar 2005 16:32:38 +0000
Post by Matthias Scheler
Post by Thor Lancelot Simon
Post by Matthias Scheler
Filesystem Size Device Write Performance(*)
1.9G raid0 34MB/Sec
101G raid0 22MB/Sec <--
How large is the underlying disk?
wd0 at atabus0 drive 0: <WDC WD1600JB-32EVA0>
wd0: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
wd1 at atabus0 drive 1: <WDC WD1600JB-32EVA0>
wd1: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
raid0: Components: /dev/wd0a /dev/wd1a
So either somebody "fixed" FFSv1 after NetBSD 2.0.1 or the problem isn't
1.) The disks
The only difference between the disk is the interface (PATA vs. SATA).
And in my tests ("dd" from a raw device) that didn't make much
difference.
Looking at the above, both disks on the same PATA channel? (master and slave) I believe that there's some overhead in swapping between the master and slave, but I'd be surprised if it was that much.

Have you tried having each disk on it's own PATA channel?

Thanks,
Chris
Jason Thorpe
2005-03-03 22:29:13 UTC
Permalink
Post by Chris Gilbert
Looking at the above, both disks on the same PATA channel? (master
and slave) I believe that there's some overhead in swapping between
the master and slave, but I'd be surprised if it was that much.
It's not the overhead so much as "both disks cannot be active at the
same time". There is no parallelism in this case.
Post by Chris Gilbert
Have you tried having each disk on it's own PATA channel?
That would certainly be what I would recommend.

-- thorpej
Matthias Scheler
2005-03-03 23:56:16 UTC
Permalink
Post by Jason Thorpe
Post by Chris Gilbert
Have you tried having each disk on it's own PATA channel?
That would certainly be what I would recommend.
As mentioned in another e-mail: I've got too many disks.

I really wish there were motherboards with one of Intels new PCIe chipsets
(with four SATA ports and decent slots for Gigabit ethernet) that would work
with my old 2.0GHz P4 Northwood which doesn't soak that much energy as the
new Prescott based P4s.

Kind regards
--
Matthias Scheler http://scheler.de/~matthias/
Matthias Scheler
2005-03-03 22:46:44 UTC
Permalink
Post by Chris Gilbert
Looking at the above, both disks on the same PATA channel?
Yes.
Post by Chris Gilbert
Have you tried having each disk on it's own PATA channel?
No. The machine has four PATA disks and only two PATA channels which means
that I cannot avoid sharing. I arranged the disks as they are now (raid0
on wd0 and wd1, raid1 on wd2 and wd3) to avoid performance hits on raid0
when raid1 (which e.g. contains my NetBSD source trees) is very busy.

Kind regards
--
Matthias Scheler http://scheler.de/~matthias/
Chris Gilbert
2005-03-04 00:25:08 UTC
Permalink
On Thu, 3 Mar 2005 22:46:44 +0000
Post by Matthias Scheler
Post by Chris Gilbert
Looking at the above, both disks on the same PATA channel?
Yes.
Post by Chris Gilbert
Have you tried having each disk on it's own PATA channel?
No. The machine has four PATA disks and only two PATA channels which means
that I cannot avoid sharing. I arranged the disks as they are now (raid0
on wd0 and wd1, raid1 on wd2 and wd3) to avoid performance hits on raid0
when raid1 (which e.g. contains my NetBSD source trees) is very busy.
Might be worth a try, but have you considered moving things around a bit, eg:

currently:
Channel 0: raid0 disk 1 and raid0 disk 2
Channel 1: raid1 disk 1 and raid1 disk 2

and try:
Channel 0: raid0 disk 1 and raid1 disk 1
Channel 1: raid0 disk 2 and raid1 disk 2

that means that access to both disks in raid0 can be used together, as can both disks in raid1, which might speed things up? Obviously performance to both raid0 and raid1 together may suck, but perhaps no worse than currently?

Chris
Matthias Scheler
2005-03-04 09:18:05 UTC
Permalink
Post by Chris Gilbert
Channel 0: raid0 disk 1 and raid1 disk 1
Channel 1: raid0 disk 2 and raid1 disk 2
I'll try that when I shutdown the server the next time.

Kind regards
--
Matthias Scheler http://scheler.de/~matthias/
Matthias Scheler
2005-03-24 00:58:01 UTC
Permalink
single disk 51MB/Sec
RAIDframe 25MB/Sec
So it looks to me like RAIDframe really writes to both disk serial and
not parallel.
Is this a RAID-0? If so, please try ccd(4) and compare.
No, it's RAID-1. But writes should still happen in parallel there.

Kind regards
--
Matthias Scheler http://scheler.de/~matthias/
Greg Oster
2005-03-24 01:13:49 UTC
Permalink
Post by Matthias Scheler
single disk 51MB/Sec
RAIDframe 25MB/Sec
So it looks to me like RAIDframe really writes to both disk serial and
not parallel.
Is this a RAID-0? If so, please try ccd(4) and compare.
No, it's RAID-1. But writes should still happen in parallel there.
They do.

What does your raid0.conf file look like? I'd expect a RAID 1 set to
write a little slower than the raw disk, but not this much slower.

I'd also be interested in ccd(4) results (with the same stripe
widths...)

Later...

Greg Oster
Matthias Scheler
2005-03-24 02:25:22 UTC
Permalink
Post by Greg Oster
Post by Matthias Scheler
No, it's RAID-1. But writes should still happen in parallel there.
They do.
RAIDframe queues them in parallel. But maybe something in the kernel
searilizes them?
Post by Greg Oster
What does your raid0.conf file look like?
The RAID gets autoconfigured. But the configuration file I used for
creating it looked like this:

START array
1 2 0
START disks
/dev/wd0a
/dev/wd1a
START layout
128 1 1 1
START queue
fifo 100
Post by Greg Oster
I'd expect a RAID 1 set to write a little slower than the raw disk,
but not this much slower.
Yes, me too. It should be slower by the latency it takes to deliver the
write to the second disk. But not by a factor of two.
Post by Greg Oster
I'd also be interested in ccd(4) results (with the same stripe
widths...)
It's not a stripe it's a mirror.

Kind regards
--
Matthias Scheler http://scheler.de/~matthias/
Greg Oster
2005-03-24 02:53:06 UTC
Permalink
Post by Matthias Scheler
Post by Greg Oster
Post by Matthias Scheler
No, it's RAID-1. But writes should still happen in parallel there.
They do.
RAIDframe queues them in parallel. But maybe something in the kernel
searilizes them?
Shouldn't be...

For kicks: Try your "single disk" benchmark on the two disks at the same
time...

Also: do both disks bench out at the same rate? RAID-1 will be
limited by the slower of the two.
Post by Matthias Scheler
Post by Greg Oster
What does your raid0.conf file look like?
The RAID gets autoconfigured. But the configuration file I used for
START array
1 2 0
START disks
/dev/wd0a
/dev/wd1a
START layout
128 1 1 1
START queue
fifo 100
Looks fine/normal.
Post by Matthias Scheler
Post by Greg Oster
I'd expect a RAID 1 set to write a little slower than the raw disk,
but not this much slower.
Yes, me too. It should be slower by the latency it takes to deliver the
write to the second disk. But not by a factor of two.
It should just be slower by the amount of time it takes to make sure
the drives claim the writes complete before reporting that a given
write is done...

What do your disklabels look like for the RAID set and for the
'single disk' test case?
Post by Matthias Scheler
Post by Greg Oster
I'd also be interested in ccd(4) results (with the same stripe
widths...)
It's not a stripe it's a mirror.
Oh.. right.. you said that, and I knew that.. *blush* (I had "ccd on
the brain" from thorpej's earlier comments :-} )

Later...

Greg Oster

Johan Danielsson
2005-03-04 10:01:58 UTC
Permalink
Post by Matthias Scheler
Post by Matthias Scheler
dd if=/dev/zero of=test.img bs=1024k count=256
256+0 records in
256+0 records out
268435456 bytes transferred in 5.739 secs (46773907 bytes/sec)
How much memory do you have in this machine? 256MB isn't a very large
dataset.

/Johan
Matthias Scheler
2005-03-04 20:36:32 UTC
Permalink
Post by Johan Danielsson
Post by Matthias Scheler
Post by Matthias Scheler
dd if=/dev/zero of=test.img bs=1024k count=256
256+0 records in
256+0 records out
268435456 bytes transferred in 5.739 secs (46773907 bytes/sec)
How much memory do you have in this machine?
2GB
Post by Johan Danielsson
256MB isn't a very large dataset.
Yes, but the other machine has 1GB, too. So if buffering would affect
this the other machine would achieve the same performance. And 46MB/Sec
reassembles the raw disk performance quite good.

Kind regards
--
Matthias Scheler http://scheler.de/~matthias/
Matthias Scheler
2005-03-04 21:12:18 UTC
Permalink
Post by Matthias Scheler
Post by Johan Danielsson
Post by Matthias Scheler
Post by Matthias Scheler
dd if=/dev/zero of=test.img bs=1024k count=256
256+0 records in
256+0 records out
268435456 bytes transferred in 5.739 secs (46773907 bytes/sec)
How much memory do you have in this machine?
2GB
Post by Johan Danielsson
256MB isn't a very large dataset.
Yes, but the other machine has 1GB, too. So if buffering would affect
this the other machine would achieve the same performance. And 46MB/Sec
reassembles the raw disk performance quite good.
dd if=/dev/zero bs=1024k count=4096 of=/export/scratch/tron/test.img
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 101.291 secs (42402259 bytes/sec)

The performance didn't change much as you can see.

Kind regards
--
Matthias Scheler http://scheler.de/~matthias/
Loading...