Discussion:
Interactive responsiveness under heavy I/O load
Thor Lancelot Simon
2004-01-26 16:58:45 UTC
Permalink
Hello,
I have been noticing some disturbing patterns on two different machines
I'm trying NetBSD on. When I am untarring a file, or generally doing
anything that is causing large amounts of data to be written at once,
interactive performance is seriously degraded. For instance, while I
The new I/O sorting algorithm in -current should make this significantly
better. I am hoping that it can become the default for 2.0.
It seems like when this problem occurs, an I/O scheduler somewhere is
starving everything but the big writing process of resources. But I
have no idea if this is tweakable somewhere, or how to go about fixing
it.
You say later that you're using softdep. The likely problem is an
interaction of softdep and the questionable behaviour of the delayed
write scheduling code (the "smooth-sync" or "syncer") code that was
imported along with softdep.

The basic problem is that softdep allows an almost arbitrary number of
metadata operations (directory writes, allocation bitmap writes, etc.)
to be scheduled so long as your machine has sufficient metadata cache
buffers to hold them (when it doesn't, some will be written out to
make space for others; this inherently paces the I/O, but the problem
still exists). All delayed writes are put on a syncer "worklist"
corresponding to a particular second (in some cases, we process two
worklists per second, but usually only one, and never more than two).

Directory writes all get the same delay, "dirdelay", currently set to
15 seconds. Other metadata writes get the delay "metadelay", currently
set to 20 seconds.

What that means, in practice, is that "dirdelay" (15) seconds after you
fire off all that heavy I/O, the syncer will try to complete it all in
a second. This floods the disk queues and, in practice, can make the
system sluggish for _several_ seconds. By prioritizing reads around
async writes, the new sorting algorithm can at least minimize the
effect of this probglem on interactive use.

I have experimented with introducing jitter into the scheduling of
directory I/O, with mixed results. It is clear that without the new
disksort, it is a lose; with it, it _should_ be a win but the jury is
still out.

Another problem is that the current syncer keeps its worklist by _file_.
That means that there are some other degenerate cases, one or more of
which you may be seeing:

1) If all of your I/O is to a small number of very busy files, plain
file I/O will bog the system down every "filedelay" (currently 30)
seconds.

2) It is possible, though not likely, that I/O for non-directory
metadata is generating enough load to make your system seem
sluggish. Because this is all expressed to the syncer as I/O
for a single vnode corresponding to the entire filesystem, there
is nothing we can really do to make it flush out smoothly.

I'd appreciate it if someone else could read the code and confirm #2
but it's what I get, having looked at it a few times. There is
ongoing work to address #1; I am actively working on the directory
I/O problem (rebalancing in the syncer itself now looks more
promising than spreading the I/O when it is originally scheduled). In
any case, the new disksort should help a lot.

So there is a light at the end of the tunnel. If you want immediate
relief, turning off softdep should make your system's interactive
performance more predictable, though it will probably make your
I/O itself slower.

Thor
John Goerzen
2004-01-26 17:16:19 UTC
Permalink
Thanks for your reply and explanation, Thor. I have a couple of
Post by Thor Lancelot Simon
The new I/O sorting algorithm in -current should make this significantly
better. I am hoping that it can become the default for 2.0.
If I were to upgrade to current, how would I enable this sorting
algorithm on my system?
Post by Thor Lancelot Simon
It seems like when this problem occurs, an I/O scheduler somewhere is
starving everything but the big writing process of resources. But I
have no idea if this is tweakable somewhere, or how to go about fixing
it.
You say later that you're using softdep. The likely problem is an
interaction of softdep and the questionable behaviour of the delayed
write scheduling code (the "smooth-sync" or "syncer") code that was
imported along with softdep.
[snip]
Post by Thor Lancelot Simon
So there is a light at the end of the tunnel. If you want immediate
relief, turning off softdep should make your system's interactive
performance more predictable, though it will probably make your
I/O itself slower.
Significantly. One of the first things I did with my new NetBSD system
was to untar pkgsrc.tar.gz. It was S L O W. I didn't make formal
benchmarks, but after turning on softdep, I'd say the performance was at
least four times better, if not more (maybe even twice that).

Have I stumbled across the reason softdep is not enabled by default, or
is there some other logic behind this?

While we're on the topic, does anyone have a nice comparison of ffs or
lfs to reiserfs somewhere? I have found a number of (dated) comparisons
of ffs to ext2.

-- John
Johan A.van Zanten
2004-01-26 20:12:24 UTC
Permalink
Post by John Goerzen
Have I stumbled across the reason softdep is not enabled by default, or
is there some other logic behind this?
My understanding of how soft dependencies works is that file system
metadata is cached in memory for a short period of time (like 20
seconds?), which means it's at risk (of being lost) if the machine were to
loose power. The mount man page also says that it's gone through "moderate
to heavy testing, but should still be used with care."

-johan
Alfred Perlstein
2004-01-26 20:53:53 UTC
Permalink
Post by Johan A.van Zanten
Post by John Goerzen
Have I stumbled across the reason softdep is not enabled by default, or
is there some other logic behind this?
My understanding of how soft dependencies works is that file system
metadata is cached in memory for a short period of time (like 20
seconds?), which means it's at risk (of being lost) if the machine were to
loose power. The mount man page also says that it's gone through "moderate
to heavy testing, but should still be used with care."
That's the same as with any other filesystem that has safe meta
data handling, journalling systems included. Your data is not
guaranteed to be there unless you fsync it.
--
- Alfred Perlstein
- Research Engineering Development Inc.
- email: ***@mu.org cell: 408-480-4684
Daniel Carosone
2004-01-26 22:04:33 UTC
Permalink
Post by Johan A.van Zanten
Post by John Goerzen
Have I stumbled across the reason softdep is not enabled by default, or
is there some other logic behind this?
My understanding of how soft dependencies works is that file system
metadata is cached in memory for a short period of time (like 20
seconds?), which means it's at risk (of being lost) if the machine
were to loose power.
Two points:

- The question of "on by default" is shady; softdeps aren't enabled
unless you specify the mount option, but I understand sysinst now
creates fstab's with that option by default. So most users would
have them enabled on new systems at least. I may be wrong about
sysinst, its not something I really ever use.

- You are right about the metadata being delayed, but it's still
written in-order and before the relevant file data. The essential
point of softdep is to allow the ordering up updates to the on-disk
data structures to be preserved, so the ffs and fsck semantics that
depend on these still work, but without requiring synchronous
writes that make everything stop and wait. If you sync or fsync at
the right time, the previous semantics about data on disk or not
still hold.

Speaking very broadly, if your system crashes, the resulting
filesystem state with softdep is similar to if the machine had
crashed a little earlier without (unless you sync).

Softdep is such a huge performance win, especially for tasks like
extracting a pkgsrc tree, because lots of metadata writes update
the same disk block repetitively (think about adding files one at a
time to the same directory). With softdep, each of these updates is
done in memory (without the sync disk wait) and the resulting final
directory blocks written to disk (again, speaking very broadly).

--
Dan.
David Laight
2004-01-26 22:22:13 UTC
Permalink
Post by Daniel Carosone
- The question of "on by default" is shady; softdeps aren't enabled
unless you specify the mount option, but I understand sysinst now
creates fstab's with that option by default. So most users would
have them enabled on new systems at least. I may be wrong about
sysinst, its not something I really ever use.
Sysinst doesn't enable it by default, but does let you select it
(if you select the relevant field on the relevant menu...)

David
--
David Laight: ***@l8s.co.uk
Johnny Billquist
2004-01-26 22:43:18 UTC
Permalink
Post by Daniel Carosone
Post by Johan A.van Zanten
Post by John Goerzen
Have I stumbled across the reason softdep is not enabled by default, or
is there some other logic behind this?
My understanding of how soft dependencies works is that file system
metadata is cached in memory for a short period of time (like 20
seconds?), which means it's at risk (of being lost) if the machine
were to loose power.
- The question of "on by default" is shady; softdeps aren't enabled
unless you specify the mount option, but I understand sysinst now
creates fstab's with that option by default. So most users would
have them enabled on new systems at least. I may be wrong about
sysinst, its not something I really ever use.
- You are right about the metadata being delayed, but it's still
written in-order and before the relevant file data. The essential
point of softdep is to allow the ordering up updates to the on-disk
data structures to be preserved, so the ffs and fsck semantics that
depend on these still work, but without requiring synchronous
writes that make everything stop and wait. If you sync or fsync at
the right time, the previous semantics about data on disk or not
still hold.
Speaking very broadly, if your system crashes, the resulting
filesystem state with softdep is similar to if the machine had
crashed a little earlier without (unless you sync).
Softdep is such a huge performance win, especially for tasks like
extracting a pkgsrc tree, because lots of metadata writes update
the same disk block repetitively (think about adding files one at a
time to the same directory). With softdep, each of these updates is
done in memory (without the sync disk wait) and the resulting final
directory blocks written to disk (again, speaking very broadly).
softdep:s really sounds nice, but last I tried it (a few months ago) it
still crashed my VAX very predictably, so I'd guess it still isn't good
enough to turn on for people in general.

Johnny

Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: ***@update.uu.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
Gordon Waidhofer
2004-01-26 23:16:28 UTC
Permalink
Disk I/O performance can be **dramatically** improved
by I/O coallescing. My experience with a proprietary
operating system proved this many times over. Linux
coallesces disk I/Os.

When there are multiple bufs (struct bufs) on the
disk queue doing the same command (read/write) and
accessing consecutive disk addresses, the bufs are
turned into one I/O and use scatter/gather. Error
recovery amounts to bursting the chain and retrying
the bufs individually. It works best if I/Os can be
scheduled in batches, ala strategy_list(bp) rather
than one at a time ala strategy(bp).

There was a reference in the "LFS on BSD" paper
(Seltzer, et al) to a Berkeley project around
the same time that added I/O coallescing to
stock FFS. The performance rivaled LFS. I've
never been able to find a paper on that project.
Anybody know more?

Is anybody working on coallesced disk I/Os?

I've only given it a cursory look. I want
to find a way to add coallescing without
tearing-up the whole struct buf subsystem.
It works best if done very near the device
driver. I'm hoping to find a way for the
device drivers to do some sort of "just in time"
coallescing. That way, the struct bufs and
queues would be fine the way they are.
No new data structures need be added or maintained.
It'll be the later part of the year before
I can really wade in.

-gww
Alfred Perlstein
2004-01-26 23:35:23 UTC
Permalink
Post by Gordon Waidhofer
I've only given it a cursory look. I want
to find a way to add coallescing without
tearing-up the whole struct buf subsystem.
It works best if done very near the device
driver. I'm hoping to find a way for the
device drivers to do some sort of "just in time"
coallescing. That way, the struct bufs and
queues would be fine the way they are.
No new data structures need be added or maintained.
It'll be the later part of the year before
I can really wade in.
See FreeBSD's vfs_cluster.c.
--
- Alfred Perlstein
- Research Engineering Development Inc.
- email: ***@mu.org cell: 408-480-4684
Hauke Fath
2004-01-27 18:30:43 UTC
Permalink
Post by Johnny Billquist
Post by Daniel Carosone
Softdep is such a huge performance win, especially for tasks like
extracting a pkgsrc tree, because lots of metadata writes update
the same disk block repetitively (think about adding files one at a
time to the same directory). With softdep, each of these updates is
done in memory (without the sync disk wait) and the resulting final
directory blocks written to disk (again, speaking very broadly).
softdep:s really sounds nice, but last I tried it (a few months ago) it
still crashed my VAX very predictably, so I'd guess it still isn't good
enough to turn on for people in general.
Same here on a SparcStation 10. Both with current and 1.6 kernels, softdep
mounts make sure the box locks up during the next /etc/daily run.

hauke

--
/~\ The ASCII Ribbon Campaign
\ / No HTML/RTF in email
X No Word docs in email
/ \ Respect for open standards
David Brownlee
2004-01-27 19:37:28 UTC
Permalink
Post by Hauke Fath
Post by Johnny Billquist
Post by Daniel Carosone
Softdep is such a huge performance win, especially for tasks like
extracting a pkgsrc tree, because lots of metadata writes update
the same disk block repetitively (think about adding files one at a
time to the same directory). With softdep, each of these updates is
done in memory (without the sync disk wait) and the resulting final
directory blocks written to disk (again, speaking very broadly).
softdep:s really sounds nice, but last I tried it (a few months ago) it
still crashed my VAX very predictably, so I'd guess it still isn't good
enough to turn on for people in general.
Same here on a SparcStation 10. Both with current and 1.6 kernels, softdep
mounts make sure the box locks up during the next /etc/daily run.
I'm running softdeps on a 1.6.1_STABLE, alpha and sparc5 as
reasonably loaded NFS servers, and on 1.6ZI sparc & sparc64 boxes
as NFS clients (twenty or so users).

Also on i386 boxes, but those are less interesting :)

Alpha has been up for 165 days (I forget why it was rebooted),
sparc5 tends to go down every few months. Client boxes running
current are a little harder to benchmark as they keep getting
updated :)

I'm pretty sure there are still softdep problems, or problems
elsewhere exposed by softdeps pushing things harder...
--
David/absolute -- www.netbsd.org: No hype required --
Johan A.van Zanten
2004-01-27 20:13:53 UTC
Permalink
Post by David Brownlee
Post by Hauke Fath
Same here on a SparcStation 10. Both with current and 1.6 kernels, softdep
mounts make sure the box locks up during the next /etc/daily run.
I'm running softdeps on a 1.6.1_STABLE, alpha and sparc5 as
reasonably loaded NFS servers, and on 1.6ZI sparc & sparc64 boxes
as NFS clients (twenty or so users).
For the record, i've not had any trouble with soft dep. on SPARC and
Alpha. I generally stay a little behind the bleeding edge: i run releases,
not -current.

As far as i can recollect, i have had no trouble with soft dep. on either
of those platforms with 1.6.1, 1.6, 1.5.4 or 1.5.3.

brahma:/root # mount
/dev/sd0a on / type ffs (local)
/dev/sd0f on /var type ffs (local)
/dev/sd0g on /usr type ffs (local)
/dev/sd0h on /local type ffs (NFS exported, local)
/dev/sd0e on /export/home/000 type ffs (soft dependencies, NFS exported,
local)
/dev/sd0d on /export/tew/share type ffs (soft dependencies, NFS exported,
local)
brahma:/root # uptime
2:08PM up 120 days, 22:12, 1 user, load averages: 0.15, 0.16, 0.11
brahma:/root # uname -a
NetBSD brahma 1.6.1_STABLE NetBSD 1.6.1_STABLE (BRAHMA) #0: Wed Aug 27 17:47:36 CDT 2003 ***@brahma:/tew/netbsd-src/NetBSD-1.6/usr/src/sys/arch/sparc/compile/BRAHMA sparc


sarasvati:/root # uptime
2:10PM up 97 days, 14:19, 1 user, load averages: 0.09, 0.08, 0.08
sarasvati:/root # mount
/dev/sd0a on / type ffs (local)
/dev/sd0f on /var type ffs (local)
/dev/sd0g on /usr type ffs (soft dependencies, local)
/dev/sd0h on /local type ffs (soft dependencies, NFS exported, local)
mfs:116 on /tmp type mfs (synchronous, local)
sarasvati:/root # uname -a
NetBSD sarasvati 1.6.1_STABLE NetBSD 1.6.1_STABLE (BAGELSPIT) #0: Wed Oct 22 00:28:05 CDT 2003 ***@sarasvati:/local/src/NetBSD/NetBSD-1.6/usr/src/sys/arch/alpha/compile/BAGELSPIT
alpha

The "/local"s on each of these machines is a somehwat busy file system.
The contain my source trees (NFS mounted on various hosts) and the local
build areas. I'd estimate that infrequently, they are fairly heavily for
hours.


-johan
Johnny Billquist
2004-01-27 20:54:27 UTC
Permalink
Post by David Brownlee
Post by Hauke Fath
Post by Johnny Billquist
Post by Daniel Carosone
Softdep is such a huge performance win, especially for tasks like
extracting a pkgsrc tree, because lots of metadata writes update
the same disk block repetitively (think about adding files one at a
time to the same directory). With softdep, each of these updates is
done in memory (without the sync disk wait) and the resulting final
directory blocks written to disk (again, speaking very broadly).
softdep:s really sounds nice, but last I tried it (a few months ago) it
still crashed my VAX very predictably, so I'd guess it still isn't good
enough to turn on for people in general.
Same here on a SparcStation 10. Both with current and 1.6 kernels, softdep
mounts make sure the box locks up during the next /etc/daily run.
I'm running softdeps on a 1.6.1_STABLE, alpha and sparc5 as
reasonably loaded NFS servers, and on 1.6ZI sparc & sparc64 boxes
as NFS clients (twenty or so users).
Also on i386 boxes, but those are less interesting :)
Alpha has been up for 165 days (I forget why it was rebooted),
sparc5 tends to go down every few months. Client boxes running
current are a little harder to benchmark as they keep getting
updated :)
I'm pretty sure there are still softdep problems, or problems
elsewhere exposed by softdeps pushing things harder...
Well, if someone really wanted to take a shot at it, I could provide a
system, and a way of shooting it down by using softdep.

But we're talking slow VAXen here...

Johnny

Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: ***@update.uu.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
Steven M. Bellovin
2004-01-26 17:28:57 UTC
Permalink
Post by Thor Lancelot Simon
Hello,
I have been noticing some disturbing patterns on two different machines
I'm trying NetBSD on. When I am untarring a file, or generally doing
anything that is causing large amounts of data to be written at once,
interactive performance is seriously degraded. For instance, while I
The new I/O sorting algorithm in -current should make this significantly
better. I am hoping that it can become the default for 2.0.
Thor, I find -current to be almost unusable for interactive work when
there's something I/O-intensive running at the same time. I have
NEW_BUFQ_STRATEGY enabled, as best I can tell. Here are my vm. sysctl
values, on a 256M machine:

vm.nkmempages = 16354
vm.anonmin = 30
vm.execmin = 30
vm.filemin = 10
vm.maxslp = 20
vm.uspace = 16384
vm.anonmax = 80
vm.execmax = 60
vm.filemax = 50
vm.bufcache = 20
vm.bufmem_lowater = 3348480
vm.bufmem_hiwater = 53575680

since I've been told that those are important. (This is a kernel from
22 Jan.)

--Steve Bellovin, http://www.research.att.com/~smb
David S.
2004-01-26 19:32:52 UTC
Permalink
Post by Steven M. Bellovin
Post by Thor Lancelot Simon
The new I/O sorting algorithm in -current should make this significantly
better. I am hoping that it can become the default for 2.0.
Thor, I find -current to be almost unusable for interactive work when
there's something I/O-intensive running at the same time. I have
NEW_BUFQ_STRATEGY enabled, as best I can tell. Here are my vm. sysctl
I can confirm this. I'm using i386 1.6ZH with NEW_BUFQ_STRATEGY and
soft updates on IDE disks. My vm sysctl values are the same as below.
Yesterday, I upgraded Mozilla and MozillaFirebird on the machine.
While the source files of those packages was un-tarring, the system
was dead in the water, and not only for interactive use. That machine
also servers NFS from some other disks, and access to those from
network clients also halted during all of the heavy I/O activity.

David S.
Post by Steven M. Bellovin
vm.nkmempages = 16354
vm.anonmin = 30
vm.execmin = 30
vm.filemin = 10
vm.maxslp = 20
vm.uspace = 16384
vm.anonmax = 80
vm.execmax = 60
vm.filemax = 50
vm.bufcache = 20
vm.bufmem_lowater = 3348480
vm.bufmem_hiwater = 53575680
Sean Davis
2004-01-26 23:03:50 UTC
Permalink
Post by David S.
Post by Steven M. Bellovin
Post by Thor Lancelot Simon
The new I/O sorting algorithm in -current should make this significantly
better. I am hoping that it can become the default for 2.0.
Thor, I find -current to be almost unusable for interactive work when
there's something I/O-intensive running at the same time. I have
NEW_BUFQ_STRATEGY enabled, as best I can tell. Here are my vm. sysctl
I can confirm this. I'm using i386 1.6ZH with NEW_BUFQ_STRATEGY and
soft updates on IDE disks. My vm sysctl values are the same as below.
Yesterday, I upgraded Mozilla and MozillaFirebird on the machine.
While the source files of those packages was un-tarring, the system
was dead in the water, and not only for interactive use. That machine
also servers NFS from some other disks, and access to those from
network clients also halted during all of the heavy I/O activity.
I can also toss in a "me too" on this one. NEW_BUFQ_STRATEGY is *not*
stable. I can perform tasks like David describes above without it, and the
system will lag a bit, of course, but it won't become totally unresponsive.
with NEW_BUFQ_STRATEGY, the system becomes totally unusable for upwards of
thirty seconds. (as I stated in a previous mail, after there was a thread
started by Chuck Silvers unless I miss my guess about whether
NEW_BUFQ_STRATEGY should be made the default, my opinion is PLEASE DON'T :)
Post by David S.
Post by Steven M. Bellovin
vm.nkmempages = 16354
vm.anonmin = 30
vm.execmin = 30
vm.filemin = 10
vm.maxslp = 20
vm.uspace = 16384
vm.anonmax = 80
vm.execmax = 60
vm.filemax = 50
vm.bufcache = 20
vm.bufmem_lowater = 3348480
vm.bufmem_hiwater = 53575680
vm.nkmempages = 24551
vm.anonmin = 30
vm.execmin = 20
vm.filemin = 20
vm.maxslp = 20
vm.uspace = 16384
vm.anonmax = 80
vm.execmax = 60
vm.filemax = 60
vm.bufcache = 5
vm.bufmem_lowater = 1256960
vm.bufmem_hiwater = 20111360

-Sean

--
/~\ The ASCII
\ / Ribbon Campaign Sean Davis
X Against HTML aka dive
/ \ Email!
Thor Lancelot Simon
2004-01-27 00:54:12 UTC
Permalink
Post by Steven M. Bellovin
Thor, I find -current to be almost unusable for interactive work when
there's something I/O-intensive running at the same time. I have
NEW_BUFQ_STRATEGY enabled, as best I can tell. Here are my vm. sysctl
What kind of disk do you have? The option for NEW_BUFQ_STRATEGY is
stupidly implemented -- it's turned on or off down in the disk driver.
If you happen to be using an 'ld', for instance, there's no #ifdef in
ld.c, so the option does nothing. ("Don't get me started...")

Are you using softdep? When Paul removed the fixed limit on the number
of entries in the metadata cache, he accidentally upset the entire house
of cards on which the stability of the I/O system rested, in the softdep
case at least. The _only_ thing keeping softdep from flooding the queues
with more I/O per second than they could actually handle was that our old
buffer cache implementation imposed an extremely low limit on the number
of pending directory operations at any given time, causing writers (well,
directory-operators, really) to block and preventing them from pouring
as much I/O into the pipe as softdep would let them in one second.

I have a patch, as I said, to spread the I/O out by playing with the
use of "dirdelay". I can send it to you to try out if you like.

Thor
Thor Lancelot Simon
2004-01-27 00:57:32 UTC
Permalink
Post by Thor Lancelot Simon
Post by Steven M. Bellovin
Thor, I find -current to be almost unusable for interactive work when
there's something I/O-intensive running at the same time. I have
NEW_BUFQ_STRATEGY enabled, as best I can tell. Here are my vm. sysctl
What kind of disk do you have? The option for NEW_BUFQ_STRATEGY is
stupidly implemented -- it's turned on or off down in the disk driver.
Just to be clear -- and apologies to anyone I offended, which wasn't
my intent -- the NEW_BUFQ_STRATEGY code is clean, and works great. It's
just that, at the moment, selecting it based on the option that isn't
quite right. Easily enough fixed, and really more a sign that having
multiple default disk sorting algorithms wasn't really intended to last
this long than anything else, AFAICT.

Thor
Steven M. Bellovin
2004-01-27 01:38:16 UTC
Permalink
Post by Thor Lancelot Simon
Post by Steven M. Bellovin
Thor, I find -current to be almost unusable for interactive work when
there's something I/O-intensive running at the same time. I have
NEW_BUFQ_STRATEGY enabled, as best I can tell. Here are my vm. sysctl
What kind of disk do you have? The option for NEW_BUFQ_STRATEGY is
stupidly implemented -- it's turned on or off down in the disk driver.
If you happen to be using an 'ld', for instance, there's no #ifdef in
ld.c, so the option does nothing. ("Don't get me started...")
Are you using softdep? When Paul removed the fixed limit on the number
of entries in the metadata cache, he accidentally upset the entire house
of cards on which the stability of the I/O system rested, in the softdep
case at least. The _only_ thing keeping softdep from flooding the queues
with more I/O per second than they could actually handle was that our old
buffer cache implementation imposed an extremely low limit on the number
of pending directory operations at any given time, causing writers (well,
directory-operators, really) to block and preventing them from pouring
as much I/O into the pipe as softdep would let them in one second.
I have a patch, as I said, to spread the I/O out by playing with the
use of "dirdelay". I can send it to you to try out if you like.
Unfortunately, I can't tell you right now. Or rather, I know I'm using
ordinary IDE disks, via the wd driver; I don't think I have softdep on,
but I can't say for sure because my -current machine is having POST
problems and won't boot right now. (Thinking I had a work-around, I
stupidly upgraded to 1.6ZI just now. I did get it to boot, once, long
enough to install userspace, but I couldn't get it to reboot. Sigh.)


--Steve Bellovin, http://www.research.att.com/~smb
Stephan Uphoff
2004-01-26 23:52:37 UTC
Permalink
Post by Gordon Waidhofer
Disk I/O performance can be **dramatically** improved
by I/O coallescing. My experience with a proprietary
operating system proved this many times over. Linux
coallesces disk I/Os.
I agree - really useful for a filesystem log and small nfs type write requests.

When I initially wrote the file system for Network Storage Solution
they had a proprietary kernel and I could use scatter/gather for
I/O coalescing.

Porting to NetBSD we had Chris Jepeway implement clustering using
(architecture specific)
MMU tricks.

NSS allowed Chris Jepeway to post some of the clustering code.

http://mail-index.netbsd.org/tech-perform/2002/09/07/0000.html

However since the code only works on i386 (and probably a few other
architectures)
interest in the code was low.
Chris was busy with other work and never got around to make it arch
independent.


Stephan
Post by Gordon Waidhofer
When there are multiple bufs (struct bufs) on the
disk queue doing the same command (read/write) and
accessing consecutive disk addresses, the bufs are
turned into one I/O and use scatter/gather. Error
recovery amounts to bursting the chain and retrying
the bufs individually. It works best if I/Os can be
scheduled in batches, ala strategy_list(bp) rather
than one at a time ala strategy(bp).
There was a reference in the "LFS on BSD" paper
(Seltzer, et al) to a Berkeley project around
the same time that added I/O coallescing to
stock FFS. The performance rivaled LFS. I've
never been able to find a paper on that project.
Anybody know more?
Is anybody working on coallesced disk I/Os?
I've only given it a cursory look. I want
to find a way to add coallescing without
tearing-up the whole struct buf subsystem.
It works best if done very near the device
driver. I'm hoping to find a way for the
device drivers to do some sort of "just in time"
coallescing. That way, the struct bufs and
queues would be fine the way they are.
No new data structures need be added or maintained.
It'll be the later part of the year before
I can really wade in.
-gww
Thor Lancelot Simon
2004-01-27 01:08:50 UTC
Permalink
Post by Stephan Uphoff
When I initially wrote the file system for Network Storage Solution
they had a proprietary kernel and I could use scatter/gather for
I/O coalescing.
Porting to NetBSD we had Chris Jepeway implement clustering using
(architecture specific)
MMU tricks.
I personally find this quite useful -- and it should be possible to do
entirely using MI calls into the UVM subsystem. The problem is that
I/O coalescing by creating page aliases using the MMU is incredibly
inefficient on some architectures, either because of the cache flushing
it forces or because of the cost of MMU operations. For that reason,
a number of developers are strongly opposed to integrating *that
particular* kind of I/O coalescing implementation into NetBSD.

I'm a bit less absolutist about it. I'd personally prefer that we
played such tricks where they were reasonably cheap and effective,
even if a more general implementation using physical addresses for I/O
or chains of buffers were chosen as a long-term solution.

However, I'm pretty sure I/O coalescing is not the way out of the hole
we're currently in. Fixing the syncer is one key part of the problem;
managing to more effectively apply backpressure on writers (including
softdep's stream of directory and metadata updates) so that they do not
behave in a bursty manner is another. But I do not see the applicability
of I/O coalescing to either of these problems, and I do not see evidence
that we are causing ourselves the current trouble by generating many
small contiguous I/O operations, which is the problem that coalescing
would solve; in fact, I see a great deal of evidence that the causes
of the problem lie elsewhere.

Thor
Alfred Perlstein
2004-01-27 02:28:19 UTC
Permalink
Post by Thor Lancelot Simon
I personally find this quite useful -- and it should be possible to do
entirely using MI calls into the UVM subsystem. The problem is that
I/O coalescing by creating page aliases using the MMU is incredibly
inefficient on some architectures, either because of the cache flushing
it forces or because of the cost of MMU operations. For that reason,
a number of developers are strongly opposed to integrating *that
particular* kind of I/O coalescing implementation into NetBSD.
Yes, the remapping stuff is wasteful because typically this happens:

bufs are mapped to a "super buf" and handed to a driver,
the driver then does VTOPHYS to be able to DMA the data.

so the mapping is useless.

It wouldn't be that difficult to pass a SG list down to drivers that
support it.

And that support could be optional based on a callback from the
driver saying "I can take the SG list input".
--
- Alfred Perlstein
- Research Engineering Development Inc.
- email: ***@mu.org cell: 408-480-4684
Gordon Waidhofer
2004-01-27 03:51:26 UTC
Permalink
Post by Thor Lancelot Simon
I/O coalescing by creating page aliases using the MMU is incredibly
inefficient on some architectures, either because of the cache flushing
it forces or because of the cost of MMU operations.
Yes, the remapping stuff is wasteful ....
pmap stuff is really expensive, especially on some
architectures and especially on multiprocessors. And,
as you say, in the end it amounts to very expensive
pointer memory.

Forming a "super bp" above strategy() really isn't the
way to go. Here's a little thought experiment.

A filesystem/database/videoserver/other generates a
huge pile of I/O requests (bufs). These get handed
into strategy(). Wow, it isn't a disk but an LVM.
The requests get dispursed across the constituent
disks. Now is a good time to coallesce.

If the first batch of bufs had been coallesced,
all that would happen is the LVM would go through
the expensive step of taking them all apart again.

I did look at vfs_cluster.c on FreeBSD. But just
briefly. It doesn't look like its coallescing. It's
doing the "super bp" thing and only for file data.
Don't smack me if I misinterpretted vfs_cluster.c.
Thanx for the pointer.

Consider another thought experiment. sync() flushes
a big pile of inodes. That means lots of inode blocks.
Inode blocks have a habit of being consecutive on disk.
Cool. So those 37 inode blocks could be delivered in
a single, coallesced I/O. Trust me, this is a huge
performance win under heavy load. 1/37 the interupts.
1/37 the CPU power. 1/37 the access latency. Big win.

Coallescing I/O isn't going to do anything about softdep
bugs. I just wanted to start a thread since there was
discussion about performance under heavy load.
It's been good. Nothing urgent, though.

Cheers,
-gww
Thor Lancelot Simon
2004-01-27 04:35:41 UTC
Permalink
Post by Gordon Waidhofer
Coallescing I/O isn't going to do anything about softdep
bugs. I just wanted to start a thread since there was
discussion about performance under heavy load.
It's been good. Nothing urgent, though.
FWIW, I see no reason that disksort couldn't build something
like the buffer chains that BSD/OS used to use to do FFS
clustering without the awful pagemove() hack. Several people
have proposed this, or its equivalent. The thing is, nobody's
seemed willing to go define such an interface and whack it into
the various disk drivers; making disksort actually merge is
by far the easy part.

Generally, our I/O subsystem needs to get smarter about several
facets of request handling. The N of us who've picked up
per-device MAXPHYS but never finished it are all on the hook for
some of that... :-)
--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
Gordon Waidhofer
2004-01-27 05:29:13 UTC
Permalink
I did spend a couple of hours (that I didn't have)
looking into the driver changes, et al. That rabbit
hole goes mighty deep :)

Disksort() should be coallescing friendly, I think.
When there is channel capacity, the buf queue can
be scanned according to a set of constraints (max
transfer size, DMA slots, etc). That's what I meant
by "Just In Time" coallescing.

JITC? Doesn't roll of the tongue, does it.

So, disksort shouldn't coallesce per-se. It
should just make coallescing at the right time
easier.

As you say, the rub is all the data structures
for DMA setup. For example, the SCSI request has
but one dataprt/datalen pair. Changing the request
structure means a tear-up of a helluva lot of code.

Mighty deep rabbit hole.

Definitely consensus on what's to be done
should be built before building anything else.

I highly recommend, as part of this train of
thought, reviewing Seltzer's paper "Disk
Scheduling Revisited", about 12 year ago.
At first look it'll seem tepid. But give it
a chance. There's a lot of insight there.
Pay particular attention to the loads measured.
It takes a huge load before the "exotic" head
schedulars matter.

Cheers,
-gww
-----Original Message-----
Behalf Of Thor Lancelot Simon
Sent: Monday, January 26, 2004 8:36 PM
To: Gordon Waidhofer
Subject: Re: Coallesce disk I/O
Post by Gordon Waidhofer
Coallescing I/O isn't going to do anything about softdep
bugs. I just wanted to start a thread since there was
discussion about performance under heavy load.
It's been good. Nothing urgent, though.
FWIW, I see no reason that disksort couldn't build something
like the buffer chains that BSD/OS used to use to do FFS
clustering without the awful pagemove() hack. Several people
have proposed this, or its equivalent. The thing is, nobody's
seemed willing to go define such an interface and whack it into
the various disk drivers; making disksort actually merge is
by far the easy part.
Generally, our I/O subsystem needs to get smarter about several
facets of request handling. The N of us who've picked up
per-device MAXPHYS but never finished it are all on the hook for
some of that... :-)
--
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
Greywolf
2004-01-27 06:31:40 UTC
Permalink
Thus spake Thor Lancelot Simon ("TLS> ") sometime Today...

TLS> I personally find this quite useful -- and it should be possible to do
TLS> entirely using MI calls into the UVM subsystem. The problem is that
TLS> I/O coalescing by creating page aliases using the MMU is incredibly
TLS> inefficient on some architectures, either because of the cache flushing
TLS> it forces or because of the cost of MMU operations. For that reason,
TLS> a number of developers are strongly opposed to integrating *that
TLS> particular* kind of I/O coalescing implementation into NetBSD.

I think it's laudable to do this sort of thing MI, but if it's not
practical to do so, perhaps having it MD, or at least group-of-M-D
[naming the identifiers and pulling them in via some conditional
mechanism (compile-time might look uglier, but it's probably more
efficient than run-time)] would be the best way to go.

i.e. i386/sparc/alpha might work best with method A, vax/mips/m68k
might work best with method B, etc. If I understand correctly,
there aren't a whole slough of different ways to do this; perhaps
three, maybe four.

It seems inelegant on the surface, I'm sure, but if one size doesn't
fit all, we shouldn't be trying to shoehorn it.

This is from someone who is Not A Kernel Engineer.

[reading further...]

TLS> I'm a bit less absolutist about it. I'd personally prefer that we
TLS> played such tricks where they were reasonably cheap and effective,
TLS> even if a more general implementation using physical addresses for I/O
TLS> or chains of buffers were chosen as a long-term solution.

I would agree, as long as we keep in mind that the long-term solution
should not degrade performance from the "tricks".

TLS> However, I'm pretty sure I/O coalescing is not the way out of the hole
TLS> we're currently in.

Perhaps not the way completely out, but it appears to be part of it.
The hole seems to me to be more complex than "just a hole".

TLS> Fixing the syncer is one key part of the problem;
TLS> managing to more effectively apply backpressure on writers (including
TLS> softdep's stream of directory and metadata updates) so that they do not
TLS> behave in a bursty manner is another. But I do not see the applicability
TLS> of I/O coalescing to either of these problems, and I do not see evidence
TLS> that we are causing ourselves the current trouble by generating many
TLS> small contiguous I/O operations, which is the problem that coalescing
TLS> would solve; in fact, I see a great deal of evidence that the causes
TLS> of the problem lie elsewhere.

To my untrained eye, it's not all sitting in the buffer cache, either,
though. I think, to quote a famous writer, "the way out is through".
In short, the path lies somewhere in the middle.

Has anyone taken the time to profile the kernel -- specifically, of
course, the I/O system -- or is that another one of those things which is
Not Yet Working? I ask only because we see a lot of "well, this looks
broken, and that looks broken, and as of this update/major change,
performance has been drawn through the grate"; in short, quite a bit of
speculation.

[But I seem to recall something about kernel profiling Not Quite Working.]

Apologies if my ignorance on this offends anyone, or if I'm way off
base. I'm always learning -- please be kind.

--*greywolf;
--
22 Ways to Get Yourself Killed While Watching 'The Lord Of The Rings':

#1: Stand up halfway through the movie and yell loudly, "Wait...where the
f**k is Harry Potter?"
MLH
2004-01-27 03:40:11 UTC
Permalink
Post by David S.
Post by Steven M. Bellovin
Thor, I find -current to be almost unusable for interactive work when
there's something I/O-intensive running at the same time. I have
NEW_BUFQ_STRATEGY enabled, as best I can tell. Here are my vm. sysctl
I can confirm this. I'm using i386 1.6ZH with NEW_BUFQ_STRATEGY and
soft updates on IDE disks. My vm sysctl values are the same as below.
Yesterday, I upgraded Mozilla and MozillaFirebird on the machine.
While the source files of those packages was un-tarring, the system
was dead in the water, and not only for interactive use. That machine
also servers NFS from some other disks, and access to those from
network clients also halted during all of the heavy I/O activity.
I have several -current machines with softdeps. Two, an NFS server
and client - when I do a 'nice +20 cvs up' in pkgsrc on the client,
both machines essentially grind to a halt, with tiny spurts of
other activity, until the cvs up is completed. Several times lately,
a cdrecord session has failed, indicating a buffer underrun yet
there was no evidence of a 'buffer underrun'. It appears the FS
just quit supplying data to the cdrecorder long enough to trash
the write. The KDE login sequence which used to take about 20
seconds on several XP2000-2200s now take a minute to a minute and
a half. Konqueror startups take about 20 seconds on these same
machines when they used to take about two. Note that these access
likely many small config files. xterms, etc. start up immediately,
as usual.

vm.loadavg: 0.82 0.75 0.78
vm.nkmempages = 32743
vm.anonmin = 10
vm.execmin = 5
vm.filemin = 10
vm.maxslp = 20
vm.uspace = 16384
vm.anonmax = 80
vm.execmax = 30
vm.filemax = 50
vm.bufcache = 30
vm.bufmem_lowater = 10053120
vm.bufmem_hiwater = 160849920

On a machine that isn't really very busy right now.
Thor Lancelot Simon
2004-01-27 04:39:07 UTC
Permalink
Post by MLH
vm.bufcache = 30
Unless you increased vm.bufcache, this kernel's from the few days in which
the backpressure mechanism on the buffer cache was, basically, _completely_
broken. You probably want to upgrade and try again.
--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
Stephan Uphoff
2004-01-27 18:25:32 UTC
Permalink
Post by Gordon Waidhofer
Post by Thor Lancelot Simon
I/O coalescing by creating page aliases using the MMU is incredibly
inefficient on some architectures, either because of the cache flushing
it forces or because of the cost of MMU operations.
Yes, the remapping stuff is wasteful ....
pmap stuff is really expensive, especially on some
architectures and especially on multiprocessors. And,
as you say, in the end it amounts to very expensive
pointer memory.
I agree - I/O coalescing through page aliasing is far from the ideal
solution.
Scatter/Gather is a lot better - especially since it would allow
coalescing non-page aligned buffer. (mbuf chains for example)
It would also be nice to avoid uvm_pagermapin/uvm_pagermapin
calls by using physical addresses where possible.

FYI: The newest i386 pmap (1.168) avoids expensive TLB flushes on mappings
never
touched by the CPU. (To optimize for DMA only mappings) This makes the page
aliasing trick a *LOT* less expensive on i386 SMP.
Post by Gordon Waidhofer
As you say, the rub is all the data structures
for DMA setup. For example, the SCSI request has
but one dataprt/datalen pair. Changing the request
structure means a tear-up of a helluva lot of code.
Mighty deep rabbit hole.
Well - with Chris's patches you can easily take a look at the bottom of
the hole to see if there is any gold there or just a dead rabbit ;-)
Once you can document a gold find - you might find more helping
hands to bring in the heavy mining equipment.

Stephan
Loading...