Discussion:
Disk scheduling policy (Re: NEW_BUFQ_STRATEGY)
Thor Lancelot Simon
2003-12-01 20:07:32 UTC
Permalink
On Mon, Dec 01, 2003 at 01:35:23PM -0500, Thor Lancelot Simon wrote:
>
> SGI has an interesting general-purpose policy with two queues, pulling a
> configurable number of requests from each queue in turn, that is described
> in the release notes for a recent version of Irix. It appears to perform
> reasonably well for both interactive and fileserver workloads and might be
> a better default than either our current policy or NEW_BUFQ_STRATEGY. I've
> posted a pointer to it here before -- if anyone wants to implement this but
> can't find the details I'll dig them up again.

>From the now-dead URL http://www.sgi.com/developers/feature/2001/roadmap.html:

| Disk sorting in 6.5.8
|
| Previously, all disk requests were sorted by block
| number. Unfortunately, if the filesystem write activity was more than
| the disk could satisfy, the disk could get swamped with delayed write
| requests. This would result in reads and synchronous writes being delayed
| for extensive periods. In extreme cases, the system would appear to stall
| or would experience NFS timeouts.
|
| In 6.5.8, the queues are split. Doing this permits queuing delayed writes
| into one queue, while synchronous writes and reads are entered into another
| queue. In 6.5.8 the disk driver will alternate between queues. This ensures
| that a large queue of delayed write requests will not adversely impact
| interactive response. If both delayed writes and other requests are pending,
| the driver will alternate between them, issuing several delayed writes,
| then several of the other requests. Selecting several from each queue each
| time, rather than just one from each queue each time, makes sequential
| I/O faster and disk performance is maximized.

A really interesting examination of a certain type of pathological load
for the traditional BSD elevator sort scheduler can be found at:

http://www.cs.rice.edu/~ssiyer/r/antsched/ssiyer-thesis.pdf

It seems to me that softdep, if flushing dependencies, may cause exactly
the kind of "deceptive idleness" the paper discusses.

Thor
Jason Thorpe
2003-12-01 22:00:09 UTC
Permalink
On Dec 1, 2003, at 12:07 PM, Thor Lancelot Simon wrote:

> | In 6.5.8, the queues are split. Doing this permits queuing delayed
> writes
> | into one queue, while synchronous writes and reads are entered into
> another
> | queue. In 6.5.8 the disk driver will alternate between queues. This
> ensures
> | that a large queue of delayed write requests will not adversely
> impact
> | interactive response. If both delayed writes and other requests are
> pending,
> | the driver will alternate between them, issuing several delayed
> writes,
> | then several of the other requests. Selecting several from each
> queue each
> | time, rather than just one from each queue each time, makes
> sequential
> | I/O faster and disk performance is maximized.

This is a pretty cool, and fairly simple, algorithm. I bet it could be
implemented as a back-end to the new bufq code pretty easily.

-- Jason R. Thorpe <***@wasabisystems.com>
Alfred Perlstein
2003-12-01 23:04:58 UTC
Permalink
* Jason Thorpe <***@wasabisystems.com> [031201 14:00] wrote:
>
> On Dec 1, 2003, at 12:07 PM, Thor Lancelot Simon wrote:
>
> >| In 6.5.8, the queues are split. Doing this permits queuing delayed
> >writes
> >| into one queue, while synchronous writes and reads are entered into
> >another
> >| queue. In 6.5.8 the disk driver will alternate between queues. This
>
> This is a pretty cool, and fairly simple, algorithm. I bet it could be
> implemented as a back-end to the new bufq code pretty easily.

If you keep the bufs on two queues, a combined queue and seperate
queues you'll pay double the sorting cost, but only have to disrupt
the elevator algorithm when you reach a tuneable threshold. In fact
you could keep a single queue and split it when the io becomes
too saturated as well by keeping counts of the mix in the single
queue.



--
- Alfred Perlstein
- Research Engineering Development Inc.
- email: ***@mu.org cell: 408-480-4684
Thor Lancelot Simon
2003-12-01 23:11:59 UTC
Permalink
On Mon, Dec 01, 2003 at 03:04:58PM -0800, Alfred Perlstein wrote:
> * Jason Thorpe <***@wasabisystems.com> [031201 14:00] wrote:
> >
> > On Dec 1, 2003, at 12:07 PM, Thor Lancelot Simon wrote:
> >
> > >| In 6.5.8, the queues are split. Doing this permits queuing delayed
> > >writes
> > >| into one queue, while synchronous writes and reads are entered into
> > >another
> > >| queue. In 6.5.8 the disk driver will alternate between queues. This
> >
> > This is a pretty cool, and fairly simple, algorithm. I bet it could be
> > implemented as a back-end to the new bufq code pretty easily.
>
> If you keep the bufs on two queues, a combined queue and seperate
> queues you'll pay double the sorting cost, but only have to disrupt
> the elevator algorithm when you reach a tuneable threshold.

Matt Thomas suggested simply starting at whichever end of the other
queue is nearer the current head position, which is an interesting
idea. I was pondering zoning the disk, so you'd have multiple "queues"
on each side of the "delayed"/"normal" divide, and could try to switch
to the one nearest the block you're currently at, but that adds
significant complexity. I'm going to implement a couple of different
things and measure them for different workloads.

--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
Jason Thorpe
2003-12-01 23:16:40 UTC
Permalink
On Dec 1, 2003, at 3:11 PM, Thor Lancelot Simon wrote:

> Matt Thomas suggested simply starting at whichever end of the other
> queue is nearer the current head position, which is an interesting
> idea. I was pondering zoning the disk, so you'd have multiple "queues"
> on each side of the "delayed"/"normal" divide, and could try to switch
> to the one nearest the block you're currently at, but that adds
> significant complexity. I'm going to implement a couple of different
> things and measure them for different workloads.

Presumably this would imply recording the block # of the last request
that was issued on the disk?

I can see a danger in this kind of algorithm... if you were to always
prefer issuing requests for "near by" blocks, then you could get into a
situation where an app that performs many transactions to a localized
region could starve other apps whose data is "somewhere else".

-- Jason R. Thorpe <***@wasabisystems.com>
Matt Thomas
2003-12-01 23:47:44 UTC
Permalink
At 03:16 PM 12/1/2003, Jason Thorpe wrote:
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>
>On Dec 1, 2003, at 3:11 PM, Thor Lancelot Simon wrote:
>
>>Matt Thomas suggested simply starting at whichever end of the other
>>queue is nearer the current head position, which is an interesting
>>idea. I was pondering zoning the disk, so you'd have multiple "queues"
>>on each side of the "delayed"/"normal" divide, and could try to switch
>>to the one nearest the block you're currently at, but that adds
>>significant complexity. I'm going to implement a couple of different
>>things and measure them for different workloads.
>
>Presumably this would imply recording the block # of the last request that
>was issued on the disk?
>
>I can see a danger in this kind of algorithm... if you were to always
>prefer issuing requests for "near by" blocks, then you could get into a
>situation where an app that performs many transactions to a localized
>region could starve other apps whose data is "somewhere else".

My assumption was that you *always* drain the current queue before
switching to other queue. In that scenario, there is no starvation.


--
Matt Thomas email: ***@3am-software.com
3am Software Foundry www: http://3am-software.com/bio/matt/
Cupertino, CA disclaimer: I avow all knowledge of this message.
Thor Lancelot Simon
2003-12-01 23:57:18 UTC
Permalink
On Mon, Dec 01, 2003 at 03:47:44PM -0800, Matt Thomas wrote:
> At 03:16 PM 12/1/2003, Jason Thorpe wrote:
> >-----BEGIN PGP SIGNED MESSAGE-----
> >Hash: SHA1
> >
> >
> >On Dec 1, 2003, at 3:11 PM, Thor Lancelot Simon wrote:
> >
> My assumption was that you *always* drain the current queue before
> switching to other queue. In that scenario, there is no starvation.

The way I read the SGI text, they do N requests from queue A, then
N requests from queue B, and so forth. A simple implementation of
this seems like it might disrupt the elevator sort quite badly, so I
wonder if they actually did something more clever.

I can think of a reasonably simple way to tune "N" if we can
characterize the seek time of the disk... but that seems dangerous.

--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
Jason Thorpe
2003-12-02 00:14:54 UTC
Permalink
On Dec 1, 2003, at 3:57 PM, Thor Lancelot Simon wrote:

> The way I read the SGI text, they do N requests from queue A, then
> N requests from queue B, and so forth. A simple implementation of
> this seems like it might disrupt the elevator sort quite badly, so I
> wonder if they actually did something more clever.

They probably didn't have to do anything more clever. SGI systems
almost exclusively used SCSI disks (tuned a certain way), and could
thus rely on the disk to mitigate any disruption of the elevator sort
(through command reordering).

Really, it's not clear that the elevator sort buys you much anyway,
when you're talking to raw disks, because disks don't really expose
their real geometry anymore.

That said, elevator sort could potentially be VERY useful on RAID
systems. I know of a RAID card vendor whose firmware sets a timer when
it receives a write request for a block within a given stripe, and then
buffers the write ("stalls" it from the OS's perspective). If, before
the timer expires, writes that fill out the rest of the stripe are
received, the firmware skips the r/m/w cycle for the stripe. This can
greatly improve performance.

This, of course, makes this card look horrible if you use dd(1) to test
RAID-5 write performance, since each of the writes from the dd program
are issued in lock-step. However, if the disk queue sorting algorithm
can arrange to group writes for a stripe together (not even necessarily
in sequential order), then you can potentially have a major positive
impact on overall system performance.

I would also like to see a disk sorting algorithm that could coalesce
adjacent writes or reads into single requests (perhaps enqueueing an
uber-buf that pointed to a list of sub-bufs that were treated as s/g
elements, or something). As part of this, I'd really like to add a
bus_dmamamp_load_buf() that could handle various different data
representations within "struct buf" (I have a project I'm currently
planning that could really make use of attaching mbuf chains to bufs,
rather than simple linear buffers).

-- Jason R. Thorpe <***@wasabisystems.com>
Thor Lancelot Simon
2003-12-02 00:23:58 UTC
Permalink
On Mon, Dec 01, 2003 at 04:14:54PM -0800, Jason Thorpe wrote:
>
> Really, it's not clear that the elevator sort buys you much anyway,
> when you're talking to raw disks, because disks don't really expose
> their real geometry anymore.

Well, I have yet to encounter one that randomly shuffles blocks
such that sorting in order of increasing block number would be
_harmful_. Indeed, since we can see more information about future
requests than the disk can, it's not clear to me why the elevator
sort is not highly beneficial.

> I would also like to see a disk sorting algorithm that could coalesce
> adjacent writes or reads into single requests (perhaps enqueueing an
> uber-buf that pointed to a list of sub-bufs that were treated as s/g
> elements, or something). As part of this, I'd really like to add a

I thought I suggested this to you about a year ago and you were rather
strongly opposed, due to the VM system tricks involved in the obvious
way one would do this given the rest of our current implementation? I
am _definitely_ in favor of request coalescing in disksort.

--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
Jason Thorpe
2003-12-02 00:29:38 UTC
Permalink
On Dec 1, 2003, at 4:23 PM, Thor Lancelot Simon wrote:

> I thought I suggested this to you about a year ago and you were rather
> strongly opposed, due to the VM system tricks involved in the obvious
> way one would do this given the rest of our current implementation? I
> am _definitely_ in favor of request coalescing in disksort.

I am only opposed to any solution that uses VM to perform the
coalescing. I would prefer that most of the requests issued with
"struct buf" not be mapped into KVA at all (see yamt's patches that get
us part of the way there); it's rather silly to map a page into KVA
just so we can extract the PA of the page for DMA later.

This is why I suggested creating an uber-buf that contained a list of
sub-bufs. The uber-buf would contain the starting block # and the
overall length, and the sub-bufs would describe the individual
transfers that were gathered into the uber-buf. You could even use the
normal bufq linkage (or a TAILQ union over it) to hook the sub-bufs
into the uber-buf.

-- Jason R. Thorpe <***@wasabisystems.com>
Thor Lancelot Simon
2003-12-02 00:56:06 UTC
Permalink
On Mon, Dec 01, 2003 at 04:29:38PM -0800, Jason Thorpe wrote:
>
> This is why I suggested creating an uber-buf that contained a list of
> sub-bufs. The uber-buf would contain the starting block # and the
> overall length, and the sub-bufs would describe the individual
> transfers that were gathered into the uber-buf. You could even use the
> normal bufq linkage (or a TAILQ union over it) to hook the sub-bufs
> into the uber-buf.

Unfortunately, this requires modifying every disk driver to handle
uber-bufs instead of bufs, no? It's highly similar to the change
that BSDI made long ago to chain buffers together through a pointer
in struct buf -- we could have whacked pagemove() much earlier if
we picked up that change, but we didn't because of the changes it
would have required in all of the disk drivers.

--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
Jason Thorpe
2003-12-02 01:00:23 UTC
Permalink
On Dec 1, 2003, at 4:56 PM, Thor Lancelot Simon wrote:

> Unfortunately, this requires modifying every disk driver to handle
> uber-bufs instead of bufs, no? It's highly similar to the change
> that BSDI made long ago to chain buffers together through a pointer
> in struct buf -- we could have whacked pagemove() much earlier if
> we picked up that change, but we didn't because of the changes it
> would have required in all of the disk drivers.

Yes, it would require a change to disk controller drivers (not disk
drivers; they enqueue/dequeue, but don't really "process" the bufs).
But is that really so horrible?

See what I said about adding a bus_dmamap_load_buf(); we could add this
API call, and then such drivers would automatically pick up additional
enhancements to the buf data representation mechanism.

-- Jason R. Thorpe <***@wasabisystems.com>
Alfred Perlstein
2003-12-02 01:32:49 UTC
Permalink
* Jason Thorpe <***@wasabisystems.com> [031201 17:00] wrote:
>
> On Dec 1, 2003, at 4:56 PM, Thor Lancelot Simon wrote:
>
> >Unfortunately, this requires modifying every disk driver to handle
> >uber-bufs instead of bufs, no? It's highly similar to the change
> >that BSDI made long ago to chain buffers together through a pointer
> >in struct buf -- we could have whacked pagemove() much earlier if
> >we picked up that change, but we didn't because of the changes it
> >would have required in all of the disk drivers.
>
> Yes, it would require a change to disk controller drivers (not disk
> drivers; they enqueue/dequeue, but don't really "process" the bufs).
> But is that really so horrible?
>
> See what I said about adding a bus_dmamap_load_buf(); we could add this
> API call, and then such drivers would automatically pick up additional
> enhancements to the buf data representation mechanism.

This is one of the problem we face in FreeBSD. My suggestion would
be to have the disk driver register at load time whether or not it
can support the new bufs and have the kernel to the appropriate
thing, basically offer backwards compat for old drivers and map
the bufs and skip the large scatter gather support.

>
> -- Jason R. Thorpe <***@wasabisystems.com>
>



--
- Alfred Perlstein
- Research Engineering Development Inc.
- email: ***@mu.org cell: 408-480-4684
Jason Thorpe
2003-12-02 02:07:31 UTC
Permalink
On Dec 1, 2003, at 5:32 PM, Alfred Perlstein wrote:

> This is one of the problem we face in FreeBSD. My suggestion would
> be to have the disk driver register at load time whether or not it
> can support the new bufs and have the kernel to the appropriate
> thing, basically offer backwards compat for old drivers and map
> the bufs and skip the large scatter gather support.

Yah, this should be no problem... a flag to bufq_alloc() would probably
do the right thing in NetBSD.

-- Jason R. Thorpe <***@wasabisystems.com>
der Mouse
2003-12-02 00:56:18 UTC
Permalink
>> Really, it's not clear that the elevator sort buys you much anyway,
>> when you're talking to raw disks, because disks don't really expose
>> their real geometry anymore.
> Well, I have yet to encounter one that randomly shuffles blocks such
> that sorting in order of increasing block number would be _harmful_.

Not randomly, maybe, but unless you can get the disk to tell you about
every damaged block it's reassigned, and where it's been reassigned to,
you can't know whether you're sending requests in physical order.
(Whether this is a big enough effect to matter is another question, one
for experiment rather than theory.)

/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents.montreal.qc.ca
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Jason Thorpe
2003-12-02 01:08:49 UTC
Permalink
On Dec 1, 2003, at 4:23 PM, Thor Lancelot Simon wrote:

> Well, I have yet to encounter one that randomly shuffles blocks
> such that sorting in order of increasing block number would be
> _harmful_. Indeed, since we can see more information about future
> requests than the disk can, it's not clear to me why the elevator
> sort is not highly beneficial.

Well, I certainly agree that elevator sort is not really *harmful*, so
long as doing it doesn't defeat any queueing tricks that the disk is
trying to do. (This all gets really spooky when your "disk" is really
a disk array with some huge chunk of cache memory in front of it.)

-- Jason R. Thorpe <***@wasabisystems.com>
Matthew Orgass
2003-12-02 04:54:50 UTC
Permalink
On 2003-12-01 ***@rek.tjls.com wrote:
> I can think of a reasonably simple way to tune "N" if we can
> characterize the seek time of the disk... but that seems dangerous.

The seek time = 0 case at least should be considered since then you
actually want to jump around. In particular, now that multi-gigabyte
compact flash drives are available (which are usually quite slow), it
should be possible to do a background compile and still be able to at
least edit and save a text file with minimal delay. On the flip side,
DVD+RW can have very high seek times. It seems to me like there should be
four or five different types of strategies (or adjustments), with the
appropriate one selected per disk (though I know almost nothing about this
part of NetBSD...).

Matthew Orgass
***@city-net.com
Alfred Perlstein
2003-12-01 23:29:37 UTC
Permalink
* Thor Lancelot Simon <***@rek.tjls.com> [031201 15:12] wrote:
>
> Matt Thomas suggested simply starting at whichever end of the other
> queue is nearer the current head position, which is an interesting
> idea. I was pondering zoning the disk, so you'd have multiple "queues"
> on each side of the "delayed"/"normal" divide, and could try to switch
> to the one nearest the block you're currently at, but that adds
> significant complexity. I'm going to implement a couple of different
> things and measure them for different workloads.

That can lead to starvation. The best bet is to keep it simple and
make sure that you sweep back and forth, never doubling back.

--
- Alfred Perlstein
- Research Engineering Development Inc.
- email: ***@mu.org cell: 408-480-4684
Greywolf
2003-12-01 23:41:13 UTC
Permalink
Thus spake Alfred Perlstein ("AP> ") sometime Today...

AP> That can lead to starvation. The best bet is to keep it simple and
AP> make sure that you sweep back and forth, never doubling back.

"sweep back and forth" and "never double back" are kind of mutually
exclusive, unless you know of some way to tell the disk to start
at the end and write blocks in reverse order...

Question:

Would it not be better to simply do iterative sweeps from [0..n] each
time, and if a request comes in to write a block at an address which
has already been surpassed, it will not get serviced until the next
regular iteration?

or would that be less than optimal? (hm, i think I see the dilemma
on balancing delayed vs. immediate writes...)

Other question:

If you have a simple sweep algorithm, would it be truly necessary to
differentiate between delayed and immediate writes? When is an
immediate write really REALLY needed, as in "You need to write this
NOW and DAMN all the other pending requests"?

--*greywolf;
--
NetBSD: Servers' choice!
der Mouse
2003-12-01 23:52:55 UTC
Permalink
>>> [...disk access scheduling...]
>> That can lead to starvation. The best bet is to keep it simple and
>> make sure that you sweep back and forth, never doubling back.
> Would it not be better to simply do iterative sweeps from [0..n] each
> time, and if a request comes in to write a block at an address which
> has already been surpassed, it will not get serviced until the next
> regular iteration?

Practically any operating systems textbook will talk about such things;
they did even in the early '80s when I was in school. There's been a
lot of work put in on the question already....

Unfortunately, there's no short summing-up of the accumulated wisdom.
Like a lot of cases where there are many plausible simple algorithms,
the answer is "it depends on your workload and your goodness metric".
Sometimes elevator sort wins. Sometimes bidirectional elevator sort
wins. Sometimes closest-block wins. Sometimes even FCFS wins. Etc.

/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents.montreal.qc.ca
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Robert Elz
2003-12-02 11:16:02 UTC
Permalink
Date: Mon, 1 Dec 2003 15:29:37 -0800
From: Alfred Perlstein <***@mu.org>
Message-ID: <***@elvis.mu.org>

| That can lead to starvation. The best bet is to keep it simple and
| make sure that you sweep back and forth, never doubling back.

Actually, I suspect that's the theoretical best bet, and not the
unix filesystem best bet.

There was a lot of work on i/o scheduling back for unix in the 70's
(back in the days when drives were slow, and buffer caches miniscule, so
waiting for i/o was just about all a busy system would ever do).

It turns (from experiments then, which may no longer be valid, those were
the days when "read block N" meant exactly that) out that reading backwards
through the drive gives sub-optimal results. The best bet is usually to
sweep from 0 .. N, then return to 0 and start all over again.

The reason is obvious once it has been explained ... the vast majority of
I/O is sequential, the filesystem does read-ahead, but returning the read
ahead blocks before the block the process actually wanted is a waste of
time, nothing wants it yet. Worse than that, it can actually cause the
block read ahead to be flushed (due to non-use) before the process gets a
chance to access it.

Filesystems of the time (unlike current ones) did not necessarily allocate
blocks of a file in increasing order through the drive, so this may seem
strange - but another system change was to cause that to happen, actually
keep the free list (roughly) sorted so blocks allocated to a file sequentially
written (most of them) would go to the disc in mostly increasing block
number order. FFS (and I believe, LFS) does that now - precisely to
allow better read algorithms.

In any case, be wary of assuming that theoretical algorithms, which mostly
assume random arrival of requests (with some distribution or other) will
apply to a unix system - the blocks requested are not random, or anything
like it.

And speaking of I/O and readahead ... another change with dramatic
performance benefits was individual per/file readahead tuning - that
is, increasing the readahead amount for a file (for a particular
instance of reading that file) until i/o requests never block (or
some upper limit is reached) - as long as i/o remains sequential, and
i/o keeps blocking waiting for the data to appear, the readahead is
increased. (It was a bit more than that, but you get the idea).

For completeness, much of this work was done by Tim Long at the University
of Sydney (along wth Piers Lauder and Chris Maltby).

kre
Thor Lancelot Simon
2003-12-03 08:08:59 UTC
Permalink
On Wed, Dec 03, 2003 at 12:05:30AM -0800, Chuck Silvers wrote:
> On Sun, Nov 30, 2003 at 10:33:47AM -0800, Jason Thorpe wrote:
> >
> > On Nov 28, 2003, at 2:10 PM, Chuck Silvers wrote:
> >
> > >has anyone had any problems with NEW_BUFQ_STRATEGY lately?
> > >is there consensus on whether or not this is ready to become the
> > >default?
> >
> > I think the consensus is that it's stable, but I heard that it can have
> > some pretty horrible performance effects with some file systems if
> > they're exported by NFS.
>
> I tried bonnie++ over NFS, and NEW_BUFQ_STRATEGY didn't affect the results
> significantly. if this effect really does exist, I don't know how to
> trigger it.

I have the advantage of academic access to the SPEC benchmark suites,
but perhaps you do (or can arrange to) as well. The NFS and web
benchmarks are ideal for this sort of thing.

--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
Thor Lancelot Simon
2003-12-05 15:18:51 UTC
Permalink
On Fri, Dec 05, 2003 at 08:23:12PM +0900, YAMAMOTO Takashi wrote:
> hi,
>
> > On Dec 3, 2003, at 5:38 PM, YAMAMOTO Takashi wrote:
> >
> > > because synchronous writes, which nfsd often requests, easily suffer?
> > > i guess explicitly distinguishing sync/async requests like the
> > > attached diff makes it better.
> >
> > Cool, though I guess BUFQ_READ_PRIO should no longer be the name of the
> > queueing discipline... perhaps just BUFQ_PRIO?
>
> i think it's better to leave BUFQ_READ_PRIO as-is and create a new one
> for people who like benchmarks... :)

Your patch basically turns the policy into the same one SGI uses, and I
think it's a much more reasonable default than pure "read priority"; there
are problems with read priority delaying synchronous writes dating back
at least 10 years in the literature.

One thing I'm not so sure about is using a FCFS queue for the "time
sensitive" requests. I think that the average latency for requests from
that queue is probably significantly increased by using FCFS rather than
the increasing-block-number sort, because we lose the benefit of
readahead, which will be particularly severe if the queues are long. The
sort seems particularly likely to be beneficial given the rather large
number of requests we take from the queues in a "burst" and the consequent
likelihood that we'd get a lot of track cache hits.

Thor
Jason Thorpe
2003-12-05 16:52:18 UTC
Permalink
On Dec 5, 2003, at 7:18 AM, Thor Lancelot Simon wrote:

> One thing I'm not so sure about is using a FCFS queue for the "time
> sensitive" requests. I think that the average latency for requests
> from
> that queue is probably significantly increased by using FCFS rather
> than
> the increasing-block-number sort, because we lose the benefit of
> readahead, which will be particularly severe if the queues are long.
> The
> sort seems particularly likely to be beneficial given the rather large
> number of requests we take from the queues in a "burst" and the
> consequent
> likelihood that we'd get a lot of track cache hits.

Yah, I agree with this analysis.

-- Jason R. Thorpe <***@wasabisystems.com>
YAMAMOTO Takashi
2003-12-14 02:16:53 UTC
Permalink
hi,

> One thing I'm not so sure about is using a FCFS queue for the "time
> sensitive" requests. I think that the average latency for requests from
> that queue is probably significantly increased by using FCFS rather than
> the increasing-block-number sort, because we lose the benefit of
> readahead, which will be particularly severe if the queues are long. The
> sort seems particularly likely to be beneficial given the rather large
> number of requests we take from the queues in a "burst" and the consequent
> likelihood that we'd get a lot of track cache hits.

then how do you think about the attached one?

YAMAMOTO Takashi
Thor Lancelot Simon
2003-12-14 03:13:04 UTC
Permalink
On Sun, Dec 14, 2003 at 11:16:53AM +0900, YAMAMOTO Takashi wrote:
> hi,
>
> > One thing I'm not so sure about is using a FCFS queue for the "time
> > sensitive" requests. I think that the average latency for requests from
> > that queue is probably significantly increased by using FCFS rather than
> > the increasing-block-number sort, because we lose the benefit of
> > readahead, which will be particularly severe if the queues are long. The
> > sort seems particularly likely to be beneficial given the rather large
> > number of requests we take from the queues in a "burst" and the consequent
> > likelihood that we'd get a lot of track cache hits.
>
> then how do you think about the attached one?

Just at a first glance, why three queues?

Thor
YAMAMOTO Takashi
2003-12-14 03:41:58 UTC
Permalink
> > > One thing I'm not so sure about is using a FCFS queue for the "time
> > > sensitive" requests. I think that the average latency for requests from
> > > that queue is probably significantly increased by using FCFS rather than
> > > the increasing-block-number sort, because we lose the benefit of
> > > readahead, which will be particularly severe if the queues are long. The
> > > sort seems particularly likely to be beneficial given the rather large
> > > number of requests we take from the queues in a "burst" and the consequent
> > > likelihood that we'd get a lot of track cache hits.
> >
> > then how do you think about the attached one?
>
> Just at a first glance, why three queues?
>
> Thor

i forgot to attach a header part..

three queues are for
- time sensitive requests.
e.g. sync read/write, requests from pagedaemon(?)
- time non sensitive requests which should be served by
some period because they will likely become time sensitive
requests later.
e.g. read ahead
- time non sensitive requests.
e.g. delayed writes

YAMAMOTO Takashi
Christos Zoulas
2003-12-14 05:06:44 UTC
Permalink
In article <***@yamt.dyndns.org>,
YAMAMOTO Takashi <***@mwd.biglobe.ne.jp> wrote:
>three queues are for
> - time sensitive requests.
> e.g. sync read/write, requests from pagedaemon(?)
> - time non sensitive requests which should be served by
> some period because they will likely become time sensitive
> requests later.
> e.g. read ahead
> - time non sensitive requests.
> e.g. delayed writes
>

I would definitely lose the `3' for:

#define BPRIO_NQUEUES 3

christos
Daniel Carosone
2003-12-15 05:37:07 UTC
Permalink
On Sun, Dec 14, 2003 at 12:41:58PM +0900, YAMAMOTO Takashi wrote:
> three queues are for
> [...]
> - time non sensitive requests.
> e.g. delayed writes

Something that's been discussed before, but is worth mentioning again
at this point - I'd just love to see a mechanism that allowed
really-not-very-interesting requests (the canonical example being
atime updates) to wait ~forever if the disk is not spun up, and then
to proceed when some other higher-priority request wakes up the disk
(or there are too many such waiting?)

That needs some help from the layers above, and probably below.

--
Dan.
Manuel Bouyer
2003-12-16 21:38:42 UTC
Permalink
On Mon, Dec 15, 2003 at 04:37:07PM +1100, Daniel Carosone wrote:
> On Sun, Dec 14, 2003 at 12:41:58PM +0900, YAMAMOTO Takashi wrote:
> > three queues are for
> > [...]
> > - time non sensitive requests.
> > e.g. delayed writes
>
> Something that's been discussed before, but is worth mentioning again
> at this point - I'd just love to see a mechanism that allowed
> really-not-very-interesting requests (the canonical example being
> atime updates) to wait ~forever if the disk is not spun up, and then
> to proceed when some other higher-priority request wakes up the disk
> (or there are too many such waiting?)
>
> That needs some help from the layers above, and probably below.

I'd like to see this too. IMHO, this can share the same queue as delayed
writes. Only wake up the disk when there's too much requests in the queue, or
we have something else to do.
This would also prevent writes to log files to wake up the disk.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 24 ans d'experience feront toujours la difference
--
YAMAMOTO Takashi
2003-12-17 04:34:02 UTC
Permalink
> > Something that's been discussed before, but is worth mentioning again
> > at this point - I'd just love to see a mechanism that allowed
> > really-not-very-interesting requests (the canonical example being
> > atime updates) to wait ~forever if the disk is not spun up, and then
> > to proceed when some other higher-priority request wakes up the disk
> > (or there are too many such waiting?)
> >
> > That needs some help from the layers above, and probably below.
>
> I'd like to see this too. IMHO, this can share the same queue as delayed
> writes. Only wake up the disk when there's too much requests in the queue, or
> we have something else to do.
> This would also prevent writes to log files to wake up the disk.

it isn't so easy.
"delayed writes" can't be delayed so much because e.g. synchronous
read requests can happen on the delayed-writed page and we have no way
to look up corresponding buf from a page.

YAMAMOTO Takashi
Thor Lancelot Simon
2003-12-17 05:03:02 UTC
Permalink
On Wed, Dec 17, 2003 at 01:34:02PM +0900, YAMAMOTO Takashi wrote:
>
> it isn't so easy.
> "delayed writes" can't be delayed so much because e.g. synchronous
> read requests can happen on the delayed-writed page and we have no way
> to look up corresponding buf from a page.

What keeps that from leading to data-integrity issues *right now*?

--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
YAMAMOTO Takashi
2003-12-17 05:19:59 UTC
Permalink
> On Wed, Dec 17, 2003 at 01:34:02PM +0900, YAMAMOTO Takashi wrote:
> >
> > it isn't so easy.
> > "delayed writes" can't be delayed so much because e.g. synchronous
> > read requests can happen on the delayed-writed page and we have no way
> > to look up corresponding buf from a page.
>
> What keeps that from leading to data-integrity issues *right now*?

how do you think data-integrity issues can happen?

YAMAMOTO Takashi
Thor Lancelot Simon
2003-12-17 05:36:52 UTC
Permalink
On Wed, Dec 17, 2003 at 02:19:59PM +0900, YAMAMOTO Takashi wrote:
> > On Wed, Dec 17, 2003 at 01:34:02PM +0900, YAMAMOTO Takashi wrote:
> > >
> > > it isn't so easy.
> > > "delayed writes" can't be delayed so much because e.g. synchronous
> > > read requests can happen on the delayed-writed page and we have no way
> > > to look up corresponding buf from a page.
> >
> > What keeps that from leading to data-integrity issues *right now*?
>
> how do you think data-integrity issues can happen?

I must not understand what you're saying. You say "synchronous read
requests can happen on the delayed-writed page and we have no way to
look up the corresponding buf from a page". That suggests to me that
the read will return the wrong data; am I incorrect? If not, what is
the problem, exactly, if synchronous read requests happen on pages
that have delayed writes in the I/O queues?

Is is it simply that the read will block indefinitely waiting for the
write to happen, with no way to cause the delayed-write queue to be
flushed?

Thor
YAMAMOTO Takashi
2003-12-17 05:41:09 UTC
Permalink
> Is is it simply that the read will block indefinitely waiting for the
> write to happen, with no way to cause the delayed-write queue to be
> flushed?

yes.

YAMAMOTO Takashi
Thor Lancelot Simon
2003-12-17 06:07:34 UTC
Permalink
On Wed, Dec 17, 2003 at 02:41:09PM +0900, YAMAMOTO Takashi wrote:
> > Is is it simply that the read will block indefinitely waiting for the
> > write to happen, with no way to cause the delayed-write queue to be
> > flushed?
>
> yes.

Why would these reads be generated? If the write has not completed, is
it not the case that the page must still be dirty, and should thus still
be in the cache?

--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
YAMAMOTO Takashi
2003-12-17 06:30:15 UTC
Permalink
hi,

> > > Is is it simply that the read will block indefinitely waiting for the
> > > write to happen, with no way to cause the delayed-write queue to be
> > > flushed?
> >
> > yes.
>
> Why would these reads be generated? If the write has not completed, is
> it not the case that the page must still be dirty, and should thus still
> be in the cache?

because the page is marked as PG_BUSY during delayed write,
VOP_GETPAGES will wait on it. we have no way to hurry
the i/o in this case.

maybe distinguishing PG_BUSY-for-read and PG_BUSY-for-write can be
an alternative solution?

YAMAMOTO Takashi
Thor Lancelot Simon
2003-12-17 06:33:53 UTC
Permalink
On Wed, Dec 17, 2003 at 03:30:15PM +0900, YAMAMOTO Takashi wrote:
> >
> > Why would these reads be generated? If the write has not completed, is
> > it not the case that the page must still be dirty, and should thus still
> > be in the cache?
>
> because the page is marked as PG_BUSY during delayed write,
> VOP_GETPAGES will wait on it. we have no way to hurry
> the i/o in this case.
>
> maybe distinguishing PG_BUSY-for-read and PG_BUSY-for-write can be
> an alternative solution?

Won't this also help avoid reads, period, when the I/O queues get lots
of writes backed up in them? That seems like a very good thing.

--
Thor Lancelot Simon ***@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
YAMAMOTO Takashi
2003-12-17 06:46:35 UTC
Permalink
> > > Why would these reads be generated? If the write has not completed, is
> > > it not the case that the page must still be dirty, and should thus still
> > > be in the cache?
> >
> > because the page is marked as PG_BUSY during delayed write,
> > VOP_GETPAGES will wait on it. we have no way to hurry
> > the i/o in this case.
> >
> > maybe distinguishing PG_BUSY-for-read and PG_BUSY-for-write can be
> > an alternative solution?
>
> Won't this also help avoid reads, period, when the I/O queues get lots
> of writes backed up in them? That seems like a very good thing.

anyway, it won't help other cases like fsync on the
delayed writed pages, etc.

YAMAMOTO Takashi
Manuel Bouyer
2003-12-17 12:55:53 UTC
Permalink
On Wed, Dec 17, 2003 at 03:46:35PM +0900, YAMAMOTO Takashi wrote:
>
> anyway, it won't help other cases like fsync on the
> delayed writed pages, etc.

For this, we probably need a way to move buffers from a queue to another.
This means we need a way to find the buffer from the page.

--
Manuel Bouyer, LIP6, Universite Paris VI. ***@lip6.fr
NetBSD: 24 ans d'experience feront toujours la difference
--
Jason Thorpe
2003-12-17 17:06:21 UTC
Permalink
On Dec 16, 2003, at 9:36 PM, Thor Lancelot Simon wrote:

> I must not understand what you're saying. You say "synchronous read
> requests can happen on the delayed-writed page and we have no way to
> look up the corresponding buf from a page". That suggests to me that
> the read will return the wrong data; am I incorrect? If not, what is
> the problem, exactly, if synchronous read requests happen on pages
> that have delayed writes in the I/O queues?

I'm not sure what all the hubub is all about ... don't all file data
requests go through UBC in any case?

-- Jason R. Thorpe <***@wasabisystems.com>
YAMAMOTO Takashi
2003-12-17 22:50:37 UTC
Permalink
hi,

> > I must not understand what you're saying. You say "synchronous read
> > requests can happen on the delayed-writed page and we have no way to
> > look up the corresponding buf from a page". That suggests to me that
> > the read will return the wrong data; am I incorrect? If not, what is
> > the problem, exactly, if synchronous read requests happen on pages
> > that have delayed writes in the I/O queues?
>
> I'm not sure what all the hubub is all about ... don't all file data
> requests go through UBC in any case?

did you read my replies to tls@ and bouyer@? weren't them enough?
or am i missing the point?

my understanding is,
- page is PG_BUSY during write back. if we keep a delayed write
request at buffer queue forever, the page will be left PG_BUSY forever.
- getpages will wait on the PG_BUSY page.

YAMAMOTO Takashi
Daniel Carosone
2003-12-17 23:02:14 UTC
Permalink
On Thu, Dec 18, 2003 at 07:50:37AM +0900, YAMAMOTO Takashi wrote:
> my understanding is,
> - page is PG_BUSY during write back. if we keep a delayed write
> request at buffer queue forever, the page will be left PG_BUSY forever.
> - getpages will wait on the PG_BUSY page.

This sounds like it could already be causing noticable
performance/concurrency issues, even without extra delays?

--
Dan.
Jason Thorpe
2003-12-18 01:53:11 UTC
Permalink
On Dec 17, 2003, at 2:50 PM, YAMAMOTO Takashi wrote:

> did you read my replies to tls@ and bouyer@? weren't them enough?
> or am i missing the point?

I did read them, but after I replied. Sorry :(

> my understanding is,
> - page is PG_BUSY during write back. if we keep a delayed write
> request at buffer queue forever, the page will be left PG_BUSY
> forever.
> - getpages will wait on the PG_BUSY page.

Yah. It would be nice for outgoing PG_BUSY pages to not block, because
their contents *are* valid...

This is something that should probably be fixed.

-- Jason R. Thorpe <***@wasabisystems.com>
Manuel Bouyer
2003-12-17 08:50:54 UTC
Permalink
On Wed, Dec 17, 2003 at 01:34:02PM +0900, YAMAMOTO Takashi wrote:
> >
> > I'd like to see this too. IMHO, this can share the same queue as delayed
> > writes. Only wake up the disk when there's too much requests in the queue, or
> > we have something else to do.
> > This would also prevent writes to log files to wake up the disk.
>
> it isn't so easy.
> "delayed writes" can't be delayed so much because e.g. synchronous
> read requests can happen on the delayed-writed page and we have no way
> to look up corresponding buf from a page.

I don't understand what you mean here. If we have a read request for a
delayed-write page, why would the read end in the disk's queue at all ?
UVM should find the data in memory.

--
Manuel Bouyer, LIP6, Universite Paris VI. ***@lip6.fr
NetBSD: 24 ans d'experience feront toujours la difference
--
YAMAMOTO Takashi
2003-12-17 11:34:58 UTC
Permalink
hi,

> > > I'd like to see this too. IMHO, this can share the same queue as delayed
> > > writes. Only wake up the disk when there's too much requests in the queue, or
> > > we have something else to do.
> > > This would also prevent writes to log files to wake up the disk.
> >
> > it isn't so easy.
> > "delayed writes" can't be delayed so much because e.g. synchronous
> > read requests can happen on the delayed-writed page and we have no way
> > to look up corresponding buf from a page.
>
> I don't understand what you mean here. If we have a read request for a
> delayed-write page, why would the read end in the disk's queue at all ?
> UVM should find the data in memory.

please read my replies to ***@.
(maybe a term "read request" was not appropriate to use, sorry.)
i meant, e.g. getpages will find the page in the cache as you say
and then notice the page is marked PG_BUSY and wait on it.

YAMAMOTO Takashi
Loading...