eddy+public+
2006-04-28 23:33:01 UTC
Greetings all,
Note: Some of these ramblings are ia32/aa64-focused, but the principles
are general.
While exploring PAE last November, I wound up browsing through uvm/pmap
code. I've had a few additional ideas, and would like some [more]
feedback.
/* Big Pages */
Begin by allocating memory stride 2M/4M (former iff PAE, latter iff
!PAE). Track wasted 4K [sub]pages. Split big pages into smaller ones
when needed, but avoid using page tables until then. Coalesce smaller
pages into bigger ones when free RAM permits.
Rationale: Hopefully less MMU management overhead and fewer TLB misses
while memory is plentiful. Fall back to standard behavior when needed.
/* Fractional/Checkpointed Zeroing of Big Pages */
I whipped up a crude program that performed 1000 bzero(3) iterations on
a 2M chunk. Each iteration took about 9 ms on a PIII/500 notebook.
Should the idle-zero loop zero a fraction of a big page? What about
dedicating a PDE slot (Intel terminology) to the zero code?
Rationale: Several milliseconds -- although certainly less than 9 ms
when on faster CPU and with optimized zeroing code -- is an eternity.
/* Per-CPU Management */
Both of the above, as well as free page lists, should be per-CPU. Can a
CPU be forced to work with the memory closest to it? (Consider NUMA
performance, such as multiprocessor Opteron systems.)
Rationale: Reduced inter-CPU contention. Assuming processes have
significant CPU affininty, using "nearby" memory would reduce reduce
both interconnect bandwidth use and memory access time.
/* Ring Buffers */
A native mapping for ring buffers would be nice:
u_char *ringbuf = mmapringbuf(..., MAP_RINGBUF, ...) ;
would allocate a memory region from <base> to <base + 2 * size>. i.e.,
base
base + size
would both be aliased to the same physical pages. Voila! Simple,
linear ringbuf where the MMU handles wraparound at the region's end.
Rationale: It's just so much easier this way. :-)
/* mremap() */
Zero-copy allocation-size changes are convenient.
Rationale: Obvious.
Eddy
--
Everquick Internet - http://www.everquick.net/
A division of Brotsman & Dreger, Inc. - http://www.brotsman.com/
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
________________________________________________________________________
DO NOT send mail to the following addresses:
***@brics.com -*- ***@intc.net -*- ***@everquick.net
Sending mail to spambait addresses is a great way to get blocked.
Ditto for broken OOO autoresponders and foolish AV software backscatter.
Note: Some of these ramblings are ia32/aa64-focused, but the principles
are general.
While exploring PAE last November, I wound up browsing through uvm/pmap
code. I've had a few additional ideas, and would like some [more]
feedback.
/* Big Pages */
Begin by allocating memory stride 2M/4M (former iff PAE, latter iff
!PAE). Track wasted 4K [sub]pages. Split big pages into smaller ones
when needed, but avoid using page tables until then. Coalesce smaller
pages into bigger ones when free RAM permits.
Rationale: Hopefully less MMU management overhead and fewer TLB misses
while memory is plentiful. Fall back to standard behavior when needed.
/* Fractional/Checkpointed Zeroing of Big Pages */
I whipped up a crude program that performed 1000 bzero(3) iterations on
a 2M chunk. Each iteration took about 9 ms on a PIII/500 notebook.
Should the idle-zero loop zero a fraction of a big page? What about
dedicating a PDE slot (Intel terminology) to the zero code?
Rationale: Several milliseconds -- although certainly less than 9 ms
when on faster CPU and with optimized zeroing code -- is an eternity.
/* Per-CPU Management */
Both of the above, as well as free page lists, should be per-CPU. Can a
CPU be forced to work with the memory closest to it? (Consider NUMA
performance, such as multiprocessor Opteron systems.)
Rationale: Reduced inter-CPU contention. Assuming processes have
significant CPU affininty, using "nearby" memory would reduce reduce
both interconnect bandwidth use and memory access time.
/* Ring Buffers */
A native mapping for ring buffers would be nice:
u_char *ringbuf = mmapringbuf(..., MAP_RINGBUF, ...) ;
would allocate a memory region from <base> to <base + 2 * size>. i.e.,
base
base + size
would both be aliased to the same physical pages. Voila! Simple,
linear ringbuf where the MMU handles wraparound at the region's end.
Rationale: It's just so much easier this way. :-)
/* mremap() */
Zero-copy allocation-size changes are convenient.
Rationale: Obvious.
Eddy
--
Everquick Internet - http://www.everquick.net/
A division of Brotsman & Dreger, Inc. - http://www.brotsman.com/
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
________________________________________________________________________
DO NOT send mail to the following addresses:
***@brics.com -*- ***@intc.net -*- ***@everquick.net
Sending mail to spambait addresses is a great way to get blocked.
Ditto for broken OOO autoresponders and foolish AV software backscatter.