diff to speed up fdalloc using two-level bitmaps

Discussion:

Niels Provos

2003-10-28 16:25:59 UTC

Could you show the graph for just < 100 fds?
That's after all where most of us live. :)

It could need some fine tuning

Loading Image...

An inline assembly function for finding the first zero bit in a
word would be a great thing to have across architectures.

Niels.

Alfred Perlstein

2003-10-28 16:38:10 UTC

Permalink

Post by Niels Provos

Could you show the graph for just < 100 fds?
That's after all where most of us live. :)

It could need some fine tuning
http://www.citi.umich.edu/u/provos/benchmark/netbsd-fdalloc-zoom.jpg
An inline assembly function for finding the first zero bit in a
word would be a great thing to have across architectures.

ffs(9) is available under FreeBSD. might be on NetBSD.

--
- Alfred Perlstein
- Research Engineering Development Inc.
- email: ***@mu.org cell: 408-480-4684

David Laight

2003-10-28 18:08:26 UTC

Permalink

Post by Niels Provos

Could you show the graph for just < 100 fds?
That's after all where most of us live. :)

It could need some fine tuning
http://www.citi.umich.edu/u/provos/benchmark/netbsd-fdalloc-zoom.jpg

Have you factored out the cost of gettimeofday()?

Post by Niels Provos
An inline assembly function for finding the first zero bit in a
word would be a great thing to have across architectures.

A standard name for something that if probably inlined...

Anyone know the execution time for bsf on anything new?
It wouldn't surprise me if it is microcoded - so an expanded
function might be quicker!
(especially in a benchmark where the code/data will be cached!)

David

--
David Laight: ***@l8s.co.uk

Alan Barrett

2003-10-28 21:19:52 UTC

Permalink

Post by Niels Provos
An inline assembly function for finding the first zero bit in a
word would be a great thing to have across architectures.

It's much easier to find the last (that is, least significant) set
or clear bit. You can sometimes change the algorithm or the data
representation to suit.

unsigned int find_last_set(unsigned int x)
{
return (x & -x);
}

unsigned int find_last_clear(unsigned int x)
{
return (~x & (x+1));
}

--apb (Alan Barrett)

David Laight

2003-10-29 09:14:08 UTC

Permalink

Post by Alan Barrett
It's much easier to find the last (that is, least significant) set
or clear bit. You can sometimes change the algorithm or the data
representation to suit.
unsigned int find_last_set(unsigned int x)
{
return (x & -x);
}
unsigned int find_last_clear(unsigned int x)
{
return (~x & (x+1));
}

Unfortunately you still need to do the log2(x)....

David

--
David Laight: ***@l8s.co.uk

Lennart Augustsson

2003-10-29 01:12:24 UTC

Permalink

Post by Niels Provos

Could you show the graph for just < 100 fds?
That's after all where most of us live. :)

Thanks. That doesn't look bad enough for me to worry.
The most common case looks identical, and then we pay an
overhead that is <15%. I can live with that. :)

Good work!

-- Lennart

Niels Provos

2003-10-29 06:59:13 UTC

Permalink

Post by Niels Provos
It could need some fine tuning
http://www.citi.umich.edu/u/provos/benchmark/netbsd-fdalloc-zoom.jpg
An inline assembly function for finding the first zero bit in a
word would be a great thing to have across architectures.

The same benchmark with the following inline assembly

static __inline__ uint32_t ffz(uint32_t word)
{
__asm__("bsfl %1,%0"
:"=r" (word)
:"r" (~word));
return word;
}

looks much better:

http://www.citi.umich.edu/u/provos/benchmark/netbsd-fdalloc-zoom.jpg

Now, it is faster than before in all cases. It would be great if we
could have a

<machine/bitops.h>

include file that would define similar inline assembly for all
architectures or the equivalent C code if there is no instruction
support.

Niels.

Matt Thomas

2003-10-29 08:44:44 UTC

Permalink

Post by Niels Provos

The same benchmark with the following inline assembly
static __inline__ uint32_t ffz(uint32_t word)
{
__asm__("bsfl %1,%0"
:"=r" (word)
:"r" (~word));
return word;
}
http://www.citi.umich.edu/u/provos/benchmark/netbsd-fdalloc-zoom.jpg
Now, it is faster than before in all cases. It would be great if we
could have a
<machine/bitops.h>
include file that would define similar inline assembly for all
architectures or the equivalent C code if there is no instruction
support.

Note that some architectures (like PowerPC) count bits differently
(bit 0 is the "leftmost" bit) and the cntlzw (count leading zeroes
word) count from left to right.

ffs(3) has not implied bit ordering. To use the more efficient
instructions, the bitops must deal with both bit orderings.

--
Matt Thomas email: ***@3am-software.com
3am Software Foundry www: http://3am-software.com/bio/matt/
Cupertino, CA disclaimer: I avow all knowledge of this
message.

Bang Jun-Young

2003-10-29 11:48:56 UTC

Permalink

Post by Niels Provos

The same benchmark with the following inline assembly
static __inline__ uint32_t ffz(uint32_t word)
{
__asm__("bsfl %1,%0"
:"=r" (word)
:"r" (~word));
return word;
}

One caveat is, if the argument word is passed as 0, the result of
bsfl is undefined (i.e., unpredictable). So there's needed additional
code to check if it's 0.

BTW, how about making ffs(9) inline as well? Calling overhead seems
to be quite high on i386...

Jun-Young

--
Bang Jun-Young <***@NetBSD.org>

Bang Jun-Young

2003-10-29 12:33:15 UTC

Permalink

Post by Bang Jun-Young

Post by Niels Provos

The same benchmark with the following inline assembly
static __inline__ uint32_t ffz(uint32_t word)
{
__asm__("bsfl %1,%0"
:"=r" (word)
:"r" (~word));
return word;
}

One caveat is, if the argument word is passed as 0, the result of
bsfl is undefined (i.e., unpredictable). So there's needed additional
code to check if it's 0.

Secondly, unlike ffs(), your ffz() counts from 0. This is not a
real bug, but if consistency mattered...

Post by Bang Jun-Young
BTW, how about making ffs(9) inline as well? Calling overhead seems
to be quite high on i386...

On second look, I found that ffs() is defined as __builtin_ffs()
for all platforms but vax in sys/systm.h. I'm wondering when
__builtin_ffs() is called and when the real ffs() called, but it might
be an off-topic.

Jun-Young

--
Bang Jun-Young <***@NetBSD.org>

Jason Thorpe

2003-10-29 14:37:10 UTC

Permalink

Post by Bang Jun-Young
BTW, how about making ffs(9) inline as well? Calling overhead seems
to be quite high on i386...

GCC will already inline ffs() if the CPU back-end provides the
appropriate pattern. The right answer, if GCC is not doing it on i386,
would be to add that pattern to i386.md.

-- Jason R. Thorpe <***@wasabisystems.com>

Bang Jun-Young

2003-10-29 16:04:49 UTC

Permalink

Post by Jason Thorpe

Post by Bang Jun-Young
BTW, how about making ffs(9) inline as well? Calling overhead seems
to be quite high on i386...

GCC will already inline ffs() if the CPU back-end provides the
appropriate pattern. The right answer, if GCC is not doing it on i386,
would be to add that pattern to i386.md.

I looked further and found that ffs() was properly inlined in the
kernel:

c01c02a4 <fdalloc>:
[snip]
c01c0348: 83 fa ff cmp $0xffffffff,%edx
c01c034b: 0f 84 5e 01 00 00 je c01c04af <fdalloc+0x20b>
c01c0351: f7 d2 not %edx
c01c0353: c1 e3 05 shl $0x5,%ebx
c01c0356: 31 c0 xor %eax,%eax
c01c0358: 0f bc d2 bsf %edx,%edx
c01c035b: 0f 94 c0 sete %al
c01c035e: f7 d8 neg %eax

So it's clear that a little speedup with inlined ffz() as shown in
provos' graph was due to incomplete implementation, isn't it?

Jun-Young

--
Bang Jun-Young <***@NetBSD.org>

Bang Jun-Young

2003-10-29 12:37:38 UTC

Permalink

Post by Niels Provos
Now, it is faster than before in all cases. It would be great if we
could have a
<machine/bitops.h>
include file that would define similar inline assembly for all
architectures or the equivalent C code if there is no instruction
support.

<machine/macros.h> exists for that purpose in vax/.

Jun-Young

--
Bang Jun-Young <***@NetBSD.org>

Jason Thorpe

2003-10-29 14:33:34 UTC

Permalink

Post by Niels Provos
An inline assembly function for finding the first zero bit in a
word would be a great thing to have across architectures.

Note that GCC has an ffs() built-in, so it will already inline that if
the CPU back-end for the compiler provides the appropriate pattern.

-- Jason R. Thorpe <***@wasabisystems.com>