Discussion:
fork performance
Lloyd Parkes
2012-10-17 23:52:47 UTC
Permalink
So, with a slightly closer look, a guess and some tests to verify my guess, and I think I have found my performance problem converting the NetBSD CVS repositories to Mercurial.

The CVS server forks once for each command it receives, and it receives a lot of commands. NetBSD fork(2) seems to be much slower than OS X fork(2). Since the cvs server never execs anything, vfork isn't an option. I have implemented a program that can do the CVS log extraction efficiently and correctly (i.e. without forking), but extracting the versioned data itself isn't trivial.

Cheers,
Lloyd
Thor Lancelot Simon
2012-10-18 03:47:22 UTC
Permalink
Post by Lloyd Parkes
So, with a slightly closer look, a guess and some tests to verify my guess, and I think I have found my performance problem converting the NetBSD CVS repositories to Mercurial.
The CVS server forks once for each command it receives, and it receives a lot of commands. NetBSD fork(2) seems to be much slower than OS X fork(2). Since the cvs server never execs anything, vfork isn't an option. I have implemented a program that can do the CVS log extraction efficiently and correctly (i.e. without forking), but extracting the versioned data itself isn't trivial.
Why don't you run cvs locally?

Thor
Lloyd Parkes
2012-10-18 05:36:18 UTC
Permalink
Post by Thor Lancelot Simon
Why don't you run cvs locally?
I am. The cvs server is run from the main program via popen and this allows the main program to issue many cvs commands through a single pipe.

A bit more background: this is all being driven from a Python program that is a part of Mercurial which does a pretty good job of converting CVS repositories to Mercurial. The original code used popen("cvs log") once to read the entire revision log in one go and popen("cvs server") once to issue many checkouts (one per revision per file) to read the actual file data. Unfortunately, the output from "cvs log" cannot be parsed in the general case, and the NetBSD repositories present the general case. I changed the first bit of code to work like the second bit of code because "cvs log" for one revision of one file can be parsed in the general case.

I have to admit that if I were running the individual commands locally Python would hopefully be using vfork instead of fork. More thinking and testing is required.

Cheers,
Lloyd
David Laight
2012-10-18 07:15:15 UTC
Permalink
Post by Lloyd Parkes
So, with a slightly closer look, a guess and some tests to verify
my guess, and I think I have found my performance problem converting
the NetBSD CVS repositories to Mercurial.
The CVS server forks once for each command it receives, and it receives
a lot of commands. NetBSD fork(2) seems to be much slower than OS X
fork(2).
I've seen things that show that a processes memory page list isn't
getting its entries merged - so there are a lot of items to process
during fork(). (cat something in /proc ...)

The malloc netbsd uses (that uses mmap() instead of sbrk()) probably
makes this much more significant.
Especially if a big C++ program - like a python interpreter - is doing
the forks().

David
--
David Laight: ***@l8s.co.uk
Lars Heidieker
2012-10-18 07:39:36 UTC
Permalink
Post by David Laight
Post by Lloyd Parkes
So, with a slightly closer look, a guess and some tests to
verify my guess, and I think I have found my performance problem
converting the NetBSD CVS repositories to Mercurial.
The CVS server forks once for each command it receives, and it
receives a lot of commands. NetBSD fork(2) seems to be much
slower than OS X fork(2).
I've seen things that show that a processes memory page list isn't
getting its entries merged - so there are a lot of items to
process during fork(). (cat something in /proc ...)
The malloc netbsd uses (that uses mmap() instead of sbrk())
probably makes this much more significant. Especially if a big C++
program - like a python interpreter - is doing the forks().
David
Hi,

currently the amap layer limits the size of amaps to 255 * PAGE_SIZE
see: http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494

that's why the map entries for anon memory don't get merged.

This will hit fork performance.

Lars

- --
- ------------------------------------

Mystische Erklärungen:
Die mystischen Erklärungen gelten für tief;
die Wahrheit ist, dass sie noch nicht einmal oberflächlich sind.

-- Friedrich Nietzsche
[ Die Fröhliche Wissenschaft Buch 3, 126 ]
Lloyd Parkes
2012-10-21 18:54:48 UTC
Permalink
Post by Lars Heidieker
currently the amap layer limits the size of amaps to 255 * PAGE_SIZE
see: http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494
that's why the map entries for anon memory don't get merged.
This will hit fork performance.
I've just had another look at the ktrace data and this could be what's causing the problem for CVS. The CVS server is calling mmap quite a lot for anonymous memory. CVS doesn't appear to call mmap directly, but jemalloc shouldn't be mapping memory in chunks as small as what I'm seeing. I'm just going to have to poke it with a stick and see what happens.

I also decided to swallow my BSD pride and try running this code on Linux. It turns out that Linux is nowhere near as fast as OS X either, so maybe I shouldn't feel too bad.

Thanks for the advice.

Lloyd
Masao Uebayashi
2012-10-22 00:27:39 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by David Laight
Post by Lloyd Parkes
So, with a slightly closer look, a guess and some tests to
verify my guess, and I think I have found my performance problem
converting the NetBSD CVS repositories to Mercurial.
The CVS server forks once for each command it receives, and it
receives a lot of commands. NetBSD fork(2) seems to be much
slower than OS X fork(2).
I've seen things that show that a processes memory page list isn't
getting its entries merged - so there are a lot of items to
process during fork(). (cat something in /proc ...)
The malloc netbsd uses (that uses mmap() instead of sbrk())
probably makes this much more significant. Especially if a big C++
program - like a python interpreter - is doing the forks().
David
Hi,
currently the amap layer limits the size of amaps to 255 * PAGE_SIZE
see: http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494
that's why the map entries for anon memory don't get merged.
What happens if larger page size is used?
Lloyd Parkes
2012-10-22 18:41:51 UTC
Permalink
Early yesterday I decided to gather more data and run some experiments because I had found that dump displays elapsed time in system calls (which ktruss doesn't do) and it looked like fcntl(2) was taking much more time than fork(2). I compiled up a copy of cvs from pkgsrc with the cvs flow control option disabled (which appeared to be where the fcntl(2) calls were coming from) and the problem went away. I recompiled cvs with the flow control enabled so that I would have a proper test and the problem stayed away.

Further testing shows me that the problem is related to some different between cvs 1.12.13 and cvs 1.11.23. The answer to my problem is fairly obvious, use cvs 1.11.23. I have taken 30 second ktrace from cvs 1.12.13 that shows fork taking a quarter of a second every time. This was after cvs had been running for about 12 hours on this task and it didn't occur to me to get a copy of its memory map before I killed it.

Thanks for the help,
Lloyd
David Laight
2012-10-22 21:26:36 UTC
Permalink
Post by Lloyd Parkes
Further testing shows me that the problem is related to some different
between cvs 1.12.13 and cvs 1.11.23.
The answer to my problem is fairly obvious, use cvs 1.11.23.
I have taken 30 second ktrace from cvs 1.12.13 that shows fork
taking a quarter of a second every time.
Was that before it returned in the parent, or in the child?
I've seen issues in the past (wasn't actually NetBSD) where the child
was scheduled before the parent (dunno which NetBSD schedules first)
and if the child didn't block the parent's priority slowly got lower
and lower (if the system was busy).
This was a listener, and the process's priority got so bad connect
requests started timing out!
Post by Lloyd Parkes
This was after cvs had been running for about 12 hours on this task and it
didn't occur to me to get a copy of its memory map before I killed it.
Did you even look at the size?
Might have been growing a lot.

David
--
David Laight: ***@l8s.co.uk
Lloyd Parkes
2012-10-22 22:42:17 UTC
Permalink
Post by David Laight
Post by Lloyd Parkes
I have taken 30 second ktrace from cvs 1.12.13 that shows fork
taking a quarter of a second every time.
Was that before it returned in the parent, or in the child?
This is in the parent.
Post by David Laight
I've seen issues in the past (wasn't actually NetBSD) where the child
was scheduled before the parent (dunno which NetBSD schedules first)
and if the child didn't block the parent's priority slowly got lower
and lower (if the system was busy).
My test system is a VirtualBox guest with two CPUs so scheduling shouldn't get this pathological. The VirtualBox host has hyperthreading, so I can't guarantee two real CPUs though.
Post by David Laight
Post by Lloyd Parkes
This was after cvs had been running for about 12 hours on this task and it
didn't occur to me to get a copy of its memory map before I killed it.
Did you even look at the size?
Might have been growing a lot.
I checked swap and it wasn't being used. The system had enough RAM for the task at hand and not much more. In fact I had to tune the vm sysctl stuff to avoid swapping. The system now tries to keep much more anonymous memory in RAM and I also reduced the target inactive percentage to 10% for no good reason just before the horrendous CVS 1.12.13 test run.

Since this is VirtualBox guess, I can just fire it up again if anyone wants more information.

Cheers,
Lloyd
Lars Heidieker
2012-10-22 21:05:02 UTC
Permalink
Post by Masao Uebayashi
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Post by David Laight
Post by Lloyd Parkes
So, with a slightly closer look, a guess and some tests to
verify my guess, and I think I have found my performance
problem converting the NetBSD CVS repositories to Mercurial.
The CVS server forks once for each command it receives, and
it receives a lot of commands. NetBSD fork(2) seems to be
much slower than OS X fork(2).
I've seen things that show that a processes memory page list
isn't getting its entries merged - so there are a lot of items
to process during fork(). (cat something in /proc ...)
The malloc netbsd uses (that uses mmap() instead of sbrk())
probably makes this much more significant. Especially if a big
C++ program - like a python interpreter - is doing the
forks().
David
Hi,
currently the amap layer limits the size of amaps to 255 *
http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494
that's why the map entries for anon memory don't get merged.
What happens if larger page size is used?
Hi,

I think we can make the check
(http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494):
if ((slotneed * sizeof(*newsl)) > PAGE_SIZE) {

this will give us 1024 slots so for 4k PAGE_SIZE we have 4mb reach.
This re-enables a code path in amap_copy that breaks up a large (large
then those 255 slot limit) into chunks, this code has been disabled
for some years (since rev 1.59 afaik) it does work stable so far on my
test machine with such raised limit in place.
(This is however a short term solution, just a step in the right
direction).

I think we need to cap those allocations in size at the largest kmem
cache we have, which currently is PAGE_SIZE.

In the longer term a proper solution might be to change the allocation
strategy of the amap to become like a two leveled page-table once we
get larger then PAGE_SIZE.

The limit was introduced 7 1/2 years ago in revision 1.59 during that
time-frame allocations were done via malloc(9), which had an hard
upper limit of 64k etc. and for some years it is changed to kmem(9)
So things have changed quite a bit since.

(yamt@ I have added you as you introduced the limit)

kind regards,
Lars


- --
- ------------------------------------

Mystische Erklärungen:
Die mystischen Erklärungen gelten für tief;
die Wahrheit ist, dass sie noch nicht einmal oberflächlich sind.

-- Friedrich Nietzsche
[ Die Fröhliche Wissenschaft Buch 3, 126 ]
YAMAMOTO Takashi
2012-10-24 07:34:50 UTC
Permalink
hi,
i forgot about the particular change, sorry.

YAMAMOTO Takashi

Loading...