fork performance

Discussion:

fork performance

Lloyd Parkes

2012-10-17 23:52:47 UTC

So, with a slightly closer look, a guess and some tests to verify my guess, and I think I have found my performance problem converting the NetBSD CVS repositories to Mercurial.

The CVS server forks once for each command it receives, and it receives a lot of commands. NetBSD fork(2) seems to be much slower than OS X fork(2). Since the cvs server never execs anything, vfork isn't an option. I have implemented a program that can do the CVS log extraction efficiently and correctly (i.e. without forking), but extracting the versioned data itself isn't trivial.

Cheers,
Lloyd

Thor Lancelot Simon

2012-10-18 03:47:22 UTC

Permalink

Post by Lloyd Parkes
So, with a slightly closer look, a guess and some tests to verify my guess, and I think I have found my performance problem converting the NetBSD CVS repositories to Mercurial.
The CVS server forks once for each command it receives, and it receives a lot of commands. NetBSD fork(2) seems to be much slower than OS X fork(2). Since the cvs server never execs anything, vfork isn't an option. I have implemented a program that can do the CVS log extraction efficiently and correctly (i.e. without forking), but extracting the versioned data itself isn't trivial.

Why don't you run cvs locally?

Thor

Lloyd Parkes

2012-10-18 05:36:18 UTC

Permalink

Post by Thor Lancelot Simon
Why don't you run cvs locally?

I am. The cvs server is run from the main program via popen and this allows the main program to issue many cvs commands through a single pipe.

A bit more background: this is all being driven from a Python program that is a part of Mercurial which does a pretty good job of converting CVS repositories to Mercurial. The original code used popen("cvs log") once to read the entire revision log in one go and popen("cvs server") once to issue many checkouts (one per revision per file) to read the actual file data. Unfortunately, the output from "cvs log" cannot be parsed in the general case, and the NetBSD repositories present the general case. I changed the first bit of code to work like the second bit of code because "cvs log" for one revision of one file can be parsed in the general case.

I have to admit that if I were running the individual commands locally Python would hopefully be using vfork instead of fork. More thinking and testing is required.

Cheers,
Lloyd

David Laight

2012-10-18 07:15:15 UTC

Permalink

I've seen things that show that a processes memory page list isn't
getting its entries merged - so there are a lot of items to process
during fork(). (cat something in /proc ...)

The malloc netbsd uses (that uses mmap() instead of sbrk()) probably
makes this much more significant.
Especially if a big C++ program - like a python interpreter - is doing
the forks().

David

--
David Laight: ***@l8s.co.uk

Lars Heidieker

2012-10-18 07:39:36 UTC

Permalink

Post by David Laight

I've seen things that show that a processes memory page list isn't
getting its entries merged - so there are a lot of items to
process during fork(). (cat something in /proc ...)
The malloc netbsd uses (that uses mmap() instead of sbrk())
probably makes this much more significant. Especially if a big C++
program - like a python interpreter - is doing the forks().
David

Hi,

currently the amap layer limits the size of amaps to 255 * PAGE_SIZE
see: http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494

that's why the map entries for anon memory don't get merged.

This will hit fork performance.

Lars

- --
- ------------------------------------

Mystische Erklärungen:
Die mystischen Erklärungen gelten für tief;
die Wahrheit ist, dass sie noch nicht einmal oberflächlich sind.

-- Friedrich Nietzsche
[ Die Fröhliche Wissenschaft Buch 3, 126 ]

Lloyd Parkes

2012-10-21 18:54:48 UTC

Permalink

Post by Lars Heidieker
currently the amap layer limits the size of amaps to 255 * PAGE_SIZE
see: http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494
that's why the map entries for anon memory don't get merged.
This will hit fork performance.

I've just had another look at the ktrace data and this could be what's causing the problem for CVS. The CVS server is calling mmap quite a lot for anonymous memory. CVS doesn't appear to call mmap directly, but jemalloc shouldn't be mapping memory in chunks as small as what I'm seeing. I'm just going to have to poke it with a stick and see what happens.

I also decided to swallow my BSD pride and try running this code on Linux. It turns out that Linux is nowhere near as fast as OS X either, so maybe I shouldn't feel too bad.

Thanks for the advice.

Lloyd

Masao Uebayashi

2012-10-22 00:27:39 UTC

Permalink

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Post by David Laight

I've seen things that show that a processes memory page list isn't
getting its entries merged - so there are a lot of items to
process during fork(). (cat something in /proc ...)
The malloc netbsd uses (that uses mmap() instead of sbrk())
probably makes this much more significant. Especially if a big C++
program - like a python interpreter - is doing the forks().
David

Hi,
currently the amap layer limits the size of amaps to 255 * PAGE_SIZE
see: http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494
that's why the map entries for anon memory don't get merged.

What happens if larger page size is used?

Lloyd Parkes

2012-10-22 18:41:51 UTC

Permalink

Early yesterday I decided to gather more data and run some experiments because I had found that dump displays elapsed time in system calls (which ktruss doesn't do) and it looked like fcntl(2) was taking much more time than fork(2). I compiled up a copy of cvs from pkgsrc with the cvs flow control option disabled (which appeared to be where the fcntl(2) calls were coming from) and the problem went away. I recompiled cvs with the flow control enabled so that I would have a proper test and the problem stayed away.

Further testing shows me that the problem is related to some different between cvs 1.12.13 and cvs 1.11.23. The answer to my problem is fairly obvious, use cvs 1.11.23. I have taken 30 second ktrace from cvs 1.12.13 that shows fork taking a quarter of a second every time. This was after cvs had been running for about 12 hours on this task and it didn't occur to me to get a copy of its memory map before I killed it.

Thanks for the help,
Lloyd

David Laight

2012-10-22 21:26:36 UTC

Permalink

Post by Lloyd Parkes
Further testing shows me that the problem is related to some different
between cvs 1.12.13 and cvs 1.11.23.
The answer to my problem is fairly obvious, use cvs 1.11.23.
I have taken 30 second ktrace from cvs 1.12.13 that shows fork
taking a quarter of a second every time.

Was that before it returned in the parent, or in the child?
I've seen issues in the past (wasn't actually NetBSD) where the child
was scheduled before the parent (dunno which NetBSD schedules first)
and if the child didn't block the parent's priority slowly got lower
and lower (if the system was busy).
This was a listener, and the process's priority got so bad connect
requests started timing out!

Post by Lloyd Parkes
This was after cvs had been running for about 12 hours on this task and it
didn't occur to me to get a copy of its memory map before I killed it.

Did you even look at the size?
Might have been growing a lot.

David

--
David Laight: ***@l8s.co.uk

Lloyd Parkes

2012-10-22 22:42:17 UTC

Permalink

Post by David Laight

Post by Lloyd Parkes
I have taken 30 second ktrace from cvs 1.12.13 that shows fork
taking a quarter of a second every time.

Was that before it returned in the parent, or in the child?

This is in the parent.

Post by David Laight
I've seen issues in the past (wasn't actually NetBSD) where the child
was scheduled before the parent (dunno which NetBSD schedules first)
and if the child didn't block the parent's priority slowly got lower
and lower (if the system was busy).

My test system is a VirtualBox guest with two CPUs so scheduling shouldn't get this pathological. The VirtualBox host has hyperthreading, so I can't guarantee two real CPUs though.

Post by David Laight

Post by Lloyd Parkes
This was after cvs had been running for about 12 hours on this task and it
didn't occur to me to get a copy of its memory map before I killed it.

Did you even look at the size?
Might have been growing a lot.

I checked swap and it wasn't being used. The system had enough RAM for the task at hand and not much more. In fact I had to tune the vm sysctl stuff to avoid swapping. The system now tries to keep much more anonymous memory in RAM and I also reduced the target inactive percentage to 10% for no good reason just before the horrendous CVS 1.12.13 test run.

Since this is VirtualBox guess, I can just fire it up again if anyone wants more information.

Cheers,
Lloyd

Lars Heidieker

2012-10-22 21:05:02 UTC

Permalink

Post by Masao Uebayashi

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Post by David Laight

I've seen things that show that a processes memory page list
isn't getting its entries merged - so there are a lot of items
to process during fork(). (cat something in /proc ...)
The malloc netbsd uses (that uses mmap() instead of sbrk())
probably makes this much more significant. Especially if a big
C++ program - like a python interpreter - is doing the
forks().
David

Hi,
currently the amap layer limits the size of amaps to 255 *
http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494
that's why the map entries for anon memory don't get merged.

What happens if larger page size is used?

Hi,

I think we can make the check
(http://nxr.netbsd.org/xref/src/sys/uvm/uvm_amap.c#494):
if ((slotneed * sizeof(*newsl)) > PAGE_SIZE) {

this will give us 1024 slots so for 4k PAGE_SIZE we have 4mb reach.
This re-enables a code path in amap_copy that breaks up a large (large
then those 255 slot limit) into chunks, this code has been disabled
for some years (since rev 1.59 afaik) it does work stable so far on my
test machine with such raised limit in place.
(This is however a short term solution, just a step in the right
direction).

I think we need to cap those allocations in size at the largest kmem
cache we have, which currently is PAGE_SIZE.

In the longer term a proper solution might be to change the allocation
strategy of the amap to become like a two leveled page-table once we
get larger then PAGE_SIZE.

The limit was introduced 7 1/2 years ago in revision 1.59 during that
time-frame allocations were done via malloc(9), which had an hard
upper limit of 64k etc. and for some years it is changed to kmem(9)
So things have changed quite a bit since.

(yamt@ I have added you as you introduced the limit)

kind regards,
Lars

- --
- ------------------------------------

Mystische Erklärungen:
Die mystischen Erklärungen gelten für tief;
die Wahrheit ist, dass sie noch nicht einmal oberflächlich sind.

-- Friedrich Nietzsche
[ Die Fröhliche Wissenschaft Buch 3, 126 ]

YAMAMOTO Takashi

2012-10-24 07:34:50 UTC

Permalink

hi,
i forgot about the particular change, sorry.

YAMAMOTO Takashi

Continue reading on narkive:

Search results for 'fork performance' (Questions and Answers)

replies

I have a question about a mountain bike fork?

started 2010-11-11 20:44:02 UTC

cycling

replies

im looking at new suspension forks for my mtb bike what should i buy?