Jane Street

Open Source @ Jane Street

View on GitHub

Writing Performance Sensitive OCaml Code

1) OCaml Performance

The notes fall into the following categories:

When possible have tried to motivate these with snippets of assembly code generated by ocamlopt.

As a caveat, performance optimizations are sometimes at odds with desirable programming styles such as preserving abstraction boundaries and higher order functional patterns. Hand inlining code across abstraction boundaries and avoiding closure allocations buy you some performance, however these also make code brittle and inflexible. One should do these sorts of optimizations sparingly and only in the heart of tight loops.

Possibly the most useful performance tuning tip is to identify the places where your program spends most time and see if there are high level algorithmic approaches and changes in representation of data that can speed up things. These high level approaches will usually buy you more than machine level optimizations. Resort to low level tuning described in this document only after you have tried the former.

1.1) Compiling

Most of the discussion below assumes a 64-bit platform. Assembly code presented here is x86_64 asm generated by using ocamlopt with the compiler flags -S, -inline 20 and -nodynlink on a 64-bit Linux platform.

You can add the -S flag, which instructs the compiler to generate assembly code, by adding

OCAMLOPTFLAGS += -S

to your local OMakefile.

It is best to generate assembly code with full inlining and other optimizations that the compiler supports. Even thought inlining and optimization might make the code a bit harder to read, it will give you a more accurate picture of what executes on the CPU. Profiling tool like gprof where the inlining and such only serves to confuse the profiler.

2) Reading OCaml Assembly

This section introduces the minimal assembly instruction patterns that one must know to follow some of the cost arguments later. This is not a general introduction to x86 assembly.

2.1) Named functions are mangled but recognizable.

In the file called test_asm.ml the function f

let f a = a+1

The generated asm looks like the following:

 camlTest_asm__f_33700:
.L119:
        addq    $2, %rax
        ret

Note that the file/module name and the name of the function appear in the generated label. An anonymous function however will just have some numeric label making it difficult to locate.

2.2) Memory allocation

When OCaml allocates memory it decrements the "heap top" which is typically in %r15 followed by an overflow check. If there is an overflow it calls the GC and the allocation is retried.

The snippet below allocates 24 bytes (three 64-bit words).

.L121:  subq    $24, %r15
        cmpq    caml_young_limit(%rip), %r15
        jb      .L122

This is the code at label .L122 that has the call to the GC followed by the jump to retry allocation with the new %r15 and young limit value after garbage collection. .L122: call caml_call_gc .L123: jmp .L121

2.3) Boxed Floats and Immediate ints

OCaml uses IEEE 64-bit floats and are usually heap allocated (ie. boxed). The main reason for this is that OCaml needs to tag its values such that the GC can identify them at runtime. OCaml allocates floats on the heap using two words - the first word is a tag that says that the next is a float.

In asm the following snippet indicates a typical boxing operation where the actual float value is in the register %xmm0. Looking out for these when trying to track down boxed floats in your code.

  1. First allocate the 2 words.
.L120:  subq    $16, %r15
        cmpq    caml_young_limit(%rip), %r15
        jb      .L121
  1. Assign the tag value (the first word).
       leaq    8(%r15), %rax
       movq    $1277, -8(%rax)
  1. Assign the actual value (the second word).
       movlpd  %xmm0, (%rax)-

NOTE: An important thing to note is that a pointer to a heap allocated value does not point to the tag word but the actual first value. This is why the "load-effective-address" (lea) instruction adds the offset of 8 bytes. This is true for all heap allocated values, not just floats.

Unlike floats, ints are represented as immediates, i.e., they are not allocated on the heap. To tag ints OCaml steals the LSB bit of an int and sets it to 1 always. This means that we get 63 bits of precision in OCaml ints and any int value n will appear in the assembly as 2*n+1 (the LSB is always 1 and the bits of the number shifted one place in).

2.4) Tuples

Tuples are heap allocated. A pair will cause the allocation of 3 words: one word to indicate that it is a pair and two words for the actual values.

If a tuple contains floats, the floats will be boxed. If one wishes to return a pair of floats, a record containing the two floats may be more efficient than using a tuple - more on this later.

2.5) Option types

Option types are heap allocated. Passing a value as an option will cause memory allocation. The value None is however immediate and represented as 1, the same as the representation of the int value 0.

2.6) Write barriers

A write barrier is a basically a call to caml_modify and looks like the following:

    call    caml_modify

This webpage (pdf) and this one (pdf) give a reasonable overview of what caml_modify does. Any introductory text on garbage collection should give you some insight into why write barriers exist.

2.7) Asserts are helpful in reading asm.

If you have a large function and wish to see how a particular part of the function translates into asm, a simple way to track down the fragment in assembly is to put asserts around it. This is just a hack, but in practice it saves you some time reading asm.

Sometimes OCaml tends to optimize away provably correct asserts - this is something to watch out for. In the generated asm, the presence of the assert is indicated by:

    leaq    caml_exn_Assert_failure(%rip), %rbx

Hence if you are looking for a snippet of code in function f, first find f and then look for the asserts that you put in it. A more general method is the line numbers methods below:

2.8) Approximating line numbers in OCaml assembly

Assembly generated by OCaml does not have source line numbers. Having line numbers is useful when comparing the generated asm back to OCaml source. This hack lets one simulate have source line numbers in generated assembly. This relies on the fact that OCaml does not inline recursive functions.

To start, we define a handy set of recursive definitions to act as line number markers:

let rec _line0 () = if false then _line0 ()
let rec _line1 () = if false then _line1 ()
let rec _line2 () = if false then _line2 ()
let rec _line3 () = if false then _line3 ()
let rec _line4 () = if false then _line4 ()

And we annotate a part of the source code that we would like to read:

_line0 ();
config.script <- Some script;
_line1 ();
config.source_server <- server;
_line2 ();
config.source_path <- source_path;
_line3 ();

When we compile this code with the -S flag, this gives us:

    movq    $1, %rax
    .loc    1   207
    call    camlFile_copier_test___line0_79835
.L224:
.L225:  subq    $16, %r15
    cmpq    caml_young_limit(%rip), %r15
    jb  .L226
    leaq    8(%r15), %rsi
    movq    $1024, -8(%rsi)
    movq    0(%rsp), %rax
    movq    %rax, (%rsi)
    movq    camlConfig + 432(%rip), %rdi
    call    caml_modify
    movq    $1, %rax
    .loc    1   209
    call    camlFile_copier_test___line1_79836
.L228:
    movq    8(%rsp), %rax
    movq    88(%rax), %rsi
    movq    camlConfig + 432(%rip), %rdi
    addq    $24, %rdi
    call    caml_modify
    movq    $1, %rax
    .loc    1   211
    call    camlFile_copier_test___line2_79837
.L229:
    movq    8(%rsp), %rax
    movq    96(%rax), %rsi
    movq    camlConfig + 432(%rip), %rdi
    addq    $32, %rdi
    call    caml_modify
    movq    $1, %rax
    .loc    1   213
    call    camlFile_copier_test___line3_79838

Here the call to _line0 looks like:

    movq    $1, %rax
    .loc    1   207
    call    camlFile_copier_test___line0_79835

The $1 in the %rax register is the () argument. The lines between the calls _line functions are the asm generated from the original source. For example, the source line

    config.script <- Some script;

produced:

    subq    $16, %r15
    cmpq    caml_young_limit(%rip), %r15
    jb  .L226
    leaq    8(%r15), %rsi
    movq    $1024, -8(%rsi)
    movq    0(%rsp), %rax
    movq    %rax, (%rsi)
    movq    camlConfig + 432(%rip), %rdi
    call    caml_modify

Note that the .loc annotation also gives us information about where the function being called is defined in the original source file. This approach to annotating the asm makes the effects of inlining operations and such done by the compiler evident in the assembly.

3) Cost of OCaml constructs

3.1) Cost of memory allocation

Having seen the OCaml memory allocation instructions its easy to see that memory allocation itself is cheap - its just one addition and a test, in the common case.

The real cost of memory allocation comes during GC time. The more memory you allocate, even though the allocations are cheap, the more time your program will spend in the GC. So a general rule of thumb for speeding up programs is to reduce allocation rates. Useful reading includes:

3.2) Closure allocation

One significant source of memory allocation is the allocation of closures. Closure allocation basically involves allocating memory for each of the free variables of the function and for a code (memory operations are more expensive than register operations) and pointer.

  1. If a function is called just once or not called at all, try and restructure your code to not allocate the closure. For example: if one is going to fold or map over a list that is frequently empty, testing to see if the list is empty before doing the map operation will avoid the closure allocation.

  2. Alternately, try to avoid writing functions that have free variables - instead pass all the arguments explicitly.

3.3) Inlining

It is easier to describe when the compiler does not inline rather than when it does.

  1. It does not inline large functions. I don't know the exact metric.
  2. It does not inline "rec" bound functions - even if they are small and even if the function is not really recursive.
  3. It does not inline local functions
  4. It does not inline functions containing a structured constant such as:
    let f x = x :: [1]
  5. Function given as parameters are never inlined: http://caml.inria.fr/mantis/view.php?id=5917

To see if a function will be inlined or not you can look at the output of ocamlobjinfo. The inlining of a function is decided at declaration time, so a function is either always inlined, either never. Unless the .cmx file containing it is not available when compiling a unit using this function.

3.4) Stack allocation of arguments

Only a small number of arguments are passed in registers to functions - the actual number varies according to platform and possibly other optimizations that the compiler performs.

The remainder of the arguments have to passed via the stack. Hence functions that receive a large number of arguments will involve several memory operations for processing the callee saved register and then the stack allocation of formal arguments. In cases where it is feasible, breaking up the called function into smaller functions will allow for the arguments to be passed in registers and in possibly allow for inlining of the resulting smaller functions.

3.5) Polymorphic compare

OCaml provides a polymorphic compare caml_compare for datatypes. It works in much the same way the GC does by looking at the tags on objects to determine their types at runtime and then using the appropriate compare functions. This is more expensive than using a compare operation written specifically for the actual type (for example caml_float_compare for comparing floats).

let cmp a b =
  compare a b

in assembly: camlTest_asm__cmp_33386: subq $8, %rsp movq %rax, %rdi movq %rbx, %rsi leaq caml_compare(%rip), %rax call caml_c_call Here is an explicit float comparison:

let cmp_float1 (a : float) (b : float) =
  Float.compare a b

in assembly: camlTest_asm__cmp_float1_33390: subq $8, %rsp movq %rax, %rdi movq %rbx, %rsi call caml_float_compare addq $8, %rsp ret Here is an inferred float comparison:

let cmp_float3 (a : float) (b : float) =
  compare a b

in assembly: camlTest_asm__cmp_float3_33679: subq $8, %rsp .L107: movq %rax, %rdi movq %rbx, %rsi call caml_float_compare addq $8, %rsp ret

3.6) Inlining and polymorphic compare

In the following snippets the compiler can infer that the compare is only float comparison, however the inliner does not specialize the comparison operation.

let cmp_float2 (a : float) (b : float) =
  cmp a b

Note that the inlining has happened as expected, but no specialization: camlTest_asm__cmp_float2_33676: subq $8, %rsp movq %rax, %rdi movq %rbx, %rsi leaq caml_compare(%rip), %rax call caml_c_call addq $8, %rsp ret Doing just renaming also does not buy us anything, it is just as bad.

let cmp_compare = compare

let cmp_float4 (a : float) (b : float) =
  cmp_compare a b

As before, inlined but not specialized. camlTest_asm__cmp_float4_33682: subq $8, %rsp movq %rax, %rdi movq %rbx, %rsi leaq caml_compare(%rip), %rax call caml_c_call addq $8, %rsp ret So when in doubt, manually specify the type specific compare function.

3.7) Float boxing. Time is usually represented as a float.

Why are floats boxed/unboxed? GC, IEEE format, SSE instructions, tagging... this is a long story and one should look for descriptions of this online. Here are some general rules about when floats occur boxed and unboxed:

  1. Float arguments to (non-inlined) functions are boxed.
  2. Float return values from (non-inlined) functions are boxed.
  3. Floats have to be unboxed for doing calculations - they usually use the SSE instructions of the CPU.
  4. An array of floats uses an immediate representation - ie. each float is not boxed.
  5. A record of consisting only of floats uses an immediate representation. This does not apply to polymorphic records that get specialized to floats. This does not apply to abstract types that are internally defined to be floats.
  6. An record that has floats and non-floats has the floats boxed.
  7. A tuple of floats has the individual floats boxed.

3.8) Float compare

Compare for floats caml_float_compare requires the floats to be boxed. This can be terrible when you have immediate floats.

type t = {
  x : float;
  y : float;
}
let cmp_t1 a b =
  begin match compare a.x b.x with
  0 -> compare a.y b.y
  x -> x
  end

Becomes the following. Notice the boxing operations for the floats. camlTest_asm__cmp_t1_33693: subq $8, %rsp .L113: movq %rax, %rbp .L114: subq $32, %r15 cmpq caml_young_limit(%rip), %r15 jb .L115 leaq 8(%r15), %rsi movq $1277, -8(%rsi) movlpd (%rbx), %xmm0 movlpd %xmm0, (%rsi) leaq 16(%rsi), %rdi movq $1277, -8(%rdi) movlpd (%rbp), %xmm0 movlpd %xmm0, (%rdi) call caml_float_compare cmpq $1, %rax je .L112 addq $8, %rsp ret .align 4 .L112: .L117: subq $32, %r15 cmpq caml_young_limit(%rip), %r15 jb .L118 leaq 8(%r15), %rsi movq $1277, -8(%rsi) movlpd 8(%rbx), %xmm0 movlpd %xmm0, (%rsi) leaq 16(%rsi), %rdi movq $1277, -8(%rdi) movlpd 8(%rbp), %xmm0 movlpd %xmm0, (%rdi) call caml_float_compare addq $8, %rsp ret The floats are boxed and then there are calls to the underlying float comparison. A much more efficient, though tedious function to write would be:

let cmp_t2 a b =
  if a.x = b.x then begin
    if a.y = b.y then 1
    else if a.y < b.y then -1
    else 1
  end
  else begin
    if a.x < b.x
    then -1
    else 1
  end

Which becomes: camlTest_asm__cmp_t2_33697: .L124: movlpd (%rbx), %xmm1 movlpd (%rax), %xmm0 ucomisd %xmm1, %xmm0 jp .L121 jne .L121 movlpd 8(%rbx), %xmm1 movlpd 8(%rax), %xmm0 ucomisd %xmm1, %xmm0 jp .L123 jne .L123 movq $3, %rax ret Hence use =, <= and >= when possible in the case of floats. These map directly to the underlying asm instructions. NOTE: It may be useful to rewrite Float.compare using =, <= and >=.

3.9) Mutable fields in records

Mutations have an additional cost when the mutated value is heap allocated - ie. when the value is a boxed float, a list, a tuple, a record etc.

These mutations involve calls to the caml write barrier. A simple benchmark shows that invoking the write barrier repeatedly makes thing about 5-6 times slower. Consider the following snippet: Consider:

type t = {
  x : int;
  mutable y : float;
}
let set_y t y =
  t.y <- y

Which generates the following. Here there are calls to the write barrier. camlTest_asm__set_y_33718: subq $8, %rsp .L136: addq $8, %rax movq %rax, %rdi movq %rbx, %rsi call caml_modify movq $1, %rax addq $8, %rsp ret Contrast the above with the following. Here the only difference is the type of x which causes the float to have an immediate representation (remember that in mixed records floats are boxed, whereas they are not in float-only records).

type t = {
  x : float;
  mutable y : float;
}
let set_y t y =
  t.y <- y

Which generates: camlTest_asm__set_y_33727: .L137: movlpd (%rbx), %xmm0 movlpd %xmm0, 8(%rax) movq $1, %rax ret Changing representation of your data to avoid calls to the write barrier will help performance. As a side effect, your code may have the additional overhead of introducing unboxing and boxing instructions and so the next result may or may not be advantageous - one has to profile to sure.

3.10) Mutable single-field float records can outperform float refs.

As a direct consequence of the above, float ref represents the float as a boxed value. Mutating the ref causes calls to the write barrier. The following type does not:

type float_ref = {
  mutable x : float;
}

While it is tempting to try and represent floats in an immediate way as much as possible, however this sometimes hinder performance rather than help it. The fact that function arguments/returns are boxed can cause repeated boxing and unboxing as the floats are passed around.

Note: A useful experiment to try is to use the above float_ref type everywhere that one uses floats and explicitly cast to float only when one needs to do calculations. With respect to memory layout, float_ref is essentially the same as a boxed float but we explicitly have to insert the boxing and unboxing operations - making these explicit can help us optimize away needless cases and save on mutations.

3.11) Inlining avoids boxing (sometimes).

If the following function is inlined the argument and return value are not boxed.

let f x = x + 0.1

However, in the following function it is possible that x is used as a boxed float sometimes, the OCaml compiler will always box the float on entry to the function.

let f x =
  if rarely_true_flag
  then ls := (x :: !ls);
  x + 0.1

NOTE: This could be something that can be fixed in the compiler by pushing the box allocation into the conditional.

3.12) Local module definitions are not free

Local module definitions have a runtime cost - they are not purely syntactic.

let mod_test x =
  let module M = T1.T2 in
  M.f x

Generates: camlTest_asm__mod_test_33742: movq camlTest_asm + 120(%rip), %rbx movq 8(%rbx), %rbx movq (%rax), %rax ret Notice that the value in %rbx has no real purpose. The inlined call to M.f is only the instruction movq (%rax), %rax. If we write:

module M = T1.T2
let mod_test x =
  M.f x

We have: camlTest_asm__mod_test_33745: movq (%rax), %rax ret

3.13) Functor application is a runtime operation.

This:

module S(T : T_intf) = struct
  let g x = (T.f x) + 1
end
let mod_test2 x =
  let module M = S(T1.T2) in
  M.g x

Generates: call camlTest_asm__S_33743 The name of the functor S shows up as a function name. There is not much one can do about this other than to move functor applications outside function definitions such that they are not repeatedly applied.

3.16) Option.iter is sometimes not inlined

This:

let opt_iter opt =
  Option.iter opt ~f:(fun x -> x_ref := x)

Becomes:

camlTest_asm__opt_iter_33732:
.L140:
    leaq    camlTest_asm__18(%rip), %rbx
    cmpq    $1, %rax
    je      .L139
    movq    (%rax), %rax
    movq    (%rbx), %rdi
    jmp     *%rdi
    .align  4
.L139:
    movq    $1, %rax
    ret

The compiler does not always do this. I don't know why - one should check.

NOTE: The above code is not as optimal as it could be - in particular the anonymous function itself is not inlined. Even if the inlining is not done it could still be better by moving the functional address lookup inside the conditional and avoiding the assignment to %rax. We should ask the compiler folk to fix this.

One should hand-inline the above code when perf matters.

let opt_iter opt =
  match opt with
  Some x -> x_ref := x
  None -> ()

3.17) Optional arguments are not free even when inlined.

When a function with optional arguments (and default values) is marked as an inlined function, the inlining does not eliminate the cost of allocating the intermediate option type. Details below.

Consider:

let f2 t ?(i1=t.i1) ?(i2=t.i2) ?(sub1=t.sub1) () =
  let t = { t with i1; i2; sub1; } in
  t

let _f3 i2 =
  let t = {i1 = 10; i2= 20; sub1={f1=1.0; f2=None}; } in
  let t = f2 t ~i1:200 ~i2 () in
  t

At the call site, the call to f2 is inlined. However the asm looks like this. See annotations inline.

.L128:  subq    $16, %r15
    cmpq    caml_young_limit(%rip), %r15  ; (2 words) camlAsm___f3_65118
    jb  .L129
    leaq    8(%r15), %rbx
    movq    $1024, -8(%rbx)  ; cons(tag 0)  size 1
    movq    0(%rsp), %rax
    movq    %rax, (%rbx) ; rbx now points to [Some i2] which just got
                         ; allocated.
    leaq    camlAsm__2(%rip), %rax ; this is the constant ~i1:200
    cmpq    $1, %rax    ; check for None
    je  .L122
    movq    (%rax), %rdx  ; this path is always taken.
    jmp .L121
    .align  4
.L122:
    movq    8(%rsp), %rax  ; this path is never taken.
    movq    (%rax), %rdx
.L121:
    cmpq    $1, %rbx   ; check for None in rbx (which we had just allocated)
    je  .L120
    movq    (%rbx), %rsi  ; this path is always taken
    jmp .L119
    .align  4
.L120:
    movq    8(%rsp), %rax ; this path is never taken.
    movq    8(%rax), %rsi
.L119:

In summary:

1) The call to f2 has indeed been inlined. However the inlining has only saved us the cost of making a function call (roughly, the cost of doing one jump). It has not optimized away the cost of the optional arguments which can now be fully determined at compile time.

2) The argument ~i1:200 is allocated at compile time to the constant Some 200. At runtime the compiler checks this constant to see if it is a None or Some.

3) The argument ~i2 is allocated to a Some i2 at runtime and is immediately checked for at runtime to see if it None or Some.

This should really be fixed in the compiler because optional arguments that are properly inlined are a very valuable feature.

4) Misc

Some misc perf notes.

1) glibc pow() and the OCaml ** operator.

OCaml's ** operator calls pow() in glibc. Until glibc version 2.12, the execution time of pow() from the standard math lib varies significantly depending on the numeric values of arguments. This behavior is apparently fixed in glibc 2.14. Consider:

open Core.Std let _ = _squelch_unused_module_warning_
open Core_bench.Std

let () =
  Command.run (Bench.make_command [
    Bench.Test.create ~name:"Float ** (1)" (fun () ->
      ignore (1.0000000000000020 ** 0.5000000000100000));
    Bench.Test.create ~name:"Float ** (2)" (fun () ->
      ignore (1.0000000000000020 ** 0.5000000000000000));
    Bench.Test.create ~name:"Float ** (3)" (fun () ->
      ignore (1.0000000000000020 ** 0.4999999999000000));
    Bench.Test.create ~name:"Float ** (4)" (fun () ->
      ignore (1.0000000000000020 ** 0.4999900000000000));
  ])

which produces:


$ ./z.exe -q 1 
Estimated testing time 4s (4 benchmarks x 1s). Change using -quota SECS.
┌──────────────┬──────────────┬─────────┬────────────┐
│ Name         │     Time/Run │ mWd/Run │ Percentage │
├──────────────┼──────────────┼─────────┼────────────┤
│ Float ** (1) │  11_491.57ns │   2.07w │      2.87% │
│ Float ** (2) │ 400_987.47ns │   2.57w │    100.00% │
│ Float ** (3) │  11_465.42ns │   2.07w │      2.86% │
│ Float ** (4) │      59.25ns │   2.00w │      0.01% │
└──────────────┴──────────────┴─────────┴────────────┘

The typical execution time for this function is 60-70ns, but in some specific cases it can be as bad as 400us. One workaround for the problem is to reduce the precision of x in pow(x, y), by doing (Float.round_nearest (x *. 1e8)) *. 1e-8..

There is surprisingly little about this issue on the Internet. Here are two somewhat relevant links:

http://entropymine.com/imageworsener/slowpow/

http://stackoverflow.com/questions/9272155/replacing-extrordinarily-slow-pow-function

2) Strings, Bigstrings and the OCaml runtime lock.

In the following code, the OCaml runtime lock is being held while the time consuming LZ4 compression routine is run. The OCaml runtime lock forces all calls into the C code to be sequential, since it behaves like a critical section.

CAMLprim value _LZ4_compress_bigstring(value src, value dst, value isize)
{
    CAMLparam3(src, dst, isize);
    char* src_c = get_bstr(src, 0);
    char* dst_c = get_bstr(dst, 0);

     int   res   = LZ4_compress(src_c, dst_c, Int_val(isize));
    CAMLreturn (Val_int(res));
}

The code snippet can be made "faster" by releasing runtime lock as follows:

CAMLprim value _LZ4_compress_bigstring(value src, value dst, value isize)
{
    CAMLparam3(src, dst, isize);
    char* src_c = get_bstr(src, 0);
    char* dst_c = get_bstr(dst, 0);

    caml_enter_blocking_section()
    int   res   = LZ4_compress(src_c, dst_c, Int_val(isize));
    caml_leave_blocking_section()
    CAMLreturn (Val_int(res));
}

Note that src and dst are Bigstrings.

The snippet is faster in the sense that it allows the OCaml runtime to schedule other threads that may use the runtime lock. For example, this may be the case with an Async program where many jobs are trying to compress strings and multiple such compression routines can happen on different worker threads.

This does not work with strings. In particular, doing the following is unsafe:

CAMLprim value _LZ4_compress_string(value src, value dst, value isize)
{
    CAMLparam3(src, dst, isize);
    char* src_c = String_val(src);
    char* dst_c = String_val(dst);

    caml_enter_blocking_section()
    int   res   = LZ4_compress(src_c, dst_c, Int_val(isize));
    caml_leave_blocking_section()
    CAMLreturn (Val_int(res));
}

The reason is that OCaml strings live on the OCaml heap. By releasing the runtime lock, we allow the OCaml GC to run. If the OCaml GC moves the string in memory while the C code is using it, then things fail.

This is safe to do with Bigstrings since they are allocated on the C heap. The OCaml GC never moves them. The GC may move the stub that point to the C heap, but the values of the src_c and dst_c pointers always point to the right data.

Related reading: http://d.hatena.ne.jp/camlspotter/20100309/1268111257 (pdf)

3) Fast, Slow and Incorrect Array blits

Array blits in OCaml, in the general case, are slow -- i.e. they are not merely memmove/memcpy calls as one might assume.

Array.blit as provided by Core and also supplied by OCaml call caml_array_blit provided by the runtime. In the OCaml runtime, there are arrays tagged as "double arrays" -- these are typically float arrays. Here are the rules:

1) caml_array_blit does a memmove for double arrays. (For float arrays) 2) caml_array_blit does a memmove for arrays in the minor heap. (i.e. for arrays created recently). 3) For every other array type there is a for-loop that calls caml_modify.

Here, caml_modify is OCaml's write barrier (see notes earlier about this). Blits that fall under case 3 are an order of magnitude slower. Long lived int arrays, bool arrays and arrays of other simple immediate value representations will fall under the third case thereby incurring the cost of calling caml_modify.

1) Compiler patching

It may be possible to patch the compiler to create arrays with representation guarantees at initialization time, such they are not conservatively treated as arrays of mutable references. We don't have this yet.

2) Use Bigarrays

One can use Bigarrays instead of regular OCaml arrays for immediate values. One approach to blitting Bigarrays is to use the provided blit function:

let blit arr ~len ~offset ~delta =
  let sub1 = Array1.sub arr offset len in
  let sub2 = Array1.sub arr (offset+delta) len in
  Array1.blit sub1 sub2

However, this is not as fast as one would expect for two reasons: (1) creating sub-arrays is slow and (2) creating sub-arrays allocates words on the major heap.

A faster blit for Bigarrays is to (1) write your own in C OR (2) hijack the blit provided by Bigstrings (which are essentially Bigarrays). One would do the later as follows:

let unsafe_blit ~arr1 ~arr2 ~len ~offset1 ~offset2 ~sizeof =
  let src_pos = offset1 * sizeof in
  let dst_pos = offset2 * sizeof in
  let len = len * sizeof in
  Bigstring.unsafe_blit
    ~src:(Obj.magic arr1)
    ~dst:(Obj.magic arr2)
    ~src_pos
    ~dst_pos
    ~len

This is indeed a fast blit and works between Bigarrays of immediate types such as floats, ints, bools, chars etc. Further, for large enough blits one can give up the caml runtime lock, thereby allow other ocaml code to run in parallel with the blit. (see Section on the OCaml runtime lock.)

To blit from a bigarray to an OCaml array one would do something like:

let unsafe_blit_bigstring_string ~arr1 ~arr2 ~len ~offset1 ~offset2 ~sizeof =
  let src_pos = offset1 * sizeof in
  let dst_pos = offset2 * sizeof in
  let len = len * sizeof in
  Bigstring.unsafe_blit_bigstring_string
    ~src:(Obj.magic arr1)
    ~dst:(Obj.magic arr2)
    ~src_pos
    ~dst_pos
    ~len

The above snippet works well for Bigarrays of floats but not for much else, i.e. this approach does not work well if one needs to blit values our from Bigarrays to regular OCaml arrays of any type.

The reason this is so is that the when assigning a value into a Bigarray, the library changes the values representation from its OCaml representation to a representation compatible with native C code. (Bigarrays were after all originally designed for interop with C code). In the case of floats, OCaml does change the word level representation of floats and so the above snippet words fine. Hence the following function works as expected:

let snapshot_to_float_array (arr : float_bigarray) ~offset ~len =
  if len = 0 then [||]
  else begin
    let dest = Array.create ~len 0.0 in
    unsafe_blit_bigstring_string
      ~arr1:arr ~arr2:dest
      ~offset1:offset ~offset2:0
      ~len ~sizeof:sizeof_float64;
    dest
  end

However for ints and other OCaml immediate types, the OCaml representation involves setting some tag bits. When values are assigned to Bigarrays, "bigarr.{i} <- ", this representation is stripped: (x) >> 1. When values are read back using the getter provided, "<- bigarr.{i}", the tag bits are reconstructed: ((intnat)(x) << 1) + 1.

Hence the following does not work:

let snapshot_to_int_array (arr : int_bigarray) ~offset ~len =
  if len = 0 then [||]
  else begin
    let dest = Array.create ~len 0 in
    unsafe_blit_bigstring_string
      ~arr1:arr ~arr2:dest
      ~offset1:offset ~offset2:0
      ~len ~sizeof:sizeof_int;
    dest
  end

If one were to blit/memove the values from Bigarrays to OCaml arrays, the tag bits are not reconstructed causing numbers to be garbled and confusing the GC. The following does work however, but it is slow:

let snapshot_to_array bigarr ~offset ~len =
  if len = 0 then [||]
  else begin
    let arr = Array.create ~len bigarr.{0} in
    for i = 0 to len - 1 do
      arr.(i) <- bigarr.{offset+i}
    done;
    arr
  end

3) Use FFI and build your own.

The most direct solution is to write your own FFI and call a memmove in C. To ensure that this is not misused for array types that may have references, the types should be fixed to the concrete types that this is allowed for.

CAMLprim value _array_blit64(value src, value dst,
                             value src_pos, value dst_pos,
                             value len)
{
  CAMLparam5(src, dst, src_pos, dst_pos, len);
  int isrc_pos = Int_val(src_pos);
  int idst_pos = Int_val(dst_pos);
  int ilen = Int_val(len);

  int idst_len = Wosize_val(dst);
  int isrc_len = Wosize_val(src);

  // ensure we are dealing with 64-bit values
  assert(sizeof(value) == 8);

  //ensure that we are within bounds
  assert(ilen >=0);
  assert(isrc_pos >= 0 && isrc_pos + ilen <= isrc_len);
  assert(idst_pos >= 0 && idst_pos + ilen <= idst_len);

  memmove(String_val(dst) + idst_pos * sizeof(value),
          String_val(src) + isrc_pos * sizeof(value),
          ilen * sizeof(value));

  CAMLreturn (Val_unit);
}

ml/mli file bindings:

external int_array_blit :
  src:int array -> dst:int array ->
  src_pos:int -> dst_pos:int ->
  len:int -> unit
    = "_array_blit64"

external float_array_blit :
  src:float array -> dst:float array ->
  src_pos:int -> dst_pos:int ->
  len:int -> unit
    = "_array_blit64"

While this gives us fast blits on immediate arrays, we do lose the ability to give up the caml runtime lock for large blits.

4) Time.now() and RDTSC

What may arguably be the best approach to timing OCaml code is to manually instrument your code using Time.now. One can experiment with various configurations of the code by commenting out parts of the code to understand the relative costs of particular functions. The downside to using Time.now() is that (1) it is an expensive call and (2) it has limited resolution and is hence not useful in measuring snippets that may only take milliseconds.

RDTSC is a light weight CPU counter which can be used instead of Time.now(). It gives a measure of CPU cycles elapsed between subsequent calls and it useful for timing.

CAMLprim value caml_rdtsc( )
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return Val_int( ((unsigned long long)lo)|( ((unsigned long long)hi)<<32 ));
}