Index


NAME

The Perl Multicore Specification and Implementation

SYNOPSIS

  #include "perlmulticore.h"

  // in your XS function:

  perlinterp_release ();
  do_the_C_thing ();
  perlinterp_acquire ();

DESCRIPTION

This specification describes a simple mechanism for XS modules to allow re-use of the perl interpreter for other threads while doing some lengthy operation, such as cryptography, SQL queries, disk I/O and so on.

The mechanism basically implements the same mechanism that practically all other scripting languages (e.g. python) use when implementing real threads.

The design goals for this mechanism were to be simple to use, to be extremely low overhead when not active, with both low code and data size overhead and broad applicability.

The newest version of this document can be found at http://perlmulticore.schmorp.de/.

The newest version of the header file that implements this specification can be downloaded from http://perlmulticore.schmorp.de/perlmulticore.h.

XS? HOW DO I USE THIS FROM PERL?

This document is only about the XS-level mechanism that defines generic callbacks - to make use of this, you need a module that provides an implementation for these callbacks, for example Coro::Multicore.

WHICH MODULES SUPPORT IT?

You can check the perl multicore registry for a list of modules that support this specification.

HOW DO I USE THIS IN MY MODULES?

The usage is very simple - you include this header file in your XS module. Then, before you do your lengthy operation, you release the perl interpreter:

   perlinterp_release ();

And when you are done with your computation, you acquire it again:

   perlinterp_acquire ();

And that's it. This doesn't load any modules and consists of only a few machine instructions when no module to take advantage of it is loaded.

Here is a simple example, an flock wrapper implemented in XS. Unlike perl's built-in flock, it allows other threads (for example, those provided by Coro) to execute, instead of blocking the whole perl interpreter. For the sake of this example, it requires a file descriptor instead of a handle.

   #include "perlmulticore.h" // this header file

   // and in the XS portion
   int flock (int fd, int operation)
           CODE:
           perlinterp_release ();
           RETVAL = flock (fd, operation);
           perlinterp_acquire ();
           OUTPUT:
           RETVAL

You cna find more examples In the Case Studies appendix.

HOW ABOUT NOT-SO LONG WORK?

Sometimes you don't know how long your code will take - in a compression library for example, compressing a few hundred Kilobyte of data can take a while, while 50 Bytes will compress so fast that even attempting to do something else could be more costly than just doing it.

This is a very hard problem to solve. The best you can do at the moment is to release the perl interpreter only when you think the work to be done justifies the expense.

As a rule of thumb, if you expect to need more than a few thousand cycles, you should release the interpreter, else you shouldn't. When in doubt, release.

For example, in a compression library, you might want to do this:

   if (bytes_to_be_compressed > 2000) perlinterp_release ();
   do_compress (...);
   if (bytes_to_be_compressed > 2000) perlinterp_acquire ();

Make sure the if conditions are exactly the same and don't change, so you always call acquire when you release, and vice versa.

When you don't have a handy indicator, you might still do something useful. For example, if you do some file locking with fcntl and you expect the lock to be available immediately in most cases, you could try with F_SETLK (which doesn't wait), and only release/wait/acquire when the lock couldn't be set:

   int res = fcntl (fd, F_SETLK, &flock);

   if (res)
     {
       // error, assume lock is held by another process and do it the slow way
       perlinterp_release ();
       res = fcntl (fd, F_SETLKW, &flock);
       perlinterp_acquire ();
     }

THE HARD AND FAST RULES

As with everything, there are a number of rules to follow.

Never touch any perl data structures after calling perlinterp_release.

Possibly the most important rule of them all, anything perl is completely off-limits after perlinterp_release, until you call perlinterp_acquire, after which you can access perl stuff again.

That includes anything in the perl interpreter that you didn't prove to be safe, and didn't prove to be safe in older and future versions of perl: global variables, local perl scalars, even if you are sure nobody accesses them and you only try to "read" their value, and so on.

If you need to access perl things, do it before releasing the interpreter with perlinterp_release, or after acquiring it again with perlinterp_acquire.

Always call perlinterp_release and perlinterp_acquire in pairs.

For each perlinterp_release call there must be a perlinterp_acquire call. They don't have to be in the same function, and you can have multiple calls to them, as long as every perlinterp_release call is followed by exactly one perlinterp_acquire call.

For example., this would be fine:

   perlinterp_release ();

   if (!function_that_fails_with_0_return_value ())
     {
       perlinterp_acquire ();
       croak ("error");
       // croak doesn't return
     }

   perlinterp_acquire ();
   // do other stuff

Never nest calls to perlinterp_release and perlinterp_acquire.

That simply means that after calling perlinterp_release, you must call perlinterp_acquire before calling perlinterp_release again. Likewise, after perlinterp_acquire, you can call perlinterp_release but not another perlinterp_acquire.

Always call perlinterp_release first.

Also simple: you must not call perlinterp_acquire without having called perlinterp_release before.

Never underestimate threads.

While it's easy to add parallel execution ability to your XS module, it doesn't mean it is safe. After you release the perl interpreter, it's perfectly possible that it will call your XS function in another thread, even while your original function still executes. In other words: your C code must be thread safe, and if you use any library, that library must be thread-safe, too.

Always assume that the code between perlinterp_release and perlinterp_acquire is executed in parallel on multiple CPUs at the same time. If your code can't cope with that, you could consider using a mutex to only allow one such execution, which is still better than blocking everybody else from doing anything:

   static pthread_mutex_t my_mutex = PTHREAD_MUTEX_INITIALIZER;

   perlinterp_release ();
   pthread_mutex_lock (&my_mutex);
   do_your_non_thread_safe_thing ();
   pthread_mutex_unlock (&my_mutex);
   perlinterp_acquire ();

Don't get confused by having to release first.

In many real world scenarios, you acquire a resource, do something, then release it again. Don't let this confuse you, with this, you already own the resource (the perl interpreter) so you have to release first, and acquire it again later, not the other way around.

DESIGN PRINCIPLES

This section discusses how the design goals were reached (you be the judge), how it is implemented, and what overheads this implies.

Simple to Use

All you have to do is identify the place in your existing code where you stop touching perl stuff, do your actual work, and start touching perl stuff again.

Then slap perlinterp_release () and perlinterp_acquire () around the actual work code.

You have to include perlmulticore.h and distribute it with your XS code, but all these things border on the trivial.

Very Efficient

The definition for perlinterp_release and perlinterp_release is very short:

   #define perlinterp_release() perl_multicore_api->pmapi_release ()
   #define perlinterp_acquire() perl_multicore_api->pmapi_acquire ()

Both are macros that read a pointer from memory (perl_multicore_api), dereference a function pointer stored at that place, and call the function, which takes no arguments and returns nothing.

The first call to perlinterp_release will check for the presence of any supporting module, and if none is loaded, will create a dummy implementation where both pmapi_release and pmapi_acquire execute this function:

  static void perl_multicore_nop (void) { }

So in the case of no magical module being loaded, all calls except the first are two memory accesses and a predictable function call of an empty function.

Of course, the overhead is much higher when these functions actually implement anything useful, but you always get what you pay for.

With Coro::Multicore, every release/acquire involves two pthread switches, two coro thread switches, a bunch of syscalls, and sometimes interacting with the event loop.

A dedicated thread pool such as the one IO::AIO uses could reduce these overheads, and would also reduce the dependencies (AnyEvent is a smaller and more portable dependency than Coro), but it would require a lot more work on the side of the module author wanting to support it than this solution.

Low Code and Data Size Overhead

On a 64 bit system, perlmulticore.h uses exactly 8 octets (one pointer) of your data segment, to store the perl_multicore_api pointer. In addition it creates a 16 octet perl string to store the function pointers in, and stores it in a hash provided by perl for this purpose.

This is pretty much the equivalent of executing this code:

   $existing_hash{perl_multicore_api} = "123456781234567812345678";

And that's it, which is, as I think, indeed very little.

As for code size and speed, on my amd64 system, every call to perlinterp_release or perlinterp_acquire results in a variation of the following 9-10 octet sequence which is easy to predict for modern CPUs, as the function pointer is constant after initialisation:

   150>   mov    0x200f23(%rip),%rax  # <perl_multicore_api>
   157>   callq  *0x8(%rax)

The actual function being called when no backend is installed or enabled looks like this:

   1310>  retq

The biggest part is the initialisation code, which consists of 11 lines of typical XS code. On my system, all the code in perlmulticore.h compiles to less than 160 octets of read-only data.

Broad Applicability

While there are alternative ways to achieve the goal of parallel execution with threads that might be more efficient, this mechanism was chosen because it is very simple to retrofit existing modules with it, and it

The design goals for this mechanism were to be simple to use, very efficient when not needed, low code and data size overhead and broad applicability.

DISABLING PERL MULTICORE AT COMPILE TIME

You can disable the complete perl multicore API by defining the symbol PERL_MULTICORE_DISABLE to 1 (e.g. by specifying -DPERL_MULTICORE_DISABLE as compiler argument).

This will leave no traces of the API in the compiled code, suitable "empty" perl_release and perl_acquire definitions will be provided.

This could be added to perl's CPPFLAGS when configuring perl on platforms that do not support threading at all for example, and would reduce the overhead to nothing. It is by no means required, though, as the header will compile and work just fine without any thread support.

APPENDIX: CASE STUDIES

This appendix contains some case studies on how to patch existing modules. Unless they are available on CPAN, the patched modules (including diffs), can be found at the perl multicore repository (see the perlmulticore registry)

In addition to the patches shown, the perlmulticore.h header must be added to the module and included in any XS or C file that uses it.

Case Study: Digest::MD5

The Digest::MD5 module presents some unique challenges becausu it mixes Perl-I/O and CPU-based processing.

So first let's identify the easy cases - set up (in new) and calculating the final digest are very fast operations and would unlikely profit from running them in a separate thread. Which leaves the add method and the md5 (md5_hex, md5_base64) functions.

They are both very easy to update - the MD5Update call doesn't access any perl data structures, so you can slap perlinterp_release/perlinterp_acquire around it:

   if (len > 8000) perlinterp_release ();
   MD5Update(context, data, len);
   if (len > 8000) perlinterp_acquire ();

This works for both add and md5 XS functions. The 8000 is somewhat arbitrary.

This leaves addfile, which would normally be the ideal candidate, because it is often used on large files and needs to wait both for I/O and the CPU. Unfortunately, it is implemented like this (only the inner loop is shown):

   unsigned char buffer[4096];

   while ( (n = PerlIO_read(fh, buffer, sizeof(buffer))) > 0) {
       MD5Update(context, buffer, n);
   }

That is, it uses a 4KB buffer per MD5Update. Putting perlinterp_release/perlinterp_acquire calls around it would be way too inefficient. Ideally, you would want to put them around the whole loop.

Unfortunately, Digest::MD5 uses PerlIO for the actual I/O, and PerlIO is not thread-safe. We can't even use a mutex, as we would have to protect against all other PerlIO calls.

As a compromise, we can use the USE_HEAP_INSTEAD_OF_STACK option that Digest::MD5 provide, which puts the buffer onto the stack, and use a far larger buffer:

   #define USE_HEAP_INSTEAD_OF_STACK

   New(0, buffer, 1024 * 1024, unsigned char);

   while ( (n = PerlIO_read(fh, buffer, sizeof(buffer))) > 0) {
       if (n > 8000) perlinterp_release ();
       MD5Update(context, buffer, n);
       if (n > 8000) perlinterp_acquire ();
   }

This will unfortunately still block on I/O, and allocate a large block of memory, but it is better than nothing.

Case Study: DBD::mysql

Another example would be to modify DBD::mysql to allow other threads to execute while executing SQL queries.

The actual code that needs to be patched is not actually in an .xs file, but in the dbdimp.c file, which is included in an XS file.

While there are many calls, the most important ones are the statement execute calls. There are only two in dbdimp.c, one call in mysql_st_internal_execute41, and one in dbd_st_execute, both calling the undocumented internal mysql_st_internal_execute function.

The difference is that the former is used with mysql 4.1+ and prepared statements.

The call in dbd_st_execute is easy, as it does all the important work and doesn't access any perl data structures (I checked DBIc_NUM_PARAMS manually to make sure):

   perlinterp_release ();
   imp_sth->row_num= mysql_st_internal_execute(
                                               sth,
                                               *statement,
                                               NULL,
                                               DBIc_NUM_PARAMS(imp_sth),
                                               imp_sth->params,
                                               &imp_sth->result,
                                               imp_dbh->pmysql,
                                               imp_sth->use_mysql_use_result
                                              );
   perlinterp_acquire ();

Despite the name, mysql_st_internal_execute41 isn't actually from libmysqlclient, but a long function in dbdimp.c. Here is an abridged version, with perlinterp_release/perlinterp_acquire calls:

     int i;
     enum enum_field_types enum_type;
     dTHX;
     int execute_retval;
     my_ulonglong rows=0;
     D_imp_xxh(sth);

     if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
       PerlIO_printf(DBIc_LOGPIO(imp_xxh),
                     "\t-> mysql_st_internal_execute41\n");

     perlinterp_release ();

     if (num_params > 0 && !(*has_been_bound))
     {
       if (mysql_stmt_bind_param(stmt,bind))
         goto error;
     }

     if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
       {
         perlinterp_acquire ();
         PerlIO_printf(DBIc_LOGPIO(imp_xxh),
                       "\t\tmysql_st_internal_execute41 calling mysql_execute with %d num_params\n",
                       num_params);
         perlinterp_release ();
       }




     execute_retval= mysql_stmt_execute(stmt);

     if (execute_retval)
       goto error;

     /*
      This statement does not return a result set (INSERT, UPDATE...)
     */
     if (!(*result= mysql_stmt_result_metadata(stmt)))
     {
       if (mysql_stmt_errno(stmt))
         goto error;

       rows= mysql_stmt_affected_rows(stmt);
     }
     /*
       This statement returns a result set (SELECT...)
     */
     else
     {
       for (i = mysql_stmt_field_count(stmt) - 1; i >=0; --i) {
           enum_type = mysql_to_perl_type(stmt->fields[i].type);
           if (enum_type != MYSQL_TYPE_DOUBLE && enum_type != MYSQL_TYPE_LONG)
           {
               /* mysql_stmt_store_result to update MYSQL_FIELD->max_length */
               my_bool on = 1;
               mysql_stmt_attr_set(stmt, STMT_ATTR_UPDATE_MAX_LENGTH, &on);
               break;
           }
       }
       /* Get the total rows affected and return */
       if (mysql_stmt_store_result(stmt))
         goto error;
       else
         rows= mysql_stmt_num_rows(stmt);
     }
     perlinterp_acquire ();
     if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
       PerlIO_printf(DBIc_LOGPIO(imp_xxh),
                     "\t<- mysql_internal_execute_41 returning %d rows\n",
                     (int) rows);
     return(rows);

   error:
     if (*result)
     {
       mysql_free_result(*result);
       *result= 0;
     }
     perlinterp_acquire ();
     if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
       PerlIO_printf(DBIc_LOGPIO(imp_xxh),
                     "     errno %d err message %s\n",
                     mysql_stmt_errno(stmt),
                     mysql_stmt_error(stmt));

So perlinterp_release is called after some logging, but before the mysql_free_result call.

To make things more interesting, the function has multiple calls to PerlIO to log things, all of which aren't thread-safe, and need to be surrounded with perlinterp_acquire and pelrinterp_release calls to temporarily re-acquire the interpreter. This is slow, but logging is normally off:

     if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
       {
         perlinterp_acquire ();
         PerlIO_printf(DBIc_LOGPIO(imp_xxh),
                       "\t\tmysql_st_internal_execute41 calling mysql_execute with %d num_params\n",
                       num_params);
         perlinterp_release ();
       }

The function also has a separate error exit, each of which needs it's own perlinterp_acquire call. First the normal function exit:

     perlinterp_acquire ();
     if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
       PerlIO_printf(DBIc_LOGPIO(imp_xxh),
                     "\t<- mysql_internal_execute_41 returning %d rows\n",
                     (int) rows);
     return(rows);

And this is the error exit:

   error:
     if (*result)
     {
       mysql_free_result(*result);
       *result= 0;
     }
     perlinterp_acquire ();

This is enough to run DBI's execute calls in separate threads.

Interlude: the various DBD::mysql async mechanisms

Here is a short discussion of the four principal ways to run DBD::mysql SQL queries asynchronously.

in a separate process

Both AnyEvent::DBI and DBD::Gofer (via DBD::Gofer::Transport::corostream) can run DBI calls in a separate process, and this is not limited to mysql. This has to be paid with more complex management, some limitations in what can be done, and an extra serailisation/deserialisation step for all data.

DBD::mysql's async support

This let's you execute the SQL query, while waiting for the results via an event loop or similar mechanism. This is reasonably fast and very compatible, but the disadvantage are that DBD::mysql requires undocumented internal functions to do this, and more importantly, this only covers the actual execution phase, not the data transfer phase: for statements with large results, the program blocks till all of it is transferred, which can include large amounts of disk I/O.

Coro::Mysql

This module actually works quite similar to the perl multicore, but uses Coro threads exclusively. It shares the advantages of DBD::mysql's async mode, but not, at least in theory, it's disadvantages. In practise, the mechanism it uses isn't undocumented, but distributions often don't come with the correct header file needed top use it, and oracle's mysql has broken whtis mechanism multiple times (mariadb supports it), so it's actually less reliably available than DBD::mysql's async mode or perl multicore.

It also requires Coro.

perl multicore

This method has all the advantages of Coro::Mysql without most disadvantages, except that it incurs higher overhead due to the extra thread switching.

Pick your poison.

SEE ALSO

This document's canonical web address: http://perlmulticore.schmorp.de/

The header file you need in your XS module: http://perlmulticore.schmorp.de/perlmulticore.h

Status of CPAN modules, and pre-patched module tarballs: http://perlmulticore.schmorp.de/registry

AUTHOR

   Marc A. Lehmann <perlmulticore@schmorp.de>
   http://perlmulticore.schmorp.de/

LICENSE

The perlmulticore.h header file itself is in the public domain. Where this is legally not possible, or at your option, it can be licensed under the creative commons CC0 license: https://creativecommons.org/publicdomain/zero/1.0/.

This document is licensed under the General Public License, version 3.0, or any later version.