Home** > Parallelization with R (High Performance Computing, BigData)**

# Parallelization with R (High Performance Computing, BigData)

The exponential increase of data requires efficient means to analyse them.

More and more, analytical computation requires the installation of clusters with R and techs like MPI or Hadoop in order to distribute the computation/database.

Databases of hundreds of Gigabytes (and even Terabytes) are now common (1Tb = 1000 Gb = 1 million Mb = 10^12 bytes).

For instance, the number of Facebook users reaches approximately 1 billion. If we consider that each users writes 10 articles of 1000 characters each, the data will weight 10000 Gb = 10 Tb.

The number of transactions on financial markets are in the 10 billions per day. You can multiply the number by at least 100 in order to have an idea of the quantity of data if you want to analyse the whole order book (in order to build a "high frequency" strategy for instance)

In biology, the DNA of the 23 chromosomes weights a bit less than 1 Tb.

**High Performance Computing and BigData**

"BigData" is the trending buzzword.

It is not possible to perform statistical analysis on databases weighting Terabytes on a single PC. You will need a distributed platform.

At least 3 High Performance Computing (HPC) solutions can be implemented with R.

Rmpi

Rmpi is based on MPI (Message Passing Interface) technology. It is a standard.

Rserve

Rserve transform R as a server that can be installed on several machines, in order to make a cluster. It is not by itself a parallelization solution. You will have to distribute the computation yourself.

Rhadoop

Hadoop is recent and use Map/Reduce machanism in order to distribute the computations. Examples. It is likely the future standard of HPC.

MATLAB has also distributed functionnalities included. The Parallel Computing Toolbox can be used to distribute the computations on a cluster of machines with MATLAB, with minimal coding effort.

However, there is no distribution of the databases included in MATLAB by default.

**What should I use ?**

It depends of your type of calculation, the quantity of data and the way to access it (binary files ? relational databases ? NoSQL ?). Is it easy to parallelize your problem ? How much RAM do I need ?

Very often, the bottleneck is not the CPU but rather the data access.

This problem is cleverly solved in Hadoop HDFS: instead of having parallel CPUs with data coming from a single database, with data having to transit on the network in order to be processed, Hadoop takes another side: we sends the code responsible for the analysis on the element of the Hadoop cluster which hosts the data. There is no data transmitted on the network.

This requires that the problem can be sliced into independent computation, working on independent data. The computation on one node on the Hadoop cluster must only depend on the data present on this node of the Hadoop cluster.

Example of class of problems that are easy to parallelize on on Hadoop cluster.

You’ve got a database with thousands of clients et for each clients you have millions of data (logs, website navigation, phone calls, for instance). You want to make some statistics on the behaviour of your clients.

This is easy to parallelize:

1/ For each client, do the computation on the data of this client. This is the computation intensive step.

2/ Compute an average of all clients (or a histogram, or a pie chart...) that summarizes the statistics.

In order to solve this problem, you only need a distributed database where each node contains a few clients with all the data for these clients. Each element of the cluster can calculate its own clients.

Another example in High Frequency Trading for the finance industry, or market-maker.

You have a database which contains all the daily transactions for a thousand stocks. For each stock, you have one year of history, with 10 millions transactions.

Your problem is: for each stock, you want to simulate an investment strategy.

This problem is once again "embarassingly parallel" : the database can be distributed across the instruments : each element of the HDFS Hadoop cluster will contain a few stocks, with all the transactions for these stocks. It’s possible to simulate the investment strategy for these stocks on each node of the cluster.

There are other class of problem "embarassingly parallel" (like Monte Carlo simulation, option pricing, etc.)

But if your strategy cannot be calculated for each stock, and if on the contrary, you need all the data in order to construct your portfolio. This is generally the case : if you have bought a few stocks and you are fully invested at time t (no more available cash), you cannot buy more stocks at time t+1 => The decisions made on one stock have an influence on the decisions that you make on other stocks.

=> You cannot distribute the computation on the Hadoop cluster.

In a case like this one, you can see the problem from another point of view : rather than distributing the computation per stocks, we can distribute the computation per day of trading. Each node of the Hadoop cluster will contain ALL the transaction for ALL the stocks, but only on a short period of time : 1 day for instance. Then it is possible to compute all your portfolio with all the buy/sell signals for this day on this node of the cluster.

As in this example, it it sometimes possible to make a computation easily parallel if you compute it from another point of vue.

**Do I really need it ?**

Not necessarily. If your main concern is the CPU time, it can often be solved with an optimization of the code (after a profiling session, for instance. R has some profiling tools, MATLAB also).

Avoid loops. If you need some, remove all the computations that could be pre-calculated from them. Use vectorial programming in R and MATLAB.

We can help you in order to optimize the performances of your analysis, in CPU and RAM terms.

If your software is already (at least partially) optimized and its core can be parallelized, then the installation of Hadoop will rend your apllication fully scalable.

Use the ’Contact" link below for more information.