R in your organization

Home > Big Data : big engineering

Big Data : big engineering

Financial markets produce several hundreds of Gigabytes per day. Therefore, several years of historical records of prices weights tens of Terabytes.
Even if you do not analyse the complete data, it poses technical challenges.

Efficient storage, compression, but also quick ways to access the data are necessary.

Hadoop (HDFS) is an unexpensive distributed database, which can be set up on commodity hardware.

But what kind ? How much RAM ? CPU ? Hard-drive response time ? Which one is going to be your limiting factor ?
What network configuration ? What kind of network ? Network topology ? Fiber or Ethernet ?
Which version of Linux ? Debian, Centos ?
Which version of Hadoop ? Apache, Cloudera, MapR ?
What configuration of the cluster ? How to administrate it ? CM ? Zookeeper ?
Which software to install ? Hive ? Hbase ? Pig ?
Which tools do you have to build in order to efficiently make the most of your data (Linux scripts, Java, R, C++, Hbase query language ? a mixture of them ?)

... and finally, how does your calculation scale. If you double your IT architecture, you should be able to analyze an amount of data twice as big (or an identical amount, but in times twice as short). This has to be kept constantly into the mind of developers while building the software.

There are no easy answer. It depends of your data, and what kind of analysis you want to perform on it. Different problems, different bottlenecks. Do not believe the hype and do not run into buying a supposed off-the-shelf turnkey solution. Hadoop technologies are powerful, but they have to be tuned to your specific need.

In order to save time and money, you need a thorough analysis and a proof-of-concept prototype has to be performed before deploying the full-scale solution with ease.