R in your organization

Home > Big Data, what is it ?

Big Data, what is it ?

Are you a CIO ? You have heard of Big Data but you are unsure what it can bring to you. This article is here to help.

You have databases that are too large for a single disk : You need a distributed database, on several computers (a cluster): This is where Big Data starts.

A few examples:

- Telecom companies, firms with millions of clients or with millions of items to sell (Amazon / eBay) with big customer analytics.

- bank fraud detection (analysis of "normal" behaviours and detection of uncommon changes)

- webanalytics (analytics of logs), in order to get some data about behaviours of visitors and to better target them.

- genetics/biotechnologies (DNA, microarray data)

- huge science experiences. The CERN (Particle accelerator institute based in Geneva, which has proven the existence of the Higgs Boson) collects 15 millions Terrabytes a year......

- Social networks with hundreds of millions of users (twitter, linkedin, facebook)

- Web search engines (google, yahoo) : webpage classifications, page ranks calculation and web crawlers : find and analyse informations on millions of webpages.

- High frequency data (financial markets)

In terms of size, BigData databases are too large to fit on a single disk (somewhat bigger than 2 TB).

One of the immediate problem that one has with such databases is the difficulty to make JOIN operations between tables. The concepts are different from the classical SQL based databases.

Data is not stored following a classical relation-entity model (Tables with primary/foreign keys) but rather under a Document Model.

Instead of a "Client" table and a "Address" table with a Foreign key from "Address" to "Client’" (A client can have multiple addresses), the Document Model will have a Client documents containing its addresses directly. There are no table anymore, but only records with primary keys (client id) with all its attributes (addresses, etc.) stored with it.

These are the so-called NoSQL (Not only SQL) technologies.

The storage of data differs, we therefore need new means in order to get/analyse the data. There are no SQL queries anymore.

Some use cases of data analytics with big databases:

- pure storage : you access the data like you would in a library: "please give me the book number 2156, please give me the book with that precise title". This case is easy to solve: you do not need any computations, and there is no data transfer trough the network : you will only get the book you have asked, and the way to find it is easy : just look at the index of all books.

(-intermediates cases : web pages searches by Search Engines or in distributed networkds (Kademlia/Emule). All the computations made on the keywords have been done through the construction of the indexes. The indexes are computationally intensive, but when this is done, there are no more computation for the user searches in themselves, just some crossing of keyword sets)

- a database that you can query in unplanned ways. Example: "what is the frequency, per year, of the usage of the word "computer" in all the books of the library since 1920. This query is very different : you need to get all books, read them completely, and count the word, then make a sum per year. If we have only one computer in order to make the computation, this computer is going to sequentially ask for the books (one after the other), then count the words, etc... The whole library is going to be transmitted through the network in order to get a response, and a single computer is going to compute everything : it is going to be deadly slow. Therefore, new technologies have emerged:

Map Reduce, Hive, Pig Latin

Since the database is distributed, it is interesting to also distribute the computation. And it would be perfect to have the computations and their corresponding data on the same computer, in order to avoid network data transfer : this gives a huge time gain.

Drawback: we need to program this kind of queries in brand new languages : Map /Reduce, Hive Pig Latin

In terms of technologies (databases and R packages), we have:

- Hadoop/HDFS and their corresponding R packages Rmr, Rhadoop

- MongoDB (document store, JSON oriented) : RmongoDB

- CouchDB (document store): R4CouchDB

- Cassandra (key-value): RCassandra

- HBASE (key-value) : Rhbase

- Redis (key-value): rredis

- Neo4j: a graph-type database (with a REST API inorder to access it within R), useful if your data are relationships between entities.

Remark: These databases are generally running under Linux

The choice of a database depend of your data type (key-value, document model or graph types) and the use that you wish to have (realtime ? replcation ? fault tolerant ? coherent ?)

A comparison with Google Trend:

JPEG -

See also this comparison of NoSQL databases

See also our article on R and parallel computation

Any relationships with other buzzwords ?

- Cloud computing, Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).
=> No direct relationship. Thus said, the installation of a distributed database cluster has a cost (infrastructure, maintainance...). Maybe you can host it to an organization wich hosts cloud services (Amazon EC2 for instance)

- Scalability: NoSQL databases are scalable by nature : If you are worrying about the response time of your IT infrastructure, NoSQL databases may be a solution.

- Virtualization (Vmware...): No relationship. On the contrary, it does not seem interesting to have a NoSQL "virtual" cluster, this can only degrade performances.

Do I really need NoSQL ?

If you do not know, probably you do not need it ... today.

But do you have a maximum use of your client data ? your connexion logs ? Are marketing teams aware of the statistics that can be made on these data ?

I need to migrate my current relational databases toward a distributed database (Big Data/NoSQL)

In most cases, you do not need to migrate everything. Your database (Oracle, SQL server, SYBASE) contain 2 different type of tables :

- references table (code/name tables, status...) of a "few" records (a few thousands, tens of thousands)

- "Big" tables : records that are linked to the activity of your organization which grow a bit more every day. (Invoices, connexion logs, etc.). They often have a temporal dimension (a "date’" column in the table).

It is enough to just migrate these big tables toward a distributed one, and keep a small classical SQL databases for the table of references.