Is Blockchain The Future Of Big Data and Machine Learning?

Oct 27, 2018 | 9384 Views

The popularity of the blockchain has exploded recently by the fashion effect of ICO's, but all are not worth the time and money to invest in. Let's be aware of the power of this technology and all that one could build of use with it before the crypto trading kills it.

Quick definition
For the uninitiated (you are fewer and fewer), a reminder about the key concepts of blockchain and smart contract :

  • Blockchain: The basic principle is simple. The first block of data is used to initialize the chain. This block is broadcast on a network of nodes (servers) that take care of writing data in the chain. Each subsequent block contains the hash (a reference) of the previous block, as well as other data (anything, as long as it respects the maximum size of a block). Read/write servers are often called validators (with the writing consensus used in most channels, the Proof Of Work). They are mostly paid for their work, ie the resolution of an algorithm that proves that the transaction is validated on the network.
  • Smart contracts: smart contracts are an evolution of the previous system. Each transaction is associated with a portion of the program that usually triggers when writing the data. This code can achieve different things, it's relatively open. For example, one can check the coherence of the writing compared to other blocks or call an API. It is common for example to check that an exchange has been made to launch a payment.

Below are some diagrams that summarize these principles fairly well:


The power of the chain against traditional databases
A few projects really exploit the essence of the blockchain: a decentralized database, historized by nature, unalterable. No possibility to modify registers or corrupt them.

Decentralized, there is no need to take care of replication or backup. All nodes make the chain live. Everyone can easily access the data, everyone can contribute.

Smart contracts and other blockchain technologies facilitate the implementation of more complex read and write validation processes, with potential event that triggers when adding data to the chain.

A follow-up of the data thanks to the sequence of blocks is created naturally, which facilitates the follow-up of a product or an incident for example. We can even within a block reference another that would not be the precedent, but that would have a business link to weave a mesh more complex.

Below is an example of the decentralization of banks with a blockchain. The central bank is absent, it is no longer necessary :


Less effort for open data?
Historically, businesses were cautious about distributing their public data. Today with the advent of machine learning, open data is the new goal in many sectors, especially in the public with transport and health. All this driven by initiatives like those of M Macron in France.

But how much effort and expense to get there! All companies are striving to extract their historical data in open data portals custom, with the development of new expensive API, which must be maintained, document, ... At the global scale, it is a work of a titan, which costs money, and that only complicates exchanges and data recovery.

Why not use a blockchain to disseminate this data? A use case often cited is that of healthcare. Blockchain streaming of this data would have many advantages:

  • Follow-up of a patient between different institutions around the world
  • Standardization of medical data formats
  • History of a person's illness
  • An unalterable medical record
  • Access for the family to follow a close

The possibilities are almost endless, and it is obvious that on a large scale such an organization can not exist without decentralization: the number of actors in games makes this implementation impossible.

Towards a hyper-connected world
What if the whole world wrote in blockchain? All IOT objects? If all countries provided nodes for a gigantic network? We would then have access to a huge worldwide data lake!

Megalo and utopian you say? Let's be honest: today all major countries have spy services that collect all the data circulating on the internet. The Pentagon even benefits from Google's help to process its drone videos. Between that, the daily server hackings, the recent scandals Facebook and Cambridge Analytica and so on, the utopia is to believe that our data can still be protected. This time is over. Unless you live in a cave without a smartphone or internet, you exist or will exist on the internet whether you like it or not.

Why not offer most of the data freely? Of course, they would be anonymized. Data is like gold today for machine learning, everyone is looking for new use cases. If a large part of this data becomes public, there would be far fewer attempts to harvest it illegally. Even anonymized, these data would be of enormous value, especially if they were formatted within the same blockchain and available worldwide.

Cost limits and storage capacity
Currently, the real problem for this kind of global system is the cost. As very well explained in this article, injecting large amounts of data into a public blockchain of the ethereal type costs a fortune, because of the price of the transactions to settle, and the infinite storage of the data:

Using a blockchain like HyperLedger solves part of the problem, because of the consensus method used PBFT, which allows you to have no fees to pay to write in the chain. The nodes write in the chain thanks to a system based on the rumor, which also avoids having useless server calculations.

There is still the problem of storage: over time, the chain grows more and more, requiring machines for nodes able to manage this history. The risk for the network is to have only a few nodes with this capacity, the other nodes having only a part of the chain to validate recent transactions. We would then fall back into a system that is certainly decentralized, but not distributed.

Fortunately, different techniques are under development or already exist to overcome this type of problem:

  • State Tree Pruning : this technique consists in keeping only the last blocks useful to the operation of the network in the nodes. If the blocks are unused or too old, we do not keep them in the history of the classic nodes. Only some master nodes maintain this history.
  • Sharding : the principle of sharding is to split the history into several subsets. This makes it possible to have nodes dedicated to these sets, which therefore require less storage space. There are still masters nodes, but their function is to check the cohesion of the whole, and no longer to manage all the transactions and their contents.
  • Channels : on the HyperLedger system, the channels are used to separate the transactions by theme as soon as they are created. The nodes do not need to manage all channels, only those to which they are subscribed. This automatically reduces the size of the chain to manage.

Watch out for drifts
To be honest, there are many strong issues related to this type of opened global network. But it's more about regulating uses than limiting the data. An obvious example would be the possible drifts in the insurance or banking professions. Training systems to detect whether an individual will be able to repay a loan based on his state of health, for example, would be totally immoral. Unfortunately, this type of practice already exists, and it is not a lock of the data which will be able to prevent it: only a strict regulation and a total transparency of the organisms will make it possible to avoid these drifts.

The fear of open data in companies
Business is not just about data, it's about service. And today, offering a quality service that unites information in a unified way is what users expect. Why limit yourself to displaying the means of transport of your bus company when you can add public transport and competitors? If your app is the best, it will be used, that's where your customers will be. To think that because you have control of your data gives you the monopoly of your business is a mistake. This will only encourage scrappers to pollute the traffic of your website with robots.

Citymapper now offers its own bus service. At the origin Citymapper does not have its own data, it is only a aggregator and provider of routes. A new business has been created.

Google aggregates all transport data in Maps to attract users. In the end, the product is Maps which is a complete product, thanks to Google's work on data.

The Open Data blockchain for individuals
This is the annoying subject of the moment, especially with the GDPR regulations that are just around the corner in Europe, the Facebook scandal Cambridge Analytica, etc... Certainly, the data of the users will have to be anonymized. But they will remain a source of very interesting data to cross with other business data.

Another use case for the general public could be the validation of information of the type 'rumor'. The nodes of the blockchain make it possible to validate a transaction via different algorithms. One could imagine the creation of data propagated and validated by several users.

Take the example of a late train. A user requests to enter this data in the chain. As long as 5 users do not confirm the data, it is not validated. This data could be crossed with that of the official carrier's passenger information, for example.

Hashgraph, based on the Practical Fault Tolerance Byzantine protocol, could be an interesting technology for this type of use case:

The key to an AI master?
Imagine an AI connected to this gigantic data lake where all objects and events could be linked. We could teach him to link blocks of unrelated information at first, and discover hidden correlations. And who knows how to predict the future? One would surely see patterns of the butterfly effect, where an isolated incident triggers several actions on a larger scale.


The article was originally published here

Source: HOB