The recent boom in the data industry has raised many questions related to fields such as data science, data engineering, big data, artificial intelligence, machine learning etc. So, what are these disciplines and what is the hype all about?

Despite all the confusion that has been evoked by these new terms, companies did not hesitate to get on track and employ specialists to tackle the problems that big data has introduced. After all, it is critical for all organizations to understand their data and to turn it into business value. But what is BIG DATA?

Big data is a term describing data sets that are too large and too complex for traditional data processing. The concept has already been out there for a couple of years now, but it did not get old and that’s why we cannot just ignore it. Big data encompasses three V’s: volume (amount of data), variety (type of data), and velocity (speed of data processing). In other words, we are exposed to huge unstructured amounts of data. Therefore, instead of or in addition to using a relational database for storing and accessing data, one needs to utilize NoSQL databases and technologies such as Hadoop. Additionally, you will hear people talk about two more V’s of big data, i.e. veracity (quality and accuracy of data) and value (worth of data). Collecting and analyzing data has no value if we do not turn it into action. Hence, the real benefit lies in the processes you improve, the decisions you take and strategies you employ. To accomplish this, businesses need to understand the importance of having data engineers and data scientists on the team.

Data engineering, data science and even data analytics are words that are sometimes being used interchangeably for domains that “do something with data”. The reason behind it is that these fields are still new and that they are overlapping to some extent. This is especially true for small and mid-sized companies, where one or a few individuals do everything related to data. Thus, putting a label on their work is pointless. On the other hand, larger companies have been building whole “data teams” and are separating these disciplines in order to distribute the workload. Either way, the crucial part is the mutual understanding and communication.  Hence, for the sake of a better understanding of everyone’s roles and responsibilities, let’s try to define these disciplines and positions:

  • Data engineers define how to collect and organize data, and ensure clean, reliable, and performative access to it.  In other words, before data can be used for value creation, it must be prepared, stored in a data warehouse by creating ETL (Extract, Transform, Load) processes and sent into different analysis tools. Data engineers often need to improve data reliability and quality, i.e. turn big data into smart data, as well as ensure that architectures (databases and large-scale processing systems) in place support the requirements of the data scientist and the business.
  • Data analysts or business analysts use programming languages, spreadsheets, and business intelligence tools to describe, categorize, visualize and present the data that exists to uncover actionable insights of current problems. Sometimes they can also be found underneath the data science umbrella.
  • Data scientists usually use the input from the data engineer (and data analyst) to further perform analysis. They apply sophisticated analytics tools, machine learning and statistical methods to use in predictive and prescriptive modeling to ultimately answer industry and business questions. Let’s hold off a second to digest the last sentence.

Data analytics tools include programming languages such as R, Python, and their packages, but also commercials such as Tableau and Power BI.  So far so good.

Machine learning (ML) is usually mentioned in the context of artificial intelligence (AI). Although many people tried to define AI, there exists no clear definition. Nevertheless, we can think of AI as any system that can think or act humanly or rationally.  Machine learning is often referred to as a subset of AI and to keep it simple I would embrace this thought. ML is a data analytics technique that teaches computers to learn from experience.

There are three techniques used in ML: supervised, unsupervised and reinforcement learning. A supervised learning algorithm takes a known set of input data (our explanatory or independent variable) and known responses (dependent variable) and trains a model to generate predictions for the response to new data, i.e. we have both input and output data (labels) to use for modeling. These algorithms include classification (e.g. support vector machines, k-nearest neighbor, …) and regression . Unsupervised learning algorithms only use input data without responses. They are used for finding hidden patterns. The most common technique is clustering (k-means, hidden Markov models, …). Reinforcement learning is about taking suitable action to maximize reward in a particular situation.

Many statistical methods are used in the domain of ML. Statistical methods concentrate on inference and a probability model by usually using smaller datasets in long data format (more observations than variables), while ML uses wide data and huge datasets. Nonetheless, I would keep these two concepts close together.


Hopefully, the confusion about data is somewhat cleared and next time you hear some of the above-mentioned concepts, you will know what they are all talking about. I believe that data is the biggest conqueror of today’s digital world and instead of fearing this revolution, businesses and organizations should welcome the technological advancements and take action.