Big Data Terms and Definition

The world of data is changing dramatically. Traditional approaches to data collection and analysis are being challenged by new methods that are transforming the way we interact with and utilize data. This shift is being driven by a confluence of technological advancements, changing consumer behaviors, and the rise of big data. In this blog post, we'll explore the key paradigm shifts that are shaping the future of data and what they mean for businesses.

Developers have to deal with new paradigms while creating the next generation of data-driven apps. In building the next generation of applications, companies and developers need to adopt new paradigms. The need for this shift is predicated on the fundamental belief that building a new application at scale requires tailored solutions to that application’s unique challenges, business model and ROI. Some things change, and I’d like to point to some of those changes. The primary objective needs to be to enable a data centric culture while avoiding the development of large, monolithic applications and single-point-of-failure data stacks. Some things change, and I'd want to clarify some of those changes, as well as what stye signify - and what they don't. Some topics may be well-known, some may be well-known but not implemented, and some may be new. Hopefully, I can assist you in navigating the data journey.

Big Data

The term Big Data refers to new technologies and approaches needed to cope with - and leverage - the massive growth of data due to the fast evolution of decentralization.

We characterize Big Data by the following four V:

The exponential growth in the volume of data due to the internet, sensors, machine-generated data and metadata.
The variety of data sources and types/structures, as well as use cases.
The need to process massive volumes of data at high velocity (e.g., in real-time or on demand).
Veracity summarizing the new challenges and requirements such as security, privacy, legal frameworks, and data quality. Veracity also covers the need to verify the appropriateness and representativeness of data and conclusions derived from them.

Unstructured, semi-structured, and structured data

Unstructured data refers to data that is not organized in a pre-defined manner (such as video or images). However, this is debatable because data without structure is essentially nonsense. Such that, data must have a structure in order to be comprehended, but we may not know what that structure is at first. My point of view here ;)
Semi-structured data is data that does not conform to the formal structure of (relational) databases. Semi-structured data contains some structuring elements such as tags (self-describing structure). Examples are web logs, click streams, JSON, XML etc.
Structured data follows a formal pre-defined structure (e.g. tables in a relational database, CSV files).

The ability to work with unstructured and semi-structured data removes the need to impose a fixed predefined structure on data. This facilitates the incorporation of new data sources more quickly. Moreover, it replaces a single rigid data structure by problem-specific structures that are defined at read time and are adapted to the given use case.

Horizontal vs. vertical scaling

Scalability is the ability of a system, network, or process to handle a growing amount of work or its ability to be enlarged to accommodate that growth.

Vertical scaling can be done in both directions. Vertical upscaling aims to accommodate growth by adding more resources to a single node (i.e., a single, more powerful and hence disproportionately more expensive machine). Vertical downscaling means the miniaturization of a single machine for fit lesser hardware specs. For example virtual machines running on a laptop for demonstration purposes. Those systems are not production-ready and fit mostly to applied research.
Horizontal scaling achieves this by adding more of the same nodes to the system. With the dropping prices and increasing performance of commodity hardware, horizontal designs make it possible to build cost-effective systems that can be scaled out easily as needed, while also offering additional benefits such as redundancy and fault tolerance.

Agile development

Agile development is a collection of principles and methods for adaptive solution development by self-organizing cross-functional teams. Through iterations of short development cycles in small highly-skilled teams, agile development promotes early delivery, adaptive planning, exploration of possibilities, continuous improvement, and responsiveness to change (even at late stages).

Data Architecture

Data architecture consists of models, policies, and rules that determine how data is collected, arranged, stored, integrated, and retrieved for use. Furthermore, the term describes the adoption, use and integration of various tools, techniques and operations to process technically the collected and stored data.

Data Science

Data science is closely linked with Big Data, but not depending on it. The goal of data science is the extraction of knowledge from various data points. It goes beyond, but has its roots in the fields of mathematics, statistics, information theory, information philosophy, event and signal processing, machine learning and computer science. Data scientists work closely with business and problem owners to identify data sources and questions that can be answered using data. They thus develop schemes and solutions for insight generation, decision-making, and process and product optimisation.

Data Lake

Data lake is the Big Data alternative for traditional data warehouses. It consists of a horizontally scalable and centralized platform where all data can be stored in its original fidelity and accessed at any time with robust data security, protection, governance, and management. In contrast to a traditional data warehouse, no predefined schema is imposed on data in a data lake (see schema). This permits the same data to be used in different scenarios with case-specific structure defined at access time.

Virtual Data Lakehouse

A Virtual Data Lakehouse combines data mesh principles and cross-platform data processing technology to seamlessly connect all your data lakes and data silos into a large-scale, interconnected federated data lake. A Virtual Data Lakehouse enables organizations to store and analyze their data across various storage systems; the architecture is also known as federated data lakes. Every data lake remains independent in its physical and virtual instance. The distributed nature of this concept allows data privacy relevant data processing, also known as Federated Learning.

Data Gravity

Data Gravity describes the uncontrolled flow of data into a lake or ocean without categorizing the data (source, content, structure). Those captured data will be in the lake or ocean forever (paradigm “never delete”), but there is no use anymore since the structure, content and source got lost over time and occupy simply storage, but cannot be deleted since the sense is not clear anymore.

Data model

A data model describes and standardizes the data elements and their relationships to one another in a certain domain. A data model determines the organization of data and how it is stored and accessed.

Data profiling

Data profiling is the process of examining data to collect statistics and contextual data information in order to evaluate the quality and quantity of the data, as well as the integration potential with other data sources and modeling feasibility. Data profiling is a preliminary step before any further action on data.

Data fusion

Data fusion (or blending) refers to the integration of different data from heterogeneous sources into representations that are meaning and useful for further modeling.

Data mining

Data mining refers to the set of techniques used for exploring, extracting and defining previously unknown patterns in data. These patterns can be in the form of groups of data records (clusters), dependencies (association rules) or anomalies. The main final goal of data mining is to describe and understand hidden relationships in data, while machine learning focuses mostly on probabilistic approaches for making predictions and optimization.

Fast analytics

Fast analytics refers to the interactive visual representation of insights and results from data science. It empowers the end-users (typically business users/experts) to interact in a fast and lightweight way with the data, formulate and answer a wide range of meaningful questions. New answers are presented very fast (sub-second), representing prototypes of analyses.

Massive Parallel Processing (MPP)

Massive Parallel Processing refers to the use of a large number of processing units to perform a set of coordinated computations in parallel, typically used in frameworks like Apache Hadoop, Spark, Kafka or Flink.

Machine Learning

Machine learning deals with the study and construction of algorithms that can learn from data. Models learned from data can be used to make predictions, optimize processes and products or support decision making.

Metadata

Metadata describes the information layer about data. It provides information about one or more aspects of the data, such as means of creation, purpose, time and date of creation, author, location, and standards.

NoSQL

NoSQL (often interpreted as "Not only SQL") is a modern category of databases that are not restricted to storing and retrieving tabular data (as in the case of relational databases). Examples of NoSQL databases include document-oriented databases, key-value stores and graph databases.

Predictive modeling

Predictive modeling is a generic term to refer to the methodology using machine learning, data mining and statistics to predict possible future outcomes based on past data. The model is developed on salted anonymized original data and trained during production on original data. As every model in the AI space the training is part of production and often needs reinforced learning and training processes (see Reinforced Machine Learning).

Reinforced Machine Learning

RML (RL) is the opposite of model based machine learning. RML does not need to have an exact mathematical routine or definition, the process finds a balance between exploration and exploitation to detect the “sweet spot” of confidence and to start again with the learning process. That makes this procedure extremely powerful and is similar to the neurological learning process of an infant.

Schema on write / Schema on read

A schema refers to the formal definition of data items and their relationships in a relational database. Schema-less storage avoids imposing case-specific structures on data, thereby permitting the storage and use of semi- or unstructured data (see un- / semi-structured data).

Shared-nothing architecture (data locality)

In a shared-nothing distributed architecture, data is processed on the same node where it is stored (data locality). This avoids the need to move the data from one node to another, which would make computation with large amounts of data practically impossible.

Stream processing

Stream processing is the processing of data upon receiving, as they become available in near real time. Stream processing is not bound to a specific data flow and / or delivery mechanism as IoT (es example) provides. Stream processing can also be useful in a connection between RDBMS on a transactional basis, as example to detect anomalies in CRM transactions.

This is contrasted with batch processing, which often relies on large datasets stored in cold or semi-cold storage archives like a Data Lake.

I hope this short overview is somehow useable and help you to speed up the career you are targeting in data. The paradigm shifts in the world of data are transforming the way we interact with and utilize data. The rise of AI and machine learning, the importance of data privacy, the democratization of data, and the shift to the cloud are just some of the key trends that are shaping the future of data. Businesses that are able to adapt and take advantage of these shifts will be well-positioned to succeed in an increasingly data-driven world.

#data #analytics #bigdata

Big Data Terms and Definition

Alexander Alten-Lorenz

Building the next-gen AI agent infrastructure | Co-creator Apache Wayang | CEO Scalytics

Recommended by LinkedIn

More articles by this author

Insights from the community

Others also viewed

Navigating the Big Data Landscape - Opportunities and Pitfalls

Buckle up for Big Data

Seven Habits of Highly Effective CDOs = Masters of the External (Data) Environment

Big Data is technology or a Problem?

The Era of a Data-Driven World

How KPN Uses DataHub to Implement Data Mesh at Scale

Data Mesh, Data Stack, and the Holy Grail

Data Mesh: Design, Benefits, Hype, and Reality

Is it possible to empower people through Big Data?

The Three Big Data Waves: Managing Data Structures, Managing the Web and its Content, and Big Data Management

Explore topics

Recommended by LinkedIn

Building a Cohesive Team to Achieve the Impossible

Mar 23, 2024

Solve the limitations of current data platforms.

Dec 11, 2023

Big Data Analytics: The Future of Business Intelligence

Feb 6, 2023

AI regulation and development USA vs. EU

Jan 26, 2023

In my opinion: IoT and Blockchain - the stack for the DWeb

May 19, 2019

Open Source based Hyper-Converged Infrastructures and Hadoop

Jun 18, 2016

Apache Nifi, SolR and CDH - a interactive Twitter dashboard just in minutes

May 10, 2016