Introduction to Big Data World

Introduction to Big Data World

Hello Everyone,

I have started my third beautiful journey under the mentorship of the World Record Holder Mr. Vimal Daga Sir . In the second day of my journey at ARTH-The School of Technologies , I came across the concept of Big Data , one of the Giant Technology in the World of Technologies.

What is Big Data ?

No alt text provided for this image


Big Data is a term used for a collection of data sets so large and complex that it is difficult to process using traditional applications/tools. It is the data exceeding Terabytes in size. Because of the variety of data that it encompasses, big data always brings a number of challenges relating to its volume and complexity. A recent survey says that 80% of the data created in the world are unstructured. One challenge is how these unstructured data can be structured, before we attempt to understand and capture the most important data. Another challenge is how we can store it. Here are the top technologies used to store and analyse Big Data. We can categorise them into two (storage and Querying/Analysis).


Four V's of Big Data

Big data can be described by the following characteristics:

  • Volume:

The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.

  • Variety:

The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The Big Data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed(velocity), and huge in size (volume). Later, these tools and technologies were explored and utilized for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, and sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.

  • Velocity:

The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.

  • Veracity:

It is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting the accurate analysis.

No alt text provided for this image

Other Characterstics of Big Data :

The other characteristics of Big Data are--

  • Exhaustive :

Whether the entire system is captured or recorded or not.

  • Fine-grained and uniquely lexical :

Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.

  • Relational :

If the data collected contains common fields that would enable a conjoining, or meta-analysis, of different data sets.

  • Extensional :

If new fields in each element of the data collected can be added or changed easily.

  • Scalability :

If the size of the data can expand rapidly.

  • Value :

The utility that can be extracted from the data.

  • Variability :

It refers to data whose value or other characteristics are shifting in relation to the context in which they are being generated.

Fields of Use of Big Data :

  • Banking and Security.
  • Communication, Media and Entertainment.
  • Healthcare Providers.
  • Education.
  • Manufacturing and Natural Resources.
  • Government.
  • Insurance.
  • Retail and Wholesale Trade.
  • Transportation.
  • Energy and Utilities.
No alt text provided for this image


Big Data Companies :

  • Amazon :
No alt text provided for this image


The online retail giant has access to a massive amount of data on its customers; names, addresses, payments and search histories are all filed away in its data bank.

While this information is obviously put to use in advertising algorithms, Amazon also uses the information to improve customer relations, an area that many big data users overlook.

The next time you contact the Amazon help desk with a query, don't be surprised when the employee on the other end already has most of the pertinent information about you on hand. This allows for a faster, more efficient customer service experience that doesn't include having to spell out your name three times.

  • Google :
No alt text provided for this image


Google is founded in 1998 and California is headquartered. It has $101.8 billion market capitalization and $80.5 billion of sales as of May 2017. Around 61,000 employees are currently working with Google across the globe.

Google provides integrated and end to end Big Data solutions based on innovation at Google and help the different organization to capture, process, analyze and transfer a data in a single platform. Google is expanding its Big Data Analytics; BigQuery is a cloud-based analytics platform that analyzes a huge set of data quickly.

BigQuery is a serverless, fully managed and low-cost enterprise data warehouse. So it does not require a database administrator as well as there is no infrastructure to manage. BigQuery can scan terabytes data in seconds and pentabytes data in minutes.

  • Microsoft :
No alt text provided for this image


It is US-based Software and Programming Company, founded in 1975 with headquarters in Washington. As per Forbes list, it has a Market Capitalization of $507.5 billion and $85.27 billion of sales. It currently employed around 114,000 employees across the globe.

Microsoft’s Big Data strategy is wide and growing fast. This strategy includes a partnership with Hortonworks which is a Big Data startup. This partnership provides HDInsight tool for analyzing structured and unstructured data on Hortonworks data platform (HDP)

Recently Microsoft has acquired Revolution Analytics which is a Big Data Analytics platform written in “R” programming language. This language used for building Big Data apps that do not require a skill of Data Scientist.

  • Facebook :
No alt text provided for this image


Arguably the world’s most popular social media network with more than two billion monthly active users worldwide ,Facebook stores enormous amounts of user data, making it a massive data wonderland. It’s estimated that there will be more than 183 million Facebook users in the United States alone by October 2019. Facebook is still under the top 100 public companies in the world, with a market value of approximately $475 billion.

Every day, we feed Facebook’s data beast with mounds of information.Every 60 seconds , 136,000 photos are uploaded, 510,000 comments are posted, and 293,000 status updates are posted. That is a lot of data.

At first, this information may not seem to mean very much. But with data like this, Facebook knows who our friends are, what we look like, where we are, what we are doing, our likes, our dislikes, and so much more. Some researchers even say Facebook has enough data to know us better than our therapists!

  • American Express :
No alt text provided for this image


The American Express Company is using big data to analyse and predict consumer behaviour.

By looking at historical transactions and incorporating more than 100 variables, the company employs sophisticated predictive models in place of traditional business intelligence-based hindsight reporting.

This allows a more accurate forecast of potential churn and customer loyalty. In fact, American Express has claimed that, in their Australian market, they are able to predict 24% of accounts that will close within four months.

  • IBM :
No alt text provided for this image


International Business Machine (IBM) is an American company headquartered in New York. IBM is listed at 43 in Forbes list with a Market Capitalization of $162.4 billion as of May 2017. The company’s operation is spread across 170 countries and the largest employer with around 414,400 employees.

IBM has a sale of around $79.9 billion and a profit of $11.9 billion. In 2017, IBM holds most patents generated by the business for 24 consecutive years.

IBM is the biggest vendor for Big Data-related products and services. IBM Big Data solutions provide features such as store data, manage data and analyze data.

There are numerous sources from where this data comes and accessible to all users, Business Analysts, Data Scientist, etc. DB2, Informix, and Infosphere are popular database platforms by IBM which supports Big Data Analytics. There are also famous analytics applications by IBM such as Cognos.

  • Oracle :
No alt text provided for this image


Oracle offers fully integrated cloud applications, platform services with more than 420,000 customers and 136,000 employees across 145 countries. It has a Market capitalization of $182.2 billion and sales of $37.4 B as per Forbes list.

Oracle is the biggest player in the Big Data area, it is also well known for its flagship database. Oracle leverages the benefits of big data in the cloud. It helps organizations to define its data strategy and approach which includes big data and cloud technology.

It provides a business solution that leverages Big Data Analytics, applications, and infrastructure to provide insight for logistics, fraud, etc. Oracle also provides Industry solutions which ensure that your organization takes advantage of Big Data opportunities.

Oracle’s Big Data industry solutions address the growing demand for different industries such as Banking, Health Care, Communications, Public Sector, Retail, etc. There are a variety of Technology solutions such as Cloud Computing, Application Development, and System Integration.

  • SAP :
No alt text provided for this image


SAP is the largest business software company founded in 1972 with headquarters in Walldrof, Germany. It has a Market Capitalization of $119.7 billion with total employee count as 84,183 as of May 2017.

As per the Forbes list, SAP has sales of $24.4 billion and a profit of around $4 B with 345,000 customers. It is the largest provider of enterprise application software and the best cloud company with 110 million cloud subscribers.

The SAP provides a variety of Analytics Tool but its main Big Data Tool is the HANA-in memory relational database. This tool integrates with Hadoop and can run on 80 terabytes of data.

SAP helps the organization to turn a huge amount of Big Data into real-time insight with Hadoop. It enables distributed data storage and advanced computation capabilities.

  • BDO :
No alt text provided for this image


National accounting and audit firm BDO puts big data analytics to use in identifying risk and fraud during audits.

Where, in the past, finding the source of a discrepancy would involve numerous interviews and hours of manpower, consulting internal data first allows for a significantly narrowed field and streamlined process.

In one case, BDO Consulting Director Kirstie Tiernan noted, they were able to cut a list of thousands of vendors down to a dozen and, from there, review data individually for inconsistencies. A specific source was identified relatively quickly.

  • Capital One :
No alt text provided for this image


Marketing is one of the most common uses for big data and Capital One are at the top of the game, utilising big data management to help them ensure the success of all customer offerings.

Through analysis of the demographics and spending habits of customers, Capital One determines the optimal times to present various offers to clients, thus increasing the conversion rates from their communications.

Not only does this result in better uptake but marketing strategies become far more targeted and relevant, therefore improving budget allocation.

  • General Electric :
No alt text provided for this image


GE is using the data from sensors on machinery like gas turbines and jet engines to identify ways to improve working processes and reliability.

The resultant reports are then passed to GE's analytics team to develop tools and improvements for increased efficiency.

The company has estimated that data could boost productivity in the US by 1.5%, which, over a 20-year period, could save enough cash to raise average national incomes by as much as 30%.

  • Netflix:
No alt text provided for this image


The entertainment streaming service has a wealth of data and analytics providing insight into the viewing habits of millions of international consumers.

Netflix uses this data to commission original programming content that appeals globally as well as purchasing the rights to films and series boxsets that they know will perform well with certain audiences.

For example, Adam Sandler has proven unpopular in the US and UK markets in recent years but Netflix green-lighted four new films with the actor in 2015, armed with the knowledge that his previous work had been successful in Latin America.


Big Data Challenges :

  • Handling a Large Amount of Data :

There is a huge explosion in the data available. Look back a few years, and compare it with today, and you will see that there has been an exponential increase in the data that enterprises can access. They have data for everything, right from what a consumer likes, to how they react, to a particular scent, to the amazing restaurant that opened up in Italy last weekend.

This data exceeds the amount of data that can be stored and computed, as well as retrieved. The challenge is not so much the availability, but the management of this data. With statistics claiming that data would increase 6.6 times the distance between earth and moon by 2020, this is definitely a challenge.

Along with rise in unstructured data, there has also been a rise in the number of data formats. Video, audio, social media, smart device data etc. are just a few to name.

Some of the newest ways developed to manage this data are a hybrid of relational databases combined with NoSQL databases. An example of this is MongoDB, which is an inherent part of the MEAN stack. There are also distributed computing systems like Hadoop to help manage Big Data volumes.

Netflix is a content streaming platform based on Node.js. With the increased load of content and the complex formats available on the platform, they needed a stack that could handle the storage and retrieval of the data. They used the MEAN stack, and with a relational database model, they could in fact manage the data.

  • Real-time can be Complex :

When I say data, I’m not limiting this to the “stagnant” data available at common disposal. A lot of data keeps updating every second, and organizations need to be aware of that too. For instance, if a retail company wants to analyze customer behavior, real-time data from their current purchases can help. There are Data Analysis tools available for the same – Veracity and Velocity. They come with ETL engines, visualization, computation engines, frameworks and other necessary inputs.

It is important for businesses to keep themselves updated with this data, along with the “stagnant” and always available data. This will help build better insights and enhance decision-making capabilities.

However, not all organizations are able to keep up with real-time data, as they are not updated with the evolving nature of the tools and technologies needed. Currently, there are a few reliable tools, though many still lack the necessary sophistication.

  • Data Security :

A lot of organizations claim that they face trouble with Data Security. This happens to be a bigger challenge for them than many other data-related problems. The data that comes into enterprises is made available from a wide range of sources, some of which cannot be trusted to be secure and compliant within organizational standards.

They need to use a variety of data collection strategies to keep up with data needs. This in turn leads to inconsistencies in the data, and then the outcomes of the analysis. A simple example such as annual turnover for the retail industry can be different if analyzed from different sources of input. A business will need to adjust the differences, and narrow it down to an answer that is valid and interesting.

This data is made available from numerous sources, and therefore has potential security problems. You may never know which channel of data is compromised, thus compromising the security of the data available in the organization, and giving hackers a chance to move in.

It’s necessary to introduce Data Security best practices for secure data collection, storage and retrieval.

  • Paying loads of Money :

Big data adoption projects entail lots of expenses. If you opt for an on-premises solution, you’ll have to mind the costs of new hardware, new hires (administrators and developers), electricity and so on. Plus: although the needed frameworks are open-source, you’ll still need to pay for the development, setup, configuration and maintenance of new software.

If you decide on a cloud-based big data solution, you’ll still need to hire staff (as above) and pay for cloud services, big data solution development as well as setup and maintenance of needed frameworks.

Moreover, in both cases, you’ll need to allow for future expansion to avoid big data growth getting out of hand and costing you a fortune.

  • Shortage of Skilled People :

There is a definite shortage of skilled Big Data professionals available at this time. This has been mentioned by many enterprises seeking to better utilize Big Data and build more effective Data Analysis systems. There is a lack experienced people and certified Data Scientists or Data Analysts available at present, which makes the “number crunching” difficult, and insight building slow.

Again, training people at entry level can be expensive for a company dealing with new technologies. Many are instead working on automation solutions involving Machine Learning and Artificial Intelligence to build insights, but this also takes well-trained staff or the outsourcing of skilled developers.

  • Trouble in Upscaling :

The most typical feature of big data is its dramatic ability to grow. And one of the most serious challenges of big data is associated exactly with this.

Your solution’s design may be thought through and adjusted to upscaling with no extra efforts. But the real problem isn’t the actual process of introducing new processing and storing capacities. It lies in the complexity of scaling up so, that your system’s performance doesn’t decline and you stay within budget.


Big Data Case Studies :

Undoubtedly Big Data has become a big game-changer in most of the modern industries over the last few years. As Big Data continues to pass through our day to day lives, the number of different companies that are adopting Big Data continues to increase. Let us see how Big Data helped them to perform exponentially in the market with these big data case studies.

  • Case Study 1 : Walmart
No alt text provided for this image


Walmart is the largest retailer in the world and the world’s largest company by revenue, with more than 2 million employees and 20000 stores in 28 countries. It started making use of big data analytics much before the word Big Data came into the picture.

Walmart uses Data Mining to discover patterns that can be used to provide product recommendations to the user, based on which products were brought together. WalMart by applying effective Data Mining has increased its conversion rate of customers. It has been speeding along big data analysis to provide best-in-class e-commerce technologies with a motive to deliver superior customer experience. The main objective of holding big data at Walmart is to optimize the shopping experience of customers when they are in a Walmart store. Big data solutions at Walmart are developed with the intent of redesigning global websites and building innovative applications to customize the shopping experience for customers whilst increasing logistics efficiency. Hadoop and NoSQL technologies are used to provide internal customers with access to real-time data collected from different sources and centralized for effective use.

  • Case Study 2 : Uber
No alt text provided for this image

Uber is the first choice for people around the world when they think of moving people and making deliveries. It uses the personal data of the user to closely monitor which features of the service are mostly used, to analyze usage patterns and to determine where the services should be more focused. Uber focuses on the supply and demand of the services due to which the prices of the services provided changes. Therefore one of Uber’s biggest uses of data is surge pricing. For instance, if you are running late for an appointment and you book a cab in a crowded place then you must be ready to pay twice the amount.

For example, On New Year’s Eve, the price for driving for one mile can go from 200 to 1000. In the short term, surge pricing affects the rate of demand, while long term use could be the key to retaining or losing customers. Machine Learning Algorithms are considered to determine where the demand is strong.

  • Case Study 3 : Flipkart
No alt text provided for this image


Flipkart the World’s number one e-commerce platform is using analytics and algorithms to get better insights into its business during any type of sale or festival season. This article will explain you how the Flipkart is leveraging Big Data Platform for processing big data in streams and batches. This service-oriented architecture empowers user experience, optimizes logistics and improves product listings. It will give you an insight into how this ingenious big data platform is able to process such large amounts of data.

Flipkart Data Platform is a service-oriented architecture that is capable of computing batch data as well as streaming data. This platform comprises of various micro-services that promote user experience through efficient product listings, optimization of prices, maintaining various types of data domains – Redis, HBase, SQL, etc. This FDP is capable of storing 35 Petabytes of data and is capable of managing 800+ Hadoop nodes on the server. This is just a brief of how Big Data is helping Flipkart. Below I am sharing a detailed explanation of Flipkart data platform architecture that will help you to understand the process better.


What is Distributed Storage ?

No alt text provided for this image


A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

Distributed storage is the basis for massively scalable cloud storage systems like Amazon S3 and Microsoft Azure Blob Storage, as well as on-premise distributed storage systems like Cloudian Hyperstore.

Distributed storage systems can store several types of data:

  • Files—a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
  • Block storage—a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network(SAN).
  • Objects—a distributed object storage system wraps data into objects, identified by a unique ID or hash.

Distributed storage systems have several advantages:

  • Scalability—the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
  • Redundancy—distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
  • Cost—distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at low cost.
  • Performance—distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.

Distributed Storage Features and Limitations

Most distributed storage systems have some or all of the following features:

  • Partitioning—the ability to distribute data between cluster nodes and enable clients to seamlessly retrieve the data from multiple nodes.
  • Replication—the ability to replicate the same data item across multiple cluster nodes and maintain consistency of the data as clients update it.
  • Fault tolerance—the ability to retain availability to data even when one or more nodes in the distributed storage cluster goes down.
  • Elastic scalability—enabling data users to receive more storage space if needed, and enabling storage system operators to scale the storage system up and down by adding or removing storage units to the cluster.

An inherent limitation of distributed storage systems is defined by the CAP theorem. The theorem states that a distributed system cannot maintain Consistency, Availability and Partition Tolerance (the ability to recover from a failure of a partition containing part of the data). It has to give up at least one of these three properties. Many distributed storage systems give up consistency while guaranteeing availability and partition tolerance.


What is Hadoop ?

No alt text provided for this image


Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Importance of Hadoop :

  • Ability to store and process huge amounts of any kind of data, quickly.

 With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT) , that's a key consideration.

  • Computing power.

 Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.

  • Fault tolerance. 

Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.

  • Flexibility. 

Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.

  • Low cost. 

The open-source framework is free and uses commodity hardware to store large quantities of data.

  • Scalability.

 You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

Challenges of Using Hadoop :

  • MapReduce programming is not a good match for all problems.

 It’s good for simple information requests and problems that can be divided into independent units, but it's not efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.

  • There’s a widely acknowledged talent gap. 

It can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. That's one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings.

  • Data security.

 Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing. The Kerberos authentication protocol is a great step toward making Hadoop environments secure.

  • Full-fledged data management and governance. 

Hadoop does not have easy-to-use, full-feature tools for data management , data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.

Conclusion:

The availability of Big Data, low-cost commodity hardware, and new information management and analytic software have produced a unique moment in the history of data analysis. The convergence of these trends means that we have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the first time in history. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and profitability.

The Age of Big Data is here, and these are truly revolutionary times if both business and technology professionals continue to work together and deliver on the promise.


That's All

Thank You

By...

Hriddha Bhowmik

16/09/2020














Mayuk Das

Ambassador RHEL Linux Automation with ANSIBLE - DevOPS(RH294) at LinuxWorld Informatics Pvt Ltd || ARTH Learner || IIEC-RISE Python

4y

Carry on...like this all d best

Aditya Roshan Jha

Systems Engineer at TCS || Full-Stack Developer

4y

Oh yeah, that's great!

Very nice, carry-on friend ☺☺

Soham Gupta

Attended Delhi University

4y

Good dear

To view or add a comment, sign in

More articles by Hriddha Bhowmik

  • Git and GitHub Workshop

    Git and GitHub Workshop

    It is really exciting to have a weekend when it is scheduled for a live-workshop under the guidance of your best…

    4 Comments
  • Industry Use Case on Automation using Ansible

    Industry Use Case on Automation using Ansible

    It's now the end of the year and we are still at home because of COVID-19 pandemic.Still in this situation an evening…

  • Amazon Web Services-The Messiah of Startups

    Amazon Web Services-The Messiah of Startups

    Hello Everyone, I am back with a new article. This is based on a new technology and is different from the previous…

    14 Comments
  • Python Virtual Helpmate

    Python Virtual Helpmate

    Hello Everyone, Learning something new daily is a great achivement.This achievement becomes more dynamic when we get…

    4 Comments
  • Flutter Music App

    Flutter Music App

    Hello Everyone, To take a break out of my monotonous life in this COVID-19 pandemic situation, I joined "Integration of…

    6 Comments
  • My First App with Flutter

    My First App with Flutter

    Hello everyone,I am here to share you about my first app that I have created in the Flutter with the help of dart…

    6 Comments

Insights from the community

Others also viewed

Explore topics