Data Persistence
Data
§ Data is a set of values of
subjects with respect to qualitative or quantitative variables.
§ Data and information or
knowledge are often used interchangeably; however, data becomes information
when it is viewed in context or in post-analysis.
§ Data is measured,
collected and reported, and analyzed, whereupon it can be visualized using
graphs, images or other analysis tools.
§ Data as a general concept
refers to the fact that some existing information or knowledge is represented
or coded in some form suitable for better usage or processing.
§ Raw data
("unprocessed data") is a collection of numbers or characters before
it has been "cleaned" and corrected by researchers. Raw data needs to
be corrected to remove outliers or obvious instrument or data entry errors
(e.g., a thermometer reading from an outdoor Arctic location recording a
tropical temperature).
§ Data processing commonly
occurs by stages, and the "processed data" from one stage may be
considered the "raw data" of the next stage.
Database
§ A database is an organized
collection of data, generally stored and accessed electronically from a
computer system. Where databases are more complex, they are often developed
using formal design and modeling techniques.
§ The database management
system (DBMS) is the software that interacts with end users, applications, and
the database itself to capture and analyze the data.
§ The DBMS software
additionally encompasses the core facilities provided to administer the
database.
§ The sum of the database,
the DBMS and the associated applications can be referred to as a "database
system".
§
Database Server
§ A database server is a
server which houses a database application that provides database services to
other computer programs or to computers, as defined by the client–server model.
§ frequently provide
database-server functionality, and some database management systems (such as
MySQL) rely exclusively on the client–server model for database access (while
others e.g. SQLite are meant for using as an embedded database).
§ Users access a database
server either through a "front end" running on the user's computer –
which displays requested data – or through the "back end", which runs
on the server and handles tasks such as data analysis and storage.
§ In a master-slave model,
database master servers are central and primary locations of data while database
slave servers are synchronized backups of the master acting as proxies.
§
Database Management System
§ A database management
system (DBMS) is system software for creating and managing databases. The DBMS
provides users and programmers with a systematic way to create, retrieve,
update and manage data.
§ A DBMS makes it possible
for end users to create, read, update and delete data in a database.
§ The DBMS essentially
serves as an interface between the database and end users or application
programs, ensuring that data is consistently organized and remains easily accessible.
§ The DBMS manages three
important things: the data, the database engine that allows data to be
accessed, locked and modified -- and the database schema, which defines the
database’s logical structure
.
§ These three foundational elements help provide
concurrency, security, data integrity and uniform administration procedures.
§ Typical database
administration tasks supported by the DBMS include change management,
performance monitoring/tuning and backup and recovery.
Files Vs Databases
Pros of the
file systems
§ Performance can be better than when you do it in a database.- To justify this, if you
store large files in DB, then it may slow down the performance because a simple
query to retrieve the list of files or filename will also load the file data if
you used Select * in your query. In a file system, accessing a file is quite
simple and light weight.
§ Saving the files and downloading them in the file system is
much -
simpler than it is in a database since a simple "Save As" function
will help you out. Downloading can be done by addressing a URL with the
location of the saved file.
§ Migrating the data is an easy process - You can just copy and
paste the folder to your desired destination while ensuring that write
permissions are provided to your destination.
§ It's cost effective - in most cases to expand your web
server rather than pay for certain databases.
Cons of the
File System
§ Loosely packed. - There is no ACID (Atomicity,
Consistency, Isolation, Durability) operations in relational mapping, which
means there is no guarantee. Consider a scenario in which your files are
deleted from the location manually or by some hacking dudes.
§ Low security - Since your files can be saved in a folder where you
should have provided write permissions, it is prone to safety issues and
invites trouble, like hacking. It's best to avoid saving in the file system if
you cannot afford to compromise in terms of security.
Pros of Database
§ ACID (Atomicity, Consistency, Isolation, Durability) consistency- which includes all
rollback of an update that is complicated when files are stored outside the
database.
§ Files will be in sync with the database - and cannot be orphaned,
which gives you the upper hand in tracking transactions.
§ This is more secure than having a file system.
Cons of
Database
§ You may have to convert the files to blob in order to store them in
the database.
§ Database backups - This will be heftier and heavier.
§ Memory is ineffective - Often, RDBMSs are RAM-driven, so
all data must go to RAM first. Yeah, that’s right. Have you ever thought about
what happens when an RDBMS must find and sort data? RDBMS tracks each data page
— even the lowest amount of data read and written — and it must track if it’s
in-memory or if it’s on-disk, if it’s indexed or if it's sorted physically etc.
Data
Arrangements
Data
warehouse
§ In computing, a data
warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a
system used for reporting and data analysis, and is considered a core component
of business intelligence. DWs are central repositories of integrated data from
one or more disparate sources. They store current and historical data in one
single place that are used for creating analytical reports for workers
throughout the enterprise.
§ The data stored in the
warehouse is uploaded from the operational systems (such as marketing or
sales). The data may pass through an operational data store and may require
data cleansing for additional operations to ensure data quality before it is
used in the DW for reporting.
§ The typical extract,
transform, load (ETL)-based data warehouse uses staging, data integration, and
access layers to house its key functions.
§ The staging layer or
staging database stores raw data extracted from each of the disparate source
data systems.
§ The integration layer
integrates the disparate data sets by transforming the data from the staging
layer often storing this transformed data in an operational data store (ODS)
database.
§ The integrated data are
then moved to yet another database, often called the data warehouse database,
where the data is arranged into hierarchical groups, often called dimensions,
and into facts and aggregate facts.
§ The main source of the
data is cleansed, transformed, catalogued, and made available for use by
managers and other business professionals for data mining, online analytical
processing, market research and decision support.
§ However, the means to
retrieve and analyze data, to extract, transform, and load data, and to manage
the data dictionary are also considered essential components of a data
warehousing system.
Big Data
§ A big data is a field that
treats ways to analyze, systematically extract information from, or otherwise
deal with data sets that are too large or complex to be dealt with by
traditional data-processing application software.
§ Data with many cases
(rows) offer greater statistical power, while data with higher complexity (more
attributes or columns) may lead to a higher false discovery rate.
§ Big data challenges include capturing data,
data storage, data analysis, search, sharing, transfer, visualization,
querying, updating, information privacy and data source. Big data was
originally associated with three key concepts: volume, variety, and velocity.
§ "There is little
doubt that the quantities of data now available are indeed large, but that's
not the most relevant characteristic of this new data ecosystem." Analysis
of data sets can find new correlations to "spot business trends, prevent
diseases, combat crime and so on." Scientists, business executives,
practitioners of medicine, advertising and governments alike regularly meet
difficulties with large data-sets in areas including Internet search, fintech,
urban informatics, etc.
Data
warehouse VS Big data
§ Data Warehousing is one of
the common words for last 10-20 years, whereas Big Data is a hot trend for last
5-10 years.
§ Both hold a lot of data,
used for reporting, managed by an electronic storage device.
§ So, one common thought of
maximum people that recent big data will replace old data warehousing very
soon. But still, big data and data warehousing is not interchangeable as they
used totally for a different purpose.
§ So, let us start learning
Big Data and Data Warehouse in a detail in this post.
|
Basis for Comparison
|
Data warehouse
|
Big data
|
|
Meaning
|
Data Warehouse is mainly an
architecture, not a technology. It is extracting data from varieties SQL
based data source (mainly relational database) and help for generating
analytic reports. In terms of definition, data repository, which using for
any analytic reports, has been generated from one process, which is nothing
but the data warehouse.
|
Big Data is mainly a
technology, which stands on volume, velocity, and variety of the data.
Volumes define the amount of data coming from different sources, velocity
refers to the speed of data processing, and varieties refer to the number of
types of data (mainly support all type of data format).
|
|
Preferences
|
If an organization wants to
know some informed decision (like what is going on in their corporation, next
year planning based on current year performance data etc.), they prefer to choose
data warehousing, as for this kind of report they need reliable or believable
data from the sources
|
If organization need to
compare with a lot of big data, which contain valuable information and help
them to take a better decision (like how to lead more revenue, more
profitability, more customers etc.), they obviously preferred Big Data
approach
|
|
Accepted data source
|
Accepted one or more
homogeneous (all sites use the same DBMS product) or heterogeneous (sites may
run different DBMS product) data sources.
|
Accepted any kind of
sources, including business transactions, social media, and information from
sensor or machine specific data. It can come from DBMS product or not.
|
|
Accepted type of formats
|
Handles mainly structural
data (specifically relational data).
|
Accepted all types of
formats. Structure data, relational data, and unstructured data including
text documents, email, video, audio, stock ticker data and financial
transaction.
|
|
Subject-Oriented
|
Data warehouse is subject
oriented because it provides information on the specific subject (like a
product, customers, suppliers, sales, revenue etc.) not on organization
ongoing operation. It does not focus on ongoing operation, it mainly focuses
on analysis or displaying data which help on decision making.
|
Big Data is also
subject-oriented, the main difference is a source of data, as big data can
accept and process data from all the sources including social media, sensor
or machine specific data. It also main on provide exact analysis on data specifically
on subject oriented.
|
|
Time - Variant
|
The data collected in a
data warehouse is identified by a time period. As it mainly holds historical
data for an analytical report
|
Big Data have a lot of
approach to identified already loaded data, a time period is one of the
approaches on it. As Big data mainly processing flat files, so archive with
date and time will be the best approach to identify loaded data. But it has
the option to work with streaming data, so it not always holding historical
data.
|
|
Non-volatile
|
Previous data never erase
when new data added to it. This is one of the major features of a data
warehouse. As it totally different from an operational database, so any
changes on an operational database will not directly impact to a data warehouse.
|
For Big data, again
previous data never erase when new data added to it. It stored as a file
which represents a table. But here sometime in case of streaming directly use
Hive or Spark as operation environment.
|
|
Distributed File System
|
Processing of huge data in
Data Warehousing is time-consuming and sometimes it took an entire day for
complete the process.
|
This is one of the big utilities
of Big Data. HDFS (Hadoop Distributed File System) mainly defined to load
huge data in distributed systems by using map reduce program.
|
No comments:
Post a Comment