Friday, 12 April 2019

DATA PERSISTENCE


Data Persistence



Data

§  Data is a set of values of subjects with respect to qualitative or quantitative variables.

§  Data and information or knowledge are often used interchangeably; however, data becomes information when it is viewed in context or in post-analysis.

§  Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools.

§  Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing.

§  Raw data ("unprocessed data") is a collection of numbers or characters before it has been "cleaned" and corrected by researchers. Raw data needs to be corrected to remove outliers or obvious instrument or data entry errors (e.g., a thermometer reading from an outdoor Arctic location recording a tropical temperature).

§  Data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next stage.

Database

§  A database is an organized collection of data, generally stored and accessed electronically from a computer system. Where databases are more complex, they are often developed using formal design and modeling techniques.

§  The database management system (DBMS) is the software that interacts with end users, applications, and the database itself to capture and analyze the data.

§  The DBMS software additionally encompasses the core facilities provided to administer the database.

§  The sum of the database, the DBMS and the associated applications can be referred to as a "database system".
§  
Database Server

§  A database server is a server which houses a database application that provides database services to other computer programs or to computers, as defined by the client–server model.

§  frequently provide database-server functionality, and some database management systems (such as MySQL) rely exclusively on the client–server model for database access (while others e.g. SQLite are meant for using as an embedded database).

§  Users access a database server either through a "front end" running on the user's computer – which displays requested data – or through the "back end", which runs on the server and handles tasks such as data analysis and storage.

§  In a master-slave model, database master servers are central and primary locations of data while database slave servers are synchronized backups of the master acting as proxies.

§  
Database Management System

§  A database management system (DBMS) is system software for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data.

§  A DBMS makes it possible for end users to create, read, update and delete data in a database.

§  The DBMS essentially serves as an interface between the database and end users or application programs, ensuring that data is consistently organized and remains easily accessible.
§  The DBMS manages three important things: the data, the database engine that allows data to be accessed, locked and modified -- and the database schema, which defines the database’s logical structure
.
§   These three foundational elements help provide concurrency, security, data integrity and uniform administration procedures.

§  Typical database administration tasks supported by the DBMS include change management, performance monitoring/tuning and backup and recovery.






Files Vs Databases


Pros of the file systems
§  Performance can be better than when you do it in a database.- To justify this, if you store large files in DB, then it may slow down the performance because a simple query to retrieve the list of files or filename will also load the file data if you used Select * in your query. In a file system, accessing a file is quite simple and light weight.

§  Saving the files and downloading them in the file system is much - simpler than it is in a database since a simple "Save As" function will help you out. Downloading can be done by addressing a URL with the location of the saved file.

§  Migrating the data is an easy process - You can just copy and paste the folder to your desired destination while ensuring that write permissions are provided to your destination.

§  It's cost effective - in most cases to expand your web server rather than pay for certain databases.

Cons of the File System

§  Loosely packed. - There is no ACID (Atomicity, Consistency, Isolation, Durability) operations in relational mapping, which means there is no guarantee. Consider a scenario in which your files are deleted from the location manually or by some hacking dudes.

§  Low security - Since your files can be saved in a folder where you should have provided write permissions, it is prone to safety issues and invites trouble, like hacking. It's best to avoid saving in the file system if you cannot afford to compromise in terms of security.

Pros of Database

§  ACID (Atomicity, Consistency, Isolation, Durability) consistency- which includes all rollback of an update that is complicated when files are stored outside the database.

§  Files will be in sync with the database - and cannot be orphaned, which gives you the upper hand in tracking transactions.

§  This is more secure than having a file system.


Cons of Database

§  You may have to convert the files to blob in order to store them in the database.

§  Database backups - This will be heftier and heavier.

§  Memory is ineffective - Often, RDBMSs are RAM-driven, so all data must go to RAM first. Yeah, that’s right. Have you ever thought about what happens when an RDBMS must find and sort data? RDBMS tracks each data page — even the lowest amount of data read and written — and it must track if it’s in-memory or if it’s on-disk, if it’s indexed or if it's sorted physically etc.

Data Arrangements

Data warehouse

§  In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.

§  The data stored in the warehouse is uploaded from the operational systems (such as marketing or sales). The data may pass through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the DW for reporting.

§  The typical extract, transform, load (ETL)-based data warehouse uses staging, data integration, and access layers to house its key functions.

§  The staging layer or staging database stores raw data extracted from each of the disparate source data systems.

§  The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database.

§  The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups, often called dimensions, and into facts and aggregate facts.

§  The main source of the data is cleansed, transformed, catalogued, and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support.

§  However, the means to retrieve and analyze data, to extract, transform, and load data, and to manage the data dictionary are also considered essential components of a data warehousing system.



            Big Data

§  A big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.

§  Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.

§   Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity.


§  "There is little doubt that the quantities of data now available are indeed large, but that's not the most relevant characteristic of this new data ecosystem." Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on." Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet search, fintech, urban informatics, etc.

Data warehouse VS Big data

§  Data Warehousing is one of the common words for last 10-20 years, whereas Big Data is a hot trend for last 5-10 years.

§  Both hold a lot of data, used for reporting, managed by an electronic storage device.

§  So, one common thought of maximum people that recent big data will replace old data warehousing very soon. But still, big data and data warehousing is not interchangeable as they used totally for a different purpose.

§  So, let us start learning Big Data and Data Warehouse in a detail in this post.

Basis for Comparison
Data warehouse
Big data

Meaning

Data Warehouse is mainly an architecture, not a technology. It is extracting data from varieties SQL based data source (mainly relational database) and help for generating analytic reports. In terms of definition, data repository, which using for any analytic reports, has been generated from one process, which is nothing but the data warehouse.
Big Data is mainly a technology, which stands on volume, velocity, and variety of the data. Volumes define the amount of data coming from different sources, velocity refers to the speed of data processing, and varieties refer to the number of types of data (mainly support all type of data format).

Preferences

If an organization wants to know some informed decision (like what is going on in their corporation, next year planning based on current year performance data etc.), they prefer to choose data warehousing, as for this kind of report they need reliable or believable data from the sources
If organization need to compare with a lot of big data, which contain valuable information and help them to take a better decision (like how to lead more revenue, more profitability, more customers etc.), they obviously preferred Big Data approach

Accepted data source

Accepted one or more homogeneous (all sites use the same DBMS product) or heterogeneous (sites may run different DBMS product) data sources.
Accepted any kind of sources, including business transactions, social media, and information from sensor or machine specific data. It can come from DBMS product or not.

Accepted type of formats

Handles mainly structural data (specifically relational data).
Accepted all types of formats. Structure data, relational data, and unstructured data including text documents, email, video, audio, stock ticker data and financial transaction.

Subject-Oriented

Data warehouse is subject oriented because it provides information on the specific subject (like a product, customers, suppliers, sales, revenue etc.) not on organization ongoing operation. It does not focus on ongoing operation, it mainly focuses on analysis or displaying data which help on decision making.
Big Data is also subject-oriented, the main difference is a source of data, as big data can accept and process data from all the sources including social media, sensor or machine specific data. It also main on provide exact analysis on data specifically on subject oriented.

Time - Variant

The data collected in a data warehouse is identified by a time period. As it mainly holds historical data for an analytical report
Big Data have a lot of approach to identified already loaded data, a time period is one of the approaches on it. As Big data mainly processing flat files, so archive with date and time will be the best approach to identify loaded data. But it has the option to work with streaming data, so it not always holding historical data.

Non-volatile

Previous data never erase when new data added to it. This is one of the major features of a data warehouse. As it totally different from an operational database, so any changes on an operational database will not directly impact to a data warehouse.
For Big data, again previous data never erase when new data added to it. It stored as a file which represents a table. But here sometime in case of streaming directly use Hive or Spark as operation environment.

Distributed File System

Processing of huge data in Data Warehousing is time-consuming and sometimes it took an entire day for complete the process.
This is one of the big utilities of Big Data. HDFS (Hadoop Distributed File System) mainly defined to load huge data in distributed systems by using map reduce program.



No comments:

Post a Comment