SQL vs. NoSQL for Data Science

Data come in a variety of forms, at different paces, and at different volumes. And if all three criteria define the difference between SQL and NoSQL and there, all three are still irrelevant for data science.

My theorem is, that no matter what shape, size, frequeny, value and trustworthiness, SQL type of presenting the data is still the number one player.

Before you all jump, and start writing comments, hold on and continue reading.

Technology

SQL Databases as we have known them and as we know them nowadays have improved over the years in terms of velocity, performance, volume, frequency, and up until this day, keep the relational model, data ACID, isolations, and so on. All this — if you would like to address them — fix, rigid, complex, … tendencies have not only kept the databased number one, but provided the business, digitalization wave, and cloud computing with structure and data visibility.

NoSQL database technologies are, on the other hand as described by many, flexible, can provide horizontal scaling, give better query performances due to data normalization, and can do parallel computations.

I cannot argue with parallel computation and horizontal scaling. But with rest, I have big issues.

Offerings in Microsoft Azure

Let’s take a look into the different offerings in comparison between SQL and NoSQL:

FeatureSQLNoSQL
StoragePredefined tablesKey-value pair storage, Document Storage, Columnstore storage, Graph based storage
Data ModelTables of rows and columns with relations between other tables, related data stored separately, joined to form complex queriesStores data depending on database type: Key-value, documents, columnar type and graph databases
ExampleSQL Server, Azure SQL Server, Azure DatabaseAzure Cosmos DB ( Table API, Cassandra API, Graph API), HBase in HDInsight, Azure Cache for Redis, Azure Table Storage, Azure Blob Store, Azure Data Lake store, Azure File Store, Azure Time Series Insights
Business PurposeGeneral purpose systems, CRM, Accounting, Finance, Human resources, Planning, Inventory management, Transactional management SystemMobile Apps, IoT Apps, Real-time data Stream and Analytics, Content management
ScaleVertically by increasing server loadHorizontally by sharding across multiple
AnalyticsSQL, SQL with R, Python, and Java, Analysis Services with Data Mining, Integration Services with data profiling, Azure Machine LearningSpark, Scala, Python, R, Hive, SQL, .NET Core, Azure Machine Learning for Python, Stream Analytics

Analytical hiccups and trade-offs with NoSQL

Horizontal scaling and parallel computations are great. They both provide elasticity of resources and faster analytical results. When delivering data to the analytical department, the ends must meet.

Strong consistency

You have probably heard the phrase “from eventual to strong consistency”. This simply means that the schema is defined on read and when a document, key-value pair binary file, columnar file is copied between different transformation zones it will eventually become consistent with schema, ACID rules, iYou have probably heard the phrase “from eventual to strong consistency”. This simply means that the schema is defined on read and when a document, key-value pair binary file, the column file is copied between different transformation zones it will eventually become consistent with schema, ACID rules, integrity constraints. But with eventually consistent data, you are gaining fast response (analytics) with the cost of potential errors or stale data. And delivering this type of data to the data science department will always result in back and forth communication full of nagging questions and clearing many data inconsistencies.

Data transformation

Keeping data in original format is great; Transforming it to a readable dataset is a big trade-off for data architect, data engineers and a huge must for data science.

Faster query performance can be achieved with in NoSQL without complex query joins. Normalized data would mean, that all transactional data, along with all dimensional data (with names, explanations) are included. Making datasets inadvertently large, but with cheap storage and scalability, this is not a problem. Problem lies with data consistency, updated, well, ACID. And it’s orchestration. Making shards and copies of data every time, something is changed, updated or deleted, can be tedious, invisible and result of many storage issues. But all orchestration issues are also solvable with right software and some coding.

Data usage and delivery

Data scientists love Python Pandas, Numpy, R’s Dplyr, data.table, Spark’s dataframe, datasets, Julia’s dataframe.jl. You get the picture. All are column and row based. In other words, all NoSQL data are delivered as SQL typed data or columnar typed. Even graph data (with edges and vertices) are transformed in this format.

Flexibility

I tend to calculI tend to calculate the flexibility of NoSQL data through “time to market” KPI. How fast can add a new type of data (image, alter the schema, new KPI) or a new change on the data and deliver it to the data science department is key to flexibility. There are for sure also other key factors regarding flexibility, for example through deployment and operations, replication, or even availability.

Final thoughts

I am happy that NoSQL concepts, technologies, and statistical approaches have penetrated the world of data science. Not only does it help develop new ways of calculation, improve and develop new algorithms, but also opened new ways of analyzing formats, bringing them to the community faster than we have envisioned. But there are caveats to these concepts. Data still need to be cleaned, harmonized, consolidated. In order to bring consistency, accuracy through transformation, wrangling, and orchestration. And these processes must (!) not be neglected, overlooked, or underestimated. If they are, the complete NoSQL paradigm will go down the drain. And this will cost the company precious resources, energy, and unhappiness.

Tagged with: , , , , , , ,
Posted in thoughts, Uncategorized
2 comments on “SQL vs. NoSQL for Data Science
  1. […] by data_admin [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

    Like

  2. […] article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) […]

    Like

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, Fabric, enterprise Power BI, SQL Server BI, Data Modeling, SSAS Design, SSRS, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

SQLPam's Blog

Life changes fast and this is where I occasionally take time to ponder what I have learned and experienced. A lot of focus will be on SQL and the SQL community – but life varies.

William Durkin

William Durkin a blog on SQL Server, Replication, Performance Tuning and whatever else.

$hell Your Experience !!!

As aventuras de um DBA usando o Poder do $hell

Design a site like this with WordPress.com
Get started