SQL vs. NoSQL for Data Science

Posted on February 13, 2022 by tomaztsql — 2 Comments

Data come in a variety of forms, at different paces, and at different volumes. And if all three criteria define the difference between SQL and NoSQL and there, all three are still irrelevant for data science.

My theorem is, that no matter what shape, size, frequeny, value and trustworthiness, SQL type of presenting the data is still the number one player.

Before you all jump, and start writing comments, hold on and continue reading.

Technology

SQL Databases as we have known them and as we know them nowadays have improved over the years in terms of velocity, performance, volume, frequency, and up until this day, keep the relational model, data ACID, isolations, and so on. All this — if you would like to address them — fix, rigid, complex, … tendencies have not only kept the databased number one, but provided the business, digitalization wave, and cloud computing with structure and data visibility.

NoSQL database technologies are, on the other hand as described by many, flexible, can provide horizontal scaling, give better query performances due to data normalization, and can do parallel computations.

I cannot argue with parallel computation and horizontal scaling. But with rest, I have big issues.

Offerings in Microsoft Azure

Let’s take a look into the different offerings in comparison between SQL and NoSQL:

Feature	SQL	NoSQL
Storage	Predefined tables	Key-value pair storage, Document Storage, Columnstore storage, Graph based storage
Data Model	Tables of rows and columns with relations between other tables, related data stored separately, joined to form complex queries	Stores data depending on database type: Key-value, documents, columnar type and graph databases
Example	SQL Server, Azure SQL Server, Azure Database	Azure Cosmos DB ( Table API, Cassandra API, Graph API), HBase in HDInsight, Azure Cache for Redis, Azure Table Storage, Azure Blob Store, Azure Data Lake store, Azure File Store, Azure Time Series Insights
Business Purpose	General purpose systems, CRM, Accounting, Finance, Human resources, Planning, Inventory management, Transactional management System	Mobile Apps, IoT Apps, Real-time data Stream and Analytics, Content management
Scale	Vertically by increasing server load	Horizontally by sharding across multiple
Analytics	SQL, SQL with R, Python, and Java, Analysis Services with Data Mining, Integration Services with data profiling, Azure Machine Learning	Spark, Scala, Python, R, Hive, SQL, .NET Core, Azure Machine Learning for Python, Stream Analytics

Analytical hiccups and trade-offs with NoSQL

Horizontal scaling and parallel computations are great. They both provide elasticity of resources and faster analytical results. When delivering data to the analytical department, the ends must meet.

Strong consistency

You have probably heard the phrase “from eventual to strong consistency”. This simply means that the schema is defined on read and when a document, key-value pair binary file, columnar file is copied between different transformation zones it will eventually become consistent with schema, ACID rules, iYou have probably heard the phrase “from eventual to strong consistency”. This simply means that the schema is defined on read and when a document, key-value pair binary file, the column file is copied between different transformation zones it will eventually become consistent with schema, ACID rules, integrity constraints. But with eventually consistent data, you are gaining fast response (analytics) with the cost of potential errors or stale data. And delivering this type of data to the data science department will always result in back and forth communication full of nagging questions and clearing many data inconsistencies.

Data transformation

Keeping data in original format is great; Transforming it to a readable dataset is a big trade-off for data architect, data engineers and a huge must for data science.

Faster query performance can be achieved with in NoSQL without complex query joins. Normalized data would mean, that all transactional data, along with all dimensional data (with names, explanations) are included. Making datasets inadvertently large, but with cheap storage and scalability, this is not a problem. Problem lies with data consistency, updated, well, ACID. And it’s orchestration. Making shards and copies of data every time, something is changed, updated or deleted, can be tedious, invisible and result of many storage issues. But all orchestration issues are also solvable with right software and some coding.

Data usage and delivery

Data scientists love Python Pandas, Numpy, R’s Dplyr, data.table, Spark’s dataframe, datasets, Julia’s dataframe.jl. You get the picture. All are column and row based. In other words, all NoSQL data are delivered as SQL typed data or columnar typed. Even graph data (with edges and vertices) are transformed in this format.

Flexibility

I tend to calculI tend to calculate the flexibility of NoSQL data through “time to market” KPI. How fast can add a new type of data (image, alter the schema, new KPI) or a new change on the data and deliver it to the data science department is key to flexibility. There are for sure also other key factors regarding flexibility, for example through deployment and operations, replication, or even availability.

Final thoughts

I am happy that NoSQL concepts, technologies, and statistical approaches have penetrated the world of data science. Not only does it help develop new ways of calculation, improve and develop new algorithms, but also opened new ways of analyzing formats, bringing them to the community faster than we have envisioned. But there are caveats to these concepts. Data still need to be cleaned, harmonized, consolidated. In order to bring consistency, accuracy through transformation, wrangling, and orchestration. And these processes must (!) not be neglected, overlooked, or underestimated. If they are, the complete NoSQL paradigm will go down the drain. And this will cost the company precious resources, energy, and unhappiness.

Tagged with: Azure, Cosmos DB, Julia, NoSQL, Python, R, Spark, SQL
Posted in thoughts, Uncategorized

2 comments on “SQL vs. NoSQL for Data Science”

SQL vs. NoSQL for Data Science – Data Science Austria says:

February 14, 2022 at 12:40 am

[…] by data_admin [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

LikeLike

Reply
SQL vs. NoSQL for Data Science | R-bloggers says:

February 14, 2022 at 8:12 am

[…] article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) […]

LikeLike

Reply

	Finding Date Palindr… on Little useless-useful R functi…
	Tracking Typing Spee… on Little useless-useful R functi…
	laresbernardo on LazyMouse – R package fo…
	laresbernardo on LazyMouse – R package fo…
	Randomly Moving the… on LazyMouse – R package fo…