Data and analytics workloads: How to choose the right technology & tool

In the below framework, ‘Application’ comes with its ‘Functional’ and ‘Non-Functional’ requirements. And this drives the decision towards what technology or tool to go with.

This article helps you in understanding the drivers which help in deciding what technology or tool to use for any data and analytics workloads. Below are the key areas for which the suitable technologies need to be recognized with the help of underlying drivers.

To be On Point:

  • Application requirements drive “ Data storage “ technology choice
  • Data characteristics that need to be ingested drive “ Data Ingestion “ technology choice.
  • If “ Data storage “ and “ Data Ingestion “ technology choices are made, the business rule, quality, and latency would drive the “Data Processing technology” choice.

1.Datastore technologies:

No SQL Choice is preferred for the following use cases:

  • Applications which require horizontal scaling — Mobile App with a huge number of users — Read / Write operations for each user and Sharding
  • Low Latency / Session State — Ad tech and Game State
  • Application Monitoring and IoT which has streaming data ingestion and continuous WRITE

The selection of No SQL database depends on CAP theorem (Consistency, Availability, and Partition Tolerance)

The above-mentioned DB Engines are not only limited to ‘Key-Value Store’ but also can be used as ‘Document Store’, ‘Time Series DB’ etc. However, they are known and popular for that primary purpose.

Relational / SQL supported DB Engines

  • Traditional RDBMS Engines for ACID operations ranging from SQL Server to MySQL and Postgres SQL
  • Analytical Database Engines to support OLAP use cases — Columnar Data stores which store data in column-oriented models unlike Row-based

Distributed File Systems

Hadoop ecosystem has ‘HDFS’ which is the core file system to store files irrespective of its format and structure.

The other popular term which is being used is ‘Data Lake’. It is a concept and may contain one or more different technologies. Sometimes ‘Columnar DW’ can play the role of Data Lake and sometimes HDFS plays the same role. It primarily holds all the data in its source format without any ‘Processing’ applied to it. It requires strong ‘Meta Data Management’ to identify and query the required data set for further applications. Microsoft has out of the box ‘Data Lake Store’ solution. We shall focus on Data Lakes in separate paper s /sections.

2. Data Ingestion Technologies

The data to be ingested once we identify the ‘Data Store’ choice need to be analyzed in terms of its Characteristics and Quality. The combination of ‘Kind of Data’ and ‘Target Data Store’ determines the right ‘Data Ingestion Technology’ choice. For e.g. Streaming data such as ‘log files’ needs to be ingested and stored into the ‘Wide Column Store’. This can help identify the right ingestion technology choice.

3. Data Processing Technologies:

The ‘Operations’ which need to be performed on the incoming data before / after storing on to the ‘Data Store’ and the ‘performance’ expectations determine the right technology choice for ‘Data Processing’. Few popular technology choices are ‘Spark’ and ‘MapReduce’.

In the Big Data world, Streaming data processing in distributed systems is the key requirement with low / no latency. This ‘Streaming Data’ is unbounded and there is no literal end to the incoming data feeds. Below are a few essential parameters we need to keep in mind before making the technology choice.

  • Delivery Guarantee — Incoming Streaming data need to be ingested and processed irrespective of failure. This can happen in 3 modes. ‘At least once’, ‘Exactly once’ and ‘At most once’
  • Fault Tolerance — In case of failure the processing must resume from the point of failure
  • State Management — The ‘State’ of the data should be persisted if in case of failure
  • Performance — Consumers should be able to read the processed data in Real-time / Near Real-time

Based on the above considerations we can further classify the available technologies into two categories

All the above-mentioned technologies in the above 3 sections come with a set of libraries and support multiple Programming languages from both the ‘Data world’ and ‘Application Development’ world. E.g. Java, .NET, Go, SCALA, SQL, Python, R, etc. Other modules that are important to consider are ‘Management & Coordination’, ‘Resource Management’, ‘Security’ and ‘Scheduling’.

About Technovert:

It all comes back to good data. We’ve got you covered for any data project across verticals. Our comprehensive inclusive of business intelligence consulting services PowerBI solutions, Data Visualization, Dashboard Design & Development along with DBA support service s can help.

Our experts help you to discover how to unlock the true potential of the data and let you take the next ste p in choosing the right tool and technology for any of your data and analytics workloads.

or schedule a call back to know more about how you can transform your business.

Originally published at on March 23, 2020.

We help companies like yours build custom, scalable applications with enhanced experience.