Data Processing Platforms: What Are They And Whether Everyone Needs Them

They process data, that is, they extract benefits from it, completely diversified companies. Even a chain of family hairdressers in the area can keep records in Excel, using it as a CRM system. Based on the data, did you derive a list of clients who have not come for a haircut for a long time? It’s time to send them sms with an “individual” discount.

At what point should a business organize an entire data processing platform? Is data processing always about big data?

From Excel to ML – Data Analytics Maturity Levels

At the beginning of the text, we mentioned the family barbershop. A good example to continue talking about what is generally happening in the world of data analytics. For further narration, let it be a network of barbershops “Bearded sysadmin”.

Below is a graph of the maturity of analytical systems, based on the classification of Gartner. It has four levels. Not every company goes through a linear evolution from start to finish. There are those that break into levels 3-4 “with their feet”. The main thing is to have the necessary resources – money and specialists, as well as relevant business tasks. And there are companies that, for the entire time of their existence, will remain on Excel tables and a simple BI system. This is also normal.

Level one: descriptive

We combine the first three points into one block: raw and cleaned data, standard reports. This is the lowest level of data manipulation and is most often done in Google Sheets or Excel.

So, our barbershop began to collect data on clients who come for a haircut and count visits. The administrator enters information manually, some data is pulled from the registration form on the site. The manager can clear data from duplicates, correct errors that were made during registration, and even structure data by the number and variety of services provided per month.

Based on this, you can make regular reports. Find out if the number of customers is growing month by month, which gave more income over the summer – cutting the beard and hair.

This data answers the question: what happened? Based on them, hypotheses can be formulated and decisions can be made. For the most part – in manual mode and due to the cognitive efforts of the manager.

Analytics formats such as Ad hoc reports and OLAP also belong to this level. Ad hoc reports are reports made for a specific business request. Most often this is something non-standard, which is not in the usual reporting. For example, the manager of the “Bearded SysAdmin” is tasked with finding out how many sales happened in three months for a cohort of bald but bearded visitors (broken down by day).

Second level: diagnostic

At this level – the so-called self-service analytics (self-service BI). It implies that specialists of different profiles, and not just data analysts, can run queries on the necessary data and generate summary reports. This approach also manifests itself in the use of BI systems such as Power BI, Qlik or Tableau. At the same time, dashboards in them, as a rule, are configured by data specialists.

Here the data answers the question, why did this happen? They do not just describe the current state of the company, but are a source of analytical conclusions. For example, the revenue of the “Bearded SysAdmin” increased by 2 times compared to the previous month. The data shows that this happened due to several advertising posts on Telegram about the barbershop action.

At this level, a company can move from Excel spreadsheets to Python scripts and SQL queries. Also, one or two data analysts in the team can no longer be dispensed with.

Why switch to more complex tools at all?

The reasons may differ for each specific company:

  • The amount of work with data has increased. The company began not only to calculate profits and expenses for the month, but also to collect data on marketing activities, record the outflow of customers, and so on. It becomes irrational to produce dozens of new Excel tables – it is easy to get confused in them and it is difficult to make correlations between events.
  • There was a need for automation. Employees spend a lot of time collecting data manually. This time they can devote to work that is more useful for business growth.
  • We need to improve the quality of the data. The less process automation, the more room for human error. Some data may stop being collected or entered with errors. Automation and BI systems will help to better “clean up” data and find new directions for analytics.
  • The number of analysts has increased. For example, the company began to develop in several regions. Each has its own analyst, but they need to summarize the data in one place. To unify tools and approaches, you can use a single BI system and a common repository (or at least a database).

Third level: predicative and prescriptive

At this level, work begins with more complex concepts. It’s about predictive and prescriptive analytics.

In the first case, the data answers the question of what will happen next. For example, you can predict the growth of revenue or customer base in six months. Here the analysis algorithm can form the basis of the ML model.

Prescriptive analytics is based on the question of what to optimize. The data shows that in order for barbershop revenue to grow by 60%, you need to increase the budget for advertising promotion by 15%.

At this stage, we are no longer talking about several analysts, but about a whole team that can work in several business areas. Typically, at this point, companies need data platforms.

Fourth level

“Example company” is an autonomous analytics system based on artificial intelligence. Here the machine offers some supposedly correct decision based on the result of big data analysis, and the person makes the final decision.

Banks can use similar systems. For example, it can be scoring systems for issuing loans. And our barbershop can use Lead scoring – a technology for assessing a customer database in terms of their readiness to purchase company products.

The third and fourth levels are just for big data?

The short answer is no.

The amount of data is not as important as the tasks facing the company

Of course, the more data, the more representative the results. But to operate with arguments in the spirit of “I have a database of only a million people, all this platform processing is not for me” is also wrong.

The data may not be much, but it can be very diverse: records of conversations with clients, records from surveillance cameras, user images, etc. All this needs to be stored in a systematic way in order to successfully extract from them valuable knowledge for the company, applicable in business tasks.

The amount of data is not as important as the number of analytics and analytics teams

If a company has several analytical teams in different business areas, this leads to problems. Teams can use the same data source, but at the same time different analytics tools, different storages. Sometimes they can analyze the same thing or calculate the same indicator differently, which is not very rational. If a new analytics team is added, it runs the risk of duplicating some of the work already done.

The heterogeneity of analytical pipelines also leads to delays in meeting business requirements. The product manager will ask you to fix the dashboard with product revenue, and he will receive the fix only after 1.5 months.

As the complexity of analytics tasks and the number of analysts grow, companies are looking at data platforms. They provide a common base, generally accepted agreements: with the help of what tools and how we take data from sources, where we put it, how we organize storage.

What are data processing platforms?

In general, a data platform is a set of integrated tools that allow companies to do regular and reproducible data analytics.

The set of tools can be varied, but they are invested in approximately the same pipeline for working with data:

  1. Sources. The whole set of data sources – from simple files and relational databases to SaaS solutions that collect any information that is potentially useful for business.
  2. Data processing and transformation. This is where ETL or ELT tools come into play. Data is taken from the source, transformed if necessary, and sent to the repository. Tools such as Apache Spark, Kafka, Airflow can be involved here.
  3. Storing data in a format suitable for further work with them. The most popular tools for this are Greenplum, Clickhouse, Vertica, tools from the Hadoop ecosystem.
  4. Data analysis itself is descriptive and/or predictive. SQL, Python, or any other languages can be used as a tool here.
  5. Data output/visualization for end users. Most often, some kind of BI system adopted by the company (Power BI, Qlik, Tableau, Apache Superset or their analogues).

How to build a data platform

Here we return to our “Bearded sysadmin”. It’s hard to imagine a barbershop that needs a data processing platform, but we’ve gone too far. Branches of the barbershop are open in 6 regions of the country and 30 cities. He also launched online courses on beard care at home and a platform for barbers with a system of personal accounts.

In general, there is a lot of data, requests for business growth, too, and analytical teams cannot cope. What are the options?

We create on our own, from scratch

The most difficult option to implement, but it cannot be completely ruled out. In this case, companies need to hire expensive specialists in the market – DevOps or data engineers. And hope they manage without a data architect (or hire one too).

It will also be necessary to rent or purchase infrastructure for the platform. You will need fast servers and good bandwidth. If the infrastructure is on-premises, the servers, of course, will still need to be serviced (+ shift engineers in the technical team for 24/7 maintenance).

The entire set of software selected for the platform will need to be configured and “friends” with each other so that data processing takes place as autonomously and without failures as possible. In fact, there is no industry standard, there are very few ready-made instructions.

In general, the project is large-scale – you need to invest a lot of money in something that will not bring profit before and a little after the end of the “construction”. And the work can be stretched at best for several months.

Not suitable for our barbershop. There are no necessary specialists, there is no IT brand to attract good specialists, and the profit from data analysis is needed as quickly as possible.

We need to look for something more ready. What are the options?

Go to cloud provider

Foreign companies, which are often cloud native, have one common scenario. When they need a data processing platform, they go to one of the popular foreign clouds – for example, AWS, Google Cloud, Azure – and there they assemble a system from separate “cubes”.

They have a lot of products, and there you can find the right “boxed” solution for each of the stages of the pipeline that we discussed above. “Cubes”, however, will also need to be linked – with the help of their own cloud architects or the corresponding managed service from the provider.

We purchase a ready-made platform

Another option is to contact, for example, Cloudera, which is currently the only adequate supplier of Hadoop. You can get a ready-made, already assembled platform and even technical support from them. But it will be expensive. The price tag will be able to accept only a firmly standing enterprise.

Why companies need data platforms

We have already written a lot about the structure and implementation options of data platforms. Now, in a nutshell, why companies can benefit from using data processing platforms:

On the prepared qualitative data, you can build recommender systems (relevant for e-commerce and retail). It is they who, after ordering products in the delivery service, offer your favorite products at a discount. So, the company increases the average check and is engaged in upselling services.

The company receives a common toolkit for all analytical teams in the company: it limits the list of tools used and saves on hiring new specialists.

Data platforms help you take your analytics to the next level—from descriptive to predictive—and unlock more business-critical insights.

In solutions with technical support, you can transfer the expenses for the work of data engineers from payroll to OPEX.

Ready-made platforms will reduce the burden on data engineers and data scientists. They will not spend time setting up the software and its compatibility with the infrastructure.

Data processing platforms are not a must-have for every company, but also not some unique tool that is available only to large and very large companies. This may be a solution for medium-sized businesses that want to grow and see this growth in a data-driven approach.

Navid Anjum

Full-stack web developer and founder of Laravelaura. He makes his tutorials as simple as humanly possible and focuses on getting the students to the point where they can build projects independently. https://github.com/NavidAnjum

Leave a Reply

Your email address will not be published. Required fields are marked *