Modern data systems are becoming increasingly complex, prompting companies to seek unified platforms that can handle everything from raw data storage to advanced analytics and machine learning.
We talked to our expert, Oleksii Zarembovskyi, to learn about Databricks and how it serves as a solution for managing big data across different cloud environments.
Background & experience:
Over 6 years of experience in the data engineering field. Oleksii leads a team of 15+ Data Engineers, with a focus on organisational and technical debt coverage, performance improvements, and best practices implementation.
How does Databricks differ from other data platforms (such as Snowflake, BigQuery, or traditional Spark)?
Oleksii Zarembovskyi: Databricks is a platform that provides services for working with big data based on Spark. On top of that, it also offers persistent data storage (Delta Lake), includes a metadata catalogue (Unity Catalog), allows querying data with SQL, and enables applying machine learning models to that data. In other words, it is quite a versatile service that offers many capabilities.
Compared to competitors like Snowflake or BigQuery, which mainly position themselves as analytical data warehouses with SQL-based data access, Databricks allows working with data in multiple programming languages, like Python, Java, Scala, SQL, and others. This means that both developers and business users can interact with the data, which is a significant advantage for collaboration between different roles on a project.
In addition, the engine driving data transformation in Databricks is Spark itself, along with platform-level optimisations such as Delta Lake and Unity Catalog. Another important advantage is that Databricks is not tied to a specific cloud provider; it can run on any (AWS, Azure, GCP, etc.), unlike, for example, BigQuery, which is tied to Google Cloud. This is critically important in modern solutions, where data can be stored across different clouds, and there is a need to integrate it. Databricks can help achieve that.
In what scenarios does Databricks prove itself as a “universal tool”? Where is it truly convenient and effective?
OZ: As I mentioned earlier, Databricks is essentially a combination of different services in one. It functions as a data platform utilising cloud infrastructure, where data is stored across various cloud providers, including AWS and Azure. This means we have persistent storage for data, and importantly, Databricks provides so-called ACID guarantees. This is especially relevant for financial institutions because it ensures data integrity: the data won’t be lost, you can roll back to a previous version, and there is control over the schema.
All of this is implemented through Delta Lake, which underlies their lakehouse approach. Simply put, it’s a combination of a data lake and a data warehouse: we store both “raw” data in various formats and processed data in a structured form.
There are several data tiers:
All of this allows different users to work with the same data: an analyst can explore it, a developer can build pipelines, all in one environment.
Previously, I hadn’t seen systems where one could do all of this in one place. Databricks is a tool with a wide range of features right out of the box. But it must be used with understanding that it’s still a cloud service, and if something is misconfigured, costs can skyrocket. For example, each query spins up a cluster, and you pay for its runtime, plus its power affects the costs. That’s why it’s important to calibrate everything properly and monitor expenses closely.
What challenges or limitations have you encountered when working with Databricks?
OZ: Well, I wouldn’t call them “limitations” exactly; I’d say it’s more about the specifics of how the system itself works. Some business analysts and testers are simply used to working with standard relational databases, which operate quite differently. If you compare those to Databricks, the main point is that classical databases are optimised for fast processing of small amounts of data; for example, you select 5–10 rows, and it works instantly.
In Databricks, however, if you want to run a job through a cluster, the cluster startup alone can take about 5 minutes. You have to wait for the cluster to spin up, warm up, install all the necessary libraries, and only then does execution begin. Also, this system is not designed for working with small data volumes. Pulling 5 rows might take just as long, or even longer, than pulling a million or two, because it’s a different scale of system.
Another challenge I’ve faced is related to automation, particularly when deploying infrastructure in Databricks. Yes, you can manually configure, which notebooks are used for running jobs, how pipeline orchestration is done, and what dependencies exist between them. But if you want to create something long-term, continuous, and standardised, you need a full-fledged Software Development Lifecycle.
For example, you need a repository where the code is stored, and for each branch, changes are automatically deployed to the corresponding environment. This way, everything can be tested in each environment. The deployment of components happens through CI/CD and IaaC services. Essentially, you have a configuration that specifies which notebooks to deploy, where to deploy them, under which cluster to run them, what type of job it should be, how many resources it needs, and so on.
How does a company’s approach to working with data change when Databricks is implemented?
OZ: Databricks recommends a data approach described in the DAMA Book. This book, published by the Data Management Association, outlines the components of effective data management, including data architecture, data governance, data quality, data warehousing, and so on. But they also have a specific statement about data, which outlines how we should treat and use it.
In other words, they say that there needs to be a shift in mindset. This applies to developers and all stakeholders involved with data; data needs to be treated as an asset. The idea is that one can use this data to provide insights to the business, to build data-driven solutions, to present data in a structured way for the business, and essentially to plan further steps based on it.
At the same time, we must understand that data has certain "risks". For example, user data contains certain personal information that must be stored in a secure environment, without access for unauthorised persons. And in accordance with regulations in various regions, for instance, GDPR, if there is a request, the data must be deleted from the database along with all dependent data related to that user. Data retention should also be clearly defined for a specific period.
Again, this is about a change in the approach to data and the understanding of what it should be: data collection, processing, presentation, and how this cycle should continue. Once one gains some insights, one may gather new data, again collect it, process it, and so on. But eventually, the outcome should be either archiving or, if needed, deletion of that data. This is about the data lifecycle, which is also part of Data Management, as described in the DAMA Book.
How does the end-to-end development lifecycle in Databricks look? Can you give a real-life example?
OZ: It’s pretty close to what Databricks itself recommends. First of all, development at a minimum starts with planning and understanding what exactly we want to do, what our goals are, breaking them down into individual features, and planning them so that we can gradually move toward that goal. This is roughly speaking the Scrum approach. Based on that, you can create tasks on a board, in Jira, Azure DevOps, or similar tools, so we can track our activities.
When making changes or doing development, we either write new code or modify existing code. For that, we need version control Git repositories. Depending on the complexity of the project, there can be one or several repositories for each service. In some cases, there can be dozens, but that can make it difficult to manage dependencies between services and the deployment order.
Databricks recommends managing code versioning through a Git repository linked to the Databricks workspace. This way, you can do development, link a notebook to a cluster, run it, test it, and check if it works. If everything is fine, we merge the changes into the dev branch.
But that’s not the end, because we still need to deploy and run everything. Notebooks are essentially just sets of code; they don’t run automatically and have no dependencies or parameters. All of that can be organised using Databricks Workflows, basically jobs that run on a schedule defined in a configuration.
So, in addition to the Git repository, we add a CI/CD component and an infrastructure component. This means that with every change to a branch in the repository, code and configuration deployment are triggered automatically. If we have any changes to notebooks or clusters, these are also tracked, and if something new appears or something is removed, those changes are applied automatically.
After applying these changes, we can run everything in the appropriate environment. When everything works and we are confident the product is fine, we can promote these changes, for example, to staging, so that it can be tested both by QA engineers and business analysts. After that, we need to get a sign-off from the stakeholders.
For example, we might agree internally that staging is the environment stakeholders can directly use. They can access it for downloading results, or analysts can connect to it to present data to end users.
This way, first, we confirm whether the product works as expected and matches what we planned. Second, we ensure stakeholders are satisfied and that it meets their expectations. If everything is fine, it goes to production. And again, at every stage, for every branch, we have a CI/CD deployment configuration that deploys changes automatically. The infrastructure is also provisioned using tools like Terraform, which keeps the state of our services in what’s called the Terraform state.
In this way, we achieve a complete end-to-end development cycle – from start to finish.
Databricks is a platform for big data that helps with storage, processing, and analysis using Apache Spark technology. It allows developers and business users to work together using languages like Python, Java, Scala, and SQL. Databricks offers tools for transforming data and deploying machine learning models. It helps companies manage the entire data lifecycle, from collecting raw data to delivering analytics, all in one cloud-friendly solution.
Yes, Databricks provides a platform and tools for ETL.
The breadth of knowledge and understanding that ELEKS has within its walls allows us to leverage that expertise to make superior deliverables for our customers. When you work with ELEKS, you are working with the top 1% of the aptitude and engineering excellence of the whole country.
Right from the start, we really liked ELEKS’ commitment and engagement. They came to us with their best people to try to understand our context, our business idea, and developed the first prototype with us. They were very professional and very customer oriented. I think, without ELEKS it probably would not have been possible to have such a successful product in such a short period of time.
ELEKS has been involved in the development of a number of our consumer-facing websites and mobile applications that allow our customers to easily track their shipments, get the information they need as well as stay in touch with us. We’ve appreciated the level of ELEKS’ expertise, responsiveness and attention to details.