The migration from AWS with Redshift to Databricks for e-commerce company WWL

21 December 2023 | 5 minutes of reading time

The migration from AWS with Redshift to Databricks for e-commerce company WWL by i-spark

The i-spark team is actively engaged with World Wide Lighting [WWL], continuing their multi-year collaboration as a vital part of the data team using i-spark’s Data Team-as-a-Service solution. About a year ago, the e-commerce lighting company WWL reached a point where it had to choose between investing in their current data platform, or switching to a modern cloud solution that could scale with growth of the business. WWL aimed to enhance the reliability, stability, and speed of their data processes, including clickstream and ERP data, from their 24 diverse international webshops, while maintaining acceptable costs. This consultancy role and later the execution role around this issue was a perfect fit for i-spark.

Their existing data platform was based on an Amazon Redshift data warehouse within AWS. However, they experienced challenges in reliability and resiliency, due to high-demand and usage of the platform. Investing in the current platform would inevitably encounter the same challenges sooner or later, ultimately leading to possible large investments with an uncertain outcome. Therefore, the decision to migrate to a modern data platform was made.

Tool Selection

I-spark explored multiple cloud solutions and Databricks turned out to be the best fit for WWL’s requirements, as this platform was more comprehensive. Databricks also had an advantage that it’s accessible to a broader group of users such as analysts due to useful features such as the SQL editor.
The many processes that ran on the AWS platform, such as services like Lambda, Glue, and ECS that were used, were perfect candidates for migration to Databricks. Additionally, Databricks offered more solutions for Machine Learning, which was also one of WWL’s future aspirations. After setting up a business case as a Proof of Concept confirmed: Databricks was indeed a viable solution for enhancing reliability, stability, and speed of their data processes.

Solution Architecture and Roadmap

A fair share of time was invested in this phase, because well begun is half done.
In the architecture we laid the foundation of the future platform based on the learnings we had from the previous data platform. Therefore we gathered both functional and non-functional requirements from the business and translated these to a comprehensive plan for the architecture, including a design for the flow and storage of the large volumes of data. This resulted in a design based on the MACH-architecture principles using a medallion structure for the storage of data. MACH means that it is built around microservices, which are specific jobs for specific tasks, using API’s where possible, being cloud-native and headless, which means that other tools, such as reporting or marketing automation, have (curated) access to the data in and the compute power of Databricks.

As the next step, we determined the best approach for the delivery of the project, the allocation of hours amongst the team and divided the project in multiple phases. The result of this was a project roadmap where the realization of the architecture was mapped on a timeline using the phased approach.

“i-sparks approach and good preparation ensured that the delivery of the final solution was done in under two days”

Development

The roadmap guided the initial development. It began with migrating the processes that handled the most data or were the most complex in terms of processing and setup. While starting with a small team of Data Engineers for the first steps, this was gradually expanded to eventually a multidisciplinary team of Data Engineers, Analytics Engineers and Data Analysts when more and more data became available. The Data Engineers were assigned processes concerning data sources they were already familiar with, and worked in parallel on multiple flows to make this part of the project as efficient as possible. Analytics Engineers and Data Analysts worked on a more waterfall based principle due to strong dependencies between data. The project was delivered through a ‘Big Bang’ release, following a few weeks of operating the new platform alongside the existing one. This parallel run was conducted to validate data and address initial issues.

This approach and good preparation ensured that the delivery of the final solution was done in under two days. Any minor errors were immediately resolved, as the major ones had been prevented.

Data History

A challenge during the engineering process was migrating some of the existing data from Redshift to Databricks. We did this by exporting the historic data as files and later importing them back into Databricks. This imported history was then merged with the new data. Both historical and newly collected data could be retrieved in the same way in the analysis tool, Looker.

Looker

Since the source of data for Looker was changed from Redshift to Databricks, a new project within Looker was necessary. Looker is not able to connect to multiple data sources in one project. Therefore, in Looker, all existing explores were first rebuilt and converted to a new project by our Data Analysts. We also took the opportunity to move logic from Looker to dbt, to make future development and data management easier and more resilient. During the migration we encountered a few challenges:

  • Databricks uses a different SQL dialect than Redshift, necessitating adjustments in specific SQL functions. 
  • There was no separate development nor test environment where a duplicate instance of Looker was running, that meant that we had to work on the production environment while taking all precautions necessary to ensure a smooth running production environment.

Since Looker was unable to manage more than one data source, we had certain processes that were in Databricks already, temporarily transferring their data to Redshift. This was necessary to maintain the existing production reporting.

“All of these optimizations and solutions on performance and costs led to a comprehensive insight in how Databricks works ‘under the hood’ for many different Spark and SQL workloads and processes, in combination with the infrastructure it is running on”

Cost Optimization

The real challenging part of this migration journey began after its initial completion. During the development we had a strong focus on the first three requirements –  enhancing the reliability, stability, and speed of the data processes – while planning on focusing on the requirement of maintaining acceptable costs after delivery. If we had to re-do this, we would focus on this requirement earlier in the process. 

This eventually required optimization of compute resources and storage in Databricks afterwards, to find the optimal balance between data costs and consumption. We spend a lot of time and effort in finding the optimum, refactoring newly developed processes and infrastructure. The processing of large data volumes requires a lot of memory, and fast processing demands considerable computing power, all of which impacts costs. Finding the right balance in this was challenging but really educational for us. Many possible solutions have been explored, tested and implemented, resulting in cost optimization in various ways such as:

  • Smart VACUUM of old data in the data lake to lower storage costs;
  • Intelligent, dynamic up- and downscaling of SQL Warehouses configurations beyond the default auto-scaling offered by Databricks for SQL Warehouses;
  • Finding the ideal AWS EC2-instances for certain specific workloads as Job Compute Clusters in Databricks Workflows;
  • Using Compute Cluster Policies to limit the creation of Compute Clusters to pre-defined EC2-instances;
  • Moving certain workloads to dbt from Looker and Databricks Jobs to run more efficiently.

All of these optimizations and solutions on performance and costs led to a comprehensive insight in how Databricks works ‘under the hood’ for many different Spark and SQL workloads and processes, in combination with the infrastructure it is running on. And in future migrations the cost and performance optimization will be requirement number one and two.

“We adeptly transitioned WWL from a large-scale AWS-based data platform to an enterprise-grade Databricks Data Hub on AWS infrastructure”

Experienced yet Embracing a Learning Curve

In collaboration with Databricks and WWL this project has elevated i-spark’s expertise and knowledge even more, establishing it as a pivotal migration milestone. We adeptly transitioned WWL from a large-scale AWS-based data platform to an enterprise-grade Databricks Data Hub on AWS infrastructure. During the migration, we faced numerous challenges, yet our effective and swift responses ensured a seamless transition. We are proud to declare the successful migration of WWL to a future-proof platform that fulfills all their requirements, reinforcing our commitment to excellence and innovation in data solutions and looking forward to continuing our collaboration with WWL for upcoming years.

Are you ready for a Strategic Migration or Cost Optimization Partner?

Unlock the full potential of your data solutions with i-spark, your premier partner for migration and cost optimization. Our journey with Databricks has equipped us with insights and experience, positioning us uniquely to guide your migration journey. Specializing in cost-effective strategies, we’re not just optimizing your current Databricks implementation—we’re revolutionizing it. Choose i-spark for your migration needs and witness a transformation in your data solutions, powered by our years of experience and expertise of our Data Team-as-a-Service.

Feel free to contact us for more information about your migration