Tiivistelmä
This thesis presents a comprehensive study and comparative analysis of three distinct data lakehouse systems: Delta Lake, Apache Iceberg, and Apache Hudi. Data lakehouse systems are an emergent concept that combines the capabilities of data warehouses and data lakes to provide a unified platform for large-scale data management and analysis.
Three experimental scenarios were conducted focusing on data ingestion, query performance, and scaling, each assessing a different aspect of the system’s capabilities. The results show that each data lakehouse system possesses unique strengths and weaknesses: Apache Iceberg demonstrated the best data ingestion speed, Delta Lake exhibited consistent performance across all testing scenarios, while Apache Hudi excelled with smaller datasets.
Furthermore, the study also considered the ease of implementation and use for each system. Apache Iceberg emerged as the most user-friendly, with comprehensive documentation. Delta Lake provided a slightly steeper learning curve, while Apache Hudi was the most challenging to implement.
This study underscores the promising potential of data lakehouses as alternatives to traditional database architectures. However, further research is necessary to solidify the positioning of data lakehouses as the new generation of database architectures.
.png)
