Data lake

Data lakes are large stores of minimally processed data in formats like Parquet and ORC that are stored on cheap object storage like S3. It is a dumping ground for data and figure out how to query it later.

In practice, this was historically Hadoop and HDFS, but has been replaced by S3 and Iceberg.

Data warehouse

Data warehouses are structured stores that are optimized for analytical queries that often aggregate over large fractions of the stored tables.

In practice, Snowflake and BigQuery are modern examples.

Data lakehouse

Data lakehouses are what you get when you add analytical query capabilities on top of a data lake. For example, if you layer Trino, Spark, or DuckDB over Iceberg over S3, you get something that functionally approaches a data warehouse but is implemented more like a data lake.

2 items with this tag.