This post outlines the usages of datalakes and data warehouses.
Datalakes are generally built on databases which can support storage of unstructured or semi-structured data. There isn’t a strict schema such as like datawarehouse built on top of RDBMS’s.
This allows adding additional fields to tables(documents) when new fields become available. For example:
We may have 1 million customers already stored. Now we need to import another million with a new field for each customer such as credit rating. This can be done in a datalake without having to change the previous customers and adding null credit ratings to historical rows.
The Datalake acts as a buffer between live application databases and analytics. Data from live databases is periodically exported and placed into the datalake. Analytics teams would then only work with the data held in the datalake.
Datalake is an intermediate storage for analytics, original data imported to a data lake is generally not readily analysable. The data would go through series of transformations to transform it to a form that can be analysed. This may be in the forms of running ETL jobs or Stream transformations to convert from the semi structured data into warehouse schema. The warehouse schema may be held in a RDBMS allowing it to be queried using SQL by business professionals. Transformations from datalake into structured data would require scripting and not just querying. This is achieved languages such as python, scala hosted through platforms such as spark or hadoop. They take the semi structured data and transform them into a schema. Once the data is transformed it can be queried for reporting purposes.
Below is a typical data pipeline to describe where a datalake would sit:
Below is an article outlining the differences: