A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files. The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning. The data lake includes structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (images, audio, video) thus creating a centralized data store accommodating all forms of data.
James Dixon, then chief technology officer at Pentaho allegedly coined the term to contrast it with data mart, which is a smaller repository of interesting attributes extracted from raw data. He argued that data marts have several inherent problems, and promoted data lakes. These problems are often referred to as information siloing. PricewaterhouseCoopers said that data lakes could "put an end to data silos. In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository."
Many companies have now entered into this space: Microsoft, Zeloni, Teradata, Cloudera, and Amazon all have data lake offerings to name a few. As with any "new" technology trend the buyer must beware on all claims made!
One example of a data lake is the distributed file system used in Apache Hadoop.
Many companies also use cloud storage services such as Amazon S3. There is a gradual academic interest in the concept of data lakes, for instance, Personal DataLake at Cardiff University to create a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data.