How to Maximize Virtual Data Lakes

Dec 16, 2021 | Data Quality | 0 comments

All Blogs

By Kevin Kautz

Before we jump into virtual data lakes, let’s consider the promise and shortcomings of data lakes in general, and distinguish physical data lakes from virtual ones.

A data lake provides a consolidated view and single access point for analytical workloads — such as dashboards and reporting for operations — and exploratory analysis and data science research. It expands access to data while restricting access to transactional systems. It offers the opportunity to enforce security and to protect the compute resources of operational systems. It protects, secures, and enables data analytics.

Unfortunately, copying the data from operational systems to populate a data lake adds cost and introduces delays. Forcing data into a single storage technology narrows the available tools for research, and frequently optimizes for data warehouse cubes that answer narrowly designed questions without guiding teams to a shared understanding of the business and its key measures. It costs more, takes longer, and can reduce consensus.

If your organization invests in a data lake, consider carefully how you will ensure the freshness of the data that you copy from operational systems. If you want to reduce or eliminate copies, then you may benefit from a virtual data lake.

Now let’s move on to emerging best practices for virtual data lakes.

Defining Terms

Since you might have different definitions of key terminology, here’s how I use the following terms:

  • Where data represents both the raw materials and the finished goods, the operational systems are the ones that ingest data and push it through the data pipelines that cleanse, standardize, transform, and enhance the data before it is delivered to clients. In your industry, operational systems may be broader and more complex as they collect and generate data that you also wish to use for analytics.

  • To measure and to improve a company’s operations, each data warehouse uses a fixed schema with aggregated data cubes to enable decision-makers to answer predetermined questions via dashboards and reports. Each cube focuses narrowly on a single system from which it gathers its content.

  • An organization’s data lake gathers data content from multiple operational systems. Its purpose is to enable analytical workloads for exploratory analysis and data science research.

  • We make a distinction between a physical data lake and a virtual data lake. A physical data lake holds a separate copy of data content and uses a single data storage layer. A virtual data lake uses software and metadata to reduce or eliminate copies, and accepts many storage layers and formats.

Cloud Migration

When both operational systems and the data lake are in the cloud, we can increase the speed of refresh of the data lake. This allows us to copy from operational systems only once, and we can build the data warehouse cubes from the data lake instead of copying individually for each data warehouse.

Moving operational systems into the cloud is an essential first step toward virtual data lakes. When only the data lake is in the cloud, it is both slow and costly to copy data from operational systems that are not. We cannot do it as frequently as we would like. Long runtimes also steal compute resources from operational systems.

To maximize the benefits of cloud migration, and to unlock the potential of virtual data lakes, it is not enough to lift and shift relational databases. Instead, use cloud technologies that:

  • Decouple compute from storage. High-speed cloud networks allow us to separate compute resources from cloud data storage. In this way, we can increase or decrease compute resources independently from data resources. This is an essential building block to scalability and to virtualization.

Relational databases cannot decouple compute from storage because this would break transactional integrity (see ACID compliance). Where transactional integrity is essential to the operational systems, we are forced to accept that we must copy the data content from relational databases to a decoupled cloud-optimized data storage layer for the data lake.

In practice, very few operational systems require transactional integrity. In particular, data pipelines that ingest, cleanse, standardize, and enhance data content can be safely implemented with full recoverability without using ACID-compliant transactions.

  • Decouple access and security from storage. When data content resides in cloud-optimized data storage layers, there are no access controls or security enforcement within the data storage. When we decouple data storage from compute, we also decouple the access and security from data storage.

We must reintroduce data access rules and information security through the tools that consume data from the data lake. One way to do this is to add a new component in data platform architecture for cloud data lakes: the federated data access layer. This is described in greater detail below.

Virtualization Recommendations

After your data assets have migrated to the cloud, what are the remaining steps to construct a virtual data lake from those data assets?

  • Get the data out of the database. This bears repeating. Put your data into a cloud-optimized data storage layer. This means S3 for AWS, ADLS for Azure, and GCS for GCP. These work with high-speed cloud networks include redundancy and recoverability, and if you plan carefully, straightforwardly support partitioning. The result is highly scalable parallelized data access from parallelized compute resources.

  • Introduce federated data access. When we add a federated data access layer, we can expand access to data at the same time that we improve our security footprint. Block access to your operational systems from non-operational users. Direct all data access for analytics and reporting and data warehouse creation through this federated data layer. Add roles and rules to impose consistent data access policy and track the query and reporting behavior as well as analytical research so that you know who is using which data.

  • Reduce or eliminate copies. In the simplest definition, data virtualization means that the data exists only once in a data storage layer. The same data can be used for dashboards and reports and also for analytical exploration and data science research.

For data warehouse cubes in data visualization tools such as Tableau, PowerBI, Cliq, and Apache Superset, employ the pass-through design pattern. Use your data federation layer to cache dimensions or cube content so that you no longer need to build and refresh cubes within the visualization servers. That is, avoid copies and make the data as fresh as possible.

  • Replace data transactions with data transformations. Data analytics developers are familiar with a different paradigm of data transformation than those who work with relational databases. Statistical packages such as SPSS and SAS, and languages such as Scala and R, use an approach that builds a new dataset with each aggregation or transformation.

When you apply this approach to data pipelines, you remove the need for transactional integrity. Instead, you use reproducibility and recoverability to achieve similar results.

For those who prefer to think of data through the language of SQL, you can understand this approach as follows. Build your data pipeline using a sequence of CTAS (“create table as … select …”) where each new table contains all of the data content from the prior, with changes introduced via filters, joins, and aggregations.

  • Use parallelized in-memory analytics. With partitioned data in cloud-optimized data storage, you can make use of newer analytics engines that populate in-memory dataframes in Python and R so that you can achieve dramatically faster execution of data science models and exploratory research.

If you use Spark in a map-reduce cluster, read and write directly from and to the cloud data storage (S3, ADLS, GCS). If you use Python and R, use the Apache Arrow Flight libraries to interact with a cluster of Arrow dataframe worker nodes. One of the federated data access vendors natively supports Apache Arrow and Arrow Flight.

Seven Virtual Data Lake Best Practices

To bring it all together, use the following seven best practices to reduce or eliminate copies of data, to increase the freshness of data for decision-making, to provide broader access to data across multiple operational systems, and to increase the speed of analytical exploration and data science research.

  1. Migrate operational systems to the cloud.
  2. Move your data into cloud-optimized data storage.
  3. Redesign your data pipelines without transactions.
  4. Tap separate compute resources for operational systems, for data visualization of data warehouses, and for analytical exploration and data science research.
  5. Use pass-through data warehouse design in data visualization tools.
  6. Deploy a federated data access layer that includes access rules and information security enforcement.
  7. Use analytical tools that rely on parallelized in-memory dataframes.

This describes Verisys’ journey in modernizing data analytics with a virtual data lake. We would love to hear about yours. We’ve implemented some of these best practices and others are in our engineering roadmaps. In future blog posts, we will share more about successes and lessons learned along the way.

To learn more about how Verisys is transforming provider data management, contact us today.

Kevin Kautz is the Chief Data Officer at Verisys. His background includes data engineering, data science, and data platform architecture at Nielsen, US Office Products, and several other companies. He lives near Albany, N.Y.