How-to: Create a Data Lake using AWS Lake Formation
Gathering insights from analytic data at a rapid pace for strategic decision making has become a key priority for companies in the last few years.
The growth and ubiquity of the AWS Cloud has made it super convenient for companies to leverage various tools and adopt them for rapid application development.
In addition to application development, AWS also offers a significant array of data-oriented services to assist companies in gaining maximum value from their data. Without having to worry about managing infrastructure and resources through On-Prem data centers.
Plus with AWS, one can always test out different solutions by quickly spinning up a few services. Which then allows teams to think about long-term support, and do teardowns once they’ve settled on adopting a particular solution.
What is a Data Lake on AWS? 🤔
Data is widely hailed as the golden asset of companies. Rightfully so, as it yields immense value allowing companies to take key business decisions in a timely manner.
As a result, properly managing data to extract maximum value from it becomes extremely critical.
Just one of the few ways to manage the data is via building a Data Lake.
At a high-level, a data lake is a central data repository that houses raw data from multiple/various data producers (sources) in a scalable and cost-effective manner. Producers can store their data (as it is) in the data lake whether the data is structured/unstructured or semi-structured.
After the data has landed in the data lake, producers can (securely) make it consumption-ready via different ETL processes, validation, and maintenance processes. Which would then allow consumers of the data to run analytics queries on the data for obtaining insights and making strategic business choices.
Nice huh? Now let’s see what it takes to manage a data lake. 😉
Nothing is free 🙂
In any sort of effort that reaps large-scale advantages, there are also challenges that it brings for companies to work through.
Creating, and then maintaining a data lake is not so easy. It requires a significant number of manual processes that can be complex, and also pretty time-consuming.
In many companies, data sources are abundant. The data usually comes from a multitude of sources. This requires data to be monitored, and checked carefully in order to ensure accuracy and integrity of the data for end users.
Even the management of the data involves different procedures/processes to prevent security holes and leaks of the data (especially sensitive data). And then teams would need to setup access policies with permissions, as well as securing of different layers of data (like views/tables).
If the proper considerations are not given to choices about technical stack, architecture, data validation, quality, and data governance, a data lake setup can pretty quickly become a raging mess. Rendering it difficult to leverage, understand, and not accessible.
Luckily, AWS has available a few key services to help manage data lakes.
For this particular tutorial, I have written it with the understanding that the reader already has some familiarity with AWS Glue and S3.
Feel free to follow my tutorial here to learn more about Glue and S3!
How-to: Analyze Coffee using Amazon Athena, S3, & Glue
I heard really cool things about Amazon Athena.
We will learn how to create a super simple data lake via AWS Lake Formation. And also learn about the governance and security benefits that Lake Formation enables over other AWS services.
Let’s get started! 🙂
So, what security benefits does AWS Lake Formation offer? 🧐
Lake Formation enables a permissions model that works by further supplementing the permissions provided by AWS IAM.
IAM Administrator (Set of IAM Permissions)
This is the user who has the ability to create IAM Users and Roles. Holds the AdministratorAccess AWS managed policy. It may also be given the data lake administrator permission.
Data Lake Administrator (Set of Lake Formation Permissions)
This is the user who has the ability to register data sources. Things like: S3 buckets/prefixes, access to the data catalog, creating databases, creating and running workflows, GRANT/REVOKE Lake Formation permissions to other users, and viewing AWS CloudTrail logs.
This is a role that runs a workflow on behalf of a data producer/user. This role is specified when one creates a workflow from a Lake Formation blueprint.
You can specify this role when you need to create a workflow from a blueprint.
The first two administrator types have IAMAllowedPrincipals and also have been granted Super permission. And they Use only IAM access control enabled by default.
This gives them backward compatibility with GRANTs to be able to access previously existing workloads in Glue, Athena, and S3.
You get a centrally-defined access model which enables fine-grained permissions to the data that is stored in the data lakes using simple GRANT/REVOKE commands.
Permissions can also be set up and enforced at the individual table and column-levels. Plus, it’s super versatile in how Lake Formation can work with many types of AWS services for Machine Learning & Data Analytics (SageMaker, Redshift, Athena, Glue, etc.)
Access control in Lake Formation is represented in two key areas:
- Metadata Access
- These are permissions on the Glue Data Catalog resources that allow principals to create, read, update, and delete metadata databases and tables.
2. Underlying Data Access
- These are permissions on the S3 data source itself that include data access and data location permissions.
- Data access permissions enable principals to READ and WRITE to the S3 bucket contents.
- Data location permissions enable the creation of metadata databases and tables that point to the specific S3 prefixes/buckets.
Leveraging these two controls for centralizing the data access policies is pretty easy:
- First, you’d want to remove any direct access to the required data sources in S3, enabling Lake Formation to manage all data access.
- Secondly, setup data protection & access policies to enforce those policies across all the AWS services that are accessing the data in the data lake. Using Metadata and S3 Object permissions, one can create and setup Users and Roles to only access specific data down to the table and columnar-level (Least-Privileged access).
How do the permissions work? 🧐
In LakeFormation, finer-grained permissions are configured in a method that can be used to substitute and override coarser-grained IAM permissions.
This benefits with heightened security, transitioning from the coarser permissions set to one ultimately managed by Lake Formation.
Here is a simple visual highlighting the possible permissions:
To see a list of permissions available via Lake Formation, reference the AWS LakeFormation permissions documentation.
Lake Formation also currently supports Server-Side-Encryption on S3, in addition to private endpoints for a VPC.
It also logs all activity in AWS CloudTrail, giving an upper hand to network isolation and traceability.
Let’s dive in! 🤿
Author’s note: Although I will be demoing this tutorial via taking steps in the AWS Console, please note that in an Enterprise setting, you should aim to build this setup via some sort of Infrastructure as Code service (i.e. Terraform, CloudFormation, Pulumi, etc.).
Step 1) Navigate to Lake Formation via the AWS Console
AWS Console > Search > Lake Formation > Click Get Started
Once clicking on Get Started, we’ll be presented the modal above requesting to set up the Administrator of the data lake.
You can add individual AWS Users and available Roles on your AWS account. For the purpose of this tutorial, I will check the box to Add myself, and then click Get started.
Step 2) Add a Data lake location
Once we have access to the Lake Formation dashboard, need to add a Data lake location. Which is basically an S3 Bucket location/prefix that holds (producer-type) data that can be retrieved.
Note: In Lake Formation, the data can be obtained by various means. Like, via AWS Glue Jobs, or even through the combination of AWS Kinesis Streams & Data Firehose. Or, by even doing a simple upload into S3 (which is what I did for this tutorial).
Click > Register location.
Selecting Use blueprint, will bring us to a form where we can select whether we want to grab data from a database or some other log source.
One can follow the setup to build a workload, which is in a nutshell just a Glue ETL Job where all the options for the Extract, Transform, and Load steps are in the same place.
Example: For a MySQL, MsSQL, or Oracle database add (or create) an AWS Glue connection, specifying also the source DB and table in this format:
/<_tag_>, encoded_< / strong, encoded_tag_open_table_>tag_closed
Add/Create the target Glue Catalog, specifying a DB and a table, also browse with the tool provided, for a suitable S3 path to host the catalog data.
Select a workflow name, decide the job frequency, (like: Run on demand), and a table prefix, the other options can be left as defaults.
- When S3 is the target location, always opt for parquet format. As parquet gives a solid performance advantage for dataset operations later on.
- Also, if you plan to use Athena to query your catalog, use “_” instead of “-” for database & table names. Since the “-” character sometimes can lead to unwanted data compatibility issues.
Fine-Grained Access 🔒
Lake Formation permissions apply only in the region in which they were granted.
For backward compatibility, Lake Formation passes through IAM permissions for new resources (until instructed to do differently).
Streamlined and Centralized Governance
The aspect that truly helps maintain control over a data lake is that with Lake Formation, we finally have a centralized dashboard to control all of the various:
- S3 Buckets/Prefixes,
- ETL Jobs,
- Glue Crawlers,
- Glue Database Catalogs,
- and permissions. 🙂
One additional advantage is that Lake Formation by default comes with CloudTrail logging enabled. So that every action taken by end users or other AWS services via IAM roles is checked and logged directly into the dashboard.
So, Lake Formation is optimal for use cases where there are many different producers of data with all of their different S3 buckets, catalogs, and crawlers.
One can easily manage those different data sources via Lake Formation, to enable centralized access to any data source.
Lake Formation enables GRANT/REVOKE permissions to users or roles on a table/column level.
I really like how AWS Lake Formation Permissions are far more granular than IAM access/permissions to secure a centralized data lake.
Since they are enforced on logical objects like: databases, tables, or columns instead of files and directories. Plus, granular control for columnar access.
Thank you for following along! 🙏
As always, feel more than welcome to ask questions / leave comments in the section below.