Create governance model around AWS Datazone

Article cover image
 

In this article, I’ll be sharing my experience on AWS Datazone as the governance tool and how we can leverage it for providing access to the metadata, glossaries & to data with proper checks and audit in place. Based upon my experience, I found datazone to be still evolving and misses in few areas from the governance perspective. We’ll be going though these constraints and the workaround.

Amazon DataZone is a data management service that makes it faster and easier to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources. With Amazon DataZone, administrators who oversee an organization’s data assets can manage and govern access to data using fine-grained controls. These controls help ensure access with the right level of privilege and context. Amazon DataZone makes it easy for engineers, data scientists, product managers, analysts, and business users to share and access data across the organization so they can discover, use, and collaborate to derive data-driven insights.

 

Features of Amazon Datazone

  1. Unified data management portal: Amazon DataZone provides a web-based application that serves as a centralized portal for users to catalog, discover, access, analyze, and govern data. This portal allows for seamless collaboration among data engineers, data scientists, product managers, analysts, and business users.

  2. Fine-Grained Access Controls: Administrators and data stewards can manage and govern access to data using detailed access controls. This ensures that data is accessible with the appropriate level of privileges and context, enhancing data security and compliance.

  3. Built-In Workflows: Amazon DataZone includes built-in workflows for data consumers to request access to data and for data owners to approve these requests. This automated process streamlines data access while maintaining strict governance.

  4. Integration with Analytics Tools: DataZone integrates with various AWS analytics tools like Amazon Redshift Query Editor and Amazon Athena, allowing users to consume data directly from the DataZone portal without needing to log into the AWS Management Console.

  5. Support for Business Glossaries: The service includes a business data catalog that supports a business glossary, providing consistent definitions for business terms across the organization. This enhances data understanding and ensures consistent use of terminology.

  6. Data Projects and Environments: DataZone projects are groupings based on business use cases, where users, data assets, and analytics tools are organized to facilitate collaboration. Within projects, environments provide the necessary infrastructure and access controls for data and analytics tools.

Constraints

  1. DataZone Lineage feature still in preview – It doesn’t support dynamic frame and we observed improper lineage generation due to the mixing of spark logs.

  2. No email notifications – At the time of writing, datazone doesn’t have any email notification feature to inform data owners/stewards of any subscription request or to consumers of their responses. We ended up creating our piece of code using Event Bridge to support this

  3. Filter on data products – Currently, datazone does not support the addition of asset filters on data products. That limits the functionality of applying fine-grain access control on data assets under the data products.

  4. Import/ Export of Glossaries – Datazone doesn’t support the import/export of glossary and business definitions across accounts. This is a big drawback in scenarios where we want to migrate our solution from lower environments to production.

Login Process

Datazone can be best worked in an enterprise environment using the IAM identity center and by providing implicit user access. There is a good article for enabling federated access to amazon datazone with Okta but the same can be extended to other identity providers as well such as EntraID.

Just so you know, the users accessing Datazone should be in the IAM identity center for their login to work. We can create one group in IDP as a ‘datazone’ and synchronize it with the IAM identity center as mentioned in the above link. We can also use this group as an additional security to only enable users to be part of any datazone project if they are members of this group.

 

Data Discovery & Usage

By centralizing all business and technical metadata together in one location, end users can go to a single web application to search, discover, and query data.  Data sets are discovered via standard search and filtering tools that allow a user to find an asset quickly. When data assets are found, the user can subscribe to the dataset, which will kick off an access request workflow to the assigned data steward.

Search based on Glossary Terms

 The terms in a business glossary can be added to assets to classify or enhance the identification of those attributes during the search. These terms play a significant role in data governance, ensuring consistency and clarity across analytics processes.

Generally, for effectively discovering the data assets and categorizing them, I found glossary terms below to be most effective for a broader set of use cases:

  • Based upon subject areas e.g. finance, HR, manufacturing, asset management, etc. This is useful in enterprise models where we want to categorize data assets based on subject areas.

  • Based upon projects e.g. spend analytics, workforce management, etc. This is useful in scenarios where we want to differentiate assets based on projects.

  • Based upon entity type i.e. master, reference, or transactional data. This helps in identifying whether the data asset is referential or transactional

  • Based upon data sources i.e. if data coming from multiple sources then tagging data assets can help identify all the sources for populating a particular data asset

Data Publication and Subscription

 

Publish the data asset

Data owners can publish data sets directly from the Datazone portal and provide data stewards with the capability to curate the data and manage access requests. Publishing data creates a relationship between the underlying data set, the glue catalog, the AWS account, the Datazone project, and the Datazone curated metadata. This linkage allows for easy discovery through search and allows data stewards to control access.

Subscribe to the data asset

Once a user discovers a dataset of interest, they can request access through a subscription process. This generally includes discovering the data asset of interest and subscribing to it by mentioning the request justification

Once the subscription request is submitted, below dialog box below will be shown

 

You can also view your subscription request by going to the My Subscription page of the respective data asset

Please note that the user needs to raise the request on behalf of the consumer project. For instance, in the above example, it’s the Analysis project that has requested data access from the Common Data Warehouse project. The creation of consumer projects is restricted and is covered under the permissions & roles section.

Approve or reject the subscription request

This subscription request process will notify the responsible data steward of the request and allow the data steward to grant or reject access to the data

Fine Grain Access Control

You (Data Owners/Stewards) may now restrict access to your data with fine-grained access control for your projects or sensitive data by using row or column filters. We can also create default filters which can be used to provide access to a subset of data by applying row level filters or hiding sensitive columns altogether. This can be done by choosing filters from the subscription request as shown below:

Some considerations based on my data architecture journey

  • PII or other sensitive data – restricting sensitive columns based on compliances.

  • division, department, organization, or geographical region. – restricting data based on the level of access

  • Source systems – restricting data for a particular source system if building a common data warehouse.

Data Consumption

When the data steward approves a subscription to data for project use, a data zone user with project access can query the data using AWS Athena.  The data is exposed in a project-specific glue catalog, and user access is federated from Datazone to Athena.  No AWS console access is needed as users are passed directly from Datazone to Athena.  The Datazone consumer project controls access to the glue catalog and data assets, as shown in the example below.

 

Notification & Alerts

At the time of writing, datazone doesn’t provide any email notification services out of the box. We can create our own notification & alert service using Event Bridge and lambda to send out emails for proper management of users & data. This can be done using the datazone boto3 library.

Alerts

Alerts are generated by Datazone based on various events as mentioned here. We can capture these events and can trigger lambda code to send out emails. For our use-case, we were triggering based on the subscription request & response events.

  • On Request – Email alert goes to the publisher project data owners & stewards when the consumer project subscribes to their data asset.

  • On Response – Upon approving/ rejecting the request by the respective data steward, one email alert is triggered to the subscriber consumer project owner/s, notifying them of the status of the subscription request.

 Notifications

We can also send out notifications to the publisher project data owners/stewards and all the project owners.

  • To publisher project data owners/stewards – These kinds of notifications help data owners/stewards of publisher projects keep a security check on who can access what data assets and at what level. We can capture & email various fields such as published asset name, subscriber project name, members having access to the asset and filters applied by the owner/steward.

  • To all the project owners – These email notifications are distributed to all project owners of their projects and contain information about members having access to project resources. This will help the project owners to keep a security check on who can access the data assets (owned or subscribed) of their respective projects.

 

Leave a Comment

Your email address will not be published. Required fields are marked *