Microsoft Purview: data catalog as a service
One data catalog for browsing your lakehouse, with governance!
Register, Scan and Search your cloud storage with Microsoft Purview’s Data Map and Data Catalog. A range of Storage options is supported.
Join me in exploring and evaluating Microsoft Purview’s data catalog features (Solutions) in this technical guide.
I structured this guide according to Microsoft Purview’s official product naming and navigation in the portal, in an effort to document the domain as clearly as possible.
Get started
- Fulfill requirements
- Sign in to portal at https://purview.microsoft.com/
3. Register, Scan and Search your cloud storage!
Solutions
At the top level, Microsoft Purview is divided into Solutions. A Solution is a product that offers features. In many cases, a Solution or a feature is locked behind Microsoft Purview Enterprise license and/or Microsoft 365 E3 or E5 license and/or a Solution specific license.
Data Map
Solutions → Data Map
- Register Data Sources
- Scan Data Sources
- Manage Domains and Collections
Data Source
Solutions → Data Map → Data Sources
In order to be Scanned, Storage options must be Registered as a Data Source.
Register
Solutions → Data Map → Data Sources → Register
- Register your Data Source to a Domain or Collection.
- RBAC
Scan
Solutions → Data Map → Data Sources → [Data Source name] → New scan
- Reads some data from Registered Data Source (on behalf of credential)
- Generates Metadata for Scanned Data Source
a. Available in Data Catalog - Configured per Data Source, with support for:
a. Scan rules
b. Single or scheduled runs
Domains and Collections
Solutions → Data Map → Domains
Partition your Data Sources.
- Associate Data Sources to domain-specifically named partitions
a. Filter Search on partition - Access control your Data Sources on partition
Data Catalog
Solutions → Data Catalog
- Browse Scanned Data Sources
a. View Metadata - Browse Azure resources (requires read access on Azure resource, e.g. Contributor role on Entra user)
Search
Solutions → Data Catalog → Discovery → Data assets
Supports free-text, Storage option, Domain and Collection filters.
Metadata
Solutions → Data Catalog → Discovery → Data assets → [asset name](.[file extension])
Auto-generated on Scan.
Schema
Lineage
Graph describing data states and the processes responsible — e.g. pipeline jobs (Azure Data Factory or Microsoft Fabric) and stored procedures.
Conclusion
Microsoft Purview offers a data catalog for cloud storage that is
Pros
- Cross-platform
- Access controlled
- Populated with Metadata (auto-generated on Scanned Data Sources)
a. Schema with column names and data types
b. Lineage for data pipelines and stored procedures
Cons
- (Stuck) in preview (”Microsoft Preview”)
- Lacking in documentation
Opinion
Data Catalog ships with support across Storage options, Access control, Search, and Metadata providing insights and overview, useful for maintaining a sizable and scaling lakehouse.
Microsoft Purview Enterprise offers an intuitive data catalog with pioneering governance features in preview.
Appendix
Requirements
- Azure tenant
- At least one Storage option
- Entra user (assigned to Azure tenant)
a. Privileged Microsoft Purview role group (less dependence on Entra Global Administrator) - Microsoft Purview Enterprise License
- In order to Scan Registered Data Sources, Microsoft Purview needs read access on the resource from the respective cloud platform.
a. For Azure, Microsoft Purview MSI (Microsoft Managed Identity) may be used. MSI must be granted access on resource(s) by Entra user with Owner or User Access Administrator role assignment on resource(s))
b. For AWS, e.g. Amazon S3, use AWS role ARN credential
Access control
Solutions RBAC
Settings → Roles and scopes → Role groups
Manage Microsoft Purview role groups — access to functionality in Microsoft Purview. (requires Entra user with a privileged Microsoft Purview role group)
Privileged Microsoft Purview role group
- Purview Administrator OR Organization Management (OR Entra Global Administrator)
- Authorized to escalate privilege for all Solutions
- Assigned to Entra user by Entra user with privileged Microsoft Purview role group
Domains and Collections RBAC
Data map → Domains → [Domain or Collection name] → Role assignments
Manage access to partitions and respective Registered Data Sources in Microsoft Purview.
Storage options
I have tested italic options:
Azure
Azure Blob Storage, Azure Data Lake Storage Gen 2, Azure SQL Database, Azure Subscriptions, Resource Groups, Microsoft Fabric
Other
Amazon S3, Snowflake, SAP, MySQL, MongoDB, PostgreSQL, Tableau, Cassandra, Db2. erwin, HDFS, Hive Metastore, Looker, Google BigQuery, Teradata, Qlik Sense, Salesforce, Dataverse
License and pricing
Microsoft Purview instances are assigned per Azure tenant — all tenants ship with an active Microsoft Purview Free License.
Pricing — Microsoft Purview | Microsoft Azure
At a bare minimum, you will be billed for the following in order to achieve the results from this article:
- Compute costs from Scanning — Schema and Lineage extraction (Data Map)
- Compute costs from Lineage Visualization and Search (Data Catalog)
- Microsoft Purview Enterprise license
a. Trial can be activated in the Micosoft Purview portal by the Azure tenant’s Entra Global Administrator
b. Generates Microsoft Purview resource in Azure tenant — tracks license and compute costs
For labeling in Data Map (did not work):
Example use case
Data scientist needs to process data in a lakehouse which is utilizing the Medallion architecture, but needs information about the data’s structure, relationships to existing processes, and storage location in order to advance.
- Entra user is authorized access to Domains and Collections
- Inspects Schemas, Lineages, Storage option URLs by Searching the Data Catalog
- Identifies existing ADF (Azure Data Factory) ETL job related to .parquet files stored in a Azure Data Lake Storage Gen 2 Data Source in authorized Domain or Collection by inspecting Lineage
- Follows link to identified ADF ETL job, discovered in Lineage
- Extends job in ADF, problem solved!
Issues
- Sensitivity labels and classifications for Data Map did not work
a. In preview
b. Microsoft 365 E5 license requirement — fulfilled on my Entra user
c. Scans report classifications but none are visible in Schemas
Question: Is E5 license requirement enforced on the Entra user that initiates the Scan? I failed to find the answer. - Docs carry “classic” terminology. Raises questions such as “Is this deprecated?”, “Where’s current info on this?”