Microsoft Fabric - DP 600
Microsoft Fabric -
Unified SAS platform used by data professionals & business individuals.
It is capable of ingestion, storing, process, and analyzing data in a single environment.
integrated user interface
single product that is easy to understand, set up, create, and manage.
Scalable analytics can be complex, fragmented, and expensive. With Microsoft Fabric, you don't have to spend all of your time combining various services from different vendors.
Because Fabric is a SaaS platform, it allows you to quickly and easily provision and run any type of workload or job without needing pre-approval or planning. This means that you can scale resources up or down as needed, and be more agile and responsive to changing business needs.
data silos problem is permanently resolved ( Data silos refer to isolated sets or repositories of data within an organization that are not easily accessible or integrated with other systems).
Example -
Imagine your company has been using a data warehouse to store structured data from its transactional systems, such as order history, inventory levels, and customer information. You have also collected unstructured data from social media, website logs, and third-party sources that are difficult to manage and analyze using the existing data warehouse infrastructure. Your company's new directive is to improve its decision-making capabilities by analyzing data in various formats across multiple sources, so the company chooses Microsoft Fabric.
Fabric includes the following services:
Data Engineering
Data integration
Data warehousing
Real-time analytics
Data Science
Business intelligence
The permissions required to enable Fabric are either:
Fabric admin
Power Platform admin
Microsoft 365 admin
One lake -
OneCopy is a key component of OneLake that allows you to read data from a single copy, without moving or duplicating data.
Fabric's data warehousing, data engineering (Lakehouses and Notebooks), data integration (pipelines and dataflows), real-time analytics, and Power BI all use OneLake as their native store without needing any extra configuration.
OneLake is built on top of Azure Data Lake Storage (ADLS) and data can be stored in any format, including Delta, Parquet, CSV, JSON, and more.
For tabular data, the analytical engines in Fabric will write data in delta-parquet
Shortcuts?
Lakehouse -
It is a unified platform that combines:
The flexible and scalable storage of a data lake
The ability to query and analyze data of a data warehouse
A Lakehouse is a great option if you want a scalable analytics solution that maintains data consistency.
As cloud-based solutions, lakehouses can scale automatically and provide high availability and disaster recovery.
Lakehouse data is organized in a schema-on-read format?
Lakehouses support ACID (using delta format)
Fabric Lakehouse:
You can load data - in any common format - from various sources; including local files, databases, or APIs.
Data ingestion can also be automated using Data Factory Pipelines or Dataflows (Gen2) in Microsoft Fabric.
You can create Fabric shortcuts to data in external sources, such as Azure Data Lake Store Gen2 or a Microsoft OneLake location outside of the lakehouse's own storage.
How to use Lakehouse -
Lakehouse is the lakehouse storage and metadata, where you interact with files, folders, and table data.
The semantic model (default) is an automatically created semantic model based on the tables in the lakehouse. Power BI reports can be built from the semantic model.
SQL analytics endpoint is a read-only SQL analytics endpoint through which you can connect and query data with Transact-SQL.
Ingest & Transform data into a lakehouse -
There are many ways to load data into a Fabric lakehouse, including:
Upload: Upload local files or folders to the lakehouse. You can then explore and process the file data, and load the results into tables.
Dataflows (Gen2): Import and transform data from a range of sources using Power Query Online, and load it directly into a table in the lakehouse.
Notebooks: Use notebooks in Fabric to ingest and transform data, and load it into tables or files in the lakehouse.
Data Factory pipelines: Copy data and orchestrate data processing activities, loading the results into tables or files in the lakehouse.
Real-Time Analytics in Fabric -
Synapse Real-Time Analytics in Fabric uses a KQL Database to provide table storage and Kusto Query Language (KQL) which is a powerful tool for analyzing data.
KQL is optimized for data that includes a time series component, such as real-time data from log files or streaming services.
It provides an efficient way to find insights and patterns from textual or structured data.
It supports automatic partitioning and indexing of data.
It delivers high performance for data of various sizes, ranging from a few gigabytes to several petabytes.
It can be used for solutions like IoT and log analytics in many scenarios including manufacturing, oil and gas, and automotive.
Objects inside a KQL Database -
Table - is a schema entity that contains a set of columns and rows of data.
You can use the
.create table command to create a new table,
the .show table command to show the table schema,
and the .ingest command to ingest data into a table.
Function - is a schema entity that encapsulates a subquery expression that can be invoked from within other KQL queries.
A stored function has a name, an optional list of parameters, and a body that contains the subquery expression.
You can use the
.create function command to create a new stored function,
.show functions command to show the stored functions in a database.
Materialized view - is a schema entity that stores precomputed results of a query for faster retrieval.
A materialized view has a name, an optional list of parameters, and a body that contains the query expression.
You can use the
.create materialized-view command to create a new materialized view,
.show materialized-views command to show the materialized views in a database.
Datastream - is a representation of all of the attached KQL event streams connected to the KQL database.
KQL -
KQL is a read-only request to process data and return results
A query statement consists of a table name followed by one or more operators that take, filter, transform, aggregate, or join data.
Data science in Microsoft Fabric -
Fabric Lakehouse design - medallion architecture
The medallion architecture is a recommended data design pattern used to organize data in a lakehouse logically
Data lakehouses in Fabric are built on the Delta Lake format, which natively supports ACID (Atomicity, Consistency, Isolation, Durability) transactions.
The architecture typically has three layers – bronze (raw), silver (validated), and gold (enriched), each representing higher data quality levels.
Some people also call it a "multi-hop" architecture, meaning that data can move between layers as needed.
In some cases, we may have additional raw(before bronze), and platinum(after Gold) layers. Regardless of the names and number of layers, the medallion architecture is flexible and can be tailored to meet your organization's particular requirements.
Bronze layer -
The bronze or raw layer is the first layer of the lakehouse.
It's the landing zone for all data, whether it's structured, semi-structured, or unstructured.
The data is stored in its original format, and no changes are made to it.
Silver layer -
The silver or validated layer is the second layer of the lakehouse.
It's where you'll validate and refine your data.
Typical activities in the silver layer include combining and merging data and enforcing data validation rules like removing nulls and deduplicating.
The silver layer can be thought of as a central repository across an organization or team, where data is stored in a consistent format and can be accessed by multiple teams.
Gold layer -
The gold or enriched layer is the third layer of the lakehouse.
In the gold layer, data undergoes further refinement to align with specific business and analytics needs. This could involve aggregating data to a particular granularity, such as daily or hourly, or enriching it with external information.
Once the data reaches the gold stage, it becomes ready for use by downstream teams, including analytics, data science, or MLOps.
More info about medallion architecture - ref
Move data across layers in Fabric -
There are a few things to consider when deciding how to move and transform data across layers.
How much data are you working with?
How complex are the transformations you need to make?
How often will you need to move data between layers?
What tools are you most comfortable with?
Tools for data transformation in Fabric include
Dataflows (Gen2) - These are a great option for smaller semantic models and simple transformations.
Notebooks. - These are a better option for larger semantic models and more complex transformations.
Data orchestration refers to the coordination and management of multiple data-related processes, ensuring they work together to achieve a desired outcome. The primary tool for data orchestration in Fabric is pipelines.
Ingest data with Spark and Microsoft Fabric notebooks - Sample Notebook code for authentication, path setup, etc - ref
Refer for practical knowledge -
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/01-lakehouse.html
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/02-analyze-spark.html
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/03-delta-lake.html
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/04-ingest-pipeline.html
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/05-dataflows-gen2.html
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/06-data-warehouse.html
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/07-real-time-analytics.html
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/03b-medallion-lakehouse.html
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/10-ingest-notebooks.html
https://microsoftlearning.github.io/mslearn-fabric/Instructions/Labs/11-data-activator.html
Comments
Post a Comment