1. You have data in PostgreSQL that was designed to reduce redundancy. You are transferring this data to BigQuery for analytics. The source data is hierarchical and frequently queried together. You need to design a BigQuery schema that is performant. What should you do?
- Copy the primary tables and use federated queries for secondary tables.
- Copy the normalized data into partitions.
- Use nested and repeated fields.
- Retain the data in normalized form always.
✅ Correct Answer: Use nested and repeated fields.
Why this is the best approach
You’re moving data from PostgreSQL (OLTP) to BigQuery (OLAP) with these characteristics:
🔁 Source schema is normalized (to reduce redundancy)
🌳 Data is hierarchical
🔍 Tables are frequently queried together
⚡ You need high-performance analytics
👉 BigQuery performs best with denormalized schemas, especially using nested (STRUCT) and repeated (ARRAY) fields.
Benefits of nested & repeated fields in BigQuery
🚀 Fewer joins → faster queries
💰 Lower cost → less data scanned
📊 Simpler SQL for analysts
🧱 Preserves hierarchical relationships naturally (e.g., orders → line items)
Example:
orders
├─ order_id
├─ customer_id
└─ items ARRAY<STRUCT<product_id, quantity, price>>
This keeps related data together and aligns perfectly with BigQuery’s columnar storage.
Why the other options are incorrect
❌ Copy primary tables and use federated queries
Adds latency
Increases complexity
Not optimized for analytics
❌ Copy normalized data into partitions
Partitioning helps pruning, but does not reduce joins
Still inefficient for hierarchical queries
❌ Retain the data in normalized form
OLTP best practice, not OLAP
Leads to expensive joins and slower queries in BigQuery
Exam rule of thumb 🧠
BigQuery + hierarchical data + frequent joins → Nested & repeated fields
2. You used Dataplex to create lakes and zones for your business data. However, some files are not being discovered. What could be the issue?
- You have scheduled discovery to run every hour.
- The files are in Parquet format.
- You have an exclude pattern that matches the files.
- The files are in ORC format.
✅ Correct Answer: You have an exclude pattern that matches the files.
Why this is the correct explanation
In Dataplex, data discovery works based on inclusion and exclusion patterns defined at the asset level. If some files are not being discovered, the most common and likely cause is:
👉 An exclude pattern that unintentionally matches those files.
When an exclude pattern matches:
Dataplex skips those files entirely
They will never appear in the catalog, even if:
Discovery runs successfully
The file format is supported
The schedule is frequent
Why the other options are incorrect
❌ Scheduled discovery runs every hour
This would increase discovery frequency, not prevent it.
❌ Files are in Parquet format
Parquet is fully supported by Dataplex.
❌ Files are in ORC format
ORC is also supported by Dataplex.
Exam rule of thumb 🧠
Dataplex discovery missing files → Check include/exclude patterns first
3. You have analytics data stored in BigQuery. You need an efficient way to compute values across a group of rows and return a single result for each row. What should you do?
- Use a UDF (user-defined function).
- Use a window function with an OVER clause.
- Use an aggregate function.
- Use BigQuery ML.
✅ Correct Answer: Use a window function with an OVER clause.
Why this is the correct solution
You need to:
Compute values across a group of rows
Return a result for each row (not collapse rows)
👉 This is exactly what window (analytic) functions in BigQuery are designed for.
What window functions do
Perform calculations across a defined window of rows
Preserve row-level detail
Examples:
Running totals
Moving averages
Rank, dense_rank
Percent of total per row
Example:
SELECT
order_id,
amount,
SUM(amount) OVER (PARTITION BY customer_id) AS customer_total
FROM orders;
Each row keeps its identity while sharing a group-level computation.
Why the other options are incorrect
❌ UDF (user-defined function)
Used for custom logic
Inefficient for row-group analytics
❌ Aggregate function
Aggregates collapse rows
Do not return a result per row unless used with OVER
❌ BigQuery ML
Used for machine learning
Not for SQL analytics patterns
Exam rule of thumb 🧠
Group-level computation + row-level output → Window function (OVER)
4. Your data in BigQuery has some columns that are extremely sensitive. You need to enable only some users to see certain columns. What should you do?
- Create a new dataset with the column's data.
- Create a new table with the column's data.
- Use policy tags.
- Use Identity and Access Management (IAM) permissions.
✅ Correct Answer: Use policy tags.
Why this is the correct solution
Your requirement is column-level security in BigQuery:
🔐 Some columns are extremely sensitive
👥 Only specific users should be able to see them
📊 Other users should still access the rest of the table
👉 Policy tags (from Data Catalog) are specifically designed for this exact use case.
What policy tags provide
Column-level access control
Centralized management of sensitive data classifications
Fine-grained permissions like:
Who can see PII columns
Who can see financial data
Enforcement at query time
Example:
Tag credit_card_number with a policy tag
Grant access only to a specific IAM group
Users without permission:
Access Denied: Permission denied on column credit_card_number
Why the other options are incorrect
❌ Create a new dataset with the column's data
Data duplication
Hard to maintain
Not scalable
❌ Create a new table with the column's data
Same issues: duplication, sync problems
❌ Use IAM permissions
IAM works at:
Project
Dataset
Table
❌ Cannot restrict access to individual columns
Exam rule of thumb 🧠
BigQuery column-level security → Policy tags
5. You built machine learning (ML) models based on your own data. In production, the ML models are not giving satisfactory results. When you examine the data, it appears that the existing data is not sufficiently representing the business goals. You need to create a more accurate machine learning model. What should you do?
- Train the model with the same data, but use more epochs.
- Perform feature engineering, and use domain knowledge to enhance the column data.
- Perform L2 regularization.
- Train the model with more of similar data.
✅ Correct Answer: Perform feature engineering, and use domain knowledge to enhance the column data.
Why this is the best solution
Your core problem is data quality and representation, not model training mechanics:
📉 The model performs poorly in production
🔍 Analysis shows the data does not sufficiently represent business goals
🎯 You want a more accurate model
👉 This points directly to a feature problem, not a training or tuning problem.
Why feature engineering is key
Improves how well raw data represents real-world business concepts
Incorporates domain knowledge (the most valuable signal in ML)
Often has more impact than changing algorithms or hyperparameters
Helps the model learn the right patterns, not just more patterns
Examples:
Creating ratios, aggregates, or flags
Encoding business rules into features
Transforming raw timestamps into business-relevant periods
Deriving customer lifetime value, churn indicators, etc.
Why the other options are incorrect
❌ Train with the same data using more epochs
Risks overfitting
Does not fix missing or poorly representative features
❌ Perform L2 regularization
Helps prevent overfitting
Does not improve data representativeness
❌ Train with more of similar data
More of the same biased or insufficient data won’t fix the issue
Garbage in → garbage out
Exam rule of thumb 🧠
Poor ML results + data doesn’t reflect business reality → Feature engineering
6. You need to optimize the performance of queries in BigQuery. Your tables are not partitioned or clustered. What optimization technique can you use?
- Filter data as late as possible.
- Use the LIMIT clause to reduce the data read.
- Batch your updates and inserts.
- Perform self-joins on data.
✅ Correct answer: Batch your updates and inserts.
Correct reasoning (important)
While the question mentions queries, BigQuery performance is affected not only by SELECT queries but also by how data is written and updated.
✅ Why batching updates and inserts is correct
BigQuery is append-optimized
Row-by-row inserts and updates are expensive
Batching:
Reduces job overhead
Improves execution efficiency
Is a documented BigQuery optimization best practice
Especially relevant when tables are not partitioned or clustered
This aligns exactly with the statement you provided.
Why the other options are incorrect
❌ Filter data as late as possible
Opposite of best practice — filtering should be done as early as possible.
❌ Use the LIMIT clause
LIMIT does not reliably reduce bytes scanned
Not considered a true performance optimization
❌ Perform self-joins
Increases cost and execution time
7. You repeatedly run the same queries by joining multiple tables. The original tables change about ten times per day. You want an optimized querying approach. Which feature should you use?
- Federated queries
- Materialized views
- Partitions
- Views
✅ Correct Answer: Materialized views
Why materialized views are the best choice
Your situation:
🔁 You repeatedly run the same queries
🔗 Queries join multiple tables
🔄 The source tables change ~10 times per day
⚡ You want an optimized querying approach (better performance, less cost)
👉 Materialized views are designed exactly for this pattern.
What materialized views do
Precompute and store the results of a query
Automatically incrementally refresh when source data changes
Dramatically reduce query execution time
Lower query cost by avoiding repeated heavy joins
This means:
Analysts query the materialized view instead of recomputing joins
Data stays reasonably fresh without full recomputation
Why the other options are not correct
❌ Federated queries
Slower (data accessed externally)
Not optimized for repeated analytical joins
❌ Partitions
Help reduce scanned data
Do not eliminate repeated joins or recomputation
❌ Views
Logical only
Query is recomputed every time
No performance improvement for repeated complex queries
Exam rule of thumb 🧠
Repeated complex queries + frequently changing data → Materialized views
8. You have a complex set of data that comes from multiple sources. The analysts in your team need to analyze the data, visualize it, and publish reports to internal and external stakeholders. You need to make it easier for the analysts to work with the data by abstracting the multiple data sources. What tool do you recommend?
- Connected Sheets
- Looker
- Looker Studio
- D3.js library
✅ Correct Answer: Looker
Why Looker is the right recommendation
Your requirements are:
📊 Complex data from multiple sources
🧠 Analysts need to analyze the data
📈 Visualize insights
📰 Publish reports to internal and external stakeholders
🧱 Abstract underlying data complexity so analysts don’t deal with raw joins and source differences
👉 Looker is specifically built to model and abstract data complexity using a semantic layer.
What Looker provides (key differentiator)
LookML semantic layer
Central definition of metrics, dimensions, joins
One source of truth for business logic
Analysts work with business-friendly fields, not raw SQL
Supports:
Multiple data sources
Governed access
Reusable metrics
Enables:
Dashboards
Embedded analytics
Sharing with internal & external users
This directly satisfies the requirement to abstract multiple data sources.
Why the other options are not suitable
❌ Connected Sheets
Great for spreadsheet analysis
❌ No semantic layer
❌ Not designed for complex multi-source abstraction
❌ Looker Studio
Visualization-focused
❌ Lacks a governed semantic modeling layer
Logic often duplicated per report
❌ D3.js library
Low-level visualization library
Requires heavy engineering effort
❌ Not an analytics platform
Exam rule of thumb 🧠
Abstract complex data + governed metrics + enterprise BI → Looker
9. Your company uses Google Workspace and your leadership team is familiar with its business apps and collaboration tools. They want a cost-effective solution that uses their existing knowledge to evaluate, analyze, filter, and visualize data that is stored in BigQuery. What should you do to create a solution for the leadership team?
- Configure Looker Studio.
- Configure Connected Sheets.
- Create models in Looker.
- Configure Tableau.
✅ Correct Answer: Configure Connected Sheets
Why Connected Sheets is the best solution
Your requirements are very specific:
🧑💼 Leadership team, not data engineers
🧠 Already familiar with Google Workspace
💰 Cost-effective
📊 Need to evaluate, analyze, filter, and visualize data
🗄 Data is stored in BigQuery
👉 Connected Sheets is purpose-built for exactly this scenario.
What Connected Sheets provides
Direct connection between Google Sheets and BigQuery
Lets users:
Filter
Sort
Pivot
Chart
Uses the Google Sheets UI that leadership already knows
No SQL required
No data duplication
Very low learning curve
Highly cost-effective (no extra BI licensing)
Leadership can work with billions of rows from BigQuery as if they were working in Sheets.
Why the other options are not ideal
❌ Looker Studio
Good for dashboards
Less interactive for ad-hoc analysis compared to Sheets
❌ Looker
Powerful but heavier
Requires modeling (LookML)
More suitable for analysts and enterprise BI teams
❌ Tableau
Additional licensing cost
Not aligned with Google Workspace familiarity
Exam rule of thumb 🧠
Google Workspace users + BigQuery data + cost-effective analysis → Connected Sheets
10. Your business has collected industry-relevant data over many years. The processed data is useful for your partners and they are willing to pay for its usage. You need to ensure proper access control over the data. What should you do?
- Host the data on Analytics Hub.
- Host the data on Cloud SQL.
- Export the data to zip files and share it through Cloud Storage.
- Export the data to persistent disks and share it through an FTP endpoint.
✅ Correct Answer: Host the data on Analytics Hub
Why Analytics Hub is the right solution
Your requirements are:
📊 Processed, industry-relevant data
🤝 Shared with partners
💰 Partners are willing to pay
🔐 Strong access control
☁️ Google Cloud–native, scalable solution
👉 Analytics Hub is purpose-built for secure data sharing and monetization on Google Cloud.
What Analytics Hub provides
Secure data sharing without copying data
Fine-grained access control
Supports internal and external partners
Central marketplace-style model:
Data providers publish datasets
Consumers subscribe with controlled access
Works natively with BigQuery
Easy governance, auditing, and revocation
This is exactly how Google recommends sharing commercial datasets.
Why the other options are incorrect
❌ Cloud SQL
Not designed for analytics or data sharing at scale
Poor for large partner access
❌ Zip files via Cloud Storage
No fine-grained governance
Data duplication
Manual and insecure for paid access
❌ Persistent disks + FTP
Operationally complex
Security risks
Not scalable or cloud-native
Exam rule of thumb 🧠
Paid or partner data sharing with governance → Analytics Hub
11. You need to share inventory data from Cymbal Retail with a partner company that uses BigQuery to store and analyze its data. What tool can you use to securely and efficiently share the data?
- Cloud Storage
- Analytics Hub
- Data Catalog
- Cloud Data Loss Prevention (Cloud DLP)
✅ Correct Answer: Analytics Hub
Why Analytics Hub is the best choice
Your requirements are:
📦 Share inventory data
🤝 With a partner company
📊 Partner already uses BigQuery
🔐 Needs to be secure
⚡ Needs to be efficient (no copying, no syncing)
👉 Analytics Hub is purpose-built for secure BigQuery-to-BigQuery data sharing.
What Analytics Hub enables
Share BigQuery datasets without copying data
Fine-grained access control
Easy onboarding/offboarding of partners
Near real-time access to the latest data
Auditing and governance built in
The partner queries your data as if it were their own, while you retain full control.
Why the other options are incorrect
❌ Cloud Storage
Requires exporting data
Loses BigQuery-native sharing and governance
❌ Data Catalog
Metadata discovery tool
Does not share data
❌ Cloud DLP
Used for detecting/redacting sensitive data
Not for data sharing
Exam rule of thumb 🧠
BigQuery → BigQuery partner data sharing → Analytics Hub
12. Cymbal Retail has a team of ML engineers that builds and maintains machine learning models. As a Professional Data Engineer, how will you support this team?
- Process and prepare existing data to enable feature engineering.
- Finalize the type of machine learning model to use.
- Keep on improving the machine learning model after initial deployment.
- Identify what type of data is required to build ML models.
✅ Correct Answer: Process and prepare existing data to enable feature engineering.
Why this is the correct choice
As a Professional Data Engineer, your primary responsibility in supporting ML engineers is to ensure that high-quality, reliable, and well-structured data is available for modeling.
👉 Feature engineering depends heavily on data pipelines, which are owned and maintained by data engineers.
How a Data Engineer supports ML teams
Ingest data from multiple sources
Clean, validate, and transform raw data
Build scalable data pipelines (batch/streaming)
Prepare datasets and features in formats suitable for ML
Ensure data quality, freshness, and availability
This directly enables ML engineers to focus on modeling, not data wrangling.
Why the other options are incorrect
❌ Finalize the type of machine learning model to use
This is the responsibility of ML engineers or data scientists
❌ Keep on improving the machine learning model after deployment
Model tuning, retraining, and evaluation are ML engineering tasks
❌ Identify what type of data is required to build ML models
This is typically a collaborative design decision, led by ML engineers and business stakeholders
Exam rule of thumb 🧠
Data Engineer → Data pipelines & feature readiness
ML Engineer → Model design, training, tuning