GCP PROFESSIONAL DATA ENGINEER CERTIFICATION Questions P-4

webstoryworldwide.com

2 months ago

1. You have data in PostgreSQL that was designed to reduce redundancy. You are transferring this data to BigQuery for analytics. The source data is hierarchical and frequently queried together. You need to design a BigQuery schema that is performant. What should you do?

- Copy the primary tables and use federated queries for secondary tables.
- Copy the normalized data into partitions.
- Use nested and repeated fields.
- Retain the data in normalized form always.

✅ Correct Answer: Use nested and repeated fields.
Why this is the best approach

You’re moving data from PostgreSQL (OLTP) to BigQuery (OLAP) with these characteristics:

🔁 Source schema is normalized (to reduce redundancy)

🌳 Data is hierarchical

🔍 Tables are frequently queried together

⚡ You need high-performance analytics

👉 BigQuery performs best with denormalized schemas, especially using nested (STRUCT) and repeated (ARRAY) fields.

Benefits of nested & repeated fields in BigQuery

🚀 Fewer joins → faster queries

💰 Lower cost → less data scanned

📊 Simpler SQL for analysts

🧱 Preserves hierarchical relationships naturally (e.g., orders → line items)

Example:

orders
├─ order_id
├─ customer_id
└─ items ARRAY<STRUCT<product_id, quantity, price>>

This keeps related data together and aligns perfectly with BigQuery’s columnar storage.

Why the other options are incorrect

❌ Copy primary tables and use federated queries

Adds latency

Increases complexity

Not optimized for analytics

❌ Copy normalized data into partitions

Partitioning helps pruning, but does not reduce joins

Still inefficient for hierarchical queries

❌ Retain the data in normalized form

OLTP best practice, not OLAP

Leads to expensive joins and slower queries in BigQuery

Exam rule of thumb 🧠

BigQuery + hierarchical data + frequent joins → Nested & repeated fields

2. You used Dataplex to create lakes and zones for your business data. However, some files are not being discovered. What could be the issue?

- You have scheduled discovery to run every hour.
- The files are in Parquet format.
- You have an exclude pattern that matches the files.
- The files are in ORC format.

✅ Correct Answer: You have an exclude pattern that matches the files.
Why this is the correct explanation

In Dataplex, data discovery works based on inclusion and exclusion patterns defined at the asset level. If some files are not being discovered, the most common and likely cause is:

👉 An exclude pattern that unintentionally matches those files.

When an exclude pattern matches:

Dataplex skips those files entirely

They will never appear in the catalog, even if:

Discovery runs successfully

The file format is supported

The schedule is frequent

Why the other options are incorrect

❌ Scheduled discovery runs every hour

This would increase discovery frequency, not prevent it.

❌ Files are in Parquet format

Parquet is fully supported by Dataplex.

❌ Files are in ORC format

ORC is also supported by Dataplex.

Exam rule of thumb 🧠

Dataplex discovery missing files → Check include/exclude patterns first

3. You have analytics data stored in BigQuery. You need an efficient way to compute values across a group of rows and return a single result for each row. What should you do?

- Use a UDF (user-defined function).
- Use a window function with an OVER clause.
- Use an aggregate function.
- Use BigQuery ML.

✅ Correct Answer: Use a window function with an OVER clause.
Why this is the correct solution

You need to:

Compute values across a group of rows

Return a result for each row (not collapse rows)

👉 This is exactly what window (analytic) functions in BigQuery are designed for.

What window functions do

Perform calculations across a defined window of rows

Preserve row-level detail

Examples:

Running totals

Moving averages

Rank, dense_rank

Percent of total per row

Example:

SELECT
order_id,
amount,
SUM(amount) OVER (PARTITION BY customer_id) AS customer_total
FROM orders;

Each row keeps its identity while sharing a group-level computation.

Why the other options are incorrect

❌ UDF (user-defined function)

Used for custom logic

Inefficient for row-group analytics

❌ Aggregate function

Aggregates collapse rows

Do not return a result per row unless used with OVER

❌ BigQuery ML

Used for machine learning

Not for SQL analytics patterns

Exam rule of thumb 🧠

Group-level computation + row-level output → Window function (OVER)

4. Your data in BigQuery has some columns that are extremely sensitive. You need to enable only some users to see certain columns. What should you do?

- Create a new dataset with the column's data.
- Create a new table with the column's data.
- Use policy tags.
- Use Identity and Access Management (IAM) permissions.

✅ Correct Answer: Use policy tags.
Why this is the correct solution

Your requirement is column-level security in BigQuery:

🔐 Some columns are extremely sensitive

👥 Only specific users should be able to see them

📊 Other users should still access the rest of the table

👉 Policy tags (from Data Catalog) are specifically designed for this exact use case.

What policy tags provide

Column-level access control

Centralized management of sensitive data classifications

Fine-grained permissions like:

Who can see PII columns

Who can see financial data

Enforcement at query time

Example:

Tag credit_card_number with a policy tag

Grant access only to a specific IAM group

Users without permission:

Access Denied: Permission denied on column credit_card_number

Why the other options are incorrect

❌ Create a new dataset with the column's data

Data duplication

Hard to maintain

Not scalable

❌ Create a new table with the column's data

Same issues: duplication, sync problems

❌ Use IAM permissions

IAM works at:

Project

Dataset

Table

❌ Cannot restrict access to individual columns

Exam rule of thumb 🧠

BigQuery column-level security → Policy tags

5. You built machine learning (ML) models based on your own data. In production, the ML models are not giving satisfactory results. When you examine the data, it appears that the existing data is not sufficiently representing the business goals. You need to create a more accurate machine learning model. What should you do?

- Train the model with the same data, but use more epochs.
- Perform feature engineering, and use domain knowledge to enhance the column data.
- Perform L2 regularization.
- Train the model with more of similar data.

✅ Correct Answer: Perform feature engineering, and use domain knowledge to enhance the column data.
Why this is the best solution

Your core problem is data quality and representation, not model training mechanics:

📉 The model performs poorly in production

🔍 Analysis shows the data does not sufficiently represent business goals

🎯 You want a more accurate model

👉 This points directly to a feature problem, not a training or tuning problem.

Why feature engineering is key

Improves how well raw data represents real-world business concepts

Incorporates domain knowledge (the most valuable signal in ML)

Often has more impact than changing algorithms or hyperparameters

Helps the model learn the right patterns, not just more patterns

Examples:

Creating ratios, aggregates, or flags

Encoding business rules into features

Transforming raw timestamps into business-relevant periods

Deriving customer lifetime value, churn indicators, etc.

Why the other options are incorrect

❌ Train with the same data using more epochs

Risks overfitting

Does not fix missing or poorly representative features

❌ Perform L2 regularization

Helps prevent overfitting

Does not improve data representativeness

❌ Train with more of similar data

More of the same biased or insufficient data won’t fix the issue

Garbage in → garbage out

Exam rule of thumb 🧠

Poor ML results + data doesn’t reflect business reality → Feature engineering

6. You need to optimize the performance of queries in BigQuery. Your tables are not partitioned or clustered. What optimization technique can you use?

- Filter data as late as possible.
- Use the LIMIT clause to reduce the data read.
- Batch your updates and inserts.
- Perform self-joins on data.

✅ Correct answer: Batch your updates and inserts.
Correct reasoning (important)

While the question mentions queries, BigQuery performance is affected not only by SELECT queries but also by how data is written and updated.

✅ Why batching updates and inserts is correct

BigQuery is append-optimized

Row-by-row inserts and updates are expensive

Batching:

Reduces job overhead

Improves execution efficiency

Is a documented BigQuery optimization best practice

Especially relevant when tables are not partitioned or clustered

This aligns exactly with the statement you provided.

Why the other options are incorrect
❌ Filter data as late as possible

Opposite of best practice — filtering should be done as early as possible.

❌ Use the LIMIT clause

LIMIT does not reliably reduce bytes scanned

Not considered a true performance optimization

❌ Perform self-joins

Increases cost and execution time

7. You repeatedly run the same queries by joining multiple tables. The original tables change about ten times per day. You want an optimized querying approach. Which feature should you use?

- Federated queries
- Materialized views
- Partitions
- Views

✅ Correct Answer: Materialized views
Why materialized views are the best choice

Your situation:

🔁 You repeatedly run the same queries

🔗 Queries join multiple tables

🔄 The source tables change ~10 times per day

⚡ You want an optimized querying approach (better performance, less cost)

👉 Materialized views are designed exactly for this pattern.

What materialized views do

Precompute and store the results of a query

Automatically incrementally refresh when source data changes

Dramatically reduce query execution time

Lower query cost by avoiding repeated heavy joins

This means:

Analysts query the materialized view instead of recomputing joins

Data stays reasonably fresh without full recomputation

Why the other options are not correct

❌ Federated queries

Slower (data accessed externally)

Not optimized for repeated analytical joins

❌ Partitions

Help reduce scanned data

Do not eliminate repeated joins or recomputation

❌ Views

Logical only

Query is recomputed every time

No performance improvement for repeated complex queries

Exam rule of thumb 🧠

Repeated complex queries + frequently changing data → Materialized views

8. You have a complex set of data that comes from multiple sources. The analysts in your team need to analyze the data, visualize it, and publish reports to internal and external stakeholders. You need to make it easier for the analysts to work with the data by abstracting the multiple data sources. What tool do you recommend?

- Connected Sheets
- Looker
- Looker Studio
- D3.js library

✅ Correct Answer: Looker
Why Looker is the right recommendation

Your requirements are:

📊 Complex data from multiple sources

🧠 Analysts need to analyze the data

📈 Visualize insights

📰 Publish reports to internal and external stakeholders

🧱 Abstract underlying data complexity so analysts don’t deal with raw joins and source differences

👉 Looker is specifically built to model and abstract data complexity using a semantic layer.

What Looker provides (key differentiator)

LookML semantic layer

Central definition of metrics, dimensions, joins

One source of truth for business logic

Analysts work with business-friendly fields, not raw SQL

Supports:

Multiple data sources

Governed access

Reusable metrics

Enables:

Dashboards

Embedded analytics

Sharing with internal & external users

This directly satisfies the requirement to abstract multiple data sources.

Why the other options are not suitable

❌ Connected Sheets

Great for spreadsheet analysis

❌ No semantic layer

❌ Not designed for complex multi-source abstraction

❌ Looker Studio

Visualization-focused

❌ Lacks a governed semantic modeling layer

Logic often duplicated per report

❌ D3.js library

Low-level visualization library

Requires heavy engineering effort

❌ Not an analytics platform

Exam rule of thumb 🧠

Abstract complex data + governed metrics + enterprise BI → Looker

9. Your company uses Google Workspace and your leadership team is familiar with its business apps and collaboration tools. They want a cost-effective solution that uses their existing knowledge to evaluate, analyze, filter, and visualize data that is stored in BigQuery. What should you do to create a solution for the leadership team?

- Configure Looker Studio.
- Configure Connected Sheets.
- Create models in Looker.
- Configure Tableau.

✅ Correct Answer: Configure Connected Sheets
Why Connected Sheets is the best solution

Your requirements are very specific:

🧑‍💼 Leadership team, not data engineers

🧠 Already familiar with Google Workspace

💰 Cost-effective

📊 Need to evaluate, analyze, filter, and visualize data

🗄 Data is stored in BigQuery

👉 Connected Sheets is purpose-built for exactly this scenario.

What Connected Sheets provides

Direct connection between Google Sheets and BigQuery

Lets users:

Filter

Sort

Pivot

Chart

Uses the Google Sheets UI that leadership already knows

No SQL required

No data duplication

Very low learning curve

Highly cost-effective (no extra BI licensing)

Leadership can work with billions of rows from BigQuery as if they were working in Sheets.

Why the other options are not ideal

❌ Looker Studio

Good for dashboards

Less interactive for ad-hoc analysis compared to Sheets

❌ Looker

Powerful but heavier

Requires modeling (LookML)

More suitable for analysts and enterprise BI teams

❌ Tableau

Additional licensing cost

Not aligned with Google Workspace familiarity

Exam rule of thumb 🧠

Google Workspace users + BigQuery data + cost-effective analysis → Connected Sheets

10. Your business has collected industry-relevant data over many years. The processed data is useful for your partners and they are willing to pay for its usage. You need to ensure proper access control over the data. What should you do?

- Host the data on Analytics Hub.
- Host the data on Cloud SQL.
- Export the data to zip files and share it through Cloud Storage.
- Export the data to persistent disks and share it through an FTP endpoint.

✅ Correct Answer: Host the data on Analytics Hub
Why Analytics Hub is the right solution

Your requirements are:

📊 Processed, industry-relevant data

🤝 Shared with partners

💰 Partners are willing to pay

🔐 Strong access control

☁️ Google Cloud–native, scalable solution

👉 Analytics Hub is purpose-built for secure data sharing and monetization on Google Cloud.

What Analytics Hub provides

Secure data sharing without copying data

Fine-grained access control

Supports internal and external partners

Central marketplace-style model:

Data providers publish datasets

Consumers subscribe with controlled access

Works natively with BigQuery

Easy governance, auditing, and revocation

This is exactly how Google recommends sharing commercial datasets.

Why the other options are incorrect

❌ Cloud SQL

Not designed for analytics or data sharing at scale

Poor for large partner access

❌ Zip files via Cloud Storage

No fine-grained governance

Data duplication

Manual and insecure for paid access

❌ Persistent disks + FTP

Operationally complex

Security risks

Not scalable or cloud-native

Exam rule of thumb 🧠

Paid or partner data sharing with governance → Analytics Hub

11. You need to share inventory data from Cymbal Retail with a partner company that uses BigQuery to store and analyze its data. What tool can you use to securely and efficiently share the data?

- Cloud Storage
- Analytics Hub
- Data Catalog
- Cloud Data Loss Prevention (Cloud DLP)

✅ Correct Answer: Analytics Hub
Why Analytics Hub is the best choice

Your requirements are:

📦 Share inventory data

🤝 With a partner company

📊 Partner already uses BigQuery

🔐 Needs to be secure

⚡ Needs to be efficient (no copying, no syncing)

👉 Analytics Hub is purpose-built for secure BigQuery-to-BigQuery data sharing.

What Analytics Hub enables

Share BigQuery datasets without copying data

Fine-grained access control

Easy onboarding/offboarding of partners

Near real-time access to the latest data

Auditing and governance built in

The partner queries your data as if it were their own, while you retain full control.

Why the other options are incorrect

❌ Cloud Storage

Requires exporting data

Loses BigQuery-native sharing and governance

❌ Data Catalog

Metadata discovery tool

Does not share data

❌ Cloud DLP

Used for detecting/redacting sensitive data

Not for data sharing

Exam rule of thumb 🧠

BigQuery → BigQuery partner data sharing → Analytics Hub

12. Cymbal Retail has a team of ML engineers that builds and maintains machine learning models. As a Professional Data Engineer, how will you support this team?

- Process and prepare existing data to enable feature engineering.
- Finalize the type of machine learning model to use.
- Keep on improving the machine learning model after initial deployment.
- Identify what type of data is required to build ML models.

✅ Correct Answer: Process and prepare existing data to enable feature engineering.
Why this is the correct choice

As a Professional Data Engineer, your primary responsibility in supporting ML engineers is to ensure that high-quality, reliable, and well-structured data is available for modeling.

👉 Feature engineering depends heavily on data pipelines, which are owned and maintained by data engineers.

How a Data Engineer supports ML teams

Ingest data from multiple sources

Clean, validate, and transform raw data

Build scalable data pipelines (batch/streaming)

Prepare datasets and features in formats suitable for ML

Ensure data quality, freshness, and availability

This directly enables ML engineers to focus on modeling, not data wrangling.

Why the other options are incorrect

❌ Finalize the type of machine learning model to use

This is the responsibility of ML engineers or data scientists

❌ Keep on improving the machine learning model after deployment

Model tuning, retraining, and evaluation are ML engineering tasks

❌ Identify what type of data is required to build ML models

This is typically a collaborative design decision, led by ML engineers and business stakeholders

Exam rule of thumb 🧠

Data Engineer → Data pipelines & feature readiness
ML Engineer → Model design, training, tuning