GCP PROFESSIONAL DATA ENGINEER CERTIFICATION Questions P-9

1. To comply with data privacy regulations, Cymbal E-commerce uses Sensitive Data Protection to manage PII. What are the three core functions of this process?

- Ingestion, Transformation, and Loading.
- Auditing, Alerting, and Reporting.
- Discovery, Classification, and Protection.
- Authentication, Authorization, and Accounting.

✅ Correct Answer

Discovery, Classification, and Protection.

🧠 Why this is the exam-correct answer

Sensitive Data Protection (formerly Cloud DLP) is designed to help organizations like Cymbal E-commerce comply with data privacy regulations by managing PII and sensitive data across their data estate.

Its three core functions are:

Discovery

Scans data across BigQuery, Cloud Storage, Databases, etc.

Identifies where sensitive data exists.

Classification

Detects and labels sensitive elements such as:

Names, emails, phone numbers

Credit card numbers

Government IDs

Uses predefined and custom infoTypes.

Protection

Applies controls like:

Masking

Tokenization

Redaction

Encryption

Ensures sensitive data is protected before analysis or sharing.

✔ Exam keywords matched:

PII · privacy regulations · detect · classify · protect

These three words are directly associated with Sensitive Data Protection in the PDE exam.

❌ Why the other options are wrong (common PDE traps)

Ingestion, Transformation, Loading
❌ That’s ETL, not data privacy.

Auditing, Alerting, Reporting
❌ Security monitoring concepts, not PII management.

Authentication, Authorization, Accounting
❌ That’s IAM (AAA), not data protection.


2. While BigQuery ML is excellent for many use cases, Cymbal's data scientists need to build a complex, custom product recommendation engine. Which Google Cloud service provides an end-to-end platform for this advanced task and integrates seamlessly with BigQuery?

- Dataplex.
- Looker.
- Vertex AI.
- Sensitive Data Protection.

✅ Correct Answer

Vertex AI

🧠 Why this is the exam-correct answer

For building a complex, custom product recommendation engine, Vertex AI is the end-to-end machine learning platform on Google Cloud that goes far beyond what BigQuery ML is designed to handle.

Vertex AI provides:

Custom model development (TensorFlow, PyTorch, XGBoost, custom containers)

End-to-end ML lifecycle:

Data preparation

Training (custom & AutoML)

Hyperparameter tuning

Model evaluation

Deployment and online prediction

Seamless integration with BigQuery:

Train models directly from BigQuery tables

Use BigQuery as a feature source

Write predictions back to BigQuery

✔ Exam keywords matched:

complex · custom ML · end-to-end platform · recommendation engine · BigQuery integration

These keywords strongly and uniquely map to Vertex AI.

❌ Why the other options are wrong (common PDE traps)
Dataplex

Data governance and metadata management

❌ Not an ML development platform

Looker

BI and visualization

❌ Consumes ML outputs, does not build models

Sensitive Data Protection

PII discovery and masking

❌ Security/compliance service, not ML


3. When setting up a Sensitive Data Protection job to de-identify a table for the analytics team, an engineer wants to replace email addresses with a non-reversible token that still allows for counting unique customers. Which transformation should they apply?

- Redaction, to replace emails with a placeholder like [SENSITIVE_DATA].
- Anonymization, to remove the email column entirely.
- Masking, to show only the first part of the email address.
- Tokenization with a cryptographic hash, to maintain referential integrity.

✅ Correct Answer

Tokenization with a cryptographic hash, to maintain referential integrity.

🧠 Why this is the exam-correct answer

Sensitive Data Protection (formerly Cloud DLP) supports multiple de-identification techniques. The requirement here has three key constraints:

Non-reversible → Original emails must not be recoverable

Consistent replacement → Same email → same token every time

Analytics-friendly → Ability to count unique customers

Tokenization using a cryptographic hash (often with a salt) satisfies all three:

Produces a deterministic token

Is non-reversible (one-way)

Preserves referential integrity, enabling:

COUNT(DISTINCT customer_token)

Joins and aggregations without exposing PII

✔ Exam keywords matched:

non-reversible · count unique · referential integrity

❌ Why the other options are wrong (common PDE traps)
Redaction

Replaces all values with the same placeholder

❌ Loses uniqueness → cannot count unique customers

Anonymization (column removal)

Completely removes the field

❌ No way to distinguish customers

Masking

Partially exposes the value (e.g., jo***@gmail.com)

❌ Still reveals PII

❌ Not suitable for strong privacy guarantees


4. A marketing analyst at Cymbal E-commerce needs access to a customer table but should not be able to view the customers' contact information. Which security control is most effective for protecting this Personally Identifiable Information (PII)?

- Column-level security.
- Dynamic data masking.
- Row-level security.
- IAM bucket-level permissions.

✅ Correct Answer

Column-level security.

🧠 Why this is the exam-correct answer

In BigQuery, column-level security (CLS) is the most effective and precise control when a user:

Needs access to the table

Must not see specific sensitive columns (e.g., email, phone, address)

Should still be able to analyze non-PII columns (e.g., customer_id, region, purchase totals)

For Cymbal’s marketing analyst:

Grant access to the table

Restrict visibility on PII columns only

No data duplication or masking required

✔ Exam keywords matched:

access to table · hide specific columns · PII protection

This is exactly what column-level security is designed for.

❌ Why the other options are wrong (common PDE traps)
Dynamic data masking

Masks values (e.g., ****@gmail.com)

User can still see the column exists

Best when partial visibility is acceptable

❌ Less strict than CLS

Row-level security

Restricts which rows a user can see

Used for scenarios like “only see your region”

❌ Does not protect specific columns

IAM bucket-level permissions

Applies to Cloud Storage, not BigQuery tables

Too coarse-grained


5. Which SQL statement in BigQuery ML is used to begin the process of training a new model?

- ML.EVALUATE
- CREATE MODEL
- ML.PREDICT
- SELECT MODEL

✅ Correct Answer

CREATE MODEL

🧠 Why this is the exam-correct answer

In BigQuery ML, the process of training a new machine learning model always starts with the SQL statement:

CREATE MODEL


This statement:

Defines the model type (e.g., linear regression, logistic regression, boosted trees)

Specifies training options

Executes the training job using data from a SELECT query

✔ Exam keyword matched:

begin the process of training a new model

Only CREATE MODEL initiates training.

❌ Why the other options are wrong (common PDE traps)

ML.EVALUATE
❌ Used after training to assess model performance

ML.PREDICT
❌ Used after training to generate predictions

SELECT MODEL
❌ Not a valid BigQuery ML statement


6. In the Medallion Architecture, which zone contains the final, highly refined, and aggregated data optimized for analytics and reporting?

- The Gold Zone.
- The Bronze Zone.
- The Raw Zone.
- The Silver Zone.

✅ Correct Answer

The Gold Zone.

🧠 Why this is the exam-correct answer

In the Medallion Architecture (commonly used in lakehouse designs), data flows through progressively refined layers:

Bronze Zone (Raw)

Ingested data in its original format

Minimal or no transformations

Silver Zone (Cleansed/Conformed)

Cleaned, deduplicated, standardized data

Business rules applied

Gold Zone (Curated)

Highly refined, aggregated, business-ready data

Optimized for analytics, BI, and reporting

Used directly by dashboards and analysts

✔ Exam keywords matched:

final · highly refined · aggregated · analytics and reporting

These words uniquely identify the Gold Zone.

❌ Why the other options are wrong (exam traps)

Bronze Zone / Raw Zone
❌ Raw, unprocessed ingestion layer

Silver Zone
❌ Cleaned and conformed, but not final aggregates


7. What is the primary function of Dataplex in Cymbal E-commerce's data lakehouse?

- To automatically de-identify all incoming sensitive data streams.
- To execute machine learning models directly on raw data.
- To enforce row-level security on Cloud Storage buckets.
- To act as a universal catalog for all data assets across BigQuery, Cloud Storage, and BigLake.

✅ Correct Answer

To act as a universal catalog for all data assets across BigQuery, Cloud Storage, and BigLake.

🧠 Why this is the exam-correct answer

Dataplex is Google Cloud’s data governance and metadata management service for lakehouse architectures.

In Cymbal E-commerce’s data lakehouse, Dataplex’s primary function is to:

Provide a unified, centralized catalog of data assets

Automatically discover and organize data across:

BigQuery

Cloud Storage

BigLake

Manage technical metadata, business metadata, and data lineage

Enable consistent governance across the lakehouse

✔ Exam keywords matched:

universal catalog · data assets · lakehouse governance · BigQuery + Cloud Storage

These keywords map directly and uniquely to Dataplex.

❌ Why the other options are wrong (common PDE traps)

Automatically de-identify sensitive data
❌ That is handled by Sensitive Data Protection (DLP)

Execute machine learning models
❌ That is the role of Vertex AI

Enforce row-level security on Cloud Storage buckets
❌ RLS is applied in BigQuery, not directly on buckets


8. What is the primary advantage of using BigQuery ML for a task like predicting customer churn at Cymbal E-commerce?

- It is only capable of building deep learning models using TensorFlow.
- It automatically cleans and prepares all raw data from the bronze zone without user input.
- It requires moving data to a separate, specialized machine learning environment for training.
- It allows data analysts to build and deploy machine learning models using simple SQL queries.

✅ Correct Answer

It allows data analysts to build and deploy machine learning models using simple SQL queries.

🧠 Why this is the exam-correct answer

BigQuery ML is designed to democratize machine learning by enabling data analysts to train, evaluate, and use ML models directly in BigQuery using SQL.

For a task like predicting customer churn at Cymbal E-commerce, BigQuery ML:

Eliminates the need for a separate ML environment

Allows analysts to stay within SQL

Trains models where the data already lives

Reduces operational and development overhead

✔ Exam keywords matched:

data analysts · simple SQL · build and deploy models

These phrases uniquely identify BigQuery ML in PDE exams.

❌ Why the other options are wrong (common PDE traps)

Only capable of deep learning models
❌ BigQuery ML supports linear/logistic regression, boosted trees, k-means, ARIMA, matrix factorization, and more

Automatically cleans all raw bronze data
❌ Data preparation is still the user’s responsibility

Requires moving data to a separate ML environment
❌ That describes traditional ML platforms, not BigQuery ML

Leave a Comment