1. To comply with data privacy regulations, Cymbal E-commerce uses Sensitive Data Protection to manage PII. What are the three core functions of this process?
- Ingestion, Transformation, and Loading.
- Auditing, Alerting, and Reporting.
- Discovery, Classification, and Protection.
- Authentication, Authorization, and Accounting.
✅ Correct Answer
Discovery, Classification, and Protection.
🧠 Why this is the exam-correct answer
Sensitive Data Protection (formerly Cloud DLP) is designed to help organizations like Cymbal E-commerce comply with data privacy regulations by managing PII and sensitive data across their data estate.
Its three core functions are:
Discovery
Scans data across BigQuery, Cloud Storage, Databases, etc.
Identifies where sensitive data exists.
Classification
Detects and labels sensitive elements such as:
Names, emails, phone numbers
Credit card numbers
Government IDs
Uses predefined and custom infoTypes.
Protection
Applies controls like:
Masking
Tokenization
Redaction
Encryption
Ensures sensitive data is protected before analysis or sharing.
✔ Exam keywords matched:
PII · privacy regulations · detect · classify · protect
These three words are directly associated with Sensitive Data Protection in the PDE exam.
❌ Why the other options are wrong (common PDE traps)
Ingestion, Transformation, Loading
❌ That’s ETL, not data privacy.
Auditing, Alerting, Reporting
❌ Security monitoring concepts, not PII management.
Authentication, Authorization, Accounting
❌ That’s IAM (AAA), not data protection.
2. While BigQuery ML is excellent for many use cases, Cymbal's data scientists need to build a complex, custom product recommendation engine. Which Google Cloud service provides an end-to-end platform for this advanced task and integrates seamlessly with BigQuery?
- Dataplex.
- Looker.
- Vertex AI.
- Sensitive Data Protection.
✅ Correct Answer
Vertex AI
🧠 Why this is the exam-correct answer
For building a complex, custom product recommendation engine, Vertex AI is the end-to-end machine learning platform on Google Cloud that goes far beyond what BigQuery ML is designed to handle.
Vertex AI provides:
Custom model development (TensorFlow, PyTorch, XGBoost, custom containers)
End-to-end ML lifecycle:
Data preparation
Training (custom & AutoML)
Hyperparameter tuning
Model evaluation
Deployment and online prediction
Seamless integration with BigQuery:
Train models directly from BigQuery tables
Use BigQuery as a feature source
Write predictions back to BigQuery
✔ Exam keywords matched:
complex · custom ML · end-to-end platform · recommendation engine · BigQuery integration
These keywords strongly and uniquely map to Vertex AI.
❌ Why the other options are wrong (common PDE traps)
Dataplex
Data governance and metadata management
❌ Not an ML development platform
Looker
BI and visualization
❌ Consumes ML outputs, does not build models
Sensitive Data Protection
PII discovery and masking
❌ Security/compliance service, not ML
3. When setting up a Sensitive Data Protection job to de-identify a table for the analytics team, an engineer wants to replace email addresses with a non-reversible token that still allows for counting unique customers. Which transformation should they apply?
- Redaction, to replace emails with a placeholder like [SENSITIVE_DATA].
- Anonymization, to remove the email column entirely.
- Masking, to show only the first part of the email address.
- Tokenization with a cryptographic hash, to maintain referential integrity.
✅ Correct Answer
Tokenization with a cryptographic hash, to maintain referential integrity.
🧠 Why this is the exam-correct answer
Sensitive Data Protection (formerly Cloud DLP) supports multiple de-identification techniques. The requirement here has three key constraints:
Non-reversible → Original emails must not be recoverable
Consistent replacement → Same email → same token every time
Analytics-friendly → Ability to count unique customers
Tokenization using a cryptographic hash (often with a salt) satisfies all three:
Produces a deterministic token
Is non-reversible (one-way)
Preserves referential integrity, enabling:
COUNT(DISTINCT customer_token)
Joins and aggregations without exposing PII
✔ Exam keywords matched:
non-reversible · count unique · referential integrity
❌ Why the other options are wrong (common PDE traps)
Redaction
Replaces all values with the same placeholder
❌ Loses uniqueness → cannot count unique customers
Anonymization (column removal)
Completely removes the field
❌ No way to distinguish customers
Masking
Partially exposes the value (e.g., jo***@gmail.com)
❌ Still reveals PII
❌ Not suitable for strong privacy guarantees
4. A marketing analyst at Cymbal E-commerce needs access to a customer table but should not be able to view the customers' contact information. Which security control is most effective for protecting this Personally Identifiable Information (PII)?
- Column-level security.
- Dynamic data masking.
- Row-level security.
- IAM bucket-level permissions.
✅ Correct Answer
Column-level security.
🧠 Why this is the exam-correct answer
In BigQuery, column-level security (CLS) is the most effective and precise control when a user:
Needs access to the table
Must not see specific sensitive columns (e.g., email, phone, address)
Should still be able to analyze non-PII columns (e.g., customer_id, region, purchase totals)
For Cymbal’s marketing analyst:
Grant access to the table
Restrict visibility on PII columns only
No data duplication or masking required
✔ Exam keywords matched:
access to table · hide specific columns · PII protection
This is exactly what column-level security is designed for.
❌ Why the other options are wrong (common PDE traps)
Dynamic data masking
Masks values (e.g., ****@gmail.com)
User can still see the column exists
Best when partial visibility is acceptable
❌ Less strict than CLS
Row-level security
Restricts which rows a user can see
Used for scenarios like “only see your region”
❌ Does not protect specific columns
IAM bucket-level permissions
Applies to Cloud Storage, not BigQuery tables
Too coarse-grained
5. Which SQL statement in BigQuery ML is used to begin the process of training a new model?
- ML.EVALUATE
- CREATE MODEL
- ML.PREDICT
- SELECT MODEL
✅ Correct Answer
CREATE MODEL
🧠 Why this is the exam-correct answer
In BigQuery ML, the process of training a new machine learning model always starts with the SQL statement:
CREATE MODEL
This statement:
Defines the model type (e.g., linear regression, logistic regression, boosted trees)
Specifies training options
Executes the training job using data from a SELECT query
✔ Exam keyword matched:
begin the process of training a new model
Only CREATE MODEL initiates training.
❌ Why the other options are wrong (common PDE traps)
ML.EVALUATE
❌ Used after training to assess model performance
ML.PREDICT
❌ Used after training to generate predictions
SELECT MODEL
❌ Not a valid BigQuery ML statement
6. In the Medallion Architecture, which zone contains the final, highly refined, and aggregated data optimized for analytics and reporting?
- The Gold Zone.
- The Bronze Zone.
- The Raw Zone.
- The Silver Zone.
✅ Correct Answer
The Gold Zone.
🧠 Why this is the exam-correct answer
In the Medallion Architecture (commonly used in lakehouse designs), data flows through progressively refined layers:
Bronze Zone (Raw)
Ingested data in its original format
Minimal or no transformations
Silver Zone (Cleansed/Conformed)
Cleaned, deduplicated, standardized data
Business rules applied
Gold Zone (Curated)
Highly refined, aggregated, business-ready data
Optimized for analytics, BI, and reporting
Used directly by dashboards and analysts
✔ Exam keywords matched:
final · highly refined · aggregated · analytics and reporting
These words uniquely identify the Gold Zone.
❌ Why the other options are wrong (exam traps)
Bronze Zone / Raw Zone
❌ Raw, unprocessed ingestion layer
Silver Zone
❌ Cleaned and conformed, but not final aggregates
7. What is the primary function of Dataplex in Cymbal E-commerce's data lakehouse?
- To automatically de-identify all incoming sensitive data streams.
- To execute machine learning models directly on raw data.
- To enforce row-level security on Cloud Storage buckets.
- To act as a universal catalog for all data assets across BigQuery, Cloud Storage, and BigLake.
✅ Correct Answer
To act as a universal catalog for all data assets across BigQuery, Cloud Storage, and BigLake.
🧠 Why this is the exam-correct answer
Dataplex is Google Cloud’s data governance and metadata management service for lakehouse architectures.
In Cymbal E-commerce’s data lakehouse, Dataplex’s primary function is to:
Provide a unified, centralized catalog of data assets
Automatically discover and organize data across:
BigQuery
Cloud Storage
BigLake
Manage technical metadata, business metadata, and data lineage
Enable consistent governance across the lakehouse
✔ Exam keywords matched:
universal catalog · data assets · lakehouse governance · BigQuery + Cloud Storage
These keywords map directly and uniquely to Dataplex.
❌ Why the other options are wrong (common PDE traps)
Automatically de-identify sensitive data
❌ That is handled by Sensitive Data Protection (DLP)
Execute machine learning models
❌ That is the role of Vertex AI
Enforce row-level security on Cloud Storage buckets
❌ RLS is applied in BigQuery, not directly on buckets
8. What is the primary advantage of using BigQuery ML for a task like predicting customer churn at Cymbal E-commerce?
- It is only capable of building deep learning models using TensorFlow.
- It automatically cleans and prepares all raw data from the bronze zone without user input.
- It requires moving data to a separate, specialized machine learning environment for training.
- It allows data analysts to build and deploy machine learning models using simple SQL queries.
✅ Correct Answer
It allows data analysts to build and deploy machine learning models using simple SQL queries.
🧠 Why this is the exam-correct answer
BigQuery ML is designed to democratize machine learning by enabling data analysts to train, evaluate, and use ML models directly in BigQuery using SQL.
For a task like predicting customer churn at Cymbal E-commerce, BigQuery ML:
Eliminates the need for a separate ML environment
Allows analysts to stay within SQL
Trains models where the data already lives
Reduces operational and development overhead
✔ Exam keywords matched:
data analysts · simple SQL · build and deploy models
These phrases uniquely identify BigQuery ML in PDE exams.
❌ Why the other options are wrong (common PDE traps)
Only capable of deep learning models
❌ BigQuery ML supports linear/logistic regression, boosted trees, k-means, ARIMA, matrix factorization, and more
Automatically cleans all raw bronze data
❌ Data preparation is still the user’s responsibility
Requires moving data to a separate ML environment
❌ That describes traditional ML platforms, not BigQuery ML