Site icon Webstory Worldwide

GCP PROFESSIONAL DATA ENGINEER CERTIFICATION Diagnostic Qus P-1

1. Cymbal Retail has acquired another company in Europe. Data access permissions and policies in this new region differ from those in Cymbal Retail’s headquarters, which is in North America. You need to define a consistent set of policies for projects in each region that follow recommended practices. What should you do?

- Create top level folders for each region, and assign policies at the folder level.
- Implement a flat hierarchy, and assign policies to each project according to its region.
- Create a new organization for all projects in Europe and assign policies in each organization that comply with regional laws.
- Implement policies at the resource level that comply with regional laws.

The correct answer is:

✅ Create top-level folders for each region, and assign policies at the folder level.
Why this is the best practice

Google Cloud recommends using folders to reflect organizational, geographic, or regulatory boundaries. In your scenario:

North America and Europe have different data access laws and policies

You want consistency, scalability, and centralized management

By creating top-level folders per region, you can:

Apply IAM policies, organization policies, and constraints once at the folder level

Automatically inherit policies to all projects in that region

Avoid duplicating policy management across individual projects

Easily onboard new projects in a region with the correct policies by default

This is the cleanest and most scalable design.


2. You are using Dataproc to process a large number of CSV files. The storage option you choose needs to be flexible to serve many worker nodes in multiple clusters. These worker nodes will read the data and also write to it for intermediate storage between processing jobs. What is the recommended storage option on Google Cloud?

- Local SSD
- Zonal persistent disks
- Cloud Storage
- Cloud SQL

The correct answer is:

✅ Cloud Storage
Why Cloud Storage is recommended

For Dataproc workloads processing large numbers of CSV files, especially when:

Many worker nodes

Multiple clusters

Concurrent read and write access

Intermediate storage between jobs

👉 Cloud Storage is the best and recommended option.

Key reasons:

Highly scalable & shared across all clusters and nodes

Supports parallel reads/writes

Decoupled from compute (clusters can be created/deleted without data loss)

Ideal for input, output, and intermediate data

Native integration with Dataproc, Spark, Hadoop, and Beam

This matches Google’s best practice:

Use Cloud Storage as the data lake for Dataproc.

Why the other options are not suitable

❌ Local SSD

Attached to a single VM

Data is lost when the VM stops

Not shareable across nodes or clusters

❌ Zonal persistent disks

Limited to a single VM or cluster

Not designed for multi-cluster shared access

Less scalable for large distributed workloads

❌ Cloud SQL

Relational database

Not suitable for large-scale file-based batch processing

Exam tip 🧠

If you see:

Dataproc

Multiple workers / clusters

Shared, scalable storage

👉 Always think Cloud Storage first.


3. You are running a user-supplied DoFn method signature pipeline in Dataflow. The function has been defined by you. The code is running slow and you want to further examine the pipeline code to get better visibility of why. What should you do?

- Use Cloud Logging
- Use Cloud Profiler
- Use Cloud Monitoring
- Use Cloud Audit Logs

The correct answer is:

✅ Use Cloud Profiler
Why Cloud Profiler is the right choice

You want better visibility into why your user-defined DoFn code is running slow. That means you need code-level performance analysis, not just metrics or logs.

Cloud Profiler:

Attaches to running Dataflow workers

Shows CPU usage, memory usage, and hot paths

Pinpoints which methods in your DoFn are slow

Helps identify:

Expensive computations

Inefficient loops

Excessive object creation

Serialization overhead

This is exactly what you need when the problem is slow pipeline code, especially inside a user-supplied DoFn.

Why the other options are not correct

❌ Cloud Logging
Logs events and messages, but does not show performance bottlenecks.

❌ Cloud Monitoring
Shows metrics (CPU, memory, throughput), but not which lines of code are slow.

❌ Cloud Audit Logs
Used for administrative and security auditing, not performance analysis.

Quick exam tip 🧠

If the question mentions:

“pipeline code is slow”

“user-defined DoFn”

“examine code behavior”

👉 Cloud Profiler is the answer.


4. You are managing the data for Cymbal Retail, which consists of multiple teams including retail, sales, marketing, and legal. These teams are consuming data from multiple producers including point of sales systems, industry data, orders, and more. Currently, teams that consume data have to repeatedly ask the teams that produce it to verify the most up-to-date data and to clarify other questions about the data, such as source and ownership. This process is unreliable and time-consuming and often leads to repeated escalations. You need to implement a centralized solution that gains a unified view of the organization's data and improves searchability. What should you do?

- Implement Looker dashboards that provide views of the data that meet each teams’ requirements.
- Implement a data lake with Cloud Storage, and create buckets for each team such as retail, sales, marketing.
- Implement a data warehouse by using BigQuery, and create datasets for each team such as retail, sales, marketing.
- Implement a data mesh with Dataplex and have producers tag data when created.

The correct answer is:

✅ Implement a data mesh with Dataplex and have producers tag data when created.
Why this is the right solution

Your problem is not storage or dashboards — it’s data discovery, ownership, trust, and governance.

You need:

A centralized view of all data assets

Clear metadata (source, owner, freshness, schema)

Better searchability

Fewer manual escalations between producers and consumers

👉 Dataplex is designed exactly for this.

What Dataplex gives you

A logical data mesh across BigQuery, Cloud Storage, etc.

Centralized metadata management

Business and technical tags (owner, domain, SLA, sensitivity)

Data discovery and search across teams

Clear producer/consumer responsibility model

Producers tag data at creation → consumers can self-serve without asking questions repeatedly.

Why the other options are insufficient
❌ Looker dashboards

Good for visualization

Does not solve data ownership, lineage, or discovery

Still requires manual clarification

❌ Data lake with Cloud Storage buckets

Only storage

No metadata governance or discovery layer

Consumers still won’t know which data is authoritative

❌ BigQuery datasets per team

Improves organization

But still lacks cross-team discovery, ownership metadata, and governance

Does not scale for many producers/consumers

Key concept (exam + real world)

When the problem is data discovery, trust, ownership, and governance → Dataplex (data mesh).


5. You are migrating on-premises data to a data warehouse on Google Cloud. This data will be made available to business analysts. Local regulations require that customer information including credit card numbers, phone numbers, and email IDs be captured, but not used in analysis. You need to use a reliable, recommended solution to redact the sensitive data. What should you do?

- Create a regular expression to identify and delete patterns that resemble credit card numbers, phone numbers, and email IDs.
- Use the Cloud Data Loss Prevention API (DLP API) to perform date shifting of any entries with credit card numbers, phone numbers, and email IDs.
- Use the Cloud Data Loss Prevention API (DLP API) to identify and redact data that matches infoTypes like credit card numbers, phone numbers, and email IDs.
- Delete all columns with a title similar to "credit card," "phone," and "email."

The correct answer is:

✅ Use the Cloud Data Loss Prevention API (DLP API) to identify and redact data that matches infoTypes like credit card numbers, phone numbers, and email IDs.
Why this is the recommended solution

You have regulatory requirements and need a reliable, Google-recommended way to handle sensitive data before analytics.

Cloud DLP API is purpose-built for this:

Uses predefined infoTypes (e.g. CREDIT_CARD_NUMBER, PHONE_NUMBER, EMAIL_ADDRESS)

Accurately detects sensitive data, even when column names or formats vary

Supports redaction, masking, tokenization, or replacement

Designed for compliance (PCI, GDPR, etc.)

Integrates cleanly with BigQuery, Cloud Storage, Dataflow

This ensures analysts can work with data without exposing sensitive information.

Why the other options are incorrect

❌ Regular expressions

Error-prone and hard to maintain

High risk of false positives/negatives

Not recommended for compliance use cases

❌ Date shifting with DLP

Date shifting is for timestamps, not PII like credit cards or emails

❌ Deleting columns by name

Column names are unreliable

Sensitive data may exist in unexpected fields

Leads to data loss instead of safe redaction

Exam tip 🧠

If the question mentions:

PII

Regulations

Redaction / masking

Recommended solution

👉 The answer is almost always Cloud DLP API.


6. Business analysts in your team need to run analysis on data that was loaded into BigQuery. You need to follow recommended practices and grant permissions. What role should you grant the business analysts?

- bigquery.resourceViewer and bigquery.dataViewer
- bigquery.user and bigquery.dataViewer
- storage.objectViewer and bigquery.user
- bigquery.dataOwner

The correct answer is:

✅ bigquery.user and bigquery.dataViewer
Why this is the recommended practice

Business analysts need to:

Run queries

Create temporary tables

View data in existing tables

The least-privilege, recommended IAM combination is:

🔹 roles/bigquery.user

Grants ability to:

Run query jobs

Create datasets and jobs

Use the BigQuery UI and APIs

🔹 roles/bigquery.dataViewer

Grants ability to:

Read table and view data

View metadata and schemas

Together, these roles allow analysts to analyze data without modifying it.

Why the other options are incorrect
❌ bigquery.resourceViewer and bigquery.dataViewer

resourceViewer only allows viewing metadata

❌ Cannot run queries or create jobs

❌ storage.objectViewer and bigquery.user

Cloud Storage access is unnecessary

❌ Does not grant permission to read BigQuery table data

❌ bigquery.dataOwner

Too permissive

Allows modifying and deleting data

❌ Violates least-privilege principle

Exam rule of thumb 🧠

Business analysts = bigquery.user + bigquery.dataViewer


7. Cymbal Retail has a team of business analysts who need to fix and enhance a set of large input data files. For example, duplicates need to be removed, erroneous rows should be deleted, and missing data should be added. These steps need to be performed on all the present set of files and any files received in the future in a repeatable, automated process. The business analysts are not adept at programming. What should they do?

- Create a Dataflow pipeline with the data fixes you need.
- Load the data into Dataprep, explore the data, and edit the transformations as needed.
- Create a Dataproc job to perform the data fixes you need.
- Load the data into Google Sheets, explore the data, and fix the data as needed.


The correct answer is:

✅ Load the data into Dataprep, explore the data, and edit the transformations as needed.
Why this is the best solution

Your key requirements are:

✔️ Business analysts (non-programmers)

✔️ Data cleaning and preparation (deduplication, deletion, filling missing values)

✔️ Repeatable and automated for current and future files

✔️ Recommended Google Cloud practice

Dataprep (by Trifacta) is purpose-built for exactly this use case.

What Dataprep provides

No-code / low-code UI for data preparation

Interactive data exploration

Rule-based transformations that are:

Reusable

Automatable

Can be scheduled to run on new incoming files

Runs at scale using Dataflow under the hood (but analysts don’t need to code)

This allows analysts to fix data once and apply the same logic consistently going forward.

Why the other options are not suitable
❌ Create a Dataflow pipeline

Requires programming (Java/Python)

Not suitable for non-technical analysts

❌ Create a Dataproc job

Requires Spark/Hadoop knowledge

Too complex for business analysts

❌ Load data into Google Sheets

Not scalable for large files

Not automated or repeatable

Manual and error-prone

Exam tip 🧠

If you see:

Business analysts

Data cleaning

Repeatable transformations

No coding

👉 The answer is Dataprep


8. Your data and applications reside in multiple geographies on Google Cloud. Some regional laws require you to hold your own keys outside of the cloud provider environment, whereas other laws are less restrictive and allow storing keys with the same provider who stores the data. The management of these keys has increased in complexity, and you need a solution that can centrally manage all your keys. What should you do?

- Store your keys in Cloud Hardware Security Module (Cloud HSM), and retrieve keys from it when required.
- Enable confidential computing for all your virtual machines.
- Store keys in Cloud Key Management Service (Cloud KMS), and reduce the number of days for automatic key rotation.
- Store your keys on a supported external key management partner, and use Cloud External Key Manager (Cloud EKM) to get keys when required.


The correct answer is:

✅ Store your keys on a supported external key management partner, and use Cloud External Key Manager (Cloud EKM) to get keys when required.
Why this is the best solution

Your requirements are very specific:

🔐 Some regions require you to hold encryption keys outside Google Cloud

🌍 Other regions allow keys to be stored with the cloud provider

🧩 Key management has become complex

🎯 You want centralized key management across geographies

👉 Cloud External Key Manager (EKM) is designed exactly for this scenario.

What Cloud EKM provides

Lets you keep encryption keys outside Google Cloud

Keys remain in:

On-prem HSMs, or

Supported third-party key management partners

Google Cloud never stores or controls the keys

Enables centralized management and policy control

Meets strict regulatory and sovereignty requirements

This allows you to:

Use external keys where required

Use Cloud KMS/HSM where allowed

Still manage everything centrally and consistently

Why the other options are incorrect
❌ Cloud HSM

Keys are still inside Google Cloud

Does not meet requirements where keys must be external

❌ Confidential computing

Protects data in use

Has nothing to do with key management

❌ Cloud KMS with shorter rotation

Keys are still stored within Google Cloud

Rotation does not solve external key custody requirements

Exam rule of thumb 🧠

If the question mentions “hold keys outside the cloud provider” → Cloud EKM


9. Cymbal Retail is migrating its private data centers to Google Cloud. Over many years, hundreds of terabytes of data were accumulated. You currently have a 100 Mbps line and you need to transfer this data reliably before commencing operations on Google Cloud in 45 days. What should you do?

- Zip and upload the data to Cloud Storage buckets by using the Google Cloud console.
- Order a transfer appliance, export the data to it, and ship it to Google.
- Store the data in an HTTPS endpoint, and configure Storage Transfer Service to copy the data to Cloud Storage.
- Upload the data to Cloud Storage by using gcloud storage.

The correct answer is:

✅ Order a transfer appliance, export the data to it, and ship it to Google.
Why this is the correct choice

You have these key constraints:

Hundreds of terabytes of data

Only a 100 Mbps network connection

A strict 45-day deadline

Need a reliable, recommended Google Cloud solution

At 100 Mbps, transferring hundreds of TB over the network would take many months, not 45 days.

👉 Transfer Appliance is specifically designed for this exact scenario.

What Transfer Appliance provides

Physical, encrypted storage devices (up to hundreds of TB per appliance)

Offline data transfer (no network bottleneck)

Google securely ships the appliance to you

You load the data locally

Ship it back to Google

Data is uploaded directly into Cloud Storage

This is the fastest and most reliable way to migrate very large datasets with limited bandwidth.

Why the other options are incorrect

❌ Zip and upload via Console

Manual, slow, unreliable

Completely impractical for hundreds of TB

❌ Storage Transfer Service from HTTPS endpoint

Still limited by your 100 Mbps link

Too slow for the timeline

❌ Upload using gcloud storage

Same bandwidth limitation

Not feasible within 45 days

Quick math (intuition check) 🧠

At 100 Mbps:

~1 TB/day (best case)

100+ TB → months

Transfer Appliance bypasses this entirely.


10. Laws in the region where you operate require that files related to all orders made each day are stored immutably for 365 days. The solution that you recommend has to be cost-effective. What should you do?

- Store the data in a Cloud Storage bucket, enable object versioning, and delete any version greater than 365.
- Store the data in a Cloud Storage bucket, and enable object versioning and delete any version older than 365 days.
- Store the data in a Cloud Storage bucket, and specify a retention period.
- Store the data in a Cloud Storage bucket, and set a lifecycle policy to delete the file after 365 days.

The correct answer is:

✅ Store the data in a Cloud Storage bucket, and specify a retention period.
Why this is the right solution

Your requirements are very clear:

📜 Legal requirement → Data must be immutable

⏱ Exactly 365 days

💰 Cost-effective

📁 Applies to files related to daily orders

👉 Cloud Storage retention policies are designed specifically for this use case.

What a retention policy guarantees

Objects cannot be deleted or overwritten until the retention period expires

Enforces immutability by policy, not by convention

Meets regulatory and compliance requirements

Minimal operational overhead

More cost-effective and simpler than versioning

Once the 365 days pass, the object can be deleted normally.

Why the other options are incorrect
❌ Object versioning + delete versions

Versioning does not enforce immutability

Objects can still be deleted unless additional controls are added

More expensive (stores multiple versions)

Not legally robust

❌ Lifecycle policy to delete after 365 days

Lifecycle rules do not prevent deletion or modification before 365 days

Does not guarantee immutability

❌ Versioning + delete versions older than 365 days

Same issue: versioning ≠ immutability

Higher storage costs

Complex and error-prone

Exam rule of thumb 🧠

If the question says immutable + compliance + fixed duration → Retention policy

Exit mobile version