GCP PROFESSIONAL DATA ENGINEER CERTIFICATION Questions P-2

webstoryworldwide.com

2 months ago

1. Your company is very serious about data protection and hence decides to implement the Principle of Least Privilege. What should you do to comply with this policy?

- When a task is assigned, ensure that it gets assigned to a person with the minimum privileges.
- Ensure that the access permissions are given strictly based on the person’s title and job role.
- Ensure that the users are verified every time they request access, even if they were authenticated earlier.
- Give just enough permissions to get the task done.

The correct answer is:

✅ Give just enough permissions to get the task done.
Why this is correct

The Principle of Least Privilege (PoLP) means:

A user, service account, or process should be granted only the minimum permissions necessary to perform its required tasks—and nothing more.

This reduces:

Security risk

Blast radius of compromised accounts

Accidental misuse of permissions

This definition maps exactly to:
“Give just enough permissions to get the task done.”

Why the other options are incorrect

❌ Assign the task to a person with minimum privileges
→ Privileges should be assigned to the task, not the person.

❌ Grant access based strictly on title or job role
→ Job titles are often too broad and can lead to over-privileged access.

❌ Verify users every time they request access
→ This relates to authentication or zero trust, not least privilege.

Exam tip 🧠

If you see:

Least Privilege

Security best practice

IAM permissions

👉 The answer is always about minimizing permissions, not identity or authentication.

2. A company collects lots of consumer data from online marketing campaigns. Company plans to use Google Cloud to store this collected data. The top management is worried about exposing personally identifiable information (PII) that may be present in this data. What should you do to reduce the risk of exposing PII data?

- Ensure that all stored data is monitored by Security Command Center.
- Use the Cloud Data Loss Prevention API (DLP API) to inspect and redact PII data.
- Store all data in BigQuery and turn on column level access to protect sensitive data.
- Ensure that all PII data is removed from the collected data before storing it on Google Cloud.

The correct answer is:

✅ Use the Cloud Data Loss Prevention API (DLP API) to inspect and redact PII data.
Why this is the best solution

Your key concerns are:

PII may be present

Risk of exposure

Need a reliable, recommended Google Cloud solution

👉 Cloud DLP API is purpose-built for this exact use case.

What Cloud DLP does

Automatically detects PII using built-in infoTypes (names, emails, phone numbers, credit cards, etc.)

Can redact, mask, tokenize, or anonymize sensitive data

Works across Cloud Storage, BigQuery, and data pipelines

Helps meet compliance and privacy regulations (GDPR, CCPA, etc.)

This directly reduces the risk of exposing PII while still allowing analytics.

Why the other options are not sufficient

❌ Security Command Center
→ Focuses on security posture and threats, not PII inspection or redaction.

❌ Column-level access in BigQuery
→ Protects access, but does not remove or mask PII.
→ Risk remains if permissions are misconfigured.

❌ Remove all PII before storing data
→ Not always feasible or reliable.
→ DLP provides automated and consistent enforcement.

Exam rule of thumb 🧠

If the question mentions PII + risk reduction + Google-recommended solution → Cloud DLP API

3. You need to run batch jobs, which could take many days to complete. You do not want to manage the infrastructure provisioning. What should you do?

- Use Workflows to run the jobs.
- Use Cloud Scheduler to run the jobs.
- Use Cloud Run to run the jobs.
- Run the jobs on Batch.

The correct answer is:

✅ Run the jobs on Batch
Why this is correct

Your requirements are:

🕒 Batch jobs that may take many days

🧑‍💻 No infrastructure provisioning or management

⚙️ Managed, scalable execution

👉 Google Cloud Batch is designed exactly for this use case.

What Cloud Batch provides

Fully managed batch job execution

Automatically provisions and tears down compute resources

Supports long-running jobs (hours to days)

Handles scheduling, retries, and scaling

You focus on jobs, not servers or clusters

This makes it the recommended solution.

Why the other options are incorrect

❌ Workflows

Orchestration service, not meant for long-running compute jobs

❌ Cloud Scheduler

Triggers jobs on a schedule, but does not run them

❌ Cloud Run

Designed for stateless services

Has execution time limits

Not suitable for multi-day batch jobs

Exam tip 🧠

If the question says long-running batch jobs + no infrastructure management → Cloud Batch

4. The first stage of your data pipeline processes tens of terabytes of financial data and creates a sparse, time-series dataset as a key-value pair. Which of these is a suitable sink for the pipeline's first stage?

- Cloud SQL
- AlloyDB
- Cloud Storage
- Bigtable

The correct answer is:

✅ Bigtable
Why Bigtable is the right sink

Your first pipeline stage:

Processes tens of terabytes of data

Produces a sparse, time-series dataset

Uses a key-value data model

Likely needs high write throughput and low-latency access

👉 Cloud Bigtable is purpose-built for exactly this pattern.

Bigtable strengths for this use case

Designed for massive scale (TBs–PBs)

Optimized for sparse, wide tables

Excellent for time-series data

High-throughput sequential writes

Key-value access with predictable performance

This makes Bigtable the ideal intermediate sink for the first stage of such a pipeline.

Why the other options are not suitable

❌ Cloud SQL

Relational database

Not designed for TB-scale ingestion or time-series workloads

❌ AlloyDB

High-performance relational database

Still not suited for massive sparse time-series key-value data

❌ Cloud Storage

Good for raw files or archives

Not optimized for fast key-based reads/writes or time-series access

Exam rule of thumb 🧠

Massive scale + sparse + time-series + key-value → Bigtable

5. Your data engineering team receives data in JSON format from external sources at the end of each day. You need to design the data pipeline. What should you do?

- Store the data in Cloud Storage and create an extract, transform, and load (ETL) pipeline.
- Store the data in persistent disks and create an ETL pipeline.
- Create a public API to allow external applications to add the data to your warehouse.
- Make your BigQuery data warehouse public and ask the external sources to insert the data.

The correct answer is:

✅ Store the data in Cloud Storage and create an extract, transform, and load (ETL) pipeline.
Why this is the right design

Your requirements:

📄 Data arrives daily

📦 Data format is JSON

🌐 Data comes from external sources

🏗 You need a robust, scalable data pipeline

📊 Ultimately used for analytics (data warehouse)

👉 The recommended Google Cloud pattern is:

Land raw data in Cloud Storage (data lake)

Use an ETL pipeline (Dataflow, Dataproc, Dataprep, etc.) to:

Extract JSON files

Transform/clean/validate data

Load into BigQuery or another warehouse

Benefits

Cloud Storage is durable, scalable, cost-effective

Keeps raw data immutable for reprocessing

Decouples ingestion from processing

Works well for batch, daily ingestion

Why the other options are incorrect

❌ Persistent disks

Tied to VMs

Not scalable or shareable

Poor choice for ingestion landing zone

❌ Create a public API

Adds unnecessary complexity

Security risks

Not required for batch JSON ingestion

❌ Make BigQuery public

Severe security risk

Violates best practices

No validation or transformation layer

Exam rule of thumb 🧠

External batch data → Cloud Storage → ETL → BigQuery

6. You are creating a data pipeline for streaming data on Dataflow for Cymbal Retail's point of sales data. You want to calculate the total sales per hour on a continuous basis. Which of these windowing options should you use?

- Tumbling windows (fixed windows in Apache Beam)
- Session windows
- Global window
- Hopping windows (sliding windows in Apache Beam)

The correct answer is:

✅ Tumbling windows (fixed windows in Apache Beam)
Why this is correct

Your requirement is:

📊 Total sales per hour

🔄 Continuous calculation

📡 Streaming data (POS events)

👉 This maps exactly to fixed (tumbling) windows.

What tumbling (fixed) windows do

Divide time into non-overlapping, fixed-size intervals

Example:

10:00–11:00

11:00–12:00

Each event belongs to one and only one window

Ideal for hourly aggregates like:

Total sales per hour

Orders per hour

Revenue per hour

This is the canonical solution for time-based metrics.

Why the other options are incorrect

❌ Session windows

Used for grouping events by user activity gaps

Not suitable for regular hourly aggregation

❌ Global window

All data in one window

Not practical for continuous hourly metrics

❌ Hopping (sliding) windows

Used when overlapping windows are required

Example: rolling last 1 hour every 5 minutes

More complex and unnecessary here

Exam rule of thumb 🧠

Per-hour / per-day / per-minute aggregates → Fixed (tumbling) windows

7. You are processing large amounts of input data in BigQuery. You need to combine this data with a small amount of frequently changing data that is available in Cloud SQL. What should you do?

- Copy the data from Cloud SQL to a new BigQuery table hourly.
- Create a Dataflow pipeline to combine the BigQuery and Cloud SQL data when the Cloud SQL data changes.
- Use a federated query to get data from Cloud SQL.
- Copy the data from Cloud SQL and create a combined, normalized table hourly.

The correct answer is:

✅ Use a federated query to get data from Cloud SQL.
Why this is the best solution

Your situation:

📊 Large datasets already in BigQuery

🔄 Small dataset in Cloud SQL

⏱ Frequently changing data

⚡ Need up-to-date joins without heavy pipelines

👉 Federated queries are designed exactly for this pattern.

What a federated query gives you

Query Cloud SQL data directly from BigQuery

No copying or syncing required

Always reads the latest Cloud SQL data

Simple architecture, low operational overhead

Ideal when:

Cloud SQL data is small

Changes frequently

Used mainly for lookups or joins

Example use case:

Join a large BigQuery fact table with a small, frequently updated reference table in Cloud SQL.

Why the other options are inferior

❌ Copy Cloud SQL data hourly

Data may be stale

Extra cost and complexity

Not ideal for frequently changing data

❌ Dataflow pipeline

Over-engineered for a small lookup dataset

Adds operational burden

❌ Create combined normalized table hourly

Same staleness and complexity issues

Violates simplicity and freshness goals

Exam rule of thumb 🧠

BigQuery + small, frequently changing Cloud SQL data → Federated query

8. You have a data pipeline that requires you to monitor a Cloud Storage bucket for a file, start a Dataflow job to process data in the file, run a shell script to validate the processed data in BigQuery, and then delete the original file. You need to orchestrate this pipeline by using recommended tools. Which product should you choose?

- Cloud Composer
- Cloud Tasks
- Cloud Run
- Cloud Scheduler

The correct answer is:

✅ Cloud Composer
Why Cloud Composer is the right choice

Your pipeline requires orchestration of multiple dependent steps:

Monitor a Cloud Storage bucket for a file

Start a Dataflow job

Run a shell script to validate data in BigQuery

Delete the original file after successful processing

👉 This is a classic workflow orchestration problem.

What Cloud Composer (Apache Airflow) provides

Native support for DAGs (Directed Acyclic Graphs) with dependencies

Sensors to watch Cloud Storage for file arrival

Operators to:

Trigger Dataflow jobs

Run shell scripts

Execute BigQuery validation queries

Perform cleanup steps (delete files)

Built-in retry, logging, monitoring, and error handling

This is exactly the recommended Google Cloud tool for orchestrating multi-step data pipelines.

Why the other options are incorrect

❌ Cloud Tasks

Asynchronous task queue

Not designed for multi-step workflows with dependencies

❌ Cloud Run

Serverless compute platform

Executes code, but does not orchestrate multiple services

❌ Cloud Scheduler

Time-based trigger only

Cannot manage conditional or dependent steps

Exam rule of thumb 🧠

Multi-step, dependent data pipelines → Cloud Composer

9. You manage a PySpark batch data pipeline by using Dataproc: You want to take a hands-off approach to running the workload, and you do not want to provision and manage your own cluster. What should you do?

- Configure the job to run on Dataproc Serverless.
- Rewrite the job in Spark SQL.
- Rewrite the job in Dataflow with SQL.
- Configure the job to run with Spot VMs.

The correct answer is:

✅ Configure the job to run on Dataproc Serverless
Why this is correct

Your requirements are:

🧑‍💻 Hands-off operation

⚙️ No cluster provisioning or management

🧠 Existing PySpark batch job

☁️ Recommended Google Cloud approach

👉 Dataproc Serverless for Spark is built exactly for this scenario.

What Dataproc Serverless gives you

No cluster creation or lifecycle management

Automatically provisions and scales resources

Runs PySpark and Spark SQL jobs natively

Pay only for the resources your job uses

Ideal for batch workloads

You submit the job → Google Cloud handles everything else.

Why the other options are incorrect

❌ Rewrite the job in Spark SQL
→ Still requires a cluster or serverless runtime; doesn’t remove infrastructure management by itself.

❌ Rewrite the job in Dataflow with SQL
→ Requires rewriting logic and changing processing engine; unnecessary when Spark already works.

❌ Use Spot VMs
→ Reduces cost, but you still manage clusters and handle interruptions.

Exam rule of thumb 🧠

Spark job + no cluster management → Dataproc Serverless

10. You are running Dataflow jobs for data processing. When developers update the code in Cloud Source Repositories, you need to test and deploy the updated code with minimal effort. Which of these would you use to build your continuous integration and delivery (CI/CD) pipeline for data processing?

- Compute Engine
- Terraform
- Cloud Build
- Cloud Code

The correct answer is:

✅ Cloud Build
Why Cloud Build is the right choice

Your requirements are:

🔄 Automatically test and deploy Dataflow code

🧑‍💻 Triggered by code updates in Cloud Source Repositories

⚡ Minimal effort to set up CI/CD

🧱 Recommended Google Cloud service

👉 Cloud Build is Google Cloud’s native CI/CD service and integrates directly with:

Cloud Source Repositories

GitHub / GitLab

Dataflow (build, test, deploy pipelines)

What Cloud Build provides

Build triggers on code commits

Automated:

Unit tests

Pipeline packaging

Dataflow job submission

Fully managed (no servers to manage)

YAML-based configuration (cloudbuild.yaml)

This makes Cloud Build the standard choice for CI/CD of Dataflow pipelines.

Why the other options are incorrect

❌ Compute Engine
→ Infrastructure service, not CI/CD

❌ Terraform
→ Infrastructure provisioning (IaC), not pipeline testing/deployment

❌ Cloud Code
→ IDE plugin for local development, not CI/CD automation

Exam rule of thumb 🧠

CI/CD on Google Cloud → Cloud Build

11. Your company has multiple data analysts but a limited data engineering team. You need to choose a tool where the analysts can build data pipelines themselves with a graphical user interface. Which of these products is the most appropriate?

- Cloud Data Fusion
- Cloud Composer
- Dataflow
- Dataproc

The correct answer is:

✅ Cloud Data Fusion
Why Cloud Data Fusion is the most appropriate

Your constraints are clear:

👥 Many data analysts

👩‍💻 Limited data engineering team

🧩 Analysts should build pipelines themselves

🖱 Graphical user interface (GUI) required

👉 Cloud Data Fusion is specifically designed for this use case.

What Cloud Data Fusion provides

Drag-and-drop UI for building data pipelines

No/low-code experience for analysts

Built-in connectors for:

Cloud Storage

BigQuery

Cloud SQL

On-prem sources

Handles ETL/ELT visually

Runs pipelines at scale (powered by Dataproc under the hood)

This lets analysts be self-sufficient, reducing dependency on data engineers.

Why the other options are not suitable

❌ Cloud Composer

Workflow orchestration (Airflow)

Requires Python and DAG coding

❌ Dataflow

Requires Java/Python/SQL coding

Not GUI-based

❌ Dataproc

Hadoop/Spark cluster service

Requires programming and cluster knowledge

Exam rule of thumb 🧠

Non-programmers + visual pipelines → Cloud Data Fusion

12. You want to build a streaming data analytics pipeline in Google Cloud. You need to choose the right products that support streaming data. Which of these would you choose?

- Pub/Sub, Dataprep, BigQuery
- Pub/Sub, Dataflow, BigQuery
- Cloud Storage, Dataprep, AlloyDB
- Cloud Storage, Dataflow, Cloud SQL

The correct answer is:

✅ Pub/Sub, Dataflow, BigQuery
Why this is the right choice

To build a streaming data analytics pipeline on Google Cloud, you typically need:

Ingestion of streaming events

Real-time stream processing

Analytics-ready storage

Here’s how this stack fits perfectly:

🔹 Pub/Sub

Fully managed event ingestion service

Handles high-throughput real-time streaming data

🔹 Dataflow

Serverless stream and batch processing engine (Apache Beam)

Supports windowing, aggregation, late data handling

Ideal for real-time analytics pipelines

🔹 BigQuery

Scalable analytics data warehouse

Supports streaming inserts

Enables real-time dashboards and analysis

This is the reference architecture for streaming analytics on Google Cloud.

Why the other options are incorrect

❌ Pub/Sub, Dataprep, BigQuery

Dataprep is for batch data preparation, not streaming

❌ Cloud Storage, Dataprep, AlloyDB

No streaming ingestion or real-time processing

❌ Cloud Storage, Dataflow, Cloud SQL

Cloud Storage is not a streaming source

Cloud SQL is not designed for streaming analytics

Exam rule of thumb 🧠

Streaming analytics on GCP = Pub/Sub → Dataflow → BigQuery

13. A company collects large amounts of data that is useful for improving business operations. The collected data is already clean and is in a format that is suitable for further analysis. The company uses BigQuery as a data warehouse. What approach will you recommend to move this data to BigQuery?

- Split the data into smaller files and then move to Google Cloud.
- Directly load the data using Extract and Load approach ( EL).
- Implement Extract, Transform and Load (ETL) pipelines using tools like Dataflow.
- Do transformation using Extract, Load & Transform (ELT).

The correct answer is:

✅ Directly load the data using Extract and Load approach (EL).
Why this is the best recommendation

Your scenario states:

📦 Data is already clean

🧾 Data is already in an analysis-ready format

🏢 BigQuery is the target data warehouse

👉 In this case, no transformation is required, so the simplest and most efficient approach is EL (Extract and Load).

Benefits of EL here

Avoids unnecessary processing

Faster ingestion

Lower cost and operational complexity

Aligns with BigQuery best practices when data is already prepared

Why the other options are not appropriate

❌ Split data into smaller files

An implementation detail, not an ingestion strategy

Does not address how data is moved or processed

❌ ETL with Dataflow

Overkill when data is already clean

Adds unnecessary transformation and infrastructure cost

❌ ELT (transform after load)

Still unnecessary since no transformation is needed

Exam rule of thumb 🧠

Clean, analytics-ready data → EL into BigQuery

14. A company wants to improve productivity and decides to programmatically schedule and monitor workflows. What tool can you use to automate your workflows?

- Data Fusion
- Cloud Composer
- Apache Beam and Dataflow
- Dataproc

The correct answer is:

✅ Cloud Composer
Why Cloud Composer is the right tool

Your requirement is to programmatically schedule and monitor workflows and automate them.

👉 Cloud Composer (managed Apache Airflow) is specifically designed for this purpose.

What Cloud Composer provides

Define workflows as code (DAGs)

Schedule workflows (time-based or event-driven)

Monitor execution, retries, failures, and dependencies

Orchestrate tasks across many services:

BigQuery

Dataflow

Dataproc

Cloud Storage

Cloud Run, etc.

This makes it the recommended workflow automation and orchestration tool on Google Cloud.

Why the other options are incorrect

❌ Data Fusion

Focused on ETL pipeline creation

Not a general-purpose workflow orchestrator

❌ Apache Beam and Dataflow

Used for data processing, not scheduling and monitoring workflows

❌ Dataproc

Managed Hadoop/Spark service

Not for workflow automation

Exam rule of thumb 🧠

Workflow automation + scheduling + monitoring → Cloud Composer