GCP PROFESSIONAL DATA ENGINEER CERTIFICATION Questions P-15

1. You would like to set up an alerting policy to catch whether the processed data is still fresh in a streaming pipeline. Which metrics can be used to monitor whether the processed data is still fresh?

job/system_lag
job/is_failed
job/data_watermark_age
job/per_stage_data_watermark_age

✅ Correct metrics for monitoring data freshness

✔ job/data_watermark_age
✔ job/per_stage_data_watermark_age

Why BOTH are correct (deep, precise explanation)

In a streaming pipeline on Google Cloud Dataflow, data freshness is defined by event-time progress, which is tracked using watermarks.

✅ job/data_watermark_age (job-level freshness)

Measures how far behind real time the overall pipeline watermark is

Best metric to answer:

“Is my pipeline producing fresh results overall?”

This is the primary, high-level freshness metric

✅ job/per_stage_data_watermark_age (stage-level freshness)

Measures watermark age per pipeline stage

Lets you:

Identify which transform is causing delays

Detect localized bottlenecks (e.g., slow sinks, heavy joins)

Essential for root-cause analysis when freshness degrades

➡ This metric is explicitly valid and correct for freshness monitoring.

Why the others are NOT correct ❌
❌ job/system_lag

Measures processing backlog / internal lag

Indicates resource pressure

❌ Does not directly reflect event-time freshness

❌ job/is_failed

Binary job state

A job can be running but producing stale data

❌ Not a freshness metric


2. You have a Pub/Sub subscription with data that hasn’t been processed for 3 days. You set up a streaming pipeline that reads data from the subscription, does a few Beam transformations, and then sinks to Cloud Storage. When the pipeline is launched, it is able to read from Pub/Sub, but cannot sink data to Cloud Storage due to the service account missing permissions to write to the bucket. When viewing the Job Metrics tab, what do you expect to see in the data freshness graph?

Initial start point at 3 days, with a downward sloping line.
Initial start point at 3 days, with an upward sloping line.
Initial start point at 0 days, with an upward sloping line to 3 days and beyond.
Initial start point at 3 days, with a flat horizontal line.

✅ Correct answer:

Initial start point at 3 days, with an upward sloping line.

Why this is the correct behavior (deep, precise explanation)

This question is about data freshness = watermark age, specifically output watermark age in Google Cloud Dataflow.

What happens in your scenario

Pub/Sub has 3 days of backlog → initial watermark age = 3 days

The pipeline can read from Pub/Sub

The pipeline cannot write to Cloud Storage (sink permission failure)

Because the sink is blocked:

Output watermark cannot advance

No data is successfully completed end-to-end

What the freshness graph measures

The freshness graph shows how old the output watermark is compared to now

If the watermark is stuck, but wall-clock time keeps moving forward, then:

Watermark age increases over time

Resulting graph shape

Initial point: 3 days (existing backlog)

Over time: watermark does not move, but “now” does

Effect: freshness age keeps growing

📈 Upward sloping line starting at 3 days

Why the other options are wrong ❌

❌ 3 days with downward slope

Would require successful end-to-end processing

Impossible with a broken sink

❌ 0 days upward to 3 days

Incorrect — there is already a 3-day backlog at start

❌ 3 days flat line

Would only happen if “now” were frozen

In reality, time passes → age increases


3. Select all that apply - the BigQuery Jobs tab shows jobs from:

Streaming Extracts
Query jobs
Load jobs
Streaming Inserts

Correct selections (✔):

✅ Query jobs
✅ Load jobs

Explanation (BigQuery UI & job types — reverified)

In BigQuery, the Jobs tab shows asynchronous jobs that BigQuery executes. These include:

✅ Query jobs

SQL queries you run in the UI, CLI, or API

Always appear in the Jobs tab

✅ Load jobs

Batch loads from sources like:

Cloud Storage

Local files

These are explicit jobs and are tracked in the Jobs tab

Why the other options are NOT shown ❌

❌ Streaming Inserts

Streaming inserts use the streaming API

They are not jobs

Therefore, they do NOT appear in the Jobs tab

❌ Streaming Extracts

There is no such BigQuery job type

BigQuery supports extract jobs (exports), but not “streaming extracts”



4. Your batch job has failed, and when viewing the Diagnostic tab, you see the following insights: Which of the options below is the best one to undertake to resolve the issue?

Use a larger machine size
Increase the number of machines used
Switch Beam code from Java to Python
Increase persistent disk size

✅ Correct answer: Use a larger machine size
Why “Use a larger machine size” is correct (logical + Diagnostic-tab aligned)

The key signal in the question is:

“Your batch job has failed, and when viewing the Diagnostic tab, you see insights”

In Google Cloud Dataflow, when the Diagnostics tab explicitly provides insights, it usually points to resource pressure, most commonly:

CPU exhaustion

Memory pressure (OOM kills)

Workers repeatedly restarting

When Diagnostics recommend action, the most common and direct remediation shown is:

Increase worker machine type

Why this fixes the failure

Larger machine size increases:

CPU

Memory

Many batch failures happen due to:

Large GroupBy / Combine steps

In-memory pressure during shuffle

Worker OOM crashes

These failures persist even if you add more workers, because:

The problem is per-worker resource limits, not total cluster capacity

➡ Bigger machines directly address the failure cause.

Why the other options are NOT correct ❌
❌ Increase the number of machines used

Improves throughput

❌ Does not fix per-worker memory/CPU exhaustion

Diagnostics would not recommend this for hard failures

❌ Increase persistent disk size

Correct only if Diagnostics explicitly mention disk pressure

The question does not say disk exhaustion

Diagnostics usually distinguish disk vs memory issues clearly

❌ Switch Beam code from Java to Python

Irrelevant

No impact on worker failure modes


5. Which two of the following statements are true for failures while building the pipeline?

The failure is reproducible with the Direct Runner.

The error message is visible in Dataflow.

The failure can be caused by insufficient permissions granted to the controller service account.

The failure can be caused by incorrect input/output specifications.


Correct answers (select TWO): ✅

✔ The failure is reproducible with the Direct Runner.
✔ The failure can be caused by incorrect input/output specifications.

Why these two are correct (Beam / Dataflow–accurate)

Failures while building the pipeline happen before the job is submitted to the Dataflow service. These are pipeline construction–time errors, not execution-time errors.

✅ Reproducible with the Direct Runner

Pipeline build errors occur in the client SDK (Python/Java)

The same pipeline graph is built regardless of runner

Therefore, these errors:

Also fail with the Direct Runner

Are commonly caught locally before submission

✅ Caused by incorrect input/output specifications

Examples:

Invalid file paths

Wrong schema definitions

Invalid table specs

Incorrect Pub/Sub topic/subscription formats

These errors surface while constructing transforms

Very common build-time failures

Why the other options are NOT correct ❌

❌ The error message is visible in Dataflow.

Build-time failures happen before the job is created

No Dataflow job exists yet

Therefore:

❌ Nothing appears in the Dataflow UI

Errors appear only in local logs / console output

❌ The failure can be caused by insufficient permissions granted to the controller service account.

Service account permissions are checked:

When launching

Or during execution

This is not a pipeline build-time failure


6. Your Dataflow batch job fails after running for close to 5 hours. Which two of the following troubleshooting steps would you take to understand the root cause of the failure?

Log the failing elements and check the output using Cloud Logging.

Investigate the wall time of the individual steps in the job.
Check the Dataflow worker logs for warnings or errors related to work item failures.
Monitor the Data Freshness and System Latency graphs to understand the job performance.

✅ Correct answers (select TWO):

✔ Check the Dataflow worker logs for warnings or errors related to work item failures.
✔ Log the failing elements and check the output using Cloud Logging.

Why these two are correct (thoughtful, failure-focused reasoning)

The key phrase in the question is:

“fails after running for close to 5 hours”

This implies:

The pipeline built successfully

The job ran for a long time

The failure happened during execution, likely due to bad data or runtime exceptions

✅ Check the Dataflow worker logs

This is always the first and most reliable step.

Worker logs show:

Exceptions thrown by DoFns

Serialization errors

OOMs

Disk / shuffle failures

Repeated work-item retries

➡ Without worker logs, you cannot identify the real root cause.

✔ Correct

✅ Log the failing elements and inspect via Cloud Logging

When a batch job fails after hours, the most common causes are:

Malformed records

Unexpected nulls

Schema violations

Edge-case data values

To identify these:

You log elements (or bad records)

Inspect them in Cloud Logging

Fix the pipeline or data accordingly

This is a standard Dataflow debugging technique for batch failures.

✔ Correct

Why the other options are NOT correct ❌
❌ Investigate the wall time of the individual steps

Wall time analysis helps with performance tuning

❌ It does not explain why a job failed

A job can fail even if wall times look normal

➡ Useful for optimization, not root-cause failure analysis

❌ Monitor Data Freshness and System Latency graphs

These are streaming-only metrics

Batch jobs do not use freshness or latency

Completely irrelevant here

➡ Eliminate immediately

🧠 Correct mental model

Long-running batch job failure = runtime error or bad data

So you:

Check worker logs

Inspect failing elements


7. Which one of the following is not a consideration for designing performant pipelines in Dataflow?
Coders and decoders used in pipeline.
Logging
Filtering data early.
SDK used for developing the pipeline.

Correct answer: ✅ SDK used for developing the pipeline
Using an SDK to develop the pipeline is not a consideration for performance. SDK is selected based on other relevant requirements like developer skills, libraries used etc.

Explanation (Dataflow performance design — carefully reasoned)

When designing performant pipelines in Google Cloud Dataflow, performance is driven by how the pipeline is structured and how data flows, not by the choice of SDK language.

Let’s evaluate each option:

✅ Coders and decoders used in the pipeline — Performance consideration

Inefficient coders increase:

CPU usage

Serialization/deserialization overhead

Choosing efficient, schema-aware coders improves performance
✔ Valid consideration

✅ Logging — Performance consideration

Excessive logging:

Slows down workers

Increases I/O and Cloud Logging costs

Especially harmful inside hot paths like DoFn.processElement
✔ Valid consideration

✅ Filtering data early — Performance consideration

Reduces:

Data volume

Shuffle size

Memory and disk usage

One of the most important performance best practices
✔ Valid consideration

❌ SDK used for developing the pipeline — NOT a performance consideration

Beam’s execution model is runner-optimized

Dataflow performance depends on:

Pipeline graph

Data volume

Resource usage

Not on whether you use:

Java

Python

Go

Differences are negligible compared to pipeline design

❌ Not a key performance factor


8. Select options we can use to mitigate data skew in Dataflow pipelines?

Use Dataflow shuffle for batch pipelines and Dataflow streaming option for streaming pipelines.
Use api like “withFanout” or “withHotKeyFanout”.
Add composite windows and triggers.
Add more worker machines.

✅ Correct answer:

Use APIs like withFanout or withHotKeyFanout.

➡️ This is the ONLY correct option.

Why this is the correct (and only) skew-mitigation technique
✅ Use APIs like withFanout or withHotKeyFanout ✔

Data skew is caused by hot keys (a few keys receiving most of the data).

withFanout / withHotKeyFanout:

Explicitly splits hot keys into multiple shards

Processes them in parallel

Re-combines results later

This is the canonical and recommended Beam/Dataflow solution for skew.

✔ Directly addresses the root cause of skew

Why the other options are NOT correct ❌
❌ Use Dataflow shuffle for batch pipelines and Streaming Engine for streaming

These improve scalability and resource management

They do not fix hot keys

A single hot key still bottlenecks execution

➡ Helps performance, not skew

❌ Add composite windows and triggers

Windows and triggers control when results are emitted

They do nothing to redistribute skewed keys

➡ Completely unrelated to skew mitigation

❌ Add more worker machines

This is a very common misconception

If one key dominates:

Only one worker processes it

Other workers sit idle

More workers ≠ more parallelism for hot keys

➡ Does not mitigate skew


9. When we should avoid fusion in a Dataflow pipeline?

Only in specific scenarios, like if your pipeline involves massive fanouts.
Always
Never

Correct answer: ✅
Only in specific scenarios, like if your pipeline involves massive fanouts.

Why this is correct (Dataflow execution–accurate)

In Google Cloud Dataflow, fusion is an optimization where multiple transforms are combined into a single execution stage to reduce overhead and improve performance.

👉 Fusion is generally beneficial and enabled by default.
You should avoid or break fusion only in specific scenarios where it causes problems.

When should you avoid fusion? (key scenarios)

Massive fanouts that:

Create memory pressure

Cause downstream bottlenecks

Hot keys or skewed workloads

When:

One slow transform blocks faster downstream transforms

You need better parallelism between stages

Common techniques to break fusion:

Reshuffle

Writing/reading from an external system

Using GroupByKey

Why the other options are incorrect ❌

❌ Always

Fusion is a core optimization

Disabling it everywhere would severely hurt performance

❌ Never

Fusion can cause performance issues in certain pipelines

Blindly allowing fusion can lead to:

Worker OOMs

Poor parallelism

Back pressure


10. When draining a streaming pipeline, what should you expect to happen?

A snapshot is taken of the source, then ingestion is stopped. Windows are closed and processing of in-flight elements will be allowed to complete.

Any open windows will wait for new data so that aggregations are completed. Then the pipeline will be canceled.
Both processing and ingestion stop immediately.

Ingestion stops immediately, windows are closed, and processing of in flight elements will be allowed to complete.

Correct answer: ✅
Ingestion stops immediately, windows are closed, and processing of in-flight elements will be allowed to complete.

Why this is correct (Dataflow drain semantics — reverified)

In Google Cloud Dataflow, draining a streaming pipeline is a graceful shutdown:

Ingestion stops immediately

No new data is read from the source (e.g., Pub/Sub)

Open windows are closed

Triggers fire as configured

Final results are emitted

In-flight elements continue processing

Any data already read is fully processed and written to sinks

The job ends cleanly, preserving correctness

This behavior is specifically designed to avoid data loss while allowing a controlled stop.

Why the other options are incorrect ❌

❌ “A snapshot is taken of the source…”

Dataflow does not take a source snapshot during drain

That description is closer to checkpointing semantics, not draining

❌ “Any open windows will wait for new data…”

Ingestion is stopped

No new data arrives

Windows do not wait for future data

❌ “Both processing and ingestion stop immediately.”

That describes cancel, not drain

Cancel can cause data loss



11. Using anonymous subclasses in your ParDos is an anti-pattern because:

ParDos are required to contain a concrete subclass of a DoFn.
Anonymous subclasses are bad for the performance of your pipeline.
Anonymous subclasses are harder to test than concrete subclasses.

Correct answer: ✅
Anonymous subclasses are harder to test than concrete subclasses.
Anonymous subclasses are more difficult to test because of their lack of identifiability, coupling, and reusability.

Explanation (Beam / Dataflow best practices — reverified)

In Apache Beam, using anonymous subclasses of DoFn inside ParDo is considered an anti-pattern primarily for maintainability and testability reasons.

✅ Why this is the correct reason

Anonymous DoFns:

Cannot be easily instantiated in isolation

Are difficult to unit test with TestPipeline and DoFnTester

Make reuse and mocking harder

Concrete (named) DoFn classes:

Are reusable

Are easy to test independently

Improve readability and maintainability

This is the main reason Beam guidance discourages anonymous subclasses.

Why the other options are incorrect ❌

❌ “ParDos are required to contain a concrete subclass of a DoFn.”

False. Beam allows anonymous subclasses.

It’s discouraged, not forbidden.

❌ “Anonymous subclasses are bad for the performance of your pipeline.”

False. There is no inherent performance penalty.

Execution performance is determined by:

Transform logic

Data volume

Runner optimizations

Not by whether the DoFn is anonymous or named.
Leave a Comment