1. You would like to set up an alerting policy to catch whether the processed data is still fresh in a streaming pipeline. Which metrics can be used to monitor whether the processed data is still fresh?
job/system_lag
job/is_failed
job/data_watermark_age
job/per_stage_data_watermark_age
✅ Correct metrics for monitoring data freshness
✔ job/data_watermark_age
✔ job/per_stage_data_watermark_age
Why BOTH are correct (deep, precise explanation)
In a streaming pipeline on Google Cloud Dataflow, data freshness is defined by event-time progress, which is tracked using watermarks.
✅ job/data_watermark_age (job-level freshness)
Measures how far behind real time the overall pipeline watermark is
Best metric to answer:
“Is my pipeline producing fresh results overall?”
This is the primary, high-level freshness metric
✅ job/per_stage_data_watermark_age (stage-level freshness)
Measures watermark age per pipeline stage
Lets you:
Identify which transform is causing delays
Detect localized bottlenecks (e.g., slow sinks, heavy joins)
Essential for root-cause analysis when freshness degrades
➡ This metric is explicitly valid and correct for freshness monitoring.
Why the others are NOT correct ❌
❌ job/system_lag
Measures processing backlog / internal lag
Indicates resource pressure
❌ Does not directly reflect event-time freshness
❌ job/is_failed
Binary job state
A job can be running but producing stale data
❌ Not a freshness metric
2. You have a Pub/Sub subscription with data that hasn’t been processed for 3 days. You set up a streaming pipeline that reads data from the subscription, does a few Beam transformations, and then sinks to Cloud Storage. When the pipeline is launched, it is able to read from Pub/Sub, but cannot sink data to Cloud Storage due to the service account missing permissions to write to the bucket. When viewing the Job Metrics tab, what do you expect to see in the data freshness graph?
Initial start point at 3 days, with a downward sloping line.
Initial start point at 3 days, with an upward sloping line.
Initial start point at 0 days, with an upward sloping line to 3 days and beyond.
Initial start point at 3 days, with a flat horizontal line.
✅ Correct answer:
Initial start point at 3 days, with an upward sloping line.
Why this is the correct behavior (deep, precise explanation)
This question is about data freshness = watermark age, specifically output watermark age in Google Cloud Dataflow.
What happens in your scenario
Pub/Sub has 3 days of backlog → initial watermark age = 3 days
The pipeline can read from Pub/Sub
The pipeline cannot write to Cloud Storage (sink permission failure)
Because the sink is blocked:
Output watermark cannot advance
No data is successfully completed end-to-end
What the freshness graph measures
The freshness graph shows how old the output watermark is compared to now
If the watermark is stuck, but wall-clock time keeps moving forward, then:
Watermark age increases over time
Resulting graph shape
Initial point: 3 days (existing backlog)
Over time: watermark does not move, but “now” does
Effect: freshness age keeps growing
📈 Upward sloping line starting at 3 days
Why the other options are wrong ❌
❌ 3 days with downward slope
Would require successful end-to-end processing
Impossible with a broken sink
❌ 0 days upward to 3 days
Incorrect — there is already a 3-day backlog at start
❌ 3 days flat line
Would only happen if “now” were frozen
In reality, time passes → age increases
3. Select all that apply - the BigQuery Jobs tab shows jobs from:
Streaming Extracts
Query jobs
Load jobs
Streaming Inserts
Correct selections (✔):
✅ Query jobs
✅ Load jobs
Explanation (BigQuery UI & job types — reverified)
In BigQuery, the Jobs tab shows asynchronous jobs that BigQuery executes. These include:
✅ Query jobs
SQL queries you run in the UI, CLI, or API
Always appear in the Jobs tab
✅ Load jobs
Batch loads from sources like:
Cloud Storage
Local files
These are explicit jobs and are tracked in the Jobs tab
Why the other options are NOT shown ❌
❌ Streaming Inserts
Streaming inserts use the streaming API
They are not jobs
Therefore, they do NOT appear in the Jobs tab
❌ Streaming Extracts
There is no such BigQuery job type
BigQuery supports extract jobs (exports), but not “streaming extracts”
4. Your batch job has failed, and when viewing the Diagnostic tab, you see the following insights: Which of the options below is the best one to undertake to resolve the issue?
Use a larger machine size
Increase the number of machines used
Switch Beam code from Java to Python
Increase persistent disk size
✅ Correct answer: Use a larger machine size
Why “Use a larger machine size” is correct (logical + Diagnostic-tab aligned)
The key signal in the question is:
“Your batch job has failed, and when viewing the Diagnostic tab, you see insights”
In Google Cloud Dataflow, when the Diagnostics tab explicitly provides insights, it usually points to resource pressure, most commonly:
CPU exhaustion
Memory pressure (OOM kills)
Workers repeatedly restarting
When Diagnostics recommend action, the most common and direct remediation shown is:
Increase worker machine type
Why this fixes the failure
Larger machine size increases:
CPU
Memory
Many batch failures happen due to:
Large GroupBy / Combine steps
In-memory pressure during shuffle
Worker OOM crashes
These failures persist even if you add more workers, because:
The problem is per-worker resource limits, not total cluster capacity
➡ Bigger machines directly address the failure cause.
Why the other options are NOT correct ❌
❌ Increase the number of machines used
Improves throughput
❌ Does not fix per-worker memory/CPU exhaustion
Diagnostics would not recommend this for hard failures
❌ Increase persistent disk size
Correct only if Diagnostics explicitly mention disk pressure
The question does not say disk exhaustion
Diagnostics usually distinguish disk vs memory issues clearly
❌ Switch Beam code from Java to Python
Irrelevant
No impact on worker failure modes
5. Which two of the following statements are true for failures while building the pipeline?
The failure is reproducible with the Direct Runner.
The error message is visible in Dataflow.
The failure can be caused by insufficient permissions granted to the controller service account.
The failure can be caused by incorrect input/output specifications.
Correct answers (select TWO): ✅
✔ The failure is reproducible with the Direct Runner.
✔ The failure can be caused by incorrect input/output specifications.
Why these two are correct (Beam / Dataflow–accurate)
Failures while building the pipeline happen before the job is submitted to the Dataflow service. These are pipeline construction–time errors, not execution-time errors.
✅ Reproducible with the Direct Runner
Pipeline build errors occur in the client SDK (Python/Java)
The same pipeline graph is built regardless of runner
Therefore, these errors:
Also fail with the Direct Runner
Are commonly caught locally before submission
✅ Caused by incorrect input/output specifications
Examples:
Invalid file paths
Wrong schema definitions
Invalid table specs
Incorrect Pub/Sub topic/subscription formats
These errors surface while constructing transforms
Very common build-time failures
Why the other options are NOT correct ❌
❌ The error message is visible in Dataflow.
Build-time failures happen before the job is created
No Dataflow job exists yet
Therefore:
❌ Nothing appears in the Dataflow UI
Errors appear only in local logs / console output
❌ The failure can be caused by insufficient permissions granted to the controller service account.
Service account permissions are checked:
When launching
Or during execution
This is not a pipeline build-time failure
6. Your Dataflow batch job fails after running for close to 5 hours. Which two of the following troubleshooting steps would you take to understand the root cause of the failure?
Log the failing elements and check the output using Cloud Logging.
Investigate the wall time of the individual steps in the job.
Check the Dataflow worker logs for warnings or errors related to work item failures.
Monitor the Data Freshness and System Latency graphs to understand the job performance.
✅ Correct answers (select TWO):
✔ Check the Dataflow worker logs for warnings or errors related to work item failures.
✔ Log the failing elements and check the output using Cloud Logging.
Why these two are correct (thoughtful, failure-focused reasoning)
The key phrase in the question is:
“fails after running for close to 5 hours”
This implies:
The pipeline built successfully
The job ran for a long time
The failure happened during execution, likely due to bad data or runtime exceptions
✅ Check the Dataflow worker logs
This is always the first and most reliable step.
Worker logs show:
Exceptions thrown by DoFns
Serialization errors
OOMs
Disk / shuffle failures
Repeated work-item retries
➡ Without worker logs, you cannot identify the real root cause.
✔ Correct
✅ Log the failing elements and inspect via Cloud Logging
When a batch job fails after hours, the most common causes are:
Malformed records
Unexpected nulls
Schema violations
Edge-case data values
To identify these:
You log elements (or bad records)
Inspect them in Cloud Logging
Fix the pipeline or data accordingly
This is a standard Dataflow debugging technique for batch failures.
✔ Correct
Why the other options are NOT correct ❌
❌ Investigate the wall time of the individual steps
Wall time analysis helps with performance tuning
❌ It does not explain why a job failed
A job can fail even if wall times look normal
➡ Useful for optimization, not root-cause failure analysis
❌ Monitor Data Freshness and System Latency graphs
These are streaming-only metrics
Batch jobs do not use freshness or latency
Completely irrelevant here
➡ Eliminate immediately
🧠 Correct mental model
Long-running batch job failure = runtime error or bad data
So you:
Check worker logs
Inspect failing elements
7. Which one of the following is not a consideration for designing performant pipelines in Dataflow?
Coders and decoders used in pipeline.
Logging
Filtering data early.
SDK used for developing the pipeline.
Correct answer: ✅ SDK used for developing the pipeline
Using an SDK to develop the pipeline is not a consideration for performance. SDK is selected based on other relevant requirements like developer skills, libraries used etc.
Explanation (Dataflow performance design — carefully reasoned)
When designing performant pipelines in Google Cloud Dataflow, performance is driven by how the pipeline is structured and how data flows, not by the choice of SDK language.
Let’s evaluate each option:
✅ Coders and decoders used in the pipeline — Performance consideration
Inefficient coders increase:
CPU usage
Serialization/deserialization overhead
Choosing efficient, schema-aware coders improves performance
✔ Valid consideration
✅ Logging — Performance consideration
Excessive logging:
Slows down workers
Increases I/O and Cloud Logging costs
Especially harmful inside hot paths like DoFn.processElement
✔ Valid consideration
✅ Filtering data early — Performance consideration
Reduces:
Data volume
Shuffle size
Memory and disk usage
One of the most important performance best practices
✔ Valid consideration
❌ SDK used for developing the pipeline — NOT a performance consideration
Beam’s execution model is runner-optimized
Dataflow performance depends on:
Pipeline graph
Data volume
Resource usage
Not on whether you use:
Java
Python
Go
Differences are negligible compared to pipeline design
❌ Not a key performance factor
8. Select options we can use to mitigate data skew in Dataflow pipelines?
Use Dataflow shuffle for batch pipelines and Dataflow streaming option for streaming pipelines.
Use api like “withFanout” or “withHotKeyFanout”.
Add composite windows and triggers.
Add more worker machines.
✅ Correct answer:
Use APIs like withFanout or withHotKeyFanout.
➡️ This is the ONLY correct option.
Why this is the correct (and only) skew-mitigation technique
✅ Use APIs like withFanout or withHotKeyFanout ✔
Data skew is caused by hot keys (a few keys receiving most of the data).
withFanout / withHotKeyFanout:
Explicitly splits hot keys into multiple shards
Processes them in parallel
Re-combines results later
This is the canonical and recommended Beam/Dataflow solution for skew.
✔ Directly addresses the root cause of skew
Why the other options are NOT correct ❌
❌ Use Dataflow shuffle for batch pipelines and Streaming Engine for streaming
These improve scalability and resource management
They do not fix hot keys
A single hot key still bottlenecks execution
➡ Helps performance, not skew
❌ Add composite windows and triggers
Windows and triggers control when results are emitted
They do nothing to redistribute skewed keys
➡ Completely unrelated to skew mitigation
❌ Add more worker machines
This is a very common misconception
If one key dominates:
Only one worker processes it
Other workers sit idle
More workers ≠ more parallelism for hot keys
➡ Does not mitigate skew
9. When we should avoid fusion in a Dataflow pipeline?
Only in specific scenarios, like if your pipeline involves massive fanouts.
Always
Never
Correct answer: ✅
Only in specific scenarios, like if your pipeline involves massive fanouts.
Why this is correct (Dataflow execution–accurate)
In Google Cloud Dataflow, fusion is an optimization where multiple transforms are combined into a single execution stage to reduce overhead and improve performance.
👉 Fusion is generally beneficial and enabled by default.
You should avoid or break fusion only in specific scenarios where it causes problems.
When should you avoid fusion? (key scenarios)
Massive fanouts that:
Create memory pressure
Cause downstream bottlenecks
Hot keys or skewed workloads
When:
One slow transform blocks faster downstream transforms
You need better parallelism between stages
Common techniques to break fusion:
Reshuffle
Writing/reading from an external system
Using GroupByKey
Why the other options are incorrect ❌
❌ Always
Fusion is a core optimization
Disabling it everywhere would severely hurt performance
❌ Never
Fusion can cause performance issues in certain pipelines
Blindly allowing fusion can lead to:
Worker OOMs
Poor parallelism
Back pressure
10. When draining a streaming pipeline, what should you expect to happen?
A snapshot is taken of the source, then ingestion is stopped. Windows are closed and processing of in-flight elements will be allowed to complete.
Any open windows will wait for new data so that aggregations are completed. Then the pipeline will be canceled.
Both processing and ingestion stop immediately.
Ingestion stops immediately, windows are closed, and processing of in flight elements will be allowed to complete.
Correct answer: ✅
Ingestion stops immediately, windows are closed, and processing of in-flight elements will be allowed to complete.
Why this is correct (Dataflow drain semantics — reverified)
In Google Cloud Dataflow, draining a streaming pipeline is a graceful shutdown:
Ingestion stops immediately
No new data is read from the source (e.g., Pub/Sub)
Open windows are closed
Triggers fire as configured
Final results are emitted
In-flight elements continue processing
Any data already read is fully processed and written to sinks
The job ends cleanly, preserving correctness
This behavior is specifically designed to avoid data loss while allowing a controlled stop.
Why the other options are incorrect ❌
❌ “A snapshot is taken of the source…”
Dataflow does not take a source snapshot during drain
That description is closer to checkpointing semantics, not draining
❌ “Any open windows will wait for new data…”
Ingestion is stopped
No new data arrives
Windows do not wait for future data
❌ “Both processing and ingestion stop immediately.”
That describes cancel, not drain
Cancel can cause data loss
11. Using anonymous subclasses in your ParDos is an anti-pattern because:
ParDos are required to contain a concrete subclass of a DoFn.
Anonymous subclasses are bad for the performance of your pipeline.
Anonymous subclasses are harder to test than concrete subclasses.
Correct answer: ✅
Anonymous subclasses are harder to test than concrete subclasses.
Anonymous subclasses are more difficult to test because of their lack of identifiability, coupling, and reusability.
Explanation (Beam / Dataflow best practices — reverified)
In Apache Beam, using anonymous subclasses of DoFn inside ParDo is considered an anti-pattern primarily for maintainability and testability reasons.
✅ Why this is the correct reason
Anonymous DoFns:
Cannot be easily instantiated in isolation
Are difficult to unit test with TestPipeline and DoFnTester
Make reuse and mocking harder
Concrete (named) DoFn classes:
Are reusable
Are easy to test independently
Improve readability and maintainability
This is the main reason Beam guidance discourages anonymous subclasses.
Why the other options are incorrect ❌
❌ “ParDos are required to contain a concrete subclass of a DoFn.”
False. Beam allows anonymous subclasses.
It’s discouraged, not forbidden.
❌ “Anonymous subclasses are bad for the performance of your pipeline.”
False. There is no inherent performance penalty.
Execution performance is determined by:
Transform logic
Data volume
Runner optimizations
Not by whether the DoFn is anonymous or named.