1. You have a Dataflow pipeline in production. For certain data, the system seems to be stuck longer than usual. This is causing delays in the pipeline execution. You want to reliably and proactively track and resolve such issues. What should you do?
- Review the Dataflow logs regularly.
- Set up alerts on Cloud Monitoring based on system lag.
- Review the Cloud Monitoring dashboard regularly.
- Set up alerts with Cloud Run functions code that reviews the audit logs regularly.
β
Correct Answer: Set up alerts on Cloud Monitoring based on system lag.
Why this is the right approach
Your problem is about a Dataflow pipeline getting stuck for certain data, causing delays, and you want to:
π Proactively detect issues
π§ Reliably track pipeline health
β‘ Act before delays become severe
π Cloud Monitoring alerts on system lag are designed exactly for this.
What βsystem lagβ tells you
How far behind the pipeline is compared to real time
Whether data is accumulating faster than itβs being processed
Early indicators of:
Hot keys
Backpressure
Insufficient resources
Stalled stages
By setting alerts (for example, when lag exceeds a threshold for N minutes), you can automatically notify on-call engineers and take action.
Why the other options are weaker
β Review Dataflow logs regularly
Reactive, manual, and easy to miss issues
Not proactive or reliable at scale
β Review the Cloud Monitoring dashboard regularly
Still manual
Requires someone to be constantly watching
β Set up alerts using Cloud Run functions and audit logs
Overengineered
Audit logs are not meant for pipeline health or performance tracking
Exam rule of thumb π§
Streaming delays or βstuckβ pipelines β Monitor & alert on system lag
2. When running Dataflow jobs, you see this error in the logs: "A hot key HOT_KEY_NAME was detected inβ¦". You need to resolve this issue and make the workload performant. What should you do?
- Increase the data with the hot key.
- Ensure that your data is evenly distributed.
- Disable Dataflow shuffle.
- Add more compute instances for processing.
β
Correct Answer: Ensure that your data is evenly distributed.
Why this is the right solution
The error:
βA hot key HOT_KEY_NAME was detected β¦β
means that too much data is concentrated on a single key (or a small set of keys). In Dataflow (Apache Beam), this causes:
One worker doing most of the work
Other workers sitting idle
Pipeline stages appearing stuck or slow
Poor scalability and performance
π This problem is called data skew, and the correct way to fix it is to rebalance the data distribution.
What βensure data is evenly distributedβ means in practice
Common techniques include:
Redesigning keys to avoid skew
Adding randomness or salting to keys
Using Beam features like:
withFanout()
withHotKeyFanout()
Pre-aggregating before shuffles
These approaches spread the workload across workers, which directly resolves hot key issues.
Why the other options are incorrect
β Increase the data with the hot key
Makes the problem worse
β Disable Dataflow shuffle
Not a fix for hot keys
Shuffle is required for grouping operations
β Add more compute instances
Hot keys prevent parallelism
More workers wonβt help if one key bottlenecks everything
Exam rule of thumb π§
Hot key detected β Fix data skew, not infrastructure
3. You have a team of data analysts that run queries interactively on BigQuery during work hours. You also have thousands of report generation queries that run simultaneously. You often see an error: Exceeded rate limits: too many concurrent queries for this project_and_region. How would you resolve this issue?
- Create a yearly reservation of BigQuery slots.
- Run the report generation queries in batch mode.
- Run all queries in interactive mode.
- Create a view to run the queries.
β
Correct Answer: Run the report generation queries in batch mode.
Why this is the correct solution
Your situation:
π₯ Data analysts run interactive queries during work hours
π Thousands of report-generation queries run simultaneously
β You see:
βExceeded rate limits: too many concurrent queries for this project and regionβ
π This error is specifically about interactive query concurrency limits in BigQuery.
How batch queries solve this
Batch queries:
Do not count against interactive concurrency limits
Are queued and executed when resources are available
Are ideal for:
Scheduled reports
Large numbers of non-urgent queries
Background processing
By moving report generation queries to batch mode, you:
Free up interactive slots for analysts
Eliminate concurrency errors
Maintain performance for real-time analysis
Why the other options are incorrect
β Create a yearly reservation of BigQuery slots
Helps with compute capacity
β Does not remove interactive concurrency limits
β Run all queries in interactive mode
Makes the problem worse
β Create a view to run the queries
Views do not reduce concurrency
Queries still execute underneath
Exam rule of thumb π§
Too many concurrent queries β Move non-urgent workloads to batch mode
4. You are running a Dataflow pipeline in production. The input data for this pipeline is occasionally inconsistent. Separately from processing the valid data, you want to efficiently capture the erroneous input data for analysis. What should you do?
- Read the data once, and split it into two pipelines, one to output valid data and another to output erroneous data.
- Check for the erroneous data in the logs.
- Create a side output for the erroneous data.
- Re-read the input data and create separate outputs for valid and erroneous data.
β
Correct Answer: Create a side output for the erroneous data.
Why this is the best approach
"Using side outputs can collect the erroneous data efficiently and is a recommended approach."
In Apache Beam / Dataflow, when you want to:
β
Continue processing valid data
β Separately capture invalid / erroneous records
β‘ Do this efficiently in a single pass
π The recommended pattern is to use side outputs (also called multiple outputs) from a ParDo.
How side outputs solve your problem
With side outputs, you can:
Read the input once
Validate each record
Route records to:
Main output β valid data
Side output β invalid / malformed data
This is:
Efficient (no re-reading)
Scalable
Cleanly separated
Easy to monitor and analyze
Example conceptually:
ParDo
ββ main output β valid records
ββ side output β invalid records
Why the other options are incorrect
β Read the data once and split into two pipelines
Not idiomatic Beam
More complex than necessary
Side outputs are the intended mechanism
β Check for erroneous data in logs
Logs are for debugging, not data capture
Not reliable or structured for analysis
β Re-read the input data
Inefficient
Increases cost and latency
Violates streaming/batch best practices
Exam rule of thumb π§
Valid + invalid data handling in Dataflow β Side outputs
5. Multiple analysts need to prepare reports on Monday mornings due to which there is heavy utilization of BigQuery. You want to take a cost-effective approach to managing this demand. What should you do?
- Use BigQuery Enterprise edition with a one-year commitment.
- Use Flex Slots.
- Use BigQuery Enterprise Plus edition with a three-year commitment.
- Use on-demand pricing.
β
Correct Answer: Use Flex Slots
Why Flex Slots are the best choice
Your situation:
π₯ Multiple analysts
π
Heavy BigQuery usage only on Monday mornings
π° You want a cost-effective way to handle short, predictable spikes
π Flex Slots are specifically designed for this scenario.
What Flex Slots give you
Purchase BigQuery slots for short durations (minimum 60 seconds)
Scale up during peak demand (Monday mornings)
Scale down immediately after demand drops
Pay only for the time you actually need extra capacity
This avoids paying for unused capacity during the rest of the week.
Why the other options are not appropriate
β BigQuery Enterprise edition with a one-year commitment
Long-term commitment
Wasteful if demand spikes only once a week
β BigQuery Enterprise Plus with a three-year commitment
Very expensive
Designed for always-on, mission-critical workloads
β On-demand pricing
No control over concurrency or performance
Can become expensive and unpredictable during heavy usage
Exam rule of thumb π§
Short, predictable spikes in BigQuery usage β Flex Slots
6. You run a Cloud SQL instance for a business that requires that the database is accessible for transactions. You need to ensure minimal downtime for database transactions. What should you do?
- Configure replication.
- Configure backups and increase the number of backups.
- Configure backups.
- Configure high availability.
β
Correct Answer: Configure high availability.
Why this is the right choice
Your requirement is minimal downtime for database transactions on a Cloud SQL instance.
π High Availability (HA) is specifically designed to ensure:
Automatic failover
Minimal service interruption
Continuous transaction availability
What Cloud SQL High Availability provides
A primary instance and a standby instance in a different zone
Automatic failover if the primary instance becomes unavailable
No manual intervention required
Best option for transactional workloads (OLTP) that must stay online
Why the other options are insufficient
β Configure replication
Read replicas improve read scalability
Failover is manual
Not designed to minimize downtime
β Configure backups / increase backups
Used for data recovery
Do not help with availability or uptime
Exam rule of thumb π§
Minimal downtime for Cloud SQL transactions β High Availability
7. Cymbal Retail processes streaming data on Dataflow with Pub/Sub as a source. You need to plan for disaster recovery and protect against zonal failures. What should you do?
- Create Dataflow jobs from templates.
- Enable vertical autoscaling.
- Take Dataflow snapshots periodically.
- Enable Dataflow shuffle.
β
Correct Answer: Take Dataflow snapshots periodically.
Why this is the correct solution
Your requirements are:
π‘ Streaming pipeline (Dataflow + Pub/Sub)
π Disaster recovery planning
β οΈ Protection against zonal failures
π Dataflow snapshots are the official and recommended DR mechanism for streaming Dataflow jobs.
What Dataflow snapshots provide
Capture the entire state of a streaming pipeline:
In-flight data
Windowing state
Timers
Checkpoints
Allow you to restart the job from the snapshot
Enable recovery with minimal data loss
Critical for zonal or worker failures
If a zone goes down, you can restore the pipeline in another zone/region using the snapshot.
Why the other options are incorrect
β Create Dataflow jobs from templates
Helps with deployment
β Does not preserve streaming state or in-flight data
β Enable vertical autoscaling
Helps performance
β Not related to disaster recovery
β Enable Dataflow shuffle
Improves performance and resource usage
β Not a DR mechanism
Exam rule of thumb π§
Streaming Dataflow + disaster recovery β Snapshots
8. A colleague at Cymbal Retail asks you about the configuration of Dataproc autoscaling for a project. What would be the Google-recommended situation when you should enable autoscaling?
- When you want to scale out single-job clusters.
- When you want to down-scale idle clusters to minimum size.
- When there are different size workloads on the cluster.
- When you want to scale on-cluster Hadoop Distributed File System (HDFS).
β
Correct answer: When you want to scale out single-job clusters
Re-evaluating the options one by one
β When you want to down-scale idle clusters to minimum size
Incorrect β Google recommends deleting idle clusters, not autoscaling them.
β When there are different size workloads on the cluster
Incorrect β as you said:
Long-running jobs interfere with autoscaling and delay downscaling.
β When you want to scale on-cluster HDFS
Incorrect β autoscaling with HDFS is explicitly discouraged due to data rebalancing.
β
When you want to scale out single-job clusters
β
This is the ONLY correct and Google-recommended scenario.
Why this is the correct answer
Dataproc autoscaling is recommended when:
The cluster is running a single job
The job has variable or unpredictable resource needs
Autoscaling can safely:
Scale workers up during heavy stages
Scale workers down when stages complete
There is no interference from other jobs
This is exactly what Google means by autoscaling within a single workload.
9. You need to create repeatable data processing tasks by using Cloud Composer. You need to follow best practices and recommended approaches. What should you do?
- Combine multiple functionalities in a single task execution.
- Update data with INSERT statements during the task run.
- Use current time with the now() function for computation.
- Write each task to be responsible for one operation.
β
Correct Answer: Write each task to be responsible for one operation.
Why this is the best practice for Cloud Composer
Cloud Composer (managed Apache Airflow) follows well-established workflow design principles. To create repeatable, reliable, and maintainable data processing tasks, Google recommends:
π Designing tasks that do exactly one thing.
Benefits of single-responsibility tasks
π Repeatable & idempotent executions
π§© Easier debugging and retries
π Better reusability
π Clear dependency management
βοΈ Improved observability and failure isolation
This aligns with both:
Airflow best practices
Software engineering principles (Single Responsibility Principle)
Why the other options are incorrect
β Combine multiple functionalities in a single task
Hard to debug
Failures are difficult to isolate
Breaks Airflow best practices
β Update data with INSERT statements during the task run
Not inherently wrong, but not a best practice guideline
Focus should be on task design, not SQL syntax
β Use current time with the now() function
Leads to non-deterministic workflows
Breaks reproducibility and backfills
Exam rule of thumb π§
Cloud Composer / Airflow β Small, single-purpose tasks
10. You need to design a Dataproc cluster to run multiple small jobs. Many jobs (but not all) are of high priority. What should you do?
- Reuse the same cluster to run all jobs in parallel.
- Use ephemeral clusters.
- Reuse the same cluster and run each job in sequence.
- Use cluster autoscaling.
β
Correct Answer: Use ephemeral clusters
Why ephemeral clusters are correct
"Jobs can use ephemeral clusters to quickly run the job and then deallocate the resources after use. Multiple jobs can be run in parallel without interfering with each other."
Ephemeral (per-job) clusters are the recommended pattern when:
You have many small jobs
Jobs have different priorities
You want:
Job isolation
Predictable performance
No interference between jobs
Clean environments
You want to avoid autoscaling conflicts
Each job:
Spins up its own cluster
Runs independently
Shuts down after completion
This ensures:
High-priority jobs are never delayed
No contention between workloads
Better reliability and reproducibility
Why the other options are incorrect
β Reuse the same cluster to run jobs in parallel
Causes resource contention
High-priority jobs can be starved
β Reuse the same cluster and run jobs sequentially
Wastes time
High-priority jobs may wait
β Use cluster autoscaling
Explicitly not recommended for multiple concurrent jobs
Scaling decisions become unstable
Exam-safe rule π§
Dataproc + many small jobs + mixed priority β Ephemeral clusters
11. Cymbal Retail uses Google Cloud and has automated repeatable data processing workloads to achieve reliability and cost efficiency. You want out-of-the-box metric collection dashboards and the ability to generate alerts when specific conditions are met. What tool can you use?
- Cloud Data Loss Prevention (Cloud DLP)
- Data Catalog
- Cloud Monitoring
- Cloud Composer
β
Correct Answer: Cloud Monitoring
Why Cloud Monitoring is the right tool
Your requirements are:
π Automated, repeatable data processing workloads
π Out-of-the-box metrics and dashboards
π¨ Ability to generate alerts when conditions are met
βοΈ Google Cloudβnative solution
π Cloud Monitoring (formerly Stackdriver Monitoring) is specifically designed for this.
What Cloud Monitoring provides
Prebuilt dashboards for Google Cloud services (BigQuery, Dataflow, Dataproc, etc.)
Automatic metric collection
Custom alerting policies (thresholds, duration, conditions)
Integration with Cloud Logging
Supports reliability and cost-efficiency goals
This matches your needs exactly, with no custom tooling required.
Why the other options are incorrect
β Cloud DLP
Used for detecting and redacting sensitive data
Not for monitoring workloads or alerting
β Data Catalog
Metadata discovery and governance tool
No dashboards or alerting
β Cloud Composer
Workflow orchestration (Airflow)
Does not provide monitoring dashboards or alerting
Exam rule of thumb π§
Metrics + dashboards + alerts β Cloud Monitoring
12. Your company recently migrated to Google Cloud and started using BigQuery. The team members donβt know how much querying they are going to do, and they need to be efficient with their spend. As a Professional Data Engineer, what pricing model would you recommend?
- Create a pool of resources using BigQuery Reservations.
- Use BigQueryβs on-demand pricing model.
- Use IAM service to block access to BigQuery till the team figures out how much querying they are going to do.
- Decide how much compute capacity you need and reserve it using capacity pricing.
β
Correct Answer: Use BigQueryβs on-demand pricing model.
Why this is the best recommendation
Your situation:
π Recently migrated to BigQuery
β Uncertain and unpredictable query volume
π° Need to be cost-efficient
π― Want to avoid overcommitting too early
π BigQuery on-demand pricing is specifically designed for this scenario.
Benefits of on-demand pricing
Pay only for data scanned by queries
No upfront commitment
No need to estimate capacity
Ideal during:
Early adoption
Exploration phases
Unpredictable workloads
This lets the team learn usage patterns before committing to reservations.
Why the other options are incorrect
β Create a pool of resources using BigQuery Reservations
Requires predictable workload
Risk of paying for unused capacity
β Decide compute capacity and reserve it (capacity pricing)
Same issue: premature commitment
Better suited once usage stabilizes
β Block access to BigQuery using IAM
Prevents productivity
Not a cost-management strategy
Exam rule of thumb π§
Unknown or unpredictable BigQuery usage β On-demand pricing
Predictable, steady workloads β Reservations / capacity pricing