GCP PROFESSIONAL DATA ENGINEER CERTIFICATION Questions P-5

webstoryworldwide.com

2 months ago

1. You have a Dataflow pipeline in production. For certain data, the system seems to be stuck longer than usual. This is causing delays in the pipeline execution. You want to reliably and proactively track and resolve such issues. What should you do?

- Review the Dataflow logs regularly.
- Set up alerts on Cloud Monitoring based on system lag.
- Review the Cloud Monitoring dashboard regularly.
- Set up alerts with Cloud Run functions code that reviews the audit logs regularly.

✅ Correct Answer: Set up alerts on Cloud Monitoring based on system lag.
Why this is the right approach

Your problem is about a Dataflow pipeline getting stuck for certain data, causing delays, and you want to:

🔔 Proactively detect issues

🧠 Reliably track pipeline health

⚡ Act before delays become severe

👉 Cloud Monitoring alerts on system lag are designed exactly for this.

What “system lag” tells you

How far behind the pipeline is compared to real time

Whether data is accumulating faster than it’s being processed

Early indicators of:

Hot keys

Backpressure

Insufficient resources

Stalled stages

By setting alerts (for example, when lag exceeds a threshold for N minutes), you can automatically notify on-call engineers and take action.

Why the other options are weaker

❌ Review Dataflow logs regularly

Reactive, manual, and easy to miss issues

Not proactive or reliable at scale

❌ Review the Cloud Monitoring dashboard regularly

Still manual

Requires someone to be constantly watching

❌ Set up alerts using Cloud Run functions and audit logs

Overengineered

Audit logs are not meant for pipeline health or performance tracking

Exam rule of thumb 🧠

Streaming delays or “stuck” pipelines → Monitor & alert on system lag

2. When running Dataflow jobs, you see this error in the logs: "A hot key HOT_KEY_NAME was detected in…". You need to resolve this issue and make the workload performant. What should you do?

- Increase the data with the hot key.
- Ensure that your data is evenly distributed.
- Disable Dataflow shuffle.
- Add more compute instances for processing.

✅ Correct Answer: Ensure that your data is evenly distributed.
Why this is the right solution

The error:

“A hot key HOT_KEY_NAME was detected …”

means that too much data is concentrated on a single key (or a small set of keys). In Dataflow (Apache Beam), this causes:

One worker doing most of the work

Other workers sitting idle

Pipeline stages appearing stuck or slow

Poor scalability and performance

👉 This problem is called data skew, and the correct way to fix it is to rebalance the data distribution.

What “ensure data is evenly distributed” means in practice

Common techniques include:

Redesigning keys to avoid skew

Adding randomness or salting to keys

Using Beam features like:

withFanout()

withHotKeyFanout()

Pre-aggregating before shuffles

These approaches spread the workload across workers, which directly resolves hot key issues.

Why the other options are incorrect

❌ Increase the data with the hot key

Makes the problem worse

❌ Disable Dataflow shuffle

Not a fix for hot keys

Shuffle is required for grouping operations

❌ Add more compute instances

Hot keys prevent parallelism

More workers won’t help if one key bottlenecks everything

Exam rule of thumb 🧠

Hot key detected → Fix data skew, not infrastructure

3. You have a team of data analysts that run queries interactively on BigQuery during work hours. You also have thousands of report generation queries that run simultaneously. You often see an error: Exceeded rate limits: too many concurrent queries for this project_and_region. How would you resolve this issue?

- Create a yearly reservation of BigQuery slots.
- Run the report generation queries in batch mode.
- Run all queries in interactive mode.
- Create a view to run the queries.

✅ Correct Answer: Run the report generation queries in batch mode.
Why this is the correct solution

Your situation:

👥 Data analysts run interactive queries during work hours

📊 Thousands of report-generation queries run simultaneously

❌ You see:
“Exceeded rate limits: too many concurrent queries for this project and region”

👉 This error is specifically about interactive query concurrency limits in BigQuery.

How batch queries solve this

Batch queries:

Do not count against interactive concurrency limits

Are queued and executed when resources are available

Are ideal for:

Scheduled reports

Large numbers of non-urgent queries

Background processing

By moving report generation queries to batch mode, you:

Free up interactive slots for analysts

Eliminate concurrency errors

Maintain performance for real-time analysis

Why the other options are incorrect

❌ Create a yearly reservation of BigQuery slots

Helps with compute capacity

❌ Does not remove interactive concurrency limits

❌ Run all queries in interactive mode

Makes the problem worse

❌ Create a view to run the queries

Views do not reduce concurrency

Queries still execute underneath

Exam rule of thumb 🧠

Too many concurrent queries → Move non-urgent workloads to batch mode

4. You are running a Dataflow pipeline in production. The input data for this pipeline is occasionally inconsistent. Separately from processing the valid data, you want to efficiently capture the erroneous input data for analysis. What should you do?

- Read the data once, and split it into two pipelines, one to output valid data and another to output erroneous data.
- Check for the erroneous data in the logs.
- Create a side output for the erroneous data.
- Re-read the input data and create separate outputs for valid and erroneous data.

✅ Correct Answer: Create a side output for the erroneous data.
Why this is the best approach
"Using side outputs can collect the erroneous data efficiently and is a recommended approach."
In Apache Beam / Dataflow, when you want to:

✅ Continue processing valid data

❌ Separately capture invalid / erroneous records

⚡ Do this efficiently in a single pass

👉 The recommended pattern is to use side outputs (also called multiple outputs) from a ParDo.

How side outputs solve your problem

With side outputs, you can:

Read the input once

Validate each record

Route records to:

Main output → valid data

Side output → invalid / malformed data

This is:

Efficient (no re-reading)

Scalable

Cleanly separated

Easy to monitor and analyze

Example conceptually:

ParDo
├─ main output → valid records
└─ side output → invalid records

Why the other options are incorrect

❌ Read the data once and split into two pipelines

Not idiomatic Beam

More complex than necessary

Side outputs are the intended mechanism

❌ Check for erroneous data in logs

Logs are for debugging, not data capture

Not reliable or structured for analysis

❌ Re-read the input data

Inefficient

Increases cost and latency

Violates streaming/batch best practices

Exam rule of thumb 🧠

Valid + invalid data handling in Dataflow → Side outputs

5. Multiple analysts need to prepare reports on Monday mornings due to which there is heavy utilization of BigQuery. You want to take a cost-effective approach to managing this demand. What should you do?

- Use BigQuery Enterprise edition with a one-year commitment.
- Use Flex Slots.
- Use BigQuery Enterprise Plus edition with a three-year commitment.
- Use on-demand pricing.

✅ Correct Answer: Use Flex Slots
Why Flex Slots are the best choice

Your situation:

👥 Multiple analysts

📅 Heavy BigQuery usage only on Monday mornings

💰 You want a cost-effective way to handle short, predictable spikes

👉 Flex Slots are specifically designed for this scenario.

What Flex Slots give you

Purchase BigQuery slots for short durations (minimum 60 seconds)

Scale up during peak demand (Monday mornings)

Scale down immediately after demand drops

Pay only for the time you actually need extra capacity

This avoids paying for unused capacity during the rest of the week.

Why the other options are not appropriate

❌ BigQuery Enterprise edition with a one-year commitment

Long-term commitment

Wasteful if demand spikes only once a week

❌ BigQuery Enterprise Plus with a three-year commitment

Very expensive

Designed for always-on, mission-critical workloads

❌ On-demand pricing

No control over concurrency or performance

Can become expensive and unpredictable during heavy usage

Exam rule of thumb 🧠

Short, predictable spikes in BigQuery usage → Flex Slots

6. You run a Cloud SQL instance for a business that requires that the database is accessible for transactions. You need to ensure minimal downtime for database transactions. What should you do?

- Configure replication.
- Configure backups and increase the number of backups.
- Configure backups.
- Configure high availability.

✅ Correct Answer: Configure high availability.
Why this is the right choice

Your requirement is minimal downtime for database transactions on a Cloud SQL instance.

👉 High Availability (HA) is specifically designed to ensure:

Automatic failover

Minimal service interruption

Continuous transaction availability

What Cloud SQL High Availability provides

A primary instance and a standby instance in a different zone

Automatic failover if the primary instance becomes unavailable

No manual intervention required

Best option for transactional workloads (OLTP) that must stay online

Why the other options are insufficient

❌ Configure replication

Read replicas improve read scalability

Failover is manual

Not designed to minimize downtime

❌ Configure backups / increase backups

Used for data recovery

Do not help with availability or uptime

Exam rule of thumb 🧠

Minimal downtime for Cloud SQL transactions → High Availability

7. Cymbal Retail processes streaming data on Dataflow with Pub/Sub as a source. You need to plan for disaster recovery and protect against zonal failures. What should you do?

- Create Dataflow jobs from templates.
- Enable vertical autoscaling.
- Take Dataflow snapshots periodically.
- Enable Dataflow shuffle.

✅ Correct Answer: Take Dataflow snapshots periodically.
Why this is the correct solution

Your requirements are:

📡 Streaming pipeline (Dataflow + Pub/Sub)

🌍 Disaster recovery planning

⚠️ Protection against zonal failures

👉 Dataflow snapshots are the official and recommended DR mechanism for streaming Dataflow jobs.

What Dataflow snapshots provide

Capture the entire state of a streaming pipeline:

In-flight data

Windowing state

Timers

Checkpoints

Allow you to restart the job from the snapshot

Enable recovery with minimal data loss

Critical for zonal or worker failures

If a zone goes down, you can restore the pipeline in another zone/region using the snapshot.

Why the other options are incorrect

❌ Create Dataflow jobs from templates

Helps with deployment

❌ Does not preserve streaming state or in-flight data

❌ Enable vertical autoscaling

Helps performance

❌ Not related to disaster recovery

❌ Enable Dataflow shuffle

Improves performance and resource usage

❌ Not a DR mechanism

Exam rule of thumb 🧠

Streaming Dataflow + disaster recovery → Snapshots

8. A colleague at Cymbal Retail asks you about the configuration of Dataproc autoscaling for a project. What would be the Google-recommended situation when you should enable autoscaling?

- When you want to scale out single-job clusters.
- When you want to down-scale idle clusters to minimum size.
- When there are different size workloads on the cluster.
- When you want to scale on-cluster Hadoop Distributed File System (HDFS).

✅ Correct answer: When you want to scale out single-job clusters
Re-evaluating the options one by one
❌ When you want to down-scale idle clusters to minimum size

Incorrect — Google recommends deleting idle clusters, not autoscaling them.

❌ When there are different size workloads on the cluster

Incorrect — as you said:

Long-running jobs interfere with autoscaling and delay downscaling.

❌ When you want to scale on-cluster HDFS

Incorrect — autoscaling with HDFS is explicitly discouraged due to data rebalancing.

✅ When you want to scale out single-job clusters

✅ This is the ONLY correct and Google-recommended scenario.

Why this is the correct answer

Dataproc autoscaling is recommended when:

The cluster is running a single job

The job has variable or unpredictable resource needs

Autoscaling can safely:

Scale workers up during heavy stages

Scale workers down when stages complete

There is no interference from other jobs

This is exactly what Google means by autoscaling within a single workload.

9. You need to create repeatable data processing tasks by using Cloud Composer. You need to follow best practices and recommended approaches. What should you do?

- Combine multiple functionalities in a single task execution.
- Update data with INSERT statements during the task run.
- Use current time with the now() function for computation.
- Write each task to be responsible for one operation.

✅ Correct Answer: Write each task to be responsible for one operation.
Why this is the best practice for Cloud Composer

Cloud Composer (managed Apache Airflow) follows well-established workflow design principles. To create repeatable, reliable, and maintainable data processing tasks, Google recommends:

👉 Designing tasks that do exactly one thing.

Benefits of single-responsibility tasks

🔁 Repeatable & idempotent executions

🧩 Easier debugging and retries

🔄 Better reusability

📊 Clear dependency management

⚙️ Improved observability and failure isolation

This aligns with both:

Airflow best practices

Software engineering principles (Single Responsibility Principle)

Why the other options are incorrect

❌ Combine multiple functionalities in a single task

Hard to debug

Failures are difficult to isolate

Breaks Airflow best practices

❌ Update data with INSERT statements during the task run

Not inherently wrong, but not a best practice guideline

Focus should be on task design, not SQL syntax

❌ Use current time with the now() function

Leads to non-deterministic workflows

Breaks reproducibility and backfills

Exam rule of thumb 🧠

Cloud Composer / Airflow → Small, single-purpose tasks

10. You need to design a Dataproc cluster to run multiple small jobs. Many jobs (but not all) are of high priority. What should you do?

- Reuse the same cluster to run all jobs in parallel.
- Use ephemeral clusters.
- Reuse the same cluster and run each job in sequence.
- Use cluster autoscaling.

✅ Correct Answer: Use ephemeral clusters
Why ephemeral clusters are correct
"Jobs can use ephemeral clusters to quickly run the job and then deallocate the resources after use. Multiple jobs can be run in parallel without interfering with each other."
Ephemeral (per-job) clusters are the recommended pattern when:

You have many small jobs

Jobs have different priorities

You want:

Job isolation

Predictable performance

No interference between jobs

Clean environments

You want to avoid autoscaling conflicts

Each job:

Spins up its own cluster

Runs independently

Shuts down after completion

This ensures:

High-priority jobs are never delayed

No contention between workloads

Better reliability and reproducibility

Why the other options are incorrect
❌ Reuse the same cluster to run jobs in parallel

Causes resource contention

High-priority jobs can be starved

❌ Reuse the same cluster and run jobs sequentially

Wastes time

High-priority jobs may wait

❌ Use cluster autoscaling

Explicitly not recommended for multiple concurrent jobs

Scaling decisions become unstable

Exam-safe rule 🧠

Dataproc + many small jobs + mixed priority → Ephemeral clusters

11. Cymbal Retail uses Google Cloud and has automated repeatable data processing workloads to achieve reliability and cost efficiency. You want out-of-the-box metric collection dashboards and the ability to generate alerts when specific conditions are met. What tool can you use?

- Cloud Data Loss Prevention (Cloud DLP)
- Data Catalog
- Cloud Monitoring
- Cloud Composer

✅ Correct Answer: Cloud Monitoring
Why Cloud Monitoring is the right tool

Your requirements are:

🔁 Automated, repeatable data processing workloads

📊 Out-of-the-box metrics and dashboards

🚨 Ability to generate alerts when conditions are met

☁️ Google Cloud–native solution

👉 Cloud Monitoring (formerly Stackdriver Monitoring) is specifically designed for this.

What Cloud Monitoring provides

Prebuilt dashboards for Google Cloud services (BigQuery, Dataflow, Dataproc, etc.)

Automatic metric collection

Custom alerting policies (thresholds, duration, conditions)

Integration with Cloud Logging

Supports reliability and cost-efficiency goals

This matches your needs exactly, with no custom tooling required.

Why the other options are incorrect

❌ Cloud DLP

Used for detecting and redacting sensitive data

Not for monitoring workloads or alerting

❌ Data Catalog

Metadata discovery and governance tool

No dashboards or alerting

❌ Cloud Composer

Workflow orchestration (Airflow)

Does not provide monitoring dashboards or alerting

Exam rule of thumb 🧠

Metrics + dashboards + alerts → Cloud Monitoring

12. Your company recently migrated to Google Cloud and started using BigQuery. The team members don’t know how much querying they are going to do, and they need to be efficient with their spend. As a Professional Data Engineer, what pricing model would you recommend?

- Create a pool of resources using BigQuery Reservations.
- Use BigQuery’s on-demand pricing model.
- Use IAM service to block access to BigQuery till the team figures out how much querying they are going to do.
- Decide how much compute capacity you need and reserve it using capacity pricing.

✅ Correct Answer: Use BigQuery’s on-demand pricing model.
Why this is the best recommendation

Your situation:

🆕 Recently migrated to BigQuery

❓ Uncertain and unpredictable query volume

💰 Need to be cost-efficient

🎯 Want to avoid overcommitting too early

👉 BigQuery on-demand pricing is specifically designed for this scenario.

Benefits of on-demand pricing

Pay only for data scanned by queries

No upfront commitment

No need to estimate capacity

Ideal during:

Early adoption

Exploration phases

Unpredictable workloads

This lets the team learn usage patterns before committing to reservations.

Why the other options are incorrect

❌ Create a pool of resources using BigQuery Reservations

Requires predictable workload

Risk of paying for unused capacity

❌ Decide compute capacity and reserve it (capacity pricing)

Same issue: premature commitment

Better suited once usage stabilizes

❌ Block access to BigQuery using IAM

Prevents productivity

Not a cost-management strategy

Exam rule of thumb 🧠

Unknown or unpredictable BigQuery usage → On-demand pricing
Predictable, steady workloads → Reservations / capacity pricing