1. Google Cloud Pub/Sub provides native support for exactly-once message processing without requiring additional idempotent logic in the subscriber application.
- True
- False
Correct answer: ❌ False
Explanation (PDE exam–accurate)
Google Cloud Pub/Sub does not provide native end-to-end exactly-once message processing for subscribers.
Pub/Sub guarantees at-least-once delivery
A message can be delivered more than once
To achieve exactly-once semantics, subscriber applications must implement idempotent logic or use downstream systems that handle deduplication (for example, BigQuery or Dataflow)
Important clarification (common exam trap ⚠️)
Pub/Sub does support exactly-once delivery between Pub/Sub and Dataflow when using Dataflow streaming with Pub/Sub,
👉 but this does NOT mean Pub/Sub itself provides exactly-once processing for all subscribers
Since the question says “without requiring additional idempotent logic in the subscriber application”, the statement is False.
PDE memory rule 🧠
Pub/Sub alone → at-least-once
Exactly-once → requires Dataflow or idempotent subscribers
While Pub/Sub offers "at-least-once" delivery and features like ordering keys and custom message IDs for deduplication, achieving true exactly-once processing typically still requires the subscriber to implement idempotent logic to handle potential duplicate deliveries. Managed Service for Apache Kafka, with configuration, can provide exactly-once processing guarantees.
2. Select all the statements that accurately describe advantages of Pub/Sub..
It automatically integrates with a wide range of Google Cloud services without the need for additional connectors.
It is completely serverless and offers a pay-for-what-you-use pricing model.
It is a fully managed, "no-ops" service that simplifies operations by removing the need for cluster management.
It guarantees strict message ordering across all messages within a topic without affecting throughput.
It provides long-term message persistence, allowing consumers to replay data from any historical offset.
Correct answers (select all that apply):
✅ It automatically integrates with a wide range of Google Cloud services without the need for additional connectors.
Being a native Google Cloud service, Pub/Sub offers deep, built-in integration with other services like Dataflow, BigQuery, and Cloud Functions. This eliminates the need for separate connectors, simplifying your data pipelines.
✅ It is completely serverless and offers a pay-for-what-you-use pricing model.
Pub/Sub is a serverless platform. You don't provision any instances, and the pricing is based on the volume of messages you process, which is a pay-for-what-you-use model.
✅ It is a fully managed, “no-ops” service that simplifies operations by removing the need for cluster management.
Pub/Sub is designed as a fully managed service, meaning Google handles the infrastructure provisioning, maintenance, and scaling. This allows you to focus on your application logic rather than managing servers or clusters.
These are true advantages of Google Cloud Pub/Sub and are commonly tested in the PDE exam.
Incorrect options (do NOT select):
❌ It guarantees strict message ordering across all messages within a topic without affecting throughput.
Pub/Sub does not provide global ordering per topic. Ordering is only supported per ordering key, and enabling it can reduce throughput.
❌ It provides long-term message persistence, allowing consumers to replay data from any historical offset.
Pub/Sub has limited retention (up to 7 days) and is not meant for long-term storage or unlimited replay.
3. What is a key benefit of Google Cloud's Managed Service for Apache Kafka?
- It does not require any knowledge of the Kafka API.
- It automates complex operational tasks like broker sizing and rebalancing.
- It is completely serverless with no cluster configuration.
Correct answer (✔):
✅ It automates complex operational tasks like broker sizing and rebalancing.
Managed Service for Apache Kafka handles the heavy lifting of operational tasks, such as scaling the brokers to meet demand (sizing) and redistributing partitions across the cluster to ensure even load distribution (rebalancing). This allows developers to focus on building their applications rather than managing the underlying infrastructure.
Explanation (PDE exam–focused)
Managed Service for Apache Kafka is a fully managed Kafka offering, and its key benefit is reducing operational overhead while preserving Kafka compatibility.
Google Cloud handles:
Broker provisioning and sizing
Automatic scaling
Partition rebalancing
High availability and maintenance
This allows teams to keep using Kafka without running or tuning clusters themselves.
Why the other options are incorrect ❌
❌ It does not require any knowledge of the Kafka API.
Incorrect. This service is Kafka API–compatible by design.
You must understand Kafka concepts (topics, partitions, producers, consumers).
❌ It is completely serverless with no cluster configuration.
Incorrect.
While managed, it is not serverless like Pub/Sub.
You still work with Kafka clusters (though Google manages them for you).
PDE exam shortcut 🧠
Pub/Sub → serverless, event-driven, no Kafka API
Managed Service for Apache Kafka → Kafka-compatible, managed ops, not serverless
4. You need to build a feature for the "Galactic Grand Prix" that can automatically detect when a player has likely disconnected. The requirement is to group all of a player's in-game actions together and trigger an alert if there is a gap of more than two minutes between any two consecutive actions from that player. Which windowing strategy is the most appropriate for this use case?
- Session Windows with a 2-minute gap duration.
- Sliding (Hopping) Windows with a 2-minute duration and a 30-second period.
- Global Windows with a time-based trigger.
- Fixed (Tumbling) Windows of 2 minutes.
Correct answer: ✅ Session Windows with a 2-minute gap duration
Session windows are specifically designed for this scenario. They group elements by key (in this case, player_id) and define a window that ends after a specified period of inactivity (the "gap duration"). This matches the requirement to identify when a player has gone idle or disconnected.
Why this is the correct choice (PDE exam logic)
This use case requires you to:
Group events by player
Detect inactivity (a gap of more than 2 minutes between actions)
Trigger an alert when the gap occurs
Session windows are specifically designed for this exact pattern.
With Session Windows (gap = 2 minutes):
Events from the same player are grouped into one session as long as actions keep arriving within 2 minutes of each other.
If no event arrives for more than 2 minutes, the session closes automatically.
The end of a session = inactivity detected, which is precisely when you want to trigger the alert.
This is a classic user activity / disconnect detection pattern and is commonly tested in streaming questions (often in Google Cloud Dataflow scenarios).
Why the other options are incorrect ❌
❌ Sliding (Hopping) Windows (2 min duration, 30 sec period)
Designed for continuous overlapping aggregates (e.g., rolling averages).
They do not detect inactivity gaps between events.
You would still see windows even if the user stopped sending events.
❌ Global Windows with a time-based trigger
Global windows never close on their own.
You would need custom trigger logic, making this unnecessarily complex.
Poor choice for detecting session breaks in exams.
❌ Fixed (Tumbling) Windows of 2 minutes
Arbitrarily slices time.
A player could send actions at 1:59 and 2:01 and still be considered active, even though there’s a gap.
Does not model user behavior naturally.
PDE exam memory trick 🧠
Inactivity / user disconnect / session behavior → Session Windows
Rolling metrics → Sliding Windows
Periodic reporting → Fixed Windows
Custom logic only → Global Windows
5. As the lead data engineer for the "Galactic Grand Prix," you've chosen Dataflow as your processing engine. Which of the following core streaming challenges is Dataflow designed to solve automatically or with its built-in features?
- Automatically scaling worker resources up and down to handle a sudden 10x surge in viewership.
- Ensuring data is not lost or processed twice if a worker node unexpectedly fails.
- Generating the initial raw JSON data from the game clients.
- Correctly ordering player events that arrive from different geographic regions with different network speeds.
- Authenticating user identities on the fan engagement website.
Correct selections (✔):
✅ Automatically scaling worker resources up and down to handle a sudden 10x surge in viewership.
Dataflow is serverless and provides autoscaling to manage fluctuating workloads.
✅ Ensuring data is not lost or processed twice if a worker node unexpectedly fails.
Dataflow ensures fault tolerance and exactly-once processing semantics.
✅ Correctly ordering player events that arrive from different geographic regions with different network speeds.
Dataflow uses event-time processing and watermarks to handle out-of-order data.
These are core streaming challenges that Dataflow is designed to handle using built-in capabilities.
Why these are correct (PDE exam reasoning)
Google Cloud Dataflow is a fully managed, serverless stream and batch processing service based on Apache Beam. It natively solves several hard distributed-systems problems:
✅ Auto-scaling
Dataflow automatically scales workers up and down
Handles sudden traffic spikes (e.g., 10× viewership)
No manual capacity planning required
✅ Fault tolerance & exactly-once (when supported by sink)
Uses checkpointing and state persistence
Automatically retries failed work
Prevents data loss and minimizes duplicate processing
With supported sinks (e.g., BigQuery, Pub/Sub → Dataflow), can achieve exactly-once semantics
✅ Event-time processing & out-of-order data
Uses event time, watermarks, and triggers
Correctly processes late and out-of-order events
Critical for globally distributed players with varying network latency
Incorrect selections (✘):
❌ Generating the initial raw JSON data from the game clients.
This is the responsibility of the application or clients, not Dataflow.
Dataflow processes data after it is produced (e.g., from Pub/Sub).
❌ Authenticating user identities on the fan engagement website.
Authentication is handled by services like Identity Platform, OAuth, or IAM.
Dataflow is a data processing engine, not an auth system.
PDE exam memory shortcut 🧠
Dataflow solves:
Scaling
Fault tolerance
Exactly-once (with supported sinks)
Event-time & out-of-order data
Dataflow does NOT solve:
Data generation
Authentication
Frontend or app logic
6. In a Dataflow pipeline, once a watermark passes a certain point in event time (e.g., 12:05 PM), it is a guarantee that no data with a timestamp earlier than 12:05 PM can ever be processed in that window.
- False
- True
Correct answer: ❌ False
A watermark is a heuristic, representing the point in event time where the system believes all data has arrived. While it's a strong signal, Dataflow's architecture allows for "late data" to arrive after the watermark has passed. You can configure how to handle this late data.
Explanation (PDE exam–accurate)
In Google Cloud Dataflow, a watermark is an estimate, not a hard guarantee.
When the watermark passes 12:05 PM, it means Dataflow expects that most data earlier than that time has arrived.
Late data can still arrive with timestamps earlier than 12:05 PM due to:
Network delays
Client buffering
Cross-region latency
Dataflow can still process this late data depending on:
Allowed lateness
Triggers
Therefore, the statement claiming a guarantee is incorrect.
Key PDE concepts to remember 🧠
Watermark = heuristic / estimate
Not a strict cutoff
Late data is handled via:
Allowed lateness
Late-firing triggers
Discarding (if configured)
Exam memory rule
Watermarks predict completeness; they do not promise it.
7. A development team is storing live operational data for a mobile application in a Bigtable instance to ensure low-latency reads and writes for the app. A business analyst needs to run an ad-hoc SQL query on this live data to analyze a recent trend. What is the most efficient and recommended way to achieve this without impacting the mobile application's performance?
- Create an external table in BigQuery that points to the Bigtable instance, and have the analyst run their query against that external table using a 'Data Boost' profile.
- Implement a Dataflow pipeline to read all the data from Bigtable and write it into a new, native BigQuery table for the analyst to query.
- Grant the analyst's user account direct read access to the production Bigtable instance and have them use the CBT command-line tool to retrieve the data.
- Export the data from Bigtable to a CSV file in Cloud Storage, and then load that CSV file into BigQuery.
Correct answer: ✅
Create an external table in BigQuery that points to the Bigtable instance, and have the analyst run their query against that external table using a Data Boost profile.
This is the exact use case for the BigQuery and Bigtable integration. Creating an external table allows BigQuery to query the data directly within Bigtable using familiar SQL. Using Data Boost is the critical part that ensures performance isolation; it runs the analytical query using dedicated, on-demand compute resources, which guarantees that the heavy analytical workload will not interfere with the low-latency traffic from the production mobile application.
Why this is the best and recommended approach (PDE exam logic)
The key requirements are:
Ad-hoc SQL querying
Live operational data
No performance impact on the mobile app
Most efficient / recommended solution
Using **BigQuery external tables over Cloud Bigtable with Bigtable Data Boost is exactly designed for this scenario.
What this gives you:
No data duplication (queries run directly against Bigtable)
SQL access for analysts via BigQuery
Isolation from production traffic
Data Boost uses separate compute resources
Does not consume Bigtable node CPU
Near real-time analysis of live data
This is a textbook PDE exam answer.
Why the other options are incorrect ❌
❌ Implement a Dataflow pipeline to copy data into BigQuery
Introduces data latency
Requires ongoing pipeline maintenance
Duplicates data and increases cost
Not ideal for ad-hoc or live analysis
(Use Google Cloud Dataflow only when transformation or historical warehousing is required.)
❌ Grant direct Bigtable access and use the CBT CLI
Bigtable is not SQL-friendly
CBT is for administration/debugging, not analytics
High risk of impacting production performance
Not analyst-friendly
❌ Export to CSV → Cloud Storage → Load into BigQuery
Completely batch-oriented
High operational overhead
Stale data
Worst choice for live operational insights
PDE exam memory rule 🧠
Bigtable (OLTP) + BigQuery (OLAP) + Data Boost = real-time analytics without production impact
8. In BigQuery's serverless architecture, query processing capacity (compute) is tightly coupled with the amount of data you have stored. To improve query performance, you must first increase your storage capacity.
- False
- True
Correct answer: ❌ False
BigQuery's architecture fundamentally separates compute from storage. This is a core concept. Query processing is handled by the Dremel engine, which allocates and scales compute resources automatically and independently of the storage layer. You do not need to provision or scale storage to get more query power. This separation allows BigQuery to handle petabyte-scale queries without requiring users to manage underlying infrastructure.
Explanation (PDE exam–accurate)
In BigQuery, compute and storage are decoupled by design.
Storage: Scales independently based on how much data you store
Compute (query processing capacity): Scales automatically and independently when queries run
You do NOT need to increase storage to get more compute power
BigQuery automatically provisions the required compute resources to execute queries, which is a core benefit of its serverless architecture.
Why this statement is incorrect ❌
The statement describes a traditional data warehouse model, not BigQuery.
In BigQuery:
Query performance improvements are achieved via:
Better query design
Partitioning and clustering
Slots / reservations (if using capacity-based pricing)
Not by increasing storage size
PDE exam memory rule 🧠
BigQuery = serverless, decoupled storage & compute
If a question suggests:
“Add more storage to get more compute” → False
9. A data analyst complains that their queries on a large, multi-year historical sales data table are slow and costly. The queries often filter by the transaction date and group by product category. Which of the following BigQuery features could be used to improve query performance and reduce costs for this specific workload? (Select all that apply)
- Partitioning the table by transaction date
- Using the BigQuery Storage Write API
- Clustering the table by product category
- Creating a materialized view for common aggregations
Correct selections (✔):
✅ Partitioning the table by transaction date
Partitioning the table by date (for example, daily or monthly) is the most effective optimization here. When queries filter by the transaction date, BigQuery can prune partitions, meaning it only scans the data in the relevant date partitions, dramatically reducing the amount of data processed, which lowers cost and improves speed.
✅ Clustering the table by product category
After partitioning by date, clustering by product_category would physically co-locate data for the same category within each partition. This improves the performance of queries that filter or group by category, as BigQuery can more efficiently read the relevant blocks of data.
✅ Creating a materialized view for common aggregations
Since the analyst is running common aggregations (grouping by category), a materialized view could pre-compute and store these results. Queries that match the materialized view's logic would read from the much smaller, pre-aggregated table instead of the raw data, making them significantly faster and cheaper.
These features directly improve query performance and cost efficiency in BigQuery for the described workload.
Why these are correct (PDE exam reasoning)
✅ Partitioning by transaction date
Queries filter by transaction date
Partitioning allows BigQuery to scan only relevant partitions
Results in:
Much lower data scanned
Faster queries
Reduced cost
👉 This is the most important optimization for time-based historical data.
✅ Clustering by product category
Queries GROUP BY product category
Clustering organizes data by column values within partitions
Improves:
Aggregations
Filter performance
Query efficiency on grouped columns
👉 Partitioning + clustering is a classic PDE exam combo.
✅ Materialized views for common aggregations
If analysts repeatedly run similar aggregation queries
Materialized views:
Store precomputed results
Automatically stay updated
Dramatically reduce query time and cost
👉 Best for frequent, repetitive analytics patterns.
Why the other option is incorrect ❌
❌ Using the BigQuery Storage Write API
This is an ingestion optimization, not a query optimization
Improves write throughput and latency
Has no impact on query performance or cost
10. A social media company uses a single, massive BigQuery table to store all user events (likes, posts, comments), which is partitioned by day. The Data Science team needs to run complex machine learning training queries that scan months of historical data. At the same time, the Product Analytics team runs many small, concurrent dashboard queries that typically look at the last 7 days of data. The analytics dashboards are becoming slow whenever the ML training jobs are running. What is the best way to ensure the Product Analytics team's dashboards remain fast and responsive without stopping the Data Science team's work?
- Use a BigQuery BI Engine reservation to accelerate the dashboards.
- Switch BigQuery from the on-demand pricing model to a flat-rate model and create separate reservations/assignments for the Data Science and Product Analytics teams.
- Split the single table into two separate tables: one for historical data and one for the last 7 days.
- Ask the Data Science team to only run their heavy queries at night when the Product Analytics team is offline.
Correct answer: ✅
Switch BigQuery from the on-demand pricing model to a flat-rate (capacity-based) model and create separate reservations/assignments for the Data Science and Product Analytics teams.
Flat-rate reservations are designed specifically for this scenario. By purchasing a dedicated amount of slot capacity (BigQuery's unit of compute) and creating separate reservations, you can assign one pool of slots to the Data Science team (for their heavy, long-running jobs) and another pool to the Analytics team (for their short, high-concurrency dashboard queries). This guarantees that a surge in demand from one team does not consume the compute resources needed by the other, ensuring consistent performance for both.
Why this is the best solution (PDE exam logic)
The core problem here is resource contention:
ML training jobs → long-running, heavy queries scanning months of data
Dashboards → many small, concurrent, latency-sensitive queries
Both are competing for the same BigQuery compute resources
In BigQuery, the only way to guarantee performance isolation between workloads is to use capacity-based pricing with reservations.
What this approach gives you:
Dedicated query processing capacity (slots) per team
Guaranteed performance for dashboards, regardless of ML workload
No need to stop or delay Data Science jobs
Predictable performance and cost control
This is a very common PDE exam scenario:
👉 Multiple teams + mixed workloads + performance interference = reservations.
Why the other options are NOT correct ❌
❌ Use BigQuery BI Engine to accelerate the dashboards
BI Engine helps with in-memory acceleration
It does not prevent slot contention
Heavy ML queries can still starve dashboard queries
Helpful, but not sufficient or guaranteed
❌ Split the table into historical and recent data
Adds unnecessary data management complexity
Does not solve compute contention
ML queries would still compete with dashboard queries
❌ Ask the Data Science team to run queries at night
Operational workaround, not a technical solution
Not scalable, not reliable, and never the correct exam answer
11. For ingesting high-volume time-series data, the optimal row key design is to always begin the key with the event timestamp (e.g., 2025-09-15T10:30:00Z#sensor_123) to ensure data is stored chronologically and is optimized for fast range scans.
- True
- False
Correct answer: ❌ False
While starting a row key with a timestamp does store data chronologically, it is a common anti-pattern for high-throughput ingestion. This design would cause all new writes to target a single node in the Bigtable cluster, creating a performance bottleneck known as "hotspotting." A better practice is to design the key to distribute writes, for example, by prefixing it with a more varied value like a sensor ID or by using a reversed timestamp.
Explanation (PDE exam–accurate)
For high-volume time-series data in Cloud Bigtable, starting the row key with a timestamp is an anti-pattern.
Why this is NOT optimal
Bigtable stores rows lexicographically by row key
A monotonically increasing timestamp prefix causes:
Hotspotting on a small set of tablets
Write bottlenecks under high ingest rates
This severely hurts throughput and scalability
What is recommended instead ✅
To balance write distribution and query efficiency:
Use a salting / hashing prefix
Or reverse the timestamp
Or start with a high-cardinality entity ID (e.g., sensor ID)
Example good row key patterns:
hash(sensor_id)#reverse_timestamp
sensor_123#reverse_timestamp
bucket#sensor_id#reverse_timestamp
These designs:
Distribute writes evenly
Still allow efficient time-range scans per sensor
PDE exam memory rule 🧠
Bigtable row keys should avoid monotonically increasing prefixes.
Timestamp first = hotspot risk = wrong answer.
12. A development team stores a large volume of historical user activity data in a Bigtable table. Which of the following are valid and recommended methods for analyzing or interacting with this data? (Select all that apply)
- Serving low-latency requests (e.g., retrieving a user's last 10 actions) for a user-facing application dashboard.
- Performing a JOIN operation with a customer dimension table using the Bigtable SQL interface.
- Using the native HBase client for programmatic data access in a Java application.
- Executing a federated query from BigQuery to run interactive SQL analysis without moving the data.
Correct selections (✔):
✅ Serving low-latency requests (e.g., retrieving a user's last 10 actions) for a user-facing application dashboard.
Serving low-latency data for applications is a primary use case for Bigtable.
✅ Using the native HBase client for programmatic data access in a Java application.
Bigtable is compatible with the HBase API.
✅ Executing a federated query from BigQuery to run interactive SQL analysis without moving the data.
BigQuery supports federated queries, which allow you to run SQL on data stored in Bigtable without a separate ETL step.
Why these are correct (PDE exam reasoning)
✅ Low-latency serving for applications
Cloud Bigtable is built for high-throughput, low-latency reads and writes.
Ideal for patterns like “get last N actions for a user” using a well-designed row key.
This is Bigtable’s primary strength (OLTP-style access).
✅ Native HBase client (Java)
Bigtable is HBase API–compatible.
Java applications can use the HBase client for programmatic access.
This is a standard, recommended integration path for production services.
✅ Federated queries from BigQuery
You can analyze Bigtable data using BigQuery external (federated) tables.
Enables interactive SQL analysis without copying data.
Best for ad-hoc analytics while keeping operational workloads in Bigtable.
Why the other option is incorrect ❌
❌ Performing a JOIN operation with a customer dimension table using the Bigtable SQL interface.
Bigtable does not provide a native SQL interface.
It does not support joins internally.
Joins are handled in analytics engines like BigQuery, not in Bigtable itself.
13. An e-commerce platform is designing a real-time fraud detection system. For each incoming purchase, the system must check the customer's static profile data (name, address) and their 5 most recent transactions. This entire lookup must complete in under 20 milliseconds. Which Bigtable schema design would be most effective for meeting this low-latency requirement?
- Use a single "wide" table with a row key of customerID. Store profile data in a profile column family and each new transaction in the transactions column family. Fetch the required data with a single ReadRows request for that customerID, configured to retrieve the 5 most recent cells from the transactions family.
- Use two separate tables: a profiles table and a transactions table. When a purchase occurs, query both tables with the customerID.
- Ingest all data into BigQuery and use a scheduled query to regularly export the latest transaction data for each user into a Bigtable table.
- Use a single "tall" table with a row key of customerID#transactionID. To get the data, perform a scan for the customerID prefix and apply a limit(5) filter.
Correct answer: ✅
Use a single “wide” table with a row key of customerID. Store profile data in a profile column family and each new transaction in the transactions column family. Fetch the required data with a single ReadRows request for that customerID, configured to retrieve the 5 most recent cells from the transactions family.
This approach is the most performant. Using the customerID as the row key allows you to fetch all the necessary data (profile and transactions) in a single, highly-optimized ReadRows operation. Bigtable is designed to efficiently retrieve a specified number of the most recent cells (versions) in a column, making it ideal for "latest N" queries like this.
Why this is the most effective design (PDE exam logic)
For a sub-20 ms fraud check, the overriding goal is minimize round trips and scans.
With Cloud Bigtable:
A single-row lookup is the fastest possible access pattern.
A wide-row design lets you colocate all data needed for a customer.
Using column families:
profile → static data (name, address)
transactions → time-ordered transaction cells
Bigtable supports cell versioning, so you can:
Store transactions as multiple versions
Fetch only the 5 most recent cells in one request
👉 Result: one ReadRows call, no scans, no joins, ultra-low latency.
This is a canonical Bigtable serving-layer pattern and a frequent PDE exam answer.
Why the other options are not optimal ❌
❌ Two separate tables (profiles + transactions)
Requires two network calls
Increases latency and failure points
Risky for a hard 20 ms SLA
❌ Ingest into BigQuery and export back to Bigtable
BigQuery is an analytics (OLAP) system, not a low-latency serving store
Adds unnecessary pipeline complexity and latency
Not suitable for real-time fraud checks
❌ Tall table with customerID#transactionID and prefix scan
Requires a row scan, even with a limit
Scans are slower and less predictable than direct row reads
Inferior to a single-row fetch for latency-critical paths