19. Is it possible to mix elements in sSchema PCollections inside a single Beam pipeline? (Select the two correct answers.)
Not possible within the same PCollection
Yes, but only across different PCollections
Not at all
Yes, in all scenarios
Correct answers (select TWO): ✅
✔ Yes, but only across different PCollections
✔ Not possible within the same PCollection
Explanation (Beam schema model — carefully verified)
In Apache Beam, schemas are enforced at the PCollection level.
🔹 Key rule
All elements within a single schema-aware PCollection must conform to the same schema
You cannot mix elements with different schemas inside the same PCollection
Why these two are correct ✅
✔ Yes, but only across different PCollections
A Beam pipeline can contain multiple PCollections
Each PCollection can have:
A different schema
Or no schema at all
Mixing schemas across PCollections is perfectly valid
✔ Not possible within the same PCollection
Schema consistency is mandatory within one PCollection
Mixed-schema elements would violate Beam’s type and schema guarantees
Why the other options are incorrect ❌
❌ Yes, in all scenarios
Incorrect — schema mixing is restricted by PCollection boundaries
❌ Not at all
Incorrect — schema mixing is allowed across different PCollections
🧠 Exam memory rule
Schema consistency is enforced per PCollection, not per pipeline
20. Which of the following element types can be encoded as a schema from a PCollection? (select ALL that apply.)
Protobuf objects
Byte string objects
A single list of JSON objects
Avro objects
✅ Correct answer (select ALL that apply):
✔ Avro objects
✔ Byte string objects
✔ A single list of JSON objects
✔ Protobuf objects
Why all four are correct (Beam-accurate, reconciled)
In Apache Beam, a PCollection element can be schema-encoded if Beam can represent it using its schema type system (either directly or via supported container/primitive mappings).
Let’s go one by one.
✅ Avro objects
Avro is explicitly schema-based
Beam has native Avro schema providers
Fully supported for Beam SQL and schema transforms
✔ Correct
✅ Protobuf objects
Protobuf messages have a strongly defined schema
Beam automatically maps Protobuf fields to Beam schemas
✔ Correct
✅ Byte string objects
Beam schemas support primitive field types, including BYTES
A PCollection<byte[]> or ByteString can be schema-encoded
This is explicitly supported in Beam’s schema type system
✔ Correct
✅ A single list of JSON objects
This is the subtle one — and where the confusion came from.
Beam schemas support nested and collection types (ARRAY, MAP, ROW)
A list of JSON objects with consistent structure can be represented as:
ARRAY<ROW<…>>
Beam does not require the source format to be “schema-native” (like Avro)
As long as the structure is consistent, it can be encoded as a schema
✔ Correct
⚠️ Important distinction:
This does not mean “any arbitrary JSON is magically inferred”, but that a list of JSON objects can be schema-encoded in Beam’s model.
21. What is the use case for timers in Beam’s State & Timers API?
You can use timers instead of state variables to do timely aggregations.
Timers are used in combination with state variables to ensure that the state is cleared at regular time intervals.
Correct answer: ✅
Timers are used in combination with state variables to ensure that the state is cleared at regular time intervals.
Explanation (Beam State & Timers — carefully reverified)
In Apache Beam, timers are not a replacement for state. Instead, they are designed to work together with state.
✅ Correct use case
State holds intermediate data (counts, aggregates, session info, etc.)
Timers:
Fire at a specific event-time or processing-time
Trigger logic such as:
Emitting results
Cleaning up state
Closing sessions
Handling inactivity timeouts
A very common pattern is:
“Accumulate data in state → set a timer → when the timer fires, emit results and clear state.”
Why the other option is incorrect ❌
❌ “You can use timers instead of state variables to do timely aggregations.”
Timers do not store data
They only trigger callbacks (onTimer)
Aggregations require state to hold intermediate results
Timers alone are useless without state
22. Can you do aggregations with ParDo?
You can do aggregations using state variables in a DoFn.
No, you cannot do any type of aggregations with ParDo.
Correct answer: ✅
You can do aggregations using state variables in a DoFn.
Explanation (Beam–accurate, reverified)
In Apache Beam, a ParDo by itself is a per-element transform and stateless by default.
However, when you use State & Timers API inside a DoFn, you can perform aggregations.
How it works:
State variables store intermediate aggregation results (counts, sums, lists, etc.)
Timers control when to emit results or clear state
This enables:
Streaming aggregations
Session-based logic
Custom window-like behavior
So aggregations are possible, but only with state.
Why the other option is incorrect ❌
❌ “No, you cannot do any type of aggregations with ParDo.”
Incorrect.
While ParDo alone is not an aggregation transform,
ParDo + state absolutely can aggregate.
Beam even documents this as an advanced but valid pattern.
Important exam nuance 🧠
Preferred / standard aggregations → GroupByKey, Combine, CombinePerKey
Custom / advanced aggregations → ParDo with State & Timers
If the question asks “Can you do aggregations?”, the correct answer is YES, using state.
23. Which functions of the DoFn lifecycle are recommended to be used for micro-batching?
- startBundle and finishBundle
- init and destroy
- setup and teardown
Correct answer: ✅
startBundle and finishBundle
Explanation (Beam DoFn lifecycle — reverified)
In Apache Beam, micro-batching refers to processing elements in small bundles rather than strictly one-at-a-time.
The lifecycle methods specifically designed for bundle-level logic are:
✅ startBundle()
Called once per bundle, before processing elements
Ideal for:
Initializing per-bundle buffers
Opening batch connections
Preparing in-memory accumulators
✅ finishBundle()
Called once per bundle, after all elements are processed
Ideal for:
Flushing buffered writes
Emitting aggregated results
Cleaning up bundle-level resources
Together, these methods enable efficient micro-batching.
Why the other options are incorrect ❌
❌ init and destroy
These are not DoFn lifecycle methods in Beam
❌ setup and teardown
These run once per DoFn instance, not per bundle
Intended for:
Long-lived resources (clients, connections)
❌ Not suitable for micro-batching logic
24. Choose all the applicable options:
If your pipelines interact with external systems,
Testing external systems against peak volume is not important.
External System doesn’t impact performance of a Dataflow pipeline as they are run outside the Dataflow environment.
Not provisioning external systems appropriately may impact the performance of your pipeline due to back pressure..
It is important to provision those external systems appropriately (i.e., to handle peak volumes).
Correct selections (✔):
✅ Not provisioning external systems appropriately may impact the performance of your pipeline due to back pressure.
✅ It is important to provision those external systems appropriately (i.e., to handle peak volumes).
Explanation (Beam / Dataflow–accurate)
In Google Cloud Dataflow (and Apache Beam in general), pipelines often read from or write to external systems such as databases, APIs, message queues, or storage services.
✅ Back pressure is real
If an external sink/source (e.g., database, REST API) is slow or under-provisioned:
Writes block
Queues build up
Workers slow down or idle
This causes back pressure, which directly reduces pipeline throughput and increases latency.
✅ Proper provisioning is critical
External systems must be sized to:
Handle peak throughput
Support parallel requests from multiple workers
This is a standard production best practice for Dataflow pipelines.
Why the remaining option is incorrect ❌
❌ External System doesn’t impact performance of a Dataflow pipeline as they are run outside the Dataflow environment.
Incorrect.
Even though external systems run outside Dataflow:
Dataflow workers depend on them
Slow external systems directly slow the pipeline
This is a very common exam trap.
25. What is the recommended way to convert JSON objects to POJOs?
Use JsonToRow
Use JsonToPOJO
✅ Correct answer:
Use JsonToRow
Why this is the correct answer (Beam-accurate)
In Apache Beam, there is no official or recommended transform called JsonToPOJO.
The recommended pattern to convert JSON into strongly typed objects (POJOs) is:
Convert JSON → Row using JsonToRow
Convert Row → POJO (using schema-based mapping, e.g. RowToPOJO / Convert.to())
This aligns with Beam’s schema-first design.
Why Beam recommends this approach
Row is Beam’s schema-native representation
Enables:
Beam SQL
Schema-aware transforms
Cross-language portability
POJOs are derived from schemas, not directly from JSON
Why JsonToPOJO is wrong ❌
❌ There is no standard Beam transform named JsonToPOJO
❌ Not part of Apache Beam SDK
❌ Not recommended in Beam documentation or training
This option is a decoy / trap answer, commonly used in quizzes.
26. Which two of the following interfaces support Calcite SQL?
Dataflow template
Beam SQL client
Dataflow SQL
You’re correct: “Dataflow SQL” is not a correct option today (and in many current exam materials), because Dataflow SQL has been ended/deprecated.
What the deep recheck shows
Google’s own deprecation notice says Dataflow SQL support ended:
Console access ended July 31, 2024
CLI usage ended January 31, 2025
So if the question is asking which interfaces support Calcite SQL as usable interfaces, then Dataflow SQL should not be selected.
✅ Correct answer (the two that apply)
✅ Beam SQL client
✅ Dataflow template (when the template is a Beam SQL-based pipeline)
Why these two
Beam SQL client: Beam SQL’s default dialect is Beam Calcite SQL (i.e., Apache Calcite-based).
Dataflow template: Templates are a deployment mechanism, and Google/Beam training content includes Beam SQL/Dataflow SQL labs/templates that run SQL-authored pipelines through Dataflow jobs (i.e., the SQL is compiled via Beam SQL/Calcite and executed as a pipeline).
And Dataflow SQL is the wrong pick now because it’s discontinued.
27. What operations can you do in standard Pandas DataFrames that are not possible in Beam DataFrames?
Write the DataFrame columns as rows
Compute two different aggregates based on the input data
Shift the DataFrame
✅ Correct answer:
Shift the DataFrame
Why only this one is correct
✅ Shift the DataFrame
df.shift() depends on row order and neighboring rows
Apache Beam DataFrames:
Do not guarantee row order
Do not support row-relative operations
Therefore, shift is NOT possible in Beam DataFrames
✔ Correct
Why the others are NOT correct ❌
❌ Write the DataFrame columns as rows
This sounds like a transpose (df.T)
However, Beam DataFrames can represent reshaping operations when they can be expressed as distributed transforms
This operation is not categorically forbidden the way shift is
❌ Compute two different aggregates based on the input data
Fully supported in Beam DataFrames
Example:
df.groupby("key").agg({"col1": "sum", "col2": "mean"})
Aggregations are core distributed operations
28. Which one of these statements is true?
You can use the option include_window_info from ib.show to get extra metadata about each element in a Pcollection.
When using the interactive runner, you have to create a logging DoFn to see the values of an intermittent PCollection.
When using the interactive runner, if you want to play with the values from a PCollection within a dataframe, you must access them from within a DoFn.
Correct answer: ✅
You can use the option include_window_info from ib.show to get extra metadata about each element in a PCollection.
Correct. ib.show(windowed_word_counts, include_window_info=True) can be used.
Explanation (Apache Beam Interactive Runner — reverified)
In Apache Beam, when working with the Interactive Runner (for example in notebooks), you can inspect intermediate PCollections using:
ib.show(pcoll, include_window_info=True)
This option:
Displays windowing metadata
Includes:
Window assignment
Pane info
Timestamps
Is extremely useful when debugging windowing, triggers, and late data
✔ This statement is true
Why the other statements are false ❌
❌ “When using the interactive runner, you have to create a logging DoFn to see the values of an intermittent PCollection.”
False.
The Interactive Runner is explicitly designed so you do NOT need logging DoFns.
You can directly inspect intermediate PCollections with ib.show().
❌ “When using the interactive runner, if you want to play with the values from a PCollection within a dataframe, you must access them from within a DoFn.”
False.
You can convert a PCollection to a DataFrame using Beam DataFrames and interact with it outside a DoFn.
That’s one of the main benefits of the Interactive Runner.
29. Which two of the following statements are true about using the interactive runner?
You can limit the number of elements the interactive runner records from an unbounded source by setting the recording_element_count option.
You can limit the amount of time the interactive runner records data from an unbounded source by using the recording_duration option.
You can limit the amount of data the interactive runner records from an unbounded source by setting recording_size_limit.
Correct answers (select TWO): ✅
✔ You can limit the amount of time the interactive runner records data from an unbounded source by using the recording_duration option.
You can set this limit. Make use of this value when you are not dealing with sources that have large volumes.
✔ You can limit the amount of data the interactive runner records from an unbounded source by setting recording_size_limit.
You can set this limit. Make use of this value when you are not dealing with sources that have large volumes to ensure that your notebook does not read too much data.
Explanation (Apache Beam Interactive Runner — carefully reverified)
In Apache Beam, the Interactive Runner is designed for exploration and debugging, especially in notebooks. For unbounded sources, Beam provides explicit controls to avoid unbounded data capture.
✅ recording_duration
Limits how long data is recorded from an unbounded source
Example: record only the first 30 seconds of a stream
Very commonly used during interactive development
✅ recording_size_limit
Limits how much data (bytes) is recorded
Prevents excessive memory or disk usage
Useful when element sizes vary
Why the remaining option is incorrect ❌
❌ recording_element_count
There is no supported option by this name in the Interactive Runner
Beam does not provide an element-count–based recording limit