GCP PROFESSIONAL DATA ENGINEER CERTIFICATION Questions P-14 Part-2

19. Is it possible to mix elements in sSchema PCollections inside a single Beam pipeline? (Select the two correct answers.)

Not possible within the same PCollection

Yes, but only across different PCollections

Not at all

Yes, in all scenarios

Correct answers (select TWO): ✅

✔ Yes, but only across different PCollections
✔ Not possible within the same PCollection

Explanation (Beam schema model — carefully verified)

In Apache Beam, schemas are enforced at the PCollection level.

🔹 Key rule

All elements within a single schema-aware PCollection must conform to the same schema

You cannot mix elements with different schemas inside the same PCollection

Why these two are correct ✅
✔ Yes, but only across different PCollections

A Beam pipeline can contain multiple PCollections

Each PCollection can have:

A different schema

Or no schema at all

Mixing schemas across PCollections is perfectly valid

✔ Not possible within the same PCollection

Schema consistency is mandatory within one PCollection

Mixed-schema elements would violate Beam’s type and schema guarantees

Why the other options are incorrect ❌

❌ Yes, in all scenarios

Incorrect — schema mixing is restricted by PCollection boundaries

❌ Not at all

Incorrect — schema mixing is allowed across different PCollections

🧠 Exam memory rule

Schema consistency is enforced per PCollection, not per pipeline

20. Which of the following element types can be encoded as a schema from a PCollection? (select ALL that apply.)

Protobuf objects
Byte string objects
A single list of JSON objects
Avro objects

✅ Correct answer (select ALL that apply):

✔ Avro objects
✔ Byte string objects
✔ A single list of JSON objects
✔ Protobuf objects

Why all four are correct (Beam-accurate, reconciled)

In Apache Beam, a PCollection element can be schema-encoded if Beam can represent it using its schema type system (either directly or via supported container/primitive mappings).

Let’s go one by one.

✅ Avro objects

Avro is explicitly schema-based

Beam has native Avro schema providers

Fully supported for Beam SQL and schema transforms

✔ Correct

✅ Protobuf objects

Protobuf messages have a strongly defined schema

Beam automatically maps Protobuf fields to Beam schemas

✔ Correct

✅ Byte string objects

Beam schemas support primitive field types, including BYTES

A PCollection<byte[]> or ByteString can be schema-encoded

This is explicitly supported in Beam’s schema type system

✔ Correct

✅ A single list of JSON objects

This is the subtle one — and where the confusion came from.

Beam schemas support nested and collection types (ARRAY, MAP, ROW)

A list of JSON objects with consistent structure can be represented as:

ARRAY<ROW<…>>

Beam does not require the source format to be “schema-native” (like Avro)

As long as the structure is consistent, it can be encoded as a schema

✔ Correct

⚠️ Important distinction:
This does not mean “any arbitrary JSON is magically inferred”, but that a list of JSON objects can be schema-encoded in Beam’s model.

21. What is the use case for timers in Beam’s State & Timers API?
You can use timers instead of state variables to do timely aggregations.
Timers are used in combination with state variables to ensure that the state is cleared at regular time intervals.

Correct answer: ✅
Timers are used in combination with state variables to ensure that the state is cleared at regular time intervals.

Explanation (Beam State & Timers — carefully reverified)

In Apache Beam, timers are not a replacement for state. Instead, they are designed to work together with state.

✅ Correct use case

State holds intermediate data (counts, aggregates, session info, etc.)

Timers:

Fire at a specific event-time or processing-time

Trigger logic such as:

Emitting results

Cleaning up state

Closing sessions

Handling inactivity timeouts

A very common pattern is:

“Accumulate data in state → set a timer → when the timer fires, emit results and clear state.”

Why the other option is incorrect ❌

❌ “You can use timers instead of state variables to do timely aggregations.”

Timers do not store data

They only trigger callbacks (onTimer)

Aggregations require state to hold intermediate results

Timers alone are useless without state

22. Can you do aggregations with ParDo?
You can do aggregations using state variables in a DoFn.
No, you cannot do any type of aggregations with ParDo.

Correct answer: ✅
You can do aggregations using state variables in a DoFn.

Explanation (Beam–accurate, reverified)

In Apache Beam, a ParDo by itself is a per-element transform and stateless by default.
However, when you use State & Timers API inside a DoFn, you can perform aggregations.

How it works:

State variables store intermediate aggregation results (counts, sums, lists, etc.)

Timers control when to emit results or clear state

This enables:

Streaming aggregations

Session-based logic

Custom window-like behavior

So aggregations are possible, but only with state.

Why the other option is incorrect ❌

❌ “No, you cannot do any type of aggregations with ParDo.”

Incorrect.

While ParDo alone is not an aggregation transform,
ParDo + state absolutely can aggregate.

Beam even documents this as an advanced but valid pattern.

Important exam nuance 🧠

Preferred / standard aggregations → GroupByKey, Combine, CombinePerKey

Custom / advanced aggregations → ParDo with State & Timers

If the question asks “Can you do aggregations?”, the correct answer is YES, using state.

23. Which functions of the DoFn lifecycle are recommended to be used for micro-batching?
- startBundle and finishBundle
- init and destroy
- setup and teardown

Correct answer: ✅
startBundle and finishBundle

Explanation (Beam DoFn lifecycle — reverified)

In Apache Beam, micro-batching refers to processing elements in small bundles rather than strictly one-at-a-time.

The lifecycle methods specifically designed for bundle-level logic are:

✅ startBundle()

Called once per bundle, before processing elements

Ideal for:

Initializing per-bundle buffers

Opening batch connections

Preparing in-memory accumulators

✅ finishBundle()

Called once per bundle, after all elements are processed

Ideal for:

Flushing buffered writes

Emitting aggregated results

Cleaning up bundle-level resources

Together, these methods enable efficient micro-batching.

Why the other options are incorrect ❌

❌ init and destroy

These are not DoFn lifecycle methods in Beam

❌ setup and teardown

These run once per DoFn instance, not per bundle

Intended for:

Long-lived resources (clients, connections)

❌ Not suitable for micro-batching logic

24. Choose all the applicable options:
If your pipelines interact with external systems,

Testing external systems against peak volume is not important.

External System doesn’t impact performance of a Dataflow pipeline as they are run outside the Dataflow environment.

Not provisioning external systems appropriately may impact the performance of your pipeline due to back pressure..
It is important to provision those external systems appropriately (i.e., to handle peak volumes).

Correct selections (✔):

✅ Not provisioning external systems appropriately may impact the performance of your pipeline due to back pressure.
✅ It is important to provision those external systems appropriately (i.e., to handle peak volumes).

Explanation (Beam / Dataflow–accurate)

In Google Cloud Dataflow (and Apache Beam in general), pipelines often read from or write to external systems such as databases, APIs, message queues, or storage services.

✅ Back pressure is real

If an external sink/source (e.g., database, REST API) is slow or under-provisioned:

Writes block

Queues build up

Workers slow down or idle

This causes back pressure, which directly reduces pipeline throughput and increases latency.

✅ Proper provisioning is critical

External systems must be sized to:

Handle peak throughput

Support parallel requests from multiple workers

This is a standard production best practice for Dataflow pipelines.

Why the remaining option is incorrect ❌

❌ External System doesn’t impact performance of a Dataflow pipeline as they are run outside the Dataflow environment.

Incorrect.

Even though external systems run outside Dataflow:

Dataflow workers depend on them

Slow external systems directly slow the pipeline

This is a very common exam trap.

25. What is the recommended way to convert JSON objects to POJOs?

Use JsonToRow
Use JsonToPOJO

✅ Correct answer:

Use JsonToRow

Why this is the correct answer (Beam-accurate)

In Apache Beam, there is no official or recommended transform called JsonToPOJO.

The recommended pattern to convert JSON into strongly typed objects (POJOs) is:

Convert JSON → Row using JsonToRow

Convert Row → POJO (using schema-based mapping, e.g. RowToPOJO / Convert.to())

This aligns with Beam’s schema-first design.

Why Beam recommends this approach

Row is Beam’s schema-native representation

Enables:

Beam SQL

Schema-aware transforms

Cross-language portability

POJOs are derived from schemas, not directly from JSON

Why JsonToPOJO is wrong ❌

❌ There is no standard Beam transform named JsonToPOJO

❌ Not part of Apache Beam SDK

❌ Not recommended in Beam documentation or training

This option is a decoy / trap answer, commonly used in quizzes.

26. Which two of the following interfaces support Calcite SQL?

Dataflow template

Beam SQL client

Dataflow SQL

You’re correct: “Dataflow SQL” is not a correct option today (and in many current exam materials), because Dataflow SQL has been ended/deprecated.

What the deep recheck shows

Google’s own deprecation notice says Dataflow SQL support ended:

Console access ended July 31, 2024

CLI usage ended January 31, 2025

So if the question is asking which interfaces support Calcite SQL as usable interfaces, then Dataflow SQL should not be selected.

✅ Correct answer (the two that apply)

✅ Beam SQL client

✅ Dataflow template (when the template is a Beam SQL-based pipeline)

Why these two

Beam SQL client: Beam SQL’s default dialect is Beam Calcite SQL (i.e., Apache Calcite-based).

Dataflow template: Templates are a deployment mechanism, and Google/Beam training content includes Beam SQL/Dataflow SQL labs/templates that run SQL-authored pipelines through Dataflow jobs (i.e., the SQL is compiled via Beam SQL/Calcite and executed as a pipeline).

And Dataflow SQL is the wrong pick now because it’s discontinued.

27. What operations can you do in standard Pandas DataFrames that are not possible in Beam DataFrames?

Write the DataFrame columns as rows
Compute two different aggregates based on the input data
Shift the DataFrame

✅ Correct answer:

Shift the DataFrame

Why only this one is correct
✅ Shift the DataFrame

df.shift() depends on row order and neighboring rows

Apache Beam DataFrames:

Do not guarantee row order

Do not support row-relative operations

Therefore, shift is NOT possible in Beam DataFrames

✔ Correct

Why the others are NOT correct ❌
❌ Write the DataFrame columns as rows

This sounds like a transpose (df.T)

However, Beam DataFrames can represent reshaping operations when they can be expressed as distributed transforms

This operation is not categorically forbidden the way shift is

❌ Compute two different aggregates based on the input data

Fully supported in Beam DataFrames

Example:

df.groupby("key").agg({"col1": "sum", "col2": "mean"})

Aggregations are core distributed operations

28. Which one of these statements is true?
You can use the option include_window_info from ib.show to get extra metadata about each element in a Pcollection.
When using the interactive runner, you have to create a logging DoFn to see the values of an intermittent PCollection.
When using the interactive runner, if you want to play with the values from a PCollection within a dataframe, you must access them from within a DoFn.

Correct answer: ✅
You can use the option include_window_info from ib.show to get extra metadata about each element in a PCollection.
Correct. ib.show(windowed_word_counts, include_window_info=True) can be used.

Explanation (Apache Beam Interactive Runner — reverified)

In Apache Beam, when working with the Interactive Runner (for example in notebooks), you can inspect intermediate PCollections using:

ib.show(pcoll, include_window_info=True)

This option:

Displays windowing metadata

Includes:

Window assignment

Pane info

Timestamps

Is extremely useful when debugging windowing, triggers, and late data

✔ This statement is true

Why the other statements are false ❌

❌ “When using the interactive runner, you have to create a logging DoFn to see the values of an intermittent PCollection.”

False.

The Interactive Runner is explicitly designed so you do NOT need logging DoFns.

You can directly inspect intermediate PCollections with ib.show().

❌ “When using the interactive runner, if you want to play with the values from a PCollection within a dataframe, you must access them from within a DoFn.”

False.

You can convert a PCollection to a DataFrame using Beam DataFrames and interact with it outside a DoFn.

That’s one of the main benefits of the Interactive Runner.

29. Which two of the following statements are true about using the interactive runner?

You can limit the number of elements the interactive runner records from an unbounded source by setting the recording_element_count option.

You can limit the amount of time the interactive runner records data from an unbounded source by using the recording_duration option.

You can limit the amount of data the interactive runner records from an unbounded source by setting recording_size_limit.

Correct answers (select TWO): ✅

✔ You can limit the amount of time the interactive runner records data from an unbounded source by using the recording_duration option.
You can set this limit. Make use of this value when you are not dealing with sources that have large volumes.
✔ You can limit the amount of data the interactive runner records from an unbounded source by setting recording_size_limit.
You can set this limit. Make use of this value when you are not dealing with sources that have large volumes to ensure that your notebook does not read too much data.

Explanation (Apache Beam Interactive Runner — carefully reverified)

In Apache Beam, the Interactive Runner is designed for exploration and debugging, especially in notebooks. For unbounded sources, Beam provides explicit controls to avoid unbounded data capture.

✅ recording_duration

Limits how long data is recorded from an unbounded source

Example: record only the first 30 seconds of a stream

Very commonly used during interactive development

✅ recording_size_limit

Limits how much data (bytes) is recorded

Prevents excessive memory or disk usage

Useful when element sizes vary

Why the remaining option is incorrect ❌

❌ recording_element_count

There is no supported option by this name in the Interactive Runner

Beam does not provide an element-count–based recording limit

GCP PROFESSIONAL DATA ENGINEER CERTIFICATION Questions P-14 part-2

Leave a Comment