GCP PROFESSIONAL DATA ENGINEER CERTIFICATION Questions P-3

webstoryworldwide.com

2 months ago

1. You need to choose a data storage solution to support a transactional system. Your customers are primarily based in one region. You want to reduce your administration tasks and focus engineering effort on building your business application. What should you do?

- Create a Cloud Storage bucket with a regional bucket.
- Use Spanner.
- Install a database of your choice on a Compute Engine VM.
- Use Cloud SQL.

The correct answer is:

✅ Use Cloud SQL
Why Cloud SQL is the best choice

Your requirements are:

🧾 Transactional system (OLTP)

🌍 Customers primarily in one region

🧑‍💻 Minimal administration

🛠 Focus engineering effort on application development, not ops

👉 Cloud SQL is a fully managed relational database service designed exactly for this scenario.

What Cloud SQL gives you

Managed MySQL, PostgreSQL, or SQL Server

Automatic:

Backups

Patching

Replication

High availability

Strong consistency and ACID transactions

Regional deployment with low latency

Minimal operational overhead

This lets your team focus on business logic, not database maintenance.

Why the other options are not appropriate

❌ Cloud Storage (regional bucket)

Object storage, not transactional

No ACID transactions or SQL support

❌ Spanner

Designed for global-scale, multi-region systems

More complex and expensive than needed

❌ Database on Compute Engine VM

High operational burden (patching, backups, scaling)

Not aligned with “reduce administration tasks”

Exam rule of thumb 🧠

Single-region transactional app + low ops → Cloud SQL

2. You have data that is ingested daily and frequently analyzed in the first month. Thereafter, the data is retained only for audits, which happen occasionally every few years. You need to configure cost-effective storage. What should you do?

- Configure a lifecycle policy on Cloud Storage.
- Create a bucket on Cloud Storage with Autoclass configured.
- Configure a data retention policy on Cloud Storage.
- Create a bucket on Cloud Storage with object versioning configured.

✅ Correct Answer: Configure a lifecycle policy on Cloud Storage.
Clear and final explanation

Your data access pattern is well-defined and predictable:

📅 Ingested daily

🔍 Frequently analyzed in the first month

🗄 Rarely accessed afterward (audits every few years)

🎯 Goal: cost-effective storage

When the hot → cold → archive timeline is known, the recommended and exam-expected solution is to use Cloud Storage lifecycle policies.

Why lifecycle policy is correct

A lifecycle policy lets you explicitly control storage costs by automatically transitioning data based on age, for example:

Standard for first 30 days (frequent access)

Coldline / Archive after 30 days (rare access)

Optional deletion after a long retention period

This provides:

Precise cost optimization

Predictable behavior

Low operational overhead

Alignment with Google Cloud best practices when access patterns are known

Why the other options are incorrect

❌ Autoclass
Best when access patterns are unknown or unpredictable. Here, they are clearly known.

❌ Retention policy
Enforces immutability, does not reduce storage cost.

❌ Object versioning
Increases storage usage and cost; does not optimize for access frequency.

3. A manager at Cymbal Retail expresses concern about unauthorized access to objects in your Cloud Storage bucket. You need to evaluate all access on all objects in the bucket. What should you do?

- Enable and then review the Data Access audit logs.
- Route the Admin Activity logs to a BigQuery sink and analyze the logs with SQL queries.
- Change the permissions on the bucket to only trusted employees.
- Review the Admin Activity audit logs.

✅ Correct Answer: Enable and then review the Data Access audit logs.
Why this is the correct solution

Your concern is unauthorized access to objects in a Cloud Storage bucket.
To evaluate who accessed which objects (read/write/list), you must look at data-level access events.

👉 Data Access audit logs record:

Object reads

Object writes

Object deletions

Object list operations

These logs provide per-object visibility, which is exactly what you need.

⚠️ Important:
Data Access logs are NOT enabled by default — you must explicitly enable them first.

Why the other options are incorrect

❌ Review Admin Activity audit logs

Admin Activity logs track configuration changes (IAM, bucket settings)

They do NOT show object reads/writes

❌ Route Admin Activity logs to BigQuery

Still the wrong log type

You’d only see admin actions, not object access

❌ Change permissions to trusted employees

Preventive, not investigative

Does not help you evaluate existing or past access

Exam rule of thumb 🧠

Need to see who accessed data → Data Access audit logs

4. You need to store data long term and use it to create quarterly reports. What storage class should you choose?

- Archive
- Standard
- Coldline
- Nearline

✅ Correct Answer: Coldline
Why Coldline is the best choice

Your requirements are:

🗄 Store data long term

📊 Access it quarterly (about once every 3 months)

💰 Optimize storage cost

👉 Cloud Storage Coldline is specifically designed for data that is:

Accessed less than once per month

Needed occasionally (e.g., quarterly reports)

Must remain readily retrievable (unlike Archive, which has higher access costs)

Coldline offers:

Much lower storage cost than Standard or Nearline

Reasonable access costs for infrequent (quarterly) reads

No long restore delays

Why the other options are incorrect

❌ Standard
Too expensive for long-term, infrequently accessed data

❌ Nearline
Optimized for data accessed about once per month, not quarterly

❌ Archive
Best for very rarely accessed data (years)
Higher retrieval cost than Coldline
Overkill for quarterly reporting

Quick decision table 🧠
Access Frequency Best Storage Class
Daily / weekly Standard
Monthly Nearline
Quarterly Coldline ✅
Years Archive

5. You have several large tables in your transaction databases. You need to move all the data to BigQuery for the business analysts to explore and analyze the data. How should you design the schema in BigQuery?

- Combine all the transactional database tables into a single table using outer joins.
- Redesign the schema to normalize the data by removing all redundancies.
- Retain the data on BigQuery with the same schema as the source.
- Redesign the schema to denormalize the data with nested and repeated data.

✅ Correct Answer: Redesign the schema to denormalize the data with nested and repeated data.
Why this is the best practice in BigQuery

BigQuery is a columnar, analytics-first data warehouse, not a traditional OLTP database. It is optimized for:

Large-scale scans

Fewer joins

Analytical queries run by business analysts

👉 Because of this, denormalized schemas with nested and repeated fields are the recommended design pattern.

Benefits of denormalization in BigQuery

🚀 Better query performance (fewer joins)

💰 Lower cost (less data scanned)

📊 Simpler SQL for analysts

🧱 Uses BigQuery’s strengths:

Nested (STRUCT)

Repeated (ARRAY) fields

Example:

Orders table with repeated line items instead of a separate order_items table.

Why the other options are incorrect

❌ Combine all tables into a single table with outer joins

Creates extremely wide tables

Difficult to maintain

Often leads to data duplication and confusion

❌ Normalize the schema

OLTP best practice, not OLAP

Causes expensive joins and poor performance in BigQuery

❌ Retain the same schema as the source

OLTP schemas are usually normalized

Misses BigQuery optimization opportunities

Exam rule of thumb 🧠

BigQuery analytics → Denormalize + nested & repeated fields

6. Cymbal Retail has accumulated a large amount of data. Analysts and leadership are finding it difficult to understand the meaning of the data, such as BigQuery columns. Users of the data don't know who owns what. You need to improve the searchability of the data. What should you do?

- Export the data to Cloud Storage with descriptive file names.
- Create tags for data entries in Cloud Catalog.
- Rename BigQuery columns with more descriptive names.
- Add a description column corresponding to each data column.

✅ Correct Answer: Create tags for data entries in Cloud Data Catalog.
Why this is the right solution

Your core problems are about data discovery and understanding, not data storage or schema design:

Analysts don’t understand what columns mean

Users don’t know who owns which data

You need to improve searchability across datasets

👉 Cloud Data Catalog is Google Cloud’s centralized metadata management and discovery service, designed exactly for this use case.

What Data Catalog tags provide

Business and technical metadata such as:

Data owner

Data domain

Sensitivity / classification

Column meaning

Searchable metadata across BigQuery, Cloud Storage, etc.

A shared, authoritative source of truth about data assets

No need to modify schemas or duplicate data

This directly solves:

“What does this column mean?”

“Who owns this data?”

“Where is the authoritative dataset?”

Why the other options are not sufficient

❌ Export data to Cloud Storage with descriptive names

Loses BigQuery analytics capabilities

Does not solve metadata or ownership discovery

❌ Rename BigQuery columns

Helpful, but limited

Does not capture ownership, domain, or business context

Risky for existing queries and pipelines

❌ Add a description column for each column

Not scalable or maintainable

Pollutes the schema

Not searchable in a meaningful way

Exam rule of thumb 🧠

Data meaning + ownership + searchability → Data Catalog tags

7. You are ingesting data that is spread out over a wide range of dates into BigQuery at a fast rate. You need to partition the table to make queries performant. What should you do?

- Create an integer-range partitioned table.
- Create an ingestion-time partitioned table with daily partitioning type.
- Create an ingestion-time partitioned table with yearly partitioning type.
- Create a time-unit column-partitioned table with yearly partitioning type.

✅ Correct Answer: Create an ingestion-time partitioned table with daily partitioning type.
Why this is the best choice

Your situation:

📅 Data spans a wide range of dates

🚀 Data is ingested at a fast rate

⚡ You want performant queries in BigQuery

Likely querying data by recent days / date ranges

👉 Ingestion-time partitioning with daily partitions is the recommended and most common approach for this scenario.

Benefits

Automatically partitions data based on when it arrives

No dependency on a date column being present or clean

Optimized for:

Fast ingestion

Queries filtering by recent time windows

Simple to manage and widely used in streaming and batch ingestion

Why the other options are not ideal

❌ Integer-range partitioned table

Useful for numeric ranges (IDs, counters)

Not suitable for date-based analytics

❌ Ingestion-time partitioned table with yearly partitioning

Partitions would be too large

Poor query pruning and performance

❌ Time-unit column-partitioned table with yearly partitioning

Yearly partitions are too coarse

Leads to scanning excessive data

Exam rule of thumb 🧠

High-ingest + wide date range + analytics → Ingestion-time partitioned table (daily)

8. Your analysts repeatedly run the same complex queries that combine and filter through a lot of data on BigQuery. The data changes frequently. You need to reduce the effort for the analysts. What should you do?

- Create a dataset with the data that is frequently queried.
- Export the frequently queried data into Cloud SQL.
- Export the frequently queried data into a new table.
- Create a view of the frequently queried data.

✅ Correct Answer: Create a view of the frequently queried data.
Why this is the best solution

Your situation:

🔁 Analysts repeatedly run the same complex queries

🔄 Underlying data changes frequently

🎯 Goal is to reduce analyst effort, not duplicate data

👉 BigQuery views are designed exactly for this use case.

Benefits of using a view

Encapsulates complex SQL logic in one place

Analysts can query the view with simple SELECT statements

Always reflects the latest underlying data

No data duplication

Easy to maintain and update centrally

This dramatically improves productivity and consistency.

Why the other options are incorrect

❌ Create a dataset with frequently queried data

A dataset is just a container, not a solution

❌ Export data into Cloud SQL

Not suitable for large analytical workloads

Adds unnecessary complexity

❌ Export data into a new table

Data duplication

Requires refresh logic

Risk of stale data

Exam rule of thumb 🧠

Repeated complex queries + frequently changing data → View

9. You have data stored in a Cloud Storage bucket. You are using both Identity and Access Management (IAM) and Access Control Lists (ACLs) to configure access control. Which statement describes a user's access to objects in the bucket?

- The user has no access if IAM denies the permission.
- The user has access if either IAM or ACLs grant a permission
- The user has no access if either IAM or ACLs deny a permission.
- The user only has access if both IAM and ACLs grant a permission.

✅ Correct Answer: The user has access if either IAM or ACLs grant a permission.
Why this is correct

In Cloud Storage, when both IAM and ACLs are enabled, access evaluation works as follows:

Permissions are additive

There is no explicit deny

A user is granted access if ANY applicable policy allows it

👉 This means:

If IAM allows OR an ACL allows → access is granted

Key rule to remember 🧠

Effective access = IAM permissions ∪ ACL permissions

So even if:

IAM does not grant access, but ACL does → ✅ access

ACL does not grant access, but IAM does → ✅ access

Why the other options are incorrect

❌ The user has no access if IAM denies the permission
→ There is no explicit “deny” in IAM; lack of permission ≠ deny.

❌ The user has no access if either IAM or ACLs deny a permission
→ Again, no explicit deny; permissions are additive.

❌ The user only has access if both IAM and ACLs grant a permission
→ This is incorrect; both are not required.

Exam rule of thumb 🧠

IAM + ACLs = additive permissions (logical OR)

10. You have large amounts of data stored on Cloud Storage and BigQuery. Some of it is processed, but some is yet unprocessed. You have a data mesh created in Dataplex. You need to make it convenient for internal users of the data to discover and use the data. What should you do?

- Create a lake for Cloud Storage data and a zone for BigQuery data.
- Create a lake for unprocessed data and assets for processed data.
- Create a raw zone for the unprocessed data and a curated zone for the processed data.
- Create a lake for BigQuery data and a zone for Cloud Storage data.

✅ Correct Answer: Create a raw zone for the unprocessed data and a curated zone for the processed data.
Why this is the right approach

You already have a data mesh implemented with Dataplex, and your goal is to make data:

🔍 Easy to discover

📘 Easy to understand

🧑‍💼 Convenient for internal users to consume

Dataplex best practices recommend organizing data by data lifecycle and quality, not by storage system.

How Dataplex is meant to be structured

Lake → Represents a business domain (e.g., Retail, Sales)

Zones → Represent stages of data maturity

Assets → Point to actual data (BigQuery datasets, GCS buckets)

The canonical zone pattern is:

🔹 Raw zone

Unprocessed / ingested data

Source-aligned, minimal validation

Not intended for broad consumption

🔹 Curated zone

Cleaned, transformed, trusted data

Business-ready

Intended for analysts, dashboards, ML, reporting

This makes it very clear to users:

What data is safe and ready to use

What data is still being prepared

Why the other options are incorrect

❌ Create a lake for Cloud Storage data and a zone for BigQuery data
→ Lakes are not meant to separate by storage type.

❌ Create a lake for unprocessed data and assets for processed data
→ Lakes represent domains, not processing stages.

❌ Create a lake for BigQuery data and a zone for Cloud Storage data
→ Same issue: incorrect abstraction.

Exam rule of thumb 🧠

Dataplex data organization → Raw zone → Curated zone

11. Cymbal Retail collects large amounts of data that is useful for improving business operations. The company wants to store and analyze this data in a serverless and cost-effective manner using Google Cloud. The analysts need to use SQL to write the queries. What tool can you use to meet these requirements?

- Data Fusion
- Spanner
- BigQuery
- Memorystore

✅ Correct Answer: BigQuery
Why BigQuery is the right tool

Your requirements are very clear:

☁️ Serverless (no infrastructure management)

💰 Cost-effective for large-scale analytics

📊 Large amounts of data

🧠 SQL-based analysis for analysts

📈 Designed for business analytics

👉 BigQuery is Google Cloud’s fully managed, serverless data warehouse, built exactly for this use case.

What BigQuery provides

Serverless architecture (no servers, clusters, or tuning)

Pay-per-query / capacity pricing → cost control

Standard SQL support

Scales automatically to petabytes of data

Integrates with BI tools and ML (BigQuery ML)

Why the other options are incorrect

❌ Data Fusion

ETL / pipeline creation tool

Not a data warehouse or analytics engine

❌ Spanner

Globally distributed transactional database (OLTP)

Not cost-effective for analytical workloads

❌ Memorystore

In-memory cache (Redis/Memcached)

Not for analytics or SQL querying

Exam rule of thumb 🧠

Serverless + SQL analytics + large data → BigQuery

12. Cymbal Retail also collects large amounts of structured, semistructured, and unstructured data. The company wants a centralized repository to store this data in a cost-effective manner using Google Cloud. What tool can you use to meet these requirements?

- Cloud SQL
- Cloud Storage
- Bigtable
- Dataflow

✅ Correct Answer: Cloud Storage
Why Cloud Storage is the right choice

Your requirements are:

📊 Large amounts of data

🧩 Structured, semi-structured, and unstructured data

🗄 Centralized repository

💰 Cost-effective

☁️ Google Cloud–native

👉 Cloud Storage is designed exactly for this use case and acts as a data lake on Google Cloud.

What Cloud Storage provides

Supports all data types:

Structured (CSV, Parquet, Avro)

Semi-structured (JSON, XML)

Unstructured (images, videos, logs)

Highly durable and scalable

Multiple cost-optimized storage classes

Serverless (no infrastructure to manage)

Integrates with:

BigQuery

Dataflow

Dataproc

Dataplex

This makes Cloud Storage the recommended centralized storage layer.

Why the other options are incorrect

❌ Cloud SQL

Relational database

Not designed for large-scale or unstructured data

❌ Bigtable

Optimized for low-latency key-value access

Not a general-purpose data lake

❌ Dataflow

Data processing service

Does not store data

Exam rule of thumb 🧠

Centralized, cost-effective storage for all data types → Cloud Storage