1. What is the primary function of a data engineer?
- Conducting statistical analysis on data.
- Managing network security for data systems.
- Designing user interfaces for data visualization.
- Building and maintaining data pipelines.
β
Correct Answer: Building and maintaining data pipelines.
Why this is the primary function of a data engineer
A data engineer focuses on the infrastructure and systems that make data usable across an organization.
π Their core responsibility is to:
Ingest data from multiple sources
Clean, transform, and validate data
Build reliable, scalable data pipelines
Ensure data is available, fresh, and high quality
Support analytics, BI, and ML teams
This directly maps to building and maintaining data pipelines.
Why the other options are incorrect
β Conducting statistical analysis on data
Typically done by data analysts or data scientists
β Managing network security for data systems
Responsibility of security or infrastructure teams
β Designing user interfaces for data visualization
Done by BI developers or frontend engineers
Exam rule of thumb π§
Data Engineer β Pipelines & data infrastructure
Data Analyst β Insights & queries
Data Scientist β Models & statistics
2. Which stage in a data pipeline involves modifying and preparing data for specific downstream requirements?
- Replicate and migrate
- Ingest
- Transform
- Store
β
Correct Answer: Transform
Why Transform is correct
In a data pipeline, the transform stage is where raw data is:
Cleaned
Filtered
Aggregated
Enriched
Reshaped
π This stage modifies and prepares data so it meets the specific requirements of downstream systems, such as analytics, reporting, or machine learning.
Examples of transformations:
Converting data formats
Removing duplicates
Applying business rules
Creating derived columns
Why the other options are incorrect
β Ingest
Responsible for collecting and loading raw data
No modification of data logic
β Store
Focuses on persisting data
Does not prepare data for use
β Replicate and migrate
Concerned with moving data
Not transforming it
Exam rule of thumb π§
Modify / prepare data β Transform
3. What is the purpose of Analytics Hub in Google Cloud?
- To enable secure and controlled data sharing within and outside an organization.
- To provide a centralized platform for data visualization.
- To automate the process of data cleaning and transformation.
- To perform complex machine learning tasks on large datasets.
β
Correct Answer: To enable secure and controlled data sharing within and outside an organization.
Why this is the correct answer
Analytics Hub is Google Cloudβs service for sharing BigQuery datasets in a secure, governed, and scalable way.
It allows:
Organizations to publish datasets
Internal teams or external partners to subscribe to those datasets
Data to be shared without copying, so it stays up to date
Fine-grained access control, auditing, and revocation
This makes it ideal for:
Partner data sharing
Data monetization
Internal data marketplaces
Why the other options are incorrect
β Centralized platform for data visualization
β Thatβs Looker or Looker Studio, not Analytics Hub.
β Automate data cleaning and transformation
β Thatβs handled by Dataflow, Dataprep, or Data Fusion.
β Perform complex machine learning tasks
β Thatβs Vertex AI or BigQuery ML, not Analytics Hub.
Exam rule of thumb π§
BigQuery data sharing (internal or external) β Analytics Hub
4. Which Google Cloud product is best suited for storing unstructured data like images and videos?
- Cloud Storage
- BigQuery
- Cloud SQL
- Bigtable
β
Correct Answer: Cloud Storage
Why Cloud Storage is the best fit
Unstructured data such as:
Images
Videos
Audio files
Documents
needs:
Object-based storage
High durability
Massive scalability
Cost-effective storage classes
π Cloud Storage is designed exactly for this.
Key advantages
Stores unstructured objects of any size
Multiple storage classes (Standard, Nearline, Coldline, Archive)
Highly durable and scalable
Integrates with analytics and ML tools
Why the other options are incorrect
β BigQuery
Analytical data warehouse
Not designed to store raw images/videos
β Cloud SQL
Relational database
Not suitable for large binary objects at scale
β Bigtable
NoSQL wide-column database
Optimized for key-value access, not object storage
5. What is the key difference between a data lake and a data warehouse?
- Data lakes are managed by data scientists, while data warehouses are managed by data engineers.
- Data lakes store only structured data, while data warehouses store unstructured data.
- Data lakes store raw, unprocessed data, while data warehouses store processed and organized data.
- Data lakes are used for real-time analytics, while data warehouses are used for long-term storage.
β
Correct Answer: Data lakes store raw, unprocessed data, while data warehouses store processed and organized data.
Why this is the key difference
The fundamental distinction between a data lake and a data warehouse is how data is stored and prepared:
Data lake
Stores raw, unprocessed data
Can include structured, semi-structured, and unstructured data
Schema-on-read
Used for exploration, ML, and flexible analytics
Data warehouse
Stores cleaned, transformed, and structured data
Schema-on-write
Optimized for reporting and BI
This makes the third option the canonical definition.
Why the other options are incorrect
β Managed by different roles
Not a defining technical difference
β Data type reversal
Data lakes can store all types; warehouses focus on structured data
β Real-time vs long-term storage
Both can support real-time and historical analytics
Exam rule of thumb π§
Raw β Data lake | Curated β Data warehouse
6. In Datastream event messages, which section contains the actual data changes in a key-value format?
- Change log
- Payload
- Generic metadata
- Source-specific metadata
β
Correct Answer: Payload
Why Payload is correct
In Datastream event messages, the message is logically divided into sections. The actual data changesβthat is, the before/after values of columns represented as key-value pairsβare contained in the Payload.
What each section contains (for clarity)
Payload β
The row-level data changes
Column names and values (key-value format)
Insert, update, and delete data
Generic metadata β
Common event info (timestamps, operation type, etc.)
Source-specific metadata β
Database-specific details (e.g., MySQL binlog position)
Change log β
Not the section that holds the actual column data
Exam rule of thumb π§
Datastream actual data values β Payload
7. Which of the following services efficiently moves large datasets from on-premises, multicloud file systems, object stores, and HDFS into Cloud Storage and supports scheduled transfers?
- Vertex AI
- gcloud storage
- Storage Transfer Service
- Transfer Appliance
β
Correct Answer: Storage Transfer Service
Why Storage Transfer Service is the right choice
Your requirements are very specific:
π¦ Large datasets
π’ From on-premises systems
βοΈ From multicloud file systems and object stores
π§± From HDFS
π Scheduled / recurring transfers
π― Destination: Cloud Storage
π Storage Transfer Service (STS) is explicitly designed for this use case.
What Storage Transfer Service supports
Sources:
On-premises file systems
AWS S3 and other cloud object stores
HDFS
Destination:
Cloud Storage
Features:
Scheduled and incremental transfers
Retry, checksum validation
High-throughput, reliable transfers
Fully managed (no agents on GCP side)
Why the other options are incorrect
β Vertex AI
Machine learning platform
Not a data transfer service
β gcloud storage
CLI for ad-hoc transfers
Not optimized for large-scale, scheduled, enterprise transfers
β Transfer Appliance
Physical device
Best for offline / network-constrained migrations
Does not support scheduled transfers
Exam rule of thumb π§
Large-scale + scheduled + on-prem / multicloud β Storage Transfer Service
8. The ease of migrating data is heavily influenced by which two factors?
- Data source and destination platform
- Data complexity and network latency
- Data size and network bandwidth
- Data format and storage type
β
Correct Answer: Data size and network bandwidth
Why this is correct
The ease of migrating data is primarily determined by how long the migration will take and how reliably it can be completed. These are most strongly affected by:
Data size π¦
Larger datasets take longer to transfer
May require special tools (Transfer Appliance, Storage Transfer Service)
Network bandwidth π
Higher bandwidth = faster migrations
Limited bandwidth can make online migration impractical
Together, these two factors directly determine:
Migration duration
Tool selection
Cost
Risk of failure
Why the other options are less correct
β Data source and destination platform
Affects tool choice, but not the ease as much as size/bandwidth
β Data complexity and network latency
Complexity affects transformation effort, not raw migration ease
Latency matters less than bandwidth for bulk transfers
β Data format and storage type
Usually manageable with modern tools
Rarely the biggest blocker compared to size and bandwidth
Exam rule of thumb π§
Data migration difficulty β How much data + how fast you can move it
9. Which tool or service uses the βcpβ command to facilitate ad-hoc transfers directly to Cloud Storage from various on-premise sources?
- Datastream
- Transfer Appliance
- gcloud storage command
- Storage Transfer Service
β
Correct Answer: gcloud storage command
Why this is correct
The cp command is used with the gcloud storage CLI to perform ad-hoc, direct transfers to Cloud Storage.
Example:
gcloud storage cp localfile.txt gs://my-bucket/
This is ideal for:
One-time or ad-hoc uploads
Smaller or moderate-sized datasets
Transfers from on-premises systems where scripting is sufficient
Why the other options are incorrect
β Datastream
Used for change data capture (CDC) from databases
Does not use cp
β Transfer Appliance
Physical device for offline migrations
No CLI cp command
β Storage Transfer Service
Managed service for large-scale, scheduled transfers
Uses jobs and agents, not cp
Exam rule of thumb π§
Ad-hoc file copy to Cloud Storage using cp β gcloud storage
10. Which of the following tools is best suited for migrating very large datasets offline?
- gcloud storage command
- Storage Transfer Service
- Transfer Appliance
- Datastream
β
Correct Answer: Transfer Appliance
Why Transfer Appliance is best for very large offline migrations
Your requirement is to migrate very large datasets when online transfer is impractical (due to bandwidth limits, time constraints, or reliability).
π Transfer Appliance is a physical device provided by Google Cloud that you:
Load with data on-premises
Physically ship to Google
Google uploads the data directly into Cloud Storage
This makes it ideal for:
Hundreds of terabytes to petabytes of data
Limited or slow network connectivity
Strict migration timelines
Why the other options are incorrect
β gcloud storage command
Online, ad-hoc transfers
Not suitable for massive datasets
β Storage Transfer Service
Online and scheduled transfers
Requires sufficient network bandwidth
β Datastream
CDC for databases
Not a bulk migration tool
Exam rule of thumb π§
Very large data + offline migration β Transfer Appliance
11. The LOAD DATA statement in BigQuery is best suited for which scenario?
- Loading data from a local CSV file using a graphical interface.
- Automating data loading into BigQuery tables within a script or application.
- Creating a new dataset in BigQuery.
- Querying data directly from Google Sheets without loading it into BigQuery.
β
Correct Answer: Automating data loading into BigQuery tables within a script or application.
Why this is correct
The LOAD DATA statement in BigQuery is a SQL-based, programmatic way to load data into BigQuery tables. It is best suited when you want to:
Automate data ingestion as part of:
SQL scripts
Applications
Scheduled jobs
Stored procedures
Load data from supported sources such as:
Cloud Storage
Google Drive (in some cases)
Integrate loading logic directly into data pipelines
This aligns perfectly with automation and repeatability.
Why the other options are incorrect
β Loading data from a local CSV file using a graphical interface
Thatβs done via the BigQuery UI, not LOAD DATA.
β Creating a new dataset in BigQuery
Done using CREATE DATASET, not LOAD DATA.
β Querying data directly from Google Sheets without loading it
Done using external tables or federated queries, not LOAD DATA.
Exam rule of thumb π§
BigQuery LOAD DATA β Automated, SQL-driven data ingestion
12. Which of the following best describes the BigQuery Data Transfer Service?
- It is a command-line tool for loading data into BigQuery.
- It is a service that allows querying data directly in Cloud Storage.
- It is a feature that enables querying data across Cloud Storage and other cloud object stores.
- It is a fully-managed service for scheduling and automating data transfers into BigQuery from various sources.
β
Correct Answer:
It is a fully-managed service for scheduling and automating data transfers into BigQuery from various sources.
Why this is correct
The BigQuery Data Transfer Service (BQ DTS) is designed to automatically load data into BigQuery on a schedule with minimal operational effort.
It supports:
Scheduled and recurring transfers
Fully managed execution (no infrastructure to manage)
Data sources such as:
Google Ads, Google Analytics
Cloud Storage
Amazon S3
SaaS applications and partner sources
This makes it ideal for ongoing, automated ingestion into BigQuery.
Why the other options are incorrect
β Command-line tool for loading data
β Thatβs bq load or LOAD DATA, not the Data Transfer Service.
β Querying data directly in Cloud Storage
β Thatβs done using external tables or BigLake.
β Querying across cloud object stores
β Again, external tables / BigLake, not BQ DTS.
Exam tip π§
Scheduled, automated ingestion into BigQuery β BigQuery Data Transfer Service
13. Which of the following statements accurately describes the staleness configuration for BigLake's metadata cache?
- The staleness can be configured between 15 minutes to 1 day.
- The cache is refreshed only manually.
- The staleness can be configured between 30 minutes to 7 days and can be refreshed automatically or manually.
- The staleness cannot be configured and remains fixed.
β
Correct Answer:
The staleness can be configured between 30 minutes to 7 days and can be refreshed automatically or manually.
Why this is correct
BigLake uses a metadata cache to improve query performance when accessing data stored outside BigQuery (for example, in Cloud Storage).
Key points about BigLake metadata cache:
You can configure staleness
Supported range is 30 minutes to 7 days
Cache refresh can be:
Automatic (based on staleness setting)
Manual (on demand)
This gives you a balance between:
Fresh metadata
Lower latency and cost
Why the other options are incorrect
β 15 minutes to 1 day
Incorrect range
β Refreshed only manually
Automatic refresh is supported
β Staleness cannot be configured
It is configurable
Exam tip π§
BigLake metadata cache β configurable staleness (30 minβ7 days), auto or manual refresh
14. What is the main advantage of using BigLake tables over external tables in BigQuery?
- BigLake tables have strong limitations on data formats and storage locations.
- BigLake tables are simpler to set up.
- BigLake tables offer enhanced performance, security, and flexibility.
- BigLake tables require separate permissions for the table and the data source.
β
Correct Answer: BigLake tables offer enhanced performance, security, and flexibility.
Why this is the main advantage
BigLake tables extend BigQueryβs capabilities to data stored in Cloud Storage and other object stores, while overcoming many limitations of traditional external tables.
Key advantages include:
π Better performance
Metadata caching
Improved query planning
π Enhanced security
Fine-grained access control
Unified BigQuery IAM and policy tags
π Greater flexibility
Supports multiple engines (BigQuery, Spark)
Works across data lakes and warehouses
Better schema evolution support
Why the other options are incorrect
β Strong limitations on formats and locations
BigLake reduces limitations, it doesnβt increase them.
β Simpler to set up
BigLake is more powerful, but setup is more involved than external tables.
β Requires separate permissions
BigLake improves centralized permission management, not fragments it.
Exam rule of thumb π§
External tables β simple access
BigLake β performance + security + governance
15. Which of the following BigQuery features or services allows you to query data directly in Cloud Storage without loading it into BigQuery?
- The bq load command
- The BigQuery Data Transfer Service
- The bq mk command
- External tables
β
Correct Answer: External tables
Why this is correct
External tables in BigQuery allow you to:
Query data directly from Cloud Storage
Avoid loading or copying data into BigQuery storage
Use standard SQL to analyze data stored as:
CSV
JSON
Avro
Parquet
ORC
This is exactly what the question asks: query data directly in Cloud Storage without loading it into BigQuery.
Why the other options are incorrect
β bq load command
Loads data into BigQuery tables (copying data)
β BigQuery Data Transfer Service
Automates data ingestion into BigQuery
Still results in data being stored in BigQuery
β bq mk command
Used to create datasets, tables, or views
Does not query data
Exam rule of thumb π§
Query data in Cloud Storage without loading β External tables
16. Which of the following is a key benefit of using stored procedures in BigQuery?
- The improvement of code reusability and maintainability.
- The flexibility to integrate with external APIs directly.
- The ability to define custom data transformations only using Python.
- The simplification of ad-hoc SQL query execution.
β
Correct Answer: The improvement of code reusability and maintainability.
Why this is the key benefit
Stored procedures in BigQuery allow you to:
Encapsulate complex SQL logic in one place
Reuse the same logic across multiple workflows
Reduce duplication of SQL code
Make changes in one place instead of many
This leads directly to better maintainability and reusability, which is the primary design goal of stored procedures.
They are especially useful for:
ETL / ELT workflows
Data quality checks
Multi-step SQL logic
Scheduled or automated jobs
Why the other options are incorrect
β Integrate with external APIs directly
BigQuery stored procedures cannot call external APIs
β Define transformations only using Python
Stored procedures use SQL scripting, not Python
β Simplification of ad-hoc SQL execution
Stored procedures are for repeatable logic, not ad-hoc queries
Exam rule of thumb π§
Stored procedures β reusable, maintainable, multi-step SQL logic
17. What is the primary advantage of using BigQuery's procedural language support?
- It enhances the performance of individual SQL queries.
- It allows the use of external libraries within SQL queries.
- It simplifies the creation of custom data transformations using Python.
- It enables the execution of multiple SQL statements in sequence with shared state.
β
Correct Answer: It enables the execution of multiple SQL statements in sequence with shared state.
Why this is the primary advantage
BigQueryβs procedural language support (SQL scripting) allows you to:
Execute multiple SQL statements in order
Use control flow such as:
IF / ELSE
LOOP / WHILE
Declare and use variables
Share state across statements in the same script or stored procedure
This makes it possible to implement complex, multi-step workflows directly inside BigQuery.
Why the other options are incorrect
β Enhances performance of individual queries
Procedural scripting improves workflow logic, not query execution speed
β Allows external libraries inside SQL
BigQuery SQL does not support external libraries
β Simplifies transformations using Python
BigQuery procedural language is SQL-based, not Python
Exam rule of thumb π§
BigQuery procedural SQL β control flow + variables + multi-statement execution
18. How does Dataform enhance the management of SQL workflows in BigQuery?
- It unifies transformation, assertion, and automation within BigQuery, streamlining data operations.
- It replaces the need for SQL entirely, allowing transformations using only Python.
- It provides a visual interface for designing and executing ad-hoc SQL queries.
- It automates data migration from on-premises databases to BigQuery.
β
Correct Answer: It unifies transformation, assertion, and automation within BigQuery, streamlining data operations.
Why this is correct
Dataform is designed specifically to manage SQL-based data transformations in BigQuery by providing:
Transformations
Manage SQL models (tables, views, incremental tables)
Assertions
Built-in data quality checks
Automation
Dependency management
Scheduled execution
Version control integration
All of this is done natively on BigQuery, making SQL workflows more reliable, maintainable, and scalable.
Why the other options are incorrect
β Replaces SQL with Python
Dataform is SQL-first, not Python-based
β Visual interface for ad-hoc queries
Thatβs BigQuery UI / Looker Studio, not Dataform
β Automates on-prem data migration
Thatβs handled by tools like Datastream or Storage Transfer Service
Exam rule of thumb π§
SQL workflow management in BigQuery β Dataform
19. What is the primary purpose of assertions in Dataform?
- To define data quality tests, ensuring data consistency and accuracy.
- To schedule the execution of SQL workflows at specific intervals.
- To compile SQLX files into executable SQL scripts.
- To define and manage dependencies between tables and views.
β
Correct Answer: To define data quality tests, ensuring data consistency and accuracy.
Why this is correct
In Dataform, assertions are used to:
Validate data quality
Enforce business rules
Ensure data consistency and correctness
Catch issues like:
Null values where they shouldnβt exist
Invalid ranges
Duplicate keys
Assertions help prevent bad data from propagating downstream.
Why the other options are incorrect
β Schedule execution of workflows
Scheduling is handled by Dataform scheduling or external orchestrators
β Compile SQLX files
Compilation is part of Dataformβs build process, not assertions
β Manage dependencies
Dependencies are defined by refs and the DAG, not assertions
Exam rule of thumb π§
Dataform assertions β data quality checks
20. Which of the following best describes the core concept of extract, load, and transform (ELT)?
- Transforming data before loading it into a data warehouse.
- Extracting data, loading it into a staging area, and then transforming it.
- Loading data into a data warehouse and then transforming it within the warehouse.
- Extracting only the necessary data and transforming it before loading.
β
Correct Answer: Loading data into a data warehouse and then transforming it within the warehouse.
Why this is correct
The core idea of ELT (Extract, Load, Transform) is:
Extract data from source systems
Load the raw data directly into the data warehouse (for example, BigQuery)
Transform the data inside the warehouse using its compute power
This approach is common in modern cloud data warehouses because they are:
Highly scalable
Cost-efficient
Optimized for large-scale transformations using SQL
Why the other options are incorrect
β Transforming data before loading
That describes ETL, not ELT
β Loading into a staging area, then transforming
Also describes ETL
β Extracting only necessary data and transforming before loading
Again, ETL
Exam rule of thumb π§
ELT β Load first, transform later (inside the warehouse)