Understanding Apache Iceberg's Metadata.json
Introduction
Apache Iceberg is a data lakehouse table format designed to solve many of the problems associated with large-scale data lakes turning them in data warehouses called data lakehouses. It allows for schema evolution, time travel queries, and efficient data partitioning, all while maintaining compatibility with existing data processing engines. Central to Iceberg's functionality is the metadata.json file, which serves as the heart of table metadata management.
Purpose of metadata.json
The metadata.json file in Apache Iceberg serves several critical purposes:
This file is not just a static record but a dynamic document that evolves with the table, making it an indispensable component of Apache Iceberg's architecture.
Detailed Breakdown of Fields
Identification and Versioning
format-version
table-uuid
Table Structure and Location
location
last-updated-ms
Schema Management
schemas and current-schema-id
Data Partitioning
partition-specs and default-spec-id
Snapshots and History
last-sequence-number, current-snapshot-id, snapshots, snapshot-log
Recommended by LinkedIn
Metadata Logging
metadata-log
Sorting and Ordering
sort-orders and default-sort-order-id
Example Metadata.json
COPY
COPY
{
"format-version": 2,
"table-uuid": "5f8b14d8-0a14-4e6a-8b04-7b1b9341c939",
"location": "s3://my-bucket/tables/my_table",
"last-updated-ms": 1692643200000,
"last-sequence-number": 100,
"last-column-id": 10,
"schemas": [
{
"schema-id": 1,
"columns": [
{"name": "id", "type": "integer", "id": 1},
{"name": "name", "type": "string", "id": 2}
]
},
{
"schema-id": 2,
"columns": [
{"name": "id", "type": "integer", "id": 1},
{"name": "name", "type": "string", "id": 2},
{"name": "age", "type": "integer", "id": 3}
]
}
],
"current-schema-id": 2,
"partition-specs": [
{
"spec-id": 1,
"fields": [
{"name": "name", "transform": "identity", "source-id": 2}
]
},
{
"spec-id": 2,
"fields": [
{"name": "age", "transform": "bucket[4]", "source-id": 3}
]
}
],
"default-spec-id": 2,
"last-partition-id": 4,
"properties": {
"commit.retry.num-retries": "5"
},
"current-snapshot-id": 3,
"snapshots": [
{"snapshot-id": 1, "timestamp-ms": 1692643200000},
{"snapshot-id": 2, "timestamp-ms": 1692643500000},
{"snapshot-id": 3, "timestamp-ms": 1692643800000}
],
"snapshot-log": [
{"timestamp-ms": 1692643200000, "snapshot-id": 1},
{"timestamp-ms": 1692643500000, "snapshot-id": 2},
{"timestamp-ms": 1692643800000, "snapshot-id": 3}
],
"metadata-log": [
{"timestamp-ms": 1692643200000, "metadata-file": "s3://my-bucket/tables/my_table/metadata/00001.json"},
{"timestamp-ms": 1692643500000, "metadata-file": "s3://my-bucket/tables/my_table/metadata/00002.json"}
],
"sort-orders": [
{
"order-id": 1,
"fields": [
{"name": "id", "direction": "ASC", "null-order": "NULLS_FIRST"}
]
}
],
"default-sort-order-id": 1,
"refs": {
"main": {"snapshot-id": 3}
},
"statistics": [
{
"snapshot-id": "3",
"statistics-path": "s3://my-bucket/tables/my_table/stats/00003.puffin",
"file-size-in-bytes": 1024,
"file-footer-size-in-bytes": 64,
"blob-metadata": [
{
"type": "table-stats",
"snapshot-id": 3,
"sequence-number": 100,
"fields": [1, 2, 3],
"properties": {
"statistic-type": "summary"
}
}
]
}
],
"partition-statistics": [
{
"snapshot-id": 3,
"statistics-path": "s3://my-bucket/tables/my_table/partition_stats/00003.parquet",
"file-size-in-bytes": 512
}
]
}
How Engines Use metadata.json
Query Planning
One of the primary uses of the metadata.json by data processing engines is in query planning. Here's how:
Schema Evolution
Data Consistency
Data Layout and Sorting
Metadata Updates
Optimistic Concurrency Controls
Conclusion on Engine Usage
The metadata.json in Apache Iceberg acts as a comprehensive guide for data engines, enabling them to efficiently manage, query, and evolve large-scale data tables. By providing detailed metadata, it allows for optimizations at various levels, from query planning to data consistency, making Iceberg tables highly performant and flexible.