Skip to content

REST catalog, S3tables with botocore session #2657

@Flogue

Description

@Flogue

Apache Iceberg version

0.10.0

Please describe the bug 🐞

Since 0.10.0, it is now possible to use a botocore session for a rest catalog, so:

import io
import os

import pandas as pd
import pyarrow as pa

from boto3 import Session
from pyiceberg.catalog import load_catalog

boto3_session = Session(profile_name='a_profile', region_name='us-east-1')

catalog = load_catalog(
        "catalog",
        type="rest",
        botocore_session=boto3_session._session,
        warehouse="arn:aws:s3tables:us-east-1:XXXXXXXXXXX:bucket/a_bucket",
        uri=f"https://s3tables.us-east-1.amazonaws.com/iceberg",
        **{
            "rest.sigv4-enabled": "true",
            "rest.signing-name": "s3tables",
            "rest.signing-region": "us-east-1"
        })

table = catalog.load_table("namespace.a_table")

json_string = "[{\"data\":\"000000000000\", ...}]"
df = pd.read_json(io.StringIO(json_string), orient='records')

arrow_table = pa.Table.from_pandas(df=df, schema=table.schema().as_arrow())

table.overwrite(arrow_table)

It works until we ".overwrite()":

OSError: When reading information for key 'metadata/snap-6778585584222594295-0-3ae9518f-fd1c-488f-b3d2-4ca1724317a1.avro' in bucket '2c8e7acb-67a1-4dc9-8ym9eg38966b8bazzfjn487w5o9wruse1b--table-s3': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

To "fix" it, we can do:

boto3_session = Session(profile_name='a_profile', region_name='us-east-1')

catalog = load_catalog(
        "catalog",
        type="rest",
        botocore_session=boto3_session._session,
        warehouse="arn:aws:s3tables:us-east-1:XXXXXXXXXXX:bucket/a_bucket",
        uri=f"https://s3tables.us-east-1.amazonaws.com/iceberg",
        **{
            "rest.sigv4-enabled": "true",
            "rest.signing-name": "s3tables",
            "rest.signing-region": "us-east-1"
        })

table = catalog.load_table("namespace.a_table")

json_string = "[{\"data\":\"000000000000\", ...}]"
df = pd.read_json(io.StringIO(json_string), orient='records')

arrow_table = pa.Table.from_pandas(df=df, schema=table.schema().as_arrow())

credentials = boto3_session.get_credentials().get_frozen_credentials()
os.environ["AWS_ACCESS_KEY_ID"] = credentials.access_key
os.environ["AWS_SECRET_ACCESS_KEY"] = credentials.secret_key
if credentials.token:
    os.environ["AWS_SESSION_TOKEN"] = credentials.token
table.overwrite(arrow_table)

which works but defeats the purpose.

We can access .schema() and such. So it seems the overwrite method is not using the proper SigV4Adapter (pyiceberg/catalog/rest/init.py).

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions