This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in a DuckDB installation.

The requirements are as follows.

  • A DuckDB installation.

  • A persistent database, for example by running the DuckDB CLI command duckdb <my-database-filename>.db or duckdb <my-database-filename>.duckdb, replacing <my-database-filename> with the name of the target file.

  • The path to the target persistent database file.

  • A schema in the target database.

    • Create a schema.

    • You can list available schemas and their parent catalogs by running the following DuckDB CLI command:

      SELECT * FROM information_schema.schemata;
      

    The DuckDB connector uses the default schema name of main if not otherwise specified.

  • A table in the target schema.

    • Create a table.

    • You can list available tables in a schema by running the following DuckDB CLI commands, replacing the target catalog and schema names:

      USE <catalog-name>.<schema-name>;
      SHOW TABLES;
      

    The DuckDB connector uses the default table name of elements if not otherwise specified.

    For maximum compatibility, Unstructured recommends the following table schema:

    CREATE TABLE elements (
        id VARCHAR,
        element_id VARCHAR,
        text TEXT,
        embeddings FLOAT[],
        type VARCHAR,
        system VARCHAR,
        layout_width DECIMAL,
        layout_height DECIMAL,
        points TEXT,
        url TEXT,
        version VARCHAR,
        date_created INTEGER,
        date_modified INTEGER,
        date_processed DOUBLE,
        permissions_data TEXT,
        record_locator TEXT,
        category_depth INTEGER,
        parent_id VARCHAR,
        attached_filename VARCHAR,
        filetype VARCHAR,
        last_modified TIMESTAMP,
        file_directory VARCHAR,
        filename VARCHAR,
        languages VARCHAR[],
        page_number VARCHAR,
        links TEXT,
        page_name VARCHAR,
        link_urls VARCHAR[],
        link_texts VARCHAR[],
        sent_from VARCHAR[],
        sent_to VARCHAR[],
        subject VARCHAR,
        section VARCHAR,
        header_footer_type VARCHAR,
        emphasized_text_contents VARCHAR[],
        emphasized_text_tags VARCHAR[],
        text_as_html TEXT,
        regex_metadata TEXT,
        detection_class_prob DECIMAL
    );
    

    You can list the schema of a table by running the following DuckDB CLI commands, replacing the target catalog, schema, and table names:

    USE <catalog-name>.<schema-name>;
    DESCRIBE TABLE <table-name>;
    

The DuckDB connector dependencies:

CLI, Python
pip install "unstructured-ingest[duckdb]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • DUCKDB_DATABASE - The path to the target DuckDB persistent database file with the extension .db or .duckdb, represented by --database (CLI) or database (Python).
  • DUCKDB_DB_SCHEMA - The name of the target schema in the database, represented by --db-schema (CLI) or db_schema (Python).
  • DUCKDB_TABLE - The name of the target table in the schema, represented by --table (CLI) or table (Python).

These environment variables:

  • UNSTRUCTURED_API_KEY - Your Unstructured API key value.
  • UNSTRUCTURED_API_URL - Your Unstructured API URL.

Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector: