DuckDB

This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in a DuckDB installation.

The requirements are as follows.

A DuckDB installation.
A persistent database, for example by running the DuckDB CLI command duckdb <my-database-filename>.db or duckdb <my-database-filename>.duckdb, replacing <my-database-filename> with the name of the target file.
The path to the target persistent database file.
A schema in the target database.
- Create a schema.
- You can list available schemas and their parent catalogs by running the following DuckDB CLI command:
  SELECT * FROM information_schema.schemata;
The DuckDB connector uses the default schema name of main if not otherwise specified.

A table in the target schema.

Create a table.
You can list available tables in a schema by running the following DuckDB CLI commands, replacing the target catalog and schema names:
```
USE <catalog-name>.<schema-name>;
SHOW TABLES;
```

The DuckDB connector uses the default table name of elements if not otherwise specified.

For maximum compatibility, Unstructured recommends the following table schema:

CREATE TABLE elements (
    id VARCHAR,
    element_id VARCHAR,
    text TEXT,
    embeddings FLOAT[],
    type VARCHAR,
    system VARCHAR,
    layout_width DECIMAL,
    layout_height DECIMAL,
    points TEXT,
    url TEXT,
    version VARCHAR,
    date_created INTEGER,
    date_modified INTEGER,
    date_processed DOUBLE,
    permissions_data TEXT,
    record_locator TEXT,
    category_depth INTEGER,
    parent_id VARCHAR,
    attached_filename VARCHAR,
    filetype VARCHAR,
    last_modified TIMESTAMP,
    file_directory VARCHAR,
    filename VARCHAR,
    languages VARCHAR[],
    page_number VARCHAR,
    links TEXT,
    page_name VARCHAR,
    link_urls VARCHAR[],
    link_texts VARCHAR[],
    sent_from VARCHAR[],
    sent_to VARCHAR[],
    subject VARCHAR,
    section VARCHAR,
    header_footer_type VARCHAR,
    emphasized_text_contents VARCHAR[],
    emphasized_text_tags VARCHAR[],
    text_as_html TEXT,
    regex_metadata TEXT,
    detection_class_prob DECIMAL
);

You can list the schema of a table by running the following DuckDB CLI commands, replacing the target catalog, schema, and table names:

USE <catalog-name>.<schema-name>;
DESCRIBE TABLE <table-name>;

The DuckDB connector dependencies:

CLI, Python
pip install "unstructured-ingest[duckdb]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

DUCKDB_DATABASE - The path to the target DuckDB persistent database file with the extension .db or .duckdb, represented by --database (CLI) or database (Python).
DUCKDB_DB_SCHEMA - The name of the target schema in the database, represented by --db-schema (CLI) or db_schema (Python).
DUCKDB_TABLE - The name of the target table in the schema, represented by --table (CLI) or table (Python).

These environment variables:

UNSTRUCTURED_API_KEY - Your Unstructured API key value.
UNSTRUCTURED_API_URL - Your Unstructured API URL.

Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector:

Dropbox Elasticsearch

Unstructured API services

Getting started with API services

Process individual files

Batch processing and ingestion

How to

Best practices

Troubleshooting

Concepts

Endpoints