Overview
Concepts
Ingestion is the term that Unstructured uses to refer to the set of activities that happens when files are input for processing. Ingestion enables multiple files to be processed as a batch.
You can perform ingestion with the following tools:
- The Unstructured Platform, a no-code user interface, unlimited pay-as-you-go platform to get all of your data ready for Retrieval Augmented Generation (RAG) and model fine-tuning.
- The Unstructured Ingest CLI, with unlimited pay-as-you-go and limited free options, that enable you to use command-line scripts to get all of your data ready for RAG and model fine-tuning.
- The Unstructured Ingest Python library, with unlimited pay-as-you-go and limited free options, that enable you to use Python code to get all of your data ready for RAG and model fine-tuning.
The Unstructured Python SDK and Unstructured JavaScript/TypeScript SDK can process only one file at a time.
Files are ingested from an originating source location. Each batch of ingested files is processed either all by Unstructured or all locally. The processed data is sent to a target destination location. The kinds of locations you can specify varies:
When you use the Unstructured Platform, the source and destination must both be in cloud storage. Local source or local destination locations are not allowed. For example:
The Unstructured Platform enables you to connect to many kinds of sources and destinations.
If you use the Unstructured Ingest CLI or the Unstructured Ingest Python library, the source or destination can be a cloud storage location or a local location. For example:
Unstructured provides many source and destination connectors.
Ingestion options for the Unstructured service
This is the flow for sending files to Unstructured for processing and the processed data being delivered by Unstructured:
-
This flow always happens for the Unstructured Platform. The Platform only allows sending files from cloud storage and sending processed data to cloud storage.
-
For the Unstructured Ingest CLI or the Unstructured Ingest Python library, to use this flow:
- When using the Unstructured Ingest CLI, include the
--partition-by-api
option and set--api-key
and--partition-endpoint
to a valid, matching Unstructured API key and API URL, respectively. - When using the Unstructured Ingest Python library, set
partition_by_api=True
andapi_key
and setpartition_endpoint
to a valid, matching Unstructured API key and API URL, respectively.
- When using the Unstructured Ingest CLI, include the
Local ingestion options
This is the flow for processing files locally. No files are sent to Unstructured for processing:
-
This flow never happens for the Unstructured Platform. The Platform does not allow sending files from a local destination to Unstructured or Unstructured sending processed data to a local destination.
-
For the Unstructured Ingest CLI or the Unstructured Ingest Python library, to use this flow:
- When using the Unstructured Ingest CLI, omit the
--partition-by-api
,--api-key
, and--partition-endpoint
options. - When using the Unstructured Ingest Python library, omit
partition_by_api
or explicitly setpartition_by_api=False
. Also omitapi_key
andpartition_endpoint
.
- When using the Unstructured Ingest CLI, omit the
Unstructured Ingest CLI
The Unstructured Ingest CLI enables you to use command-line scripts to get all of your data ready for RAG and model fine-tuning.
One approach to using the CLI is installing Python and then running the following command to install the CLI:
This default installation option enables the ingestion of plain text files, HTML, XML, JSON and emails that do not require any extra dependencies. This default option also enables you to specify local source and destination locations.
You might also need to install additional dependencies, depending on your needs. Learn more.
For additional installation options, see:
- Run the library in a container
- Installing the library
- The installation commands for additional connectors for sources and destinations
To display the list of available source connector commands, run the following command:
To display the list of available destination connector commands, run the following command:
To display help for a specific source connector command, run the following command:
To display help for a specific destination connector command, run the following command:
To begin using the CLI, see the quickstarts for the:
pip install unstructured
, see the migration guide.Unstructured Ingest Python library
The Unstructured Ingest Python library enable you to use Python code to get all of your data ready for RAG and model fine-tuning.
The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured API services for processing:
One approach to using the Unstructured Ingest Python library is installing Python and then running the following command to install the library and the default connectors:
This default installation option enables the ingestion of plain text files, HTML, XML, JSON and emails that do not require any extra dependencies. This default option also enables you to specify local source and destination locations.
You might also need to install additional dependencies, depending on your needs. Learn more.
For additional installation options, see:
- Run the library in a container
- Installing the library
- The installation commands for additional connectors for sources and destinations
Some source and destination connectors provide newer v2 and older v1 implementations, while some provide only older v1 implementations. You should use the v2 implementations wherever they are available, to help ensure better forward-compatibility of your code. For the lists of available v2 and v1 connectors, see:
- v2 non-fsspec connectors
- v2 fsspec connectors
- v1 non-fsspec connectors
- v1 fsspec connectors
- v1 Notion connector
To begin using the Unstructured Ingest Python library, see the code examples for the source and destination connectors.
pip install unstructured
, see the migration guide.Generate Python code examples
You can connect any available source connector to any available destination connector. However, the source connector code examples in the documentation show connecting only to the local destination connector. Similarly, the destination connector code examples in the documentation show connecting only to the local source connector.
To quickly generate an Unstructured Ingest Python library code example that connects any available source connector to any available destination connector, do the following:
-
Open the Unstructured Ingest Code Generator webpage.
-
Select your input (source) location type from the Get unstructured documents from drop-down list.
-
Select your output (destination) location type from the Upload RAG-ready documents to drop-down list.
-
Select your chunking strategy from the Chunking strategy drop-down list:
- None - Do not chunk the data elements’ content.
- basic - Combine sequential data elements to maximally fill each chunk. However, do not mix
Table
and non-Table
elements in the same chunk. - by_title - Use the
basic
strategy and also preserve section boundaries. Optionally preserve page boundaries as well. - by_page - Use the
basic
strategy and also preserve page boundaries. - by_similarity - Use the
sentence-transformers/multi-qa-mpnet-base-dot-v1
embedding model to identify topically similar sequential elements and combine them into chunks. This strategy is availably only when calling Unstructured API services.
To learn more, see Chunking strategies and Chunking configuration.
-
For any chunking strategy other than None:
- Enter your chunk size in the Chunk size (characters) box, or leave the default of 1000 characters.
- If you need to apply overlapping to the chunks, enter the chunk overlap size in the Chunk overlap (characters) box, or leave default of 20 characters.
To learn more, see Chunking configuration.
-
To generate vector embeddings, select the provider in the Embedding provider drop-down list.
To learn more, see Embedding configuraton.
-
Click Generate code.
-
Copy the example code from the Generated Code pane into your code project.
-
The code example will contain one or more environment variables that you must set for the code to run correctly. To learn what to set these variables to, click the documentation links that are below the Generated Code pane.
Migration guide
The older unstructured
versions of the Unstructured Ingest CLI and Unstructured Ingest Python library have been replaced and are now deprecated.
To migrate to the newer unstructured-ingest versions of the Ingest CLI and Ingest Python library, do the following:
-
If you previously ran
pip install unstructured
only for the purposes of using the Ingest CLI or the Ingest Python library, upgrade to the latest versions by running the following commands:a.
pip uninstall unstructured
b.pip install unstructured-ingest
-
If you previously installed an older version of a source or destination connector, for example
pip install "unstructured[azure]"
for the Azure Storage connector, upgrade to the latest version by running the following commands:a.
pip uninstall "unstructured[azure]"
b.pip install "unstructured-ingest[azure]"
-
If you were running Python code against an older version of the Ingest Python library, update your
import
statements by replacing all instances ofunstructured.ingest
withunstructured_ingest
to run against the latest version.