Unstructured API services
Getting started with API services
Process individual files
Batch processing and ingestion
- Overview
- Ingest CLI
- Ingest Python library
- Ingest dependencies
- Ingest configuration
- Source connectors
- Destination connectors
How to
- Choose a partitioning strategy
- Choose a hi-res model
- Get element contents
- Process a subset of files
- Set embedding behavior
- Parse simple PDFs and HTML
- Set partitioning behavior
- Set chunking behavior
- Output unique element IDs
- Output bounding box coordinates
- Set document language for better OCR
- Extract tables as HTML
- Extract images and tables from documents
- Get chunked elements
- Change element coordinate systems
- Work with PowerPoint files
- Use LangChain and Ollama
- Use LangChain and Llama 3
- Transform a JSON file into a different schema
- Generate a JSON schema for a file
Troubleshooting
Endpoints
Extract images and tables from documents
Task
You want to get, decode, and show elements, such as images and tables, that are embedded in a PDF document.
Approach
Extract the Base64-encoded representation of specific elements, such as images and tables, in the document. For each of these extracted elements, decode the Base64-encoded representation of the element into its original visual representation and then show it.
To run this example
You will need a document that is one of the document types supported by the extract_image_block_types
argument.
See the extract_image_block_types
entry in API Parameters.
This example uses a PDF file with embedded images and tables.
Code
For the Unstructured Ingest Python library, you can use the standard Python json.load function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete.
import json, base64, io
from PIL import Image
def get_image_block_types(input_json_file_path: str):
with open(input_json_file_path, 'r') as file:
file_elements = json.load(file)
for element in file_elements:
if "image_base64" in element["metadata"]:
# Decode the Base64-encoded representation of the
# processed "Image" or "Table" element into its original
# visual representation, and then show it.
image_data = base64.b64decode(element["metadata"]["image_base64"])
image = Image.open(io.BytesIO(image_data))
image.show()
if __name__ == "__main__":
# Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/pdf/embedded-images-tables.pdf
# Specify where to get the local file, relative to this .py file.
get_image_block_types(
input_json_file_path="local-ingest-output/embedded-images-tables.json"
)
For the Unstructured Python SDK, you’ll need:
These environment variables:
UNSTRUCTURED_API_KEY
- Your Unstructured API key value.UNSTRUCTURED_API_URL
- Your Unstructured API URL.
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
from unstructured.staging.base import elements_from_dicts, elements_to_json
import os
import base64
from PIL import Image
import io
if __name__ == "__main__":
client = UnstructuredClient(
api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
server_url=os.getenv("UNSTRUCTURED_API_URL")
)
# Source: https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/embedded-images-tables.pdf
# Where to get the input file and store the processed data, relative to this .py file.
local_input_filepath = "local-ingest-input-pdf/embedded-images-tables.pdf"
local_output_filepath = "local-ingest-output/embedded-images-tables.json"
with open(local_filepath, "rb") as f:
files = shared.Files(
content=f.read(),
file_name=local_input_filepath
)
request = operations.PartitionRequest(
shared.PartitionParameters(
files=files,
strategy=shared.Strategy.HI_RES,
split_pdf_page=True,
split_pdf_allow_failed=True,
split_pdf_concurrency_level=15,
# Extract the Base64-encoded representation of each
# processed "Image" and "Table" element. Extract each into
# an "image_base64" object, as a child of the
# "metadata" object, for that element in the result.
# Element type names, such as "Image" and "Table" here,
# are case-insensitive.
# Any available Unstructured element type is allowed.
extract_image_block_types=["Image", "Table"]
)
)
try:
result = await client.general.partition_async(request)
for element in result.elements:
if "image_base64" in element["metadata"]:
# Decode the Base64-encoded representation of the
# processed "Image" or "Table" element into its original
# visual representation, and then show it.
image_data = base64.b64decode(element["metadata"]["image_base64"])
image = Image.open(io.BytesIO(image_data))
image.show()
except Exception as e:
print(e)