Unstructured API services
Getting started with API services
Process individual files
Batch processing and ingestion
- Overview
- Ingest CLI
- Ingest Python library
- Ingest dependencies
- Ingest configuration
- Source connectors
- Destination connectors
How to
- Choose a partitioning strategy
- Choose a hi-res model
- Get element contents
- Process a subset of files
- Set embedding behavior
- Parse simple PDFs and HTML
- Set partitioning behavior
- Set chunking behavior
- Output unique element IDs
- Output bounding box coordinates
- Set document language for better OCR
- Extract tables as HTML
- Extract images and tables from documents
- Get chunked elements
- Change element coordinate systems
- Work with PowerPoint files
- Use LangChain and Ollama
- Use LangChain and Llama 3
- Transform a JSON file into a different schema
- Generate a JSON schema for a file
Troubleshooting
Endpoints
Get element contents
Task
You want to get, manipulate, and print or save, the contents of the document elements and metadata from the processed data that Unstructured returns.
Approach
Each element in the document elements contains fields for that element’s type, its ID, the extracted text, and associated metadata.
The programmatic approach you take to get these document elements will depend on which tool, SDK, or library you use:
For the Unstructured Ingest CLI, you can use a tool such as jq to work with a JSON file that the CLI outputs after the processing is complete.
For example, the following script uses jq
to access and print each element’s ID, text, and originating file name:
#!/usr/bin/env bash
JSON_FILE="local-ingest-output/my-file.json"
jq -r '.[] | "ID: \(.element_id)\nText: \(.text)\nFilename: \(.metadata.filename)\n"' \
"$JSON_FILE"
For the Unstructured Ingest Python library, you can use the standard Python json.load function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete.
For example, the following code example uses standard Python to access and print each element’s ID, text, and originating file name:
import json
def parse_json_file(input_file_path: str):
with open(input_file_path, 'r') as file:
file_elements = json.load(file)
for element in file_elements:
print(f"ID: {element["element_id"]}")
print(f"Text: {element["text"]}")
print(f"Filename: {element["metadata"]["filename"]}\n")
if __name__ == "__main__":
parse_json_file(
input_file_path="local-ingest-output/my-file.json"
)
For the Unstructured Python SDK, calling an UnstructuredClient
object’s general.partition_async
method returns a PartitionResponse
object.
This PartitionResponse
object’s elements
variable contains a list of key-value dictionaries (List[Dict[str, Any]]
). For example:
# ...
res = await client.general.partition_async(request=req)
# Do something with the elements, for example:
save_elements_to_file(res.elements)
# ...
You can use standard Python list operations on this list.
You can also use standard Python looping techniques on this list to access each element in this list.
To work with an individual element’s contents, you can use standard dictionary operations on the element.
For example:
# ...
res = await client.general.partition_async(request=req)
for element in res.elements:
# Do something with each element, for example:
save_element_to_database(f"{element["element_id"]}")
save_element_to_database(f"{element["text"]}")
save_element_to_database(f"{element["metadata"]["filename"]}\n")
# ...
To serialize this list as JSON, you can:
- Use the
elements_from_dicts
function to convert the list of key-value dictionaries (Iterable[Dict[str, Any]]
) into a list of elements (Iterable[Element]
). - Use the
elements_to_json
function to convert the list of elements into a JSON-formatted string and then print or save that string.
For example:
from unstructured.staging.base import elements_from_dicts, elements_to_json
# ...
res = await client.general.partition_async(request=req)
dict_elements = elements_from_dicts(
element_dicts=res.elements
)
elements_to_json(
elements=dict_elements,
indent=2,
filename=output_filepath
)
# ...
For the Unstructured JavaScript/TypeScript SDK, calling an UnstructuredClient
object’s general.partition
method returns a Promise<PartitionResponse>
object.
This PartitionResponse
object’s elements
property contains an Array
of string-value objects ({ [k: string]: any; }[]
). For example:
// ...
client.general.partition({
partitionParameters: {
files: {
content: data,
fileName: inputFilepath
},
strategy: Strategy.HiRes,
splitPdfPage: true,
splitPdfAllowFailed: true,
splitPdfConcurrencyLevel: 15
}
}).then((res) => {
if (res.statusCode == 200) {
// Do something with the elements, for example:
saveElementsToFile(res.elements)
}
} // ...
You can use standard Array operations on this array.
You can also use standard Array
techniques such as forEach to access each object in this array. For example:
// ...
client.general.partition({
partitionParameters: {
files: {
content: data,
fileName: inputFilepath
},
strategy: Strategy.HiRes,
splitPdfPage: true,
splitPdfAllowFailed: true,
splitPdfConcurrencyLevel: 15
}
}).then((res) => {
if (res.statusCode == 200) {
res.elements?.forEach(element => {
// Do something with each element, for example:
saveElementToDatabase(`${element["element_id"]}`)
saveElementToDatabase(`${element["text"]}`)
saveElementToDatabase(`${element["metadata"]["filename"]}`)
}
}
} // ...
To serialize this list as JSON, you can use the standard JSON.stringify function to serialize it to JSON-formatted string and the Node.js fs.WriteFileSync function to save it as a file. For example:
// ...
client.general.partition({
partitionParameters: {
files: {
content: data,
fileName: inputFilepath
},
strategy: Strategy.HiRes,
splitPdfPage: true,
splitPdfAllowFailed: true,
splitPdfConcurrencyLevel: 15
}
}).then((res) => {
if (res.statusCode == 200) {
const jsonElements = JSON.stringify(res.elements, null, 2)
fs.writeFileSync(outputFilepath, jsonElements)
}
} // ...
For the Unstructured open-source library, calling the partition_via_api
function returns a list of elements (list[Element]
). For example:
# ...
elements = partition_via_api(
filename=input_filepath,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
api_url=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res"
)
# ...
You can use standard Python list operations on this list.
You can also use standard Python looping techniques on this list to access each element in this list.
Each individual element has the following attributes:
.text
provides the element’stext
field value as astr
. See Element example..metadata
provides the element’smetadata
field as anElementMetadata
object. See Metadata..category
provides the element’stype
field value as astr
. See Element type..id
provides the element’selement_id
value as astr
. See Element ID.
In addition, the following methods are available:
.convert_coordinates_to_new_system()
converts the element’s location coordinates, if any, to a new coordinate system. See Element’s coordinates..to_dict()
gets the element’s content as a standard Python key-value dictionary (dict[str, Any]
).
For example:
# ...
elements = partition_via_api(
filename=input_filepath,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
api_url=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res"
)
for element in elements:
# Do something with each element, for example:
save_element_to_database(f"{element.id}")
save_element_to_database(f"{element.text}")
save_element_to_database(f"{element.metadata.filename}")
To serialize this list as a Python dictionary, you can use the elements_to_dicts
method, for example:
from unstructured.staging.base import elements_to_dicts
# ...
elements = partition_via_api(
filename=input_filepath,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
api_url=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res"
)
elements_dicts = elements_to_dicts(elements)
To serialize this list as JSON, you can use the elements_to_json
function to convert the list of elements (Iterable[Element]
) into a JSON-formatted string and then print or save that string. For example:
from unstructured.staging.base import elements_to_json
# ...
elements = partition_via_api(
filename=input_filepath,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
api_url=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res"
)
json_elements = elements_to_json(
elements=elements,
indent=2
)
elements_to_json(
elements=elements,
indent=2,
filename=output_filepath
)