diff --git a/ai-scripts/data-extractor.txt b/ai-scripts/data-extractor.txt
new file mode 100644
index 00000000..8056fa41
--- /dev/null
+++ b/ai-scripts/data-extractor.txt
@@ -0,0 +1,9 @@
+---
+Target article: /ui/data-extractor.md
+Target image: /img/ui/data-extractor/structured-data-extraction-conceptual-flow.png
+AI tool used: Google Gemini
+
+AI prompt:
+
+Generate an image of one filled-in medical form, with an arrow pointing from the form to a JSON representation of the form's content, then an arrow pointing from the JSON to a JSON file inside of cloud file storage, then an arrow pointing from cloud file storage to inserting the JSON file as a record inside of a database table. Make the arrows straight, and put sufficient padding between each of these elements.
+---
\ No newline at end of file
diff --git a/api-reference/workflow/workflows.mdx b/api-reference/workflow/workflows.mdx
index eab07e14..a3482d14 100644
--- a/api-reference/workflow/workflows.mdx
+++ b/api-reference/workflow/workflows.mdx
@@ -2234,6 +2234,85 @@ Allowed values for `subtype` and `model_name` include the following:
- `"model_name": "voyage-code-2"`
- `"model_name": "voyage-multimodal-3"`
+### Extract node
+
+An **Extract** node has a `type` of `structured_data_extractor` and a `subtype` of `llm`.
+
+
+
+ ```python
+ embedder_workflow_node = WorkflowNode(
+ name="Extractor",
+ subtype="llm",
+ type="structured_data_extractor",
+ settings={
+ "schema_to_extract": {
+ "json_schema": "",
+ "extraction_guidance": ""
+ },
+ "provider": "",
+ "model": ""
+ }
+ )
+ ```
+
+
+ ```json
+ {
+ "name": "Extractor",
+ "type": "structured_data_extractor",
+ "subtype": "llm",
+ "settings": {
+ "schema_to_extract": {
+ "json_schema": "",
+ "extraction_guidance": ""
+ },
+ "provider": "",
+ "model": ""
+ }
+ }
+ ```
+
+
+
+Fields for `settings` include:
+
+- `schema_to_extract`: _Required_. The schema or guidance for the structured data that you want to extract. One (and only one) of the following must also be specified:
+
+ - `json_schema`: The extraction schema, in [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) format, for the structured data that you want to extract, expressed as a single string.
+ - `extraction_guidance`: The extraction prompt for the structured data that you want to extract, expressed as a single string.
+
+- Allowed values for `provider` and `model` include the following:
+
+ - `"provider": "anthropic"`
+
+ - `"model": "claude-opus-4-5-20251101"`
+ - `"model": "claude-sonnet-4-5-20250929"`
+ - `"model": "claude-haiku-4-5-20251001"`
+ - `"model": "claude-3-7-sonnet-20250219"`
+ - `"model": "claude-sonnet-4-20250514"`
+
+ - `"provider": "azure_openai"`
+
+ - `"model": "gpt-5-mini"`
+ - `"model": "gpt-4o"`
+ - `"model": "gpt-4o-mini"`
+
+ - `"provider": "bedrock"`
+
+ - `"model": "us.anthropic.claude-opus-4-20250514-v1:0"`
+ - `"model": "us.anthropic.claude-sonnet-4-20250514-v1:0"`
+ - `"model": "us.anthropic.claude-3-7-sonnet-20250219-v1:0"`
+ - `"model": "us.anthropic.claude-sonnet-4-5-20250929-v1:0"`
+
+ - `"provider": "openai"`
+
+ - `"model": "gpt-4o"`
+ - `"model": "gpt-5-mini"`
+ - `"model": "gpt-4o-mini"`
+
+[Learn more](/ui/data-extractor).
+
## List templates
To list templates, use the `UnstructuredClient` object's `templates.list_templates` function (for the Python SDK) or the `GET` method to call the `/templates` endpoint (for `curl` or Postman).
diff --git a/docs.json b/docs.json
index 6a432202..d4684fd2 100644
--- a/docs.json
+++ b/docs.json
@@ -120,6 +120,7 @@
"pages": [
"ui/document-elements",
"ui/partitioning",
+ "ui/data-extractor",
"ui/chunking",
{
"group": "Enriching",
diff --git a/img/ui/data-extractor/house-plant-care.png b/img/ui/data-extractor/house-plant-care.png
new file mode 100644
index 00000000..b23a8356
Binary files /dev/null and b/img/ui/data-extractor/house-plant-care.png differ
diff --git a/img/ui/data-extractor/medical-invoice.png b/img/ui/data-extractor/medical-invoice.png
new file mode 100644
index 00000000..b632da26
Binary files /dev/null and b/img/ui/data-extractor/medical-invoice.png differ
diff --git a/img/ui/data-extractor/real-estate-listing.png b/img/ui/data-extractor/real-estate-listing.png
new file mode 100644
index 00000000..df7fc475
Binary files /dev/null and b/img/ui/data-extractor/real-estate-listing.png differ
diff --git a/img/ui/data-extractor/schema-builder.png b/img/ui/data-extractor/schema-builder.png
new file mode 100644
index 00000000..d7296640
Binary files /dev/null and b/img/ui/data-extractor/schema-builder.png differ
diff --git a/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf b/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf
new file mode 100644
index 00000000..3655cce4
Binary files /dev/null and b/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf differ
diff --git a/img/ui/data-extractor/structured-data-extraction-conceptual-flow.png b/img/ui/data-extractor/structured-data-extraction-conceptual-flow.png
new file mode 100644
index 00000000..0d9447cc
Binary files /dev/null and b/img/ui/data-extractor/structured-data-extraction-conceptual-flow.png differ
diff --git a/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx b/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx
index 5a671c39..a8263279 100644
--- a/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx
+++ b/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx
@@ -462,9 +462,264 @@ embedding model that is provided by an embedding provider. For the best embeddin
6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
the workflow designer so that you can continue designing things later as you see fit.
+## Step 7: Experiment with structured data extraction
+
+In this step, you apply custom [structured data extraction](/ui/data-extractor) to your workflow. Structured data extraction is the process where Unstructured
+automatically extracts the data from your source documents into a format that you define up front. For example, in addition to Unstructured
+partitioning your source documents into elements with types such as `NarrativeText`, `UncategorizedText`, and so on, you can have Unstructured
+output key information from the source documents in a custom structured data format, appearing within a `DocumentData` element that contains a JSON object with custom fields such as `name`, `address`, `phone`, `email`, and so on.
+
+1. With the workflow designer active from the previous step, just before the **Destination** node, click the add (**+**) icon, and then click **Enrich > Extract**.
+
+ 
+
+2. In the node's settings pane's **Details** tab, under **Provider**, select **Anthropic**. Under **Model**, select **Claude Sonnet 4.5**. This is the model that Unstructured will use to do the structured data extraction.
+
+
+ The list of available models for structured data extraction is constantly being updated. Your list might also be different, depending on your Unstructured
+ account type. If **Anthropic** and **Claude Sonnet 4.5** is not available, choose another available model from the list.
+
+ If you have an Unstructured **Business** account and want to add more models to this list, contact your
+ Unstructured account administrator or Unstructured sales representative, or email Unstructured Support at
+ [support@unstructured.io](mailto:support@unstructured.io).
+
+
+3. Click **Upload JSON**.
+4. in the **JSON Schema** box, enter the following JSON schema, and then click **Use this Schema**:
+
+ ```json
+ {
+ "type": "object",
+ "properties": {
+ "title": {
+ "type": "string",
+ "description": "Full title of the research paper"
+ },
+ "authors": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string",
+ "description": "Author's full name"
+ },
+ "affiliation": {
+ "type": "string",
+ "description": "Author's institutional affiliation"
+ },
+ "email": {
+ "type": "string",
+ "description": "Author's email address"
+ }
+ },
+ "required": [
+ "name",
+ "affiliation",
+ "email"
+ ],
+ "additionalProperties": false
+ },
+ "description": "List of paper authors with their affiliations"
+ },
+ "abstract": {
+ "type": "string",
+ "description": "Paper abstract summarizing the research"
+ },
+ "introduction": {
+ "type": "string",
+ "description": "Introduction section describing the problem and motivation"
+ },
+ "methodology": {
+ "type": "object",
+ "properties": {
+ "approach_name": {
+ "type": "string",
+ "description": "Name of the proposed method (e.g., StrokeNet)"
+ },
+ "description": {
+ "type": "string",
+ "description": "Detailed description of the methodology"
+ },
+ "key_techniques": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "List of key techniques used in the approach"
+ }
+ },
+ "required": [
+ "approach_name",
+ "description",
+ "key_techniques"
+ ],
+ "additionalProperties": false
+ },
+ "experiments": {
+ "type": "object",
+ "properties": {
+ "datasets": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string",
+ "description": "Dataset name"
+ },
+ "description": {
+ "type": "string",
+ "description": "Dataset description"
+ },
+ "size": {
+ "type": "string",
+ "description": "Dataset size (e.g., number of sentence pairs)"
+ }
+ },
+ "required": [
+ "name",
+ "description",
+ "size"
+ ],
+ "additionalProperties": false
+ },
+ "description": "Datasets used for evaluation"
+ },
+ "baselines": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "Baseline methods compared against"
+ },
+ "evaluation_metrics": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "Metrics used for evaluation"
+ },
+ "experimental_setup": {
+ "type": "string",
+ "description": "Description of experimental configuration and hyperparameters"
+ }
+ },
+ "required": [
+ "datasets",
+ "baselines",
+ "evaluation_metrics",
+ "experimental_setup"
+ ],
+ "additionalProperties": false
+ },
+ "results": {
+ "type": "object",
+ "properties": {
+ "main_findings": {
+ "type": "string",
+ "description": "Summary of main experimental findings"
+ },
+ "performance_improvements": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "dataset": {
+ "type": "string",
+ "description": "Dataset name"
+ },
+ "metric": {
+ "type": "string",
+ "description": "Evaluation metric (e.g., BLEU)"
+ },
+ "baseline_score": {
+ "type": "number",
+ "description": "Baseline method score"
+ },
+ "proposed_score": {
+ "type": "number",
+ "description": "Proposed method score"
+ },
+ "improvement": {
+ "type": "number",
+ "description": "Improvement over baseline"
+ }
+ },
+ "required": [
+ "dataset",
+ "metric",
+ "baseline_score",
+ "proposed_score",
+ "improvement"
+ ],
+ "additionalProperties": false
+ },
+ "description": "Performance improvements over baselines"
+ },
+ "parameter_reduction": {
+ "type": "string",
+ "description": "Description of parameter reduction achieved"
+ }
+ },
+ "required": [
+ "main_findings",
+ "performance_improvements",
+ "parameter_reduction"
+ ],
+ "additionalProperties": false
+ },
+ "related_work": {
+ "type": "string",
+ "description": "Summary of related work and prior research"
+ },
+ "conclusion": {
+ "type": "string",
+ "description": "Conclusion section summarizing contributions and findings"
+ },
+ "limitations": {
+ "type": "string",
+ "description": "Limitations and challenges discussed in the paper"
+ },
+ "acknowledgments": {
+ "type": "string",
+ "description": "Acknowledgments section"
+ },
+ "references": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "List of cited references"
+ }
+ },
+ "additionalProperties": false,
+ "required": [
+ "title",
+ "authors",
+ "abstract",
+ "introduction",
+ "methodology",
+ "experiments",
+ "results",
+ "related_work",
+ "conclusion",
+ "limitations",
+ "acknowledgments",
+ "references"
+ ]
+ }
+ ```
+
+5. Immediately above the **Source** node, click **Test**.
+6. In the **Test output** pane, make sure that **Extract (9 of 9)** is showing. If not, click the right arrow (**>**) until **Extract (9 of 9)** appears, which will show the output from the last node in the workflow.
+7. To explore the structured data extraction, search for the text `"extracted_data"` (including the quotation marks).
+8. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
+ the workflow designer so that you can continue designing things later as you see fit.
+
## Next steps
-Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, and embeds your source documents, producing
+Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, embeds, and extracts structured data from your source documents, producing
context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning.
Right now, your workflow only accepts one local file at a time for input. Your workflow also only sends Unstructured's processed data to your screen or to be saved locally as a JSON file.
diff --git a/snippets/general-shared-text/get-started-single-file-ui.mdx b/snippets/general-shared-text/get-started-single-file-ui.mdx
index 6ea82008..3c7c6f54 100644
--- a/snippets/general-shared-text/get-started-single-file-ui.mdx
+++ b/snippets/general-shared-text/get-started-single-file-ui.mdx
@@ -116,6 +116,7 @@ You can also do the following:
What's next?
-- [Learn how to add chunking, embeddings, and additional enrichments to your local file results](/ui/walkthrough-2).
+- [Learn how to extract structured data in a custom format from your local file](/ui/data-extractor#use-the-structured-data-extractor-from-the-start-page).
+- [Learn how to add chunking, embeddings, custom structured data extraction, and additional enrichments to your local file results](/ui/walkthrough-2).
- [Learn how to do large-scale batch processing of multiple files and semi-structured data that are stored in remote locations instead](/ui/quickstart#remote-quickstart).
- [Learn more about the Unstructured user interface](/ui/overview).
\ No newline at end of file
diff --git a/ui/data-extractor.mdx b/ui/data-extractor.mdx
new file mode 100644
index 00000000..313f970e
--- /dev/null
+++ b/ui/data-extractor.mdx
@@ -0,0 +1,777 @@
+---
+title: Structured data extraction
+---
+
+
+ To begin using the structured data extractor right away, skip ahead to the how-to [procedures](#using-the-structured-data-extractor).
+
+
+## Overview
+
+When Unstructured [partitions](/ui/partitioning) your source documents, the default result is a list of Unstructured
+[document elements](/ui/document-elements). These document elements are expressed in Unstructured's format, which includes elements such as
+`Title`, `NarrativeText`, `UncategorizedText`, `Table`, `Image`, `List`, and so on. For example, you could have
+Unstructured ingest a stack of customer order forms in PDF format, where the PDF files' layout is identical, but the
+content differs per individual PDF by customer order number. For each PDF, Unstructured might output elements such as
+a `List` element that contains details about the customer who placed the order, a `Table` element
+that contains the customer's order details, `NarrativeText` or `UncategorizedText` elements that contains special
+instructions for the order, and so on. You might then use custom logic that you write yourself to parse those elements further in an attempt to
+extract information that you're particularly interested in, such as customer IDs, item quantities, order totals, and so on.
+
+Unstructured's _structured data extractor_ simplifies this kind of scenario by allowing Unstructured to automatically extract the data from your source documents
+into a format that you define up front. For example, you could have Unstructured ingest that same stack of customer order form PDFs and
+then output a series of customer records, one record per order form. Each record could include data, with associated field labels, such as the customer's ID; a series of order line items with descriptions, quantities, and prices;
+the order's total amount; and any other available details that matter to you.
+This information is extracted in a consistent JSON format that is already fine-tuned for you to use in your own applications.
+
+The following diagram provides a conceptual representation of structured data extraction, showing a flow of data from a patient information form into JSON output that is saved as a
+JSON file in some remote cloud file storage location. From there, you could for example run your own script or similar to insert the JSON as a series of records into a database.
+
+
+
+
+
+To show how the structured data extractor works from a technical perspective, take a look at the following real estate listing PDF. This file is one of the
+sample files that is available directly from the **Start** page and the workflow editor's **Source** node in the Unstructured use interface (UI). The file's
+content is as follows:
+
+
+
+Without the structured data extractor, if you run a workflow that references this file, Unstructured extracts the listing's data in a default format similar to the following
+(note that the ellipses in this output indicate omitted fields for brevity):
+
+```json
+[
+ {
+ "type": "Title",
+ "element_id": "3f1ad705648037cf65e4d029d834a0de",
+ "text": "HOME FOR FUTURE",
+ "metadata": {
+ "...": "..."
+ }
+ },
+ {
+ "type": "NarrativeText",
+ "element_id": "320ca4f48e63d8bcfba56ec54c9be9af",
+ "text": "221 Queen Street, Melbourne VIC 3000",
+ "metadata": {
+ "...": "..."
+ }
+ },
+ {
+ "type": "NarrativeText",
+ "element_id": "05f648e815e73fe5140f203a62d8a3cc",
+ "text": "2,800 sq. ft living space",
+ "metadata": {
+ "...": "..."
+ }
+ },
+ {
+ "type": "NarrativeText",
+ "element_id": "27a9ded56b42f559999e48d1dcd76c9e",
+ "text": "Recently renovated kitchen",
+ "metadata": {
+ "...": "..."
+ }
+ },
+ {
+ "...": "..."
+ }
+]
+```
+
+In the preceding output, the `text` fields contain information about the listing, such as the street address,
+the square footage, one of the listing's features, and so on. However,
+you might want the information presented as `street_address`, `square_footage`, `features`, and so on.
+
+By using the structured data extractor in your Unstructured workflows, you could have Unstructured extract the listing's data in a custom-defined output format similar to the following (ellipses indicate omitted fields for brevity):
+
+```json
+[
+ {
+ "type": "DocumentData",
+ "element_id": "f2ee7334-c00a-4fc0-babc-2fcea28c1fb6",
+ "text": "",
+ "metadata": {
+ "...": "...",
+ "extracted_data": {
+ "street_address": "221 Queen Street, Melbourne VIC 3000",
+ "square_footage": 2800,
+ "price": 1000000,
+ "features": [
+ "Recently renovated kitchen",
+ "Smart home automation system",
+ "2-car garage with storage space",
+ "Spacious open-plan layout with natural lighting",
+ "Designer kitchen with quartz countertops and built-in appliances",
+ "Master suite with walk-in closet and en-suite bath",
+ "Covered patio and landscaped backyard garden"
+ ],
+ "agent_contact": {
+ "phone": "+01 555 123456"
+ }
+ }
+ }
+ },
+ {
+ "type": "Title",
+ "element_id": "3f1ad705648037cf65e4d029d834a0de",
+ "text": "HOME FOR FUTURE",
+ "metadata": {
+ "...": "..."
+ }
+ },
+ {
+ "...": "..."
+ }
+]
+```
+
+In the preceding output, the first document element, of type `DocumentData`, has an `extracted_data` field within `metadata`
+that contains a representation of the document's data in the custom output format that you specify. Beginning with the second document element and continuing
+until the end of the document, Unstructured also outputs the document's data as a series of Unstructured's document elements and metadata as it normally would.
+
+To use the structured data extractor, you can provide Unstructured with an _extraction schema_, which defines the structure of the data for Unstructured to extract.
+Or you can specify an _extraction prompt_ that guides Unstructured on how to extract the data from the source documents, in the format that you want.
+
+An extraction prompt is like a prompt that you would give to a chatbot or AI agent. This prompt guides Unstructured on how to extract the data from the source documents. For this real estate listing example, the
+prompt might look like the following:
+
+```text
+Extract the following information from the listing, and present it in the following format:
+
+- street_address: The full street address of the property including street number, street name, city, state, and postal code.
+- square_footage: The total living space area of the property, in square feet.
+- price: The listed selling price of the property, in local currency.
+- features: A list of property features and highlights.
+- agent_contact: Contact information for the real estate agent.
+
+ - phone: The agent's contact phone number.
+```
+
+An extraction schema is a JSON-formatted schema that defines the structure of the data that Unstructured extracts. The schema must
+conform to the [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) guidelines,
+which are a subset of the [JSON Schema](https://json-schema.org/docs) language.
+
+For this real estate listing example, the schema might look like the following:
+
+```json
+{
+ "type": "object",
+ "properties": {
+ "property_listing": {
+ "type": "object",
+ "properties": {
+ "street_address": {
+ "type": "string",
+ "description": "The full street address of the property including street number, street name, city, state, and postal code"
+ },
+ "square_footage": {
+ "type": "integer",
+ "description": "The total living space area of the property, in square feet"
+ },
+ "price": {
+ "type": "number",
+ "description": "The listed selling price of the property, in local currency"
+ },
+ "features": {
+ "type": "array",
+ "description": "A list of property features and highlights",
+ "items": {
+ "type": "string",
+ "description": "A single property feature or highlight"
+ }
+ },
+ "agent_contact": {
+ "type": "object",
+ "description": "Contact information for the real estate agent",
+ "properties": {
+ "phone": {
+ "type": "string",
+ "description": "The agent's contact phone number"
+ }
+ },
+ "required": ["phone"],
+ "additionalProperties": false
+ }
+ },
+ "required": ["street_address", "square_footage", "price", "features", "agent_contact"],
+ "additionalProperties": false
+ }
+ },
+ "required": ["property_listing"],
+ "additionalProperties": false
+}
+```
+
+You can also use a visual schema builder to define the schema, like this:
+
+
+
+## Using the structured data extractor
+
+There are two ways to use the [structured data extractor](#overview) in your Unstructured workflows:
+
+- From the **Start** page of your Unstructured account. This approach works
+ only with a single file that is stored on your local machine. [Learn how](#use-the-structured-data-extractor-from-the-start-page).
+- From the Unstructured workflow editor. This approach works with a single file that is stored on your local machine, or with any
+ number of files that are stored in remote locations. [Learn how](#use-the-structured-data-extractor-from-the-workflow-editor).
+
+### Use the structured data extractor from the Start page
+
+To have Unstructured [extract the data in a custom-defined format](#overview) for a single file that is stored on your local machine, do the following from the **Start** page:
+
+1. Sign in to your Unstructured account, if you are not already signed in.
+2. On the sidebar, click **Start**, if the **Start** page is not already showing.
+3. In the **Welcome, get started right away!** tile, do one of the following:
+
+ - To use a file on your local machine, click **Browse files** and then select the file, or drag and drop the file onto **Drop file to test**.
+
+
+ If you use a local file, the file must be 10 MB or less in size.
+
+
+ - To use a sample file provided by Unstructured, click one of the the sample files that are shown, such as **realestate.pdf**.
+
+4. After Unstructured partitions the selected file into Unstructured's document element format, click **Update results** to
+ have Unstructured apply generative enrichments, such as [image descriptions](/ui/enriching/image-descriptions) and
+ [generative OCR](/ui/enriching/generative-ocr), to those document elements.
+5. In the title bar, next to **Transform**, click **Extract**.
+6. If the **Define Schema** pane, do one of the following to extract the data from the selected file by using a custom-defined format:
+
+ - To use the schema based on one that Unstructured suggests after analyzing the selected file, click **Run Schema**.
+ - To use a custom schema that conforms to the [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) guidelines,
+ click **Upload JSON**; enter your own custom schema or upload a JSON file that contains your custom schema; click **Use this Schema**; and then click **Run Schema**.
+ [Learn about the OpenAI Structured Outputs format](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas).
+ - To use a visual editor to define the schema, click the ellipses (three dots) icon; click **Reset form**, enter your own custom schema objects and their properties,
+ and then click **Run Schema**. [Learn about OpenAI Structured Outputs data types](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas).
+ - To use a plain language prompt to guide Unstructured on how to extract the data, click **Suggest**; enter your propmpt in the
+ dialog; click **Generate schema**; make any changes to the suggested schema as needed; and then click **Run Schema**.
+
+7. The extracted data appears in the **Extract results** pane. You can do one of the following:
+
+ - To view a human-viewable representation of the extracted data, click **Formatted**.
+ - To view the JSON representation of the extracted data, click **JSON**.
+ - To download the JSON representation of the extracted data as a local JSON file, click the download icon next to **Formatted** and **JSON**.
+ - To change the schema and then re-run the extraction, click the back arrow next to **Extract Results**, and then skip back to step 6 in this procedure.
+
+### Use the structured data extractor from the workflow editor
+
+To have Unstructured [extract the data in a custom-defined format](#overview) for a single file that is stored on your local machine, or with any
+number of files that are stored in remote locations, do the following from the workflow editor:
+
+1. If you already have an Unstructured workflow that you want to use, open it to show the workflow editor. Otherwise, create a new
+ workflow as follows:
+
+ a. Sign in to your Unstructured account, if you are not already signed in.
+ b. On the sidebar, click **Workflows**.
+ c. Click **New Workflow +**.
+ d. With **Build it Myself** already selected, click **Continue**. The workflow editor appears.
+
+2. Add an **Extract** node to your existing Unstructured workflow. This node must be added right before the workflow's **Destination** node.
+ To add this node, in the workflow designer, click the **+** (add node) button immediately before the **Destination** node, and then click **Enrich > Extract**.
+3. Click the newly added **Extract** node to select it.
+4. In the node's settings pane, on the **Details** tab, under **Provider**, select the provider for the model that you want Unstructured to use to do the extraction. Then, under **Model**, select the model.
+5. To specify the custom schema for Unstructured to use to do the extraction, do one of the following:
+
+ - To use a custom schema that conforms to the [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) guidelines,
+ click **Upload JSON**; enter your own custom schema or upload a JSON file that contains your custom schema; and then click **Use this Schema**.
+ [Learn about the OpenAI Structured Outputs format](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas).
+ - To use a visual editor to define the schema, enter your own custom schema objects and their properties. To clear the current schema and start over,
+ click the ellipses (three dots) icon, and then click **Reset form**.
+ [Learn about OpenAI Structured Outputs data types](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas).
+
+6. Continue building your workflow as desired.
+7. To see the results of the structured data extractor, do one of the following:
+
+ - If you have already selected a local file as input to your workflow, click **Test** immediately above the **Source** node. The results will be displayed on-screen
+ in the **Test output** pane.
+ - If you are using source and destination connectors for your workflow, [run the workflow as a job](/ui/jobs#run-a-job),
+ [monitor the job](/ui/jobs#monitor-a-job), and then examine the job's results in your destination location.
+
+## Limitations
+
+The structured data extractor is not guaranteed to work with the [Pinecone destination connector](/ui/destinations/pinecone).
+This is because Pinecone has strict limits on the amount of metadata that it can manage. These limits are
+below the threshold of what the structured data extractor typically needs for the amount of metadata that it manages.
+
+## Saving the extracted data separately
+
+Unstructured does not recommend that you save `DocumentData` elements as rows or entries within a traditional SQL-style destination database or vector store, for the following reasons:
+
+- Saving a mixture of `DocumentData` elements and default Unstructured elements such as `Title`, `NarrativeText`, and `Table` elements and
+ so on in the same table, collection, or index might cause unexpected performance issues or might return less useful search and query results.
+- The `DocumentData` elements' `extracted_data` contents can get quite large and complex, exceeding the column or field limits of some SQL-style databases or vector stores.
+
+Instead, you should save the JSON containing the `DocumentData` elements that Unstructured outputs into a blob storage,
+file storage, or No-SQL database destination location. You could then use the following approach to extract and save the
+`extracted_data` contents from the JSON into a SQL-style destination database or vector store from there.
+
+To save the contents of the `extracted_data` field separately from the rest of Unstructured's JSON output, you
+could for example use a Python script such as the following. This script works with one or more Unstructured JSON output files that you already have stored
+on the same machine as this script. Before you run this script, do the following:
+
+- To process all Unstructured JSON files within a directory, change `None` for `input_dir` to a string that contains the path to the directory. This can be a relative or absolute path.
+- To process specific Unstructured JSON files within a directory or across multiple directories, change `None` for `input_file` to a string that contains a comma-separated list of filepaths on your local machine, for example `"./input/2507.13305v1.pdf.json,./input2/table-multi-row-column-cells.pdf.json"`. These filepaths can be relative or absolute.
+
+
+ If `input_dir` and `input_file` are both set to something other than `None`, then the `input_dir` setting takes precedence, and the `input_file` setting is ignored.
+
+
+- For the `output_dir` parameter, specify a string that contains the path to the directory on your local machine that you want to send the `extracted_data` JSON. If the specified directory does not exist at that location, the code will create the missing directory for you. This path can be relative or absolute.
+
+```python
+import asyncio
+import os
+import json
+
+async def process_file_and_save_result(input_filename, output_dir):
+ with open(input_filename, "r") as f:
+ input_data = json.load(f)
+
+ if input_data[0].get("type") == "DocumentData":
+ if "extracted_data" in input_data[0]["metadata"]:
+ extracted_data = input_data[0]["metadata"]["extracted_data"]
+
+ results_name = f"{os.path.basename(input_filename)}"
+ output_filename = os.path.join(output_dir, results_name)
+
+ try:
+ with open(output_filename, "w") as f:
+ json.dump(extracted_data, f)
+ print(f"Successfully wrote 'metadata.extracted_data' to '{output_filename}'.")
+ except Exception as e:
+ print(f"Error: Failed to write 'metadata.extracted_data' to '{output_filename}'.")
+ else:
+ print(f"Error: Cannot find 'metadata.extracted_data' field in '{input_filename}'.")
+ else:
+ print(f"Error: The first element in '{input_filename}' does not have 'type' set to 'DocumentData'.")
+
+
+def load_filenames_in_directory(input_dir):
+ filenames = []
+ for root, _, files in os.walk(input_dir):
+ for file in files:
+ if file.endswith('.json'):
+ filenames.append(os.path.join(root, file))
+ print(f"Found JSON file '{file}'.")
+ else:
+ print(f"Error: '{file}' is not a JSON file.")
+
+ return filenames
+
+async def process_files():
+ # Initialize with either a directory name, to process everything in the dir,
+ # or a comma-separated list of filepaths.
+ input_dir = None # "path/to/input/directory"
+ input_files = None # "path/to/file,path/to/file,path/to/file"
+
+ # Set to the directory for output json files. This dir
+ # will be created if needed.
+ output_dir = "./extracted_data/"
+
+ if input_dir:
+ filenames = load_filenames_in_directory(input_dir)
+ else:
+ filenames = input_files.split(",")
+
+ os.makedirs(output_dir, exist_ok=True)
+
+ tasks = []
+ for filename in filenames:
+ tasks.append(
+ process_file_and_save_result(filename, output_dir)
+ )
+
+ await asyncio.gather(*tasks)
+
+if __name__ == "__main__":
+ asyncio.run(process_files())
+```
+
+## Additional examples
+
+In addition to the preceding real estate listing example, here are some more examples that you can adapt for your own use.
+
+### Caring for houseplants
+
+Using the following image file ([download this file](https://raw.githubusercontent.com/Unstructured-IO/docs/main/img/ui/data-extractor/house-plant-care.png)):
+
+
+
+An extraction schema for this file might look like the following:
+
+```json
+{
+ "type": "object",
+ "properties": {
+ "plants": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string",
+ "description": "The name of the plant"
+ },
+ "sunlight": {
+ "type": "string",
+ "description": "The sunlight requirements for the plant (for example: 'Direct', 'Bright Indirect - Some direct')."
+ },
+ "water": {
+ "type": "string",
+ "description": "The watering instructions for the plant (for example: 'Let dry between thorough watering', 'Water when 50-60% dry')."
+ },
+ "humidity": {
+ "type": "string",
+ "description": "The humidity requirements for the plant (for example:'Low', 'Medium', 'High')"
+ }
+ },
+ "required": ["name", "sunlight", "water", "humidity"],
+ "additionalProperties": false
+ }
+ }
+ },
+ "required": ["plants"],
+ "additionalProperties": false
+}
+```
+
+An extraction guidance prompt for this file might look like the following:
+
+
+ Providing an extraction guidance prompt is available only from the **Start** page.
+ The workflow editor does not offer an extraction guidance prompt—you must provide an
+ extraction schema instead.
+
+
+```text
+Extract the plant information for each of the plants in this document, and present it in the following format:
+
+- plants: A list of plants.
+
+ - name: The name of the plant.
+ - sunlight: The sunlight requirements for the plant (for example: 'Direct', 'Bright Indirect - Some direct').
+ - water: The watering instructions for the plant (for example: 'Let dry between thorough watering', 'Water when 50-60% dry').
+ - humidity: The humidity requirements for the plant (for example: 'Low', 'Medium', 'High').
+```
+
+And Unstructured's output would look like the following:
+
+```json
+[
+ {
+ "type": "DocumentData",
+ "element_id": "3be179f1-e1e5-4dde-a66b-9c370b6d23e8",
+ "text": "",
+ "metadata": {
+ "...": "...",
+ "extracted_data": {
+ "plants": [
+ {
+ "name": "Krimson Queen",
+ "sunlight": "Bright Indirect - Some direct",
+ "water": "Let dry between thorough watering",
+ "humidity": "Low"
+ },
+ {
+ "name": "Chinese Money Plant",
+ "sunlight": "Bright Indirect - Some direct",
+ "water": "Let dry between thorough watering",
+ "humidity": "Low - Medium"
+ },
+ {
+ "name": "String of Hearts",
+ "sunlight": "Direct - Bright Indirect",
+ "water": "Let dry between thorough watering",
+ "humidity": "Low"
+ },
+ {
+ "name": "Marble Queen",
+ "sunlight": "Low- High Indirect",
+ "water": "Water when 50 - 80% dry",
+ "humidity": "Low - Medium"
+ },
+ {
+ "name": "Sansevieria Whitney",
+ "sunlight": "Direct - Low Direct",
+ "water": "Let dry between thorough watering",
+ "humidity": "Low"
+ },
+ {
+ "name": "Prayer Plant",
+ "sunlight": "Medium - Bright Indirect",
+ "water": "Keep soil moist",
+ "humidity": "Medium - High"
+ },
+ {
+ "name": "Aloe Vera",
+ "sunlight": "Direct - Bright Indirect",
+ "water": "Water when dry",
+ "humidity": "Low"
+ },
+ {
+ "name": "Philodendron Brasil",
+ "sunlight": "Bright Indirect - Some direct",
+ "water": "Water when 80% dry",
+ "humidity": "Low - Medium"
+ },
+ {
+ "name": "Pink Princess",
+ "sunlight": "Bright Indirect - Some direct",
+ "water": "Water when 50 - 80% dry",
+ "humidity": "Medium"
+ },
+ {
+ "name": "Stromanthe Triostar",
+ "sunlight": "Bright Indirect",
+ "water": "Keep soil moist",
+ "humidity": "Medium - High"
+ },
+ {
+ "name": "Rubber Plant",
+ "sunlight": "Bright Indirect - Some direct",
+ "water": "Let dry between thorough watering",
+ "humidity": "Low - Medium"
+ },
+ {
+ "name": "Monstera Deliciosa",
+ "sunlight": "Bright Indirect - Some direct",
+ "water": "Water when 80% dry",
+ "humidity": "Low - Medium"
+ }
+ ]
+ }
+ }
+ },
+ {
+ "...": "..."
+ }
+]
+```
+
+### Medical invoicing
+
+Using the following PDF file ([download this file](https://raw.githubusercontent.com/Unstructured-IO/docs/main/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf)):
+
+
+
+An extraction schema for this file might look like the following:
+
+```json
+{
+ "type": "object",
+ "properties": {
+ "patient": {
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string",
+ "description": "Full name of the patient."
+ },
+ "birth_date": {
+ "type": "string",
+ "description": "Patient's date of birth."
+ },
+ "sex": {
+ "type": "string",
+ "enum": ["M", "F", "Other"],
+ "description": "Patient's biological sex."
+ }
+ },
+ "required": ["name", "birth_date", "sex"],
+ "additionalProperties": false
+ },
+ "medical_summary": {
+ "type": "object",
+ "properties": {
+ "prior_procedures": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "procedure": {
+ "type": "string",
+ "description": "Name or type of the medical procedure."
+ },
+ "date": {
+ "type": "string",
+ "description": "Date when the procedure was performed."
+ },
+ "levels": {
+ "type": "string",
+ "description": "Anatomical levels or location of the procedure."
+ }
+ },
+ "required": ["procedure", "date", "levels"],
+ "additionalProperties": false
+ },
+ "description": "List of prior medical procedures."
+ },
+ "diagnoses": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "List of medical diagnoses."
+ },
+ "comorbidities": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "List of comorbid conditions."
+ }
+ },
+ "required": ["prior_procedures", "diagnoses", "comorbidities"],
+ "additionalProperties": false
+ }
+ },
+ "required": ["patient", "medical_summary"],
+ "additionalProperties": false
+}
+```
+
+An extraction guidance prompt for this file might look like the following:
+
+
+ Providing an extraction guidance prompt is available only from the **Start** page.
+ The workflow editor does not offer an extraction guidance prompt—you must provide an
+ extraction schema instead.
+
+
+```text
+Extract the medical information from this record, and present it in the following format:
+
+- patient
+
+ - name: Full name of the patient.
+ - birth_date: Patient's date of birth.
+ - sex: Patient's biological sex.
+
+- medical_summary
+
+ - prior_procedures
+
+ - procedure: Name or type of the medical procedure.
+ - date: Date when the procedure was performed.
+ - levels: Anatomical levels or location of the procedure.
+
+ - diagnoses: List of medical diagnoses.
+ - comorbidities: List of comorbid conditions.
+
+Additional extraction guidance:
+
+- name: Extract the full legal name as it appears in the document. Use proper capitalization (for example: "Marissa K. Donovan").
+- birth_date: Convert to format "MM/DD/YYYY" (for example: "03/28/1976"),
+
+ - Accept variations: MM/DD/YYYY, MM-DD-YYYY, YYYY-MM-DD, Month DD, YYYY,
+ - If only age is given, do not infer birth date - mark as null,
+
+- sex: Extract biological sex as single letter: "M" (Male), "F" (Female), or "X" (Other)
+
+ - Map variations: Male/Man → "M", Female/Woman → "F", Others → "X"
+
+- prior_procedures:
+
+ Extract all surgical and major medical procedures, including:
+
+ - procedure: Use standard medical terminology when possible.
+ - date: Format as "MM/DD/YYYY". If only year/month available, use "01" for missing day.
+ - levels: Include anatomical locations, vertebral levels, or affected areas.
+
+ - For spine procedures: Use format like "L4 to L5" or "L4-L5".
+ - Include laterality when specified (left, right, bilateral).
+
+ - diagnoses:
+
+ Extract all current and historical diagnoses:
+
+ - Include both primary and secondary diagnoses.
+ - Preserve medical terminology and ICD-10 descriptions if provided.
+ - Include location/region specifications (for example: "radiculopathy — lumbar region").
+ - Do not include procedure names unless they represent a diagnostic condition.
+
+ - comorbidities
+
+ Extract all coexisting medical conditions that may impact treatment:
+
+ - Include chronic conditions (for example: "diabetes", "hypertension").
+ - Include relevant surgical history that affects current state (for example: Failed Fusion, Multi-Level Fusion).
+ - Include structural abnormalities (for example: Spondylolisthesis, Stenosis).
+ - Do not duplicate items already listed in primary diagnoses.
+
+Data quality rules:
+
+1. Completeness: Only include fields where data is explicitly stated or clearly indicated.
+2. No inference: Do not infer or assume information not present in the source.
+3. Preserve specificity: Maintain medical terminology and specificity from source.
+4. Handle missing data: Return empty arrays [] for sections with no data, never null.
+5. Date validation: Ensure all dates are realistic and properly formatted.
+6. Deduplication: Avoid listing the same condition in multiple sections.
+
+Common variations to handle:
+
+- Operative reports: Focus on procedure details, dates, and levels.
+- H&P (history & physical): Rich source for all sections.
+- Progress notes: May contain updates to diagnoses and new procedures.
+- Discharge summaries: Comprehensive source for all data points.
+- Consultation notes: Often contain detailed comorbidity lists.
+- Spinal levels: C1-C7 (Cervical), T1-T12 (Thoracic), L1-L5 (Lumbar), S1-S5 (Sacral).
+- Use "fusion surgery" not "fusion" alone when referring to procedures.
+- Preserve specificity: "Type 2 Diabetes" not just "Diabetes" when specified.
+- Multiple procedures same date**: List as separate objects in the array.
+- Revised procedures: Include both original and revision as separate entries.
+- Bilateral procedures: Note as single procedure with "bilateral" in levels.
+- Uncertain dates: If date is approximate (for example, "Spring 2023"), use "01/04/2023" for Spring, "01/07/2023" for Summer, and so on.
+- Name variations: Use the most complete version found in the document.
+- Conflicting information**: Use the most recent or most authoritative source.
+
+Output validation:
+
+Before returning the extraction:
+
+1. Verify all required fields are present.
+2. Check date formats are consistent.
+3. Ensure no duplicate entries within arrays.
+4. Confirm sex field contains only "M", "F", or "Other".
+5. Validate that procedures have all three required fields.
+6. Ensure diagnoses and comorbidities are non-overlapping.
+```
+
+And Unstructured's output would look like the following:
+
+```json
+[
+ {
+ "type": "DocumentData",
+ "element_id": "e8f09cb1-1439-4e89-af18-b6285aef5d37",
+ "text": "",
+ "metadata": {
+ "...": "...",
+ "extracted_data": {
+ "patient": {
+ "name": "Ms. Daovan",
+ "birth_date": "01/01/1974",
+ "sex": "F"
+ },
+ "medical_summary": {
+ "prior_procedures": [],
+ "diagnoses": [
+ "Radiculopathy — lumbar region"
+ ],
+ "comorbidities": [
+ "Diabetes",
+ "Multi-Level Fusion",
+ "Failed Fusion",
+ "Spondylolisthesis"
+ ]
+ }
+ }
+ }
+ },
+ {
+ "...": "..."
+ }
+]
+```
\ No newline at end of file
diff --git a/ui/walkthrough.mdx b/ui/walkthrough.mdx
index 5a058735..c0186d8b 100644
--- a/ui/walkthrough.mdx
+++ b/ui/walkthrough.mdx
@@ -4,7 +4,7 @@ sidebarTitle: Walkthrough
---
This walkthrough provides you with deep, hands-on experience with the [Unstructured user interface (UI)](/ui/overview). As you follow along, you will learn how to use many of Unstructured's
-features for [partitioning](/ui/partitioning), [enriching](/ui/enriching/overview), [chunking](/ui/chunking), and [embedding](/ui/embedding). These features are optimized for turning
+features for [partitioning](/ui/partitioning), [enriching](/ui/enriching/overview), [chunking](/ui/chunking), [embedding](/ui/embedding), and [structured data extraction](/ui/data-extractor). These features are optimized for turning
your source documents and data into information that is well-tuned for
[retrieval-augmented generation (RAG)](https://unstructured.io/blog/rag-whitepaper),
[agentic AI](https://unstructured.io/problems-we-solve#powering-agentic-ai),
@@ -539,9 +539,264 @@ embedding model that is provided by an embedding provider. For the best embeddin
6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
the workflow designer so that you can continue designing things later as you see fit.
+## Step 7: Experiment with structured data extraction
+
+In this step, you apply custom [structured data extraction](/ui/data-extractor) to your workflow. Structured data extraction is the process where Unstructured
+automatically extracts the data from your source documents into a format that you define up front. For example, in addition to Unstructured
+partitioning your source documents into elements with types such as `NarrativeText`, `UncategorizedText`, and so on, you can have Unstructured
+output key information from the source documents in a custom structured data format, within a `DocumentData` element containing aJSON object with custom fields such as `name`, `address`, `phone`, `email`, and so on.
+
+1. With the workflow designer active from the previous step, just before the **Destination** node, click the add (**+**) icon, and then click **Enrich > Extract**.
+
+ 
+
+2. In the node's settings pane's **Details** tab, under **Provider**, select **Anthropic**. Under **Model**, select **Claude Sonnet 4.5**. This is the model that Unstructured will use to do the structured data extraction.
+
+
+ The list of available models for structured data extraction is constantly being updated. Your list might also be different, depending on your Unstructured
+ account type. If **Anthropic** and **Claude Sonnet 4.5** is not available, choose another available model from the list.
+
+ If you have an Unstructured **Business** account and want to add more models to this list, contact your
+ Unstructured account administrator or Unstructured sales representative, or email Unstructured Support at
+ [support@unstructured.io](mailto:support@unstructured.io).
+
+
+3. Click **Upload JSON**.
+4. in the **JSON Schema** box, enter the following JSON schema, and then click **Use this Schema**:
+
+ ```json
+ {
+ "type": "object",
+ "properties": {
+ "title": {
+ "type": "string",
+ "description": "Full title of the research paper"
+ },
+ "authors": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string",
+ "description": "Author's full name"
+ },
+ "affiliation": {
+ "type": "string",
+ "description": "Author's institutional affiliation"
+ },
+ "email": {
+ "type": "string",
+ "description": "Author's email address"
+ }
+ },
+ "required": [
+ "name",
+ "affiliation",
+ "email"
+ ],
+ "additionalProperties": false
+ },
+ "description": "List of paper authors with their affiliations"
+ },
+ "abstract": {
+ "type": "string",
+ "description": "Paper abstract summarizing the research"
+ },
+ "introduction": {
+ "type": "string",
+ "description": "Introduction section describing the problem and motivation"
+ },
+ "methodology": {
+ "type": "object",
+ "properties": {
+ "approach_name": {
+ "type": "string",
+ "description": "Name of the proposed method (e.g., StrokeNet)"
+ },
+ "description": {
+ "type": "string",
+ "description": "Detailed description of the methodology"
+ },
+ "key_techniques": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "List of key techniques used in the approach"
+ }
+ },
+ "required": [
+ "approach_name",
+ "description",
+ "key_techniques"
+ ],
+ "additionalProperties": false
+ },
+ "experiments": {
+ "type": "object",
+ "properties": {
+ "datasets": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string",
+ "description": "Dataset name"
+ },
+ "description": {
+ "type": "string",
+ "description": "Dataset description"
+ },
+ "size": {
+ "type": "string",
+ "description": "Dataset size (e.g., number of sentence pairs)"
+ }
+ },
+ "required": [
+ "name",
+ "description",
+ "size"
+ ],
+ "additionalProperties": false
+ },
+ "description": "Datasets used for evaluation"
+ },
+ "baselines": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "Baseline methods compared against"
+ },
+ "evaluation_metrics": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "Metrics used for evaluation"
+ },
+ "experimental_setup": {
+ "type": "string",
+ "description": "Description of experimental configuration and hyperparameters"
+ }
+ },
+ "required": [
+ "datasets",
+ "baselines",
+ "evaluation_metrics",
+ "experimental_setup"
+ ],
+ "additionalProperties": false
+ },
+ "results": {
+ "type": "object",
+ "properties": {
+ "main_findings": {
+ "type": "string",
+ "description": "Summary of main experimental findings"
+ },
+ "performance_improvements": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "dataset": {
+ "type": "string",
+ "description": "Dataset name"
+ },
+ "metric": {
+ "type": "string",
+ "description": "Evaluation metric (e.g., BLEU)"
+ },
+ "baseline_score": {
+ "type": "number",
+ "description": "Baseline method score"
+ },
+ "proposed_score": {
+ "type": "number",
+ "description": "Proposed method score"
+ },
+ "improvement": {
+ "type": "number",
+ "description": "Improvement over baseline"
+ }
+ },
+ "required": [
+ "dataset",
+ "metric",
+ "baseline_score",
+ "proposed_score",
+ "improvement"
+ ],
+ "additionalProperties": false
+ },
+ "description": "Performance improvements over baselines"
+ },
+ "parameter_reduction": {
+ "type": "string",
+ "description": "Description of parameter reduction achieved"
+ }
+ },
+ "required": [
+ "main_findings",
+ "performance_improvements",
+ "parameter_reduction"
+ ],
+ "additionalProperties": false
+ },
+ "related_work": {
+ "type": "string",
+ "description": "Summary of related work and prior research"
+ },
+ "conclusion": {
+ "type": "string",
+ "description": "Conclusion section summarizing contributions and findings"
+ },
+ "limitations": {
+ "type": "string",
+ "description": "Limitations and challenges discussed in the paper"
+ },
+ "acknowledgments": {
+ "type": "string",
+ "description": "Acknowledgments section"
+ },
+ "references": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ },
+ "description": "List of cited references"
+ }
+ },
+ "additionalProperties": false,
+ "required": [
+ "title",
+ "authors",
+ "abstract",
+ "introduction",
+ "methodology",
+ "experiments",
+ "results",
+ "related_work",
+ "conclusion",
+ "limitations",
+ "acknowledgments",
+ "references"
+ ]
+ }
+ ```
+
+5. With the "Chinese Characters" PDF file still selected in the **Source** node, click **Test**.
+6. In the **Test output** pane, make sure that **Extract (9 of 9)** is showing. If not, click the right arrow (**>**) until **Extract (9 of 9)** appears, which will show the output from the last node in the workflow.
+7. To explore the structured data extraction, search for the text `"extracted_data"` (including the quotation marks).
+8. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
+ the workflow designer so that you can continue designing things later as you see fit.
+
## Next steps
-Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, and embeds your source documents, producing
+Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, embeds, and extracts structured data from your source documents, producing
context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning.
Right now, your workflow only accepts one local file at a time for input. Your workflow also only sends Unstructured's processed data to your screen or to be saved locally as a JSON file.
diff --git a/ui/workflows.mdx b/ui/workflows.mdx
index d91d53d6..1eb6f931 100644
--- a/ui/workflows.mdx
+++ b/ui/workflows.mdx
@@ -178,6 +178,26 @@ If you did not previously set the workflow to run on a schedule, you can [run th
flowchart LR
Source-->Partitioner-->Enrichment-->Chunker-->Embedder-->Destination
```
+ ```mermaid
+ flowchart LR
+ Source-->Partitioner-->Extract-->Destination
+ ```
+ ```mermaid
+ flowchart LR
+ Source-->Partitioner-->Chunker-->Extract-->Destination
+ ```
+ ```mermaid
+ flowchart LR
+ Source-->Partitioner-->Chunker-->Embedder-->Extract-->Destination
+ ```
+ ```mermaid
+ flowchart LR
+ Source-->Partitioner-->Enrichment-->Chunker-->Extract-->Destination
+ ```
+ ```mermaid
+ flowchart LR
+ Source-->Partitioner-->Enrichment-->Chunker-->Embedder-->Extract-->Destination
+ ```
For workflows that use **Chunker** and enrichment nodes together, the **Chunker** node should be placed after all enrichment nodes. Placing the
@@ -382,6 +402,18 @@ import DeprecatedModelsUI from '/snippets/general-shared-text/deprecated-models-
- [Embedding overview](/ui/embedding)
- [Understanding embedding models: make an informed choice for your RAG](https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag).
+
+ Do one of the following to define the custom schema for the structured data that you want to extract:
+
+ - To use a custom schema that conforms to the [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) guidelines,
+ click **Upload JSON**; enter your own custom schema or upload a JSON file that contains your custom schema; and then click **Use this Schema**.
+ [Learn about the OpenAI Structured Outputs format](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas).
+ - To use a visual editor to define the schema, enter your own custom schema objects and their properties. To clear the current schema and start over,
+ click the ellipses (three dots) icon, and then click **Reset form**.
+ [Learn about OpenAI Structured Outputs data types](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas).
+
+ [Learn more](/ui/data-extractor).
+
## Edit, delete, or run a workflow