{"id":"01KFFC4ZBD52SY7E4BX6XV8623","cid":"bafkreigieo6qbzb43qh7qlq67ylmepzz6kzoiv65lxy3iccobt6pn3xeym","type":"agent","properties":{"_profile_version":"v1","actions_required":["entity:view","entity:update","file:view","file:update","file:create","relationship:create"],"description":"Extracts text and images from JPEG files using Mistral OCR","endpoint":"https://ocr-service.arke.institute","endpoint_verified_at":"2026-01-21T04:14:11.183Z","input_schema":{"properties":{"entity_id":{"description":"File entity ID (JPEG) to process","type":"string"},"options":{"description":"Agent-specific options","properties":{},"type":"object"}},"required":["entity_id"],"type":"object"},"label":"OCR Service","output_description":"The OCR service writes extracted text directly back onto the input JPEG file entity. After processing, the source entity's 'text' property contains the full OCR output as markdown. If the page contained embedded images (figures, charts, tables rendered as images), those are extracted as new JPEG file entities and uploaded with their binary content. The markdown text uses arke: URIs to reference these extracted images inline (e.g., '![img-0.jpeg](arke:II...)'), so the text and its images stay linked. The source entity also receives metadata properties: 'text_source' is set to 'ocr', 'text_extracted_at' records the timestamp, 'text_has_content' indicates whether any non-whitespace text was found, 'text_images_count' records how many embedded images were detected, and 'ocr_model' records the model used (mistral-ocr-latest). If the source entity already has text from born-digital extraction (text_source = 'born_digital'), OCR is skipped unless force_ocr is set. Each extracted image entity gets properties recording its extraction origin: 'extraction_source', 'source_bbox' with bounding box coordinates, 'extracted_by', and 'extracted_at'.","output_relationships":["source entity --[has_extracted]--> extracted image: follow 'has_extracted' from the input JPEG to find all images that were pulled out of it during OCR","extracted image --[extracted_from]--> source entity: follow 'extracted_from' from any extracted image back to the page it came from"],"output_tree_example":"source-page.jpeg  (input entity, updated in place)\n├── properties.text = \"# Chapter 1\\n\\nThe quick brown fox...\\n\\n![img-0.jpeg](arke:IIxyz123)\\n\\nMore text...\"\n├── properties.text_source = \"ocr\"\n├── properties.text_has_content = true\n├── properties.text_images_count = 1\n├── properties.ocr_model = \"mistral-ocr-latest\"\n│\n└── [has_extracted] ──► source-page_img-0.jpeg  (new file entity)\n                        ├── properties.extraction_source = \"ocr\"\n                        ├── properties.source_bbox = { x1, y1, x2, y2 }\n                        ├── properties.extracted_by = \"ocr-service\"\n                        └── [extracted_from] ──► source-page.jpeg","status":"active"},"relationships":[{"peer":"01KFF0H1KSR4SHHDQ7T2HXQEK6","peer_type":"collection","predicate":"collection"}],"ver":6,"created_at":"2026-01-21T04:14:06.885Z","ts":"2026-01-30T02:42:30.394Z","edited_by":{"method":"manual","user_id":"01KDZS52M5F9XS0ZPZQQXGPC9A"}}