{"id":"01KFFH6ETXGRVD10WPNP3007D6","cid":"bafkreihz4lifkm67ylk4j27g6x3x55ljhiexx7mbipojmn2im3p7cyviiq","type":"agent","properties":{"_profile_version":"v1","actions_required":["file:view","file:create","file:update","entity:view","entity:create","entity:update"],"description":"Processes PDFs: detects type (born-digital vs scanned), extracts text and images for born-digital, renders pages to JPEG for scanned","endpoint":"https://pdf-processor.arke.institute","endpoint_verified_at":"2026-01-21T05:42:22.654Z","input_schema":{"properties":{"entity_id":{"description":"Source PDF file entity to process","type":"string"},"options":{"description":"Processing options","properties":{"dpi":{"description":"Resolution in DPI (default: 300)","type":"number"},"extract_images":{"description":"Extract embedded images from born-digital PDFs (default: true)","type":"boolean"},"extraction_mode":{"description":"Processing mode (default: auto)","enum":["auto","born_digital","scanned"],"type":"string"},"image_min_size":{"description":"Minimum image dimension to extract (default: 100)","type":"number"},"quality":{"description":"JPEG quality 1-100 (default: 85)","type":"number"}},"type":"object"}},"required":["entity_id"],"type":"object"},"label":"PDF to JPEG Processor","output_description":"For every page in the source PDF, the processor creates a JPEG file entity representing that page. First, the PDF is classified as 'born_digital' or 'scanned' using a 3-tier detection system: producer/creator metadata, page structure analysis (full-page images vs vector text), and text rendering mode (invisible OCR layer detection). If detection is inconclusive, it defaults to 'scanned'. Every page is then rendered to JPEG via Ghostscript at the configured DPI (default 300) and quality (default 85), capped to a maximum dimension of 2400px. Each resulting JPEG is uploaded as a new file entity with properties including 'page_number', 'width', 'height', and 'pdf_type'. For born-digital PDFs, native text is extracted per page and stored directly on the page entity in a 'text' property, along with 'text_source' set to 'born_digital', 'text_extracted_at', 'text_extracted_by', and 'text_has_content'. Scanned pages have 'text_source' set to null, meaning downstream OCR is needed. For born-digital PDFs, embedded images (figures, diagrams, photos) are also extracted and uploaded as separate JPEG file entities, each with properties 'extraction_source', 'source_page_number', 'source_image_index', 'extracted_by', and 'extracted_at'. Small images below the minimum size threshold (default 100px) and full-page background images on text-heavy pages are filtered out.","output_relationships":["Each page JPEG entity has a 'derived_from' relationship pointing to the source PDF entity","The source PDF entity has 'has_derivative' relationships pointing to all page JPEG entities","Page entities are linked sequentially with 'prev' and 'next' relationships (page 1 -> next -> page 2, page 2 -> prev -> page 1, etc.)","For born-digital PDFs: each extracted image entity has an 'extracted_from' relationship pointing to its source page entity","For born-digital PDFs: each page entity has 'has_derivative' relationships pointing to any images extracted from that page","To traverse: start from the source PDF, follow 'has_derivative' to find all page entities, then read 'page_number' to order them. Follow 'next'/'prev' to walk the page sequence. For born-digital pages, follow 'has_derivative' from a page to find its extracted images."],"output_tree_example":"source_pdf 'research-paper.pdf' (5 pages, born_digital)\n├── page_jpeg 'research-paper_page_0001.jpg' (page_number: 1, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Title Page\\nAuthors...', text_source: 'born_digital')\n├── page_jpeg 'research-paper_page_0002.jpg' (page_number: 2, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Abstract\\nThis paper...', text_source: 'born_digital')\n│   └── extracted_image 'research-paper_image_p2_i1.jpg' (source_page_number: 2, source_image_index: 1, extraction_source: 'born_digital')\n├── page_jpeg 'research-paper_page_0003.jpg' (page_number: 3, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Section 1\\nIntroduction...', text_source: 'born_digital')\n├── page_jpeg 'research-paper_page_0004.jpg' (page_number: 4, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Section 2\\nMethods...', text_source: 'born_digital')\n│   ├── extracted_image 'research-paper_image_p4_i1.jpg' (source_page_number: 4, source_image_index: 1, extraction_source: 'born_digital')\n│   └── extracted_image 'research-paper_image_p4_i2.jpg' (source_page_number: 4, source_image_index: 2, extraction_source: 'born_digital')\n└── page_jpeg 'research-paper_page_0005.jpg' (page_number: 5, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'References\\n1. Smith...', text_source: 'born_digital')","status":"active"},"relationships":[{"peer":"01KFF0H1KSR4SHHDQ7T2HXQEK6","peer_type":"collection","predicate":"collection"}],"ver":7,"created_at":"2026-01-21T05:42:18.415Z","ts":"2026-01-30T02:42:31.021Z","edited_by":{"method":"manual","user_id":"01KDZS52M5F9XS0ZPZQQXGPC9A"}}