tesseract-006

Tests that wordstrbox output works

Test is expected to pass.

The pipeline

<p:declare-step xmlns:cx="http://xmlcalabash.com/ns/extensions"
                xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:t="http://xproc.org/ns/testsuite/3.0" name="main" version="3.0">
   <p:import href="https://xmlcalabash.com/ext/library/pdf-steps.xpl"/>
   <p:import href="https://xmlcalabash.com/ext/library/tesseract.xpl"/>
   <p:output port="result"/>
   <cx:pdf-to-images dpi="300">
      <p:with-input port="source"
                    href="../documents/example.pdf"/>
   </cx:pdf-to-images>
   <cx:tesseract language="eng"
                 output-format="wordstrbox" debug-output="/dev/null"/>
   <p:wrap-sequence wrapper="text"/>
</p:declare-step>

Result

<text xmlns:t="http://xproc.org/ns/testsuite/3.0">WordStr 200 3206 586 3270 0 #PDF Text 
	 587 3206 591 3270 0
WordStr 191 3076 842 3120 0 #This is a sample PDF document. 
	 843 3076 847 3120 0
WordStr 206 2451 672 2929 0 #  
	 673 2451 677 2929 0
WordStr 190 2250 495 2295 0 #With an image. 
	 496 2250 500 2295 0
</text>

Schematron checks

<s:schema xmlns:s="http://purl.oclc.org/dsdl/schematron"
          xmlns:t="http://xproc.org/ns/testsuite/3.0" queryBinding="xslt2">
   <s:pattern>
      <s:rule context="/">
         <s:assert test="text">Wrong document element</s:assert>
      </s:rule>
   </s:pattern>
   <s:pattern>
      <s:rule context="/text">
         <s:assert test="starts-with(., 'WordStr 200')">Wrong text</s:assert>
      </s:rule>
   </s:pattern>
</s:schema>

Revision history

12 Jun 2026, Norm Tovey-Walsh
Created test.