From: Digest <deadmail>
To: "OS/2GenAu Digest"<deadmail>
Date: Sun, 3 Aug 2008 00:00:40 EST-10EDT,10,1,0,7200,4,1,0,7200,3600
Subject: [os2genau_digest] No. 1686
Reply-To: <deadmail>
X-List-Unsubscribe: www.os2site.com/list/

**************************************************
Saturday 02 August 2008
 Number  1686
**************************************************

Subjects for today
 
1  Re:  Tesseract : Voytek Eymont" <voytek at sbt dot net dot au>
2  Re:  Tesseract : Voytek Eymont" <voytek at sbt dot net dot au>
3  Re:  Tesseract : Alan Duval <amoht at westnet dot com dot au>
4  Re:  Tesseract : Alan Duval <amoht at westnet dot com dot au>

**= Email   1 ==========================**

Date:  Fri, 1 Aug 2008 23:57:32 +1000 (EST)
From:  "Voytek Eymont" <voytek at sbt dot net dot au>
Subject:  Re:  Tesseract


<quote who="Peter Moylan">

> I haven't tried this (I don't even have Tesseract), but I think what you
> need to do is write a simple Rexx script that does three things in
> sequence:
> - parse its program argument to decompose the file name, so as to
> construct the arguments for the next two steps - call Tesseract
> - call the word processor

that's what I'd do

I just made a quick hack of a script I've used for something (like
archiving log files):

it looks through one or more of predefined directories, and, ocrs any
predefined file types (like TIF and FAX)

---------
0[roman][F:\ute]ocr

0[roman][F:\ute]SET TESSDATA_PREFIX=F:/ute/tesseract/usr/share/
ocr processing ocr, logging to \logs\ocr.log
.... processing directory \scanner
.... ... processing for extension tif
.... ... processing for extension fax

0[roman][F:\ute\tesseract\usr\bin]tesseract F:\scanner\FX004164.FAX
F:\scanner\F
X004164 -l eng
Tesseract Open Source OCR Engine

0[roman][F:\ute\tesseract\usr\bin]tesseract F:\scanner\FX004167.FAX
F:\scanner\F
X004167 -l eng
Tesseract Open Source OCR Engine
.... processing directory \scanner\out
.... ... processing for extension tif
.... ... processing for extension fax

----------
ymmv

----
/* ocr.cmd  */

/* you MUST have following:
tesseract ocr application
'
this does very little (none ?) in the way of error checking,
if your application hold logs open, this will skip and log any open logs
if you don't know what it all means, stop now

tesseract F:\FAXBANKSIA\FX006309.FAX test -l eng

*/


/* user defines below */

extlist= 'tif fax'			/* list all target extensions to process */
dirlist= '\scanner \scanner\out'        /* list all target dirs to process */
logdir= '\logs\'			/* target dir for logs, NEEDS trailing '\' */

arch= 'tesseract image text -l eng'	/* command to execute against targets */
arch= 'tesseract'
/* arch= 'pause' */			/* UNCOMMENT this line for testing ? perhaps .. */
ocrdir= '\ute\tesseract\usr\bin'	/* where is t exe ? */
'SET TESSDATA_PREFIX=F:/ute/tesseract/usr/share/' /* t's libs are there */

/* user defines end */


	IF RxFuncQuery('SysLoadFuncs') THEN		/* assume libraries loaded ...*/
   	    DO
		CALL RxFuncAdd 'SysLoadFuncs', 'RexxUtil', 'SysLoadFuncs'
      		CALL SysLoadFuncs
		SAY '... loading REXX Utilities libraries ...'
	    END
	call time(e)			/* let's time it */
	call logfile
	say thiscmd 'processing ocr, logging to' log

	curdir = directory()		/* get where we are */
	wdirlist=dirlist		/* load rubber bullets */
	wextlist=extlist

DO WHILE wdirlist >''			/* loop for all target paths */
	PARSE VAR wdirlist target wdirlist
		say '... processing directory 'target

	  DO WHILE wextlist >'' 	/* loop for all extensions */
		PARSE VAR wextlist ext wextlist
		say '... ... processing for extension 'ext

		call SysFileTree target||'\*.'||ext, 'file', 'FO'	/* find out all target
LOGs */

			do i=1 to file.0	/* we assume there are several valid targets... */
			/* make sure NOT in use */
			if (stream(file.i,'C','OPEN WRITE') = 'READY:') then
			DO
			call stream file.i, 'C', close
			  /* we need to strip FQPFname to a simple file name */
			PARSE value file.i WITH name '.' dump
			CALL directory(ocrdir)
			  'tesseract' file.i name '-l eng'
			END
			else call LINEOUT log, date() time() 'skipping 'file.i', file in use'
           		end

	  end 				/* do while wextlist */
	wextlist=extlist		/* reload */
end 					/* do while wdirlist */

	call LINEOUT log, date() time() thiscmd' using 'arch' in 'dirlist' for
extensions 'extlist' in 'TRUNC(time(e)) 'sec.'
	call LINEOUT log		/* let's log we done it */
	CALL directory(curdir)

EXIT


logfile:
	parse source . . thiscmd	/* lets see what are we running, and set log &
par */
        at_char=LASTPOS('\' , thiscmd)
        thiscmd=SUBSTR(thiscmd , at_char + 1 )
        parse value thiscmd with thiscmd'.'ext
        log = logdir|| thiscmd || '.log'
return

/*
fully guaranteed never to overwrite any CD ROM
*/

----

-- 
Voytek

----------------------------------------------------------------------------------
 
**= Email   2 ==========================**

Date:  Sat, 2 Aug 2008 00:02:44 +1000 (EST)
From:  "Voytek Eymont" <voytek at sbt dot net dot au>
Subject:  Re:  Tesseract


<quote who="Alan Duval">

>
> I haven't used OCRing - didn't know it existed. Does it actually produce
> a text file and not a scanned copy? I know that you use PMfax a lot, so do
> you scan docs into PMfax and then use OCRing to convert them to text?  I
> want to scan articles and convert them to text formats that I can store or
> send to friends. That means that they have to be converted to *.doc files
> or *.pdf files. So far I have found that Tesseract does a good job but it

yes, I scan with CopyShop into PMfax; generally, I send out the PMfax
TIFF-F files, ocassionaly, make TIFF into PDF

I ocassionally OCR scanned stuff to text, not very often

no, 1st you need a scanned copy, then, you ocr part or all of it,
I generally OCR to clipboard (default in PMfax), then paste into whatever




-- 
Voytek

----------------------------------------------------------------------------------
 
**= Email   3 ==========================**

Date:  Sat, 02 Aug 2008 21:03:33 +1100
From:  Alan Duval <amoht at westnet dot com dot au>
Subject:  Re:  Tesseract

Peter Moylan wrote:
> Alan Duval wrote:
>> Dennis Nolan wrote:
>
>>> A better way is to create a Program object on your desktop. There is 
>>> a Program Object template in the Templates folder.
>>> Make your OCR program the Object. From Memory you just need to drag 
>>> it to the Object when creating it.
>>> During the creatioin you need to specify the dropped file as the 
>>> input parameter. The Help file that you can access in the program 
>>> object explain how to do this.
>>>
>>> If it is set up correctly you only need to drag and drop your tif 
>>> files on the object for it to do its stuff.
>>
>> I can drag and drop my tif files on the program object that I created 
>> and it will process it and save it to C:\OCR.
>>
>>> There is a way to get it to open your word processor too, but it's 
>>> been too long for me to clearly remember how I used to do it.
>>
>> That's what I now want but can't see how to do it. I can drag and 
>> drop the txt file that has been created on to the word processor 
>> object and it opens in the word processor but I would like that to 
>> happen without doing this second drag and drop.
>
> I haven't tried this (I don't even have Tesseract), but I think what 
> you need to do is write a simple Rexx script that does three things in 
> sequence:
>   - parse its program argument to decompose the file name, so as to
>     construct the arguments for the next two steps
>   - call Tesseract
>   - call the word processor
>
> For someone who is not familiar with Rexx (I don't know whether you 
> are), the only hard part is the parsing of the file name, and even 
> that is easy once you look up the Rexx manual because Rexx has an 
> explicit PARSE command. The rest is just like writing a batch file.
>
> Suppose this script is called "script.cmd". Then you can create a 
> program object which has the program name specified as "CMD.EXE" 
> (without the quotes), and the parameter string "/C SCRIPT.CMD" (also 
> without the quotes). The working directory should be the directory 
> where script.cmd lives. Alternatively, you can give a full path 
> specification for script.cmd, and set the working directory to be 
> where you want your data files to live. That part is not particularly 
> important, because you can always include CD (i.e. change directory) 
> commands in your script, or use full path names for every file that 
> has to be mentioned.
>
> On further thought, it's possible that the parameter string in the 
> program object should be something like "/C SCRIPT.CMD %1", or 
> something similar, to ensure that the parameter is passed to the 
> script. I can't check that now because I don't have OS/2 at work.
>
Thanks Peter,

I don't much about REXX. All I've done is to write the HELLO command in 
REXX.
I'm lost when I read the PARSE and CALL commands. I'd have to get a book 
and work steadily through it to know what I was doing.

Regards,

Alan
----------------------------------------------------------------------------------
 

**= Email   4 ==========================**

Date:  Sat, 02 Aug 2008 21:07:04 +1100
From:  Alan Duval <amoht at westnet dot com dot au>
Subject:  Re:  Tesseract

Voytek Eymont wrote:
> <quote who="Peter Moylan">
>
>   
>> I haven't tried this (I don't even have Tesseract), but I think what you
>> need to do is write a simple Rexx script that does three things in
>> sequence:
>> - parse its program argument to decompose the file name, so as to
>> construct the arguments for the next two steps - call Tesseract
>> - call the word processor
>>     
>
> that's what I'd do
>
> I just made a quick hack of a script I've used for something (like
> archiving log files):
>
> it looks through one or more of predefined directories, and, ocrs any
> predefined file types (like TIF and FAX)
>
> ---------
> 0[roman][F:\ute]ocr
>
> 0[roman][F:\ute]SET TESSDATA_PREFIX=F:/ute/tesseract/usr/share/
> ocr processing ocr, logging to \logs\ocr.log
> ... processing directory \scanner
> ... ... processing for extension tif
> ... ... processing for extension fax
>
> 0[roman][F:\ute\tesseract\usr\bin]tesseract F:\scanner\FX004164.FAX
> F:\scanner\F
> X004164 -l eng
> Tesseract Open Source OCR Engine
>
> 0[roman][F:\ute\tesseract\usr\bin]tesseract F:\scanner\FX004167.FAX
> F:\scanner\F
> X004167 -l eng
> Tesseract Open Source OCR Engine
> ... processing directory \scanner\out
> ... ... processing for extension tif
> ... ... processing for extension fax
>
> ----------
> ymmv
>
> ----
> /* ocr.cmd  */
>
> /* you MUST have following:
> tesseract ocr application
> '
> this does very little (none ?) in the way of error checking,
> if your application hold logs open, this will skip and log any open logs
> if you don't know what it all means, stop now
>
> tesseract F:\FAXBANKSIA\FX006309.FAX test -l eng
>
> */
>
>
> /* user defines below */
>
> extlist= 'tif fax'			/* list all target extensions to process */
> dirlist= '\scanner \scanner\out'        /* list all target dirs to process */
> logdir= '\logs\'			/* target dir for logs, NEEDS trailing '\' */
>
> arch= 'tesseract image text -l eng'	/* command to execute against targets */
> arch= 'tesseract'
> /* arch= 'pause' */			/* UNCOMMENT this line for testing ? perhaps .. */
> ocrdir= '\ute\tesseract\usr\bin'	/* where is t exe ? */
> 'SET TESSDATA_PREFIX=F:/ute/tesseract/usr/share/' /* t's libs are there */
>
> /* user defines end */
>
>
> 	IF RxFuncQuery('SysLoadFuncs') THEN		/* assume libraries loaded ...*/
>    	    DO
> 		CALL RxFuncAdd 'SysLoadFuncs', 'RexxUtil', 'SysLoadFuncs'
>       		CALL SysLoadFuncs
> 		SAY '... loading REXX Utilities libraries ...'
> 	    END
> 	call time(e)			/* let's time it */
> 	call logfile
> 	say thiscmd 'processing ocr, logging to' log
>
> 	curdir = directory()		/* get where we are */
> 	wdirlist=dirlist		/* load rubber bullets */
> 	wextlist=extlist
>
> DO WHILE wdirlist >''			/* loop for all target paths */
> 	PARSE VAR wdirlist target wdirlist
> 		say '... processing directory 'target
>
> 	  DO WHILE wextlist >'' 	/* loop for all extensions */
> 		PARSE VAR wextlist ext wextlist
> 		say '... ... processing for extension 'ext
>
> 		call SysFileTree target||'\*.'||ext, 'file', 'FO'	/* find out all target
> LOGs */
>
> 			do i=1 to file.0	/* we assume there are several valid targets... */
> 			/* make sure NOT in use */
> 			if (stream(file.i,'C','OPEN WRITE') = 'READY:') then
> 			DO
> 			call stream file.i, 'C', close
> 			  /* we need to strip FQPFname to a simple file name */
> 			PARSE value file.i WITH name '.' dump
> 			CALL directory(ocrdir)
> 			  'tesseract' file.i name '-l eng'
> 			END
> 			else call LINEOUT log, date() time() 'skipping 'file.i', file in use'
>            		end
>
> 	  end 				/* do while wextlist */
> 	wextlist=extlist		/* reload */
> end 					/* do while wdirlist */
>
> 	call LINEOUT log, date() time() thiscmd' using 'arch' in 'dirlist' for
> extensions 'extlist' in 'TRUNC(time(e)) 'sec.'
> 	call LINEOUT log		/* let's log we done it */
> 	CALL directory(curdir)
>
> EXIT
>
>
> logfile:
> 	parse source . . thiscmd	/* lets see what are we running, and set log &
> par */
>         at_char=LASTPOS('\' , thiscmd)
>         thiscmd=SUBSTR(thiscmd , at_char + 1 )
>         parse value thiscmd with thiscmd'.'ext
>         log = logdir|| thiscmd || '.log'
> return
>
> /*
> fully guaranteed never to overwrite any CD ROM
> */
>
> ----
>   
Thanks Voytek,

However i'm afraid it's too complicated for me as I am not familiar with 
REXX.

Regards,

Alan
----------------------------------------------------------------------------------
 

