Robert, You are absolutely correct, the document has been scanned and OCR'd, but as you point out there always tend to be inaccuracies with automatic OCR. To insure that everything is preserved the scanned image is used but selectable text is the OCR version. If you look at the document properties you will see that it was generated by Adobe Acrobat Capture 3.0 on 7/12/2001 at 17:34:12 ;-) Dave -----Original Message----- From: Robert Ussery [mailto:uavscience@frii.com] Sent: 06 January 2005 03:45 To: Microcontroller discussion list - Public. Subject: Re: [OT]{WOT} Al Qaeda training manual on US DOJ site? Mike Singer wrote: >Moreover, they wrote: "The manual was found in a computer file" >and "The manual was translated into English" May I ask why they >chose that kind of a font. The font inflates 30K text up to 1M pdf. >I doubt the bandits would like very much using 1M files instead of >30K like Microchip with their XMbyte pdfs. > It's pretty clear from the .pdf that it was generated from a scanned-in government translation. I'll bet the course of events went something like this: 1) Govt. captures Al-Quaeda file 2) DoJ translates and prints it in a fairly standard Courier-like font. 3) DoJ then decides they want an electronic copy and for some obscure beaurocratic DoJ reason decide to scan the doc in (all 100ish pages of it) instead of hunting down the original translation. The reason for the large file size is not the obscure font, but rather the fact that they were generated from scanned images, and not pure text. The unnevenness of the text, the splotches and imperfections on the pages, and the page borders around the edge all make it fairly clear that this is a scanned in document. One interesting thing to note about the .pdfs, however, is that they do contain selectable (probably OCR-derived) text. I'd say the text has been OCR'd because of some formatting irregularities peculiar to OCR - for instance, random tabs, and irregular formatting (one or two letters of a word being bold, for instance). My point with all of this is that there's nothing odd about the format of the .pdfs, other than the DoJ's choice to scan them in rather than create .pdfs from the original document. Even this, however, can be explained by the fact that the scanned original appears to have been used as evidence in some legal procedings. Thus, the document was probably scanned not for publication to the DoJ site, but rather for DoJ records and subsequently was published on their site. TTYL - Robert -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist