The report looks for PDF datastreams. Using pdfinfo the number of pages is determined. If an object has more that one PDF datastream those are analyzed as well; i.e. PDF-1, PDF-2, etc. Only the most recent version of a PDF is analyzed, so if a PDF-1.0 and PDF-1.1 exists only the PDF-1.1 is analyzed.

When a PDF does not exist and mods:typeOfResource is Text the number of the most recent JPEGs is used; this assumes the page turner is being used. An example is a yearbook. Only the most recent version of the JPEG datastream is counted; so if a JPEG-1.0 and JPEG-1.1 is found that only counts a one page, not two.

Video & aduio objects with transcript PDFs are counted.

StillImages with PDFs are counted; with the thinking that even though it is a still image it might contain text that can be OCRed i.e. a billboard.

If the object doesn't have PDF's or JPEG's(and is mods:typeOfResource = Text) nothing is counted. This leaves resources that only have DjVu datastreams from being counted.

The list of objects to analyze is determined by the Fedora database objectPaths tables.

This isn't a perfect representation of the number of pages but I think it will provide a sufficient accounting for our report.



On March 30th I ran a test of the script on the dev and test systems. The test system has 365,354 pages. I am attaching the report in spreadsheet form.

Run time of the script on the test system was 14 minutes for 15,285 objects.


On April 9th the script was run on staging and production. Reports attached.

