Run report to determine total number of pages on all systems

Project:RUcore Jobs & Reports
Component:Report - test server
Category:task
Priority:normal
Assigned:chadmills
Status:closed
Description

The report looks for PDF datastreams. Using pdfinfo the number of pages is determined. If an object has more that one PDF datastream those are analyzed as well; i.e. PDF-1, PDF-2, etc. Only the most recent version of a PDF is analyzed, so if a PDF-1.0 and PDF-1.1 exists only the PDF-1.1 is analyzed.

When a PDF does not exist and mods:typeOfResource is Text the number of the most recent JPEGs is used; this assumes the page turner is being used. An example is a yearbook. Only the most recent version of the JPEG datastream is counted; so if a JPEG-1.0 and JPEG-1.1 is found that only counts a one page, not two.

Video & aduio objects with transcript PDFs are counted.

StillImages with PDFs are counted; with the thinking that even though it is a still image it might contain text that can be OCRed i.e. a billboard.

If the object doesn't have PDF's or JPEG's(and is mods:typeOfResource = Text) nothing is counted. This leaves resources that only have DjVu datastreams from being counted.

The list of objects to analyze is determined by the Fedora database objectPaths tables.

This isn't a perfect representation of the number of pages but I think it will provide a sufficient accounting for our report.

Comments

#1

On March 30th I ran a test of the script on the dev and test systems. The test system has 365,354 pages. I am attaching the report in spreadsheet form.

Run time of the script on the test system was 14 minutes for 15,285 objects.

#2

Status:active» closed

On April 9th the script was run on staging and production. Reports attached.

Back to top