Migrate legacy mods:typeOfResouce = stillImage presentation format to pyramidal TIFF

Project:RUcore Jobs & Reports
Component:Job - production
Category:task
Priority:normal
Assigned:dhoover
Status:fixed
Description

Scan the repository and create pyramidal TIFFs for resources with mods:typeOfResource = "StillImage" that have either DARCH or ARCH datastreams that are mime-type "image/tiff"

Comments

#1

I wrote a job and performed a run on the dev system. Logic is as follows:

1) Use Fedora database to obtain a list of objectID's
2) Open descriptive metadata and process objects with mods:typeOfResource = "StillImage"
3) Only create pyramidal TIFF's for resources that do not already have them
4) Check if a DARCH datastream(s) exist and are mime-type "image/tiff"
5) If no DARCH exists check for an ARCH datastream
6) If a valid source(s) is found iterate through the source TIFF's creating pyramidal TIFF derivatives using the PECL/Imagemagick class
7) Before ingesting the pyramidal TIFF (PTIF) datastream check if a corresponding JPEG datastream exists and if it have a useful; non-standard user friendly label. If it does have a user frienly label use that same label for the PTIF datastream
8) Ingest the datastream
9) Validate the checksum of the PTIF in Fedora and compare that checksum to the checksum of the derivative
10) Optionally cleanup derivative

If at any point there is a failure it is recorded.

The commit run on the dev system yielded the following:

---------------------------------------------------------------
Summary report start
---------------------------------------------------------------
checksumInfoError: 0
checksumNotValid: 0
checksumValid: 31
currentDerivativeMimeTypeError: 3
currentDerivativeOk: 0
derivativeCommitted: 31
derivativeCreated: 31
derivativeNotCommitted: 0
derivativeNotGeneratedDueToAnError: 0
derivativeSourceFileNotFound: 0
derivativeSourceNotFound: 1
derivativeTotalSize: 53.68MB
errorReadingObjectDatastreamInfo: 0
fedoraError: 0
imageMagickConversionError: 0
modsError: 48
unsupportedResourceType: 512
---------------------------------------------------------------
Summary report end
---------------------------------------------------------------

---------------------------------------------------------------
Script ended - 2016-05-19T09:10:49-04:00
Total time spent was 199.03 second(s) or 3.32 minute(s)
This process used 142 second(s) for its computations.
It spent 3 second(s) in system calls.
---------------------------------------------------------------

I am doing a dry run on test next.

#2

Component:Job - development» Job - test

Test system dry run results are as follows:

Converted/Totals
====================================
Resources analyzed: 16,144
Derivatives created: 7,078
Total derivative size: 21.2 Gigabytes(GB)
Total run time: 10 hours 56 minutes

Errors
====================================
Resource is mods:typeOfResource = 'stillImage' but archival datastream not a tiff/image: 818
ImageMagick conversion error: 4
MODS datastream error: 98
Derivative source file, archival datastream, not found: 56

Skipped
====================================
Resource not mods:typeOfResource = 'stillImage': 7878
Current derivative OK; i.e. pyramidal TIFF already exists: 20

#3

Rucore-test error explanation:

ImageMagick conversion error: 4
======================
1 - resource has PTIF but also has a PRES-SMAP. A very test package object.
3 - resources are test JOHP resources with restricted transcripts

Derivative source file, archival datastream, not found: 56
=====================================
These are resources that are mods:typeOfResource = "StillImage" but do not have a ARCH or DARCH. The list of these ID's was not captured during the dry run. Changes have been made to the logging to pick these up the next time the job is run.

Resource is mods:typeOfResource = 'stillImage' but archival datastream not a tiff/image: 818
=============================================================
These means that for 818 resources mods:typeOfResource = "StillImage" either all ARCHs or DARCHs are not mime-type 'image/tiff'

MODS datastream error: 98
==================
These look to be test objects for some project like OOI where the manage API was used to create stub resources.

Attaching dry run logs to this comments

#4

Assigned to:chadmills» dhoover

Dave,

Do you or Ashwin have any objections if I were to run and commit the new PTIFs to the test system? Impact would be 21GB of additional usage.

Thanks,
Chad

#5

Component:Job - test» Job - staging

Job run on the test environment. While the numbers have changed the summary below and the explanation in comment #3 still covers the errors section. Attached is the full log.

I think this is ready for a dry run on staging. I have package this and added it to /mellon/cvsroot on rep-test.

@rep-test/mellon/cvsroot/job.ptif.php
@rep-test/mellon/cvsroot/job.ptif.README

Converted/Totals
====================================
Resources analyzed: 16,160
Derivatives created: 7,506
Total derivative size: 22.98GB Gigabytes(GB)
Total run time: 12 hours 43 minutes
- Computation time: 2 hours 26 minutes
- System call time: 1 hour 48 minutes

Errors
====================================
Resource is mods:typeOfResource = 'stillImage' but archival datastream not a tiff/image: 943
ImageMagick conversion error: 4
MODS datastream error: 111
Derivative source file, archival datastream, not found: 63

Skipped
====================================
Resource not mods:typeOfResource = 'stillImage': 8309
Current derivative OK; i.e. pyramidal TIFF already exists: 34

#6

Report for the trial run on rep-staging is attached, below
is the summary information. If all is deemed well I will rerun on
rep-staging to commit the datastreams.

Logging to screen...
---------------------------------------------------------------
Script started - 2016-08-13T23:25:45-04:00
---------------------------------------------------------------
Examining 894 resources...
.....

100% completed
**************************

---------------------------------------------------------------
Summary report start
---------------------------------------------------------------
checksumInfoError: 0
checksumNotValid: 0
checksumValid: 0
currentDerivativeOk: 12
currentSourceMimeTypeError: 65
derivativeCommitted: 0
derivativeCreated: 203
derivativeNotCommitted: 203
derivativeNotGeneratedDueToAnError: 0
derivativeSourceFileNotFound: 0
derivativeSourceNotFound: 1
derivativeTotalSize: 1.57GB
errorReadingObjectDatastreamInfo: 0
fedoraError: 0
imageMagickConversionError: 1
modsError: 30
unsupportedResourceType: 711
---------------------------------------------------------------
Summary report end
---------------------------------------------------------------

---------------------------------------------------------------
Script ended - 2016-08-14T00:11:28-04:00
Total time spent was 2742.98 second(s) or 45.72 minute(s)
This process used 7097 second(s) for its computations.
It spent 727 second(s) in system calls.
---------------------------------------------------------------

#7

Looks good; the one error with ARCH1 of 201359 looks like a problem though. That datastream was mis-diagnosed as have the mime-type image/tiff when in reality it is an x-audio/wav. The job has an looping issue where the last ARCH in the series is checked for mime-type correctness and not each individual ARCH is checked. I fixed the issue and re-ran it on test. Pleasae re-run on staging in report mode. Expectation is the 'imageMagickConversionError' should not appear in a re-run. Thanks.

@rep-test/mellon/cvsroot/job.ptif.php
@rep-test/mellon/cvsroot/job.ptif.README

#8

job.ptif.php was rerun on rep-staging

Script started - 2016-08-16T18:01:52-04:00
---------------------------------------------------------------
Examining 894 resources...
......
---------------------------------------------------------------
Summary report start
---------------------------------------------------------------
checksumInfoError: 0
checksumNotValid: 0
checksumValid: 0
currentDerivativeOk: 12
currentSourceMimeTypeError: 66
derivativeCommitted: 0
derivativeCreated: 196
derivativeNotCommitted: 196
derivativeNotGeneratedDueToAnError: 0
derivativeSourceFileNotFound: 0
derivativeSourceNotFound: 1
derivativeTotalSize: 1.53GB
errorReadingObjectDatastreamInfo: 0
fedoraError: 0
imageMagickConversionError: 0
modsError: 30
unsupportedResourceType: 711
---------------------------------------------------------------
Summary report end
---------------------------------------------------------------

---------------------------------------------------------------
Script ended - 2016-08-16T18:44:35-04:00
Total time spent was 2562.98 second(s) or 42.72 minute(s)
This process used 6957 second(s) for its computations.
It spent 781 second(s) in system calls.
---------------------------------------------------------------

#9

Thanks, looks great. Please run with commit turn on on staging.

#10

Any update on this?

#11

I must have forgotten to update this. Let me know if we are ready for
rep-prod dryrun.

Run on rep-staging with commit turned on.

Logging to screen...
---------------------------------------------------------------
Script started - 2016-08-17T16:29:51-04:00
...
...

---------------------------------------------------------------
Summary report start
---------------------------------------------------------------
checksumInfoError: 0
checksumNotValid: 0
checksumValid: 193
currentDerivativeOk: 12
currentSourceMimeTypeError: 66
derivativeCommitted: 193
derivativeCreated: 196
derivativeNotCommitted: 0
derivativeNotGeneratedDueToAnError: 0
derivativeSourceFileNotFound: 0
derivativeSourceNotFound: 1
derivativeTotalSize: 1.53GB
errorReadingObjectDatastreamInfo: 0
fedoraError: 3
imageMagickConversionError: 0
modsError: 30
unsupportedResourceType: 711
---------------------------------------------------------------
Summary report end
---------------------------------------------------------------

---------------------------------------------------------------
Script ended - 2016-08-17T17:17:38-04:00
Total time spent was 2867.7 second(s) or 47.8 minute(s)
This process used 6925 second(s) for its computations.
It spent 777 second(s) in system calls.
---------------------------------------------------------------

#12

I started the dryrun of job.ptif.php on rep-prod last night running
it for small ranges. Reports are attached. From 1 thru 25000 there
were no files that met the criteria for creating PTIFs.

job-ptif_log_1-1000.txt
job-ptif_log_1000-5000.txt
job-ptif_log_5000-10000.txt
job-ptif_log_10000-15000.txt
job-ptif_log_15000-20000.txt
job-ptif_log_20000-25000.txt
job-ptif_log_25000-30000.txt

The last run 25000-30000 started creating files in rucore/tmp/ptif
but not knowing how much space it would use I canceled the job to
rerun with the cleanup option turned on.

It has left 202 file in /home/httpd/html/rucore/tmp/ptif the first few
of which are listed below if you want to take a look at them.

rutgers-lib_39303_DARCH1_DARCH1.ptif
rutgers-lib_39303_DARCH2_DARCH2.ptif
rutgers-lib_39304_DARCH1_DARCH1.ptif
rutgers-lib_39304_DARCH2_DARCH2.ptif
rutgers-lib_39306_DARCH1_DARCH1.ptif
rutgers-lib_39306_DARCH2_DARCH2.ptif
rutgers-lib_39307_DARCH1_DARCH1.ptif
rutgers-lib_39307_DARCH2_DARCH2.ptif
rutgers-lib_39308_DARCH1_DARCH1.ptif
rutgers-lib_39308_DARCH2_DARCH2.ptif
rutgers-lib_39308_DARCH3_DARCH3.ptif
rutgers-lib_39309_DARCH1_DARCH1.ptif
rutgers-lib_39309_DARCH2_DARCH2.ptif
rutgers-lib_39309_DARCH3_DARCH3.ptif
rutgers-lib_39320_ARCH1_ARCH1.ptif
rutgers-lib_39321_ARCH1_ARCH1.ptif
rutgers-lib_39322_ARCH1_ARCH1.ptif
rutgers-lib_39323_ARCH1_ARCH1.ptif
rutgers-lib_39324_ARCH1_ARCH1.ptif
rutgers-lib_39325_ARCH1_ARCH1.ptif

#13

Thanks. Looks good. The three Fedora errors in the report were for rutgers-lib:200903 and the commits did work. I think this is related to the Fedora database table issue mentioned in the thumbjpeg job. The title for rutgers-lib:200903 has a special character in it, "Pamphlet_test_Geoff_2013_02_25_#2_"Ф"."

Please run in report/dryrun mode on production with cleanup turned on.

#14

Dave,

Thanks. I am downloading at looking at the production ptif that were generated now. Be in touch soon.

-Chad

#15

Component:Job - staging» Job - production

Dave,

Looked at the 201 production samples and they all look great. I would say run the entire job in dryrun with cleanup turned on and let's see what the total count and size is.

Thanks!

#16

Full dryrun done on rep-prod 9/8/16 21:57 (full report is attached)

---------------------------------------------------------------
Script started - 2016-09-08T21:57:52-04:00
---------------------------------------------------------------
Examining 35786 resources...
......

---------------------------------------------------------------
Summary report start
---------------------------------------------------------------
checksumInfoError: 0
checksumNotValid: 0
checksumValid: 0
currentDerivativeOk: 242
currentSourceMimeTypeError: 10610
derivativeCommitted: 0
derivativeCreated: 4404
derivativeNotCommitted: 4404
derivativeNotGeneratedDueToAnError: 0
derivativeSourceFileNotFound: 0
derivativeSourceNotFound: 8
derivativeTotalSize: 27.44GB
errorReadingObjectDatastreamInfo: 0
fedoraError: 0
imageMagickConversionError: 0
modsError: 35
unsupportedResourceType: 22039
---------------------------------------------------------------
Summary report end
---------------------------------------------------------------

---------------------------------------------------------------
Script ended - 2016-09-09T09:06:29-04:00
Total time spent was 40116.59 second(s) or 668.61 minute(s)
This process used 99979 second(s) for its computations.
It spent 7668 second(s) in system calls.
---------------------------------------------------------------

#17

That was a pretty successful run. The "derivativeSourceNotFound: 8" have been identified and reported to Isaiah as a separate issue. If space isn't a concern then I say run it with commit mode on production.

Thanks,
Chad

#18

Full run with commit on run on rep-prod 2016-09-16 21:02 (full report is attached)

---------------------------------------------------------------
Script started - 2016-09-16T21:02:38-04:00
---------------------------------------------------------------
Examining 35835 resources...
....

---------------------------------------------------------------
Summary report start
---------------------------------------------------------------
checksumInfoError: 0
checksumNotValid: 0
checksumValid: 4404
currentDerivativeOk: 242
currentSourceMimeTypeError: 10610
derivativeCommitted: 4404
derivativeCreated: 4404
derivativeNotCommitted: 0
derivativeNotGeneratedDueToAnError: 0
derivativeSourceFileNotFound: 0
derivativeSourceNotFound: 8
derivativeTotalSize: 27.44GB
errorReadingObjectDatastreamInfo: 0
fedoraError: 0
imageMagickConversionError: 0
modsError: 35
unsupportedResourceType: 22088
---------------------------------------------------------------
Summary report end
---------------------------------------------------------------

---------------------------------------------------------------
Script ended - 2016-09-17T12:06:49-04:00
Total time spent was 54250.05 second(s) or 904.17 minute(s)
This process used 99837 second(s) for its computations.
It spent 9245 second(s) in system calls.
---------------------------------------------------------------

#19

Dave,

Looks great. Thanks. No we wait for the untarring of our legacy ARCH tars to generate more.

-Chad

#20

Chad

Followup on size of ingested PTIFs

I was looking through the fedora ingested datastreams directory for PTIFs on
rep-prod that were added based on the running of job.ptif.php on 9/16 an 9/17
It appears that every PTIF may have been duplicated. Attached are 2 reports
that show the PTIFs added on 9/16 and 9/17 sorted by date/time.

Also attached is the php file that shows the settings that were used during the run.

#21

Status:active» fixed

I found out why duplicate datastreams were ingested. On the followup checksumValidation call I was still using the POST method when interacting with Fedora. I should have been using the GET method. Unfortunately the ingest and checksumValidation methods are the same with the only difference being the method used to interact. The second POST call performed another ingest.

I will update the original creation script on @rep-test/mellon/cvsroot with this change. I will start working on another task to purge the duplicate datastreams on all of the systems.

For now I am going to mark this fixed. After we untar the archivals we will need to revisit this job.

-Chad

#22

I have updated the @rep-test/mellon/cvsroot/job.ptif.php script to fix the bug that creates 2 versions of every generated ptif datastream.

Back to top