Retrospectively add cover page and DOI to every existing deposit (not release-related)

Project:RUcore Jobs & Reports
Component:Job - production
Category:task
Priority:critical
Assigned:yuyang
Status:Moved to JIRA
Description

Retrospectively add cover page and DOI to every existing deposit per the request from OAWG. See specs: https://rucore.libraries.rutgers.edu/collab/ref/spc_steering_r7_4_r7_5_v...

Comments

#1

Priority:normal» critical

This is top priority and must be done on the test server before we finish testing (end of July).

#2

Title: Retrospectively add cover page and DOI to every existing deposit (part of R7.4 implementation)» Retrospectively add cover page and DOI to every existing deposit (not release-related)

#3

Fri 9/12/14 email

All,

I put WMS_7.4.1 in place on the production server.

I copied the mycoversheet files over along with the
prodfaclist file.

I ran the script and it stopped before completion. The reports
are attached, the totals are below:

rep-prod:/home/RS_7.4.1 # wc -l 2014*
135 2014-09-11T23:33:17.000Z-errorlog.txt
22 2014-09-11T23:33:17.000Z-successlog.txt
157 total

All 135 in the error log were "No presentation PDF file exists"
the ones I spotted checked had only the object file, or had
the object file and only the SMAP datastream.

I wondered why the prodfaclist file even had these as it is unclear
to me what selection method/criteria Jeffery used to select the
objects to be processed. There are 501 objects in the file but
when I search Faculty Deposits online there are only 322

The script stopped on this:
For 'rutgers-lib:43062'
JobInfo
JobSettings JobID="" UserJobID="" JobPriority="3" AdlibServer=""
JobFileMsg Count=""
JobFileList
JobFile Folder="" Filename=""
JobFileList
JobInfo

Following file/pages have exceeded maximum OCR size limit (28"x28") for PDF
server: pres43062-0.pdf, pg size [ 2736 x 2160 pts], 1 size [2736 x 2160 pts].

Earlier in the run a screen only message displayed like this:

For 'rutgers-lib:39294'
Error: Bad annotation destination

I did not attempt a restart as I thought these should be looked at
and my questions answered before procedding.

Note all logs are attached in a future comment

#4

Just to follow up, I'm working on analyzing the different sorts of errors and will have a detailed report soon.

#5

I downloaded and tested the following set of PDFs for objects that failed to generate coversheets so far. Most of them should work if redone. One turns out to have been a success in a later pass. Four did not pass my test on devel. One of these was locked for changes with "change:no". The other three were very large, book length documents that seem to have timed out. I've put the ones that can be redone in /mellon/cvsroot as "prodfaclist2".

rutgers-lib:21042 ## was able to work on devel - redo this one (Mardikian's)
rutgers-lib:21209 ## was able to work on devel - redo this one (Vazquez's)
rutgers-lib:21707 ## was able to work on devel - redo this one (Dent's)
rutgers-lib:21708 ## was able to work on devel - redo this one (Agnew powerpoint)
rutgers-lib:21768 ## was able to work on devel - redo this one (Niessen's)
rutgers-lib:24050 ## was able to work on devel - redo this one
*rutgers-lib:24053 ## this one was locked and failed with the following message: getCurrentJobStatus::Failure:JobID-29d99a46-f55b-451b-97cf-4fa77c10d9bf\760887 failed.
rutgers-lib:24057 ## was able to work on devel - redo this one
*rutgers-lib:24461 ## This unlocked but book length PDF (275 pages) timed out with: getCurrentJobStatus::Failure:JobID-29d99a46-f55b-451b-97cf-4fa77c10d9bf\760731 failed.
rutgers-lib:33277 ## was able to work on devel - redo this one
rutgers-lib:33278 ## was able to work on devel - redo this one
#rutgers-lib:33358 ## done already on mss3
rutgers-lib:33360 ## was able to work on devel - redo this one
*rutgers-lib:35873 ## unlocked but very large PDF 44.6 MBs typewritten dissertation
*rutgers-lib:36717 ## This unlocked but large book length PDF timed out with: getCurrentJobStatus::Failure:JobID-29d99a46-f55b-451b-97cf-4fa77c10d9bf\760579 failed.

#6

Ran the 10 objects Jeffery put in the prodfaclist2 file

This displayed to screen for one of them:

For 'rutgers-lib:33360'
Error: Expected the optional content group list, but wasn't able to find it, or it isn't an Array

error file 0 bytes :
rep-prod:/home/RS_7.4.1 # ls -al 2014-09-18T15\:08\:46.000Z-errorlog.txt
-rwxrwxrwx 1 root root 0 2014-09-18 15:08 2014-09-18T15:08:46.000Z-errorlog.txt
rep-prod:/home/RS_7.4.1 #

success file:
rep-prod:/home/RS_7.4.1 # cat 2014-09-18T15\:08\:46.000Z-successlog.txt
Success with rutgers-lib:21042
Success with rutgers-lib:21209
Success with rutgers-lib:21707
Success with rutgers-lib:21708
Success with rutgers-lib:21768
Success with rutgers-lib:24050
Success with rutgers-lib:24057
Success with rutgers-lib:33277
Success with rutgers-lib:33278
Success with rutgers-lib:33360

All 10 were successful.

If there are anymore to rerun pleaes post e new list to rep-devel
/mellon/cvsroot

#7

I've copied down all the PDFs from the faculty portal (weeding out collection objects and things like Mitch's objects that do not have PDF datastreams) and extracted text to test for the existence of coversheets. All told there were 309 PDFs. After Dave's run, 286 of these have coversheets. 23 are problem PDFs for various reasons (size, length, or change:no restrictions). I have run pdfinfo on these and created four log files that I will attach here:
337 1142 9591 infolist.txt
286 286 5148 pdflistdone.txt
309 309 5562 pdflistfull.txt
23 23 414 pdflistnotdone.txt

#8

Out of the 23 in the notdone list I've identified 10 more likely candidates (ordinary looking PDFs without change:no, poster size, or book length) that will probably work. I've put a list of this in /mellon/cvsroot as "prodfaclist3". Dave could you try running the script again on these? That would leave only 13 definite problem files.

#9

Note: prodfaclist3 was whittled down to 7, since 3 objects had XACML POLICY datastreams that would prevent changes in Fedora.

#10

Copied new prodfaclist3 from rep-devel:/mellon/cvsroot
and reran the script results are:

2014-09-19T11:28:53.000Z-errorlog.txt
::::::::::::::
Error with rutgers-lib:37048 - Error updating fedora presentation datastream: Au
thorization Denied Caused by: code=0
Error with rutgers-lib:43551 - getCurrentJobStatus::Failure:JobID-152b8fe8-d55f-
4d50-8260-aa78d4acf857\102139 failed. | Missing metadata: date.
::::::::::::::
2014-09-19T11:28:53.000Z-successlog.txt
::::::::::::::
Success with rutgers-lib:35779
Success with rutgers-lib:44715
Success with rutgers-lib:44721
Success with rutgers-lib:44723
Success with rutgers-lib:44725
rep-prod:/home/RS_7.4.1 #

#11

Thanks Dave! I should have zapped rutgers-lib:37048, which has a POLICY.

I'm not sure what's wrong with rutgers-lib:43551. It says it can be changed, though it's the only one I saw that indicated this specifically. It has watermarks and does some weird dynamic updating - the top part of the first page becomes invisible after a fraction of a second when you first try to view it.

Here is its pdfinfo:

Subject:
Keywords:
Author: RajniT
Producer: Persits Software AspPDF - <a href="http://www.persits.com" title="www.persits.com">www.persits.com</a>
CreationDate: Mon Jun 23 12:28:48 2014
ModDate: Mon Jun 23 12:28:48 2014
Tagged: no
Pages: 2
Encrypted: yes (print:yes copy:yes change:yes addNotes:yes)
Page size: 612 x 792 pts (letter)
File size: 249750 bytes
Optimized: yes
PDF version: 1.3

#12

I think we've now done all of these that we can do with the current version of the software and are at the point where we should discuss what to do about the obstinate corner cases. It shouldn't hold Dave up from other things.

#13

When passing the rutgers-lib:43351 PDF into the PDF server the following error is returned.

Error: PDF contains security

Looking in adobe acrobat the PDF does have security limitations.

#14

I think this is a new issue with this job that was uncovered at testing today. When the new PDF-1 datastream was added the ALT_IDS attribute was not carried over from the previous/original PDF1.0 entry to the new PDF1.1 entry. This means the original filename is not used when the user downlaods the PDF because it wasn't carried over from PDF1.0 to PDF1.1.

Example on production is: rutgers-lib: 41069

&lt;foxml:datastream ID="PDF-1" FEDORA_URI="info:fedora/rutgers-lib:41069/PDF-1" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true"&gt;
&nbsp;&nbsp;&lt;foxml:datastreamVersion ID="PDF-1.0" LABEL="PDF-1" CREATED="2013-09-19T16:47:42.000Z" ALT_IDS="fname:Otto ARL survey for RUcore.pdf" MIMETYPE="application/pdf" SIZE="868188"&gt;
&nbsp;&nbsp;&nbsp;&nbsp;&lt;foxml:contentDigest TYPE="SHA-256" DIGEST="02e3c39369baa867cea0556010ef4ebd7a673148a5957bd579fbf34de462da16"/&gt;
&nbsp;&nbsp;&nbsp;&nbsp;&lt;foxml:contentLocation TYPE="INTERNAL_ID" REF="http://128.6.218.102:8080/fedora/get/rutgers-lib:41069/PDF-1/2013-09-19T16:47:42.000Z"/&gt;
&nbsp;&nbsp;&lt;/foxml:datastreamVersion&gt;
&nbsp;&nbsp;&lt;foxml:datastreamVersion ID="PDF-1.1" LABEL="" CREATED="2014-09-12T14:44:39.843Z" MIMETYPE="application/pdf" SIZE="920942"&gt;
&nbsp;&nbsp;&nbsp;&nbsp;&lt;foxml:contentLocation TYPE="INTERNAL_ID" REF="http://128.6.218.102:8080/fedora/get/rutgers-lib:41069/PDF-1/2014-09-12T14:44:39.843Z"/&gt;
&nbsp;&nbsp;&lt;/foxml:datastreamVersion&gt;
&lt;/foxml:datastream&gt;

#15

Assigned to:triggs» yuyang

I think this has to be handed to Yang who is doing the Fedora REST interaction for these. In dlr/EDIT add and change datastream, we handle altIDs as part of the process.

#16

Assigned to:yuyang» triggs

The LABEL attribute was not preserved from the first to the second version as well.

#17

Assigned to:triggs» yuyang

Same problem as the altids. I'm not doing the interaction with Fedora in this.

#18

Component:Job - test server» Job - production

What's the current status of this? Still relevant?

#19

Status:active» Moved to JIRA

Back to top