Normalize labels in legacy objects

Project:RUcore Jobs & Reports
Component:Job - test
Category:task
Priority:critical
Assigned:ananthan
Status:test
Description

All the objects created prior to R7.0 will have "title:datastreamID" in the label attribute. In the search portal it is going to display long label which is not going to be pretty. So we need to update all the legacy labels and write the datastream ID in the label attribute.

Comments

#1

There is a script in dlr/EDIT, backlabels.php, that is ready to be run on a given list of Fedora PIDs.

#2

Here is how it works.

Before:
1. ARCH1 application/x-tar Seabrook Farms: TAR 2006-01-09T13:55:17.000Z
2. DC (Version1.0)
2006-01-09T13:55:17.000Z
text/xml DC Metadata 2006-01-09T13:55:17.000Z
3. DIGIPROV1 (Version.0)
2006-01-09T13:55:17.000Z
text/xml Digiprov metadata 2006-01-09T13:55:17.000Z
4. DIGIPROV2 (Version.0)
2006-01-09T13:55:17.000Z
text/xml Digiprov metadata 2006-01-09T13:55:17.000Z
5. DJVU-1 (Version.0)
2006-01-09T13:55:17.000Z
image/x.djvu Seabrook Farms: DJVU 2006-01-09T13:55:17.000Z
6. JPEG-1 (Version.0)
2006-01-09T13:55:17.000Z
image/jpeg Seabrook Farms: JPEG 2006-01-09T13:55:17.000Z
7. MODS (Version1.0)
2006-01-09T13:55:17.000Z
text/xml MODS Metadata 2006-01-09T13:55:17.000Z
8. PDF-1 (Version.0)
2006-01-09T13:55:17.000Z
application/pdf Seabrook Farms: PDF 2006-01-09T13:55:17.000Z

During:
triggs@rep-devel:/mellon/htdocs/dlr/EDIT> php -f backlabels.php pid=rutgers-lib:10019
Found a presentation 'JPEG-1' named 'Seabrook Farms: JPEG' that needs changing...
Success modifying the label and removing the old version of the JPEG-1 datastream of rutgers-lib:10019.
Found a presentation 'DJVU-1' named 'Seabrook Farms: DJVU' that needs changing...
Success modifying the label and removing the old version of the DJVU-1 datastream of rutgers-lib:10019.
Found a presentation 'PDF-1' named 'Seabrook Farms: PDF' that needs changing...
Success modifying the label and removing the old version of the PDF-1 datastream of rutgers-lib:10019.

After:
1. ARCH1 application/x-tar Seabrook Farms: TAR 2006-01-09T13:55:17.000Z
2. DC (Version1.0)
2006-01-09T13:55:17.000Z
text/xml DC Metadata 2006-01-09T13:55:17.000Z
3. DIGIPROV1 (Version.0)
2006-01-09T13:55:17.000Z
text/xml Digiprov metadata 2006-01-09T13:55:17.000Z
4. DIGIPROV2 (Version.0)
2006-01-09T13:55:17.000Z
text/xml Digiprov metadata 2006-01-09T13:55:17.000Z
5. DJVU-1 (Version.1)
2013-02-13T16:21:50.732Z
image/x.djvu DJVU-1 2013-02-13T16:21:50.732Z
6. JPEG-1 (Version.1)
2013-02-13T16:21:50.633Z
image/jpeg JPEG-1 2013-02-13T16:21:50.633Z
7. MODS (Version1.0)
2006-01-09T13:55:17.000Z
text/xml MODS Metadata 2006-01-09T13:55:17.000Z
8. PDF-1 (Version.1)
2013-02-13T16:21:50.824Z
application/pdf PDF-1 2013-02-13T16:21:50.824Z

#3

This needs to be run on test and staging ASAP for ALL existing objects.

#4

I can run it on devel, but Dave will have to do it on staging.

#5

It's running now on devel. I started by generating a complete list of PIDs on devel which I p[ut into a file named
fulldevelpidlist
I then cd'ed to the EDIT directory and ran this script:
triggs@rep-devel:/mellon/htdocs/dlr/EDIT> for pid in `cat ~/fulldevelpidlist`
> do
> php -f backlabels.php pid=$pid
> done > ~/backlabels-devel.log &
The progress can be checked by looking at the log file:
triggs@rep-devel:/mellon/htdocs/dlr/EDIT> wc ~/backlabels-devel.log
3887 60236 420763 /home/triggs/backlabels-devel.log
or
triggs@rep-devel:/mellon/htdocs/dlr/EDIT> tail -f ~/backlabels-devel.log | less
to watch it building.

#6

It took three hours to run on the 16468 test objects. From an initial look at the log, I believe it succeeded on all but some datastreams with POLICY restrictions (as expected) and on a few huge (i.e., multi-GB) presentation datastreams that seemed to have timed out. We may want to restrict the array of datastreams so as to avoid this in the real run - though such datastreams may not yet exist on prod. We may want to run it on a series of smaller object sets, or perhaps prioritized by collection.

#7

In the development system, almost all the labels have been fixed. However, the first three records that come up in an empty ETD search have their long titles as the label: rutgers-lib:24349, rutgers-lib:25828, and rutgers-lib:24641.
Also, the first record in an empty Digital Collections search: rutgers-lib:25802.
And various records in an empty Scholarly Materials search: rutgers-lib:25174 , rutgers-lib:24641, rutgers-lib:24354, rutgers-lib:25088, rutgers-lib:24546, rutgers-lib:24925.
I did not look at every record so there might be more. Are these the records with huge datastreams that you mentioned?
Also -- it works fine on records with a policy. When it does not work (as with rutgers-lib:24641), the label says "Access to Triumphant Underdogs? The Haves Not Ahead in the First Decade of the WTO Dispute Settlement System: PDF-1 has been restricted at the author's request". When it does work (as with rutgers-lib:200909), it displays "Access to PDF-1 has been restricted at the author's request.".

Because it works with the vast majority of datastreams, my recommendation is to go ahead with it on staging with your own suggested modification: restrict the array of datastreams for the initial run, then pick up those excluded datastreams.

#8

I think most of the ones that did not work were objects that have either an active or an obsolete POLICY restricting the datastreams in question. For active POLICYs, this is perhaps the behavior we should expect; for obsolete POLICYs, it is a Fedora bug that we still have to work around (I'm not sure if it is fixed in the latest Fedora.) When there is a POLICY referencing a datastream, we cannot even get the datastream profile (including the needed UTC dates).

The objects that may have timed out are only the huge datasets in ZIP or GZIP files. I'm not even sure we have those yet on production. In any event, there represent a few objects only and are easy to isolate. I would like to drop these (GZIP and ZIP) from the array of ids being considered and then, if need be, try these by hand within dlr/EDIT. This way, we can safely separate the processes of ds modification and removal of the older version.

#9

I await the appropriate readme files and scripts to be put
on rep-devel in /mellon/cvsroot that I can use for
both rep-staging and rep-prod

I will need a script that identifies the appropriate rutgers-lib PID's
and another that will process the PID's against the backlabels.php
script.

I am also curious if it can be run on production earlier than Release
7.0, or are we waiting on the delivery of code to get the backlabels.php
script and anything else it depends on?

#10

I can generate a list of PIDs you need for staging through a Solr search. The backlabels.php script does depend on the new INT subdirectory of dlr/EDIT delivered with 7.0 with it's internal getfedorarest.php. This may have to be run as part of the generasl installation process once dlr/EDIT is in place. In the meantime, for staging, I'll put a list of PIDs, a readme, and a copy of backlabels.php in cvsroot for you.

#11

Hi Dave,

The following three files are ready for you in /mellon/cvsroot on rep-devel:
-rw-r--r-- 1 triggs developers 859 2013-02-26 15:50 staging-labels-readme
-rw-r--r-- 1 triggs developers 1632 2013-02-26 15:41 backlabels.php
-rw-r--r-- 1 triggs developers 7174 2013-02-26 15:40 staging-labels-list

#12

Status:active» test

According to Dave, this is done.

#13

Status:test» closed

#14

Status:closed» active

I turned on user-friendly labels on production. I quickly found a few objects with labels that are questionable and might not have been caught in the normalization run.

<a href="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:30899" title="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:30899">http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:30899</a>

<a href="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:34470" title="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:34470">http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:34470</a>

<a href="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:38794" title="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:38794">http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:38794</a>

There might be more. Let me know if I should disable the display of the labels until this is debugged.

#15

These were intentionally left out so that they could be done later by hand. Rhonda and I wanted to move cautiously with objects like the ones you mention with very large datastreams. Rather than trust them to a program that by default would remove the old versions once the labels were changed, we though for these we could change the label keeping the old datastream and then remove that datastream when it was determined that the new one (with the new label) was safe.

#16

The third object cited is an ETD. There are more ETD's like this as well.

<a href="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:37433" title="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:37433">http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:37433</a>

#17

The ETDs are the other known issue. If they have restrictive POLICY datastreams, Fedora (at least as we have it set) will not allow the labels to be changed. I originally thought this was a Fedora feature/bug, but have begun to wonder if it has something to do with a higher level, repository POLICY that we have in place.

#18

To clarify, Rhonda asks "Do you want RUresearch datastreams to carry the datastream ID, e.g., PDF-1, GZIP-1, GZIP-2, etc.?".

Ryan and I spoke - at this time, unless we can somehow present the original file name, there is no straightforward way to differentiate between multiple files of the same file type within a datastream. As an example, see <a href="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:39192" title="http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:39192">http://mss3.libraries.rutgers.edu/dlr/showfed.php?pid=rutgers-lib:39192</a>. While the description information below says what each of the two zip files is, there's nothing in the file title that differentiates the two.

It could be possible (but painful) to go back and revise all of the file names to identify them individually, but it seems that where we should be going is to present the original file name in this file display.

So unfortunately, right now, we need to carry the detailed datastream ID in the research data portal.

Should we discuss at SW_Arch?

#19

Aletia,

Have you considered re-ingested or reworking the files for the primate teeth objects? Right now as I see it we offer a download of a ZIP file that is a TAR file and then contains the subsequent SUR or TSF files. Maybe ingesting them as a directory and using the "Download Object" interface might be the best route because then the original files names will be visible and selectable by the end user. Just my 2 cents.

-Chad

#20

I spoke with Ryan and changed all of the Primate Teeth resources datastream labels from {title}: ZIP-1 and {title}: ZIP-2 to "SUR Files" and "TFS Files". Other research projects still need some attention as far as what the datastream labels need to be changed to.

#21

What is the status of this issue?

#22

I'm not sure, but I think Dave ran the script on all the targeted objects in late March, so it might be done now.

#23

For objects with XACML protected datastreams those datastream labels have not been normalized.

Exmaple: <a href="http://rucore.libraries.rutgers.edu/rutgers-lib/38772/" title="http://rucore.libraries.rutgers.edu/rutgers-lib/38772/">http://rucore.libraries.rutgers.edu/rutgers-lib/38772/</a>

I know there was an issue because of the XACML policy but a solution needs to be found.

#24

The Fedora XACML restriction is still an issue with 3.6. The workaround is to remove obsolete XACML policies and then change the labels or do other API operations. We may want to create a script to browse the repository periodically and remove obsolete POLICY datastreams.

#25

Unfortunately, we can't use a workaround that involves removing any POLICY datastreams. Especially for ETDs -- but not just those -- we must retain all POLICY versions, including expired ones. There are many potential solutions but this is not one of them.

#26

Project:RUcore dlr/EDIT» RUcore Jobs & Reports
Version:7.0» <none>
Component:Code» Job - test

#27

Is there an update on this? It's almost been a year since the last comment, and we know how to adjust expired XACML policies now to perform an update. It is marked critical as well.

#28

I tested one by hand just now and the label changing worked. Do we have a list of objects that had such policies? They might have been in the backlabels log. If there are not too many, it might be worth going through such a list by hand. If they still have old POLICY datastreams, these can be updated and then the labels changed by hand.

#29

#1
On February 13th, 2013 triggs says:

There is a script in dlr/EDIT, backlabels.php, that is ready to be run on a given list of Fedora PIDs.
---------------------------------------------
#23
On August 29th, 2013 chadmills says:

For objects with XACML protected datastreams those datastream labels have not been normalized.

Exmaple: <a href="http://rucore.libraries.rutgers.edu/rutgers-lib/38772/" title="http://rucore.libraries.rutgers.edu/rutgers-lib/38772/">http://rucore.libraries.rutgers.edu/rutgers-lib/38772/</a>

I know there was an issue because of the XACML policy but a solution needs to be found.
----------------------------------------------
#27
On July 19th, 2016 chadmills says:

Is there an update on this? It's almost been a year since the last comment, and we know how to adjust expired XACML policies now to perform an update. It is marked critical as well.
----------------------------------------------------
MY NEW COMMENT:
We have a script. We know how to run the script on objects with an expired XACML policy. I do not have a list of objects with an expired XACML Policy. Please generate a list of objects with an expired XACML policy. Apply that list of Fedora PIDs to the script, and place the script on [wherever it is that we put such things] for Dave to run on the production system. Please also post the list here, or link to it, so we can check the objects to let you know that the fix has been successful.

#30

I found 969 objects that have old-style POLICY datastreams with expiration dates up to 2015 and ingest dates of 052715 or earlier. These POLICY datastreams will have to be updated somehow before any labels can be changed. 359 of these objects already have labels in the style PDF-1 but have old style POLICY datastreams preventing any edits to the PDF-1. 610 of these objects have old POLICY datastreams and PDF-1 datastreams with titles as part of the label. The fill list is attached as filtered-policyreport.txt. The list of objects needing POLICY only changes is attached as policy-only-changes.txt. The list of objects needing changes to POLICY and label is attached as policy-and-label-changes.txt.

#31

The attached spreadsheet shows the Fedora PID, the date/time of the last embargo version, and the datastreams to be embargoed for all the objects that will need to have new style POLICY versions created to unlock access to the datastreams and labels.

#32

I ran the policy replacement script on 156 objects on rep-test (it took one minute):
php dorunembargo.php rep-test-policy-dt-ds.txt realrun > rep-test-policy-dt-ds-020217.txt
I'm attaching the log so that we can test these objects.

#33

Here is an excel spreadsheet with the results on rep-test to aid in spot check testing. These objects should have new policies with the listed dates but have the listed datastreams be unlocked and editable.

#34

Assigned to:triggs» dhoover

Hi Dave,

The dorunembargo.php script is ready to run on rep-staging. There are two files to download in /mellon/cvsroot:
-rw-r--r-- 1 triggs developers 20480 Feb 7 15:36 dorunembargo.tar
-rw-r--r-- 1 triggs developers 5180 Feb 7 15:29 rep-staging-policy-dt-ds.txt
triggs@rep-devel:/mellon/htdocs/dlr/EDIT> tar tvf dorunembargo.tar
-rwxr-xr-x triggs/developers 1074 2017-02-02 11:49 dorunembargo.php
-rw-r--r-- triggs/developers 7747 2017-02-07 15:34 srunembargo.php
Unpack the tarfile in a directory of your choosing and type the following for the dryrun test, e.g.:
php -f dorunembargo.php rep-staging-policy-dt-ds.txt dryrun > rep-staging-policy-dt-ds-020717-dryrun.log
To run the real script type:
php -f dorunembargo.php rep-staging-policy-dt-ds.txt realrun > rep-staging-policy-dt-ds-020717-realrun.log

Note - this text is also available in /mellon/cvsroot/dorunembargo-readme.txt

#35

Note - once this is run and the XACML policies are unlocked, we will be able to run the change labels script again on the 610 objects that need labels changed to PDF-1. (See attachments to the comment above.)

#36

Looks good - e.g., for 200514 the audit now shows:
Audit 27 Description: dlr/EDIT user purged version . . . Purged datastream (ID=FLV-1), versions ranging from the beginning of time to the end of time. This resulted in the permanent removal of 1 datastream version(s) (2017-02-01T17:33:13.530Z) and all associated audit records.
Action: purgeDatastream
DSID: FLV-1
Date: 2017-02-14T21:12:00.020Z

#37

Run on rep-staging 2/17/17 14:46
php -f dorunembargo.php rep-staging-policy-dt-ds.txt dryrun > rep-staging-policy-dt-ds-020717-dryrun-log.txt

#38

This looks good. It should be ready for the realrun and then some testing of the objects as we did with rep-test.

#39

Run on rep-staging 2/17/17

2017-02-17 16:48:54 nohup php -f dorunembargo.php rep-staging-policy-dt-ds.txt realrun > rep-staging-policy-dt-ds-020717-realrun-log.txt &
The above did not work since it was run outside of dlr/EDIT
Jeffery reworked script so it could run from anywhere.

2017-02-17 17:15:13 nohup php -f dorunembargo.php rep-staging-policy-dt-ds.txt realrun > rep-staging-policy-dt-ds-021717-realrun-log.txt

Checking with fine shows 84 objects updated on 2/17/17 at 17:15

rep-staging:/home/EMBARGO # find /repository/data/objects -type f -ls |grep "Feb 17" |wc -l = 84
rep-staging:/home/EMBARGO # wc -l rep-staging-policy-dt-ds.txt = 84

Report is attached.

#40

Thanks Dave! I've spot checked a few objects from the list and changed the label on one of them, it it seems to have worked just as we'd hoped... Perhaps someone else could check these too based on the PIDs in the output log - check to make sure the PDF is not locked and check the differing versions of the POLICY to make sure the dates are essentially the same.

#41

I spot checked three resources. Two of them had expired XACML policy and the third one has a future expiration date. I was able to replace PDF files successfully.

#42

Hi Dave,

Whenever you are ready, the dorunembargo.php script is ready to run on rep-prod. There are two files to download in /mellon/cvsroot:
-rw-r--r-- 1 triggs developers 20480 Feb 7 15:36 dorunembargo.tar
-rw-r--r-- 1 triggs developers 59200 Mar 28 13:48 rep-prod-policy-dt-ds.txt
triggs@rep-devel:/mellon/htdocs/dlr/EDIT> tar tvf dorunembargo.tar
-rwxr-xr-x triggs/developers 1074 2017-02-02 11:49 dorunembargo.php
-rw-r--r-- triggs/developers 7747 2017-02-07 15:34 srunembargo.php
Unpack the tarfile in a directory of your choosing and type the following for the dryrun test, e.g.:
php -f dorunembargo.php rep-prod-policy-dt-ds.txt dryrun > rep-prod-policy-dt-ds-032817-dryrun.log
To run the real script type:
php -f dorunembargo.php rep-prod-policy-dt-ds.txt realrun > rep-prod-policy-dt-ds-032817-realrun.log

Note - this text is also available in /mellon/cvsroot/dorunembargo-prod-readme.txt

#43

Ran in both dryrun and realrun mode last night. Reports are attached.

#44

Thanks Dave! I did some spot checking through the whole range, and they all look good - the datastreams are now unlocked and we should be ready to run the change labels script on the subset of earlier objects that still have the titles as part of their DS label.

#45

Hi Dave,

Whenever you are ready, the donbacklabels.php script is ready to run on rep-prod. This script will change the labels of 610 objects that had the titles as part of the DS label and were proeviously locked by the old-style POLICY datastreams.

There are two files to download in /mellon/cvsroot:
-rw-r--r-- 1 triggs developers 10240 Apr 13 11:32 donbacklabels.tar
-rw-r--r-- 1 triggs developers 10980 Apr 13 11:32 rep-prod-backlabels.txt
triggs@rep-devel:~> tar tvf donbacklabels.tar
-rw-r--r-- triggs/developers 553 2017-04-13 11:10 donbacklabels.php
-rw-r--r-- triggs/developers 3137 2017-04-13 10:14 nbacklabels.php

Unpack the tarfile in a directory of your choosing and type the following for the dryrun test, e.g.:
php -f donbacklabels.php rep-prod-backlabels.txt dryrun > rep-prod-backlabels-041317-dryrun.log
To run the real script type:
php -f donbacklabels.php rep-prod-backlabels.txt realrun > rep-prod-backlabels-041317-realrun.log

Note - this text is also available in /mellon/cvsroot/donbacklabels-prod-readme.txt

#46

Attached is the report from the dryrun on rep-prod.

#47

Hi Dave,

The dry run looks good. We can go ahead with the realrun whenever you are ready.

Jeffery

#48

Attached is the report from the realrun on rep-prod

#49

Assigned to:dhoover» ananthan
Status:active» test

Thanks Dave! I spot checked the list and they all look good! I'll mark this test for Kalaivani (though others can test as well by checking PIDs in the log file)

Jeffery

Back to top