Converting techMD datastreams to "managed content"

Project:RUcore Jobs & Reports
Component:Report - development
Category:task
Priority:normal
Assigned:rmarker
Status:closed
Description

We have run into what seem to be permissions problems in testing the fedora-modify-control-group.sh script.

I consulted with Steve Durbin of Fedora Commons, who was able to run the script for control group change on his system, but couldn't reproduce our problem. He recommended we try turning off XACML and testing again. It will require a repository restart, and thus assistance from Dave Hoover.

We will need to set the ENFORCE-MODE parameter to "permit-all-requests" in the fedora.fcfg (point 2.1 here <a href="https://wiki.duraspace.org/display/FEDORA36/XACML+Policy+Enforcement" title="https://wiki.duraspace.org/display/FEDORA36/XACML+Policy+Enforcement">https://wiki.duraspace.org/display/FEDORA36/XACML+Policy+Enforcement</a>).

If we still get a "forbidden" error, we can then likely rule out xacml policy enforcement as the cause.

Comments

#1

Note: This should be tested first on rep-dev.

#2

What do you mean by "It will require a repository restart"? Restart Fedora? Ashwin can do that.

#3

If Ashwin can do it, that would be fine too. It's an edit of the Fedora config file, a stop, and a restart of fedora. If the test is successful, we can get a better sense of how long something like this will take and we can plan on a way to move it up the chain as it were.

#4

Update this entry with instructions. I will ask Ashwin to take a look.

#5

I'll need to work pretty closely with Ashwin when we are both ready. I think we'll need to open up fedora as described by Steve Durbin, then I'll need to set up my command line environment to run the test, which could take several passes to get everything right. Once it's done, we'll need to document what worked or did not work and restrict fedora again.

#6

Component:Job - test server» Report - development

I separated the objects on rep-dev into those that already had managed content (M) TECHNICAL1 datastreams (38) and those that still had inline (X) TECHNICAL1 datastreams (415). I ran my conversion script on the latter set. A typical conversion time for an individual object was

real 0m0.765s
user 0m0.700s
sys 0m0.044s

The total conversion time was

real 5m32.384s
user 4m56.283s
sys 0m18.097s

[1]+ Done time for file in `cat techchange.list`;
do
./movecontrol.sh $file TECHNICAL1;
done > techchange.log

techchange.log shows success for all objects, including several with more than one version of TECHNICAL1.

#7

Below is a link to the XACML/embargo policy management issue that "may" help when constructing a system wide policy that would allow datastream migration from inline to managed while still enforcing the resource/object level XACML emargo policies.

https://software.libraries.rutgers.edu/node/2860

In the 9/24/15 it was thought that making some modifications like this to the system policy would be effective.

#8

I have found a way to run the modifyDatastreamControlGroup method as a web service and have created a new script that works exactly like the Fedora fedora-modify-control-group.sh script but without having to change any current XACML enforcement rules. The new script (movecontrol.php) runs the following curl command:
$ch4 = curl_init();
$resturl = "http://$FEDORAHOST:$PORT/fedora/management/control?action=modifyDatastreamControlGroup&pid=$pid&dsID=$dsid&controlGroup=M";
curl_setopt($ch4, CURLOPT_HEADER, 0);
curl_setopt($ch4, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch4, CURLOPT_NOBODY, FALSE);
curl_setopt($ch4, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch4, CURLOPT_FRESH_CONNECT, TRUE);
curl_setopt($ch4, CURLOPT_USERPWD, "$fedoraAdmin:$fedoraAdminPasswd");
curl_setopt($ch4, CURLOPT_URL, $resturl);
$curlcontent3 = curl_exec($ch4);
curl_close($ch4);
print "Changing the control group for the $dsid datastream of $pid to managed content.\n$curlcontent3\n";
and can be driven by a simple shell script, runmovecontrol.sh, that reads a list of PIDs, converts the TECHNICAL1 datastreams, and stores the return info in a log file. Note: the script converts all versions of the datastream and maintains all the original creation and modification times. It adds information about the control group move to the audit trail. As a test, I ran the script on the 1115 objects in the Hoboken Historical Photographs collection on rep-test. This gives an idea of the timing required:
triggs@rep-devel:~> time ./runmovecontrol.sh

real 1m9.785s
user 0m13.537s
sys 0m9.065s
triggs@rep-devel:~> ls -lt | head -1
total 2059048
-rw-r--r-- 1 triggs developers 243118 Oct 1 16:31 151001-163040-movecontrollog
The log file shows no errors and the typical return for a successful run, in this case on objects with two or three versions, e.g.:
Changing the control group for the TECHNICAL1 datastream of rutgers-lib:24678 to managed content.
<versions>
<version>Tue Jun 30 11:58:11 EDT 2009</version>
<version>Fri Jun 28 13:42:12 EDT 2013</version>
</versions>

Changing the control group for the TECHNICAL1 datastream of rutgers-lib:24677 to managed content.
<versions>
<version>Tue Jun 30 11:53:53 EDT 2009</version>
<version>Tue Jun 30 15:56:14 EDT 2009</version>
<version>Fri Jun 28 13:42:09 EDT 2013</version>
</versions>

Changing the control group for the TECHNICAL1 datastream of rutgers-lib:24676 to managed content.
<versions>
<version>Tue Jun 30 11:47:15 EDT 2009</version>
<version>Tue Jun 30 15:48:54 EDT 2009</version>
<version>Fri Jun 28 13:42:10 EDT 2013</version>
</versions>

Changing the control group for the TECHNICAL1 datastream of rutgers-lib:14329 to managed content.
<versions>
<version>Fri Jan 26 11:01:25 EST 2007</version>
<version>Fri Jun 28 13:44:36 EDT 2013</version>
</versions>
If the command is run on an object that does not have an inline TECHNICAL1 (because it is already managed), it simple returns an empty version set, e.g.:
triggs@rep-devel:~> php -f ~/movecontrol.php pid=rutgers-lib:10028
Changing the control group for the TECHNICAL1 datastream of rutgers-lib:10028 to managed content.
<versions>
</versions>
After the command has been run, the datastream versions appear on the file system with the current date tree:
triggs@rep-devel:~> find /repository/data -name "rutgers-lib_24678*"
/repository/data/datastreams/2013/0628/13/42/rutgers-lib_24678+SMAP-ARCH+SMAP-ARCH.0
/repository/data/datastreams/2013/0628/13/42/rutgers-lib_24678+JPEG-1+JPEG-1.2
/repository/data/datastreams/2013/0628/13/42/rutgers-lib_24678+SMAP1+SMAP1.1
/repository/data/datastreams/2013/0628/13/42/rutgers-lib_24678+ARCH1+ARCH1.2
/repository/data/datastreams/2015/1001/16/30/rutgers-lib_24678+TECHNICAL1+TECHNICAL1.1
/repository/data/datastreams/2015/1001/16/30/rutgers-lib_24678+TECHNICAL1+TECHNICAL1.2
/repository/data/objects/2009/0630/15/58/rutgers-lib_24678
but the datastream profile is the same as before except for <dsControlGroup>M</dsControlGroup>:
<datastreamProfile xsi:schemaLocation="http://www.fedora.info/definitions/1/0/management/ http://www.fedora.info/definitions/1/0/datastreamProfile.xsd" pid="rutgers-lib:24678" dsID="TECHNICAL1" dateTime="2013-06-28T17:42:12.374Z"><dsLabel>TECHNICAL1</dsLabel><dsVersionID>TECHNICAL1.2</dsVersionID><dsCreateDate>2013-06-28T17:42:12.374Z</dsCreateDate><dsState>A</dsState><dsMIME>text/xml</dsMIME><dsFormatURI/><dsControlGroup>M</dsControlGroup><dsSize>248</dsSize><dsVersionable>true</dsVersionable><dsInfoType/><dsLocation>rutgers-lib:24678+TECHNICAL1+TECHNICAL1.2</dsLocation><dsLocationType>INTERNAL_ID</dsLocationType><dsChecksumType>SHA-256</dsChecksumType><dsChecksum>3ac96670525d0d89a8b8f7a281e28c57e7cb8244ace51966d386a2ae7f3c6fd8</dsChecksum></datastreamProfile>
The audit trail records the event as follows:
Audit 16 Description: Modified datastream control group for rutgers-lib:24678 TECHNICAL1 from X to M
Action: modifyDatastreamControlGroup
DSID: TECHNICAL1
Date: 2015-10-01T20:30:40.275Z
I propose running the script on the rest of the objects on rep-test and then turning it over to Dave to run on rep-staging and rep-prod.

#9

Assigned to:triggs» dhoover

I ran runmovecontrol.sh on all the datastreams (almost 16,000) on rep-test.
triggs@rep-devel:~>
real 13m57.515s
user 3m12.668s
sys 2m8.244s

[1]+ Done time ./runmovecontrol.sh

The following file is in the /mellon/cvsroot directory on staging with the files need to run the script on staging and production:
triggs@rep-devel:~> ls -l /mellon/cvsroot/movecontrol.tar
-rw-r--r-- 1 triggs developers 624640 Oct 29 14:26 /mellon/cvsroot/movecontrol.tar

This is a readme for running movecontrol.php to move the control group from inline to managed to TECHNICAL1 datastreams.

The following tar file will be delivered:
triggs@rep-devel:~> tar tvf movecontrol.tar
-rwxr-xr-x triggs/developers 180 2015-10-01 15:06 runmovecontrol.sh ## the shell script for running the php script
-rwxr-xr-x triggs/developers 1540 2015-10-01 14:56 movecontrol.php ## the php script that moves the control group in Fedora
-rw-r--r-- triggs/developers 0 2015-10-29 14:07 movecontrollist.txt ## an empty file list file (copy lists of pids to this file as desired before running the script, e.g.
head -10000 objlistprod-082615.txt > movecontrollist.txt
time ./runmovecontrol.sh
etc.
-rw-r--r-- triggs/developers 13214 2015-10-29 14:05 objliststaging-102915.txt ### a list of pids on rep-staging
-rw-r--r-- triggs/developers 594004 2015-10-29 14:01 objlistprod-082615.txt ### a list of pids on production
-rw-r--r-- triggs/developers 1031 2015-10-29 14:25 movecontrol-readme.txt ### this readme file

#10

Assigned to:dhoover» rmarker
Status:active» test

I was able to identify 17 objects that did not have TECHNICAL1 datastreams and so failed the move control group script. I used the following command:
perl -pe 's/\n/ /;' 151201-062317-movecontrollog | perl -pe 's/Changing/\nChanging/g;' | egrep html
to extract a list of these, and then went through the objects in dlr/EDIT, extracting the inline TECHMD001 datastream, adding it back in as a managed content datastream with the ID TECHNICAL1, and then purging the old TECHMD001 datastream from the object. I worked on the following set of objects:
rutgers-lib:41210
rutgers-lib:41209
rutgers-lib:37749
rutgers-lib:37173
rutgers-lib:37080
rutgers-lib:36022
[rutgers-lib:35872]
rutgers-lib:30688
rutgers-lib:30687
rutgers-lib:27014
rutgers-lib:26687
rutgers-lib:26437
rutgers-lib:26381
rutgers-lib:26380
rutgers-lib:25813
rutgers-lib:25367
rutgers-lib:25114

I've bracketed rutgers-lib:35872 because this object seems to have been purged since the runlist was generated.

#11

Status:test» closed

Back to top