Identify RUcore projects with split tar ARCH datastreams

Project:RUcore Jobs & Reports
Component:Report - production
Category:task
Priority:normal
Assigned:dhoover
Status:Moved to JIRA
Description

Early in RUcore development, we had ingested a number of video objects with split TARs to handle archival datastreams that are large in size (>2GB). We need to identify those and reconstitute the files as part of of our efforts to unroll TARed datastreams. I'm requesting a report that identifies:

- Objects with more than one ARCH datastreams consisting of a TAR
- The first TAR file is between 1.6 and 2.0GB in size.

Comments

#1

Assigned to:triggs» dhoover

To do this Isaiah only needs a list of likely candidates. We've looked at the output of the script on rep-test and determined that a suitable list can be generated by running the following command on rep-prod:
for file in `locate ARCH2`; do file $file; done | egrep -i tar | sed "s/: .*$//" | xargs ls -l > slipt-tar-outputlist.txt

What the script does is:
1) locate all ARCH2 datastreams
2) run file on these to determine the file type
3) grep out the tar files
4) clean the file output to leave valid filenames
5) pipe these names to ls -l to get the file sizes in bytes, creation dates, and paths

With this list Isaiah can look at the objects for files with earlier Fedora PIDs and large enough sizes using dlr/EDIT.

Note: the script has a low overhead, but may take a few minutes to run because of the need to read the files.

#2

This may not be necessary any longer. The dryrun untar script will now identify these same objects in the multiple tar arch error condition it flags. If we were to run the dryrun untar script on the set of videos, we should get the same list.

#3

Moved to Jira.

#4

Status:active» Moved to JIRA

Back to top