Delete duplicate version of a datastream

Project:RUcore Jobs & Reports
Component:Job - staging
Category:task
Priority:normal
Assigned:dhoover
Status:closed
Description

The create high resolution pyramidal TIFF job created 2 versions of each image in Fedora. This job is designed to fix that by deleting the duplicate version inside of a datastream ID. This is accomplished by comparing file checksums of all version of a datastream ID and deleting the duplicates.

Comments

#1

Ran job on dev and test with deletion of duplicate enabled. Attached is report from test system. 23.14GB of space was reclaimed.

#2

Component:Job - test» Job - staging
Assigned to:chadmills» dhoover

The job script and README have been added to @rep-test/mellon/cvsroot/ and is ready for a "dryrun" on staging.

@rep-test/mellon/cvsroot/job.duplicates.php
@rep-test/mellon/cvsroot/job.duplicates.README

#3

Any update on this?

#4

Run in dryrun mode on rep-staging this morning.

Report log is attached.

#5

Thanks, this looks good on staging. Please run the commit job on staging.

#6

Run in real mode on rep-staging to delete dups on 2/6/17

Summary report start
---------------------------------------------------------------
checksumCalculated: 416
datastreamHistoryError: 0
datastreamHistoryMatchError: 0
datastreamHistoryMatchFound: 193
datastreamPathError: 0
duplicateFound: 193
duplicateTotalSize: 1.51GB
originalFound: 223
---------------------------------------------------------------
Summary report end
---------------------------------------------------------------

---------------------------------------------------------------
Script ended - 2017-02-06T14:58:39-05:00
Total time spent was 87.96 second(s) or 1.47 minute(s)
This process used 36 second(s) for its computations.
It spent 2 second(s) in system calls.
---------------------------------------------------------------

#7

Great, thanks. I spot checked a few and they look good. Unless you are seeing something I am not, please run in dry run mode on production.

Thanks,
Chad

#8

Run on production in dryrun mode full report attached

---------------------------------------------------------------
Summary report start
---------------------------------------------------------------
checksumCalculated: 9350
datastreamHistoryError: 0
datastreamHistoryMatchError: 0
datastreamHistoryMatchFound: 0
datastreamPathError: 0
duplicateFound: 4396
duplicateTotalSize: 27.43GB
originalFound: 4954
---------------------------------------------------------------
Summary report end
---------------------------------------------------------------

---------------------------------------------------------------
Script ended - 2017-02-14T12:14:03-05:00
Total time spent was 1085.3 second(s) or 18.09 minute(s)
This process used 648 second(s) for its computations.
It spent 37 second(s) in system calls.
---------------------------------------------------------------
dhoover@rep-prod:/home/DELDUP>

#9

Looks good, I noticed the duplicate number if off by 8 when compared to the original create job. I localized that to the rutgers-lib:41121 resource. It looks like 8 PTIF datastreams were manually purged after the create job was run and that is why the duplicate job can;t find them. They were purged for some reason, which I am not doubting. Please run the job on production with commit enabled.

Thanks,
Chad

#10

The following was run on rep-prod to remove duplicate PTIFs

nohup php -f job.duplicates.php > job-duplicates-log-real.txt &

Full report is attached, summary below:

---------------------------------------------------------------
Summary report start
---------------------------------------------------------------
checksumCalculated: 9360
datastreamHistoryError: 0
datastreamHistoryMatchError: 0
datastreamHistoryMatchFound: 4396
datastreamPathError: 0
duplicateFound: 4396
duplicateTotalSize: 27.43GB
originalFound: 4964
---------------------------------------------------------------
Summary report end
---------------------------------------------------------------

---------------------------------------------------------------
Script ended - 2017-02-15T16:05:03-05:00
Total time spent was 1428.36 second(s) or 23.81 minute(s)
This process used 652 second(s) for its computations.
It spent 34 second(s) in system calls.
---------------------------------------------------------------

Report is attached.

#11

Status:active» closed

Thanks looks like it worked as expected. Closing for now.

Back to top