portalcron.php script didn't have a way to reindex every item in Feodra

Project:RUcore dlr/EDIT
Version:6.1
Component:Code
Category:bug report
Priority:critical
Assigned:chadmills
Status:closed
Description

The portalcron.php seemed to be driven by the portal supplying what ID's to re-index. There will be a need, possibly nightly, to use Fedora instead of the Portal to determine what needs indexing.

This is necessary to perform any cleanup or catch objects that have slipped through the cracks.

I have added the functionality to start with a list of token provided by the Fedora 'objectPaths' table. I then get a list of known indexed objectID's in Solr. By comparing the two lists I see if something is in the Solr index that is not in Fedora anymore.

The script then removes those index entries from Fedora. Finally using the list of tokens provided by Fedora all of those ID''s are indexed.

To initialize pass the parameter refreshType=fedora

Example: php -f portalcron.php refreshType=fedora

To run an index with the portal driving what is being re-indexed either pass no parameter, default, or pass refreshType=portal

Example: php -f portalcron.php refreshType=portal

I need to adjust the cron job to run this nightly after running it manually and looking at the results.

Comments

#1

To cut down on re-indexing time I have added a modification time parameter, 'since'. When used it will re-index only things that have been modified since the time submitted. Times can be submitted in a natural way.

Example: php -f portalcron.php refreshType=fedora since='24 hours ago'

This will re-index all Fedora objects that have been modified in the last 24 hours

#2

Moved code change into cron. Need to talk with sys admin about a special one time run of the cron everyday with the following parameters.

php -f portalcron.php refreshType=fedora since='25 hours ago'

#3

Status:active» closed

Placed in cron now. A cron is run multiple times daily to update objects based on portal association changes. Then once a day a a cron is run reindexing all Fedora objects that have been modified in the last 25 hours. Finally once a week the cron is run re-indexing all Fedora objects.

#4

Status:closed» active

When performing a clean run with no items in the Solr index the index cron was indexing every item twice. This was seen while running the cron for the first time on the staging system. Looking at the portalcron script it has been discovered while obtaining the list of ID's to index the list was not unique by default, as expected. The list of ID's need to be made unique separately.

Simulating this on the development system yields the following from the beginning of the script:

============================================
16062 Fedora tokens were found in the Fedora database.
16062 Fedora objects will be reindexed.
0 objects were found in the Solr index.
16062 tokens were found that are in Fedora but not in the Solr index.
Indexing all 32124 Fedora objects...
1 of 32124 0% complete

Note the "Indexing all 32124 Fedora objects..." is double what is expected. When getting the correct unique ID's the following is seen when running the script.

============================================
16062 Fedora tokens were found in the Fedora database.
16062 Fedora objects will be reindexed.
0 objects were found in the Solr index.
16062 tokens were found that are in Fedora but not in the Solr index.
Indexing all 16062 Fedora objects...
1 of 16062 0.01% complete

To completely test the solution the indexes will need to be blown away on development.

#5

Status:active» closed

The Solr indexes were removed and the cron was run. As expected 16,062 objects were indexed, not double the number. Creating a unique set of ID's solved the problem.

Back to top