Provide more granular diagnosis of failed ingest to the WMS ingest log

Project:RUcore dlr/EDIT
Category:bug report

This is a joint assignment to YY and JAT. It begins in dlr/EDIT.

When an object fails ingest because of something that happens once it arrives in Fedora, the user (WMS user) does not get sufficient information to diagnose the problem. Fedora must provide more detailed information about what has happened. This information must be passed back to the WMS and reflected in the ingest log.

Typically, the error report is something like "Error add index to Solr/Lucene: ..." (see below). In the past, this has ended up being a file size problem, EZID server down, a mis-match between the file type and the file extension, the wrong smap (that will no longer be a problem, but I'm just using it as an example of what this generic error message disguises), technical metadata issues, and *sometimes* it is actually an indexing issue. Please pass back more information from Fedora so that we don't unnecessarily use up developer time for user support.

Ingesting ... OK.
<a href="" title=""></a>

Indexing Solr/Lucene search engine ...
Error add index to Solr/Lucene: Error with add action for rutgers-lib:201344: the requested object cannot be found.

Error purging ingested record rutgers-lib:201344: Soap fault while purging: 0-- Caused by: no path in db registry for [rutgers-lib:201344]



This will be a useful addition in my opinion.


Part of the solution to this will involve a change to the Solr indexer so that it parses the XML-1 files before attempted to include them. If an XML file fails the parse, it will be excluded and a warning given as part of the return. This will identify XML files that need attention but allow the WMS to preserve an otherwise successful ingest. I will create a separate issue for this.


Update on this issue.

The feature that parses XML-1 skips over them with a warning is now ready on rep-dev. There are now three responses rather than two when indexing is attempted.
This is output for bad XML-1:
<responses><response actiontype="add"><status>WARNING</status><message>Success with add action for rutgers-lib:202138...WARNING: an ERROR was found in the XML-1 Datastream...</message></response><response actiontype="commit"><status>WARNING</status><message>Success with commit action for rutgers-lib:202138...WARNING: an ERROR was found in the XML-1 Datastream...</message></response></responses>
but this for normal objects:
<responses><response actiontype="add"><status>OK</status><message>Success with add action for rutgers-lib:202147...</message></response><response actiontype="commit"><status>OK</status><message>Success with commit action for rutgers-lib:202147...</message></response></responses>
and this for non-existent objects:
<responses><response actiontype="add"><status>Failed</status><message>Error with add action for rutgers-lib:2021478765: the requested object cannot be found.</message></response></responses>

The last message is typically generated when the indexer tries to run on the PID of an object that failed ingest for some reason. Our investigations suggest that these failures are usually caused by parsing errors such as duplicate datastream IDs. Such IDs escape detection when parsers do not call the FOXML schema but simply check for well-formedness. To parse FOXML reliably, the FOXML schema must be associated in advance in the header of the XML file, e.g.:
<foxml:digitalObject VERSION="1.1" PID="rutgers-lib:46902" FEDORA_URI="info:fedora/rutgers-lib:46902" xsi:schemaLocation="info:fedora/fedora-system:def/foxml#">
Note the external location Fedora itself outputs (<a href="" title=""></a>) when an object is exported.
The FOXML schema used idType for a number of elements including foxml:datastream. It is defined as follows:
<xsd:simpleType name="idType"><xsd:restriction base="xsd:ID"><xsd:maxLength value="64"/></xsd:restriction></xsd:simpleType>
At <a href="" title=""></a> there is a description of the lower level xsd:ID type:
"The constraint added by this datatype, beyond the xsd:NCName datatype from which it is derived, is that the values of all the attributes and elements that have an ID datatype in a document must be unique."


Status:active» test

This is similar to the other issue where I attached a test XML-1 file. I'll attach the same file here just in case.



You may want to talk to Yang and make sure he understands what he has to do.



Status:test» fixed

There is a bug entry in WMS project for this and was tested by Jie and Jeffery. They reported that it worked as expected.


Status:fixed» closed

Back to top