Normalize spacing and punctuation and capitalization in searches

Project:RUcore SOLR Searching and Indexing
Version:8.x
Component:User interface
Category:task
Priority:normal
Assigned:mbweber
Status:postponed
Description

Change the search logic so that it does not matter whether I enter
"phd" or "ph.d." or "Ph.D." or "ph d" or "ph. d." or "Ph. D."
Any/all of those searches should return the same search result.

Comments

#1

This might not have been clear: this normalizing should apply to all searches, not just the one example given here.

#2

Version:7-x» 7.4

#3

Version:7.4» 7.5

This is definitely one that will require a VM where we have control over all aspects of the Solr server.

#4

Version:7.5» 7.6

Awaiting Solr VM.

#5

If you strongly believe that without a VM, you can't move forward with this, please bring this to sw_arch for discussion.

#6

There are a whole set of experimental Solr issues that need the ability to start and stop and start the Solr server.

#7

I think we might try adding this line to the synonyms.txt file in solr/conf:
phd, ph.d., ph d, ph. d.
That would make Solr search for all of these if one is typed in a user query. (The queries are already lowercased so we don't need the capital variants.) I cannot test this yet, as I am unable to write to the synonyms.txt file or other Solr on rep-dev.

#8

The point is not to make this particular search "normalized", but to normalize ALL searches in the repository. I don't want punctuation or spacing to matter for anything.
M. S. - M.S. - MS
M. A. - M.A. - MA
it's - its - I.T.S.
master's - masters
C.S. Lewis - cslewis - C. S. Lewis

Just as capitalization does not matter, neither should spacing or punctuation. This should be a "rule" -- not a synonym file.

#9

The phd example was trickier than these others, which is why I thought of using a synonym. These new ones could be handled more easily by some tweaks to (some of) the tokenizers in the schema.xml file:

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

- Tokenizes at whitespaces

- Stop words are removed

- Words delimiters are used to generate word tokens.

generateWordParts=1 => wi-fi will generate wi and fi

generateNumberParts = 1 => 3.5 will generate 3 and 5

catenateWords=1 => wi-fi will generate wi, fi and wifi

catenateNumbers = 1 => 3.5 will generate 3,5 and 35

catenateAll = 1 => wi-fi-35 will generate wi, fi, wifi, 35 and wifi35.

catenateAll = 0 => wi-fi-35 will generate wi, fi, wifi and 35, but not wifi35.

splitOnCaseChange=1 => camelCase will generate camel and case.

- All text is changed to lower case.

- The Snowball porter stemmer will convert running to “run”

Both methods can use used in combination to get general things as well as oddball cases.

#10

Looks like "catenateAll" might be useful in this context. See how it would work for periods with text, e.g., Ph.D. and Ph. D.

#11

Version:7.6» 7.7
Assigned to:triggs» dhoover

I would like to begin experimenting with these on rep-dev, but in order to do so I will need permission to edit the Solr schema.xml file and to restart the Solr server on the dev VM.

#12

Assigned to:dhoover» triggs

Jeffery

Please call me for the password of the user that runs solr
and owns all the files under /solr on rep-dev

To start it use:
su as the solr user
cd /solr
./solr_startup

The process will be -> /solr/java/bin/java -jar start.jar

To stop it
Become the solr user
kill the PID

#13

Version:7.7» 8.1

This is related to the other stemming issue that I moved to 8.1. They should be worked on together.

#14

Jeffery,

What is the status of this?

#15

Status:active» test

I have a good part of this working - I think - through changes to the schema.xml, but not every one of Rhonda's examples I think.

#16

Assigned to:triggs» rmarker

#17

Version:8.1» 8.x
Assigned to:rmarker» mbweber
Status:test» postponed

It appears that we will not be able to normalize spacing, punctuation, and capitalization to search by degree. Assigning this to the Metadata Working Group to complete these two tasks:

1. Determine the standard for recording degrees
2. Manually update ETD records to conform to that standard
3. Draft brief (one or two sentences) "help" language for the "Help" tab in RUetd to explain how to search on degree.

Searching in test:
phd = 0
PhD = 0
Ph. D. = 19
Ph D = 19
ph d = 19
Ph.D. = 585
Ph.D = 585
ph.d = 585

Back to top