PHP Fatal error: Allowed memory size :: class.string.cleaner.php

Project:RUcore SOLR Searching and Indexing
Version:8.1
Component:Code
Category:bug report
Priority:normal
Assigned:chadmills
Status:closed
Description

Reported from logs
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[10-May-2016 17:20:28 America/New_York] PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 72 bytes) in /mellon/includes/classes/php/string_cleaner/class.string.cleaner.php on line 201

In php.ini I doubled the memory limit:

;memory_limit = 256M ; Maximum amount of memory a script may consume (128MB)
memory_limit = 512M ; Maximum amount of memory a script may consume (128MB)

and restarted Apache

Possible fix (from emails)
-----------------------------------------------------------------------------------------------------------------------------------------------------------

I did add string.cleaner to the code and the portalcron is now running on rep-test in the late afternoon each day. I think I'll try being a little more conservative with it if it uses so much memory. Thanks for letting me know about this.

I've changed my call so that string.cleaner only gets applied to a full datastream when it is a MODS section. We can wait to see if the problem persists when the new cron runs at 4 this afternoon, or I can run the cron now. I think I will - we can watch for it over the next hour or so as the cron runs.

Jeffery

Comments

#1

So if the code was changed to do something different
and use less memory, should the memory limit be set
back to 128M?

Has anuone ever used this php function to look memory
usage bu a script?

memory_get_peak_usage

#2

Dave,

Yes please move it back to 128MB.

I have used "memory_get_peak_usage" during dev. I am waiting to see if Jeffery can tell me what he was throwing at the class that caused the error.

Thanks,
Chad

#3

Sorry I misspoke it was at 256M and I doubled it to 512M,
so I reset it to 256M

#4

I think the issue is still happening:
[10-May-2016 17:20:28 America/New_York] PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 72 bytes) in /mellon/includes/classes/php/string_cleaner/class.string.cl
eaner.php on line 201
[11-May-2016 11:48:22 America/New_York] PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 72 bytes) in /mellon/includes/classes/php/string_cleaner/class.string.cl
eaner.php on line 201

The first is yesterday's afternoon 4 o'clock cron and the second the run I started by hand a bit after 10 this morning. In both cases, it happens near the end of the run. I wonder if we should put back the memory and see if the 4 o'clock cron runs without the problem.

#5

It would be nice to know what object(s) are causing this and what datastreams. Without know that it is hard to say a configuration change will suffice. This may be either a usage issue or a code issue with the class.

#6

I know, but it's hard to tell from the error log. I've tried grepping some of the string fragments nearby in the log to see if it might turn up a discrete id, but none so far.

#7

It may have been this object:
triggs@rep-devel:/mellon/htdocs/dlr/EDIT> egrep -lr "abstract 1: type = abstract" /repository/data/objects/*
/repository/data/objects/2014/0402/16/55/rutgers-lib_201949
It does break the indexer run separately.

#8

for 201949 I processed the following:

XML-1 datastream ========== Processed 1402 characters.
------------------------------------------------------
Cleaned 0 character(s) in this document.
Cleansed 0 control code(s) from this document.
Used smart character replacement? TRUE
------------------------------------------------------

MODS datastream =========== Processed 1415 characters.
------------------------------------------------------
Cleaned 10 character(s) in this document.
Cleansed 0 control code(s) from this document.
Used smart character replacement? TRUE
------------------------------------------------------

Both executed fine.

#9

Status:test» active

Check out resource 202889. Running the XML-1 datastream though the string cleaner throws the error.

faultCode0faultStringFatal error:Allowed memory size of 268435456 bytes exhausted (tried to allocate 32 bytes) in /mellon/includes/classes/php/string_cleaner/class.string.cleaner.php on line 201

There are three XML-* datastreams. I have only looked at the first one and there looks to be an embedded PDF in the XML datastream! I think Yang needs to be looped in to decide if that happened via ABBYY or not.

XML-1 datastream is dated :: 2015-04-03T13:42:04.000Z

#10

The suspect object works for reindexing. We'll see if the cron creates the memory error again.

#11

Jeffery,

We still need to figure out the origins of this XML-1 datastreams with the embedded base64 PDF. The header of the XML document says:

<!DOCTYPE article PUBLIC "-//PKP//OJS Articles and Issues XML//EN" "http://pkp.sfu.ca/ojs/dtds/2.3/native.dtd">

So I assume you may know something about this item.

#12

Hmmm. That looks like a journal archive datastream. It shouldn't behave like that, but then this is new software. In any event, it's a good target to look at. Thanks!

#13

Do you know what the PID is for this one? I suspect it could be an early test object for the Journal archive where we tried to do a full text of the journal export XML. That idea was tossed aside and files like that should no longer be indexed, though there might be a straggler XML-1 or two on rep-test still. If we can find those, we should just purge those datastreams.

#14

Comment #9.

Check out resource 202889. Running the XML-1 datastream though the string cleaner throws the error.

#15

Great! I'll check it and some others. It appears in this list of possible suspects (XML-1 files that contain the string OJS) that I just grepped:
/repository/data/datastreams/2009/0428/12/49/rutgers-lib_17618+XML-1+XML1.0
/repository/data/datastreams/2009/0428/13/01/rutgers-lib_18589+XML-1+XML1.0
/repository/data/datastreams/2009/0428/13/28/rutgers-lib_19526+XML-1+XML-1.0
/repository/data/datastreams/2009/0428/13/58/rutgers-lib_19983+XML-1+XML1.0
/repository/data/datastreams/2009/0428/14/01/rutgers-lib_20004+XML-1+XML1.0
/repository/data/datastreams/2009/0428/15/51/rutgers-lib_23985+XML-1+XML-1.0
/repository/data/datastreams/2009/0428/15/55/rutgers-lib_24231+XML-1+XML-1.0
/repository/data/datastreams/2011/0215/14/45/rutgers-lib_24432+XML-1+XML-1.0
/repository/data/datastreams/2011/0215/14/45/rutgers-lib_24434+XML-1+XML-1.0
/repository/data/datastreams/2011/0215/14/45/rutgers-lib_24435+XML-1+XML-1.0
/repository/data/datastreams/2011/0215/14/45/rutgers-lib_24483+XML-1+XML-1.0
/repository/data/datastreams/2011/0215/14/45/rutgers-lib_24484+XML-1+XML-1.0
/repository/data/datastreams/2015/0403/13/42/rutgers-lib_202889+XML-1+XML-1.0
/repository/data/datastreams/2015/1120/14/33/rutgers-lib_203611+XML-11+XML-11.0
/repository/data/datastreams/2015/1203/14/31/rutgers-lib_203704+XML-10+XML-10.0
/repository/data/datastreams/2015/1203/14/31/rutgers-lib_203704+XML-11+XML-11.0

#16

How did those XML-1 datastreams get in? Were they through a batch import into WMS then Fedora or directly into Fedora?

#17

I just cleaned out the XML-1 and indexed 202889 without a problem. The XML-1 was an early (April 2015) experiment when we modelling the archive. We put it in deliberately so that it would generate some searchable full text of the metadata for all the articles. We soon decided to have YuHung add the metadata into MODS and skip the full text, but a few of these were left behind. The perl program never had an issue with it, which is why it was so quiet until now. They never went into production like this (and don't occur on rep-dev).

#18

Assigned to:triggs» chadmills
Status:active» test

These were the trouble files:
/repository/data/datastreams/2015/0403/13/42/rutgers-lib_202889+XML-1+XML-1.0
/repository/data/datastreams/2015/1120/14/33/rutgers-lib_203611+XML-11+XML-11.0
/repository/data/datastreams/2015/1203/14/31/rutgers-lib_203704+XML-10+XML-10.0
/repository/data/datastreams/2015/1203/14/31/rutgers-lib_203704+XML-11+XML-11.0

I got rid of the one XML-1 and tightened the script for find XML\-1\+ to ignore the XML-11. I was able to reindex of of these. The next cron run at 4 (hopefully) will not have the memory issues.

#19

I haven't seen any new memory errors in the log today. We'll see what happens with todays' full refresh starting about now.

#20

Status:test» fixed

#21

Status:fixed» closed

Back to top