How to use the HEXPAT delay to process the XML document?

When I searched for a Haskell library that can handle large (300-1000mb) xml files, I came across hexpat. There is an example in the Haskell Wiki claiming

-- Process document before handling error, so we get lazy processing.

For testing purposes, I have redirected the output to /dev/null and thrown on it 300mb file. The memory consumption kept rising until I had to kill the process.

Now I removed the error handling from the process function:

process :: String -> IO ()
process filename = do
inputText <- L.readFile filename
let (xml, mErr) = parse defaultParseOptions inputText :: (UNode String, Maybe XMLParseError)

hFile <- openFile "/dev/null" WriteMode
L.hPutStr hFile $format xml
hClose hFile

return ()

As a result, the function now uses constant memory. Why does error handling cause a lot of memory consumption?

As far as I know, xml and mErr are two separate unevaluated thunks after calling and parsing. Does formatting xml evaluate xml and build an evaluation tree of’mErr’? If so, is there a way to handle errors when using constant memory?

http://www.haskell.org/haskellwiki/Hexpat/

I can’t say about the permissions of hexpat, but in general, error handling forces you to read the entire file into memory. If you just want to print out the result when there is no error anywhere in the input, you need to read it before generating the output The entire input.

As I said, I don’t really know hexpat, but using xml-conduit, you can do something similar:

< pre>try $runResourceT $parseFile def inputFile $$renderBytes def =$sinkFile outputFile

It will use constant memory, and if there is any error in processing, it will throw an exception (try will catch it). The disadvantage is The output file may be corrupted. My guess is that you’d better output to a temporary file. After the whole process is completed, move the temporary file to the output file. On any exception, just delete the temporary file.

When I searched for a Haskell library that can handle large (300-1000mb) xml files, I came across hexpat. There is an example in the Haskell Wiki claiming

-- Process document before handling error, so we get lazy processing.

For testing purposes, I have redirected the output to /dev/null and thrown on it 300mb file. The memory consumption kept rising until I had to kill the process.

Now I removed the error handling from the process function:

process :: String -> IO ()
process filename = do
inputText <- L.readFile filename
let (xml, mErr) = parse defaultParseOptions inputText :: (UNode String, Maybe XMLParseError)

hFile <- openFile "/dev/null" WriteMode
L.hPutStr hFile $format xml
hClose hFile

return ()

As a result, the function now uses constants Memory. Why does error handling cause a lot of memory consumption?

As far as I know, xml and mErr are two separate unevaluated thunks after calling and parsing. Does formatting xml evaluate xml and build an evaluation tree of’mErr’? If so, is there a way to handle errors when using constant memory?

http://www.haskell.org/haskellwiki/Hexpat/

I can’t say about hexpat permissions, but generally That said, error handling forces you to read the entire file into memory. If you only want to print the result when there are no errors anywhere in the input, you need to read the entire input before generating the output.

As I said, I don’t really know hexpat, but using xml-conduit, you can do something like this:

try $runResourceT $parseFile def inputFile $$renderBytes def =$sinkFile outputFile

It will use constant memory, and if there are any errors in processing, it will throw an exception (try will catch). The disadvantage is that the output file may be corrupted. My guess is that you are the most Good output to a temporary file. After the entire process is completed, move the temporary file to the output file. On any exception, just delete the temporary file.

Leave a Comment

Your email address will not be published.