Josef Baker: Analysis of Mathematical Formulae within PDF Documents
The exponential growth of the internet over recent years has given users access to a wealth of information, usually easily accessible through general search engines such as Google, or specialised services like IEEE Xplore. However, mathematical content is far more difficult to find. Due to the many different styles, formats and conventions used, accurate, automatic indexing can not currently be carried out, meaning that search engines have to rely on keywords or other text within the document, which often leads to poor precision or recall. Even when the correct mathematics is located it can not be manipulated, as it is often just a collection of images or unstructured symbols.
A solution to this would be to markup mathematics with a language such as MathML, however, manually marking up years of legacy documents would be almost impossible and automatic markup is very difficult, with current software lacking accuracy. This is because Optical Character Recognition software copes poorly when presented with mathematics (compared to text), as it has difficulties recognising the many different mathematical symbols and often subtle, important changes of font style. This means that parsing algorithms need large amounts of processing to determinr the accuracy of the input. They also have to make use of heuristics which, if incorrect, lead to erroneous output.
I am developing a system to parse the popular and pervasive PDF format, and output mathematical formulae with appropriate markup. I first use optical analysis of a rasterized version of the file to gain bounding box information, then access the PDF directly, extracting symbol and baseline information . This provides far more information to parsing algorithms than a traditional OCR approach, allowing them to be developed into more analytical rather than heuristic techniques. The additional information could also provide the opportunity for new techniques to be developed.
Further work will involve using these new techniques to improve existing Formula Recognition techniques, and adding semantic information to output to help improve abilities of search algorithms.