background

  • I have recently come into contact with some functions related to topic recognition. The ocR-recognized results (latex) need to be presented to the user, who can modify and submit them. Namely a complete formula rich text editor, with operations, copy, paste.

Implementation scheme

  • In the early stage, a non-open source Editor was used for secondary development. But this editor does not support the presentation in latex format. Only mathmal presentation is supported. So when we access, we need to convert between these two formats. The main problems encountered in the subsequent development are also in format conversion.
  • The main process


  • Latex2mathml tools

www.mathjax.org/

  • Mathml2latex tools

Git.yy.com/webs/efox/m…

  • Formula text editor + render tools

www.wiris.com/editor/demo…

Latex

  • Latex is returned to the front end as the result of OCR recognition, and subsequent presentations and modifications are based on it.
  • Latex is used because it is cross-platform. Mathml is not so generic

LaTeX is used worldwide for scientific documents, books and many other forms of publishing. Not only can it create beautiful typographical documents, but it also allows users to very quickly handle the more complex parts of typography, such as entering mathematical formulas, creating tables of contents, drinking and creating bibliographies, and maintaining a consistent layout across all chapters. Given the number of open source packages available (more on that later), the possibilities for LaTeX are endless. These packages allow users to do much more with LaTeX, such as adding footnotes, drawing charts, creating tables, and so on.

A string of latex text and the corresponding render result

Mthml

MathML stands for “Mathematical Markup Language,” a subset of THE XML language used to display mathematical formulas on Web pages and even in some software.

A piece of MathML and the corresponding presentation

The realization of the mathml2latex

  • Parse leaf and non-Leaf nodes
  • The leaf node is relatively simple and returns the corresponding latex syntax directly based on the tagname type
  • Non-leaf nodes use different templates to match the parsed content in children according to the tag type (Chidren’s parsing process is a recursive process of parsing the parent node)

eg.

What are the problems

  • The process of parsing MathML above seems simple, but there are a lot of problems to deal with, mainly because the data we end up submitting needs to be synchronized with the syntax identified by THE OCR.

For example, the following two types of syntax can have the same effect.

  • The same tag cannot be rendered using the same template. For example,

  • Mathml text passed through MathJax Latex2MathML cannot be displayed correctly by a rich text editor. — At present, special treatment for special circumstances of subsequent transformation plan

  • Use the open source text editor supported by latex. Implement an action panel yourself. This avoids the problems associated with data format conversion.

  • A finch-like function that directly displays the latex syntax that the user is required to enter as a formula, and does not support copying. But that doesn’t fit our business scenario.

  • Use latex. Js to make latex pares into HTML. Implement a formula rich text editor yourself, i.e. no data conversion issues (but need to handle latex2HTML pits)

Github.com/michael-bra…

conclusion

  • The transfer of data using three different tools the rendering of data the transfer of data is the root cause of the problem
  • Welcome comments & other programs 🐶 ~