On my website I have a form that takes in some textual user input. All works fine for "normal" characters. However when unicode characters are input... well, the plot thickens.
User inputs something like
やっぱ死にかけてる
This comes in to the server as text containing XML entity refs
やっぱ死にかけてる?
Now, when I want to serve this back to the client in HTML, how do I do it?
If I simply output the string as it is, there could be a chance for a script attack. If I try to encode it with scala.xml.Text it gets converted to:
やっぱ死にかけてる?
Is there a better ready-made solution in Scala which can detect entity refs and not escape them, yet escape XML tags?
Parse the string containing entity references as a fragment of XML. To safely output the Unicode characters in XML, you can be paranoid and use XML entity references for them, as per the function escape
scala>import xml.parsing.ConstructingParser
import xml.parsing.ConstructingParser
scala>import io.Source
import io.Source
scala> val d = ConstructingParser.fromSource(Source.fromString("<dummy>や</dummy>"), true).documnent
d: scala.xml.Document = <dummy>や</dummy>
scala>val t = d(0).text
res0: String = や
scala> import xml._
import xml._
scala> def escape(xmlText: String): NodeSeq = {
| def escapeChar(c: Char): xml.Node =
| if (c > 0x7F || Character.isISOControl(c))
| xml.EntityRef("#" + Integer.toString(c, 10))
| else
| xml.Text(c.toString)
|
| new xml.Group(xmlText.map(escapeChar(_)))
| }
escape: (xmlText: String)scala.xml.NodeSeq
scala> <foo>{escape(t)}</foo>
res3: scala.xml.Elem = <foo>や</foo>
http://stackoverflow.com/questions/2033833/how-do-i-handle-unicode-user-input-in-scala-safely-esp-xml-entities