python - splitting unicode string into words -
I am trying to split a Unicode string into words (simplified) like:
print Rifndl (r '(? U) \ w +', 'Ñ ?? Ð ° Ð · Ð'вР° Ñ ?? Ñ ?? и ") Do I expect to see:
['Ñ ?? Ð ° Ð ·', 'Ð'вР°', 'Ñ ?? Ñ ?? и'] but I really do get:
[ 'xd1', 'xx0', 'xx0', 'xx0', 'xx0 xb2 \ Edit:
> If I use u in front of the string: / P> print refund (r '(? U) \ w +', u "?? Ð ° Ð · Ð'вР° Ñ ?? Ñ ?? и") I'm getting:
[u '\ u0440 \ u0430 \ u0437', '\ U0434 \ u0432 \ u0430', u '\ u0442 \ u0440 \ u0438 '] Edit 2:
It looks like I should have read docs first:
print refund ( R '(? U) \ w +', u "± ?? Ð ° Ð · Ð'вР° Ñ ?? Ñ ?? и ") [0] .encode ('utf-8') will give me:
Ñ ?? Ð ° Ð Just to make sure, is this voice the proper way to approach someone?
You are actually getting the things you need in the Unicode case. You only think that you are not because of the strange reason to avoid, because you are due to the fact that you are Reprs , printing their uninterrupted prices (This is how to display only the lists.) & gt; & gt; terms = [u '\ u0440 \ u0430 \ u0437', u ' \ u0434 \ u0432 \ u0430 ', u' \ u0442 \ u0440 \ u0438 ']> gt; & gt; & gt; W in words: ... print w # it uses terminal encoding - _only_ use interactively ... Ñ ?? Ð ° Ð · Ð'вР° Ñ ?? Ñ ?? ¸¸¸ & gt; & gt; & gt; U 'Ñ ?? Ð ° Ð ·' == u '\ u0440 \ u0430 \ U0437 is true Do not miss my comment about printing these Unicode wires. Generally if you wanted to send them to the screen, a file on the wire, etc. You have to encode yourself in the right encoding. When you use print , Python tries to take advantage of encoding your terminal, but it can only be done if there is a terminal because you do not usually know that someone If you have one, then you should rely on this in the interactive interpreter, and always have the right encoding explicitly encoded. In this simple segmentation-on-white location approach, you may not want to use regex at all, but only to use the unicode.split method. You can. & gt; & Gt; & Gt; U "Ñ ?? Ð ° Ð · Ð'вР° Ñ ?? Ñ ?? и" .split () [u '\ u0440 \ u0430 \ u0437', u '\ u0434 \ u0432 \ u0430', u 'U0442 \ U0440 \ u0438 '] Your top (biostasting) example does not work because again basically assumes that all byteens are ASCIs for their words, But yours was not. Using Unicode String, you can get the right words for your alphabet and locale. Text data must always be displayed using str instead of Unicode .
Comments
Post a Comment