python - splitting unicode string into words -

- January 15, 2010

I am trying to split a Unicode string into words (simplified) like:

  print Rifndl (r '(? U) \ w +', 'Ñ ?? Ð ° Ð · Ð'Ð²Ð ° Ñ ?? Ñ ?? Ð¸ ")    Do I expect to see:  
  ['Ñ ?? Ð ° Ð ·', 'Ð'Ð²Ð °', 'Ñ ?? Ñ ?? Ð¸']    but I really do get:  
  [ 'xd1', 'xx0', 'xx0', 'xx0', 'xx0 xb2 \   Edit:  
  
> If I use  u  in front of the string: / P> 
  print refund (r '(? U) \ w +', u "?? Ð ° Ð · Ð'Ð²Ð ° Ñ ?? Ñ ?? Ð¸")    I'm getting:  
  [u '\ u0440 \ u0430 \ u0437', '\ U0434 \ u0432 \ u0430', u '\ u0442 \ u0440 \ u0438 ']    Edit 2:  
 It looks like I should have read docs first:  
  print refund ( R '(? U) \ w +', u "± ?? Ð ° Ð · Ð'Ð²Ð ° Ñ ?? Ñ ?? Ð¸ ") [0] .encode ('utf-8')    will give me:  
  Ñ ?? Ð ° Ð    Just to make sure, is this voice the proper way to approach someone?   
 
  You are actually getting the things you need in the Unicode case. You only think that you are not because of the strange reason to avoid, because you are due to the fact that you are  Reprs , printing their uninterrupted prices (This is how to display only the lists.)  
  & gt; & gt; terms = [u '\ u0440 \ u0430 \ u0437', u ' \ u0434 \ u0432 \ u0430 ', u' \ u0442 \ u0440 \ u0438 ']> gt; & gt; & gt; W in words: ... print w # it uses terminal encoding - _only_ use interactively ... Ñ ?? Ð ° Ð · Ð'Ð²Ð ° Ñ ?? Ñ ?? ¸¸¸ & gt; & gt; & gt; U 'Ñ ?? Ð ° Ð ·' == u '\ u0440 \ u0430 \ U0437 is true    Do not miss my comment about printing these Unicode wires. Generally if you wanted to send them to the screen, a file on the wire, etc. You have to encode yourself in the right encoding. When you use  print , Python tries to take advantage of encoding your terminal, but it can only be done if there is a terminal because you do not usually know that someone If you have one, then you should rely on this in the interactive interpreter, and always have the right encoding explicitly encoded.  
 In this simple segmentation-on-white location approach, you may not want to use regex at all, but only to use the  unicode.split  method. You can.  
  & gt; & Gt; & Gt; U "Ñ ?? Ð ° Ð · Ð'Ð²Ð ° Ñ ?? Ñ ?? Ð¸" .split () [u '\ u0440 \ u0430 \ u0437', u '\ u0434 \ u0432 \ u0430', u 'U0442 \ U0440 \ u0438 ']    Your top (biostasting) example does not work because  again  basically assumes that all byteens are ASCIs for their words, But yours was not. Using Unicode String, you can get the right words for your alphabet and locale. Text data must always be displayed using  str  instead of  Unicode .   

 




  



















Get link





Facebook





X





Pinterest





Email





Other Apps

Comments Post a Comment

Search This Blog

Lay Page

python - splitting unicode string into words -

Comments

Post a Comment

Popular posts from this blog

mysql - BLOB/TEXT column 'value' used in key specification without a key length -

c# - Using Vici cool Storage with monodroid -

c# - Confused over DLL entry points (entry point not found exception) -