使用Ruby編碼Unicode代碼點 - Encoding Unicode code points with Ruby -开发者知识库

使用Ruby編碼Unicode代碼點 - Encoding Unicode code points with Ruby -开发者知识库,第1张

I'm retrieving an HTML document that is parsed with Nokogiri. The HTML is using charset ISO-8859-1. The problem is there are some Unicode chars in the document which are converted to Unicode code points instead of their respective character.

我正在檢索一個用Nokogiri解析的HTML文檔。 HTML使用charset ISO-8859-1。問題是文檔中有一些Unicode字符轉換為Unicode代碼點而不是它們各自的字符。

For example, this is some text in the HTML as received (in ISO-8859-1):

例如,這是HTML中收到的一些文本(在ISO-8859-1中):

\x95\x95 JOHNNY VENETTI \x95\x95

And when attempting to work with this text, it gets converted to this:

在嘗試使用此文本時,它會轉換為:

\u0095\u0095 JOHNNY VENETTI \u0095\u0095

So my question is, how can I ensure those characters are represented as their appropriate character instead of the code point? I've tried doing a gsub on the text, but that seems wrong for this. Also, I do not have control over the encoding of the HTML document.

所以我的問題是,如何確保將這些字符表示為適當的字符而不是代碼點?我試過在文本上做一個gsub,但這似乎是錯的。另外,我無法控制HTML文檔的編碼。

1 个解决方案

#1


3  

First you should realize that this string is NOT ISO-8859-1 encoded (file says "Non-ISO extended-ASCII text" and the codepage verifies this). May well be this is your problem, in that case you should specify the right encoding (probably something like Windows-1252, in this case) in your HTML document.

首先你應該意識到這個字符串不是ISO-8859-1編碼的(文件說“非ISO擴展ASCII文本”,代碼頁驗證這一點)。很可能這是你的問題,在這種情況下你應該在HTML文檔中指定正確的編碼(在這種情況下可能就像Windows-1252)。

In Nokogiri, you can also set the encoding explicitly in cases where the document specifies the wrong encoding:

在Nokogiri中,您還可以在文檔指定錯誤編碼的情況下顯式設置編碼:

Nokogiri.HTML("<p>\x95\x95 JOHNNY VENETTI \x95\x95</p>", nil, "Windows-1252")
# => #<Nokogiri::HTML::Document: ... 
#       children=[#<Nokogiri::XML::Text:0x15744cc "•• JOHNNY VENETTI ••">]>]>]>]>

If you don't have the option to solve this cleanly like above, you can also do it the hard way and associated the string with its correct encoding:

如果您沒有像上面那樣干凈利落地解決這個問題,那么您也可以通過艱難的方式進行解決,並將字符串與其正確的編碼相關聯:

s = "\x95\x95 JOHNNY VENETTI \x95\x95"
s.encoding # => #<Encoding:ASCII-8BIT>
s.force_encoding 'Windows-1252'
s.encode! 'utf-8'
s # => "•• JOHNNY VENETTI ••"

Note that this last piece of code is Ruby 1.9 only. If you want, you can read more about the new encoding system in Ruby 1.9.

請注意,最后一段代碼僅限Ruby 1.9。如果需要,您可以在Ruby 1.9中閱讀有關新編碼系統的更多信息。

最佳答案:

本文经用户投稿或网站收集转载,如有侵权请联系本站。

发表评论

0条回复