Apache luceneで日本語ファイル(UTF-8)の検索 on MacOSX

まずは普通にIndexの作成と検索の実行をしてみたら
文字化けしてうまくHITしなかった。
/Users/satoshi/src/lucene-3.0.2% rm -rf index/
/Users/satoshi/src/lucene-3.0.2% java -cp lucene-core-3.0.2.jar:lucene-demos-3.0.2.jar org.apache.lucene.demo.IndexFiles src
Indexing to directory 'index'...
adding src/demo/org/apache/lucene/demo/DeleteFiles.java
adding src/demo/org/apache/lucene/demo/FileDocument.java
adding src/demo/org/apache/lucene/demo/html/Entities.java
adding src/demo/org/apache/lucene/demo/html/HTMLParser.java
adding src/demo/org/apache/lucene/demo/html/HTMLParser.jj
adding src/demo/org/apache/lucene/demo/html/HTMLParserConstants.java
adding src/demo/org/apache/lucene/demo/html/HTMLParserTokenManager.java
adding src/demo/org/apache/lucene/demo/html/ParseException.java
adding src/demo/org/apache/lucene/demo/html/ParserThread.java
adding src/demo/org/apache/lucene/demo/html/SimpleCharStream.java
adding src/demo/org/apache/lucene/demo/html/Tags.java
adding src/demo/org/apache/lucene/demo/html/Test.java
adding src/demo/org/apache/lucene/demo/html/Token.java
adding src/demo/org/apache/lucene/demo/html/TokenMgrError.java
adding src/demo/org/apache/lucene/demo/HTMLDocument.java
adding src/demo/org/apache/lucene/demo/IndexFiles.java
adding src/demo/org/apache/lucene/demo/IndexHTML.java
adding src/demo/org/apache/lucene/demo/README
adding src/demo/org/apache/lucene/demo/SearchFiles.java
adding src/jsp/configuration.jsp
adding src/jsp/footer.jsp
adding src/jsp/header.jsp
adding src/jsp/index.jsp
adding src/jsp/README.txt
adding src/jsp/results.jsp
adding src/jsp/WEB-INF/web.xml
Optimizing...
621 total milliseconds
/Users/satoshi/src/lucene-3.0.2% java -cp /usr/libexec/lucene/3.0.2/lucene-core-3.0.2.jar:/usr/libexec/lucene/3.0.2/lucene-demos-3.0.2.jar org.apache.lucene.demo.SearchFiles
Enter query: 
あああ
Searching for: "�� �� ��"
0 total matching documents
Press (q)uit or enter number to jump to a page.
^C%
/Users/satoshi/src/lucene-3.0.2% cat src/demo/org/apache/lucene/demo/README
dragon ash
independent
日本語
あああ
東京%

javaの文字コードがshift-jisになってるのが原因らしい。
以下のサイトを参考に_JAVA_OPTIONS=-Dfile.encoding=UTF-8を設定してみた。
Mac OS X で Java SE 6 を使う : エンコーディングを UTF-8 で使いたい - Masaki Katakai's Weblog
export _JAVA_OPTIONS="-Dfile.encoding=UTF-8"
というわけで、INDEXの作り直しして、再度検索してみると、、、
/Users/satoshi/src/lucene-3.0.2% rm -rf index/
/Users/satoshi/src/lucene-3.0.2% java -cp lucene-core-3.0.2.jar:lucene-demos-3.0.2.jar org.apache.lucene.demo.IndexFiles src
Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8
Indexing to directory 'index'...
adding src/demo/org/apache/lucene/demo/DeleteFiles.java
adding src/demo/org/apache/lucene/demo/FileDocument.java
adding src/demo/org/apache/lucene/demo/html/Entities.java
adding src/demo/org/apache/lucene/demo/html/HTMLParser.java
adding src/demo/org/apache/lucene/demo/html/HTMLParser.jj
adding src/demo/org/apache/lucene/demo/html/HTMLParserConstants.java
adding src/demo/org/apache/lucene/demo/html/HTMLParserTokenManager.java
adding src/demo/org/apache/lucene/demo/html/ParseException.java
adding src/demo/org/apache/lucene/demo/html/ParserThread.java
adding src/demo/org/apache/lucene/demo/html/SimpleCharStream.java
adding src/demo/org/apache/lucene/demo/html/Tags.java
adding src/demo/org/apache/lucene/demo/html/Test.java
adding src/demo/org/apache/lucene/demo/html/Token.java
adding src/demo/org/apache/lucene/demo/html/TokenMgrError.java
adding src/demo/org/apache/lucene/demo/HTMLDocument.java
adding src/demo/org/apache/lucene/demo/IndexFiles.java
adding src/demo/org/apache/lucene/demo/IndexHTML.java
adding src/demo/org/apache/lucene/demo/README
adding src/demo/org/apache/lucene/demo/SearchFiles.java
adding src/jsp/configuration.jsp
adding src/jsp/footer.jsp
adding src/jsp/header.jsp
adding src/jsp/index.jsp
adding src/jsp/README.txt
adding src/jsp/results.jsp
adding src/jsp/WEB-INF/web.xml
Optimizing...
682 total milliseconds/Users/satoshi/src/lucene-3.0.2% java -cp /usr/libexec/lucene/3.0.2/lucene-core-3.0.2.jar:/usr/libexec/lucene/3.0.2/lucene-demos-3.0.2.jar org.apache.lucene.demo.SearchFiles
Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8
Enter query: 
東京
Searching for: "東 京"
1 total matching documents1. src/demo/org/apache/lucene/demo/README
Press (q)uit or enter number to jump to a page.
^C%

と、文字化けせずに無事に検索できました。
ちなみに、手持ちのUbuntu環境だと、特に_JAVA_OPTIONSで文字コードを設定しなくても
日本語ファイル(UTF-8)のIndexingと検索に成功しました。
なんでMacはShitf-JISなんだろうな。。。


人気の投稿