Published 2008-03-09 18:37:00

its funny how you can often end up solving pretty much the same problem twice, dejavu for coding. Last weeks challenge was a free text search engine for email archives including support for chinese.

About 7 years ago, I remember hacking on mnogosearch to solve a pretty similar problem. After some research this time, I settled on Xapian, some of the reasons included,
- utf8 internal support
- nice bindings for PHP & c (via gcc/C++)
- a working set of command line tools (omindex & quest)
- database independant ~ no mysql dependancies
And the test that always makes the deal is that after apt-get'tting the package, it just worked! Creating a working store and runing queries is quite simple

The only trouble was that although it says it supports utf8, actual support for chinese is a bit more complex.

Unlike western langages, each character needs to be treated like a word. Ideally Xapian would realize this, however without wanting to hack the C++ code, I decided it would be quicker to create files based on the original email and pad the chinese characters with spaces. This means, in the short term, I can use omindex, rather than, binding the D code directly to xapian API.

the way this is done in the D index builder is
- parse the email with callbacks for each mime part. I have ported the mime code from binc imap for this, and called it dinc (silly name for the week)
- if the part is text, or html, convert it to utf8 using iconv
- stream read each line of the body (using memory streams)
- convert each line to utf32/dchars
- loop through the dchars and see if the are chinese, japanese or korean (see this for the simple check for cjk characters). pad the ouput line with spaces when found
- convert the resulting utf32 array to utf8/char array, and write it to the output stream (file stream),
all this code should be quite simple to extend when i get round to the d direct access to xapian api. It should also be pretty memmory efficient, and fast when i rewrite the iconv code to work like a stream filter...


On the other end of this was making the extjs/php5 front end do the searching. Again, the end user would be expected to search in chinese as a series of characters, eg. type 'XXX' rather than type 'X X X' so in php. I needed to convert the search string into utf32, and compare each block of 4 characters against the previous list of cjk charcters, padding with spaces. then converting back to utf8 prior to sending the query. This is all pretty simple using iconv or mb_string.

All in all, not that difficult to do, however actually finding/working how to do this was quite a challenge.

Mentioned By:
mac.6.cn : Chinese Indexing and Searching In Xapian - Mac.6.cn (115 referals)
www.dzone.com : Chinese Xapian search and indexing (104 referals)
www.phpeye.com : Chinese Xapian search and indexing - Alan Knowles-- 专业PHP社区|论坛|PHP5|教程|源码|下载|框架|手册|类库|PEAR - (70 referals)
google.com : xapian chinese (55 referals)
www.peterfu.net : Xapian and Chinese Indexing&Searching - Cofyc (37 referals)
google.com : march (33 referals)
liguangming.com : Links about lucene,xapian,sphinx...... - PyWordPress (31 referals)
www.planet-php.net : Planet PHP (28 referals)
google.com : xapian (28 referals)
www.reddit.com : Chinese Xapian search and indexing : programming (24 referals)
www.peterfu.net : xapian - Tag List (18 referals)
google.com : chinese xxx (16 referals)
google.com : Xapian cjk (16 referals)
www.reddit.com : programming (13 referals)
google.com : xapian php (12 referals)
google.com : search (10 referals)
google.com : php stream filter utf-8 (7 referals)
google.com : php xapian (6 referals)
www.reddit.com : reddit.com: what's new online! (5 referals)
www.peterfu.net : Peter Fu | Neverland (5 referals)

Add Your Comment