It must be easy to install, fast indexing, fast index update, no blocking during indexing, fast search.
After reading a lot of web pages, I included a short list:
Mysql MYISAM full text, djapian/python-xapian and django-sphinx
I did not choose lucene because it seems complicated and there is no hay Heap, because it has fewer features than djapian/django-spĥinx (such as field weighting).
Then I did some benchmark tests, and for this, I collected many free books on the Internet to generate one that contains 1 A database table with 485 000 records (id, title, body), each record is about 600 bytes long.
From the database, I also generated a list of 100,000 existing words and mixed them Wash to create a search list. For testing, I ran 2 runs (4Go RAM, dual-core 2.0Ghz) on my laptop: the first time, after the server restarted to clear all caches, the second time I just finished juste so Test how good the cache results are. Here are the “homemade” benchmark results:
1485000 records with Title (150 bytes) and body (450 bytes)
Mysql 5.0.75/Ubuntu 9.04 Fulltext :
================================= ======================================
Full indexing: 7m14.146s
1 thread, 1000 searchs with single word randomly taken from database:
First run: 0:01:11.553524
next run: 0:00:00.168508
Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext :
============================= ===========================================
Full indexing: 6m08.154s
1 thread, 1000 searchs with single word randomly taken from database:
First run: 0:01:09.553524
next run: 0:00:20.316903
1 thread, 100000 searchs with single word randomly taken from database:
First run: 9m09s
next run: 5m38s
< br />1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test: 0:00:15.007353
1 thread, boolean search: 1000 x (+word1 +word2)
First run: 0:00:21.205404
next run: 0:00:00.145098
Djapian Fulltext:
=== ================================================== =====================
Full indexing: 84m7.601s
1 thread, 1000 searchs with single word randomly taken from database with prefetch:
First run: 0:02:28.085680
next run: 0:00:14.300236
python-xapian Fulltext :
= ================================================== ============== =========
1 thread, 1000 searchs with single word randomly taken from database:
First run: 0:01:26.402084
next run: 0 :00:00.695092
django-sphinx Fulltext :
============================ =============================================
< br />Full indexing: 1m25.957s
1 thread, 1000 searchs with single word randomly taken from database:
First run: 0:01:30.073001
next run: 0 :00:05.203294
1 thread, 100000 searchs with single word randomly taken from database:
First run: 12m48s
next run: 9m45s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test: 0:00:23.535319
1 thread, boolean search: 1000 x (word1 word2)
First run: 0:00:20.856486
next run: 0:00:03.005416
As you can see, Mysql is not that bad for full text search In addition, its query cache is very effective.
Mysql is a good choice in my opinion because there is nothing to install (I just need to write a small script to synchronize Innodb production tables to MyISAM search Table), because I don’t really need advanced Search function…
This is a question: what do you think of Mysql full-text search engine vs sphinx and xapian?
http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql
The Sphinx has the fastest search speed. But very It is difficult to index the incrementally increasing data because adding data to the index is as expensive as creating the entire index from scratch.
Therefore, some people maintain two Sphinx indexes: one large index with archived data, and one with A small index with the latest data. They regularly (e.g. weekly) merge the most recent index into an archive index (it is cheaper to merge two indexes) and truncate the small index to prepare for a new week. This is useful for things like forums , But not great for the wiki.
You can also check out Apache Solr. This is a wrapper for Lucene, which makes Lucene easier to use and more powerful. When I designed that demo At the time, I didn’t know Solr.
The Washington Times is an example of a project that uses Solr with Django:
> http://www.screeley.com/djangosolr/
> http://www.chrisumbel.com/article/django_solr
I am working on django’s full-text search engine.
It must be easy to install, fast indexing, Fast index update, no blocking when indexing, fast search.
After reading many web pages, I included a short list:
Mysql MYISAM full text, djapian / python-xapian and django -sphinx
I did not choose lucene because it seems to be complicated and there is no haystack, because it has fewer features than djapian/django-spĥinx (such as field weighting).
Then I did some Benchmarking, for this, I collected many free books on the Internet to generate a database table containing 1 485 000 records (id, title, body), each record is about 600 bytes long.
From the database, I also generated a list of 100,000 existing words and shuffled them to create a search list. For testing, I ran 2 runs on my laptop (4Go RAM, double Core 2.0Ghz): The first time, after the server restarts to clear all caches, the second time to complete juste to test how good the cache results are. The following are the “homemade” benchmark results:
p>
1485000 records with Title (150 bytes) and body (450 bytes)
Mysql 5.0.75/Ubuntu 9.04 Fulltext :
========== ================================================== ==============
Full indexing: 7m14.146s
1 thread, 1000 searchs with single word randomly taken from database: < br />First run: 0:01:11.553524
next run: 0:00:00.168508
Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext :
===== ================================================== ===================
Full indexing: 6m08.154s
1 thread, 1000 searchs with single word randomly taken from database:
First run: 0:01:09.553524
next run: 0:00:20.316903
1 thread, 100000 searchs with single word randomly taken from database: < br />First run: 9m09s
next run: 5m38s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test: 0:00:15.007353
1 thread, boolean search: 1000 x (+word1 +word2)
First run: 0:00:21.205404
next run: 0:00:00.145098
Djapian Fulltext:
=========================== ==============================================
Full indexing: 84m7.601s
1 thread, 1000 searchs with single word randomly taken from database with prefetch:
First run: 0:02:28.085680
next run: 0:00:14.300236
python-xapian Fulltext :
========================= ===============================================
1 thread, 1000 searchs with single word randomly taken from database:
First run: 0:01:26.402084
next run: 0:00:00.695092
django-sphinx Fulltext :
====================================== =================================
Full indexing: 1m25.957s
1 thread, 1000 searchs with single word randomly taken from database:
First run: 0:01:3 0.073001
next run: 0:00:05.203294
1 thread, 100000 searchs with single word randomly taken from database:
First run: 12m48s
next run: 9m45s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test: 0:00:23.535319
1 thread, boolean search: 1000 x (word1 word2)
First run: 0:00:20.856486
next run: 0:00:03.005416
As you can see, Mysql It’s not that bad for full-text search. In addition, its query cache is very effective.
Mysql is a good choice in my opinion because there is nothing to install (I just need to write a little foot Originally synchronize Innodb production table to MyISAM search table), because I really don’t need advanced search functions like stemming…
This is a question: how do you think about Mysql full-text search engine vs sphinx and xapian ?
I did not test Xapian, but last year I did a demonstration comparing full-text solutions:
http://www.slideshare.net/billkarwin /practical-full-text-search-with-my-sql
The Sphinx has the fastest search speed. However, it is difficult to index the incrementally increasing data because the index Adding data is as expensive as creating an entire index from scratch.
Therefore, some people maintain two Sphinx indexes: a large index with archived data, and a small index with the latest data. They are regularly (e.g. every Week) merge the most recent index into the archive index (merge two indexes is cheaper) and truncate the small index to prepare for the new week. This is useful for things like forums, but not great for wikis. < /p>
You can also check out Apache Solr. This is a wrapper for Lucene, which makes Lucene easier to use and more powerful. When I designed that demo, I didn’t know Solr.
< p>The Washington Times is an example of a project that uses Solr with Django:
> http://www.screeley.com/djangosolr/
> http://www.chrisumbel.com/ article/django_solr