Peter Vandenabeele
Performance measurements on ultrasphinx (a Rails plug-in for full text search)
Long post, summary and conclusions on top, experimental set-up and results below:
Conclusions:
I checked the influence of using a "stop word" file and "star extension" (only with a prefix, not infix) on the performance of ultrasphinx (a Rails plug-in for the sphinx full text search engine). My dataset was small (37,403 records in the main table, with a belongs_to to another table of 6,000 records, for a total of 29 MByte), but is significantly larger than the initial dataset I expect to start my project. The baseline had no stop word file (that is a file of trivial words like 'a', 'and', 'en' to not match) and no star feature (matching 'program*' with 'programmer'). I turned off morphology, since I will have a multi-language database (English + Dutch) and the language rules are quite different in Dutch. I am not aware of a Dutch language stemmer at this time.
In a first experiment, I used a 842 words stop word file (mixed English + Dutch). Against my initial expectation, this did not significantly reduce the distribution of search times (range 0 to 80 ms, with 95% being less than 20 ms). It did reduce the size of the index files by approx. 40% from approx. 20MByte to 12 MByte (a reduction of 30% was mentioned elsewhere). A small decrease in indexing time was seen (from 5 second to 4.3 seconds), but this is probably not significant given the limited amount of data.
In a second experiment, I turned on the enable star = 1 and min_prefix_len = 4 features. This grew the index files to 77 MBytes and the indexing time to 15 seconds. So, a factor of nearly 4 growth of the indexing file size and a factor 3 growth of the indexing time (This explains to me why major public search engines on the web do not provide the star feature ...). But, from initial qualitative tests, I did not see an increase in search times with the star feature enabled.
My current plan, based on this data is to use:
* no stop word file
* no morphology
* min_word_len = 2 (not tested yet)
* enable_star = 1
* min_prefix_len = 4
Set my AJAX search to require at least 2 characters in each word (to avoid useless searches and internet traffic).
Of course, on the production system, a significantly better result is expected. I hope that the current performance will be sufficient to support a few users executing AJAX search in parallel. If that fails, we may have to fall back to classic search with a "search" button.
I hope the star feature will resolve part of the problem that I do not use morphology or stemming. The extra size of the indexer should not be a problem for the initial data size I expect for the project. I will check the search times in the query.log file to see if particular queries (e.g. with a star) are expensive. Hopefully, I can report with real production data later on.
Experimental data and results below
After getting the basics of ultrasphinx to work in Rails 2.0.2 on Linux (Ubuntu Feisty, Mysql 5.0), it is now time to start optimizing the indexing and searching configuration.
The machine here had 512 MByte RAM (and part of that was used in parallel for X, Firefox running the client part of the tests, Mysql, Rhytmbox, ... so, approx. 200 MByte was in swap; I could not determine exactly which programs had swapped memory). The processor is
vendor_id : AuthenticAMD
cpu family : 15
model : 47
model name : AMD Athlon(tm) 64 Processor 3200+
stepping : 2
cpu MHz : 2000.314
cache size : 512 KB
running a vanilla Ubuntu 7.0.4 32 bit kernel.
First off, I wanted to see the effect of stop words. I had assumed that using a stop words file was essential to avoid long/slow searches (between 0.010 and 0.080 seconds) on very short terms (e.g. just 1 letter etc.). Below is a list of the 5% of search terms with the longest search times (the 25 slowest searches out of a simple test of 500 manual searches).
Below is a similar list of the 25 slowest searches out of 500 tests on the same data, but with using a list of English and Dutch stop words (I compiled the Dutch list from some logic and partially translating the English list):
peterv@debian-new:~/data/a/design/sphinx$ wc -l stopwords.en
571 stopwords.en
peterv@debian-new:~/data/a/design/sphinx$ wc -l stopwords.nl
271 stopwords.nl
The test database on which I searched was compiled by filling various fields in different tables with reading a few times a test files composed of 4 MByte of (flat text, utf8) English Open Source documentation (mainly from LDP) and 1MByte (flat text utf8) Dutch language Open Source documentation (mainly from nl.openoffice.org).
The database at the lowest level had 37,403 records, linking to another table having 6,000 entries for an average of 1KByte per record spread out over different fields.
WITHOUT STOP WORDS TABLE:
0.020 sec [ext/0/rel 5 (0,20)] [complete] phi 0.020 sec [ext/0/rel 5 (0,20)] [complete] phi 0.021 sec [ext/0/rel 14 (0,20)] [complete] openoffice danon 0.021 sec [ext/0/rel 319 (0,20)] [complete] p 0.021 sec [ext/0/rel 45 (0,20)] [complete] bg 0.022 sec [ext/0/rel 3 (0,20)] [complete] nederlands le 0.022 sec [ext/0/rel 5 (0,20)] [complete] phi 0.022 sec [ext/0/rel 519 (0,20)] [complete] d 0.023 sec [ext/0/rel 20385 (0,20)] [complete] a 0.023 sec [ext/0/rel 4892 (0,20)] [complete] bash 0.023 sec [ext/0/rel 5 (0,20)] [complete] 122 0.023 sec [ext/0/rel 576 (0,20)] [complete] off 0.024 sec [ext/0/rel 20268 (0,20)] [complete] of 0.024 sec [ext/0/rel 40 (0,20)] [complete] 113 0.025 sec [ext/0/rel 816 (0,20)] [complete] b 0.027 sec [ext/0/rel 0 (0,20)] [complete] nederlands i 0.028 sec [ext/0/rel 20 (0,20)] [complete] st 0.032 sec [ext/0/rel 0 (0,20)] [complete] nederlands ic 0.032 sec [ext/0/rel 52 (0,20)] [complete] sd 0.033 sec [ext/0/rel 941 (0,20)] [complete] scripting 0.038 sec [ext/0/rel 1872 (0,20)] [complete] n 0.039 sec [ext/0/rel 279 (0,20)] [complete] p 0.053 sec [ext/0/rel 279 (0,20)] [complete] p 0.066 sec [ext/0/rel 5217 (0,20)] [complete] s 0.088 sec [ext/0/rel 5 (0,20)] [complete] 122
WITH STOP WORDS FILE:
0.018 sec [ext/0/rel 5 (0,20)] [complete] softw 0.018 sec [ext/0/rel 6 (0,20)] [complete] tek 0.019 sec [ext/0/rel 10 (0,20)] [complete] display information domains use the host command 0.019 sec [ext/0/rel 21 (0,20)] [complete] command & bash 0.019 sec [ext/0/rel 25 (0,20)] [complete] developer linux 0.019 sec [ext/0/rel 397 (0,20)] [complete] 15 0.019 sec [ext/0/rel 55 (0,20)] [complete] waarmee 0.020 sec [ext/0/rel 11 (0,20)] [complete] command bash ~22 0.020 sec [ext/0/rel 347 (0,20)] [complete] present 0.020 sec [ext/0/rel 5 (0,20)] [complete] mu 0.021 sec [ext/0/rel 15 (0,20)] [complete] command && bash 0.021 sec [ext/0/rel 45 (0,20)] [complete] examples features 0.022 sec [ext/0/rel 0 (0,20)] [complete] process CPU now. 0.022 sec [ext/0/rel 121 (0,20)] [complete] host command 0.022 sec [ext/0/rel 40 (0,20)] [complete] los 0.022 sec [ext/0/rel 933 (0,20)] [complete] 10 0.024 sec [ext/0/rel 10 (0,20)] [complete] mit 0.024 sec [ext/0/rel 2 (0,20)] [complete] process CPU now priority PPID 0.024 sec [ext/0/rel 967 (0,20)] [complete] 1. 0.025 sec [ext/0/rel 0 (0,20)] [complete] command bash gi 0.026 sec [ext/0/rel 40 (0,20)] [complete] 77 0.028 sec [ext/0/rel 81 (0,20)] [complete] 99 0.032 sec [ext/0/rel 0 (0,20)] [complete] display information domain use the host command 0.032 sec [ext/0/rel 13 (0,20)] [complete] command bash ~21 0.075 sec [ext/0/rel 15 (0,20)] [complete] 120
The duration of the indexing and the timing of the indexes evolved as follows:
WITHOUT STOP WORDS FILE:
peterv@debian-new:~/data/a$ rake ultrasphinx:index
...
mkdir -p /var/sphinx/
Indexer --config .../config/ultrasphinx/development.conf --rotate complete
Sphinx 0.9.8-dev (r1065)
Copyright (c) 2001-2008, Andrew Aksyonoff
using config file '.../config/ultrasphinx/development.conf'
...
indexing index 'complete'...
collected 37403 docs, 28.7 MB
sorted 2.8 Mhits, 100.0% done
total 37403 docs, 28725950 bytes
total 5.044 sec, 5695477.54 bytes/sec, 7415.87 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=24624).
Index rotated ok
peterv@debian-new:/var/sphinx$ ls -l
..
-rw------- 1 peterv peterv 299224 2008-01-27 22:30 sphinx_index_complete.spa
-rw------- 1 peterv peterv 8554586 2008-01-27 22:30 sphinx_index_complete.spd
-rw------- 1 peterv peterv 234 2008-01-27 22:30 sphinx_index_complete.sph
-rw------- 1 peterv peterv 222995 2008-01-27 22:30 sphinx_index_complete.spi
-rw------- 1 peterv peterv 0 2008-01-27 22:30 sphinx_index_complete.spl
-rw------- 1 peterv peterv 0 2008-01-27 22:30 sphinx_index_complete.spm
-rw------- 1 peterv peterv 11162147 2008-01-27 22:30 sphinx_index_complete.spp
WITH STOP WORDS FILE:
peterv@debian-new:~/data/a$ rake ultrasphinx:index
...
mkdir -p /var/sphinx/
Indexer --config .../config/ultrasphinx/development.conf --rotate complete
Sphinx 0.9.8-dev (r1065)
Copyright (c) 2001-2008, Andrew Aksyonoff
indexing index 'complete'...
collected 37403 docs, 28.7 MB
sorted 1.6 Mhits, 100.0% done
total 37403 docs, 28725950 bytes
total 4.317 sec, 6653540.96 bytes/sec, 8663.33 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=24624).
Index rotated ok
peterv@debian-new:/var/sphinx$ ls -l
..
-rw------- 1 peterv peterv 299224 2008-01-27 22:34 sphinx_index_complete.spa
-rw------- 1 peterv peterv 5604585 2008-01-27 22:34 sphinx_index_complete.spd
-rw------- 1 peterv peterv 234 2008-01-27 22:34 sphinx_index_complete.sph
-rw------- 1 peterv peterv 217227 2008-01-27 22:34 sphinx_index_complete.spi
-rw------- 1 peterv peterv 0 2008-01-27 22:34 sphinx_index_complete.spl
-rw------- 1 peterv peterv 0 2008-01-27 22:34 sphinx_index_complete.spm
-rw------- 1 peterv peterv 6753769 2008-01-27 22:34 sphinx_index_complete.spp
At this time, I turned off Rhytmbox and restarted the Firefox instance to have less memory in swap. This will probably have influenced the results.
The STAR feature:
Now turning on the "star" feature that allows matching with "partial words (e.g. "openof*" should match "openoffice").
I will use no stop words file to allow easy comparison.
With the following setting in default.base config file:
..
# Enable these if you need wildcard searching. They will slow down indexing significantly.
# min_infix_len = 1
enable_star = 1
min_prefix_len = 4
..
(I only wanted prefix searching, not infix searching; that is
"programm*" should match "programmer", but not "*rogrammer" schould match "programmer"; but infix is also supported ...)
peterv@debian-new:~/data/a$ rake ultrasphinx:index
mkdir -p /var/sphinx/
Indexer --config ../ultrasphinx/development.conf --rotate complete
Sphinx 0.9.8-dev (r1065)
Copyright (c) 2001-2008, Andrew Aksyonoff
..
indexing index 'complete'...
collected 37403 docs, 28.7 MB
sorted 10.8 Mhits, 100.0% done
total 37403 docs, 28725950 bytes
total 15.377 sec, 1868081.48 bytes/sec, 2432.36 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=24624).
Index rotated ok
The index files have grown to:
peterv@debian-new:/var/sphinx$ ls -l
..
-rw------- 1 peterv peterv 299224 2008-01-27 22:49 sphinx_index_complete.spa
-rw------- 1 peterv peterv 33687906 2008-01-27 22:49 sphinx_index_complete.spd
-rw------- 1 peterv peterv 234 2008-01-27 22:49 sphinx_index_complete.sph
-rw------- 1 peterv peterv 911680 2008-01-27 22:49 sphinx_index_complete.spi
-rw------- 1 peterv peterv 0 2008-01-27 22:49 sphinx_index_complete.spl
-rw------- 1 peterv peterv 0 2008-01-27 22:49 sphinx_index_complete.spm
-rw------- 1 peterv peterv 43843258 2008-01-27 22:49 sphinx_index_complete.spp
Strange enough, I did not see an increase in search times (only checked qualitatively). The largest search time I saw was 0.022 sec.
The fun thing is that now I can do this:
0.001 sec [ext/0/rel 1409 (0,20)] [complete] c
0.001 sec [ext/0/rel 1409 (0,20)] [complete] c*
0.000 sec [ext/0/rel 35 (0,20)] [complete] co
0.000 sec [ext/0/rel 35 (0,20)] [complete] co*
0.000 sec [ext/0/rel 5 (0,20)] [complete] com
0.000 sec [ext/0/rel 5 (0,20)] [complete] com*
0.000 sec [ext/0/rel 10 (0,20)] [complete] comp
0.003 sec [ext/0/rel 5343 (0,20)] [complete] comp*
0.000 sec [ext/0/rel 0 (0,20)] [complete] compl
0.000 sec [ext/0/rel 1473 (0,20)] [complete] compl*
0.000 sec [ext/0/rel 0 (0,20)] [complete] comple
0.000 sec [ext/0/rel 1214 (0,20)] [complete] comple*
0.000 sec [ext/0/rel 0 (0,20)] [complete] complet
0.000 sec [ext/0/rel 793 (0,20)] [complete] complet*
0.000 sec [ext/0/rel 458 (0,20)] [complete] complete
0.000 sec [ext/0/rel 642 (0,20)] [complete] complete*
0.000 sec [ext/0/rel 30 (0,20)] [complete] completed
0.000 sec [ext/0/rel 30 (0,20)] [complete] completed*
0.000 sec [ext/0/rel 0 (0,20)] [complete] completel
0.000 sec [ext/0/rel 130 (0,20)] [complete] completel*
0.000 sec [ext/0/rel 10 (0,20)] [complete] completely.
What happens is that with less than 4 characters, the * has no effect, but from 4 characters on, the * expands to all words that match the same first 4 letters. And that is an interesting feature the major public search engines do not offer. At this time, with the relatively small database I expect initially for our project (< 10 MByte or so), it should not be a problem to keep indices with start expansion after 4 letters in memory.
An issue that I still have is that a final '.' of a sentence is attached to the index data and so not found without attaching a '.' or '*' to the search term.