These days, hardly anyone is searching an online store by rambling among the categories or scrolling down the long lists of products.

There is a bunch of available onsite search tools that can make an internal site search fast, intuitive and adjusted to any customer needs.

In this series of articles we are going to review the functionality of the most popular eCommerce onsite search solutions. And the first search toolkit on the list is Sphinx.

What is Sphinx?

Sphinx is an open source search engine with fast full-text search capabilities.

High speed of indexation, flexible search capabilities, integration with the most popular data base management systems (e.g. MySQL, PostgreSQL) and the support of various programming language APIs (e.g. for PHP, Python, Java, Perl, Ruby, .NET и C++ etc) —  all that make the search engine popular with thousands of eCommerce developers and merchants.

This is what makes Sphinx stand out:

  • high indexing performance (up to 10-15 Mb/s on one core)
  • rapid search performance (up to 150-250 Mb/s on a core with 1,000,000 documents)
  • high scalability (the biggest known cluster is capable of indexing up to 3,000,000,000 documents and can handle more than 50 millions of queries per day)
  • support of the distributed real-time search
  • simultaneous support of several fields (up to 32 by default) for full-text document search
  • the ability to support a number of extra attributes for every document (e.g. groups, time tags, etc.)
  • support of stop words
  • the ability to handle both single-byte encodings and UTF-8
  • support of morphologic search
  • and dozens more

All in all, Sphinx has more than 50 different features (and this number is constantly growing). Follow this link to overview the search engine functionality.

How Sphinx Works

The whole complexity of the search engine working pattern can be summed up in 2 key points:

  • using the source table, Sphinx creates its own index database
  • next, when you send an API query, Sphinx returns an array of IDs that correspond to those in the source table.

Installing Sphinx on a Server

The installation procedure is pretty easy. Follow the links below for a step-by-step installation instructions on:

This is a particular example of installing the search engine on CentOS:

When the installation is complete, Sphinx will create the path to the Config file. In the standard scenario it is:

/etc/sphinx/sphinx.conf

If you are going to simultaneously use Sphinx for several projects, it’s generally advised to create a separate folder for the Config file,  Index and Log.

E.g.

Config path – /etc/sphinx/searchsuite.yasha.web.ra/
Index path  – /var/lib/sphinx/searchsuite.yasha.web.ra/
Logs path  – /var/log/sphinx/searchsuite.yasha.web.ra/

Configuring Sphinx.conf File

Sphinx configurator consists of 4 constituents:

  • Data Source
  • Index
  • Indexer
  • Search Daemon

Here is how you can configure each of them:

1. Data Source

2. Index

And here is what some of the settings from the list above settings mean:

Prefixes — indexing prefixes allows you to run wildcard searching by ‘wordstart* wildcards. Say, if the minimum prefix length is set to > 0, the Indexer will include all the possible keyword prefixes (or, as we call them, word beginnings) in addition to the main keyword.

Thus, in addition to the keyword itself, e.g. ‘example’, Sphinx will add extra ‘exa’, ‘exam’, ‘examp’, ‘exampl’  prefixes to its index.

Note, too short prefixes (below the minimum allowed length) will not be indexed.


Infixes — Sphinx is capable of including any infixes (aka word parts) into its index. E.g. In our example, indexing the keyword “test” will add its parts “te”, “es”, “st”, “tes”, “est” in addition to the main word.

IMPORTANT! It’s not possible to enable these 2 settings at the same time. If done, you’ll get a fatal error during indexation.

Also, enabling either of these 2 settings can significantly slow down the indexation and search performance. Especially, when working with big data volumes.

3. Indexer

To configure the Indexer, you just need to set the appropriate memory limit that can be used by the Daemon Indexer.

4. Search Daemon

Here are the general Sphinx Search Daemon settings (supplied with the explanatory comments).

Morphology

After splitting the text into separate words, the morphology preprocessors are slapped into action.

These mechanisms can replace different forms of the same word with the basic, aka ‘normal’ one. This approach lets the search engine ‘synchronize’ the main search query with its forms, so that it would be possible to find all forms of the same word in the index.

When Sphinx morphology algos are enabled, the search engine returns the same search results for different forms of a word. E.g. the results may be totally identical for both ‘laptop’ and ‘laptops’.

Sphinx supports 3 types of morphology preprocessors:

  • Stemmer
  • Lemmatizer
  • phonetical algorithms
1. Stemmer

It’s the easiest and fastest morphology preprocessor. It lets the search engine find the word’s stem (a part of a word that remains unchanged for all its forms) without using any extra morphological dictionaries.

Basically, the Stemmer removes or replaces certain word suffixes and/or endings.

This morphology preprocessor works fine for most of search queries. However, there are some exceptions. For instance, with this method, ’set’ and  ‘setting’ will be considered as 2 separate queries.

Also, the preprocessor can treat words that have different meaning but the same stem as identical.

To enable the Stemmer, add the following line to the Index:

2. Lemmatizer

Unlike the Stemmer, this morphology preprocessor uses morphological dictionaries, which lets the search engine strip the keyword down to lemma. The lemma is a proper, natural language root word.

E.g. the search query ‘settings’ will be reduced to its infinitive form ‘set’.

To use the Lemmatizer, you need to download the morphological dictionaries. You can do that on the official website at sphinxsearch.com

In Config file – Indexer block you can find the lemmatizer_base option. This option will let you  specify the path to the folder where you store all the dictionaries.

When done, you need to select either lemmatize_en or lemmatize_en_all  built-in value. In the latter case, Sphinx will apply the Lemmitizer and the Index all the root word forms.

3. Phonetics algos

At the moment, Sphinx supports 2 phonetical algorithms, these are: Soundex and Metaphone.
Currently, they both work for the English language only.

Basically, these algos substitute the words of the search query with specially crafted phonetic codes. It lets the search engine treat the words that are different in meaning but phonetically close as the same.

This way of search can be of great help when searching by a customer’s name/ surname.

To enable the phonetic algos, you need to specify the values of soundex or metaphone for the morphology option.

morphology = metaphone

Stop Words

The stopwords features in Sphinx lets the search engine ignore certain keywords when creating an index and implementing searches.

All you need is to make a file with all your stop words, upload it to the server and set a path for Sphinx to find it.

When creating a list of stop words, it’s generally recommended to include the keywords that are so frequently mentioned in the text that have no influence on search results. As a rule, these are: articles, prepositions, conjunctions, etc.

With the help of the Indexer it’s possible to create a dictionary of index frequency, where all the indexes are sorted by keyword frequency. You can do that using the commands:

Word Forms

The wordforms feature in Sphinx enables the search engine to deliver the same search results no mater which word form of the search query is used. E.g. customers who are looking for ‘iphone 6’ or ‘i phone 6’ will get the same results.

This functionality comes really useful if you need to define the normal word form in cases when the Stemmer can’t do it. Also, having the file with all word forms, you will be able to easily set up the dictionary of search synonyms.

These dictionaries are used to normalize the search queries during indexation and when implementing search. Hence, to apply changes in the wordforms file, you need to run re-indexation.

The example of the file:

walks > walk
walked > walk
walking > walk



Note, that starting with 2.1.1 version, it’s possible to use к “=>” instead of  “>”. Starting with 2.2.4 version you can also use
multiple destination tokens:

s02e02 => season 2 episode 2
s3 e3 => season 3 episode 3

Main Sphinx Commands

And finally, below you can find the list of the commands used for different operations with the search engine:

1. Editing Sphinx config file:
vi /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf

2. Indexing data from the targeted config sources:
sudo -usphinx indexer –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf –all –rotate

3. Launching the Search Daemon:
sudo -usphinx searchd –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf

4. Disabling the Search Daemon:
sudo -usphinx searchd –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf –stop

5. Checking whether the search engine is functioning correctly (making a request to already created indexes):
sudo -usphinx search –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf aviator (instead of ‘aviator’ you can use any other word).

Working with API

Bottom-Line

In this tutorial, I’ve tried to outline the main aspects of setting up and configuring Sphinx.

As you can see, by using this search engine, you can easily add a custom search to your Magento website.

Questions?

Feel free to leave a comment and I’ll get back to you. 🙂