NOTE: This article was written in March 2017.

Status:

The current stable version of Sphinx is 2.2.11. Sphinx currently uses space word segmentation for English and other alphabetic languages, so it does not support Chinese word segmentation. Currently, the official Chinese word segmentation scheme only supports word segmentation by word. On the basis of Sphinx, there are currently two Chinese word segmentation solutions in China, one is Sphinx-for-Chinese, Sphinx-for-Chinese does not have an official website, the documentation is limited, the latest version of Sphinx can be found in support of 1.10. Coreseek is officially under maintenance, but does not appear to be ready to release the latest version as an open source solution The last open source stable version of Coreseek is 3.2.14, updated mid-2010, based on Sphinx 0.9.9, and does not support String properties. With Sphinx 2.0.2, String properties are already supported. In contrast, CoreSeek has more documentation and is more widely used online, so the CoreSeek solution is used. For the time being, I am using the stable version of Coreseek 3.2.14, but later on, I find it more appropriate to use the 4.1 beta. Note: If you are using Coreseek, pay attention to its Sphinx version. When looking at documentation, don’t look at the latest Sphinx documentation, look at the corresponding version.

Build:

CoreSeek: The download address of CoreSeek is invalid. The download address of CoreSeek is invalid. , you need to find one on the Internet. The installation documentation provided by Coreseek is very detailed. Since we are not trying to replace MySQL’s full text search, we do not need to install the MySQL Sphinx plugin.

Installing the PHP Sphinx extension: The Sphinx documentation for PHP calling Sphinx is included directly in the official Sphinx documentation, so it is quite convenient. The extension installation method was not documented at the time and is not difficult to a lot of online. I’m not going to expand it here… The extension needs to compile two SO files (not necessarily my path, of course). :

/ usr/local/PHP/lib/PHP/extensions/no - debug - non - ZTS - 20131226 / sphinx. So/usr/local/lib/libsphinxclient - 0.0.1. So

You need to add an extension to php.ini:

extension=sphinx.so

Using MongoDB as the data source:

The most common combination for Sphinx is MySQL + PHP. Non-MySQL data sources need to address data import issues. There are two main issues that need to be addressed with MongoDB: importing data into the Sphinx index, and mapping the Mongo Objectid to the Sphinx Document ID.

The first problem is easy to solve, because in addition to MySQL, Sphinx also supports XML and Python data sources. However, it is recommended to use MySQL as the transit for Mongo data, because XML data source does not support step-by-step data, and performance will be a big problem. Python data source requires additional compilation project, which is not compiled for a long time, and can not find several documents, so it is given up.

The second problem arises from the fact that Sphinx has an important restriction that requires a “unique, non-zero, 32-bit integer” as its ID for every data item in its index. The objectId of Mongo is a 24-bit hexadecimal string. The string converted from hexadecimal to hexadecimal is a large number that cannot even be stored in 64-bit int. Sphinx 1.10: The Mongo objectId can be stored as a string property value in the Sphinx index. However, in the latest version of Sphinx, the official documentation also states that the string property is stored in memory rather than in the index file, which is a performance concern when the data set is large. If you can use int Sphinx attributes, try not to use string Sphinx attributes. In Sphinx 0.9.9, string properties are not supported, only int, bigint, bool, and so on. I used Coreseek 3.2.14 – Sphinx 0.9.9. So it definitely needs to be done again.

The ultimate approach is to divide a 24-letter hexadecimal objectId into four paragraphs of six letters each. Each segment is converted to a decimal number to fall within a 32-bit uint range. The four objectID fragments are indexed by Sphinx as attributes. After getting the query results, they are restored to the objectID of Mongo. Sphinx’s Document ID uses the non-specific self-incremented primary key.

Full-text retrieval as a system service:

Separate the full-text search service as a separate project and expose the IP or port to provide the service. The following functions need to be realized:

1. Add or modify index, from a single file (hereinafter referred to as driver file) to drive the following functions:

* data source -> mysql: transfer data from mongo to mysql

* Generate Sphinx index conf: Generate Sphinx index conf: Generate Sphinx index conf

* mysql -> Sphinx (create index) : Index generated from MySQL data and Sphinx configuration files

  1. A single bash script implements updating the index and rebuilding the index for crontab references
  2. The query automatically returns the fields described in the Driver File, and includes the library name and table name of the data in Mongo for reverse query

The difficulty and core lies in the strategy of driver file.

Plan A:

Mysql > mongo -> mysql -> sphinx

  • Field type conversion
  • Field value transfer

So the first idea is to abstract the meaning of the field and communicate the three. The field abstract class provides an interface that returns the corresponding field types of Mongo, MySQL, and Sphinx respectively, and writes an interface to map the field values among the three. Three field types are preliminarily determined:

Attr_object_id: To map a string field to a Sphinx full-text search item attr_int: Map int fields to Sphinx properties (for sorting, filtering, grouping)

The driver file selects JSON, XML and other common data formats (JSON is finally selected). Since an index may have multiple data sources, it is required that multiple data sources (JSON arrays) be configurable in the driver file as follows:

{
  "name": "example_index",
  "source": [
    {
      "database": "db_name",
      "table": "table_name",
      "attrs": [
          { "mongo": "text1", "type": "field" },
          { "mongo": "text2", "type": "field" },
          { "mongo": "_id", "type": "objectId" },
          { "mongo": "type", "type": "int" },
          { "mongo": "someId", "type": "int" },
          { "mongo": "createTime", "type": "int" },
          { "mongo": "status", "type": "int" }
        ]
    }
  ]
}

Configure a JSON file in this format for each index, parse all the JSON files, and complete the mongo-> mysql-> Sphinx process.

Coded complete field abstraction, mongo-> MySQL part. In the process of writing and subsequent thinking, I found that this abstract approach has the following disadvantages:

  • Complicated coding: The mapping rules of int type are still simple. For a field such as object_id, one field in Mongo needs to be mapped to four fields in MySQL. Therefore, it is required to uniformly define the field abstract interface as one-to-many mapping, which increases the complexity. With fields as the basic unit, coding requires multiple traversal, multi-level traversal, and the complexity increases.
  • Shortfalls in common attributes of the field interface: In addition to the above one-to-many field abstractions all the fields to one-to-many, when working with the latest Mongo rights tables, you realize that even if you only restrict the mapping of a Mongo table to a Sphinx index, you will encounter situations where the full-text index fields are stored in other tables. For example, the tag in an activist table is stored as an array of IDs, so the tag table needs to be queried when the data is dumped. This behavior can only be done by writing a separate implementation class for the field abstract interface, and that implementation class can only be used for a single field, Tag. This abstract approach leads to an overabundance of concrete implementation classes that are loosely related.
  • Only data source configurations such as mongo-> mysql-> Sphinx can be supported. If you have other data sources, you cannot use this abstraction.

Based on the above defects, we decided to abandon this solution (three days’ work has been spent on this solution).

Plan B:

Thinking again about the application scenario, the model can be simplified as follows:

  • The third point in the planning function, “automatically return the fields described in the Driver File when querying, and include the database name and table name of the data in Mongo, so as to reverse query “, is intended to be completely transparent to the caller: The caller does not need to know which fields are indexed, and the corresponding data can be retrieved from the Mongo database based on the query results. However, in order to achieve complete blackbox, there is too much work to be done, such as adding interfaces to the driver file that describe the data returned from the search, and interfaces that reverse map some fields (such as the objectID of Mongo). Simplify this function as follows: 1. Generate a static help page (manual) for each index based on the driver file, and list the index fields on this page. This is an acceptable implementation, but the driver file reduces a lot of functionality: focus only on index creation, not index lookups.

    1. Write an index query interface, define a field transformation interface, used to query the Sphinx properties back to the desired data.
  • Since there is no need to create a mapping back to the data source for each field, there is no need to use the field as the basis for abstraction. The driver file only focuses on index creation, so you can use the steps to create the index as the basis for abstraction. The disadvantages of using steps as abstract basis compared to using fields as abstract basis are:

    – Driver files will no longer be static, driver files must contain code files, and new code files must be written for each addition of a driver file (corresponding to an index).

    • Since the maintenance of the index is separated from the query of the index, two files need to be changed when a property of an index is changed: the driver file and the query field mapping rule;
- Low level of abstraction, less common parts between driver files. Advantages are: - Simple implementation (do not over design); - Can be flexibly adapted to other types of data sources;
  • In order to support cases where the data of A Sphinx index comes from multiple Mongo libraries and multiple tables, Plan A introduces JSON arrays. Sphinx will always only index data from the same MySQL database table. Mysql + Sphinx mysql + Sphinx mysql + Sphinx mysql + Sphinx In this way, the mongo-> MySQL implementation way of freedom to put some more, other steps can be unified implementation.

The scheme breaks the entire project into two unrelated parts: one is the bash-script-driven indexing (rebuild the Sphinx conf file; Update index; Import data, etc.) toolset; One part is the RESTful API interface for index queries driven by NGINX + Phalcon.

Indexing manipulation toolset:

In this scenario, all driver files inherit the following interfaces:

/** * @Author lx * Date: 2016-11-30 ** This interface represents a Sphinx index item. Used to complete the following tasks: * data source => mysql * create sphinx searchd.conf * refresh sphinx index with searchd.conf * create manual (static web Public function getIndexName(); public function getIndexName(); public function getIndexName(); Public function getIndexFields(); public function getIndexFields(); public function getIndexFields(); /** * The time in seconds when the index was last rebuilt * @Param last_refresh_time * @Return if it needs to be rebuilt; Return false */ public function shouldFreshIndex ($last_refresh_time); /** * getIndexFields(); /** * getIndexFields(); /** * getIndexFields(); * The second dimension is an array of key-value pairs that represent the fields and their values for each row of data. * array( * array("id" => "1", "type" => "404", "content" => "I'm not an example"), * array("id" => "2", "type" => "500", "content" => "example sucks"), * array("id" => "3", "type" => "502", "content" => "what's the point /_\"), Public function getValues($offset, $offset); $offset = $offset; /** * public function generateDocument(); / / public function generateDocument(); }

Fields are represented by the following classes:

/** * @Author lx * Date: 2016-11-30 ** Class IndexField {private $name; private $mysql_type; private $sphinx_type; /** * Create an IndexField as an attribute of Sphinx int. Public static function createIntField($name) {return new IndexField($name, $name, $name, $name, $name, $name); "int", "sql_attr_uint"); } /** * Create an IndexField as a Sphinx full-text IndexField. Public static function createField($name, $name, $char_length, $name, $char_length, $name, $char_length) $char_length = 255) { return new IndexField($name, "varchar($char_length)", null); } /** * @Param string $name = $mysql_type = $mysql_type = $mysql_type */ public function __construct($name, $mysql_type, $sphinx_type = null) {$this->name = $name; $this->mysql_type = $mysql_type; $this->sphinx_type = $sphinx_type; Public function getName() {return $this->;} return $this->; } /** * Gets the type of this field in the MySQL database. Mysql > create a table from a MySQL CREATE statement * int * varchar(255) */ public function getMysqlType() { return $this->mysql_type; } /** * Gets the type of the field in the Sphinx conf file. * If this field is a full-text index field, then this function should return null. * Example: This function may return the following values: * sql_attr_uint */ public function getSphinxType() { return $this->sphinx_type; } public function isSphinxField() {return ();} public function isSphinxField() {return () empty($this->sphinx_type); }}

All data sources that need to be indexed are abstracted into the driver files mentioned above, and then all the driver files are unified in a folder. Write a script to scan the folder and reconstruct the Sphinx index configuration file according to the driver file list, update the index (full, incremental),crontab schedule tasks and other operations. When new data sources need to be indexed in the future, or when existing data sources are adjusted, only the driver file needs to be updated.

You can decomposition indexing operations into three classes: MySQLTransmitter: Used to import data into MySQL SphinxConfGenerator: Used to rebuild the Sphinx configuration file (can only be rebuilt, not updated). DocumentGenerator: Used to create manual pages for each index

Then write the unified entry script, call the above tool class, join the Sphinx built-in tools searchd, indexer, etc., to complete the index related operations. This section has been fully implemented and is currently working well.

Index query:

After adopting Plan B above, a set of reverse mapping rules for indexed attributes need to be developed.

For example, the objectId of Mongo was broken into four int types during the data source import. Now, we need to join the four int types into the available objectId for further query of Mongo. For example, there is a field code that needs to be zeroed before it to correspond to a field in Mongo.

This is a many-to-many mapping problem: converting multiple properties queried by Sphinx into multiple properties of others. So define the following interface:

/** * Add one or more attributes to the Sphinx query and add them to the result set. * @Author lx */ Interface FieldParser {/** * Declare the name of the Sphinx attribute to be converted. * The values of these specified attributes are passed as arguments to the parsValues () function. * An array of @Return Array property names. Array ("id1", "id2", "id3 ") */ function getRequiredKeys(); /** * Convert the value of the selected attribute. The result of the conversion is returned as an array of key-value pairs. * @param array $values Array ("id" => "123", "id_ext" => 456) * function parsValues (array $values); }

The concrete implementation classes of this interface are added to an array (queue) that is traversed one by one to transform the result set returned by Sphinx.

Take Mongo’s ObjectId as an example, and its specific transformation class is implemented as follows:

class MongoIdParser implements FieldParser {

    private $field_name;
    private $required_fields;

    public function __construct($field_name) {
        $this->field_name = $field_name;
        $this->required_fields = array(
            $this->field_name."1", $this->field_name."2",
            $this->field_name."3", $this->field_name."4",
        );
    }

    /**
     * {@inheritDoc}
     * @see FieldParser::getFieldNames()
     */
    public function getRequiredKeys() {
        return $this->required_fields;
    }

    /**
     * {@inheritDoc}
     * @see FieldParser::parseFieldValues()
     */
    public function parseValues(array $values) {
        $mongoId = $this->buildMongoId(
            $values[$this->field_name."1"],
            $values[$this->field_name."2"],
            $values[$this->field_name."3"],
            $values[$this->field_name."4"]);
        return array($this->field_name => $mongoId);
    }

    private function buildMongoId($_id1, $_id2, $_id3, $_id4) {
        $id = $this->toHex($_id1).$this->toHex($_id2).$this->toHex($_id3).$this->toHex($_id4);
        if (strlen($id) != 24) {
            return "";
        } else {
            return $id;
        }
    }

    private function toHex($_id) {
        $hex_str = dechex($_id);
        $count = strlen($hex_str);
        if ($count < 1 || $count > 6) {
            return "";
        }
        if ($count < 6) {
            for ($i = 0; $i < 6 - $count; $i ++) {
                $hex_str = "0".$hex_str;
            }
        }
        return $hex_str;
    }
}

With the above interfaces in place, define a class for querying Sphinx that is easy to call.

Because Sphinx itself is already extremely friendly to PHP support, there’s really not much to encapsulate except for the property value conversion functionality mentioned above. But because we love streaming calls so much, we encapsulate Sphinx calls as streaming calls. As follows:

/** * @author lx * date: 2016-11-25 * utility class to easy access sphinx search api. */ class EcoSearch { private $sphinx; private $query_index; private $field_parsers; /** * construct with sphinx searchd ip and port * @param string $ip sphinx searchd ip * @param int $port sphinx searchd port */ public function __construct($ip, $port) { $this->sphinx = new SphinxClient(); $this->sphinx->setServer($ip, $port); $this->sphinx->SetMatchMode(SPH_MATCH_ANY); } /** * construct with sphinx searchd ip and port * @param string $ip sphinx searchd ip * @param int $port sphinx Public static function on($IP = "127.0.0.1", $port = 9312) {$search = new ($IP, $port); return $search; } public function setMatchAll() { $this->sphinx->SetMatchMode(SPH_MATCH_ALL); return $this; } public function setMatchAny() { $this->sphinx->SetMatchMode(SPH_MATCH_ANY); return $this; } public function setSortBy($attr, $asc = true) { if (! empty($attr) && is_string($attr)) { $mode = $asc ? SPH_SORT_ATTR_ASC : SPH_SORT_ATTR_DESC; $this->sphinx->SetSortMode($mode, $attr); } return $this; } public function setMongoIdName($mongo_id_name) { return $this->addFieldParser(new MongoIdParser($mongo_id_name)); } public function addQueryIndex($index) { if (! empty(trim($index))) { $this->query_index = $this->query_index." ".$index; } return $this; } public function addFilter($attr, $values, $exclude = false) { $this->sphinx->SetFilter($attr, $values, $exclude); return $this; } public function addFilterRange($attr, $min, $max, $exclude = false) { $this->sphinx->SetFilterRange($attr, $min, $max, $exclude); return $this; } public function setLimits($offset, $limit) { $this->sphinx->SetLimits($offset, $limit); return $this; } public function addFieldParser($field_parser) { if ($field_parser instanceof FieldParser) { if (! $this->field_parsers) { $this->field_parsers = array(); } $this->field_parsers[] = $field_parser; } return $this; } public function query($str) { if (empty(trim($this->query_index))) { $this->query_index = "*"; } Logger::dd("search [$str] from index {$this->query_index}"); $result_set = $this->sphinx->Query($str, $this->query_index); $error = $this->sphinx->GetLastError(); if (! $error) { Logger::ww("search [$str] from index {$this->query_index}, last error: $error"); } $ret = array(); if (is_array($result_set) && isset($result_set['matches'])) { foreach ($result_set['matches'] as $result) { $ret_values = array(); $values = $result['attrs']; foreach ($this->field_parsers as $parser) { $parsed_values = $this->getParsedValues($parser, $values); $ret_values = array_merge($ret_values, $parsed_values); } $ret_values = array_merge($ret_values, $values); $ret[] = $ret_values; } } else { //echo "sphinx query fail: ".$this->sphinx->GetLastError()."\n"; } return $ret; } private function getParsedValues($parser, &$values) { $ret = null; $required_keys = $parser->getRequiredKeys($values); if (! empty($required_keys)) { $required_values = array(); foreach ($required_keys as $key) { // get required values $required_values[$key] = $values[$key]; // abondon the already parsed keys unset($values[$key]); } if (! empty($required_values)) { $ret = $parser->parseValues($required_values); } } return $ret; }}

A full-text retrieval call might look like this:

$offset = ($_POST["page"] - 1) * $_POST["pageSize"]; $limit = $_POST["pageSize"]; $search_result = EcoSearch::on() ->addQueryIndex("index_name") ->setMatchAll() ->setSortBy("createTime", false) ->setLimits($offset, $limit) ->setMongoIdName("_id") ->query($search); If (empty($search_result)) {if (empty($search_result)) { } else { $result = array(); foreach ($search_result as $r) { $result[] = query_mongo_by_id(new MongoDB\BSON\ObjectID($r['_id'])); } // response result set }

Since Sphinx provides Weight, Group, AddQuery, etc., which are not currently used in the project, this query helper class is sufficient.

Postscript:

According to the above ideas, the general framework of the whole project has been built, and the implementation of each interface class needs to be added in the future. I just wrote the general idea, and I wrote it as I thought (most of it was on the plane…). Talk to take notes, please forgive me ~.

Reference:

Sphinx website Coreseek website


Postscript: after

Originally the leader asked to build Sphinx said only support non-real-time index can, and then the whole moth, let do real-time index. The real-time index had to make the backend insert a copy in Sphinx when the data was entered into the database, but the leader required that it should not affect the main framework, so I tried to find the difference data asynchronously and insert it into Sphinx. But but but… PHP does not support asynchrony… The residual read…

After a few struggles, I decided to abandon the PHP code altogether and rewrote it in Python along the lines of the above, with several improvements:

  • The split of the Mongo ObjectId is no longer a six-bit split, but is divided into four meaningful integer values by definition.
  • Realized Python stream generation/read XML documents, no longer need to do MySQL transit.
  • Improve the process to make it more automated.
  • Incremental indexing mechanism is introduced to avoid the long time of single index reconstruction.
  • SphinXQL mechanism was introduced to support real-time indexing.
  • An API server is built with flask to uncouple with the main framework.

I’ll write about this Python framework when I have time.

I started building Elasticsearch, and now I feel like I’m not using Sphinx as much as I need to, because it’s not a plugin for Elasticsearch and I have to change the source code. But if you use both search frameworks, you’ll find that Sphinx uses far fewer resources than Elasticsearch. Uh… At least on my scale.