Sphinx profile

Sphinx is an open source search engine that supports full text search in English. So if you build Sphinx separately, you can already use full-text indexes. But often we require the Chinese index, how to do? Renren provides a Sphinx – based Chinese full-text search engine for enterprises. In other words, the kernel of Coreseek is still Sphinx. What about their versions?

Sphinx can be set to “unary shred mode” to support Chinese search. In practice, Sphinx is faster than Coreseek for non-Chinese search. Sphinx, with its “unary sharding mode” enabled, was faster than Coreseek in searching short Chinese strings. Coreseek’s word segmentation advantages only show up when searching for long Chinese text strings. It is faster than Sphinx. So choose which one to use according to your application situation. If you are indexing very long data in Chinese, use Coreseek instead

The installation note

Coreseek has released several versions such as 3.2.14 and 4.1

Version 3.2.14 was released in 2010 and is based on the Sphinx0.9.9 search engine. Version 4.1 was released in 2011 and is based on Sphinx2.0.2

System version:

[root@localhost ~]# lsb_release -a LSB Version: :core-3.1-ia32:core-3.1-noarch:graphics-3.1-ia32:graphics-3.1-noarch Distributor ID: CENTOS Description: CentOS release 5.5 (Final) release 5.5 Codename: Final

Detection of Chinese Environment

Export LANG= zh_CN.utf-8 export LC_ALL= zh_CN.utf-8 export LC_ALL= zh_CN.utf-8

Upgrade dependency packages

Yum -y install glibc-common libtool autoconf automake mysql-devel expat-devel # Install autoconf tar-zxvf /configure make make install autoconf-2.69.tar.gz CD autoconf-2.69. /configure make make install

Install MMSEG (the dictionary used by Coreseek)

Tar-zxvf coresek-3.2.13.tar. gz CD coreseek-3.2.13 CD mmseg-3.2.13/./bootstrap # Warning is not allowed. /configure --prefix=/usr/local/mmseg3 make make install # -bash: mmseg: Command not found solution # ln -s/usr/local/mmseg3 / bin/mmseg/bin/mmseg # # mmseg under test showed the following said successful installation Coreseek COS (tm) MM Segment 1.0 Copyright By Coreseek.com All Right Reserved. Usage: mmseg <option> <file> -u <unidict> Unigram Dictionary -r Combine with -u, used a plain text build Unigram Dictionary, default Off -b <Synonyms> Synonyms Dictionary -t <thesaurus> Thesaurus Dictionary -h print this help and exit

Install CSFT – 3.2.14

CD csft-3.2.13/ sh buildconf. Sh # Warning /configure \ --prefix=/usr/local/sphinx \ -- with-unixodbc \ --with-mmseg \ --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ \ --with-mmseg-libs=/usr/local/mmseg3/lib/ \ --with-mysql=/usr/local/mysql/ \ make make install #./configure --prefix=/usr/local/sphinx/ --with-mysql=/usr/local/mysql/ --enable-id64 make make install # # note: --enable-id64 system is 64-bit, after enabling the Sphinx storage module can support 4-5G, do not enable the support of 2G or so, the retrieval efficiency is higher, do not enable: # error: make[2] '/data/software/ coresek-4.1-beta /csft-4.1/ SRC' make[1]: '/data/software/ coresek-4.1-beta /csft-4.1/ SRC' make[1]: /data/software/coreseek-4.1-beta/csft-4.1/ SRC make: *** [all-recursive] Error 1 # Replace T val = ExprEval (this->m_pArg, tMatch) in csft-4.1/ SRC /sphinxexpr.cpp; T val = this->ExprEval (this->m_pArg, tMatch); Collect2: ld return 1 make[2]: $LIBS ="$LIBS -l /usr/local/lib" $LIBS ="$libs-liconv-l /usr/local/lib"

Test MMSEG word segmentation and coreseek search

CD testpack cat var/test/test. The XML # at this time should be displayed correctly in Chinese/usr/local/mmseg3 / usr/local/bin/mmseg - d mmseg3 / etc/var/test/test. The XML / usr/local/sphinx/bin/etc/indexer - c CSFT. Conf -- all/usr/local/sphinx/etc/bin/search - c CSFT. Conf web search The right should be returned at this time words: 1. 'Web ': 1 Documents, 1 Hits 2.' Search ': 2 Documents, 5 Hits

Configure MMSEG Chinese word segmentation

At present, there is no operation here. It has been configured for me when installing

CD/usr/local/mmseg3 / # generates unigram. TXT. Uni. / bin/mmseg -u/usr/local/mmseg/etc/unigram. TXT vim etc/mmseg. Ini [mmseg] merge_number_and_ascii=0; ; Abc123 /x number_and_ascii_joint=; ; Compress_space =1 defines characters that can be connected to English and Numbers; ; Seperate_number_ascii =0 is not currently supported; ; Copying letters and Numbers break up # to the sphinx/dict cp/etc/mmseg. Ini/usr/local/sphinx/dict/mmseg ini # is copied to the sphinx/dict cp/etc/unigram. TXT /usr/local/sphinx/dict/uni.lib

Configuration sphinx

# to create the configuration file mkdir -p/usr/local/sphinx/etc/conf. D/vim/usr/local/sphinx/etc/conf., d/a sphinx. Conf

Sphinx. conf configuration file contents

Source src_blog {type = mysql sql_host = 127.0.0.1 sql_user = root sql_pass = woshishui sql_db = yphp_tutiantian sql_query_pre= SET NAMES utf8 sql_query = SELECT * FROM tu_pic sql_attr_uint= pr sql_attr_uint = big_id sql_attr_uint = Small_id SQL_ATTR_UINT = IS_DEL SQL_Ranged_Throttle = 0} # Note here that RT_FIELD is the retrieval field, Rt_attr_uint is returning to field index sphinx_blog {source = src_blog path = / usr/local/sphinx/var/data/sphinx_blog docinfo = extern mlock = 0 stopwords = min_prefix_len = 0 min_infix_len = 0 morphology = none min_word_len = 2 charset_type = zh_cn.utf-8  charset_dictpath = /usr/local/mmseg3/etc/ charset_table = U+FF10.. U+FF19->0.. 9, 0.. 9, U+FF41.. U+FF5A->a.. z, U+FF21.. U+FF3A->a.. z,A.. Z->a.. z, a.. z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0.. U+00D6->U+00E0.. U+00F6,U+00E0.. U+00F6, U+00D8.. U+00DE->U+00F8.. U+00FE, U+00F8.. U+00FE, U+0100->U+0101, U+0101,U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159, U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, U+017B->U+017C,U+017C, U+017D->U+017E, U+017E, U+0410.. U+042F->U+0430.. U+044F, U+0430.. U+044F,U+05D0.. U+05EA, U+0531.. U+0556->U+0561.. U+0586, U+0561.. U+0587, U+0621.. U+063A, U+01B9,U+01BF, U+0640.. U+064A, U+0660.. U+0669, U+066E, U+066F, U+0671.. U+06D3, U+06F0.. U+06FF,U+0904.. U+0939, U+0958.. U+095F, U+0960.. U+0963, U+0966.. U+096F, U+097B.. U+097F,U+0985.. U+09B9, U+09CE, U+09DC.. U+09E3, U+09E6.. U+09EF, U+0A05.. U+0A39, U+0A59.. U+0A5E,U+0A66.. U+0A6F, U+0A85.. U+0AB9, U+0AE0.. U+0AE3, U+0AE6.. U+0AEF, U+0B05.. U+0B39,U+0B5C.. U+0B61, U+0B66.. U+0B6F, U+0B71, U+0B85.. U+0BB9, U+0BE6.. U+0BF2, U+0C05.. U+0C39,U+0C66.. U+0C6F, U+0C85.. U+0CB9, U+0CDE.. U+0CE3, U+0CE6.. U+0CEF, U+0D05.. U+0D39, U+0D60,U+0D61, U+0D66.. U+0D6F, U+0D85.. U+0DC6, U+1900.. U+1938, U+1946.. U+194F, U+A800.. U+A805,U+A807.. U+A822, U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, U+0391.. U+03A1->U+03B1.. U+03C1,U+03A3.. U+03A9->U+03C3.. U+03C9, U+03B1.. U+03C1, U+03C3.. U+03C9, U+0E01.. U+0E2E,U+0E30.. U+0E3A, U+0E40.. U+0E45, U+0E47, U+0E50.. U+0E59, U+A000.. U+A48F, U+4E00.. U+9FBF,U+3400.. U+4DBF, U+20000.. U+2A6DF, U+F900.. U+FAFF, U+2F800.. U+2FA1F, U+2E80.. U+2EFF,U+2F00.. U+2FDF, U+3100.. U+312F, U+31A0.. U+31BF, U+3040.. U+309F, U+30A0.. U+30FF,U+31F0.. U+31FF, U+AC00.. U+D7AF, U+1100.. U+11FF, U+3130.. U+318F, U+A000.. U+A48F,U+A490.. U+A4CF html_strip = 0 } indexer { mem_limit = 256M } searchd { port = 9312 read_timeout = 5 max_children = 10 pid_file =  /usr/local/sphinx/var/log/searchd.pid log = /usr/local/sphinx/var/log/searchd.log max_matches = 1000 seamless_rotate = 1}

Build a Sphinx index

# end all index pkill 9 # # search index creation/usr/local/sphinx/bin/indexer - c - config/usr/local/sphinx/etc/sphinx. Conf - all # Start index # / usr/local/sphinx/bin/searchd - config/usr/local/sphinx/etc/sphinx. Conf # error solved # : Copy the Mysql client files to all users can put a directory to cp/usr/local/Mysql/lib/libmysqlclient. So. 18 / usr/lib / # 64 - bit systems need to create a soft connection ln -s /usr/local/mysql/lib/libmysqlclient.so.18 /usr/lib64

Install sphinxclient first

# enter the sphinx source package CD coreseek apis directory - 3.2.13 / CSFT - 3.2.13 / API/libsphinxclient. / configure -- prefix = / usr/local/libsphinxclient Void sock_close (int sock); make make install # void sock_close (int sock); Static void sock_close (int sock);

Install the Sphinx extension for PHPExpand the download

# PHP7 previous versions using the following extension wget http://pecl.php.net/get/sphinx-1.3.0.tgz # PHP7 wget using the following extension http://git.php.net/?p=pecl/search_engine/sphinx.git; a=snapshot; h=9a3d08c67af0cad216aa0d38d39be71362667738; Sf = TGZ tar ZXVF - sphinx - 1.3.0. Sphinx - 1.3.0 TGZ CD/usr/local/PHP/bin/phpize. / configure - with - PHP - config = / usr/local/PHP/bin/PHP - config - with - sphinx = / usr/local/libsphinxclient/make && make install after installation of #, Under the etc directory of the installation directory, there is a sample of the test data and configuration

Configure PHP to support Sphinx

Edit the php.ini file

Vim/usr/local/PHP/etc/PHP ini # modify the following extension_dir = "/ usr/local/PHP/lib/PHP/extensions/no - debug - non - ZTS - 20100525 /" # /etc/init.d/php-fpm Restart: PHP/FPM /etc/init.d/php-fpm Restart: PHP/FPM

Check to see if Sphinx was installed successfully

Create a new PHP file

Vim index.php # : <? php echo phpinfo();

Browser input address access

Seeing Sphinx indicates that the extension was successfully installed

For more examples, click on Sphinx + PHP Basic Use Examples

Common Error Log

1. Unigram dictionary load Error: The Unigram dictionary load is not set correctly, and there is no Unigram. In particular, pay attention to the use of relative paths, whether the correct corresponding to the actual directory; To test for existence, run the command line at the directory set by dir charset_dictpath to check if it exists!

  1. Segmentation fault or segment error: the possible reasons are as follows: incorrect setting of dictionary path, or too much dictionary data constructed by oneself (no more than 20W entries are recommended);

The max_matches parameter is too large. It is recommended to set it within 10000.

  1. iniparser: cannot open …… Mmseg.ini: mmseg.ini is not set, please create mmseg.ini in the prompted location, and then go to mmseg.ini to reference the parameters needed to set up!
  2. FATAL: failed to parse config file ‘…… Csft.conf ‘: or: FATAL: config file ‘…… Csft. conf’ does not exist or is not readable. If the configuration file is not set or is located incorrectly, special care should be taken when using relative paths (e.g. running indexer directly in the bin directory); Please use the “-c profile full path /csft.conf” to set this
  3. WARNING: no such index ‘…… ‘, Skipping. : The corresponding index name was not found in the configuration file, please check whether the prompt name is set!
  4. ERROR: unknown key name ‘charset_dictpath’ in …… : The program currently running does not support or enable Chinese word segmentation. Please refer to the installation instructions to support Chinese word segmentation
  5. FATAL: index ‘…… ‘: unknown charset type ‘zh_cn.utf-8’ : charset_dictpath is not set, or there is no uni.lib in the path set by charset_dictpath; Or you’re not using Coreseek.

The resources

The official handbook: http://www.coreseek.cn/docs/c… http://blog.sina.cn/dpool/blo…