Course
ngram Fulltext Parser
MySQL Tutorial
This SQL tutorial is structured for beginners to guide them from the foundational concepts to advanced data manipulation and querying techniques in SQL. By the end of this tutorial, you will have developed a robust understanding of SQL, equipping you with the knowledge to efficiently manage and analyze data across various database systems. This guide sets the stage for your growth into a skilled data professional, ready to tackle complex data challenges and contribute to the field of data analysis and database management.
ngram Full-Text Parser
Usually in Full-Text searching, the built-in MySQL Full-Text parser considers the white spaces between words as delimiters. This determines where the words actually begin and end, to make the search simpler. However, this is only simple for languages that use spaces to separate words.
Several ideographic languages like Chinese, Japanese and Korean languages do not use word delimiters. To support full-text searches in languages like these, an ngram parser is used. This parser is supported by both InnoDB and MyISAM storage engines.
The ngram Full-Text Parser
An ngram is a continuous sequence of 'n' characters from a given sequence of text. The ngram parser divides a sequence of text into tokens as a contiguous sequence of n characters.
For example, consider the text 'Tutorial' and observe how it is tokenized by the ngram parser −
n=1: 'T', 'u', 't', 'o', 'r', 'i', 'a', 'l'n=2: 'Tu', 'ut', 'to' 'or', 'ri', 'ia' 'al'n=3: 'Tut', 'uto', 'tor', 'ori', 'ria', 'ial'n=4: 'Tuto', 'utor', 'tori', 'oria', 'rial'n=5: 'Tutor', 'utori', 'toria', 'orial'n=6: 'Tutori', 'utoria', 'torial'n=7: 'Tutoria', 'utorial'n=8: 'Tutorial'
The ngram full-text parser is a built-in server plugin. As with other built-in server plug-ins, it is automatically loaded when the server is started.
Configuring ngram Token Size
To change the token size, from its default size 2, use the ngram_token_size configuration option. The range of ngram values is from 1 to 10. But to increase the speed of search queries, use smallers token sizes; as smaller token sizes allow faster searches with smaller full-text search indexes.
Because ngram_token_size is a read-only variable, you can only set its value using two options:
Setting the --ngram_token_size in startup string:
mysqld --ngram_token_size=1
Setting ngram_token_size in configuration file 'my.cnf':
[mysqld]
ngram_token_size=1
Creating FULLTEXT Index Using ngram Parser
A FULLTEXT index can be created on columns of a table using the FULLTEXT keyword. This is used with CREATE TABLE, ALTER TABLE or CREATE INDEX SQL statements; you just have to specify 'WITH PARSER ngram'. Following is the syntax −
CREATE TABLE table_name ( column_name1 datatype, column_name2 datatype, column_name3 datatype, ... FULLTEXT (column_name(s)) WITH PARSER NGRAM) ENGINE=INNODB CHARACTER SET UTF8mb4;
Example
In this example, we are creating a FULLTEXT index using the CREATE TABLE statement as follows −
CREATE TABLE blog ( ID INT AUTO_INCREMENT NOT NULL, TITLE VARCHAR(255), DESCRIPTION TEXT, FULLTEXT ( TITLE, DESCRIPTION ) WITH PARSER NGRAM, PRIMARY KEY(id)) ENGINE=INNODB CHARACTER SET UTF8MB4;
SET NAMES UTF8MB4;
Now, insert data (in any ideographic language) into this table created −
INSERT INTO BLOG VALUES (NULL, '教程', '教程是对一个概念的冗长研究'),(NULL, '文章', '文章是关于一个概念的基于事实的小信息');
To check how the text is tokenized, execute the following statements −
SET GLOBAL innodb_ft_aux_table = "customers/blog";
SELECT * FROMINFORMATION_SCHEMA.INNODB_FT_INDEX_CACHEORDER BY doc_id, position;
ngram Parser Space Handling
Any whitespace character is eliminated in the ngram parser when parsing. For instance, consider the following TEXT with token size 2 −
- "ab cd" is parsed to "ab", "cd"
- "a bc" is parsed to "bc"
ngram Parser Stop word Handling
Apart from the whitespace character, MySQL has a stop word list consisting of various that are considered to be stopwords. If the parser encounters any word in the text present in the stopword list, the word is excluded from the index.
ngram Parser Phrase Search
Normal Phrase searches are converted to ngram phrase searches. For example, The search phrase "abc" is converted to "ab bc", which returns documents containing "abc" and "ab bc"; and the search phrase "abc def" is converted to "ab bc de ef", which returns documents containing "abc def" and "ab bc de ef". A document that contains "abcdef" is not returned.
ngram Parser Term Search
For natural language mode search, the search term is converted to a union of ngram terms. For example, the string "abc" (assuming ngram_token_size=2) is converted to "ab bc". Given two documents, one containing "ab" and the other containing "abc", the search term "ab bc" matches both documents.
For boolean mode search, the search term is converted to an ngram phrase search. For example, the string 'abc' (assuming ngram_token_size=2) is converted to '"ab bc"'. Given two documents, one containing 'ab' and the other containing 'abc', the search phrase '"ab bc"' only matches the document containing 'abc'.
ngram Parser Wildcard Search
Because an ngram FULLTEXT index contains only ngrams, and does not contain information about the beginning of terms, wildcard searches may return unexpected results. The following behaviors apply to wildcard searches using ngram FULLTEXT search indexes:
- If the prefix term of a wildcard search is shorter than ngram token size, the query returns all indexed rows that contain ngram tokens starting with the prefix term. For example, assuming ngram_token_size=2, a search on "a*" returns all rows starting with "a".
- If the prefix term of a wildcard search is longer than ngram token size, the prefix term is converted to an ngram phrase and the wildcard operator is ignored. For example, assuming ngram_token_size=2, an "abc*" wildcard search is converted to "ab bc".
ngram Full-Text Parser Using a Client Program
We can also perform ngram full-text parser operation using the client program.
Syntax
PHPNodeJSJavaPython
To perform the ngram fulltext parser through a PHP programe, we need to execute the "Create" statement using the mysqli function query() as follows −
$sql = "CREATE TABLE blog (ID INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), DESCRIPTION TEXT, FULLTEXT ( title, DESCRIPTION ) WITH PARSER NGRAM, PRIMARY KEY(id) )ENGINE=INNODB CHARACTER SET UTF8MB4";$mysqli->query($sql);
Example
Following are the programs −
PHPNodeJSJavaPython
$dbhost = "localhost";$dbuser = "root";$dbpass = "password";$dbname = "TUTORIALS";$mysqli = new mysqli($dbhost, $dbuser, $dbpass, $dbname);if ($mysqli->connect_errno) { printf("Connect failed: %s ", $mysqli->connect_error); exit();}// printf('Connected successfully.');/*CREATE Table*/$sql = "CREATE TABLE blog (ID INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), description TEXT, FULLTEXT ( title, description ) WITH PARSER NGRAM, PRIMARY KEY(id) )ENGINE=INNODB CHARACTER SET UTF8MB4"; $result = $mysqli->query($sql);if ($result) { printf("Table created successfully...!\n");}//insert data$q = "INSERT INTO blog (id, title, description) VALUES (NULL, '教程', '教程是对一个概念的冗长研究'), (NULL, '文章', '文章是关于一个概念的基于事实的小信息')";if ($res = $mysqli->query($q)) { printf("Data inserted successfully...!\n");}//we will use the below statement to see how the ngram tokenizes the data:$setglobal = "SET GLOBAL innodb_ft_aux_table = 'TUTORIALS/blog'";if ($mysqli->query($setglobal)) { echo "global innodb_ft_aux_table set...!";}$s = "SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position ";if ($r = $mysqli->query($s)) { print_r($r);}//display data (ngram parser phrase search);$query = "SELECT * FROM blog WHERE MATCH (title, description) AGAINST ('教程')";if ($r = $mysqli->query($query)) { printf("Table Records: \n"); while ($row = $r->fetch_assoc()) { printf( "ID: %d, Title: %s, Descriptions: %s", $row["id"], $row["title"], $row["description"] ); printf("\n"); }} else { printf("Failed");}$mysqli->close();
Output
The output obtained is as shown below −
global innodb_ft_aux_table set...!mysqli_result Object( [current_field] => 0 [field_count] => 6 [lengths] => [num_rows] => 62 [type] => 0)Table Records:ID: 1, Title: 教程, Descriptions: 教程是对一个概念的冗长研究ID: 3, Title: 教程, Descriptions: 教程是对一个概念的冗长研究