Fastest way to search through text records
Hi all,
Hoping to pick your collective minds.
I've got a MySQL table (pretty much just a couple of fields - an ID, and a string) which I need to search the string for and get the ID(s) of any matching rows.
Using LIKE "%search%"
works perfect - but it requires a full table scan and there's 61 million rows.
I've tried using a MATCH ... AGAINST()
search but this seems to work against the FULL word, but I need to include results where they match against part of the word too, if that makes sense.
I've considered putting the dataset into something like redis, but does anyone have any suggestions on the fastest way to do this. It doesn't have to use a database, I can put the data into any format I want - the only "requirement" is I've got to be able to search it using PHP.
Thanks in advance!
Comments
I don't think there will be any magic bullet for 61 million rows short of a complete database redesign to separate them out, I would say your best bet is better caching so the table runs in ram.
Perhaps someone like @joepie91 or @Francisco can give a better answer.
https://inceptionhosting.com
Please do not use the PM system here for Inception Hosting support issues.
How about Full-text search, (FTS) indexing with ADD FULLTEXT
example :
Yeah I think that might be the way to go.
I've added a FULLTEXT index to the table already, hasn't really sped up the LIKE searches. Using MATCH/AGAINST is quicker but doesn't return the correct set of results.
Basically, what you're asking for is a search engine, and a Solr or Elasticsearch cluster might be better suited for this case. They're designed for text search.
If you want to keep using MySQL, see if there is a distributed DB module which will automatically shard the DB and parallelize the query. PostgreSQL has the Citus extension and Postgres-XL, but I'm not as familiar with the MySQL ecosystem as I am the postgres ecosystem.
Of course, you'll need to throw lots of medium boxes at the problem to get the best throughput.
Thanks! That's probably the best way of looking at it.
I might setup a solr or elasticsearch install and import the data to see how it fairs speed wise.
There's no reason for it to be kept in a database, if grepping a csv file is quickest I'll write a solution that does that lol.
You'll probably need a 3 node cluster to start seeing performance gains.
You can use a CSV as a backend of some database systems.
No matter how silly this sounds like, these type of solutions most of the times get the job done.
Fuzzy search is already PITA. You never gonna like what you get with it (at least for me).
text search oriented solutions like elasticsearch/solr will perform better but comes with own set of extra work.
have you tried to load whole dataset to buffer (RAM). which DB engine did you use. If your data is mostly read-only, may I suggest Aria engine (or Memory)? Off-course I dont know much GBs actually your 61 mil data is, but If all of your data is in Buffer, I don't see any problem with FULL TABLE SCAN.
https://phpbackend.com/
@poisson
Nexus Bytes Ryzen Powered NVMe VPS | NYC|Miami|LA|London|Netherlands| Singapore|Tokyo
Storage VPS | LiteSpeed Powered Web Hosting + SSH access | Switcher Special |
I haven't played with it much yet, but AFAIK, PostgreSQL's full-text search capabilities are a lot better (and probably more performant) than those of MySQL.
Given that PostgreSQL is generally nicer to work with (and faster) than MySQL, but its SQL dialect is similar, it may be worth considering a switch - after trying out your usecase in PostgreSQL and verifying that it does work well there, of course. You can also use
EXPLAIN ANALYZE
(along with this tool) to understand exactly how a query was scheduled and executed, so that you can track down what exactly made it slow.ElasticSearch is a possibility, but note that it is open-core (not open-source) and certainly not low-maintenance. I wouldn't bother with it unless you've verified that PostgreSQL really can't do what you need. As already mentioned elsewhere, it's probably not going to be faster in and of itself either, you'd need to actually add more servers for that.
But honestly, 61 million records should be peanuts for an RDBMS on a single server. It's not the sort of dataset size where you should be running into performance issues, so long as you get your indexes right and don't run especially unperformant queries. This should just work.
That is a crazy db setup. You shouldn't have to scan strings at query time. Process each string when you insert it into the database, extracting any id's from the string. Make a new table that combines the extracted id's with the id of the row you just added. Add suitable indexes and do appropriate JOIN operations to find the rows you want. Full text searching doesn't seem like the right thing at all.
What is the application? If it's not an online op taking a high query load, maybe you can use a plain text file. 61M records is probably less than a GB of text and you can grep that much text in a second or so. Maybe that is ok for what you are doing.
Pretty sure the standard unix search stuff is optimised to death so I'd probably try to find a way to turn this into text files
The base functionality is still in Elasticsearch; the proprietary addons don't add much when it just needs to be a search engine, provided this isn't filled with sensitive information and on the Internet.
For everything, Amazon's Open Distro for Elasticsearch should be looked at. (https://opendistro.github.io/for-elasticsearch/) It's what Amazon uses for their hosted Elasticsearch instances.
Yes, it's like any other Java application.
I found Whoosh, which is a search engine library for Python (https://whoosh.readthedocs.io/en/latest/intro.html), and Xapian, search engine lib for lots of languages (https://xapian.org/). [I haven't used either of these. I just found them after getting curious about search engine libraries.)
Now that I think about it, there are also tools like the silver searcher, ack, and parallel grep implementations which might be helpful.
The answer is always awk based somehow.
My pronouns are asshole/asshole/asshole. I will give you the same courtesy.
I thought of awk. I haven't used it enough to get a feel for it's speed, or how well text searching works.
awk '/term/ {print $1}' file_with_penis_ascii.txt
My pronouns are asshole/asshole/asshole. I will give you the same courtesy.
61 millions lines of ASCII peens.
Would that be a linear search? I'm thinking part of the solution is getting as much parallelism as possible. Something, something O(log(n)).
All I know is that my log comes first.
My pronouns are asshole/asshole/asshole. I will give you the same courtesy.
..oh god did I just make a sexual math joke?
My pronouns are asshole/asshole/asshole. I will give you the same courtesy.
Thanks for all the info/tips guys.
I've tweaked some of the indexes and it's helped - doing a WHERE text="xx" search is now almsot instant, but still a little slow for LIKE queries. I've started having a look at the "search engine" type options too, as these might be a better bet for searching/returning "close" matches as well as exact matches, which I'm currently handling in an acceptable way in PHP by generating "close" terms based on the search term and then querying the database for these depending on the number of results.
I would offload search to sphixsearch using sphinxse with mariadb.
https://mariadb.com/kb/en/about-sphinxse/
LIKE 'XX%' should also be fast, because of the index, and there is sometimes a way to make an index on the reverse spelling of the string so that LIKE '%XX' is also fast. The slow one is LIKE '%X%'. But really if the stuff you are parsing out of those strings is id numbers or something else you can identify when you import the strings in the first place, then as mentioned above it's best to extract that at import time and put it in its own indexed column.
The way you described the problem initially, it sounds a bit weird if you really have to search free text. If that is the case though, Postgres's text search stuff is supposed to be quite good (almost as good as Elastic/Solr) while MySQL's is terrible.
You can still achieve it using MySQL FULLTEXT (not really, see update below)
It supports wildcard search for words suffixes so you could search like this MATCH(column) AGAINST('term*' IN BOOLEAN MODE)
If you want to search for wildcard prefix as well, you can create another column that will contain reversed words of the original column. The final query would look like this
You might need to update MySQL stopwords list with their reversed counterparts in this case.
UPDATE: however, that is still not gonna find the partial match inside the word.
Not 100% sure if this is a good option for your particular database, but to me it came highly recommended for editing WordPress ones:
https://interconnectit.com/products/search-and-replace-for-wordpress-databases/
DISCLAIMERS:
Make backups before trying anything, don't blindly follow advice of idiots like me from the Internet, eat healthy food and drink enough water...
BikeGremlin I/O
Mostly WordPress ™
Yeah, although the ID is the final required but if information, it's user inputted text which is being searched in a way that needs to find "Mr_Tom" from the term "To". The text is only 9 characters max. If postgresql is faster than mysql that will probably solve it - I've replicated it on a server of similar spec to the live one and with the index tweaks it's a lot faster - but not quite there for a the expected times for a website.
That's what I need ideally. LIKE %TERM% works perfect, just not the fastest.
That looks a good option for what it does, but doesn't quite fit here.
Cheers
mysql is not that good for that, I guess elasticsearch would make more sense
there is however a tool called sphinx that can index your tables and then the search will be fast, but i dont know if it will fit your needs
What is the actual application if you don't mind my asking, and what is the total amount of text to be searched? What size vm are you running in? I'm still perplexed that you have to find "Mr_Tom" from "to".
Maybe you can index trigraphs or substrings from the tokens in this text? If column length is 9 chars max, this is around 0.5GB of text, and if you index the 9-char strings, and their substrings starting from the 2nd char, 3rd char etc., you get around 500M total strings, around 2.5GB, doesn't seem too bad. It will chew up some disk space but you can index them all. Doesn't seem necessary to have them in ram since the index lookups on ssd will be fast enough
Postgresl full text native capabilities are probably enough.
databases are lame
If this is for a web service, 0.5 sec per query is really too slow. We still haven't heard anything about the application though. Fulltext engines will be of no help with this since they search for tokens given prefixes. They won't look for substrings in the middle of tokens. All you can really do is index all the "tails" of each token, or (depending on the input distribution) do some hack like indexing trigraphs. Mr_Tom if you're still around and want serious advice about this, we need to know the query load, the type of server, the software stack, the origin of the data (so we can make reasonable guesses about how the substrings are distributed), what kinds of lengths you're willing to go to in the software to make it fast, etc. A full-out in-process data structure with everything in ram can probably handle millions of queries per second, but the implementation hassle would be considerable.
you are not wrong, but:
As you said yourself it is not possible to index by substrings in the middle of tokens, so not sure what kind of algo in a dynamic language is going to be faster.
Also, if it's a web service, it is going to serve queries in parallel, so 0.5s does not mean two clients per second.