Fastest way to search through text records

2»

Comments

  • williewillie OG
    edited March 2020

    @comi said: Also, if it's a web service, it is going to serve queries in parallel, so 0.5s does not mean two clients per second.

    You're going to throw a muticore dedi at this? Or at least something with multiple cores and enough cpu headroom to handle your query load? Better to do something sensible in the software.

    @comi said: As you said yourself it is not possible to index by substrings in the middle of tokens, so not sure what kind of algo in a dynamic language is going to be faster.

    If the token is EC1D58C44 you can separately index EC1D58C44, C1D58C44, 1D58C44, D58C44, 58C44, 8C44, C44, and 44. Then you can do 1 prefix lookup in the combined index instead of a huge linear search. You can also do stuff like hash trigraphs from the tokens so each hashed trigraph gives a list of tokens that contain the trigraph, and that cuts down the search space a lot.

    Thanked by (1)uptime
  • @willie said: You're going to throw a muticore dedi at this?

    That escalated quickly.

    @willie said: Better to do something sensible in the software.

    Always. If sensible option exists.

    @willie said: you can separately index

    That's a cool idea! But if you do that you will need to throw multi-gigabyte-of-ram server at it as 8 times the rows will eat ram 8 times faster. And you better make sure 8 indexed lookups are indeed faster, because overhead of simply making a query exists.

    Peace.

  • williewillie OG
    edited March 2020

    8x 61 million = 486 million and it doesn't have to be in ram, an index lookup on that many rows on ssd should be plenty fast enough. Even on hdd it should do much better than 0.5 sec per query. I mistakenly wrote 8 lookups before editing, so don't know if you saw that version.

  • Hi guys, apologies for the delayed response.

    The "search" is against a list of available private number (licence) plates (UK based).

    For example, I might enter TOM and X1TOM would be an expected response, as well as possibly something like DD61TOM (there are a few numbering systems in the UK). Something like T09MXY might also be in there.

    I've got a basic system (in PHP) which handles finding some of the alternatives by changing the queries slightly, and I've considered "splitting" the text into separate "section" fields. For example A1ABC could be put into a prefix,number,letters set of fields and then the letters="ABC" query would be faster, but I'd still need to cover the other formats.

    Server wise it's an E3-1270v6, 64gb RAM and is running NVMe drives - but this doesn't have to be the host. The database/searching can be farmed off to another dedi, or some smaller VPS. Current stack on the live server is PHP/MySQL (running Wordpress and Magento).

    I've been a bit busy on other projects too but I'm hoping to get postgesql setup today and data in, as well as having more of a play with solr.

    Thanked by (1)uptime
  • What is the query load? If you can afford 1/2 second to scan the big text file on every plate lookup, maybe that's the simplest approach. Alternatively go for indexing all those suffix strings like described above. Chuck it all in an SQL table with a btree index and search on LIKE 'XYZ%' and it should be very fast. It will burn a few GB of disk space but it sounds like you can afford that. The main tables don't have to fit in ram at all, since the first few layers of btree index will usually stay cached.

  • comicomi OG
    edited March 2020

    @Mr_Tom, since search results go directly to the user, there is another cheap hack you can utilise.
    You can pick backend that supports "streaming", meaning it should start giving away results as it finds matches, not when the query is complete. This way you can give user something right away making it perceptively faster.
    But of course, @willie's idea with suffixes sounds best so far, since you have just 9 charaters max.

    Thanked by (1)uptime
Sign In or Register to comment.

This Site is currently in maintenance mode.
Please check back here later.

→ Site Settings