performance - How to tag texts automatically while remaining effectivness? -


Say I have a set of one million tags and a text that could possibly be parsed for these and possibly new tags Is required . The amount of tags here is an example to illustrate the problem of my thinking - a lot of ways to keep memory, etc. through a lot of loop in a linear way.

In some way, I can not think of any solutions, low footprint (and which is fast), I know that someone expects trade-off, but I think I have some concepts Seeing.

This is especially interesting for intelligent tagging ("Michael Jackson" = "Artist", etc.) Since the applicable tag can not be part of the text

In addition to popular blacklisting, caching of popular tags and huge SQL queries, what would be the most effective way to get into it?

P> (Enough funny, I have to tag myself this question :-))

Since I am limited in comment place, add me some ideas here:

  • I agree that the use of integer hash improves speed, good idea.
  • Hashes will not solve problems (Looping through each Hush / Tag, while checking word or word combination against tag)
  • To refine the problem : Say "Hello World" Lesson There are 3 possible tags in this lesson ("Hello", "World" and "Hello World"). The tag list may contain only "hello", but after parsing "world" or "hello world" can be added, which will mean that these tags will not apply to text.

Problems:

  • Recognizing the text of the book, repeat through all combinations (like "nine inch nails" but we assume that Combination limit is 4 words) It takes a long time to compare it with the tag in the database, even treating the use of integer hash as well.
  • The tag list is potentially long, so archiving tags may also slow down the running.
  • Tag updates will mean additional full text searches on texts - based on the length of the texts and their length and possibly the DB killer, and not efficient at all?
  • How to tag a "relevant" automatically? (Again, "Nine Inches Nails" comes to mind in an article about music - but "a new song released" will not be a good tag). Perhaps this is probably a question on himself.

Hash each word in the incoming text and use it to match the hash of those tags To do what you want to match. You can use a database to store and view hash values ​​so that you do not need to do this in memory.


Comments

Popular posts from this blog

c++ - Linux and clipboard -

Visual Studio 2005: How to speed up builds when a VSMDI is open? -

booting ubuntu from usb using virtualbox -