Algorithm for matching strings between two large files -
itemprop = "text">
I have a question about a search algorithm. I currently have 2 files in plain text, each of them has at least 10 million lines for now, every line is a string, and I want to find every string in the first file which also appears in the second file is. Is this a good way to do it efficiently? Any suggestion by the algorithm or the convenience of a particular language is appreciated.
If you do not know anything about the structure of the files (such as they are 're-sorted), many different There are approaches that you can take to solve the problem, depending on your obstacles used to memory and location, what you want to be.
If you have a free RAM available, you will have to make a hash table in memory to capture an option string. You can then load all the strings from the first file into the hash table. Then, load each string from the second file at a time, for each string, check whether this hash is in the table If so, report a match this approach O (M) memory (where the first file contains the number of strings ) And at least Ω (M + N) time and possibly more, depending on how the hash function works, it also (almost certainly) solve the problem most come Knowledge is the most direct way.
If you do not have much RAM, but time is not a barrier, then you can use a modified version to select some numbers to load from this first algorithm first file, then just Load the wire into a hash table Once you do this, scan the entire second file to find any matches. Then, eject all the lines from the hash table and load from the first file in the next set of lines and repeat. There is Runtime Ω (MN / B), where B is the size of B block (because in the second file are the O (M / B) iterations of the full linear scan of all n-bytes). Alternatively, if you know that a file is too small compared to the other, you may want to consider loading that entire file in memory and then scanning another file.
If you do not have much RAM available but the ability to use more disk space, one option can be used to sort each of the two files (or, at least, You can list the lines of each file in sequential order.) Once you keep the files in order, you can scan them in parallel, find all the matches. It uses more general algorithm to find two graded elements, which works like this:
- Keep an eye on the two indices, in the first list and in a second list, Both begin with zero.
- The items in both the lists are omitted:
- If the items are found in the related index, then report the match.
- Otherwise, if the item is smaller than the item in the second list in the second list, it increases the index in the first list.
- Otherwise, increase the index of the second list.
This algorithm will take about O (n log n) time to sort two files, and then work in all the lists to find common objects Will compare the total of O (n). However, since string comparisons do not necessarily run in O (1) time (in fact, they often take more time), the actual runtime may be too high. If we believe that each file contains the length N string, then the runtime O (Mn log n) will be for sorting, because each comparison takes time (m), similarly, comparative step o (mn) may take time Because each string comparison can take o (m) time as a possible optimization, you might want to consider calculating a small hash code (eg, 16 or 32 bits). It is believed that the hash code gives an equality, it can be cut dramatically at the time required to compare the stars, because most wires are not equal, because there will be different hash codes and o (1) compare them in time May go.
Finally, if each row of the file is quite low (say at least 8 characters), one option can calculate 64-bit or large hash values for each row of files. You can then use any of the above techniques to see if any hash codes are duplicated in two files (by holding everything in a hash table, using external sorting). Assuming that you have enough bits in your hash code, the number of conflicts should be low and you should be able to find matches with fast and with very little memory usage.
Hope it helps!
Wow! This is my 1000th answer on stack overflow! : -)
Comments
Post a Comment