Hi, guys. Okay, I'm a grad student in mathematics, and I'm trying to determine the asymptotic growth of the interval profile of a suffix tree. I've written a simulation in C++, but it pretty much sucks since I suck at coding. So I'm hoping you guys can help me improve its efficiency.We're looking at a suffix tree with n leaves, and we want to know how many internal nodes there are on the kth level. k is an integer which will be in the neighborhood of log(n).We start with three parameters, n (a big fucking integer, on the order of 10^6 or larger.) p (a probability, i.e. real number between 0 and 1), alpha (a positive number around 1.) We then define the integer k=alpha*log(n) (rounded to the nearest integer.) To get our internal profile, then, we1. Construct a random binary string of X consisting of n+k-1 entries, with each entry taking value 1 with probability p, 0 with probability 1-p.2. Consider the n consecutive substrings of length k of X, i.e. X(1...k), X(2...k+1), ... , X(n...n+k-1), and count how many of these substrings appear at least twice. This is M_{n,k}, the internal profile, our variable of interest.3. Estimate the asymptotic variance of M_{n,k} by running this simulation a bunch of times for really fucking big n. My approach is to convert each substring to an integer on [0,2^k - 1], and then store it in a set. And if it's already in the set (i.e. it's appeared as an earlier substring) I put it in another set, the size of which is, at the end, my profile. But I'm sure I'm doing this super-inefficiently. I'm an amateur programmer at best. Here's the second of my code in which I build my n successive substrings and then count how many appear more than once by the placing of each in sets. but if you could suggest a better method, that would be awesome.Also, I don't know anything about pointers, so I can't build the fucking tree. And I'm not sure if that would be optimal.int n=1000000;int k=round(log(n));int p=.75;int profilesize=0;int curval=0;int wordmask=0;std::set myset;std::set profileset;// building the first k characters of our string, and converting them into an integer on [0, 2^k -1]for (int i=0; icurval=1-floor((float)rand()/(p*RAND_MAX));wordmask=wordmask+curval*pow(2,i);}// generating n strings from this base-string, each time lopping off the first character // and adding the last, then checking to see if this is a string we've already// encountered by examining our list of encountered stringsfor (int i=0; icurval=1-floor((float)rand()/(p*RAND_MAX));wordmask=wordmask/2 + curval*pow(2,k-1);if( myset.find(wordmask) != myset.end() ){profileset.insert(wordmask);}else{myset.insert(wordmask);}profilesize =profileset.size();The variable "profilesize" is our final variable; then we generate it a shitload of times for large values of n, and try to estimate its variance empirically. Any help that could be offered would be appreciated.
4/19/2014 11:01:32 PM
k
4/19/2014 11:02:49 PM
4/19/2014 11:11:53 PM
Why are you generating a word mask and looking for it in the set, rather than iterating through the set and counting substrings of the length you're looking for?
4/20/2014 12:19:34 AM
maybe I'm misunderstanding your question, but I'm converting substrings to wordmasks because it seems impractical to compare the jth substring to the previous j-1 substrings, letter by letter, just to see if it matchesin essence, by converting to wordmatch, I'm allowing myself to test whether two k-strings are identical simply by comparing two integers.maybe that's suboptimal, though.
4/20/2014 12:24:44 AM
Okay perhaps I don't understand your problem...But you have a string of size 1000000 You have substrings of length 6 (let's say) that is every 6-length substring in your stringAndYou want to know which 6-length substring is the most commonOrYou want to know how many times each 6-length substring occursOr neither of these?
4/20/2014 12:34:06 AM
4/20/2014 12:37:31 AM
Okay, it looks to me like you're randomly generating a substring:for (int i=0; icurval=1-floor((float)rand()/(p*RAND_MAX));wordmask=wordmask+curval*pow(2,i);}The looking for this random string in your set of existing substrings:if( myset.find(wordmask) != myset.end() ){profileset.insert(wordmask);}else{myset.insert(wordmask);}Is this correct?If you have a set of existing substrings (set myset) (or is this the full string? it doesn't matter either way), why not look through that set for occurrences of 2 or more of that substring, rather than generating a substring, then looking to see if it exists?It's like if your string is boobsvaginayou generate the random substring "cat" then see if it's in boobsvagina?basically, in the process of generating "myset" you can implicitly keep count of how many of each element has been added. then just read out the counts, discarding any that's <2.[Edited on April 20, 2014 at 12:58 AM. Reason : ]
4/20/2014 12:44:30 AM
http://en.cppreference.com/w/cpp/container/mapYou can modify your existing thing to use a map instead, where the "value" you're storing is the count.What you're doing with the 2 sets is not extraordinarily different, or more computationally intensive.It just doesn't make sense to generate a random substring, look for it in the set, when you have the full set, and you can just look for substrings that are duplicates of each other (if that's what you're doing).[Edited on April 20, 2014 at 12:56 AM. Reason : ]
4/20/2014 12:56:25 AM
4/20/2014 12:57:46 AM
so you have the super-long string stored some where, you know your substrings are length k:psuedocode:std::map substringKCountsMap;while(i=0,i+k<superLongString.length(),i++){ newMapKey=readSuperLongString(i,i+k); if (substringKCounts.contains(newMapKey)) substringKCountsMap[newMapKey]++; else substringKCountsMap.add({newMapKey,0})}foreach(countedSubstring in substringKCountsMap){ if (countedSubstring.value()>2) printf(countedSubstring.key);}Also, this problem is VERY parallelizable, if you wanted to make this threaded, it'd run a lot faster (probably).[Edited on April 20, 2014 at 1:10 AM. Reason : ]
4/20/2014 1:07:56 AM
the ideas of the pseudocode all make sense...I'll try to translate them into some sensible C tomorrow. thanks a lot.also, I have no idea in the world how I would exploit parallelizability (that's way beyond the scope of my C++ knowledge), but I know what parallizing is (I think) and that's cool that it's possible here. I'd love to bang out some results for n=10^10 or something. that would be the shit.
4/20/2014 1:17:52 AM
cooljust make sure to look at this: http://en.cppreference.com/w/cpp/container/mapIt's a key-value pair structure, it makes your life easier.map is the sister structure to std::set and is nearly drop-in replacement for using the std::set that you're already using.http://stackoverflow.com/questions/205945/why-does-the-c-stl-not-provide-any-tree-containersBoth are implemented with binary trees, so it doesn't matter either that you don't know pointers, C++ has this taken care of for you
4/20/2014 1:21:23 AM
I checked out "map" and tried for a bit to implement it, but got kinda stuck (didn't try for all that long, admittedly...my students have an exam tomorrow and bombared me with questions.)also, I'm not entirely clear how I can store my monstrous (ideally, > 10^7 entries) binary string, or in what format. but I'm still tinkering with it.just an update. thanks for the suggestion that I use map, I checked out the links and I'm sure it'll all become clear soonish.
4/21/2014 12:14:47 AM
Storing 10^7 bits is pretty trivial... That's like a megabyte of data.A map is just a dictionary, or a lookup table. Or is it the implementation you're more having issue with?
4/21/2014 1:12:41 AM
This sounds tractable analytically; are you simulating it to check your work?
4/21/2014 11:56:04 AM
^^yeah, but the thing is , I want to generate a random string of 10^7 bits of data. In fact, I want to generate dozens or hundreds of them, all within the same program, since I'm trying to simulate the statistical properties of a random variable depending on a random enormous string of data.I guess I'm not sure how/where to store the gigantic base-string. C++ flatly refuses to store it as an array, of course, and that's kinda the only structure I'm familiar with.^yeah, I think it's tractable, but nobody's done it yet. this is my thesis, actually. I'm not precisely checking my work; it's more, I want to get a feel for what's happening, how this variable behaves for certain ranges of k/log(n). The expectation is more or less understood; I'm looking at the variance.
4/21/2014 4:55:46 PM
Isn't there anyone in your department that can sit down and help you? These programming problems aren't very big ones to overcome, a senior computer engineering student, or computer science student can help you figure them out easily.
4/21/2014 5:30:07 PM
4/21/2014 7:33:10 PM
yeah, I'm sure there is. I just randomly decided to ask TWW first though, for some reason
4/21/2014 9:55:46 PM
seriously, like moron is saying, use a map. A tree is not what you want in this scenario. You want something with a fast lookup, and a binary tree has O(log2(n)) lookup time (normal case) while a map has O(1). Even better, you have no need for pointers with a map, especially one that's already made for you like what moron suggested.Also, I'd suggest going away from C++ for this for the time being, unless you have a background in it. You'd probably have better luck banging out the basics in java or C#, and those will keep you away from nastier bugs that are easy to write in C++ when you're just starting out. Both java and C# have a dictionary (or map) as part of the base libraries, too.Another word of advice would be not to build your string as you go along. Generate your huge random string first, and then pass it in to the algorithm that analyzes it. This will be much more parallelizable (and trivially so, at that), and it has the added benefit of allowing you to easily pass in your own strings for any "weird cases" that you might want to explore. It's also going to run faster, even with the worst possible implementations. At the end of the day, you want to analyze a random string, so it doesn't matter if the string is generated as you go or if it's generated all at once.
4/22/2014 12:19:46 AM
map is still implemented as a binary tree, to my knowledge, not a hash table.A hash table would be a great option for this too, assuming he could find a good hashing algorithm. If he couldn't, there would be memory issues if he tried to scale it up.This is easy in C#, but C# is so Microsofty...[Edited on April 22, 2014 at 12:32 AM. Reason : ]
4/22/2014 12:31:14 AM
javascript manjavascript
4/22/2014 2:20:12 AM