Skip to content

TQFrisk/HashItOut

Repository files navigation

#Thomas Frisk #HashSomethingOut HW 4-2 #Reflections

Have 6 because I wasn't in class for this and tried learning online. In the process I found a hashing method I would use and improve later. Wanted to have academic honesty about it

Also had a hospital trip tonight so I rushed a bit with the merges and forgot to have the full comments there in the merge. I have the info for the commit/merges in comments if I forgot

Overall, mostly had what I expected. The bad iterations: iteration 01 and 05 were abysmal. For the linked list, it was 99% wasted space and the time on the bad linear was 19-20x worse

The second iteration of my linked list was how i learned this since I wasn't in class. I included an extra iteration because of this and i cited my source. I didn't know what the ord() function was, didn't know if it was covered in class. It was a dramatic improvement by about 33% better wasted space efficiency. The collisions were also down by orders of magnitude.

The third iteration for linked list was weird. My thought process was that whatever sum I had for the characters--even with the ord()--would bunch a lot in the same areas. So my sleep deprived idea was to then split to + and - from the midpoint. I also multiplied those values by the ord() of the first character for a better spread if it existed. Thought that would push the mean towards the center and then have a nice spread. The stats were improved but realistically that's not really what's happening in the math. Also if all the odd values are added and evens minused then I'd also get a lot of even buckets empty above and odd below. It makes sense given ~half the buckets for it were wasted for title. Quote probably got to high enough values to wrap around a bit with the modulus which is why they worked better

Fourth iteration was an improvement on the one that made the most sense and was the best. Third wasn't really one that made sense so it was a version of improvement of the one i found online. I changed it up a bit to have an addition of +3 to whatever the ith value was because when i did +0 to +5 that was the highest so that was my improvement.

fifth iteration was bad linear probing and i used the initial bad hash method for that and it was abysmal as stated before

6th iteration was by far the best. It used the best method from the linked list and had phenomenal stats overall. Just about double the time of the best linked list for title and the wasted buckets at worse down to 24% and far far better for quotes


Iteration 1: the Bad, branch: movieTitle Bad

Misnomer since I was planning on multiple commits for each iteration, but I'm just passing in the same Information so it doesn't make sense to make one for title and one for quote.

This is a linked list hash table. for iteration 1, both hash table sizes are 15000 for the 15000 records in the .csv

Title as Key Total Buckets: 15000 Wasted Buckets: 14893 Collisions: 2297776 Table time: 0.32596588134765625

Quote as Key Total Buckets: 15000 Wasted Buckets: 14892 Collisions: 9868809 Table time: 1.1029706001281738


Iteration 2: the imprvoed, branch linkedListBetter

Academic honesty, was sick as hash functions were defined in class so I looked online and found this function at https://www.w3schools.com/python/python_dsa_hashtables.asp

I did not know things like ord() previously and I'm not counting this one towards grade, merely academic honesty how I learned functions to try to do later hash functions

Better linked list: sum ord(char)

Title as Key Total Buckets: 15000 Wasted Buckets: 11249 Collisions: 32188 Table time: 0.03874492645263672

Quote as Key Total Buckets: 15000 Wasted Buckets: 10192 Collisions: 73260 Table time: 0.06971907615661621


Iteration 3: evenBetterHashFunction for this one I wanted more variance so I started with the idea of multiplying by whatever the value of the first character was (if it existed)

then I thought I'd vary it a bit more by multiplying by the ord() of the first character I then thought I could vary that more by checking to see if it's even or odd. if even Information subtract from the midpoint and if odd I add. This allowed for a better spread

iteration 03: Even Better linked list Title as Key Total Buckets: 15000 Wasted Buckets: 7329 Collisions: 9878 Table time: 0.0472109317779541

Quote as Key Total Buckets: 15000 Wasted Buckets: 2726 Collisions: 18837 Table time: 0.06874704360961914


iteration 4: final linked hash: So this is really an improvement on 2 because 2 was good and had room for improvement while 3 didn't really as much because it used a kinda weird method that doesn't really work with the % in the way I was thinking. even better for this one. Idea was to start with the index multiplied by the ord(index) but it was bad so i kept adding one to the index and it got better till +3 then got worse at +4 so i left it at +3

time is more complex than the last but wasted buckets and collisions are down. This is best one

Title as Key Total Buckets: 15000 Wasted Buckets: 7188 Collisions: 9169 Table time: 0.059093475341796875

Quote as Key Total Buckets: 15000 Wasted Buckets: 2600 Collisions: 18669 Table time: 0.08672595024108887


Iteration 5: bad linear

Time was abysmal for this one. So bad, but the wasted buckets was so few

iteration 05: bad linear probe Title as Key Total Buckets: 15000 Wasted Buckets: 3691 Collisions: 78043942 Table time: 16.025943517684937

Quote as Key Total Buckets: 15000 Wasted Buckets: 109 Collisions: 111142876 Table time: 19.86230754852295


iteration 6: good linear

I used my best linked list for this and it was the best overall for time and for wasted buckets

iteration 05: bad linear probe Title as Key Total Buckets: 15000 Wasted Buckets: 3691 Collisions: 126471 Table time: 0.08267903327941895

Quote as Key Total Buckets: 15000 Wasted Buckets: 109 Collisions: 1349221 Table time: 0.3242945671081543

About

Hw for 226

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages