On the futility of man and the endless pain of an incoherent existence
=======================================================================(This blog is written in(http://daringfireball.net/projects/markdown/syntax)) So, lately I've been doing some data analysis because I have literally nothing better to do with my time than pointless data analysis. My first target was the my little pony fan fiction archive at Equestria Daily, because people are probably interested in that and it's a sizable database full of Stuff. You can find the end result here - there's the raw data, a search, and a visualiser based on verlet physics kindly donated by a friend who goes by the name of Knighty. But you guys probably aren't interested in how often Applejack and Rainbow Dash are written as lovers (It's about 50% of the time), or how often Trixie appears without Twilight (62% of the time), so I shan't elaborate on the results. The methodology is much more interesting anyway.
The What ——– With sirXemic's advice and assistance, I used the Apriori data mining algorithm to find common sets of tags, and then did a more straightforward analysis on that much smaller dataset to work out the set's Support, and the set's Confidence - basically, how often the set occurs and the percentage that set has of all subsets. You can find my implementation on github, as well as the tool I used to gather the data in the first place.
So, 64D?
——–
So I decided to run it on 64d's blog database. So, 2am last night I wrote a quick application to dump all the blog text to a file, and set my analyser on it.
Well, 5 hours later it ran out of memory because I'd compiled it for x86 and it'd just hit the 2GB cap, so… there's that. It's still running. I have a less fancy analysis here though, which is a much more straightforward list of the frequencies of words, as well as what individual users have used them and how often they use them.
For example, 97% of Scott_AW
's 250 blogs include the word "the" (that's 244 blogs!). Cyrus is less thetastic, with only 92% of his blogs containing that word. The person who uses the word "Game" the most (Of the 3760 occurrences) is KaBob, with 54% of his blogs referring to it. Next up, with just one less, is Cesque.
Will the more detailed analysis ever finish? Maybe. Will it tell us anything useful? Maybe. Will we be able to view it using the tools I currently have? Fuck no, it's 100 times larger than any other dataset I have.
You should have stripped symbols from all words etc, now it's a bit inaccurate :3
yeah, I noticed that when I was looking through the data myself, I should.
If you could run that against a program that can distinguish certain personality aspects based on the percentage of certain words, that would be interesting lol.
Yes, very.