For the initial markup of Game of Thrones we divided the book into it's important structural elements; table of contents, and text, which is divided into chapters then further into paragraphs, then into quotes and character references. The table of contents and the chapters within the text of the novel were very straight forward to mark up since they were listed in the table of contents first. We then gave each paragraph within the chapters an individual xml:id. We tagged quotes by searching the text for any instances of plain text surrounded by " " and gave each quote a unique xml:id. For the character tagging, we created unique @ref attributes for each character and tagged them each time they occurred in the text (by both their proper name and any aliases they may have). In a separate XML document, we created a cast list consisting of char elements with the ref values for each character as well as ways to categorize the character (aliases, title, house, and gender) in sub-elements.
Once we finished giving the characters their unique character reference, we were able to start associating the quotes with who said them. The reason we did this was to determine if there was a difference between how people spoke depending on their House, gender, region of origin, and other factors.
We were able to start this process using XSLT to transform the document we were working with. We initially looked for the most basic scenario that quotes would occur in, it had the structure, “character said, ‘sentence.’” This was the easiest to locate because the paragraphs were marked so clearly with the sign posting. Then we moved on to tagging quotes which were more complex. This got very difficult since the two scenarios remaining were quotes marked by pronouns and quotes from a series of alternating dialog (thus they had no names or pronouns associated with them at all).
We were not able to mark any of the pronoun instances using XSLT. This process proved to be too intricate, so we made the switch to hand tagging the speakers. Through this, we were able to improve the accuracy of the tagging, so that was a benefit to our misfortune.
We thought that identifying the most frequently used word by region, gender, and character could lead to interesting discoveries about the culture and peoples of the area. Using NLTK's frequency distribution function, we were able to find the most frequently used words within characters' speech. After sifting through the "plumbing" words such as; the, a, and, to, of, it, etc. we were able to focus on meaningful words and then use the ref attribute attached to the speaker to determine his/her region of origin and compare the word to the speech of other characters from the same region.
Using NLTK, we tokenized each word in the quotes we had tagged in the novel, using the word_tokenize function, then we tagged the parts of speech of each word using the pos_tag function. From here we were able to gather a frequency count of what parts of speech were most frequently used by character, gender, House, and region. We then drew conclusions about how characters and gender are represented in the novel through their speech.
Furthermore, there is a possibility that our Python code tagged words as multiple parts of speech. If this is the case, then there will be future efforts to correct the code in order to obtain more accurate results.
We chose to display the numerical representation of the most frequently used parts of speech and words through graphs, which we created using SVG. In order to create the SVG for our graphs, we employed xslt to transform xml output files to display the information. This process was filled with educational trial and error; however, we ultimately prevailed and created appealing and functional SVG. With the graphs, we are clearly able to see how character's speech differs from one another depending on their most frequently used words and parts of speech.