Once we had collected and cleaned our text files, we used Regex to convert them into XML format. This was a crucial step for our project, as it was important that we created an XML structure that was machine-readable and easy to further analyze. We created a simple yet informative structure where each section of the song is wrapped in a <section> element with a corresponding attribute indicating the type of section. Lines of lyrics within each section are wrapped in <l> elements. Some sections also have n attributes to indicate the verse number. Below is an example of one of our XML lyric files. Click here to view the markdown file showing step by step how we executed our regex.
<lyrics> <section type="intro"> <l>(Ah)</l> </section> <section type="verse" n="1"> <l>Well, good for you, I guess you moved on really easily</l> <l>You found a new girl and it only took a couple weeks</l> <l>Remember when you said that you wanted to give me the world?</l> <l>(World)</l> <l>And good for you, I guess that you've been workin' on yourself</l> <l>I guess that therapist I found for you, she really helped</l> <l>Now you can be a better man for your brand-new girl (Girl)</l> </section> <section type="chorus"> <l>Well, good for you</l> <l>You look happy and healthy, not me</l> <l>If you ever cared to ask</l> <l>Good for you</l> <l>You're doin' great out there without me, baby</l> <l>God, I wish that I could do that</l> <l>I've lost my mind, I've spent the night</l> <l>Cryin' on the floor of my bathroom</l> <l>But you're so unaffected, I really don't get it</l> <l>But I guess good for you</l> </section> <section type="verse" n="2"> <l>Well, good for you, I guess you're gettin' everything you want (Ah)</l> <l>You bought a new car and your career's really takin' off (Ah)</l> <l>It's like we never even happened</l> <l>Baby, what the fuck is up with that? (Ah)</l> <l>And good for you, it's like you never even met me</l> <l>Remember when you swore to God I was the only</l> <l>Person who ever got you? Well, screw that, and screw you</l> <l>You will never have to hurt the way you know that I do</l> </section> <section type="chorus"> <l>Well, good for you</l> <l>You look happy and healthy, not me</l> <l>If you ever cared to ask</l> <l>Good for you</l> <l>You're doin' great out there without me, baby</l> <l>God, I wish that I could do that</l> <l>I've lost my mind, I've spent the night</l> <l>Cryin' on the floor of my bathroom</l> <l>But you're so unaffected, I really don't get it</l> <l>But I guess good for you</l> </section> <section type="break"> <l>(Ah-ah-ah-ah)</l> <l>(Ah-ah-ah-ah)</l> </section> <section type="bridge"> <l>Maybe I'm too emotional</l> <l>But your apathy's like a wound in salt</l> <l>Maybe I'm too emotional</l> <l>Or maybe you never cared at all</l> <l>Maybe I'm too emotional</l> <l>Your apathy is like a wound in salt</l> <l>Maybe I'm too emotional</l> <l>Or maybe you never cared at all</l> </section> <section type="chorus"> <l>Well, good for you</l> <l>You look happy and healthy, not me</l> <l>If you ever cared to ask</l> <l>Good for you</l> <l>You're doin' great out there without me, baby</l> <l>Like a damn sociopath</l> <l>I've lost my mind, I've spent the night</l> <l>Cryin' on the floor of my bathroom</l> <l>But you're so unaffected, I really don't get it</l> <l>But I guess good for you</l> </section> <section type="outro"> <l>Well, good for you, I guess you moved on really easily</l> </section> </lyrics>
Our schema reflects our XML structure as it is fairly simple, yet it still provides us with all the information we need to properly analyze the data.
start = lyrics lyrics = element lyrics { section* } section = element section {type,(n | subtype)?,l+} type = attribute type { "verse" | "chorus" | "bridge" | "outro" | "pre-chorus" | "interlude" | "refrain" | "postchorus" | "breakdown" | "intro" | "post-chorus" | "break" } n = attribute n {xsd:integer?} subtype = attribute subtype {text} l = element l {text}