The Olivia Rodrigo Project

Reflection

Cleaning Text Files

Our web scrape of the lyric text files from Genius provided us with a mostly clean and simple structure. However, there was some unnecessary information that we needed to get rid of, including the ads on the website that got scraped with the lyrics. We had some errors at first when trying to find a way to only select these specific parts, but we eventually found an effective way to find all the ads and replace them with nothing in order to delete them.

Regex

Coming up with the regex to convert our text files into XML proved its challenges. Capturing the type of section (Verse, Chorus, Bridge) was quite simple as they were already in square brackets. However, finding an effective way to turn it into an attribute of a section element that wraps around the corresponding lyrics was a lot of trial and error. From this experience, we learned how important the order of which you execute your Regex is, as each step builds off of one another.

Created for the DIGIT 210: Text-analysis class at Penn State Behrend