Our collection of source documents was retrieved from Genius. We webscraped the lyrics from each artist and album using Python (see code below). This was a good basis for our project because it provided us with structred text files which we then altered for our courpus analysis.
GENIUS_API_TOKEN='hmbNz6OGtNv2DNSTaZ0UMegsY9fUXxpoB9_el_rx59pgtbis-_Iydx1D_yB7nQm7'
from lyricsgenius import Genius
genius = Genius(GENIUS_API_TOKEN)
# artist = genius.search_artist("Olivia Rodrigo", max_songs=3, sort="title")
# print(artist.songs)
album = genius.search_album("Guts", "Olivia Rodrigo")
for track in album.tracks:
file_name = track.song.title + ".txt"
print("Downloading file: " + file_name)
with open(file_name, 'w') as f:
f.write(track.song.lyrics)
#print(track.song.lyrics)
# album.save_lyrics()
We found that our collection of lyric text files contained unrelated information that needed to be removed. This included descriptions of advertisements shown on the Genius website scattered within the song lyrics. We used Oxygen's Find and Replace in files to capture and discard of this information. Click here to view the markdown file showing the steps we took to clean up the files.The cleaned text files can be viewed on the Song List page!