Originally published on https://selenachau.wordpress.com/2016/10/18/ruby-nokogiri-for-webscraping/ for the AAPB NDSR.
As part of the AAPB NDSR fellowship, we residents are afforded personal professional development time to identify and learn or practice skills that will help us in our careers. This is a benefit that I truly appreciate!
Ruby... the jewels theme continues from Perl
I had heard of the many programming and scripting languages like Perl, Python, Ruby, and bash but really didn't know of examples that showed their power, or applicability. In the past month, I've learned that knowing the basics of programming will be essential in data curation, digital humanities, and all around data transformation and management. Librarians and archivists are aware of the vast amount of data: why not utilize programming to automate data handling? Data transformation is useful for mapping and preparing data between different systems, or just getting data in a clearer, easier to read format than what you are presented with. At the Collections as Data Symposium, Harriett Green presented an update on the HTRC Digging Deeper, Reaching Further Project [presentation slides]. This project provides tools for librarians by creating a curriculum that includes programming in Python. This past spring, the workshops were piloted at Indiana University and University of Illinois--my alma mater :0). I can't wait until the curriculum becomes more widely available so more librarians and archivists know the power of programming!
But even before the symposium, my NDSR cohort had been chatting about the amounts of data we need to wrangle, manage, and clean up in our different host sites. Lorena and Ashley had a Twitter conversation on Ruby that I caught wind of. Because of my current project at KBOO, I was interested in webscraping to collect data presented on HTML pages. Webscraping can be achieved by both Python and Ruby. My arbitrary decision to learn Ruby over Python is probably based on the gentle, satisfying sound of the name. I was told that Python is the necessary language for natural language processing. But since my needs were focused on parsing html, and a separate interest in learning how Ruby on Rails functioned, I went with Ruby. Webscraping requires an understanding of the command line and HTML.
- I installed Ruby on my Mac with Homebrew.
- I installed Nokogiri.
- I followed a few tutorials before I realized I had to read more about the fundamentals of the language.
I started with this online tutorial, but there are many other readings to get started in Ruby. Learning the fundamentals of the Ruby language included hands-on programming and following basic examples. After learning the basics of the language, it was time for me to start thinking in the logic of Ruby to compose my script.
Pieces of data are considered objects in Ruby. These objects can be pushed into an array, or set of data.[/caption] As a newbie Ruby programmer, I learned that there is a lot I don't know, there are better and more sophisticated ways to program if I know more, but I can get results now while learning along the way. For example, another way data sets can be manipulated in Ruby is by creating a hash of values. I decided to keep going with the array in my example.
So, what did I want to do? There is a set of program data across multiple html pages that I would like to look at in one spreadsheet. The abstract of my Ruby script in colloquial terms is something like this:
- Go to http://kboo.fm/program and collect all the links to radio programs.
- Open each radio program html page from that collection and pull out the program name, program status, short description, summary, hosts, and topics.
- Export each radio program's data as a spreadsheet row next to its url, with each piece of information in a separate column with a column header.