The Nokogiri version of scraping data

I had some time playing around with awk, sed, grep and others trying to scrape html content. It is ok, all this CLI commands are really powerful especially if you are aware of regex or understand how to apply them to certain challenge, but when you are on Ruby it is, probably, easier to use 'nokogiri' gem.

Let's have a quick look on what it is capable.

The very first thing we need is to install it.


$ gem install nokogiri

Now, when it is there, let's create new .rb file and power it with executable right (+x) and do some code inside.


$ touch blog.rb
$ chmod +x blog.rb
$ atom blog.rb

Note here, I am on Mac OS X, and my favorite IDE is atom, but you may want to use vi, vim, nano, textmate, sublime etc. Let's assume I have this blog and I need to parse all its text content form cloud tags. They are actually under row > col > a path in the CSS structure. Having this in mind we have the following snipper scraping these tags and placing them to my screen (puts).


#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
site = 'http://blog.erudinsky.com'
html = open(site).read
parsed_site = Nokogiri::HTML(html)
css1_objects = parsed_site.css("div.row > div.col > a.css1")

css1_objects.each do |o|
  puts o.text
end

Now, let's save it and run from the CLI


$ ./blog.rb
иллюзия
tree
random
спб
pickadate.js
9 мая
erudinsky
static site
CSS
web development
eu
sugar
locale issue
skype
windows8
lorem
оформление текстов
Supaplex
games
cloud
pumpkins
cloudfront
Embedded host client
вёрстка
responsive design
iOS9
баг
I18n
iso
Florida
Vacation
IKEA
стул
minimalkids
чемпионат
monday
timomaas
tweet
prawn
image
pdf
pbx
fail2ban
g729
nocomments
new year
gartner
route
js
devise
whitelisted
deployment
nested hypervisor
scp
linux
puma
hstore
ntfs
cyberduck
short urls
storage
reactjs
песчаные скульптуры
orchestration
macbook
vhd
vhdx
html
virtualisation
blog
materializecss
wysiwyg
svg
карта
jquery
glacier
nokogiri
новый интерфейс
аврора
tags
acts_as_toggable
pokemon
покемон
териберка
экспедиция
paperclip
containers
swap
fix
digitalocean
password
mixmonitor
rwa
mount bucket
ls
administration
html5
neverhood
sinatra
virtualbox
windows
90s
неверьвхудо
lol
giphy

Basically, open(site).read reads my http site (I guess it is cURL or wget), then - we do loop across css1_objects where we simply puts it's content into the screen.


In short, this is about:
#nokogiri
#ruby

Start discussion:
Related articles:
89 sinatra if you need lighter preview
Sinatra ruby is the light way to work with http methods. If you consider your App to be really light and just endpoints, think "not-Rails". ... more
over 1 year#sinatra #ruby
114 how to download random pictures preview
In CloudBerry Lab we help people to move data offsite to the cloud either with backup or file management solution. Sometimes I need sample files, a lot of sample files. ... more