Most of the posts highlight what I am focused on and express work and personal experience. Reason I put them here - recall later or help someone else with similar challenge.
I had some time playing around with awk, sed, grep and others trying to scrape html content. It is ok, all this CLI commands are really powerful especially if you are aware of regex or understand how to apply them to certain challenge, but when you are on Ruby it is, probably, easier to use 'nokogiri' gem.
Let's have a quick look on what it is capable.
The very first thing we need is to install it.
$ gem install nokogiri
Now, when it is there, let's create new .rb file and power it with executable right (+x) and do some code inside.
$ touch blog.rb $ chmod +x blog.rb $ atom blog.rb
Note here, I am on Mac OS X, and my favorite IDE is atom, but you may want to use vi, vim, nano, textmate, sublime etc. Let's assume I have this blog and I need to parse all its text content form cloud tags. They are actually under row > col > a path in the CSS structure. Having this in mind we have the following snipper scraping these tags and placing them to my screen (puts).
#!/usr/bin/env ruby require 'nokogiri' require 'open-uri' site = 'http://blog.erudinsky.com' html = open(site).read parsed_site = Nokogiri::HTML(html) css1_objects = parsed_site.css("div.row > div.col > a.css1") css1_objects.each do |o| puts o.text end
Now, let's save it and run from the CLI
$ ./blog.rb иллюзия tree random спб pickadate.js 9 мая erudinsky static site CSS web development eu sugar locale issue skype windows8 lorem оформление текстов Supaplex games cloud pumpkins cloudfront Embedded host client вёрстка responsive design iOS9 баг I18n iso Florida Vacation IKEA стул minimalkids чемпионат monday timomaas tweet prawn image pdf pbx fail2ban g729 nocomments new year gartner route js devise whitelisted deployment nested hypervisor scp linux puma hstore ntfs cyberduck short urls storage reactjs песчаные скульптуры orchestration macbook vhd vhdx html virtualisation blog materializecss wysiwyg svg карта jquery glacier nokogiri новый интерфейс аврора tags acts_as_toggable pokemon покемон териберка экспедиция paperclip containers swap fix digitalocean password mixmonitor rwa mount bucket ls administration html5 neverhood sinatra virtualbox windows 90s неверьвхудо lol giphy
Basically, open(site).read reads my http site (I guess it is cURL or wget), then - we do loop across css1_objects where we simply puts it's content into the screen.