Home > Technology > Ruby Regular Expressions

Ruby Regular Expressions

For those of you that are involved in scraping via Ruby etc. that require the use of RegEx’s a great site that will be useful is Rubular.

One of the little tricks I discovered today was the use of the ?: in the regular expression.

I’m sure this is old news to most people, but since it was useful to me, maybe it is of use to someone else.

In Ruby, lets say I’m scraping a page, using the scan operator, and if there’s a <H2> I want to grab it, and if there’s an <H3> I want to grab it, and I want to ignore anything inside a <p>. Well, I can make the H2 optional, by placing it in brackets, and throwing a ? after it like so

/<H2>.*<\/H2>/mi

However, now I have the problem of cleaning the gathered data of the H2′s etc. An easier way is this -

/(?:<h2(.*?)<\/h2>)?(?:<H3(.*?)<\/H3>)?(?:<p.*?>)?/mi
Categories: Technology Tags: , ,