Blog
Crawling/Scraping with Apache Nutch on AWS
As I was waiting to start a new job in January, I took on a consulting gig to implement a scalable web crawling/scraping system for a company in Vancouver. My client insisted it has to be done using Apache Nutch, which is a great tool for high-scale scraping.
Even though Nutch is first and foremost a web search engine, its “fetcher” component remains a best-in-class web crawler. In fact, Lucene and Hadoop were originally two components of Nutch if this indicates anything about Nutch’s quality. The plugin-based architecture of Nutch allowed me to easily cherry pick crawling-related parts, and plug in custom indexers for my client.
Another reasonable requirement by my client was using Amazon Web Services (AWS) for running the crawler. Without that requirement, I would advise the same too. AWS Elastic MapReduce (EMR) makes it incredibly easy and cheap to start a Hadoop cluster, and it’s seamlessly integrated with other relevant AWS services such as S3. Using EMR is slightly more expensive than setting up your own Hadoop cluster on EC2, and there are a few quirks (e.g. you cannot use the cheaper generation of small and medium instance types), but the reduction in labour costs and general headache very well justifies it.
Nutch’s plugin system conveniently provides a few extension points, where you can hook up your plugins, and they’ll be called at the right stage with all the available information. I only needed to extend HtmlParseFilter to scrape data off of HTML files, and IndexingFilter and IndexWriter to store the extracted data on my client’s existing database. HtmlParseFilter provides the HTML file’s source, and in this case, regular expressions were powerful enough for me. If HTML parsing is required, the highly resilient TagSoup is shipped. You can even use the likes of Selenium to bypass scraper-deterring AJAX calls.
Once I finished writing and testing the plugins locally, it was time to deploy them on a Hadoop cluster. The process is seamless for the most part. There’s probably no software more native to Hadoop than its parent, Nutch. This Makefile is a good start for writing deployment scripts. However, it doesn’t readily work since the AWS command line interface has changed. Another obstacle in deployment was a flaw in Nutch, where your plugin binaries were not present on cluster slaves, so I had to extract and copy the JAR files to the shared filesystem (HDFS) as a bootstrap action.
After overcoming configuration hurdles, the first scraping adventure went really well. It took 10’s of hours on a cluster of 5 smaller slaves for a particular retailer’s website. Scraping is IO-bound in general and doesn’t need heavy-duty machines. Combine that with AWS spot instances and it ends up dirt cheap. I didn’t touch most of the default configurations, and it seemed the bottleneck was the built-in minimum wait for politeness to the target servers. Nutch’s crawl artifacts are also saved on HDFS and can be easily saved on S3 in Snappy format for later analysis.
As the last words, Nutch provides all conventional features for responsible crawling/scraping; from robots.txt compliance, to custom user agent, to minimum delay between calls. Have fun!
Iceland Trip
Last month, I had a 11-day trip to Iceland, and spent most of it taking a road trip around the country. A few unprocessed pictures taken by my phone don’t quite capture how amazing my experience was. This slideshow is my attempt at sharing some of it nevertheless.
Slideshow plugin by Pixedelic.
If you liked this, you may also like my Chilkoot Trail post.
Toastmasters for Engineers
About a year ago this time, I joined a Toastmasters club, and today, I had my 9th prepared speech out of the 10 you need for your first title (i.e. Competent Communicator). Toastmasters is a huge network of self-organizing clubs that help members improve their communication and leadership skills.
I realize that different Toastmasters clubs could have slightly different rules and different cultures, so your mileage may vary. However, the cult-like general guidelines of this huge non-profit organization ensure your experience will not be that club-specific.
If you’re the quintessential engineer with limited communication and soft skills like myself, I’d recommend giving it a shot. Each club has a specific schedule for its meetings, and each meeting may contain a few prepared speeches, impromptu talks, evaluations, and other supporting summaries (e.g. grammar usage summary). Almost every role in each meeting will get an evaluation at the end. There are one-off contests once in a while too.
Every club member will be working on two tracks in parallel; communication and leadership. I personally put more focus on my communication track, which at the beginner’s level, consists of presenting 10 prepared speeches for the audience. Each speech has specific objectives, and as you progress, they become more advanced. For instance, they start from merely not using notes to watching for your body language and vocal variety.
The leadership track was never very attractive to me. The tasks in that track are obviously designed to help organize the club without any control from Toastmasters itself. Most of the administration is done by volunteers, and the contribution is gamified by awarding leadership badges and honours to those who donate their time. Even though it might have some educational value in leadership, I don’t think you can get anything more out of it than you do at your work.
Even though Toastmasters seems to be mostly about public speaking, it teaches engineers valuable lessons in soft skills. It helps with your speaking and, to some extent, writing skills. However, the two invaluable skills are coming from the general audience and the evaluation parts. The general audience makes you pick subjects that anyone can understand and is intereted in, and the limited speech time (around 5-7 minutes) forces you to make it concise. It also teaches you how to be tactful and diplomatic in your evaluations, and still get your points across. They specifically encourage using the “[shit] sandwich technique”, where you enclose the rough parts of your feedback with pleasant fluff.
As usual, the time and effort spent on Toastmasters has diminishing returns. I feel like the optimal point of retiring for me would be after the first communication manual, but of course, your experience may be different. It’s taken about one hour of my time every week, and a couple of hours for each speech. The membership fee is very little compared to your time and its value. You can always attend their meetings as a guest and decide for yourself if it’s your cup of tea.
Competitive Strategy Online Course
Massive open online courses (MOOCs) are a revolutionary way to bring education to the masses, and make the collective human knowledge accessible. You might have heard of Coursera, which is a great for-profit MOOC provider, whose courses have been offered for free so far. They charge for certain certificate courses, but if you just like to listen to video lectures or read the notes, it’s only for the price of your time.
I’ve been taking courses from Coursera for a while. The level of engagement is up to you. You can be as involved as a real student actually taking the course, attend study groups on the side, take the exams, and submit the assignments and project. You can also just skim the scripts, and (just like me) watch the video lectures as a passive way of consuming information. They have a broad catalog of courses from top universities around the world, so you’ll definitely find something that interests you.
A few days ago, I came across Competitive Strategy on Coursera. Tobias Kretschmer from a top German university is the instructor of this course, and judging by the first module of the course (out of six), he knows how to keep the audience interested and entertained. The course tries to use a few game-theoretic tools to analyze different strategies businesses can use to their advantage. As someone with a general interest in microeconomics or marketing, I quickly fell in love with it.
The first module introduced a few basic concepts from game theory, so if you already know about them, it may sound a bit boring. However, I’m looking forward to the upcoming modules to see how those abstract tools are employed in business strategy and marketing. If it piqued your interest, feel free to join; it’s free and open to new students at any time.
Is Adblocking Unethical?
I came across an admittedly biased interview on CNET, where an executive VP of the Interactive Advertising Bureau (an online advertising industry advocacy group) was resorting to vitriolic ranting and name-calling against adblockers. It showed how adblockers are leaving a dent in their industry. It didn’t however argue why affecting their bottomline is unethical, and merely used the straw man of the collapse of the internet if the disruption continues.
I am personally an adblock user. That’s one of the first things I do when I set up a new machine; install Adblock and Ghostery. I usually whitelist the community-generated content providers that I frequent, and know won’t survive without the ads. It just feels right; just like when I tip my waiters as I know they’ll have difficulty without tips in the status quo. Am I robbing other websites of the attention to the ads they feel entitled to? Is it ethical?
We can look at this dilemma from two different angles; whether or not users are stealing money from the individual websites, and if adblocking is good for the internet as a whole. I disagree that refusing to look at adverts equals stealing. Embedding ads next to the actual content implies that by consuming the content, you are helping the provider make profit. Let me use an analogy.
The hat next to a street musician has a very similar implication; please pay if you enjoyed. You never feel obligated to spare change to anyone though (you might be guilt-tripped into it, but never forced). Some people like to help out the talented ones, and simply make them happy since the performance made them happy, but most don’t. If someone forced you to pay up during a street performance, blocked your view, or disrupted your experience otherwise, wouldn’t you be annoyed and just walk away? I skip talking about the creepy trackers.
The obligation to pay for publicly available art or entertainment has never been easy to formulate. It’s usually more about the consumers’ willing to pay than the providers’ enforcement. Sometimes, the publishers have had the technological means to force-fed ads to consumers in a very aggresive way (e.g. cable TV adverts), but it’s now the consumers that have the technological capability to strip the content of intrusive ads and trackers. If producers can’t earn enough money through intrusive advertising, they should find another way to monetize their content or spend their time and effort on something more lucrative. Sorry; the landscape is shifting and so should your business model.
Isn’t it bad though? Doesn’t it make the pie smaller for everyone? Maybe. CNET’s interview didn’t provide much relevant data, and it’s outside the scope of this blog to quantitatively analyze the market anyway. What I can predict is that if the advertising industry doesn’t find ways to make ads less intrusive and more interesting, and doesn’t dissuade people from actively blocking them, they will have a hard time. They may get into an arms race with adblockers, but that’s never a good idea for them. They may regulate or even criminalize adblocking (see piracy laws), but the current economy is not stable and is certainly going to change.
When advertising is not a viable business model any more, site owners may start putting their content behind paywalls, or come up with smarter monetization strategies. It means that the publicly available content may have lower quality, or websites will simply go out of business. You need to be either a hobbyist or a large corporation to be sustainable in the new economic balance. We may need to pay for good content like good old days of newspapers, or struggle to find them among too much noise. It’s going to be uncomfortable to go back there, but without appropriate planing of the industry, it seems inevitable.
There have been proposals like this to find a middle ground for acceptable advertising. The demise of the ads-for-content economy can be prevented if producers and consumers can come up with and agree on standards of what will be tolerated. Until that day, I believe the consumers’ revolt against online advertising is only a reaction to the their creepy, aggressive, and intrusive methods, and certainly not unethical. I’m afraid democratization of content creation and consumption might force businesses to actually listen to the consumers after all!