Do NOT Repeat Yourself
Another way to ensure things will remain simple is the DRY Principle (“Don’t Repeat Yourself”), which states “every piece of knowledge must have a single, unambiguous, authoritative representation within a system”. Violating the DRY principle can hurt on many different levels; if a state is stored in multiple places, a piece of code is copied and pasted, or a logic is handled by multiple components. As a result the information stored by the program is likely to become incorrect, your code is prone to contagious bugs, or miscommunication between team members are made easier.
Every time you find yourself copy/pasting a piece of code only to change it slightly, think again about how a helper function can be extracted. If a small piece of algorithm or logic has to be repeated in more than one place, that’s another candidate for refactoring. Except for special cases in distributed computing, no information should come from more than one source either. Even if the logic around it is obvious to you at this time, modifying the logic later will prove tricky to other team members or your future self.
Fundamentals: Data Structures and Algorithms
If you’re a hobbyist or only getting your feet wet, you may be excused, but there’s absolutely no excuse for any career programmer at any level to not learn computer science basics. I skip telling horror stories about huge mistakes made by people who hated math at school and assumed “theoretical stuff” are only for academics. One of the most important areas is data structures and algorithms.
One significant benefit of learning the basics is being aware of time and memory complexity. How long a piece of code might take, or how much space would it need? What if it’s fast for my test data, but it becomes prohibitively expensive in production? Your textbooks provide simple tools to answer those questions, and even if you haven’t read them, you can at least get familiar with a few common data structures and algorithms and know how they scale as the data grows. However, there will never be a right answer to out-of-context design questions involving time and memory complexity, because design decisions will almost always involve some trade-off between time, memory, development time, maintenance costs, etc.
Another benefit of getting familiar with the basics is being able to follow the flow of information. Where the information is stored, how it is stored or retrieved, and where and why it is needed are usually answered by being aware of the underlying data structures and their algorithms. Besides the theoretical aspects, there are also best practices that come from experience and other software engineering practices. For instance, it’s established that DOM is mostly used to represent data, and should not be used to store data (HTML5 data attributes aside).
“You Aren’t Going To Need It”. Very well said! If a feature is not required by business, or there’s no certain near-future plans to add it, you wouldn’t implement it. The same logic can be applied to lower-level design decisions. More often than not, people would like to look way ahead and over-design their systems for a rainy day when the dreaded future feature will come and they just kill it with their (over-)preparedness. The idea behind the YAGNI principle is that even if future is foreseeable, you shouldn’t sacrifice the current code base because of that.
Like many other similar principles, it doesn’t give us a certain recipe, but it tells us that, because of human nature, we’re more likely to over-design than under-design.
Idioms and Design Patterns
Standing on the shoulder of giants should feel good. Walking on a beaten path will tell you other people have gone down the same path. The same goes for time-tested patterns in programming. Design patterns and programming idioms specific to a tech stack are the gift of past engineers to us to deal with the same issues. The only downside is that it may not be obvious which patterns will fit our current needs. A pattern that perfectly answers a specifc situation, may be too complicated for another. Therefore, design patterns should always be used carefully.
Intermediate engineers, after learning a few design patterns, tend to over-use them. Obtaining the hammer of design patterns shouldn’t deceive you into seeing every programming issue as a nail. Take caution and be fine with rolling your own, but certainly add them to your toolbox as they will often come in handy.
Keeping It Clean
Writing clean code is more than an obsession with aesthetics or mindless principles. There are numerous papers on how having a consistent coding style will make programmers more productive. Just like a well-written novel, a clean code doesn’t require excessive cognitive power to read and understand, and it saves more of your human RAM for important coding decisions.
A clean and consistent code base will save other people and your future self time in reading, debugging and modifying. I’m not going to expand on how to write clean code, but I felt it was important enough to deserve a mention. Here are a few bullet points on keeping the code clean and consistent:
Code should be self-documented. The modules, the fields and methods, and even the variable names should make it obvious what they’re storing or doing.
Documentation. Not everything can be conveyed by writing self-documented code. Sometimes, adding a concise comment next to an inevitably complicated piece of code can help others understand it faster. In my opinion, external documentation should be saved for very specific cases like space shuttle or robotic surgery software. It’s often unnecessary for the rest.
Coding Style. This one tends to be ignored as a nuisance to sloppy programmers. The value of familiar style is mostly psychological, and it’s shown to result in more productivity. If your team doesn’t have a styling guide, just take one of the common ones and pitch it.
Consistency. Humans are more efficient on auto-pilot, and the most important factor in achieving that is consistency. Consistency can be maintained on many levels from styling to naming to larger design patterns.
Simplicity? Sometimes making a code more readable may go against keeping it simple. Again, it’s an art to find the right balance. Adding an interim variable to put a name on a value in the middle of calculation is justifiable. Adding a whole class to store that value is not.
When I started programming circa 2000 as a hobby, all I cared about was how cool my programs and games were. It was mostly about using whatever tools I had at my disposal to make them happen. Participating in programming contests later turned me into a sloppy programmer that would take clever shortcuts, and see every small problem as an opportunity to apply a complicated cocktail of data structures and algorithms that I’d learned at school. That was sometimes the only way to solve some of those problems, but nothing at school or those contests prepared me for the great lesson that I would learn in real-world programming.
“I didn’t have time to write a short letter, so I wrote a long one instead.” - Mark Twain
Writing simple letters is not easy. Writing simple programs is not easy either. I learned that taking pride in the ability to make complicated contraptions that accomplish a simple task is a sign of professional immaturity. I learned an engineer who can keep things as simple as possible is the real programming artist. No expression could articulate it any better than Antoine de Saint Exupéry’s if I’m allowed another quote from another great author: “It seems that perfection is reached not when there is nothing left to add, but when there is nothing left to take away”.
This principle can almost perfectly translate to engineering. The KISS principle (“Keep It Simple, Stupid!”) is a great example of putting simplicity first. As a more software-engineering-centric example, Agile principle #10 states that “simplicity, the art of maximizing the amount of work not done, is essential”. One would ask what simplicity means in software, and how it could be achieved. Let’s see if we can find the answer.
Simplicity in software
There’s been extensive academic research, and many metrics were invented by academics and good-old general line managers to measure software productivity, complexity, and value. The general consensus is that software is too complex to be accurately quantified, but there are oversimplified metrics that can still tame some of that complexity. Lines of code is probably one of the first metrics that was used in the industry. It’s very easy to measure and it correlates well with complexity, but it’s not too difficult to make some condensed piece of software that no one can understand and cannot be legitimately considered simple.
How to achieve simplicity
You must have realized by now that there’s no silver bullet to achieve simplicity. There are however tools and patterns that make it easier for us to eradicate complexity in software, one wrong design decision at a time. I’ll enumerate some of the tricks, but it won’t be a comprehensive list by any means.
Understand the damned code first
Chances are you’re not creating everything from scratch. You’re working on something that a few other poor souls put together, and if you don’t want to be sticking more incongruous mud on the pile, you’ll need to understand a thing or two about the code base. This will first and foremost keep you from re-inventing the wheel; someone before you probably encountered a similar problem and spent time to fix it properly, so why not make use of that?
Another important measure, especially when fixing bugs, is to understand the root cause of the unwanted behaviour. If you don’t kill the bug where it lives, and just try to stick a band-aid on the symptoms, you’re picking an uphill battle with a bug that you don’t even fully know. In a well-designed code base, the root cause should ideally be in one place, so you’d need to address it in a single place instead of chasing it around and playing whack-a-mole with every symptom of the bug.
It always pays off to at least have a basic understanding of what every major component in code does. This helps you make proper use of them. Misusing or abusing other components will make the software more fragile, since other developers cannot always guess unorthodox application of their components and every change can potentially pull the rug from under your code.
The best way to achieve encapsulation is to define modules, clearly define the boundaries, and separate the concerns of those modules. This is a very high-level recipe, and in practice, things can get sticky and impossible to separate. However, it should be obvious to software engineers, when deciding where to put anything, whether defining the boundaries will add any value or it’s turning into another anally-retentive obsession with how to store your possessions in one of the 100 drawers in the bedroom.
Obligatory legalese: this post is about my own personal finance in Canada, not general financial advice. There’s always risk in investments, and they’re done at your own risk.
As I graduated from grad school and my paycheques started exceeding my expenses, I realized I have a new grown-up issue to deal with; not too bad for an issue! I wasn’t very comfortable with buying invisible assets that constantly fluctuated in price. When I cut through common human irrationality and realized saving my surplus income in a savings account is a sure-fire way to decay my assets, I started looking into buying shares in funds.
As a rather young software developer without much responsibility, I should be more risk seeking. As much as gambling with individual stocks is tempting, I decided against it; there’s a high risk there and high upkeep for staying current with financial news. Composite funds, with their broad range of risk profile from aggressive leveraged funds to those tied to non-Greece government bonds, provide a low-maintenance, low-risk investment vehicle.
Mutual funds are typically very convenient to buy. They’re usually offered in employees’ benefits for RRSP matching or automated withdrawals. The downside is that they take away a big chunk of your money every year. It’s common for them to rake in 2-3% of your savings, even in years when they lose 20-30% of it investing in volatile securities. Exchange-traded funds (ETFs), on the other hand, provide a much cheaper option (usually less than 1% management fee), have higher liquidity, and have been proven to be as effective as actively-managed funds. You’re basically betting on the whole civilization to stay in place, and slowly move forward as it has so far.
Canadian Couch Potato had a good model portfolio back then. I started putting my money in 3 ETFs: 40% trusting governments and companies to pay off their bonds (XBB.TO), 40% trusting the world is moving forward (XWD.TO), and 20% in betting that conservatives in Canada will make their social backwardness worth it (ZCN.TO). This portfolio has a very low management fee, and is traded in Toronto Stock Exchange, so no money is lost in exchange rates.
I opened 3 accounts with Questrade: TFSA, RRSP, and a margin account that holds the rest. They don’t have well-lubricated processes, have subpar customer support, and won’t just stop spamming you, but they have something unique; it’s free to buy ETFs with them. The spread of buy and sell prices is very low, and they don’t have any hidden fees, so it makes dealing with them worth it.
As a sidenote, if you’re going to start saving with them, you can use their referral program to earn $25-250 in cash when opening a new account. They also gives the referrer a $25-50 bonus, so if you don’t know anyone with a Questrade account, feel free to use my QPass key: 486304026379159
The first thing I accomplished was to max out my TFSA account. TFSA is a great Canadian program that allows you to earn investment income tax-free. At this time, assuming annual 10% gain from the ETFs, Canadians can save up to ~$1000 a year in taxes. The growing limit of TFSAs and the magic of compound interest will make them a bigger deal in the future. After maxing out the TFSA, I started contributing to my margin and retirement accounts. You can use up to $25K of your RRSP in buying your first home, but anything beyond that is about how much you value your future self.
It’s important to keep your portfolio as balanced as possible (40-40-20 in my case). Selling ETFs on Questrade costs money ($5-10 per trade), so I avoid re-balancing the portfolio unless the gap is too large and new contributions can’t close it. I also try to move funds to TFSA as soon as the limit grows (usually January 2nd), and to RRSP once I decide how much I’m saving this year in my retirement account. This helps keep the taxes low, and sometimes helps with rebalancing. A spreadsheet is all I need to determine how to do the re-balancing.
The housing market in Vancouver is overvalued according to unbiased experts. I don’t want to lock all of my past and future savings in a single asset that can dip in value at anytime. Buying a house also eliminates some of your options, and makes your unexpected expenses much more volatile. Real estate is therefore out of question, and that doesn’t leave anything else that can meet my criteria of low-effort, low-risk investment.
Good luck with your investments!
For a long time, I was dismissing suggestions to try out mindfulness meditation. It sounded like newly discovered eastern religion nonsense, advocated by the new age crowd or people desperately trying to fill the spiritual void in their lives. I didn’t care that they advertised relaxation as a side effect. Cutting through the mumbo jumbo to mine practical recipes for peace of mind didn’t seem worth it. Mindfulness remained an untouched realm of good-feeling insanity until frequent remarks on Hacker News made me discover Sam Harris’s new book.
Sam Harris has a reputation for fighting blind faith and irrationality. He’s also a renowned neuroscientist, which gives him all the authority to talk about faith-free mindfulness meditation. In his new book, Waking Up, he brings the attention of scientifically-minded folks to the baby of mindfulness in the bathwater of eastern religions. He’s extra careful to not make any irrational assumptions, and keeps apologizing for using conventionally-unscientific words like spirituality. Sam doesn’t make any claims about reducing our consciousness and sense of self to the physical world. He just invites us to experience meditation for ourselves.
Beside enjoying the read on human brain wonders such as split brain experiment, I was able to identify what Sam calls “the illusion of self” with my psychedelic-induced experiences back in the day. While on magic mushrooms for a few times, I witnessed the illusion of self; my ego, my sense of self, faded away and I was able to look at myself as a separate person. My psychedelic journey was -luckily- very positive, and I always suggest friends to carefully try it out at least once. I didn’t know psychedelics are a shortcut to the state of mindfulness. Wouldn’t it be great to get the same feeling of relaxation, general satisfaction in life and compassion toward others without taking the biological and legal risks?
The point of this post is not to describe what mindfulness is and how one should meditate to reach that state. There are a variety of resources on the subject, and if someone wants to start looking into them, it should be rather easy to get started. The marginal benefit that people get, even at the beginning of their quest, makes it easier to keep motivated and explore more. If you are as skeptical as I used to be, I hope this will convince you to give mindfulness the benefit of the doubt. I did and have been enjoying it so far.
As I was waiting to start a new job in January, I took on a consulting gig to implement a scalable web crawling/scraping system for a company in Vancouver. My client insisted it has to be done using Apache Nutch, which is a great tool for high-scale scraping.
Even though Nutch is first and foremost a web search engine, its “fetcher” component remains a best-in-class web crawler. In fact, Lucene and Hadoop were originally two components of Nutch if this indicates anything about Nutch’s quality. The plugin-based architecture of Nutch allowed me to easily cherry pick crawling-related parts, and plug in custom indexers for my client.
Another reasonable requirement by my client was using Amazon Web Services (AWS) for running the crawler. Without that requirement, I would advise the same too. AWS Elastic MapReduce (EMR) makes it incredibly easy and cheap to start a Hadoop cluster, and it’s seamlessly integrated with other relevant AWS services such as S3. Using EMR is slightly more expensive than setting up your own Hadoop cluster on EC2, and there are a few quirks (e.g. you cannot use the cheaper generation of small and medium instance types), but the reduction in labour costs and general headache very well justifies it.
Nutch’s plugin system conveniently provides a few extension points, where you can hook up your plugins, and they’ll be called at the right stage with all the available information. I only needed to extend HtmlParseFilter to scrape data off of HTML files, and IndexingFilter and IndexWriter to store the extracted data on my client’s existing database. HtmlParseFilter provides the HTML file’s source, and in this case, regular expressions were powerful enough for me. If HTML parsing is required, the highly resilient TagSoup is shipped. You can even use the likes of Selenium to bypass scraper-deterring AJAX calls.
Once I finished writing and testing the plugins locally, it was time to deploy them on a Hadoop cluster. The process is seamless for the most part. There’s probably no software more native to Hadoop than its parent, Nutch. This Makefile is a good start for writing deployment scripts. However, it doesn’t readily work since the AWS command line interface has changed. Another obstacle in deployment was a flaw in Nutch, where your plugin binaries were not present on cluster slaves, so I had to extract and copy the JAR files to the shared filesystem (HDFS) as a bootstrap action.
After overcoming configuration hurdles, the first scraping adventure went really well. It took 10’s of hours on a cluster of 5 smaller slaves for a particular retailer’s website. Scraping is IO-bound in general and doesn’t need heavy-duty machines. Combine that with AWS spot instances and it ends up dirt cheap. I didn’t touch most of the default configurations, and it seemed the bottleneck was the built-in minimum wait for politeness to the target servers. Nutch’s crawl artifacts are also saved on HDFS and can be easily saved on S3 in Snappy format for later analysis.
As the last words, Nutch provides all conventional features for responsible crawling/scraping; from robots.txt compliance, to custom user agent, to minimum delay between calls. Have fun!