The main reason I decided to get into computer science was because my father used to be a programmer. Now, he has moved into the project management field, but still oversees different types of large scale computer science projects. He works from home a lot and has never seemed like he was very busy or overly stressed about work, so my hope is that getting into the computer science field will lead me down a similar path. Whenever I would complain about school programming projects, he would always tell me how much larger and more complex it gets in real world programming projects but I never really thought much of it.
After working on software for some time, I can now understand what he was talking about — my programming went from projects made up of a couple classes consisting of three or four functions, to a project made up of 50+ classes with I don’t even know how many functions as well as entities, endpoints, html files, css files, and a database with multiple related tables. To say it was a large increase in complexity would be a huge understatement. One thing I learned very quickly is that when you are working with large amounts of data, efficient programming is incredibly important and can be the difference between a webpage taking 10 seconds to load and the page loading almost instantly.
One of the most important aspects of efficient programming is the concept of Big O notation — a way to classify the speed at which your program will run and the memory it will take up. The smaller the Big O, the better. For example, if you have a loop running over a string of length n, your Big O notation will be order n, or O(n) in Big O notation — the loop needs to iterate n times to complete. However, if you can do whatever you need to do without a loop, you can save a lot of memory and time. This would not really matter in a loop of length 20, something you may see in a college project. However, if you are running a loop over an array of length 10,000, you will see a serious increase in the time it takes for your loop to complete as computers do not give instantaneous responses! The idea is to avoid the use of loops if you ever can, though this is not always possible.
A more common problem arises when using nested loops. If you want to count the letters in an array of strings, you need one loop to run over the array of strings, and another loop within that one to run over the string you are currently on. In other words, your loop needs to run n*m times — n strings with m letters in each string. For simplification, this is known as order n*n, or order n^2. Nested loops should be avoided if ever possible, as again the difference between a loop running 1000 times and 1000*1000 times is quite literally exponential. The more nested loops you add, the longer a program will take to run in a field where the difference between a 2 second load time and a 4 second load time is huge. With the example above of counting the number of letters in an array of strings, this can be done with a nested for loop:
However, with a little creativity, you can avoid using nested loops much of the time. For example, instead of pushing all the elements into an array, you can increase your charCounter2 variable by the length of each string as you add them:
This will eliminate the nested for loop, and greatly reduce your runtime in cases of large arrays. Each .length call runs at order n, thus giving you an order of n + n + n + n, simplified to be just order n. The nested for loop would run at order n^2 — if n = 1000 elements, the runtime difference would be 4000 vs 1,000,000. As n gets larger and larger, this difference becomes increases more and more while the load time of your webpage would reduce considerably.
Recently, I ran into a substantial Big O problem in my code. When trying to count the occurences of each word in an array of paragraphs, I had a loop within a loop within a loop to give the correct output. This seemed totally fine when testing with 4 or 5 small paragraphs and I was just happy to get it working. The first loop iterated over the array of paragraphs (denoted as Review), and the next iterated over each of those paragraphs (review.body). The third loop iterated over my variable storing the current word counts to see if that word previously occurred — if not then it was added, and if it was then that word’s count incremented by 1.
However, when I used it with the actual arrays of thousands of longer paragraphs, it took 10-15 seconds complete which was way too long. With some help from colleagues, I discovered associative arrays. In a normal array, you would have to loop through each element in the array to see if the element you are checking exists in that array. If it does not, then you must iterate over every element in the array to check. With associative arrays on the other hand, checking to see if an element is within the associative array is much simpler. When you add an element, a hash string is generated based on that element. Therefore, when you check to see if an element is in an associative array, your computer computes the same exact hash and knows exactly where to look to see if that hash already exists. Thus eliminating an entire for loop brought my function down from order n^3, to order n^2 and reduced the load time by 8-12 seconds. If n=1000, the amount of iterations would drop from 1,000,000,000 to 1,000,000!
The word cloud generator now only takes a few seconds to create beautiful word clouds as opposed to 10-15 seconds:
When it comes to web development, load time is very important. Many times if I try to click on something on my phone and it takes more than a couple seconds to load, I just immediately exit out due to impatience. When you create a website, you do not want users exiting out because your site takes a few seconds to load even with a solid internet connection. It is important to start practicing efficient programming early even when working with small amounts of data.
This will help you avoid situations similar to mine, where you have to figure out how to write efficient programs with data sets of thousands of elements and will save you from a few infinite loops that immediately crash your computer! Efficient programming is a major key throughout all of computer science, but is especially important when it comes to a user interface and a user’s experience!
Last week my girlfriend Diane was looking for some help scraping pollen data from a couple of sites. The code was simple enough to hammer out but how was I going to deliver it? Diane is fairly tech savy but even so asking her to install nodejs and run a command line app was going to be a bit much. After considering options like a Java Swing app or Qt+nodejs I decided to give Electron a shot. Just want to see the code? It’s available here, pollen-scraper.
One of the frustrating aspects of “enterprise” cross platform frameworks is that it takes a long time to even get something up on the screen. Between complex build systems and custom layout languages, it generally takes awhile to get something on the screen using something like Qt or Swing. With Electron getting started was as simple as cloning https://github.com/electron/electron-quick-start-typescript, firing off an “npm install”, and after a “npm start” I had a working cross platform UI on the screen. Additionally, since Electron leverages web technologies it was also straightforward to add Bootstrap and AngularJS to the project.
As mentioned above, Electron applications can use any nodejs library which makes the environment incredibly powerful right out of the box. For example, I was able to leverage turfjs along with Google’s geocoder to find the closest city to an arbitrary zip code in Japan. Being able to tap into the npm/nodejs ecosystem also makes it possible to deliver high value applications quickly since you’re able to focus on business differentiators not plumbing.
All in all, my first foray with Electron was a pretty positive experience. With a couple of beers and an afternoon of work I was able to deliver a cross platform application which saved my girlfriend’s team a dramatic amount of time.
Note: This post originally appeared on Codeburst.
In functional programming parlance “partial application” of a function involves reducing the number of arguments it accepts (it’s “arity”) by some N, returning a new function. Concretely, consider a function with the following signature:
Logger.log(level, dateFormat, msg)
With partial application we’d be able to do something like:
const info = partial(Logger.log, “info”, “ISO8601”);
And then subsequently be able to call our new “info” function like:
To output an “info” message with ISO8601 date formatting.
The classical functional programming approach would be to use the Function.arguments property to dynamically create a new partially applied function. Running with the example above, you’d end up with an implementation that looks like, (Run it on JSFiddle):
Pulling it apart, its straightforward. Save a reference to a list of the arguments that you want to “fill in”, create a new function for the partial, inside this new function combine the saved arguments with the arguments the partial is called with and execute the original function.
This works but is there a cleaner way?
Although it’s normally used for setting the “this” value a function will be invoked with, it’s possible to use bind() for partial application. If you check out the Function.bind docs you’ll notice that in addition to setting “this” it’s able to set the arguments for the function its operating on. By leveraging this along with Function.apply we’ll be able to cook of partial functions. The implementation ends up being something like (On JSFiddle):
Well that’s about it for partially applied functions. If you’re feeling adventurous and want to head down the functional programming check out the related topic, currying.
I was recently out with a friend of mine who mentioned that he was having a tough time scraping some data off a website. After a few drinks we arrived at a barter, if I could scrape the data he’d buy me some single malt scotch which seemed like a great deal for me. I assumed I’d make a couple of HTTP requests, parse some HTML, grab the data and dump it into a CSV. In the worst case I imagined having to write some custom code to login to a web app and maybe sticky some cookies. And then I got started.
As it turned out this site was running one of the most sophisticated anti-scraping/anti-robot packages I’ve ever encountered. In a regular browser session everything looked normal but after a half dozen or so programmatic HTTP requests I started running into their anti-robot software. After poking around a bit it, the blocks they were deploying were a mix of:
With a full browser environment, we now need to tackle the IP restrictions that cause captchas to appear. At face value, like most people, I assumed solving captchas with OCR magic would be easier than getting new IPs after a couple of requests but it turns out that’s not true. There weren’t any usable “captcha solvers” on npm so I decided to pursue the IP angle. The idea would be to grab a new IP address after a few requests to avoid having to solve a captcha which would require human intervention. Following some research, I found out that it’s possible to use Tor as a SOCKS proxy from a third party application. So concretely, we can launch a Tor circuit and then push our Electron HTTP requests through Tor to get a different IP address that your normal Internet connection.
Ok, enough talk, show me some code!
I setup a test “target page” at http://code.setfive.com/scraper_demo/ which randomly shows “content you want” and a “please solve this captcha”. The github repository at https://github.com/adatta02/electron-scraper-skeleton has all the goodies, a runnable Electron application. The money file is injected.js which looks like:
To run that locally, you’ll need to do the usual “npm install” and then also run a Tor instance if you want to get a new IP address on every request. The way it’s implemented, it’ll detect the “content you want” and also alert you when there’s a captcha by playing a “ding!” sound. To launch, first start Tor and let it connect. Then you should be able to run:
Once it loads, you’ll see the test page in what looks like a Chrome window with a devtools instance. As it refreshes, you’ll notice that the IP address is displays for you keeps updating. One “gotcha” is that by default Tor will only get a new IP address each time it opens a conduit, so you’ll notice that I run “killall” after each request which closes the Tor conduit and forces it to reopen.
And that’s about it. Using Tor with the skeleton you should be able to build a scraper that presents a new IP frequently, scrapes data, and conveniently notifies you if human input is required.
As always questions and comments are welcomed!