I currently work at HackerRank, we collect data points from various sources of our platform, enterprise and community(developers). We ingest all these data points through ETL pipelines written in spark (pyspark).

We use Airflow for scheduling the batch jobs and we have spark streaming jobs to process data in near-realtime to generate data for out business use-cases.

Problem

The streaming jobs which did all the heavy-lifting work of extracting and cleaning up of raw data to gather the processed data, and assigning a unique identifier for the datum (single data point) to the user, it belongs to (using a custom micro-service…


Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.


When a Bash project turns into a library, it can become difficult to add new functionality. Function names, variables and parameters usually need to be changed in the scripts that utilize them. In scenarios like this, it is helpful to decouple the code and use an event driven design pattern. In said pattern, an external script can subscribe to an event. When that event is triggered (published) the script can execute the code that it registered with the event.

pubsub.sh

#!/bin/bash#
# Save the path to this script's directory in a global env variable
#
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &&…

  • Whitespaces when assigning variables
    Whitespace matters when assigning variables.
foo = 'bar' # incorrect
foo= 'bar' # incorrect
foo='bar' # correct

The first two will result in syntax errors (or worse, executing an incorrect command). The last example will correctly set the variable $foo to the text ‘bar’.

  • Failed commands do not stop script execution
    In most scripting languages, if a function calls fails, it may throw an exception and stop the execution of the program. Bash commands do not have exceptions, but they do have exit codes. A non-zero exit code signals failure, however, a non-zero exit code will not…

.bash_profile, .bash_login, .bashrc and .profile all do pretty much the same thing: set up and define functions, variables and the sorts.

The main difference is that .bashrc is called at the opening of a non-login but interactive window, and .bash_profile and the others are called for a login shell. Many people have their .bash_profile or similar call .bashrc anyway.

.profile vs .bash_profile (and .bash_login)

.profile is read by most shells on startup, including bash. However, .bash_profile is used by configurations specific to bash. For general initialization code, put it in .profile. If it’s specific to bash, use .bash_profile.

.profile isn’t…


  • Find the current shell:
    There are a few ways to determine the current shell
echo $0
ps -p $$
echo $SHELL
  • List available shells:
    To list available login shells:
cat /etc/shells
  • Change the shell:
    To change the current bash run these commands
export SHELL=/bin/bash
exec /bin/bash

To change the bash that opens on startup edit .profile and add these lines

Reference: https://books.goalkicker.com/BashBook/


Suppose you have a method which fetches the currently logged in user in your rails application.

class ApplicationController < ActionController::Base  def current_user
User.find(session[:user_id])
end
end

The method current_user is called several times per request.

The problem with this is that it sends a new query to the database to fetch the user’s information. The SQL equivalent of User.find(id) is

Select * from User where (User.id = id) LIMIT 1

We can solve this problem by caching the user data in an instance variable.

Modifying the current_user method to return cached instance variable:

def current_user
@currentUser ||= User.find(session[:user_id])
end

What the…

Vinay Badhan

Senior Software Engineer @HackerRank

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store