I currently work at HackerRank, we collect data points from various sources of our platform, enterprise and community(developers). We ingest all these data points through ETL pipelines written in spark (pyspark).
We use Airflow for scheduling the batch jobs and we have spark streaming jobs to process data in near-realtime to generate data for out business use-cases.
The streaming jobs which did all the heavy-lifting work of extracting and cleaning up of raw data to gather the processed data, and assigning a unique identifier for the datum (single data point) to the user, it belongs to (using a custom micro-service…
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
When a Bash project turns into a library, it can become difficult to add new functionality. Function names, variables and parameters usually need to be changed in the scripts that utilize them. In scenarios like this, it is helpful to decouple the code and use an event driven design pattern. In said pattern, an external script can subscribe to an event. When that event is triggered (published) the script can execute the code that it registered with the event.
foo = 'bar' # incorrect
foo= 'bar' # incorrect
foo='bar' # correct
The first two will result in syntax errors (or worse, executing an incorrect command). The last example will correctly set the variable $foo to the text ‘bar’.
.bash_profile, .bash_login, .bashrc and .profile all do pretty much the same thing: set up and define functions, variables and the sorts.
The main difference is that .bashrc is called at the opening of a non-login but interactive window, and .bash_profile and the others are called for a login shell. Many people have their .bash_profile or similar call .bashrc anyway.
.profile vs .bash_profile (and .bash_login)
.profile is read by most shells on startup, including bash. However, .bash_profile is used by configurations specific to bash. For general initialization code, put it in .profile. If it’s specific to bash, use .bash_profile.
ps -p $$
To change the bash that opens on startup edit .profile and add these lines
Suppose you have a method which fetches the currently logged in user in your rails application.
class ApplicationController < ActionController::Base def current_user
The method current_user is called several times per request.
The problem with this is that it sends a new query to the database to fetch the user’s information. The SQL equivalent of User.find(id) is
Select * from User where (User.id = id) LIMIT 1
We can solve this problem by caching the user data in an instance variable.
Modifying the current_user method to return cached instance variable:
@currentUser ||= User.find(session[:user_id])