Thursday, June 6, 2013

version control and git

Originally version control was developed for large software projects with many developers.  However, it can be useful for single users also.  A single user could use version control to keep track of history and changes to files including word files which will allow them to revert back to previous versions.  Its also useful way to keep many different versions of code well organized and keep work in sync on multiple computers. 

So how does version control work?

The system keeps track of differences from one version to the next by making a changeset which is a collection of differences from one commit.  Then the System can go back and reconstruct any version. Git is a distributed version control system (DVCS) that I will talk about here.

So what is a distributed version control system?  Its a version control system where every person has the complete history of the entire project.  DVCS requires no connection to a central repository, and commits are very fast which allows commits to be much more frequent. Also you can commit changes, revert to earlier versions, and examine history all without being connected to the server and without affecting anyone else's work. Also it isn't a problem if the server dies since every clone has a full history.  Once a developer has made one or more local commits, they can push those changes to another repository or other developer can pull them.  Most projects have a central repository but this is not necessary.  Use this link for branching diagrams.

Changeset all stored in one .git subdirectory in a top directory.  This will be a hidden file since it starts with a  dot.  Its best not to mess with this file however you can see this file in Unix with the ls -a command.

To make a clone use:

git clone (web address of  bitbucket) (name of what you would like to call the file)

git clone will make a complete copy of the repository at the particular directory you have navigated to in terminal.

To make a new git repository:
steps make a directory then Initialized git repository
    mkdir name of directory
    git init

Here is a list of useful git commands:

git add (file name) - adds file to git repository
git add -u - adds anything it the current directory which is already under version control (then do a git commit)
git clone - makes a complete copy of the repository at a particular directory you navigated to in terminal
git checkout (name of commit) - converts everything in repository back to that commit
git checkout (name of commit) -- (name of file to revert back) - converts just that file back to the commit
git checkout HEAD - gits you the most recent commit
git commit - commits to your clone's .git directory -m "string describing commit"
git commit -m "first version of script"
git fetch - pulls changesets from another clone by default: the one you cloned from
git init - Initialize git repository
git ls-files . - tells which files in this directory are under version control currently
git log - Shows most recent commits you did
git merge - applies changesets to your working copy
git pull - updating repository (navigate to the repository on your machine and perform git pull)
git push - sends your recent changesets to another clone by default: the one you clone from
git status - outputs all untracked files
git status | more - shows page by page untracked files
git status . - shows files to be committed in the directory you are in
Protocol to remove untracked files that you never put in the repository such as .pyc files
   create a special file called .gitignore
    example: $ vi directoryname/.gitignore
   then add the types of files you would like to ignore
    example: *.pyc (ignore any .pyc file)

github vs bitbucket

bitbucket is private and github is public
git was developed for the open source software for the developing the linux kernal
many repository for scientific computing projects use github for example IPython, NumPy, Scipy, and matplotlib








 

Wednesday, June 5, 2013

A simple introduction to MapReduce

I've grown extremely fond of the MapReduce programming model for processing large sets of data in parallel.

So what is MapReduce?

Well besides being the most wonderful model I've come across so far, it is a program that comprises of a Map() function and a Reduce() function. The Mapper filters and sorts and sends key value pairs to a shuffler and sorter. The shuffler and sorter pairs common keys together and then sends it to the Reducer that summaries the data.

MapReduce Via Cooking

Lets start by making my favorite South Indian Chutney; Chettinad Tomato Chutney.  Please note that the recipe below is just a fragment of the entire recipe.  The picture and full recipe is locate at this link if you would like to try it.



First we would have to slice all the veggies.  As shown below.  This is similar to the mapper function in MapReduce.  The human slicer will throw out all bad veggies and slice all fresh veggies.  The key would be the veggie name and the value would be individual slices.



This would be sent to the Shuffler and sorter but we will come back to that.  Now just imagine it is sent directly to the Reducer or Blender.  The blender will mush everything into one tomato chutney dish.  Looks good right?



Now imagine you own a South Indian Restaurant. This restaurant is extremely successful and your famous Chettinad Tomato Chutney is a top seller. Every day you sell at least 150 orders of it.

How would you accomplish this task?

First you will need a large supply of onions, tomatoes, red chilies, green chilies, and ginger
Next you will need to hire a bunch of human slicers
Then you will have to buy many blenders
You will also have to Shuffle and sort sliced similar veggies together from different slicers
Then blend appropriate quantities of each veggie together.

This is MapReduce! Processing a lot of things in parallel.

Below you will find an outline of Hadoop and MapReduce.  Hadoop adds to MapReduce a file saving system called Hadoop Distributed File System (HDFS).  I will not talk about HDFS any further here.  The data is stored on HDFS and then partitioned into smaller pieces and sent to the map functions on multiple cores or computers. Map sends key value pairs to the shuffler.  The shuffler will send specific keys to a specific sorter.  The sorter will pair up all keys and append all values to the same key and send it to the reducer where the data is combined together and summarized.  The data is then stored on HDFS.   



Saturday, June 1, 2013

Python NumPy Extension

NumPy extension is an open source python extension for use in multi-dimensional arrays and matrices. Numpy also includes a large library of high-level mathematical functions to operate on these arrays. Since python is an interpreted language it will often run slower than compiled programming languages and NumPy was written to address this problem providing multidimensional arrays and functions and operators that operate efficiently on arrays. In NumPy any algorithm that can be expressed primarily as operations on arrays and matrices can run almost as quickly as the equivalent C code because under the hood it calls complied C code.

NumPy square root




NumPy array class vs python list

The numpy array is very useful for performing mathematical operations across many numbers. Below you can see that the type list operates very differently than the type numpy.array. Below y is a list and x is a numpy array. Multiplication of lists just repeats the list and addition just concatenates the two lists. However, with the NumPy array component-wise operations are applied.



Unlike lists, all elements of a np.array have to be of the same type. On line 5 below np.array was entered as 1(type int), 2.2(type float), and 3(type int). However, on line 6 it was printed out and you can see all values were converted to type float.  Then on line 7 a more complicated component-wise operation was applied to x.



The type of the array can be specified by dtype. Here is a good link for more information on numpy dtype. Below on line 19 the dtype is converted to complex.  Then float and int types are used on line 20 and 21 respectively.




Matrices
Matrices are made by adding lists of lists. Below on line 35 is an example of a 3 by 2 matrix. Then the shape of the matrix is revealed on line 37. On line 38 the matrix is transposed. Then matrix multiplication is accomplished on line 40.



Below is an example of dot product.



NumPy has another class called matrix which works similar to matlab. However, I like uses nested lists and NumPy array class better. If you would like to use the NumPy matrix class go to this link for more information.

reshape



flatten


Rank





Linear algebra

How to solve for x in ax = b equation



Eigenvalue



numpy not a number

not a number (NaN) value is a special numerical value. NaN represents an undefined or
unrepresentable value mostly in floating-point calculations. Here is an example for a use
of NaN:

  While writing a square root calculator a return of NaN can be used for ensuring the precondition
  of x is greater than zero. This is how it could be implemented:



A little more experimenting on numpy nan and python nan concludes that both nans are floating point numbers.  Also  'nan' is of type str but float('nan') is type float.  However, as expected float('a') can not be converted to a float.  From the experiment below you can also see that all floating point numbers return true when you convert to bool type except 0.0.  



For more information visit NaN on Wikipedia