Thursday, July 25, 2013

How to enable finder to show all hidden files Mac OS X 10.8

Type the following commands in Terminal:

defaults write com.apple.Finder AppleShowAllFiles YES
killall -HUP Finder

The first command sets a hidden preference so the finder shows all files and the second command restarts the Finder so these preferences take effect. The kill all command tells Finder to quit while the -HUP flag asks the program to restart.

The files can be hidden by typing the following into Terminal:

defaults write com.apple.Finder AppleShowAllFiles NO
killall -HUP Finder

If these commands do not work try a lower case F in the word Finder in the following command:

defaults write com.apple.Finder AppleShowAllFiles

Tuesday, July 9, 2013

Divide and Conquer Inversion Counting Algorithm vs Brute Force

First I will generate a random array of size 100,000 for both algorithms and then proceed with counting the number of inversions in the generated array according to the corresponding algorithm. Here are both algoritms and I will discuss the run times later

1. Divide and Conquer Algorithm
from random import randint


def generate_array():
    length_of_array = 100000
    i = 0
    array = []
    while i < length_of_array:
        array.append(randint(0,length_of_array))
        i += 1

    return array

def sort_count(A,B):
    global inversions
    C=[]
    lenA = len(A)
    lenB = len(B)
    i, j = 0, 0
    while i < lenA and j < lenB:
        if A[i] <= B[j]:
            C.append(A[i])
            i=i+1
        else:
            inversions= inversions +len(A)-i #the maggic happens here
            C.append(B[j])
            j=j+1
    if i == lenA:#A get to the end
        C.extend(B[j:])
    else:
        C.extend(A[i:])
    return C # sends C back to divide_array which send that back to the divide_array
             # which called it as the right variable

def divide_array(L):
    N = len(L)
    if N > 1:
        left = divide_array(L[0:N/2])
        right = divide_array(L[N/2:])
        return sort_count(left,right)
    else:
        return L
inversions = 0
array = generate_array()
sorted_array = divide_array(array)
print inversions


2. Brute Force Algorithm

from random import randint


def generate_array():
    length_of_array = 100000
    i = 0
    array = []
    while i < length_of_array:
        array.append(randint(0,length_of_array))
        i += 1

    
    return array


array = generate_array()
i = 0
inversions = 0 
length = len(array)
while i < length:
    for n in array[i+1:]:
        if n < array[i]:
            inversions = inversions + 1
    i = i+1

print inversions


Analysis of Algorithms

So both of these algorithms can complete the job of counting number of inversions however, the Brute Force Algoritm is considerably slower. 

While Divide and Conquer completed the task in 1.325s the Brute Force Algorithm completed the task in 20m35.371s on the same computer.  This is a huge difference! Which shows the power of the Divide and Conquer Algorithm. 

Divide and Conquer runs on O(n log n) time while Brute Force runs on O(n^2) time. Where n is the length of the array. 

Graphic representation of big-Oh comparison for algorithms



Here's the code for making the above graph
from pylab import *

n = arange(1,100)
f2 = n* (log(n))
f = n**2

fig = figure()
ax = fig.add_subplot(111)

ax.plot(f2)
ax.plot(f)

legend(('Divide and Conquer', 'Brute Force'), 'upper left', shadow=True)

title('Divide and Conquer Inversion Counting Algorithm vs Brute Force') 

xlabel('Size of array (n)')
ylabel('Number of interations')

show()

Thursday, June 6, 2013

version control and git

Originally version control was developed for large software projects with many developers.  However, it can be useful for single users also.  A single user could use version control to keep track of history and changes to files including word files which will allow them to revert back to previous versions.  Its also useful way to keep many different versions of code well organized and keep work in sync on multiple computers. 

So how does version control work?

The system keeps track of differences from one version to the next by making a changeset which is a collection of differences from one commit.  Then the System can go back and reconstruct any version. Git is a distributed version control system (DVCS) that I will talk about here.

So what is a distributed version control system?  Its a version control system where every person has the complete history of the entire project.  DVCS requires no connection to a central repository, and commits are very fast which allows commits to be much more frequent. Also you can commit changes, revert to earlier versions, and examine history all without being connected to the server and without affecting anyone else's work. Also it isn't a problem if the server dies since every clone has a full history.  Once a developer has made one or more local commits, they can push those changes to another repository or other developer can pull them.  Most projects have a central repository but this is not necessary.  Use this link for branching diagrams.

Changeset all stored in one .git subdirectory in a top directory.  This will be a hidden file since it starts with a  dot.  Its best not to mess with this file however you can see this file in Unix with the ls -a command.

To make a clone use:

git clone (web address of  bitbucket) (name of what you would like to call the file)

git clone will make a complete copy of the repository at the particular directory you have navigated to in terminal.

To make a new git repository:
steps make a directory then Initialized git repository
    mkdir name of directory
    git init

Here is a list of useful git commands:

git add (file name) - adds file to git repository
git add -u - adds anything it the current directory which is already under version control (then do a git commit)
git clone - makes a complete copy of the repository at a particular directory you navigated to in terminal
git checkout (name of commit) - converts everything in repository back to that commit
git checkout (name of commit) -- (name of file to revert back) - converts just that file back to the commit
git checkout HEAD - gits you the most recent commit
git commit - commits to your clone's .git directory -m "string describing commit"
git commit -m "first version of script"
git fetch - pulls changesets from another clone by default: the one you cloned from
git init - Initialize git repository
git ls-files . - tells which files in this directory are under version control currently
git log - Shows most recent commits you did
git merge - applies changesets to your working copy
git pull - updating repository (navigate to the repository on your machine and perform git pull)
git push - sends your recent changesets to another clone by default: the one you clone from
git status - outputs all untracked files
git status | more - shows page by page untracked files
git status . - shows files to be committed in the directory you are in
Protocol to remove untracked files that you never put in the repository such as .pyc files
   create a special file called .gitignore
    example: $ vi directoryname/.gitignore
   then add the types of files you would like to ignore
    example: *.pyc (ignore any .pyc file)

github vs bitbucket

bitbucket is private and github is public
git was developed for the open source software for the developing the linux kernal
many repository for scientific computing projects use github for example IPython, NumPy, Scipy, and matplotlib








 

Wednesday, June 5, 2013

A simple introduction to MapReduce

I've grown extremely fond of the MapReduce programming model for processing large sets of data in parallel.

So what is MapReduce?

Well besides being the most wonderful model I've come across so far, it is a program that comprises of a Map() function and a Reduce() function. The Mapper filters and sorts and sends key value pairs to a shuffler and sorter. The shuffler and sorter pairs common keys together and then sends it to the Reducer that summaries the data.

MapReduce Via Cooking

Lets start by making my favorite South Indian Chutney; Chettinad Tomato Chutney.  Please note that the recipe below is just a fragment of the entire recipe.  The picture and full recipe is locate at this link if you would like to try it.



First we would have to slice all the veggies.  As shown below.  This is similar to the mapper function in MapReduce.  The human slicer will throw out all bad veggies and slice all fresh veggies.  The key would be the veggie name and the value would be individual slices.



This would be sent to the Shuffler and sorter but we will come back to that.  Now just imagine it is sent directly to the Reducer or Blender.  The blender will mush everything into one tomato chutney dish.  Looks good right?



Now imagine you own a South Indian Restaurant. This restaurant is extremely successful and your famous Chettinad Tomato Chutney is a top seller. Every day you sell at least 150 orders of it.

How would you accomplish this task?

First you will need a large supply of onions, tomatoes, red chilies, green chilies, and ginger
Next you will need to hire a bunch of human slicers
Then you will have to buy many blenders
You will also have to Shuffle and sort sliced similar veggies together from different slicers
Then blend appropriate quantities of each veggie together.

This is MapReduce! Processing a lot of things in parallel.

Below you will find an outline of Hadoop and MapReduce.  Hadoop adds to MapReduce a file saving system called Hadoop Distributed File System (HDFS).  I will not talk about HDFS any further here.  The data is stored on HDFS and then partitioned into smaller pieces and sent to the map functions on multiple cores or computers. Map sends key value pairs to the shuffler.  The shuffler will send specific keys to a specific sorter.  The sorter will pair up all keys and append all values to the same key and send it to the reducer where the data is combined together and summarized.  The data is then stored on HDFS.   



Saturday, June 1, 2013

Python NumPy Extension

NumPy extension is an open source python extension for use in multi-dimensional arrays and matrices. Numpy also includes a large library of high-level mathematical functions to operate on these arrays. Since python is an interpreted language it will often run slower than compiled programming languages and NumPy was written to address this problem providing multidimensional arrays and functions and operators that operate efficiently on arrays. In NumPy any algorithm that can be expressed primarily as operations on arrays and matrices can run almost as quickly as the equivalent C code because under the hood it calls complied C code.

NumPy square root




NumPy array class vs python list

The numpy array is very useful for performing mathematical operations across many numbers. Below you can see that the type list operates very differently than the type numpy.array. Below y is a list and x is a numpy array. Multiplication of lists just repeats the list and addition just concatenates the two lists. However, with the NumPy array component-wise operations are applied.



Unlike lists, all elements of a np.array have to be of the same type. On line 5 below np.array was entered as 1(type int), 2.2(type float), and 3(type int). However, on line 6 it was printed out and you can see all values were converted to type float.  Then on line 7 a more complicated component-wise operation was applied to x.



The type of the array can be specified by dtype. Here is a good link for more information on numpy dtype. Below on line 19 the dtype is converted to complex.  Then float and int types are used on line 20 and 21 respectively.




Matrices
Matrices are made by adding lists of lists. Below on line 35 is an example of a 3 by 2 matrix. Then the shape of the matrix is revealed on line 37. On line 38 the matrix is transposed. Then matrix multiplication is accomplished on line 40.



Below is an example of dot product.



NumPy has another class called matrix which works similar to matlab. However, I like uses nested lists and NumPy array class better. If you would like to use the NumPy matrix class go to this link for more information.

reshape



flatten


Rank





Linear algebra

How to solve for x in ax = b equation



Eigenvalue



numpy not a number

not a number (NaN) value is a special numerical value. NaN represents an undefined or
unrepresentable value mostly in floating-point calculations. Here is an example for a use
of NaN:

  While writing a square root calculator a return of NaN can be used for ensuring the precondition
  of x is greater than zero. This is how it could be implemented:



A little more experimenting on numpy nan and python nan concludes that both nans are floating point numbers.  Also  'nan' is of type str but float('nan') is type float.  However, as expected float('a') can not be converted to a float.  From the experiment below you can also see that all floating point numbers return true when you convert to bool type except 0.0.  



For more information visit NaN on Wikipedia

Saturday, May 25, 2013

Python tutorial Part 2 time and making folders

The python time module is very useful for finding current time. Below you will find a few examples of the module in use and how to extract year, month, day, hour, minute, second, and microsecond.


















To find elapsed time used the the code below.













To make a folder with the current date import sys, os, and datetime.  Use strftime to get the Year with century as a decimal number, get the month as a decimal number, and the day of the month as a decimal number.  Then test to see if the folder already exists and if not make the folder.  Below you will see the folder which was made by the script to the right of it.





Thursday, May 16, 2013

Python tutorial Part 1 variables int, float, str, and simple math

Variables refer to memory addresses that store values.  This means all variables are stored in memory.  Therefore for big data applications you will have to keep this in mind.  You can find out the maximum size of lists, strings, and dicts on your system by the code below:




How to Assign integers to variables and subtraction:

The equal sign (=) is used to assign variables in python.  The negative sign (-) is used for subtraction and type() is used to get the type of the variable.  The type of is int in the case below. For example:







How to Assign floating point numbers to variables and addition:

The positive sign (+) is used for addition







Examples of adding and subtracting float and int:











Division of int:

Floating point division uses (/) sign.  This type of division divides left hand operand by right hand operand and returns a floating point number.   Floor Division uses (//) sign.  This type of division is returns only the whole int part of division without any rounding. The modulus (%) operand returns the remainder. 











Division of float:

Division of float returns only type float and everything else is the same as above.





 



Multiply int and float:

The multiply operand is (*).   If the type is only int it returns int.  If either type is float it returns a float.









Exponent of int and float:

The exponent operand is ( **).  If the type is only int it returns int.  If either type is float it returns a float.












How to assign strings and concatenate strings:

To assign variables to strings inclose strings with (' ') or (" ").  As you can see below strings can be concatenated by the plus sign (+) and if you need a space between the words it needs to be added in.  The type of strings is str.










Variable name guide lines:

For python style use lower_case_with_underscores for variable names.  The words are separated by underscores to improve readability.





Variable names can not start with a number