Code Tinkering: March 2013

Sunday, March 31, 2013

How big is google's index?

Since Google has stopped exposing the size of their indexed web documents on their homepage, it seems that different web users have been performing searches to try and figure out Google’s size.

Here is my method for figuring out how big google's index is.

A. Find out how what are the top ten languages on the internet.

according to: http://www.internetworldstats.com/stats7.htm

1. English 536.6 million users

2. Chinese 444.9 million users

3. Spanish 153.3 million users

4. Japanese 99.1 million users

5. Portuguese 82.5 million users

6. German 75.2 million users

7. Arabic 65.4 million users

8. French 59.8 million users

9. Russian 59.7 million users

10. Korean 39.4 million users

11. all the rest of the languages 350.6 million users

B. Use the most common word of each language as search term

1. search the word in resulted in 25,270,000,000

2. searched 啊 2,320,000,000

3. searched a partir de 1,110,000,000

4. searched for で 6,080,000,000

5. Did not know how to search for portuguese since the words over lapped spanish

6. searched for das 3,680,000,000

7. search for في  2,250,000,000
8. searched for Le 8,140,000,000
9. searched for и  4,500,000,000
10. searched for 에  2,420,000,000
C. add them all up 55,770,000,000

Thursday, March 28, 2013

Two different ways to translate mRNA to Protein

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

def translate(mRNA):
    '''(str) -> str

    input is mRNA string and it returns a the corresponding protein string.

    >>>translate('AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA')
    MAMPRTEINSTRING

    Precondition: mRNA has to be Uppercase and include only GAUC

    '''
    #IUPAC.unambiguous_rna includes only Uppercase and GAUC
    messenger_rna = Seq(mRNA, IUPAC.unambiguous_rna)
    #translate using a dict already stored in the method
    protein = messenger_rna.translate()

    print protein

--------------------------------------------

def translate_2(mRNA):
    '''(str) -> str

    input is mRNA string and it returns a the corresponding protein string.

    >>>translate('AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA')
    MAMPRTEINSTRING

    '''
    #this string will be used to make a dict
    translation_table = '''UUU F      CUU L      AUU I      GUU V
    UUC F      CUC L      AUC I      GUC V
    UUA L      CUA L      AUA I      GUA V
    UUG L      CUG L      AUG M      GUG V
    UCU S      CCU P      ACU T      GCU A
    UCC S      CCC P      ACC T      GCC A
    UCA S      CCA P      ACA T      GCA A
    UCG S      CCG P      ACG T      GCG A
    UAU Y      CAU H      AAU N      GAU D
    UAC Y      CAC H      AAC N      GAC D
    UAA Stop   CAA Q      AAA K      GAA E
    UAG Stop   CAG Q      AAG K      GAG E
    UGU C      CGU R      AGU S      GGU G
    UGC C      CGC R      AGC S      GGC G
    UGA Stop   CGA R      AGA R      GGA G
    UGG W      CGG R      AGG R      GGG G'''

    #Make a list of the above string and remove all spaces or '/n
    translation_list = translation_table.split()
    #Make dictionary from list
    translation_dict = dict(zip(translation_list[0::2], translation_list[1::2]))

    #Accumulator variable
    protein = ''
    for aa in range(0, len(mRNA)-3, 3):
        protein += translation_dict[mRNA[aa:aa+3]]

    print protein

Friday, March 22, 2013

Installing biopython on Mac OS X 10.5 and Mac OS X 10.8

Installing on Mac OS X 10.5

1. Install XCode
- https://connect.apple.com/cgi-bin/WebObjects/MemberSite.woa/wa/getSoftware?bundleID=20414 from apple
- Just use default install

2. Install NumPy
- download http://sourceforge.net/projects/numpy/files/
- double click on the file and it will unzip
- I then placed the unzipped file on the desktop(shorter directory address)
- then open terminal
- navigate to the file
- example
cd /Users/computername/Desktop/numpy-1.7.0
- then type in terminal python setup.py build (note this takes a little while to complete and as some points it looks like to stalled. Just wait.)
- then type into terminal sudo python setup.py install

3. Install biopython
- download source zip file from http://biopython.org/wiki/Download
- double click on it and it will unzip
- I then placed the unzipped file on the desktop(shorter directory address)
- then open terminal
- navigate to the file
- type python setup.py build
- python setup.py test
- sudo python setup.py install

4. go to idle and test if you installed everything

>>> import numpy
>>> print numpy.__version__
1.7.0

>>> import Bio

>>> print Bio.__version__
1.61

Installing on Mac OS X 10.8

1. Install XCode.
- Download XCode from the Mac App Store and install

2. Install XCode the command line tools
- Open XCode
- Go to XCode on toolbar -> preferences -> Downloads -> click on install Command Line Tools
- Command Line Tools are already installed on my computer therefore it says update

3. Install MacPorts
- Download Mountain Lion MacPorts "pkg" installer

4. Install NumPy
- download http://sourceforge.net/projects/numpy/files/
- double click on the file and it will unzip

- I then placed the unzipped file on the desktop(shorter directory address)
- then open terminal
- navigate to the file
- example
cd /Users/computername/Desktop/numpy-1.7.0
- then type in terminal python setup.py build (note this takes a little while to complete and as some points it looks like to stalled. Just wait.)
- then type into terminal sudo python setup.py install

5. Install biopython
- download source zip file from http://biopython.org/wiki/Download
- double click on it and it will unzip
- I then placed the unzipped file on the desktop(shorter directory address)
- then open terminal
- navigate to the file
- type python setup.py build
- python setup.py test
- sudo python setup.py install

6. go to idle and test if you installed everything

>>> import numpy
>>> print numpy.__version__
1.7.0

>>> import Bio

>>> print Bio.__version__
1.61

Wednesday, March 20, 2013

Translate DNA to RNA

def dna_to_rna(dna):
    '''(str) -> str

    Replace every T with a U

    >>>dna_to_rna('GTACTT')
    GUACUU

    >>>dna_to_rna('GGGGTTTTTCCCCTTTTTATCTGT')
    'GGGGUUUUUCCCCUUUUUAUCUGU'

    >>>dna_to_rna('TTTTTTTT')
    UUUUUUUU

    >>>dna_to_rna('TTAA')
    UUAA
    '''

    i = 0
    accum_s = ' '
    last_t_index = 0
    while i < len(dna):
        if dna[i] == 'T':
            accum_s = accum_s + dna[last_t_index: i] + 'U'
            last_t_index = i +1
        i = i + 1
    if last_t_index != len(dna):
        accum_s = accum_s + dna[last_t_index:len(dna)]
    print accum_s

Monday, March 18, 2013

Add Packages for importing to python

Add the package directly to a directory already in the sys.path
1. Find the directory /Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages

2. If this folder is hinden in finder and you would like to see it just follow these steps:
a. open find
b. press shift-Command-G
c. this brings up a ''go to folder" dialog
d. Type in the name of the directory.
e. Then I just dragged this folder under Favorites on the left menu bar.

3. Now that I just created a test program called hello_world.py
4. I placed hello_world.py in /Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages

def add(A, B):
print (A + B)

5. opened idle and typed the below commands.

>>> import hello_world
>>> hello_world.add(5,3)
8
>>>

6. now I can import hello_world.add() any time. I'm now on my way to install any modules.

Add a directory to sys.path

Convert a CPM .txt file to a new DPM .txt file

This little function will take a file full of CPM values and convert it to DPM values in a new file. This is very useful for wipe tests.

example data:
wipe test 3/15/13 results in cpm
number 24 is blank

    1 = 29
    2 = 34
    3 = 31
    4 = 24
    5 = 34
    6 = 27
    7 = 26
    8 = 35
    9 = 25
    10 = 31
    11 = 31
    12 = 25
    13 = 23
    14 = 34
    15 = 37
    16 = 32
    17 = 28
    18 = 35
    19 = 29
    20 = 29
    21 = 35
    22 = 27
    23 = 31
    24 = 32

Example output data:
converted DPM values

1 = 0
2 = 2.86
3 = 0
4 = 0
5 = 2.86
6 = 0
7 = 0
8 = 4.29
9 = 0
10 = 0
11 = 0
12 = 0
13 = 0
14 = 2.86
15 = 7.15
16 = 0
17 = 0
18 = 4.29
19 = 0
20 = 0
21 = 4.29
22 = 0
23 = 0
24 = 0

def CPM_to_DPM(CPM_file, DPM_file):
    # open the files
    CPM_file = open(CPM_file, 'r') # read CPM_file
    DPM_file = open(DPM_file, 'w') # write DPM_file

    # Skip over header
    line = CPM_file.readline()
    while line != '\n':
        line = CPM_file.readline()

    # slice out '1 = ' and record the rest as an int in cpm_data variable
    line = CPM_file.readline()
    cpm_data = [] # accumaltor variable
    while line != "":

        cpm = line[line.rfind(' ') + 1:]
        cpm_data.append(int(cpm))
        line = CPM_file.readline()

    #set last int in array to background
    background = cpm_data[-1]

    #go over list if bigger than background perform cpm to dpm equation. if smaller value = 0
    dpm_value = []
    for i in cpm_data:
        if i > background:
            dpm_values = (i - background)*1.43
            dpm_values = round(dpm_values, 2)
            dpm_value.append(dpm_values)
        else:
            dpm_value.append(0)

     # write a header on DPM file
    DPM_file.write('converted DPM values')
    DPM_file.write('\n')
    DPM_file.write('\n')

    #write dpm_value to DPM file with increasing number for each line
    sample_num = 0

    for i in dpm_value:
        sample_num = sample_num + 1
        DPM_file.write(str(sample_num) + ' = ' + str(i))
        DPM_file.write('\n')

    # close files
    CPM_file.close()
    DPM_file.close()

    #print done so the user knows to go to look at their dpm file
    print ('done')

Friday, March 15, 2013

pong with CodeSkulptor

I started taking a course on coursera. The course is called An introduction to Interactive Programming in Python. It teaches event driven programming. Until this course I've only been using linear programming. While event driven programming is not the answer to every project it is awesome for games which is a huge interest of mine. At the end of week 4 the mini project was pong. I changed the code a little as you can see it looks like a little hockey rink. The game only works on at codeskulptor. Maybe later I will try to use tkinker. Here's the code! Just copy and paste it into CodeSkulptor!

Implementation of classic arcade game Pong!

here's the code

Friday, March 8, 2013

python

This function written is python shortens the string of a .txt file on every other line and makes a copy in a new file. This is very useful in bioinformatics. My problem was I had a Phred file from Illumina sequencing where both ends of the sequence was N. If I map this using bowtie and allow two low quality reads per sequence a lot of the data would be thrown out.

Simple example:
    +
    Shorten
    Keep
    Shorten
    Keep

    ----------------
Example output
    +
    horte
    Keep
    horte
    Keep
------------------
def every_other_line_shorten(starting_file, ending_file):
# open the files
    from_file = open(starting_file, 'r') # read from file

    to_file = open(ending_file, 'w') # write to_file

    line = from_file.readline()
    while line != "":

        # keeps line
        to_file.write(line)
        line = from_file.readline()

       # cuts line
         to_file.write(line[1:-2] + '\n')
        line = from_file.readline()

    print('done')

    to_file.close()
    from_file.close()