-->

Etiquetas

An easy way to make SQL queries from Python to the SDSS database

An easy way to make SQL queries from Python to the SDSS database

Author: Eduardo Martín Calleja

In this entry we will see a very simple method of executing SQL queries on the Sloan Digital Sky Survey (SDSS) database. In this way we can get a lot of data about all kinds of celestial objects and load them into Python data structures, like Pandas dataframes, for later process or plotting.

This post has been written entirely using the IPython Notebook. I will also use the Python module "mechanize" to surf the web and run interactively HTML forms. I will explain in detail each of the steps, and at the end I will summarize to avoid being lost in the details and show the usability of the proposed method.

Imports and references

  • For an introduction to the execution of SQL queries on the SDSS database I can not think of a better resource than their own tutorial, together with its complement of examples: Sample SQL Queries

  • To view the tables and views that exist in the database, and what information is available in each of them, you can use: Schema Browser

  • And to run interactively a SQL on the SDSS database, you can access the page: SQL Search

  • The mechanize home page

  • A nice notebook using the method in this post to query the SDSS database: here

In [1]:
%matplotlib inline

from __future__ import division

import numpy as np
import pandas as pd
import mechanize
from StringIO import StringIO # To read a string like a file

# This IPython magic generates a table with version information
#https://github.com/jrjohansson/version_information
%load_ext version_information
%version_information numpy, pandas, StringIO
Out[1]:
SoftwareVersion
Python2.7.9 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
IPython2.3.1
OSLinux 3.13.0 45 generic x86_64 with debian jessie sid
numpy1.9.1
pandas0.15.2
StringIOStringIO
Sat Feb 21 12:08:20 2015 CET
In [2]:
# URL to the SDSS SQL Search DR10
url = "http://skyserver.sdss3.org/dr10/en/tools/search/sql.aspx"

SQL preparation

Before we begin to interact from Python with the web page of the SDSS that allows us to send a SQL statement to the database, I will prepare a test SQL query on a string variable. In this SQL we will find 10 objects (this is a test!) of type 6 = 'STAR', with clean photometry and blue color in a certain region of sky. But, be careful to avoid any comments (those preceded by a double dash -) when you create the string.

In [3]:
s = 'SELECT TOP 10                         \
    objID, ra, dec, modelMag_u, modelMag_g \
FROM                                       \
    PhotoPrimary                           \
WHERE                                      \
    ra BETWEEN 140 and 141                 \
AND dec BETWEEN 20 and 21                  \
AND type = 6                               \
AND clean = 1                              \
AND modelMag_u - modelMag_g < 0.5'

Web surfing with the Python mechanize module

The first step in using mechanize will be to create a Browser-like object to be able to navigate using its methods

In [4]:
br = mechanize.Browser()

Then we must open a session using the url defined above, pointing to the SDSS web page that allows us to make the SQL queries:

In [5]:
resp = br.open(url)
In [6]:
resp.info()
Out[6]:
<httplib.HTTPMessage instance at 0x7f4ba6cdd950>

When you want to interact with a web page, you will be interested to know the HTML forms contained in it. An HTML form is a section of the document between the tags: FORM and /FORM

A HTML form contains a series of special objects called controls such as checkboxes, radio buttons, menus, etc. and labels of these objects. The user interacts with the page by modifying a control, for example by selecting an option, introducing a text in a field, etc. and sending this modified form back to the server.

Each HTML form on the page has a name, although this can in some cases be empty. To get a list of the names of the forms in the page we can write:

In [7]:
for f in br.forms():
    print f.name
sql

That is, in this case there is only one form on the page, named "sql"

At the same time, each form has a list of controls that also have a name, which can also be left blank. To list the forms on the page, along with their controls, and each control type, we can do the following:

In [8]:
for f in br.forms():
    print f.name
    for c in f.controls:
        print '\t',c.name, '\t', c.type
sql
 clear  button
 cmd  textarea
 None  submit
 syntax  checkbox
 format  radio
 reset  reset

We will focus on the "cmd" control which is the text area in which we write our SQL, and the "format" control, which, as you can see on the web page, is used to control the type of output desired: HTML, XML, CSV, etc.. To access these controls you must previously select the form to which they belong:

In [9]:
br.select_form(name="sql")

Then we will modify the control 'cmd' to enter our SQL, and the 'format' control, to select the output in csv format.

In [10]:
br['cmd'] = s  # This is the string with the sql query
br['format']=['csv'] # data output format
response = br.submit()

We can get a string with the contents of the answer using the get_data() method:

In [11]:
print response.get_data()
#Table1
objID,ra,dec,modelMag_u,modelMag_g
1237667293189833237,140.000264887516,20.3528168302492,23.27699,23.07312
1237667293189833455,140.001621936922,20.4298007848266,23.79515,24.28631
1237667293189832757,140.003651871714,20.2546305174986,24.7036,24.82169
1237667430093553674,140.004909940259,20.1719995965921,24.04017,24.81654
1237667430093488938,140.009035667549,20.0174593142685,23.4923,23.47844
1237667430093488288,140.010686497758,20.0781480229235,19.89794,19.64356
1237667209974448712,140.011074894392,20.9763084088396,22.70029,22.85815
1237667430093554008,140.012379856521,20.1469377004989,23.12733,22.86493
1237667209974448933,140.013140656439,20.9086701836405,23.96772,24.16748
1237667430093488949,140.013238349062,20.0312906376057,24.73975,25.34514


But attention!, The submit() method closes the session, so, to send another SQL query you must first repeat the br.open() and br.select() calls.

Then, and in order to be able to process the data more easily, the most advisable could be to generate a Python Pandas dataframe. We can see that the first line should be discarded, while the second row contains the names of the columns, so we will keep it in the dataframe:

In [12]:
file_like = StringIO(response.get_data())
df =pd.read_csv(file_like, skiprows = 1) # skip the first row
df
Out[12]:
objID ra dec modelMag_u modelMag_g
0 1237667293189833237 140.000265 20.352817 23.27699 23.07312
1 1237667293189833455 140.001622 20.429801 23.79515 24.28631
2 1237667293189832757 140.003652 20.254631 24.70360 24.82169
3 1237667430093553674 140.004910 20.172000 24.04017 24.81654
4 1237667430093488938 140.009036 20.017459 23.49230 23.47844
5 1237667430093488288 140.010686 20.078148 19.89794 19.64356
6 1237667209974448712 140.011075 20.976308 22.70029 22.85815
7 1237667430093554008 140.012380 20.146938 23.12733 22.86493
8 1237667209974448933 140.013141 20.908670 23.96772 24.16748
9 1237667430093488949 140.013238 20.031291 24.73975 25.34514

From here we could do such things like rename the columns giving them names more to our liking, and calculate a new column as the difference of the columns u and g, which will indicate the color of the star (more on that in another post).

In [13]:
df.columns = ['objID','ra','dec','u','g']
df['u-g'] = df['u']-df['g']
df
Out[13]:
objID ra dec u g u-g
0 1237667293189833237 140.000265 20.352817 23.27699 23.07312 0.20387
1 1237667293189833455 140.001622 20.429801 23.79515 24.28631 -0.49116
2 1237667293189832757 140.003652 20.254631 24.70360 24.82169 -0.11809
3 1237667430093553674 140.004910 20.172000 24.04017 24.81654 -0.77637
4 1237667430093488938 140.009036 20.017459 23.49230 23.47844 0.01386
5 1237667430093488288 140.010686 20.078148 19.89794 19.64356 0.25438
6 1237667209974448712 140.011075 20.976308 22.70029 22.85815 -0.15786
7 1237667430093554008 140.012380 20.146938 23.12733 22.86493 0.26240
8 1237667209974448933 140.013141 20.908670 23.96772 24.16748 -0.19976
9 1237667430093488949 140.013238 20.031291 24.73975 25.34514 -0.60539

Summary

Once we have seen the rationale of the use of the mechanize module and the creation of a Pandas dataframe, we can define a function to streamline the process for new SQL queries:

In [14]:
def SDSS_select(sql):
    '''input: string with a valid SQL query
    output: a Pandas dataframe
    '''
    br.open(url)
    br.select_form(name="sql")
    br['cmd'] = sql
    br['format']=['csv']
    response = br.submit()
    file_like = StringIO(response.get_data())
    return pd.read_csv(file_like,  skiprows=1)

The steps are as follows:

  • The following instructions will be executed only the first time:
In [15]:
# URL a SQL Search DR10
url = "http://skyserver.sdss3.org/dr10/en/tools/search/sql.aspx"
  • We prepare a string with our SQL (you can test it first on the web page, as there's no exception handling in the above function)
In [16]:
sql = 'SELECT TOP 10    \
objID, ra, dec, modelMag_u,modelMag_g,modelMag_r,modelMag_i,modelMag_z \
FROM  Star               \
WHERE ra BETWEEN 150 and 152 AND dec BETWEEN 30 and 31 AND clean = 1'
  • We make a call to the function, obtaining a Pandas dataframe in return
In [17]:
df = SDSS_select(sql)

And we already have our dataframe ready!

In [18]:
df
Out[18]:
objID ra dec modelMag_u modelMag_g modelMag_r modelMag_i modelMag_z
0 1237664869216289142 150.000355 30.732103 23.06350 22.79704 21.60797 21.22958 21.32821
1 1237665099003593049 150.000483 30.591703 22.71077 22.14142 21.63762 21.84505 21.44720
2 1237665098466656334 150.000711 30.105872 20.72021 18.12753 16.79675 16.08280 15.69484
3 1237665098466656330 150.000754 30.097463 22.66031 19.98154 18.61721 17.61466 17.07129
4 1237664869216289439 150.000985 30.748886 25.00283 23.08974 22.20838 21.99796 21.30983
5 1237664869216288905 150.001134 30.829336 21.41318 20.29182 19.80290 19.55658 19.49337
6 1237665129067840138 150.001698 30.483674 23.61431 22.05697 20.63076 20.13519 19.81597
7 1237665098466657196 150.002180 30.248386 25.05051 23.19651 21.70999 20.04427 19.07293
8 1237665129067840349 150.002215 30.429241 23.51003 24.00885 21.71184 21.25962 20.59406
9 1237665098466656342 150.002250 30.174147 20.84485 18.57332 17.56038 17.15493 16.93803

Here ends this post, but the access method to the SDSS database we have seen here, will be used systematically in future posts, to analyze with Python, based on real data, various properties of celestial objects.

No hay comentarios:

Publicar un comentario