Blog Post 1 - Interactive Graphics with NOAA Climate Data
Interesting Graphs with Plotly
This blog post begins with the simple task of replicating a chart as presented in the assignment specifications. Then I proceed to wander around and experiment with random graphing and mapping techniques until I find something decently presentable.
Normal Database Administrative Tasks
This plot consists of two functions. One of which queries our temperatures database to locate a country and provide a dataframe. The second finds the coefficients and presents them using Plotly Express.
As Always we start by importing lots of fun modules
import pandas as pd
import numpy as np
import sqlite3
from plotly import express as px
I won’t spend too much time explaining how to set up our database as that it fairly routine. To summarize though, we connect to data.db, iterate over our csv files to import them into the database, and do slight data manipulation as necessary.
conn = sqlite3.connect("data.db")
cursor = conn.cursor()
This prepare_df function is borrowed from the PIC16B Lecture content and is specific for our temps.csv.
def prepare_df(df):
df = df.set_index(keys=["ID", "Year"])
df = df.stack()
df = df.reset_index()
df = df.rename(columns = {"level_2" : "Month" , 0 : "Temp"})
df["Month"] = df["Month"].str[5:].astype(int)
df["Temp"] = df["Temp"] / 100
return(df)
I found myself restarting my kernel often when I tried to plot something terrible and so I made a quick check as to not duplicate the temperatures table.
try:
cursor.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='temperatures'")
if cursor.fetchall():
print("table already exists", cursor.fetchall())
else:
raise Exception("Table does not Exist")
except:
temps = "temps.csv"
df_iter = pd.read_csv(temps, chunksize = 100000)
for df in df_iter:
df = prepare_df(df)
df.to_sql("temperatures", conn, if_exists = "append", index = False)
countries_url = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries = pd.read_csv(countries_url)
countries.to_sql("countries", conn, if_exists = "replace", index = False)
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)
stations.to_sql("stations", conn, if_exists = "replace", index = False)
Almost to the fun stuff. We need to see what data we can access to display.
cursor.execute("SELECT sql FROM sqlite_master WHERE type='table';")
for result in cursor.fetchall():
print(result[0])
CREATE TABLE "temperatures" (
"ID" TEXT,
"Year" INTEGER,
"Month" INTEGER,
"Temp" REAL
)
CREATE TABLE "countries" (
"FIPS 10-4" TEXT,
"ISO 3166" TEXT,
"Name" TEXT
)
CREATE TABLE "stations" (
"ID" TEXT,
"LATITUDE" REAL,
"LONGITUDE" REAL,
"STNELEV" REAL,
"NAME" TEXT
)
Our Temperature Coefficient Plot
We jump right in with a function definition to query our climate data.
def query_climate_database(country:str, year_begin:int, year_end:int, month:int) -> pd.DataFrame:
"""
country, a string giving the name of a country for which data should be returned.
year_begin and year_end, two integers giving the earliest and latest years for which should be returned.
month, an integer giving the month of the year for which should be returned.
returns a Pandas dataframe of temperature readings for the specified country,
in the specified date range,
in the specified month of the year.
COLUMNS:
The station name.
The latitude of the station.
The longitude of the station.
The name of the country in which the station is located.
The year in which the reading was taken.
The month in which the reading was taken.
The average temperature at the specified station during the specified year and month.
"""
# We first obtain the FIPS code
cursor.execute(f"SELECT [FIPS 10-4] FROM countries WHERE Name = '{country}' LIMIT 1;")
fips = cursor.fetchall()[0][0]
cmd = \
f"""
SELECT S.NAME, S.LATITUDE, S.LONGITUDE, '{country}' as Country, T.Year, T.Month, T.Temp
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
WHERE s.id LIKE '{fips}%' AND T.year >= {year_begin} AND T.year <= {year_end} AND T.month = {month}
"""
df = pd.read_sql_query(cmd, conn)
return df
This function is nice because it is like 20 lines of comments and 10 lines of actual code (most of which is a single sql query). We first take the user’s country string and use it to retrieve the country’s FIPS code. Then using python’s fstring functionality we are able to fill in a sql query with the user’s provided variables.
The function then returns lots of useful columns in a DataFrame. Before we are able to map any points we need to find our Linear Regression coefficients. For that we will use sklearn.
from sklearn.linear_model import LinearRegression
def coef(data_group, xlabel="Year", ylabel="Temp"):
x = data_group[[xlabel]] # 2 brackets because X should be a df
y = data_group[ylabel] # 1 bracket because y should be a series
LR = LinearRegression()
LR.fit(x, y)
return LR.coef_[0]
This function quickly finds a line of best fit and returns the slope of the line from the fitted sklearn.linear_model.LinearRegression object. Now, we are finally able to use our friend Plotly.
from plotly.graph_objects import Figure
def temperature_coefficient_plot(country:str, year_begin:int, year_end:int, month:int, min_obs:int, **kwargs) -> Figure:
"""
country, a string giving the name of a country for which data should be returned.
year_begin and year_end, two integers giving the earliest and latest years for which should be returned.
month, an integer giving the month of the year for which should be returned.
min_obs, the minimum required number of years of data for any given station.
**kwargs, additional keyword arguments passed to px.scatter_mapbox(). These can be used to control the colormap used, the mapbox style, etc.
returns a plotly.graph_objects.Figure
"""
# We use our previous function to neatly format our dataframe
df = query_climate_database(country, year_begin, year_end, month)
# We use the groupby and transform functions to select only stations with more than *min_obs* readings
df = df[df.groupby(["NAME"])["Temp"].transform(len) >= min_obs]
# We calculate the linear regression of temperatures at each station in order to get their slopes
coefs = df.groupby(["NAME", "Month"]).apply(coef).reset_index()
coefs = coefs.rename(columns={0:"Estimated Yearly Increase (°C)"})
df = df.merge(coefs, on="NAME")
fig = px.scatter_mapbox(df,
lat = "LATITUDE",
lon = "LONGITUDE",
hover_name = "NAME",
color = "Estimated Yearly Increase (°C)",
hover_data={"Estimated Yearly Increase (°C)":':.4f'},
color_continuous_midpoint=0,
title="Estimates for yearly increase in temperature in January for Stations in India, years 1980-2020",
**kwargs)
return fig
This function takes in variables to pass to our query_climate_database function, and then also a min_obs and any other keyword arguments. First, we do the obvious query and filter our DataFrame to Weather Stations which have more than the defined min_obs observations. We also apply our coef function to collect the slopes and merge them in with our station data.
Finally, with one last big plotly function call we define arguments, pass along the aforementioned keyword arguments and return the created figure.
color_map = px.colors.diverging.RdGy_r # choose a colormap
fig = temperature_coefficient_plot("India", 1980, 2020, 1,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map)
fig.show()
Extreme Temperatures in Different Countries
My next plot compares extreme temperatures in different countries. This is interesting to me, because sometimes countries both share heatwaves/cold spells and sometimes their extreme temperatures are completely unrelated. It includes many sql queries adds the talbes together into an interesting series of charts.
Our query function looks similar, but we need to obtain both the max and min temperature for each station.
def query_climate_database_extrema(country:str, year_begin:int, year_end:int) -> pd.DataFrame:
"""
country, a string giving the name of a country for which data should be returned.
year_begin and year_end, two integers giving the earliest and latest years for which should be returned.
returns a Pandas dataframe of temperature readings for the specified country,
in the specified date range,
in the specified month of the year.
COLUMNS:
The station name.
The station's extreme temp.
The year when the extreme temperature was achieved.
The country in which the reading was taken.
The type of extreme (max/min)
"""
# We first obtain the FIPS code
cursor.execute(f"SELECT [FIPS 10-4] FROM countries WHERE Name = '{country}' LIMIT 1;")
fips = cursor.fetchall()[0][0]
# Main query
cmd = \
f"""
SELECT S.NAME, MAX(T.TEMP) AS ExtremeTemp, T.YEAR AS Year, S.ID AS ID, '{country}' AS Country, 'MAX' as type
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
WHERE T.year >= {year_begin} AND T.year <= {year_end} AND s.id LIKE '{fips}%'
GROUP BY s.id
"""
df1 = pd.read_sql_query(cmd, conn)
cmd = \
f"""
SELECT S.NAME, MIN(T.TEMP) AS ExtremeTemp, T.YEAR AS Year, S.ID AS ID, '{country}' AS Country, 'MIN' as type
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
WHERE T.year >= {year_begin} AND T.year <= {year_end} AND s.id LIKE '{fips}%'
GROUP BY s.id
"""
df2 = pd.read_sql_query(cmd, conn)
df3 = pd.concat([df1, df2])
df3 = df3.reset_index(drop=True)
return df3
Then we also want to compare this data for different countries so we make a new function to take in a list of country names. We also accept a range of years and keyword arguments here.
from typing import List
def facet_plot_country_extrema(countries:List[str], year_begin:int, year_end:int, **kwargs) -> Figure:
"""
countries, a list of strings of countries to compare
year_begin and year_end, two integers giving the earliest and latest years for which should be returned.
**kwargs, additional keyword arguments passed to px.scatter_mapbox(). These can be used to control the colormap used, the mapbox style, etc.
returns a plotly.graph_objects.Figure
"""
frames = [query_climate_database_extrema(country, year_begin, year_end) for country in countries]
fig = px.scatter(pd.concat(frames),
x="Year",
y="ExtremeTemp",
labels={
"value": "Temps",
"variable": "Extrema"
},
color="type",
facet_row="Country",
facet_col="type",
hover_data={"NAME"},
title="Countries and Their Extreme Temperatures",
**kwargs)
return fig
Now with this one single line,
fig = facet_plot_country_extrema(["Russia", "Germany"], 2000, 2020)
fig.show()
We get an faceted plot of temperature & country comparisons.
Interesting 3D Mapping
I had the great idea to map the given Latitude and Longitude points to a sphere and to display these data in plotly’s scatter_3d. I decided mapping the average temperature was already pretty interesting but I fully intend to come back when I know how to animate in order to make this even more visually appealing!
# This code was found on stackexchange and modified slightly to fit python syntax
# https://stackoverflow.com/questions/36369734/how-to-map-latitude-and-longitude-to-a-3d-sphere
import math
from typing import Tuple
def map_lat_long_to_sphere(lat:float, lon:float) -> Tuple[float, float, float]:
phi = (90-lat)*(math.pi/180)
theta = (lon+180)*(math.pi/180)
x = -(6371 * math.sin(phi)*math.cos(theta))
z = (6371 * math.sin(phi)*math.sin(theta))
y = (6371 * math.cos(phi))
return (x,y,z)
Unfortunately this is actually a very simple plot because it displays every country at once and doesn’t allow for much filtering. However I gave the user the option to pick the month at least!
def avg_temperatures_of_earth(month:int, **kwargs) -> Figure:
"""
month is an int determing the month (1,12) from which to retrieve temps
**kwargs, additional keyword arguments passed to px.scatter_mapbox(). These can be used to control the colormap used, the mapbox style, etc.
"""
# Straight to SQL
cmd = \
f"""
SELECT S.LATITUDE, S.LONGITUDE, AVG(T.Temp) AS AvgTemp, SUBSTRING(S.ID, 1, 2) AS FIPS
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
WHERE T.year >= 2000 AND T.year <= 2020 AND T.month = {month}
GROUP BY s.id
"""
df1 = pd.read_sql_query(cmd, conn)
# Map From Lat/Long to X/Y/Z
data = df1.apply(lambda x: map_lat_long_to_sphere(x.LATITUDE, x.LONGITUDE), axis=1)
new_df = pd.DataFrame(list(data), columns=['x', 'y', 'z'])
df1 =pd.concat([df1, new_df], axis=1)
month_dict = {
1: "January",
2: "February",
3: "March",
4: "April",
5: "May",
6: "June",
7: "July",
8: "August",
9: "September",
10: "October",
11: "November",
12: "December"
}
# Display
fig = px.scatter_3d(df1,
x="x",
y="y",
z="z",
color="AvgTemp",
hover_name="FIPS",
title="Temperatures of Earth in " + month_dict[month],
hover_data={"AvgTemp":':.4f'},
**kwargs)
return fig
Now using the line,
fig = avg_temperatures_of_earth(month=1)
fig.show()
I think this is a pretty cool visualization, but it has a lot of ways where I can improve it. The interactivity is pretty hard to use, the labels are a bit hard to decipher, it would be nice to have a surface to put the point onto, and it also would have been cool to use IDW interpolation to get a smooth surface of temperatures to cloak the Earth in. You could really spend forever perfecting graphs. Overall I learned a lot from this assignment and I hope you learned a bit too.