Data Analytics Automation- 50+ Python Ready Scripts

Python Ready Scripts – DataScience

While the release of these writings may be feasible for live production schemas, this is not intended. My purpose is to motivate others and possibly help others with a good starting point for structuring certain explanations in the data science domain.

Script labels are expressive in possession with the verbose environment of the python language (an Xteristic that I unconditionally love)

A few significant cautions on using this Repo:

All scripts were written on either Linux/Windows operating systems.., consuming Anaconda IDE, gedit, and sometimes recently, Geany. Most of these are written in Python3. (Some are in Python27 – Linux gedit in this case) and I will effort to specify these alterations. They have all been transcribed for my exact situations as above and for the Data Science field. I wanted feedback on how they work for people and if they find them useful. The Anaconda IDE provides quite a lot of funding for fixing and I endeavor to do as much as I can.

As far as using these scripts, you will need to know how to make them work for your specific use case – SUPPOSING that you know what you are doing- And if you really want to realize the original methods -the courses on datacamp are as good as any other out there. If necessary, I can surely try to explain an idea based on my empathy and possibly some application.

Please note that the evidence sets are not obtainable here. Use your own data. You will however find datasets online if you google them – there are quite a few out there – Nothing beats a bit of legwork. Beware of those troublesome rabbit fleabags though – It is very easy to get lost when you are having fun.. Debugging…

These scripts have established useful in their flexibility for other schemes I am working on but for posterity, this page has been created.

You may find some functions or portions of these code(s) elsewhere on the web. my commercial programming experience is still ongoing and just like everyone else, I tend to look up how to do a specific function and sometimes borrow that. Please do not quote me if your application doesn’t work. #justsaying

Having said all that, we all know that once in a while, you find something that’s printed tremendously well (such as on StackOverflow or other blogs), and rightly so – there’s no use reinventing the wheel.

My hope is that this repo can help make your building/dev work a lot easier. See what you think.
I’ll attempt to credit anything of the sorts as I post them, and apologies if anyone is missed – If you see such, please let me know and I will rectify asap

Happy Automation

For all Free Event Updates :

For all automation Updates :

500 Hours Cloud/DevOps Free Trainings :

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

# Import requests package
import requests

# Assign URL to variable: url
url = 'http://www.omdbapi.com/?t=this+is+spinal+tap'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Print the text of the response
print(r.text)

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

import numpy as np

def perform_bernoulli_trials(n, p):
"""Perform n Bernoulli trials with success probability p
and return number of successes."""
# Initialize number of successes: n_success
n_success = 0

# Perform trials
for i in range(n):
# Choose random number between zero and one: random_number
random_number = np.random.random()

# If less than p, it's a success so add one to n_success
if random_number < p:
n_success += 1

return n_success

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

import numpy as np
import matplotlib.pyplot as plt

from Bernoulli_Trial import perform_bernoulli_trials
from ecdf_func import ecdf

# Seed random number generator
np.random.seed(42)

# Take 10,000 samples out of the binomial distribution: n_defaults
n_defaults = np.random.binomial(100, 0.05, size=10000)

# Compute CDF: x, y
x, y = ecdf(n_defaults)

# Plot the CDF with axis labels
_ = plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.002)
plt.xlabel('Defaults out of 100 loans')
plt.ylabel('ECDF')

# Show the plot
plt.show()

# Seed random number generator
np.random.seed(42)

# Initialize the number of defaults: n_defaults
n_defaults = np.empty(1000)

# Compute the number of defaults
for i in range(1000):
n_defaults[i] = perform_bernoulli_trials(100, 0.05)

# Plot the histogram with default number of bins; label your axes
_ = plt.hist(n_defaults, normed=True)
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('probability')

# Show the plot
plt.show()

# Compute bin edges: bins
bins = np.arange(-0.5, max(n_defaults + 1.5) - 0.5)

# Generate histogram
_ = plt.hist(n_defaults, normed=True, bins=bins)

# Set margins
plt.margins(0.02)

# Label axes
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('Binomial PMF')

# #################################################################### #

# Draw 10,000 samples out of Poisson distribution: samples_poisson
samples_poisson = np.random.poisson(10, size=10000)

# Print the mean and standard deviation
print('Poisson: ', np.mean(samples_poisson),
np.std(samples_poisson))

# Specify values of n and p to consider for Binomial: n, p
n = [20, 100, 1000]
p = [0.5, 0.1, 0.01]

# Draw 10,000 samples for each n,p pair: samples_binomial
for i in range(3):
samples_binomial = np.random.binomial(n[i], p[i], size=10000)

# Print results
print('n =', n[i], 'Binom:', np.mean(samples_binomial),
np.std(samples_binomial))

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

import numpy as np
import matplotlib.pyplot as plt

from Bernoulli_Trial import perform_bernoulli_trials
from ecdf_func import ecdf

# Seed random number generator
np.random.seed(42)

# Take 10,000 samples out of the binomial distribution: n_defaults
n_defaults = np.random.binomial(100, 0.05, size=10000)

# Compute CDF: x, y
x, y = ecdf(n_defaults)

# Plot the CDF with axis labels
_ = plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.002)
plt.xlabel('Defaults out of 100 loans')
plt.ylabel('ECDF')

# Show the plot
plt.show()

# ################################################################## #

# Seed random number generator
np.random.seed(42)

# Initialize the number of defaults: n_defaults
n_defaults = np.empty(1000)

# Compute the number of defaults
for i in range(1000):
n_defaults[i] = perform_bernoulli_trials(100, 0.05)

# Plot the histogram with default number of bins; label your axes
_ = plt.hist(n_defaults, normed=True)
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('probability')

# Show the plot
plt.show()

# Compute bin edges: bins
bins = np.arange(-0.5, max(n_defaults + 1.5) - 0.5)

# Generate histogram
_ = plt.hist(n_defaults, normed=True, bins=bins)

# Set margins
plt.margins(0.02)

# Label axes
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('Binomial PMF')

# #################################################################### #

# Draw 10,000 samples out of Poisson distribution: samples_poisson
samples_poisson = np.random.poisson(10, size=10000)

# Print the mean and standard deviation
print('Poisson: ', np.mean(samples_poisson),
np.std(samples_poisson))

# Specify values of n and p to consider for Binomial: n, p
n = [20, 100, 1000]
p = [0.5, 0.1, 0.01]

# Draw 10,000 samples for each n,p pair: samples_binomial
for i in range(3):
samples_binomial = np.random.binomial(n[i], p[i], size=10000)

# Print results
print('n =', n[i], 'Binom:', np.mean(samples_binomial),
np.std(samples_binomial))

# ##################################################################### #

# Plotting the Normal PDFs

# Draw 100000 samples from Normal distribution with stds of interest:
# samples_std1, samples_std3, samples_std10
samples_std1 = np.random.normal(20, 1, size=100000)
samples_std3 = np.random.normal(20, 3, size=100000)
samples_std10 = np.random.normal(20, 10, size=100000)

# Make histograms
_ = plt.hist(samples_std1, normed=True, histtype='step', bins=100)
_ = plt.hist(samples_std3, normed=True, histtype='step', bins=100)
_ = plt.hist(samples_std10, normed=True, histtype='step', bins=100)

# Make a legend, set limits and show plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(-0.01, 0.42)
plt.show()

# ######################################## #

# Plottign the Normal CDF/ECDF

# Generate CDFs
x_std1, y_std1 = ecdf(samples_std1)
x_std3, y_std3 = ecdf(samples_std3)
x_std10, y_std10 = ecdf(samples_std10)

# Plot CDFs
_ = plt.plot(x_std1, y_std1, marker='.', linestyle='none')
_ = plt.plot(x_std3, y_std3, marker='.', linestyle='none')
_ = plt.plot(x_std10, y_std10, marker='.', linestyle='none')

# Make 2% margin
plt.margins(0.02)

# Make a legend and show the plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right')
plt.show()

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

# Generate 10,000 bootstrap replicates of the variance: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.var, size=10000)

# Put the variance in units of square centimeters
bs_replicates /= 100

# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('variance of annual rainfall (sq. cm)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

# Draw bootstrap replicates of the mean no-hitter time (equal to tau):
# bs_replicates
bs_replicates = draw_bs_reps(nohitter_times, np.mean, size=10000)

# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(bs_replicates, [2.5, 97.5])

# Print the confidence interval
print('95% confidence interval =', conf_int, 'games')

# Plot the histogram of the replicates
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel(r'$\tau$ (games)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

def draw_bs_pairs_linreg(x, y, size=1):
"""Perform pairs bootstrap for linear regression."""

# Set up array of indices to sample from: inds
inds = np.arange(len(x))

# Initialize replicates: bs_slope reps, bs_intercept_reps
bs_slope_reps = np.empty(size)
bs_intercept_reps = np.empty(size=size)

# Generate replicates
for i in range(size):
bs_inds = np.random.choice(inds, size=len(inds))
bs_x, bs_y = x[bs_inds], y[bs_inds]
bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)

return bs_slope_reps, bs_intercept_reps

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

for _ in range(50):
# Generate bootstrap sample: bs_sample
bs_sample = np.random.choice(rainfall, size=len(rainfall))

# Compute and plot ECDF from bootstrap sample
x, y = ecdf(bs_sample)
_ = plt.plot(x, y, marker='.', linestyle='none',
color='gray', alpha=0.1)

# Compute and plot ECDF from original data
x, y = ecdf(rainfall)
_ = plt.plot(x, y, marker='.')

# Make margins and label axes
plt.margins(0.02)
_ = plt.xlabel('yearly rainfall (mm)')
_ = plt.xlabel('ECDF')

# Show the plot
plt.show()

# # COMPUTE MEAN & SEM OF BOOTSTRAP REPLICATES #### #

# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.mean, 10000)

# Compute and print SEM
print(np.std(rainfall) / np.sqrt(len(rainfall)))

# Compute and print standard deviation of bootstrap replicates
print(np.std(bs_replicates))

# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('mean annual rainfall (mm)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

# ######### PLOTTING BOOTSTRAP REGRESSIONS ###### #

# Generate array of x-values for bootstrap lines: x
x = np.array([0, 100])

# Plot the bootstrap lines
for i in range(100):
_ = plt.plot(x, bs_slope_reps[i] * x + bs_intercept_reps[i],
linewidth=0.5, alpha=0.2, color='red')

# Plot the data
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')

# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

import numpy as np

# Make an array of translated impact forces: translated_force_b
translated_force_b = force_b - np.mean(force_b) + 0.55

# bootstrap replicates of Frog B's translated impact forces: bs_replicates
bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)

# Calc.fraction of replicates that are less than the observed Frog B force: p
p = np.sum(bs_replicates <= np.mean(force_b)) / 10000

# ##### two sample bootstrap hypothesis test for diff fo means ##### #
# Compute mean of all forces: mean_force
mean_force = np.mean(forces_concat)

# Generate shifted arrays
force_a_shifted = force_a - np.mean(force_a) + mean_force
force_b_shifted = force_b - np.mean(force_b) + mean_force

# Compute 10,000 bootstrap replicates from shifted arrays
bs_replicates_a = draw_bs_reps(force_a_shifted, np.mean, 10000)
bs_replicates_b = draw_bs_reps(force_b_shifted, np.mean, 10000)

# Get replicates of difference of means: bs_replicates
bs_replicates = bs_replicates_a - bs_replicates_b

# Compute and print p-value: p
p = np.sum(bs_replicates >= empirical_diff_means) / 10000
print('p-value =', p)

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

from sqlalchemy import desc

# Build query to return state names by population difference from 2008 to 2000:
# stmt
stmt = select([census.columns.state, (census.columns.pop2008 -census.columns.pop2000).label('pop_change')])

# Append group by for the state: stmt
stmt = stmt.group_by(census.columns.state)

# Append order by for pop_change descendingly: stmt
stmt = stmt.order_by(desc('pop_change'))

# Return only 5 results: stmt
stmt = stmt.limit(5)

# Use connection to execute the statement and fetch all results
results = connection.execute(stmt).fetchall()

# Print the state and population change for each record
for result in results:
print('{}-{}'.format(result.state, result.pop_change))

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

# Import create_engine function
from sqlalchemy import create_engine

# Create an engine to the census database
engine = create_engine('postgresql+psycopg2://' + 'student:datacamp'+\
'@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com'':5432/census')

# Use the 'table_names()' method on the engine to print the table names
print(engine.table_names())

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

# Import necessary module
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)

"""
Open the engine connection as con using the method connect() on the engine.
Execute the query that selects ALL columns from the Album table. Store the
results in rs.
Store all of your query results in the DataFrame df by applying the
fetchall() method to the results rs.
Close the connection!
"""

# 'Retrieve column of table called Album in the chinook database'

# Open engine connection: con
con = engine.connect()

# Perform query: rs
rs = con.execute('SELECT * FROM Album')

# Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

# Import pandas as pd
import pandas as pd

# Assign the filename: file
file = 'fixations.csv'
file2 = 'gaze_postions.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file)
df2 = pd.read_csv(file2)

# View the head of the DataFrame

print(df.head())
print(df2.head())

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

import os
import glob
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
# from mayavi import mlab
import multiprocessing
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.graph_objs import Surface

path = r'C:\Users\Shabaka\Desktop\Test2 DJI_Corretti\100\TIM'
# path = r'C:\DRO\DCL_rawdata_files'
allFiles = glob.glob(path + "/*.csv")
# frame = pd.DataFrame()
list_TIM = []
for file_ in allFiles:
df_TIM = pd.read_csv(file_, index_col=None, header=0)
list_TIM.append(df_TIM)
frame = pd.concat(list_TIM) # ignore_index=True)

print(frame.head())

# sns.heatmap(frame.head())

plt.show()

temp = pd.read_csv('C:\\Users\\Shabaka\\Desktop\\Temperatura_Media.csv')
# Plot the aapl time series in blue
print(temp.head())
plt.plot(temp, color='blue', label='Temp_Median..(yr)')

plt.show()

# Plot the pairwise joint distributions grouped by 'origin' along with
# regression lines
# sns.pairplot(temp, kind='reg', hue='Temp_Med')
# plt.show()

# urb_pop_reader = pd.read_csv(filename, chunksize=1000)

"""
files = glob("*.txt")
fig, ax = plt.subplots()

for f in files:
print("Current file is"+f)
#your csv loading into data
data.plot('time','temp',ax=axes[0])

#outside of the for loop
plt.savefig("myplots.png")

"""

# ''''''''''''3D Density MAp Plot ''''''''''#

def calc_kde(data):
return kde(data.T)

mu, sigma = 0, 0.1
x = 10*np.random.normal(mu, sigma, 5000)
y = 10*np.random.normal(mu, sigma, 5000)
z = 10*np.random.normal(mu, sigma, 5000)

xyz = np.vstack([x, y, z])
kde = stats.gaussian_kde(xyz)

# Evaluate kde on a grid
xmin, ymin, zmin = x.min(), y.min(), z.min()
xmax, ymax, zmax = x.max(), y.max(), z.max()
xi, yi, zi = np.mgrid[xmin:xmax:30j, ymin:ymax:30j, zmin:zmax:30j]
coords = np.vstack([item.ravel() for item in [xi, yi, zi]])

# Multiprocessing
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=cores)
results = pool.map(calc_kde, np.array_split(coords.T, 2))
density = np.concatenate(results).reshape(xi.shape)

# Plot scatter with mayavi
figure = mlab.figure('DensityPlot')

grid = mlab.pipeline.scalar_field(xi, yi, zi, density)
min = density.min()
max = density.max()
mlab.pipeline.volume(grid, vmin=min, vmax=min + .5*(max-min))

mlab.axes()
mlab.show()

# '''''''' Alternativc Route'''''''''''''#
filename = 'C:\\Users\\Shabaka\\Desktop\\Temperatura_Media.csv'
raw_data = open(filename, 'rt')
tempdata = pd.read_csv(raw_data, header=0)
print(tempdata.shape)

print(tempdata.head())

plt.plot(tempdata, color='blue', label='Temp_Med')

plt.show()

sns.pairplot(tempdata, kind='reg') # hue='Temp_Med')
plt.show()

surfdata = [go.Surface(tempdata.as_matrix())]

layout = go.Layout(
title='Temp_Data Elevation',
autosize=False,
width=500,
height=500,
margin=dict(
l=65,
r=50,
b=65,
t=90
)
)
fig = go.Figure(data=surfdata, layout=layout)
py.iplot(fig, filename='elevations-3d-surface', type='surface')

plt.show()

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

# Import pandas as pd
import pandas as pd

# Assign the filename: file
file = 'titanic.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file)

# View the head of the DataFrame

print(df.head())

# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array = np.array(data.values)

# Print the datatype of data_array to the shell
print(type(data_array))

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

import os
import glob
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# path = r'C:\DRO\DCL_rawdata_files'

path = r'C:\Users\Shabaka\Desktop\Test2 DJI_Corretti\100\TIM'
allfiles = os.path.join(path, "*.csv")
frame2 = pd.DataFrame()
list2 = []
for file_ in allfiles:
df = pd.read_csv(file, index_col=None, header=None)
list2.append(df)
frame = pd.concat(list2, ignore_index=True)

print(frame.head())

df = pd.concat((pd.read_csv(file) for file in allfiles))

print(df.head())

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

import os
import glob
import pandas as pd

def concatenate(indir='', outfile=''):
os.chdir(indir)
fileList = glob.glob('*.csv')
dfList = []

for filename in fileList:
print(filename)
df = pd.read_csv(filename, header=None)
dfList.append(df)
concatDF= pd.concat(dfList, axis=0)
concatDF.columns=colanmes
concatDF.to_csv

# -*- coding: utf-8 -*-
"""

Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan

"""

import pandas as pd

df_offers = pd.read_excel("http://blog.yhathq.com/static/misc/data/WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()

df_transactions = pd.read_excel("http://blog.yhathq.com/static/misc/data/WineKMC.xlsx", sheetname=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()

# join the offers and transactions table
df = pd.merge(df_offers, df_transactions)
# create a "pivot table" which will give us the number of times each customer responded to a given offer
matrix = df.pivot_table(index=['customer_name'], columns=['offer_id'], values='n')
# a little tidying up. fill NA values with 0 and make the index into a column
matrix = matrix.fillna(0).reset_index()
# save a list of the 0/1 columns. we'll use these a bit later
x_cols = matrix.columns[1:]

from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=5)
# slice matrix so we only include the 0/1 indicator columns in the clustering
matrix['cluster'] = cluster.fit_predict(matrix[matrix.columns[2:]])
matrix.cluster.value_counts()

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
matrix['x'] = pca.fit_transform(matrix[x_cols])[:,0]
matrix['y'] = pca.fit_transform(matrix[x_cols])[:,1]
matrix = matrix.reset_index()

customer_clusters = matrix[['customer_name', 'cluster', 'x', 'y']]
customer_clusters.head()

df = pd.merge(df_transactions, customer_clusters)
df = pd.merge(df_offers, df)

from ggplot import *
"""
import matplotlib.pyplot as plt
plt.figure()
plt.plot(rigs2)
plt.plot(customer_clusters)
plt.ion()
plt.show()
"""
ggplot(df, aes(x='x', y='y', color='cluster')) + \
geom_point(size=75) + \
ggtitle("Customers Grouped by Cluster")

cluster_centers = pca.transform(cluster.cluster_centers_)
cluster_centers = pd.DataFrame(cluster_centers, columns=['x', 'y'])
cluster_centers['cluster'] = range(0, len(cluster_centers))

ggplot(df, aes(x='x', y='y', color='cluster')) + \
geom_point(size=75) + \
geom_point(cluster_centers, size=500) +\
ggtitle("Customers Grouped by Cluster")

df['is_4'] = df.cluster==4
df.groupby("is_4").varietal.value_counts()

df.groupby("is_4")[['min_qty', 'discount']].mean()

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

"""

Open the engine connection as con using the method connect() on the engine.
Execute the query that selects ALL columns from the Album table. Store the
results in rs.
Store all of your query results in the DataFrame df by applying the
fetchall() method to the results rs.
Close the connection! - In Query Script
"""

# 'This script allows us to perform the following things:'

# Select specified columns from a table;
# Select a specified number of rows;
# Import column names from the database table.


from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute("SELECT LastName, Title FROM Employee")
df = pd.DataFrame(rs.fetchmany(size=3))
df.columns = rs.keys()

# Print the length of the DataFrame df
print(len(df))

# Print the head of the DataFrame df
print(df.head())

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

# ''Load and View Data ''''''''''#

# Import pandas
import pandas as pd
import matplotlib.pyplot as plt


# Read the file into a DataFrame: df
# df = pd.read_csv('your_file.csv') . This might also mean filepath.

df = pd.read_csv('fixations.csv')
df2 = pd.read_csv('flightdata.csv')

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

print('AERO DATA OUTPUT')


print(df2.head())

print(df2.tail())

# Print the shape of df
print(df.shape)

print(df2.shape)

# Print the columns of df
print(df.columns)

print(df2.columns)

# Print the head and tail of df_subset
# print(df.subset.head())
# print(df.subset.tail())

# Print the info of df
print(df.info())

print(df2.info())

# Print the info of df_subset
# print(df.subset.info())


# '''''''' Frequency counts for Categorical Data
# note that dataframe titles here are actually for
# continuous data. These are simply placeholders.

# Print the value counts for 'your category - i.e.column titles''
print(df['duration'].value_counts(dropna=False))

print(df['duration'].shape)

# Print the value_counts for 'next_category'
print(df['confidence'].value_counts(dropna=False))

print(df['confidence'].shape)

# Print the value counts for 'and_another'
print(df['avg_pupil_size'].value_counts(dropna=False))


# ''''''''''' Single Variable Histogram plot ''''''''#

# Plot the histogram
df['duration'].plot(kind='hist', rot=70, logx=True, logy=True)

# Display the histogram
plt.show()

# ''''' Multi Variable Box Plot Visualisation '''''''#

# Import necessary modules (see top of script)
# doesn't necessarily have to be at the top of the script
# but Spyder likes it this way and it looks
# good too.

# you want to create the boxplot?
df.boxplot(column='duration', by='avg_pupil_size', rot=90)

# Display the plot
plt.show()

# ''''''''''' Multiple variable scatter plot visualisation''''#

# Import necessary modules -moved to top
# import pandas as pd - at top
# import matplotlib.pyplot as plt - at top

# Create and display the first scatter plot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()

# Create and display the second scatter plot
df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import os
import glob
import pandas as pd
import mayavi
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from mayavi import mlab
import multiprocessing
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.graph_objs import Surface


path = 'C:\\Users\\vijay\Desktop\\Test2 DJI_Corretti'
all_files = glob.glob(os.path.join(path, "*Temperatura_Media.csv"))

df_from_each_file = pd.read_csv(all_files)
conc_df = pd.concat(df_from_each_file, ignore_index=True)

print(conc_df.head())

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

# Code for correelate function is copied from pupil labs git @
# https://github.com/pupil-labs/pupil/wiki/Data-Format


def correlate_data(data, timestamps):
'''
data: list of data :
each datum is a dict with at least:
timestamp: float

timestamps: timestamps list to correlate data to

this takes a data list and a timestamps list and makes a new list
with the length of the number of timestamps.
Each slot contains a list that will have 0, 1 or more associated
data points.

Finally we add an index field to the datum with the associated index
'''
timestamps = list(timestamps)
data_by_frame = [[] for i in timestamps]

frame_idx = 0
data_index = 0

data.sort(key=lambda d: d['timestamp'])

while True:
try:
datum = data[data_index]
# we can take the midpoint between two frames in time:
# More appropriate for SW timestamps
ts = (timestamps[frame_idx]+timestamps[frame_idx+1]) / 2.
# or the time of the next frame:
# More appropriate for Sart Of Exposure Timestamps (HW timestamps).
# ts = timestamps[frame_idx+1]
except IndexError:
# we might loose a data point at the end but we don't care
break

if datum['timestamp'] <= ts:
datum['index'] = frame_idx
data_by_frame[frame_idx].append(datum)
data_index += 1
else:
frame_idx += 1

return data_by_frame

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import pandas as pd
import matplotlib.pyplot as plt

# Define plot_pop()
def plot_pop(filename, country_code):

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv(filename, chunksize=1000)

# Initialize empty dataframe: data
data = pd.DataFrame()

# Iterate over each dataframe chunk
for df_urb_pop in urb_pop_reader:
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

# Zip dataframe columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Use list comprehension to create new
# dataframe column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = \
[int(tup[0] * tup[1]) for tup in pops_list]

# Append dataframe chunk to data: data
data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

 

# Load csv files OR load up file directory seeking specific file name


# '''''''Basic EDA Instructions to verify Data


# ''''''' Carry out some basic Visaualisations ''''''''''''#

# Import matplotlib.pyplot
# import matplotlib.pyplot as plt

# Create the scatter plot
g1800s.plot(kind='scatter', x='1800', y='1899')

# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')

# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)

# Display the plot
plt.show()

# Think - QUestion at HAnd ''''#

def check_null_or_valid(row_data):
"""Function that takes a row of data,
drops all missing values,
and checks if all remaining values are greater than or equal to 0
"""
no_na = row_data.dropna()[1:-1]
numeric = pd.to_numeric(no_na)
ge0 = numeric >= 0
return ge0

# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == 'Life expectancy'

# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()

# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1


# ''''''''''' Assemble the Data '''''''''''''#

# Concatenate the DataFrames row-wise
gapminder = pd.concat([g1800s, g1900s, g2000s])

# Print the shape of gapminder
print(gapminder.shape)

# Print the head of gapminder
print(gapminder.head())


# ''''Reshape the data to aid easier analysis ( if required)''''#

# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(gapminder, id_vars='Life expectancy')

# Rename the columns
gapminder_melt.columns = ['country', 'year', 'life_expectancy']

# Print the head of gapminder_melt
print(gapminder_melt.head())

# '''''''''''Check the data types in the dataset ''''''''#

# Convert the year column to numeric
gapminder.year = pd.to_numeric(gapminder['year'])

# Test if country is of type object
assert gapminder.country.dtypes == np.object

# Test if year is of type int64
assert gapminder.year.dtypes == np.int64

# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64

# '''''''''''''''''Ex. Country Spellings to CHeck for Correctness ''''#

# Create the series of countries: countries
countries = gapminder['country']

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
mask_inverse = ~mask

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
print(invalid_countries)

# '''''''' More Cleaning Ex.''''''''''#

# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()

# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()

# Drop the missing values
gapminder = gapminder.dropna(how='any')

# Print the shape of gapminder
print(gapminder.shape)

# Add first subplot
plt.subplot(2, 1, 1)

# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind='hist')

# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()

# Print the head of gapminder_agg
print(gapminder_agg.head())

# Print the tail of gapminder_agg
print(gapminder_agg.tail())

# Add second subplot
plt.subplot(2, 1, 2)


# ''''''''' Wrap up with visualisation of cleaned data set'''' Eg.'''#
# Create a line plot of life expectancy per year
gapminder_agg.plot()

# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')

# Display the plots
plt.tight_layout()
plt.show()

# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""
import pandas as pd
import matplotlib.pyplot as plt
import glob

 

# ''''Combining Rows of Data ''''''''''''#

# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())

#'''''''''''' cOMBINING cOLUMNS OF dATA'''''''''''#

# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis=1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)

# Print the head of ebola_tidy
print(ebola_tidy.head())


# '''Find Files that match a PAttern '''''''' #

# Import necessary modules

# Write the pattern: pattern
pattern = '*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Print the file names
print(csv_files)

# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])

# Print the head of csv2
print(csv2.head())

# '''''''''''Iterate and Concatenate all Matches ''''''#

# Create an empty list: frames
frames = []

# Iterate over csv_files
for csv in csv_files:

# Read csv into a DataFrame: df
df = pd.read_csv(csv)

# Append df to frames
frames.append(df)

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())


# ''''''One to - One Data Merge '#

# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print o2o
print(o2o)

# '''''''MAny to One Data MErge ''''#

# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print m2o
print(m2o)

# ''''''''''Many To Many Data Merge ''''''''''''#

# Merge site and visited: m2m
m2m = pd.merge(left=site, right = visited, left_on='name', right_on='site')

# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken')

# Print the first 20 lines of m2m
print(m2m.head(20))

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

# Select retweets from the Twitter dataframe: result
result = filter(lambda x:x[0:2] == 'RT', tweets_df['text'])

# Create list from filter object result: res_list
res_list = list(result)

# Print all retweets in res_list
for tweet in res_list:
print(tweet)

# -*- coding: utf-8 -*-
"""
@author: Vijayabalan
"""

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())

# ''''Customise melted Data - Change var name & Val'''#

# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'],
var_name='measurement', value_name='reading')

# Print the head of airquality_melt
print(airquality_melt.head())

#''' Pivoting Data''' from melt'''''''''#

# Print the head of airquality_melt
print(airquality_melt.head())

# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())

#''''''''''''''''Reset data frame index''''''''''''#

# Print the index of airquality_pivot
print(airquality_pivot.index)

# Reset the index of airquality_pivot: airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the new index of airquality_pivot
print(airquality_pivot.index)

# Print the head of airquality_pivot
print(airquality_pivot.head())

# ''''''' Pivoting Duplicate Values ''''''''''#

# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'],
columns='measurement',
values='reading', aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())

# Print the head of airquality
print(airquality.head())

# '''''''''Split column infor using str '''''#

# Melt tb: tb_melt
tb_melt = pd.melt(frame=tb, id_vars=['country', 'year'])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# '''''' Split a column with .split() and .get()

# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

# Print the head of ebola_melt
print(ebola_melt.head())

# ''''Combining Rows of Data ''''''''''''#

# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())

#'''''''''''' cOMBINING cOLUMNS OF dATA'''''''''''#

# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis=1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)

# Print the head of ebola_tidy
print(ebola_tidy.head())

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re


# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Print the info of tips
print(tips.info())

# '''''Working with Numeric Data - Wrong data types ''''#

# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')

# Print the info of tips
print(tips.info())


# '''' String Parsing with regular expression '''#

# Import the regular expression module

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result = prog.match('1123-456-7890')
print(bool(result))

# ''''''' Find Numeric in sstring '''''''' #

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe requires 10 strawberries and 1 banana')

# Print the matches
print(matches)


# ''''' paTTERN maTCHING '''''##

# Write the first pattern
print(bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890')))

# Write the second pattern
print(bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45')))

# Write the third pattern
print(bool(re.match(pattern='[A-Z]\w*', string='Australia')))

# '''''''''######## ''''''''''''''''' ##########'''''''''''''''''''#

# '''''Custom Fxn to clean data in column ( dataframe)''''''''#

# Define recode_sex()


def recode_sex(sex_value):

# Return 1 if sex_value is 'Male'
if sex_value == 'Male':
return 1

# Return 0 if sex_value is 'Female'
elif sex_value == 'Female':
return 0

# Return np.nan
else:

return np.nan


# Apply the function to the sex column
tips['sex_recode'] = tips.sex.apply(recode_sex)


#''' Lambda Functions ''''''#

# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x))

# Print the head of tips
print(tips.head())

# '''''''Dropping DUplicate Data '''''''''''''#

# Create the new DataFrame: tracks
tracks = billboard[['year', 'artist', 'track', 'time']]

# Print info of tracks
print(tracks.info())

# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()

# Print info of tracks
print(tracks_no_duplicates.info())

# '''''''''''''''' Fill in MIssing Data ''''''''' #

# Calculate the mean of the Ozone column: oz_mean
oz_mean = np.mean(airquality.Ozone)

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality['Ozone'].fillna(oz_mean)

# Print the info of airquality
print(airquality.info())

# ''''''''''''''' Data Test with Assert Statements ''''''#

# Assert that there are no missing values
assert pd.notnull(ebola).all().all()

# Assert that all values are >= 0
assert (ebola >= 0).all().all()

# assert pd.notnull(ebola >= 0).all().all()

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

# ''Load and View Data ''''''''''#

# Import pandas
import pandas as pd
import matplotlib.pyplot as plt


# Read the file into a DataFrame: df
# df = pd.read_csv('dob_job_application_filings_subset.csv')

df = pd.read_csv('fixations.csv')
df2 = pd.read_csv('aerodata.csv')

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

print('AERO DATA OUTPUT')


print(df2.head())

print(df2.tail())

# Print the shape of df
print(df.shape)

print(df2.shape)

# Print the columns of df
print(df.columns)

print(df2.columns)

# Print the head and tail of df_subset
# print(df.subset.head())
# print(df.subset.tail())

# Print the info of df
print(df.info())

print(df2.info())

# Print the info of df_subset
# print(df.subset.info())


# '''''''' Frequency counts for Categorical Data

# Print the value counts for 'Borough'
print(df['duration'].value_counts(dropna=False))

print(df['duration'].shape)

# Print the value_counts for 'State'
print(df['confidence'].value_counts(dropna=False))

print(df['confidence'].shape)

# Print the value counts for 'Site Fill'
print(df['avg_pupil_size'].value_counts(dropna=False))

# ''''''''''' Single Variable Histogram plot ''''''''#

# Plot the histogram
df['duration'].plot(kind='hist', rot=70, logx=True, logy=True)

# Display the histogram
plt.show()

# ''''' Multi Variable Box Plot Visualisation '''''''#

# Import necessary modules

# Create the boxplot
df.boxplot(column='duration', by='avg_pupil_size', rot=90)

# Display the plot
plt.show()

# ''''''''''' Multiple variable scatter plot visualisation''''#

# Import necessary modules
# import pandas as pd
# import matplotlib.pyplot as plt

# Create and display the first scatter plot
df.plot(kind='scatter', x='duration', y='avg_pupil_size', rot=70)
plt.show()

# Create and display the second scatter plot
df_subset.plot(kind='scatter', x='duration', y='confidence', rot=70)
plt.show()

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import pandas as pd
import matplotlib.pyplot as plt

# Define plot_pop()


def plot_pop(filename, country_code):

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv(filename, chunksize=1000)

# Initialize empty dataframe: data
data = pd.DataFrame()

# Iterate over each dataframe chunk
for df_urb_pop in urb_pop_reader:
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

# Zip dataframe columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Use list comp to create new dataframe column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = \
[int(tup[0] * tup[1]) for tup in pops_list]

# Append dataframe chunk to data: data
data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

# Set the filename: fn
fn = 'ind_pop_data.csv'

# Call plot_pop for country code 'CEB'
plot_pop('ind_pop_data.csv', 'CEB')

# Call plot_pop for country code 'ARB'
plot_pop('ind_pop_data.csv', 'ARB')

# -*- coding: utf-8 -*-
"""
@author: Vijayabalan
"""

import pandas as pd

with open('world_dev_ind.csv') as file:

file.readline()

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Process only the first 1000 rows
for j in range(1000):

# Split the current line into a list: line
line = file.readline().split(',')

# Get the value for the first column: first_col
first_col = line[0]

# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1

# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

# ''''''''''''''''' Write Generator to Load Data Chunks ''''''' #

# Define read_large_file()
def read_large_file(file_object):
"""A generator function to read a large file lazily."""

# Loop indefinitely until the end of the file
while True:

# Read a line from the file: data
data = file_object.readline()

# Break if this is the end of the file
if not data:
break

# Yield the line of data
yield data
# Open a connection to the file
with open('world_dev_ind.csv') as file:

# Create a generator object for the file: gen_file
gen_file = read_large_file(file)

# Print the first three lines of the file
print(next(gen_file))
print(next(gen_file))
print(next(gen_file))


# ''''''''''''''' Load Data in Chunks with Generator ''''''''''' '#
# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Open a connection to the file
with open('world_dev_ind.csv') as file:

# Iterate over the generator from read_large_file()
for line in read_large_file(file):

row = line.split(',')
first_col = row[0]

if first_col in counts_dict.keys():
counts_dict[first_col] += 1
else:
counts_dict[first_col] = 1

# Print
print(counts_dict)

# ''''' Iterator to load data in chunks ''''''''''' #

# Import the pandas package

# Initialize reader object: df_reader
df_reader = pd.read_csv('ind_pop.csv', chunksize=10)

# Print two chunks
print(next(df_reader))
print(next(df_reader))

# ''''''''''''' Iterator to Load Data in Chunks '''''''''''#

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Get the first dataframe chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out the head of the dataframe
print(df_urb_pop.head())

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

# Zip dataframe columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Print pops_list
print(pops_list)


# Use list comp to create new dataframe column 'Total Urban Population'

df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]

# Plot urban population data

df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""
import numpy as np
import pandas as pd

from bokeh.plotting import figure
from bokeh.io import output_file, show
from bokeh.plotting import ColumnDataSource
from bokeh.models import HoverTool

# Create the figure: p
p = figure(x_axis_label='fertility (children per woman)',
y_axis_label='female_literacy (% population)')

# Add a circle glyph to the figure p
p.circle(fertility, female_literacy)

# Call the output_file() function and specify the name of the file
output_file('fert_lit.html')

# Display the plot
show(p)

# ''''''''''''' Multiple Dta plots ''''''#

# Create the figure: p
p = figure(x_axis_label='fertility',
y_axis_label='female_literacy (% population)')

# Add a circle glyph to the figure p
_ = p.circle(fertility_latinamerica, female_literacy_latinamerica)

# Add an x glyph to the figure p
_ = p.x(fertility_africa, female_literacy_africa)

# Specify the name of the file
output_file('fert_lit_separate.html')

# Display the plot
show(p)

# '''''Scatter Plot Customisation '''''''#

# Create the figure: p
p = figure(x_axis_label='fertility (children per woman)',
y_axis_label='female_literacy (% population)')

# Add a blue circle glyph to the figure p
p.circle(fertility_latinamerica, female_literacy_latinamerica,
color='blue', size=10, alpha=0.8)

# Add a red circle glyph to the figure p
p.circle(fertility_africa, female_literacy_africa,
color='red', size=10, alpha=0.8)

# Specify the name of the file
output_file('fert_lit_separate_colors.html')

# Display the plot
show(p)


# ''''Bokeh Line PLot '''''''''''#

# Import figure from bokeh.plotting - to p of file

# Create a figure with x_axis_type="datetime": p
p = figure(x_axis_type='datetime',
x_axis_label='Date', y_axis_label='US Dollars')

# Plot date along the x axis and price along the y axis
p.line(date, price, line_width=3)

# Specify the name of the output file and show the result
output_file('line.html')
show(p)

# '''''Line and Marker Plot ''''''''#

# Import figure from bokeh.plotting - top of file

# Create a figure with x_axis_type='datetime': p
p = figure(x_axis_type='datetime', x_axis_label='Date',
y_axis_label='US Dollars')

# Plot date along the x-axis and price along the y-axis
p.line(date, price)

# With date on the x-axis and price on the y-axis,
# add a white circle glyph of size 4
p.circle(date, price, fill_color='white', size=4)

# Specify the name of the output file and show the result
output_file('line.html')
show(p)

# ''''''Bokeh Patch Plots 'Maps' ''#

# Create a list of az_lons, co_lons, nm_lons and ut_lons: x
x = [az_lons, co_lons, nm_lons, ut_lons]

# Create a list of az_lats, co_lats, nm_lats and ut_lats: y
y = [az_lats, co_lats, nm_lats, ut_lats]

# Add patches to figure p with line_color=white for x and y
p.patches(x, y, line_color='white')

# Specify the name of the output file and show the result
output_file('four_corners.html')
show(p)


# ''''''''' Plotting from a numpy array ''''''#

# Import numpy as np - at top of file

# Create array using np.linspace: x
x = np.linspace(0, 5, 100)

# Create array using np.cos: y
y = np.cos(x)

# Add circles at x and y
p.circle(x, y)

# Specify the name of the output file and show the result
output_file('numpy.html')
show(p)

# '''''''' Plotting from Pandas Dataframe ''''''''#

# Import pandas as pd - top of file

# Read in the CSV file: df
df = pd.read_csv('auto.csv')

# Import figure from bokeh.plotting - top of file

# Create the figure: p
p = figure(x_axis_label='HP', y_axis_label='MPG')

# Plot mpg vs hp by color
p.circle(df['hp'], df['mpg'], color=df['color'], size=10)

# Specify the name of the output file and show the result
output_file('auto-df.html')
show(p)

# '''''''' Plot from ColumnData Source ''''''''#

# Import the ColumnDataSource class from bokeh.plotting

# Create a ColumnDataSource from df: source
source = ColumnDataSource(df)

# Add circle glyphs to the figure p
p.circle('Year', 'Time', source=source, color='color', size=8)

# Specify the name of the output file and show the result
output_file('sprint.html')
show(p)

# '''''''Selection and non-Selection Glyph Specification ''''#

# Create a figure with the "box_select" tool: p
p = figure(x_axis_label='Year', y_axis_label='Time', tools='box_select')

# Add circle glyphs to the figure p with the selected
# and non-selected properties

p.circle('Year', 'Time', source=source,
selection_color='red', nonselection_alpha=0.1)

# Specify the name of the output file and show the result
output_file('selection_glyph.html')
show(p)

# ''''''making Hover Glyphs '''''''#

# import the HoverTool - at top of file

# Add circle glyphs to figure p
p.circle(x, y, size=10,
fill_color='grey', alpha=0.1, line_color=None,
hover_fill_color='firebrick', hover_alpha=0.5,
hover_line_color='white')

# Create a HoverTool: hover
hover = HoverTool(tooltips=None, mode='vline')

# Add the hover tool to the figure p
p.add_tools(hover)

# Specify the name of the output file and show the result
output_file('hover_glyph.html')
show(p)

# ''''''''' Color Mapping '''''''''''#

#Import CategoricalColorMapper from bokeh.models
from bokeh.models import CategoricalColorMapper

# Convert df to a ColumnDataSource: source
source = ColumnDataSource(df)

# Make a CategoricalColorMapper object: color_mapper
color_mapper = CategoricalColorMapper(factors=['Europe', 'Asia', 'US'],
palette=['red', 'green', 'blue'])

# Add a circle glyph to the figure p
p.circle('weight', 'mpg', source=source,
color=dict(field='origin', transform=color_mapper),
legend='origin')

# Specify the name of the output file and show the result
output_file('colormap.html')
show(p)

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import pandas as pd
import numpy as np

from bokeh.io import output_file, show

from bokeh.plotting import figure
from bokeh.plotting import ColumnDataSource

from bokeh.layouts import gridplot
from bokeh.layouts import row, column
from bokeh.layouts import widgetbox

from bokeh.charts import BoxPlot
from bokeh.charts import Scatter

from bokeh.palettes import Spectral6

from bokeh.models import Select
from bokeh.models import Slider
from bokeh.models import Button
from bokeh.models import HoverTool
from bokeh.models import CategoricalColorMapper
from bokeh.models import CheckboxGroup, RadioGroup, Toggle

from bokeh.models.widgets import Panel
from bokeh.models.widgets import Tabs

# Perform necessary imports
from bokeh.io import curdoc

data = pd.read_csv('___.csv')
_ = data.head()
_ = data.describe
_ = data.info()
_ = data.shape

 

# '''' Basic EDA Plot of Gapminder Data set ''''''''#

# Make the ColumnDataSource: source
source = ColumnDataSource(data={
'x' : data.loc[1970].fertility,
'y' : data.loc[1970].life,
'country' : data.loc[1970].Country,
})

# Create the figure: p
p = figure(title='1970', x_axis_label='Fertility (children per woman)',
y_axis_label='Life Expectancy (years)',
plot_height=400, plot_width=700,
tools=[HoverTool(tooltips='@country')])

# Add a circle glyph to the figure p
p.circle(x='x', y='y', source=source)

# Output the file and show the figure
output_file('gapminder.html')
show(p)

# ''' Basic Data Plot '''''''#

# Make the ColumnDataSource: source
source = ColumnDataSource(data={
'x' : data.loc[1970].fertility,
'y' : data.loc[1970].life,
'country' : data.loc[1970].Country,
'pop' : (data.loc[1970].population / 20000000) + 2,
'region' : data.loc[1970].region,
})

# Save the minimum and maximum values of the fertility column: xmin, xmax
xmin, xmax = min(data.fertility), max(data.fertility)

# Save the minimum and maximum values of the life expectancy column: ymin, ymax
ymin, ymax = min(data.life), max(data.life)

# Create the figure: plot
plot = figure(title='Gapminder Data for 1970', plot_height=400, plot_width=700,
x_range=(xmin, xmax), y_range=(ymin, ymax))

# Add circle glyphs to the plot
plot.circle(x='x', y='y', fill_alpha=0.8, source=source)

# Set the x-axis label
plot.xaxis.axis_label = 'Fertility (children per woman)'

# Set the y-axis label
plot.yaxis.axis_label = 'Life Expectancy (years)'

# Add the plot to the current document and add a title
curdoc().add_root(plot)
curdoc().title = 'Gapminder'


# ''''' Enhancing the list with some colours ''''#

# Make a list of the unique values from the region column: regions_list
regions_list = data.region.unique().tolist()

# Import CategoricalColorMapper from bokeh.models and
# the Spectral6 palette from bokeh.palettes

# Make a color mapper: color_mapper
color_mapper = CategoricalColorMapper(factors=regions_list, palette=Spectral6)

# Add the color mapper to the circle glyph
plot.circle(x='x', y='y', fill_alpha=0.8, source=source,
color=dict(field='region', transform=color_mapper),
legend='region')

# Set the legend.location attribute of the plot to 'top_right'
plot.legend.location = 'top_right'

# Add the plot to the current document and add the title
curdoc().add_root(plot)
curdoc().title = 'Gapminder'


# '''''' Adding a Slider to vary the year ''''''#

# Define the callback function: update_plot
def update_plot(attr, old, new):
# set the `yr` name to `slider.value
# and `source.data = new_data`
yr = slider.value
new_data = {
'x': data.loc[yr].fertility,
'y': data.loc[yr].life,
'country': data.loc[yr].Country,
'pop': (data.loc[yr].population / 20000000) + 2,
'region': data.loc[yr].region,
}
source.data = new_data


# Make a slider object: slider
slider = Slider(start=1970, end=2010, step=1, value=1970, title='Year')

# Attach the callback to the 'value' property of slider
slider.on_change('value', update_plot)

# Make a row layout of widgetbox(slider) and plot
# and add it to the current document
layout = row(widgetbox(slider), plot)
curdoc().add_root(layout)

# ''''' Customised Plot API from user input '''#

# Define the callback function: update_plot
def update_plot(attr, old, new):
# Assign the value of the slider: yr
yr = slider.value
# Set new_data
new_data = {
'x' : data.loc[yr].fertility,
'y' : data.loc[yr].life,
'country' : data.loc[yr].Country,
'pop' : (data.loc[yr].population / 20000000) + 2,
'region' : data.loc[yr].region,
}
# Assign new_data to: source.data
source.data = new_data

# Add title to figure: plot.title.text
plot.title.text = 'Gapminder data for %d' % yr

# Make a slider object: slider
slider = Slider(start=1970, end=2010, step=1, value=1970, title='Year')

# Attach the callback to the 'value' property of slider
slider.on_change('value', update_plot)

# Make a row layout of widgetbox(slider) and
# plot and add it to the current document
layout = row(widgetbox(slider), plot)
curdoc().add_root(layout)

# '''' Add Hover info_tool to the API '''''''#

# Create a HoverTool: hover
hover = HoverTool(tooltips=[('Country', '@country')])

# Add the HoverTool to the plot
plot.add_tools(hover)
# Create layout: layout
layout = row(widgetbox(slider), plot)

# Add layout to current document
curdoc().add_root(layout)

# '''''''Adding drop-down menu to the App ''''''''''#

# Define the callback: update_plot
def update_plot(attr, old, new):
# Read the current value off the slider and 2 dropdowns: yr, x, y
yr = slider.value
x = x_select.value
y = y_select.value
# Label axes of plot
plot.xaxis.axis_label = x
plot.yaxis.axis_label = y
# Set new_data
new_data = {
'x' : data.loc[yr][x],
'y' : data.loc[yr][y],
'country' : data.loc[yr].Country,
'pop' : (data.loc[yr].population / 20000000) + 2,
'region' : data.loc[yr].region,
}
# Assign new_data to source.data
source.data = new_data

# Set the range of all axes
plot.x_range.start = min(data[x])
plot.x_range.end = max(data[x])
plot.y_range.start = min(data[y])
plot.y_range.end = max(data[y])

# Add title to plot
plot.title.text = 'Gapminder data for %d' % yr

# Create a dropdown slider widget: slider
slider = Slider(start=1970, end=2010, step=1, value=1970, title='Year')

# Attach the callback to the 'value' property of slider
slider.on_change('value', update_plot)

# Create a dropdown Select widget for the x data: x_select
x_select = Select(
options=['fertility', 'life', 'child_mortality', 'gdp'],
value='fertility',
title='x-axis data'
)

# Attach the update_plot callback to the
# 'value' property of x_select
x_select.on_change('value', update_plot)

# Create a dropdown Select widget for the y data: y_select
y_select = Select(
options=['fertility', 'life', 'child_mortality', 'gdp'],
value='life',
title='y-axis data'
)

# Attach the update_plot callback to
# the 'value' property of y_select
y_select.on_change('value', update_plot)

# Create layout and add to current document
layout = row(widgetbox(slider, x_select, y_select), plot)
curdoc().add_root(layout)

 

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""
import pandas as pd
import numpy as np

from bokeh.plotting import figure
from bokeh.io import output_file, show
from bokeh.plotting import ColumnDataSource
from bokeh.models import HoverTool
from bokeh.layouts import gridplot
from bokeh.models.widgets import Panel
from bokeh.models.widgets import Tabs
from bokeh.layouts import row, column
from bokeh.charts import BoxPlot
from bokeh.charts import Scatter

# Import Histogram, output_file, and show from bokeh.charts
from bokeh.charts import Histogram


# ''''' Basic bokeh Histogram ''''''''#
df = pd.read_csv('fixations.csv')

df.head()

# Create a ColumnDataSource from df: source
source = ColumnDataSource(df)

# Make a Histogram: p
p = Histogram(df, 'duration', title='Gaze_Time', bins=50)

# Set the x axis label
p.xaxis.axis_label = 'Gaze_Duration'

# Set the y axis label
p.yaxis.axis_label = 'Pupil DIa'
# Specify the name of the output_file and show the result
output_file('histogram.html')
show(p)

"""
# Make a Histogram: p
p = Histogram(df, 'female_literacy', title='Female Literacy',
bins=40)

# Set the x axis label
p.xaxis.axis_label = 'Female Literacy'

# Set the y axis label
p.yaxis.axis_label = 'Fertility'
# Specify the name of the output_file and show the result
output_file('histogram.html')
show(p)

"""
# '''''' Multiple Histograms ''''''''#

# Make a Histogram: p
p = Histogram(df, 'female_literacy', title='Female Literacy',
color='Continent', legend='top_left')

# Set axis labels
p.xaxis.axis_label = 'Female Literacy (% population)'
p.yaxis.axis_label = 'Number of Countries'

# Specify the name of the output_file and show the result
output_file('hist_bins.html')

"""
# '''''' Basic BoxPlot '''''''''#

# Make a box plot: p
p = BoxPlot(df, values='duration', label='confidence',
title='Gaze Duration (grouped by Avg_Pupil_Size)',
legend='bottom_right')

# Set the y axis label
p.yaxis.axis_label = 'Fixations (% Tot_Gaze_Pop)'

# Specify the name of the output_file and show the result
output_file('boxplot.html')
show(p)
"""

# ''''''''''''''' ################ '''''''''''''' #
# Make a box plot: p
p = BoxPlot(df, values='female_literacy', label='Continent',
title='Female Literacy (grouped by Continent)',
legend='bottom_right')

# Set the y axis label
p.yaxis.axis_label = 'Female literacy (% population)'

# Specify the name of the output_file and show the result
output_file('boxplot.html')
show(p)

# ''''''''''' Multicoloured Boxplots ''''''#

# Make a box plot: p
p = BoxPlot(df, values='female_literacy',
label='Continent', color='Continent',
title='Female Literacy (grouped by Continent)',
legend='bottom_right')

# Set y-axis label
p.yaxis.axis_label = 'Female literacy (% population)'

# Specify the name of the output_file and show the result
output_file('boxplot.html')
show(p)

# ''''''''' Basic Bokeh Scatter PLot ''''''#

# Make a scatter plot: p
p = Scatter(df, x='population', y='female_literacy',
title='Female Literacy vs Population')

# Set the x-axis label
p.xaxis.axis_label = 'Population'

# Set the y-axis label
p.yaxis.axis_label = 'Female Literacy'
# Specify the name of the output_file and show the result
output_file('scatterplot.html')
show(p)

# ''''' scatter plot grouping by colour ''''#

# Make a scatter plot such that each circle
# is colored by its continent: p
p = Scatter(df, x='population', y='female_literacy',
color='Continent',
title='Female Literacy vs Population')

# Set x-axis and y-axis labels
p.xaxis.axis_label = 'Population (millions)'
p.yaxis.axis_label = 'Female literacy (% population)'

# Specify the name of the output_file and show the result
output_file('scatterplot.html')

# ''''' Scatter plot shape(marker) grouping '''''#

# Make a scatter plot such that each continent has a different marker type: p
p = p = Scatter(df, x='population', y='female_literacy',
color='Continent',
marker='Continent',
title='Female Literacy vs Population')

# Set x-axis and y-axis labels
p.xaxis.axis_label = 'Population (millions)'
p.yaxis.axis_label = 'Female literacy (% population)'

# Specify the name of the output_file and show the result
output_file('scatterplot.html')
show(p)

 

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import pandas as pd
import numpy as np

from bokeh.io import output_file, show

from bokeh.plotting import figure
from bokeh.plotting import ColumnDataSource

from bokeh.layouts import gridplot
from bokeh.layouts import row, column
from bokeh.layouts import widgetbox

from bokeh.charts import BoxPlot
from bokeh.charts import Scatter

from bokeh.models import Select
from bokeh.models import Slider
from bokeh.models import Button
from bokeh.models import HoverTool
from bokeh.models import CheckboxGroup, RadioGroup, Toggle

from bokeh.models.widgets import Panel
from bokeh.models.widgets import Tabs

# Perform necessary imports
from bokeh.io import curdoc

# Create a new plot: plot
plot = figure()

# Add a line to the plot
plot.line(x=[1, 2, 3, 4, 5], y=[2, 5, 4, 6, 7])

# Add the plot to the current document
curdoc().add_root(plot)

# ''''''''' Add a slider ''''''#

# Create a slider: slider
slider = Slider(title='my slider', start=0, end=10, step=0.1, value=2)

# Create a widgetbox layout: layout
layout = widgetbox(slider)

# Add the layout to the current document
curdoc().add_root(layout)

# '''''''' Multiple Sliders ''''''''#

# Create first slider: slider1
slider1 = Slider(title='slider1', start=0, end=10, step=0.1, value=2)

# Create second slider: slider2
slider2 = Slider(title='slider2', start=10, end=100, step=1, value=20)

# Add slider1 and slider2 to a widgetbox
layout = widgetbox(slider1, slider2)

# Add the layout to the current document
curdoc().add_root(layout)


# '''' Combining bokeh models into a layout ''''#

# Create ColumnDataSource: source
source = ColumnDataSource(data={'x': x, 'y': y})

# Add a line to the plot
plot.line('x', 'y', source=source)

# Create a column layout: layout
layout = column(widgetbox(slider), plot)

# Add the layout to the current document
curdoc().add_root(layout)

# '' Basic callback on widget ''''''#

# Define a callback function: callback
def callback(attr, old, new):

# Read the current value of the slider: scale
scale = slider.value

# Compute the updated y using np.sin(scale/x): new_y
new_y = np.sin(scale/x)

# Update source with the new data values
source.data = {'x': x, 'y': new_y}

# Attach the callback to the 'value' property of slider
slider.on_change('value', callback)

# Create layout and add to current document
layout = column(widgetbox(slider), plot)
curdoc().add_root(layout)

# ''''Updating data sources - Drop down in callback '''#

# Create ColumnDataSource: source
source = ColumnDataSource(data={
'x' : fertility,
'y' : female_literacy
})

# Create a new plot: plot
plot = figure()

# Add circles to the plot
plot.circle('x', 'y', source=source)

# Define a callback function: update_plot
def update_plot(attr, old, new):
# If the new Selection is 'female_literacy', update 'y' to female_literacy
if new == 'female_literacy':
source.data = {
'x': fertility,
'y': female_literacy
}
# Else, update 'y' to population
else:
source.data = {
'x' : fertility,
'y' : population
}

# Create a dropdown Select widget: select
select = Select(title="distribution",
options=['female_literacy', 'population'],
value='female_literacy')

# Attach the update_plot callback to the 'value' property of select
select.on_change('value', update_plot)

# Create layout and add to current document
layout = row(select, plot)
curdoc().add_root(layout)

# ''''''''' Synchronise two dropdowns '''''''''''#

# Create two dropdown Select widgets: select1, select2

select1 = Select(title='First', options=['A', 'B'], value='A')
select2 = Select(title='Second', options=['1', '2', '3'], value='1')

# Define a callback function: callback
def callback(attr, old, new):
# If select1 is 'A'
if select1.value == 'A':
# Set select2 options to ['1', '2', '3']
select2.options = ['1', '2', '3']

# Set select2 value to '1'
select2.value = '1'
else:
# Set select2 options to ['100', '200', '300']
select2.options = ['100', '200', '300']

# Set select2 value to '100'
select2.value = '100'

# Attach the callback to the 'value' property of select1
select1.on_change('value', callback)

# Create layout and add to current document
layout = widgetbox(select1, select2)
curdoc().add_root(layout)


# ''''''''''Basic button widget '''''''''#

# Create a Button with label 'Update Data'
button = Button(label='Update Data')

# Define an update callback with no arguments: update
def update():

# Compute new y values: y
y = np.sin(x) + np.random.random(N)

# Update the ColumnDataSource data dictionary
source.data = {'x': x, 'y': y}

# Add the update callback to the button
button.on_click(update)

# Create layout and add to current document
layout = column(widgetbox(button), plot)
curdoc().add_root(layout)


# ''''''' Button Styles '''''''#

# Import CheckboxGroup, RadioGroup, Toggle from bokeh.models

# Add a Toggle: toggle
toggle = Toggle(button_type='success', label='Toggle button')

# Add a CheckboxGroup: checkbox
checkbox = CheckboxGroup(labels=['Option 1', 'Option 2', 'Option 3'])

# Add a RadioGroup: radio
radio = RadioGroup(labels=['Option 1', 'Option 2', 'Option 3'])

# Add widgetbox(toggle, checkbox, radio) to the current document
curdoc().add_root(widgetbox(toggle, checkbox, radio))

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

from bokeh.plotting import figure
from bokeh.io import output_file, show
from bokeh.plotting import ColumnDataSource
from bokeh.models import HoverTool
from bokeh.layouts import gridplot
from bokeh.models.widgets import Panel
from bokeh.models.widgets import Tabs
from bokeh.layouts import row, column

# Create a ColumnDataSource from df: source
source = ColumnDataSource(df)

# '''''' Creating Rows of Plots

# Create the first figure: p1
p1 = figure(x_axis_label='fertility (children per woman)',
y_axis_label='female_literacy (% population)')

# Add a circle glyph to p1
p1.circle('fertility', 'female_literacy', source=source)

# Create the second figure: p2
p2 = figure(x_axis_label='population',
y_axis_label='female_literacy (% population)')

# Add a circle glyph to p2
p2.circle('population', 'female_literacy', source=source)

# Put p1 and p2 into a horizontal row: layout
layout = row(p1, p2)

# Specify the name of the output_file and show the result
output_file('fert_row.html')
show(layout)

# '''''''''''''' Column Plots in Bokeh ''''''#

# Create a blank figure: p1
p1 = figure(x_axis_label='fertility (children per woman)',
y_axis_label='female_literacy (% population)')

# Add circle scatter to the figure p1
p1.circle('fertility', 'female_literacy', source=source)

# Create a new blank figure: p2
p2 = figure(x_axis_label='population',
y_axis_label='female_literacy (% population)')

# Add circle scatter to the figure p2
p2.circle('population', 'female_literacy', source=source)

# Put plots p1 and p2 in a column: layout
layout = column(p1, p2)

# Specify the name of the output_file and show the result
output_file('fert_column.html')
show(layout)

# ''''''' Nesting Rows & Columns of Plots '''''''#

# Make a column layout that will be used as the second row: row2
row2 = column([mpg_hp, mpg_weight], sizing_mode='scale_width')

# Make a row layout that includes the above column layout: layout
layout = row([avg_mpg, row2], sizing_mode='scale_width')

# Specify the name of the output_file and show the result
output_file('layout_custom.html')
show(layout)

# '''''Gridded Layouts ''''''''#

# Create a list containing plots p1 and p2: row1
row1 = [p1, p2]

# Create a list containing plots p3 and p4: row2
row2 = [p3, p4]

# Create a gridplot using row1 and row2: layout
layout = gridplot([row1, row2])

# Specify the name of the output_file and show the result
output_file('grid.html')
show(layout)

# ''''''Start Tabbed Layouts ''''#1 Create Panels

# Create tab1 from plot p1: tab1
tab1 = Panel(child=p1, title='Latin America')

# Create tab2 from plot p2: tab2
tab2 = Panel(child=p2, title='Africa')

# Create tab3 from plot p3: tab3
tab3 = Panel(child=p3, title='Asia')

# Create tab4 from plot p4: tab4
tab4 = Panel(child=p4, title='Europe')


# ''''''''''''' Display the tabbed layouts '''''''''''#

# Create a Tabs layout: layout
layout = Tabs(tabs=[tab1, tab2, tab3, tab4])

# Specify the name of the output_file and show the result
output_file('tabs.html')
show(layout)

# '''''''' Linked Axes Plots '''''''#

# Link the x_range of p2 to p1: p2.x_range
p2.x_range = p1.x_range

# Link the y_range of p2 to p1: p2.y_range
p2.y_range = p1.y_range

# Link the x_range of p3 to p1: p3.x_range
p3.x_range = p1.x_range

# Link the y_range of p4 to p1: p4.y_range
p4.y_range = p1.y_range

# Specify the name of the output_file and show the result
output_file('linked_range.html')
show(layout)

# ' Linked brushed data - brushing ''''''''''''''#

# Create ColumnDataSource: source
source = ColumnDataSource(data)

# Create the first figure: p1
p1 = figure(x_axis_label='fertility (children per woman)',
y_axis_label='female literacy (% population)',
tools='box_select,lasso_select')

# Add a circle glyph to p1
_ = p1.circle('fertility', 'female literacy', source=source)

# Create the second figure: p2
p2 = figure(x_axis_label='fertility (children per woman)',
y_axis_label='population (millions)',
tools='box_select,lasso_select')

# Add a circle glyph to p2
_ = p2.circle('fertility', 'population', source=source)

# Create row layout of figures p1 and p2: layout
layout = row(p1, p2)

# Specify the name of the output_file and show the result
output_file('linked_brush.html')
show(layout)

# ''''''' Creating Legends '''''''''#

# Add the first circle glyph to the figure p
p.circle('fertility', 'female_literacy',
source=latin_america, size=10,
color='red', legend='Latin America')

# Add the second circle glyph to the figure p
p.circle('fertility', 'female_literacy',
source=africa, size=10,
color='blue', legend='Africa')

# Specify the name of the output_file and show the result
output_file('fert_lit_groups.html')
show(p)

# '''Legend Position and Style '''''''#

# Assign the legend to the bottom left: p.legend.location
p.legend.location='bottom_left'

# Fill the legend background with the color 'lightgray':
# p.legend.background_fill_color
p.legend.background_fill_color='lightgray'

# Specify the name of the output_file and show the result
output_file('fert_lit_groups.html')
show(p)

# ''''' Add hover tooltip to plot '''''''#

# Create a HoverTool object: hover
hover = HoverTool(tooltips=[('Country','@Country')])

# Add the HoverTool object to figure p
p.add_tools(hover)

# Specify the name of the output_file and show the result
output_file('hover.html')
show(p)

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import matplotlib.pyplot as plt
import numpy as np

# Load the image into an array: img
img = plt.imread('480px-Astronaut-EVA.jpg')

# Print the shape of the image
print(img.shape)

# Display the image
plt.imshow(img)

# Hide the axes
plt.axis('off')
plt.show()

# ''''''''''' Pseudocolor Plot from Image Data ''''''''''''#

# Load the image into an array: img
img = plt.imread('480px-Astronaut-EVA.jpg')

# Print the shape of the image
print(img.shape)

# Compute the sum of the red, green and blue channels: intensity
intensity = img.sum(axis=2)

# Print the shape of the intensity
print(intensity.shape)

# Display the intensity with a colormap of 'gray'
plt.imshow(intensity, cmap='gray')

# Add a colorbar
plt.colorbar()

# Hide the axes and show the figure
plt.axis('off')
plt.show()

# # '''''''''''''Specifying Extents and Aspect Ratio '''''#

# Load the image into an array: img
img = plt.imread('480px-Astronaut-EVA.jpg')

# Specify the extent and aspect ratio of the top left subplot
plt.subplot(2, 2, 1)
plt.title('extent=(-1,1,-1,1),\naspect=0.5')
plt.xticks([-1, 0, 1])
plt.yticks([-1, 0, 1])
plt.imshow(img, extent=(-1, 1, -1, 1), aspect=0.5)

# Specify the extent and aspect ratio of the top right subplot
plt.subplot(2, 2, 2)
plt.title('extent=(-1,1,-1,1),\naspect=1')
plt.xticks([-1, 0, 1])
plt.yticks([-1, 0, 1])
plt.imshow(img, extent=(-1, 1, -1, 1), aspect=1)

# Specify the extent and aspect ratio of the bottom left subplot
plt.subplot(2, 2, 3)
plt.title('extent=(-1,1,-1,1),\naspect=2')
plt.xticks([-1, 0, 1])
plt.yticks([-1, 0, 1])
plt.imshow(img, extent=(-1, 1, -1, 1), aspect=2)

# Specify the extent and aspect ratio of the bottom right subplot
plt.subplot(2, 2, 4)
plt.title('extent=(-2,2,-1,1),\naspect=2')
plt.xticks([-2, -1, 0, 1, 2])
plt.yticks([-1, 0, 1])
plt.imshow(img, extent=(-2, 2, -1, 1), aspect=2)

# Improve spacing and display the figure
plt.tight_layout()
plt.show()

# '''''' Rescale Pixel Intensities '''''''''''''#

# Load the image into an array: image
image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')

# Extract minimum and maximum values from the image: pmin, pmax
pmin, pmax = image.min(), image.max()
print("The smallest & largest pixel intensities are %d & %d." % (pmin, pmax))

# Rescale the pixels: rescaled_image
rescaled_image = 256*(image - pmin) / (pmax - pmin)
print("The rescaled smallest & largest pixel intensities are %.1f & %.1f." %
(rescaled_image.min(), rescaled_image.max()))

# Display the original image in the top subplot
plt.subplot(2, 1, 1)
plt.title('original image')
plt.axis('off')
plt.imshow(image, extent=(-2, 2, -1, 1), aspect=2)

# Display the rescaled image in the bottom subplot
plt.subplot(2, 1, 2)
plt.title('rescaled image')
plt.axis('off')
plt.imshow(rescaled_image, extent=(-2, 2, -1, 1), aspect=2)

plt.show()

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import numpy as np
import matplotlib.pyplot as plt

# ''''''''''' Specifying Label ''''''''''''''''#

# Specify the label 'Computer Science'
plt.plot(year, computer_science, color='red', label='Computer Science')

# Specify the label 'Physical Sciences'
plt.plot(year, physical_sciences, color='blue', label='Physical Sciences')

# Add a legend at the lower center
plt.legend(loc='lower center')

# Add axis labels and title
plt.xlabel('Year')
plt.ylabel('Enrollment (%)')
plt.title('Undergraduate enrollment of women')
plt.show()

# '''''''' ############# ;''''''''''''' ################## #

# ''''''''''''''''' Using Annotate ''''''''''''''#

# Plot with legend as before
plt.plot(year, computer_science, color='red', label='Computer Science')
plt.plot(year, physical_sciences, color='blue', label='Physical Sciences')
plt.legend(loc='bottom right')

# Compute the maximum enrollment of women in Computer Science: cs_max
cs_max = computer_science.max()

# Calculate the year in which there was maximum enrollment
# of women in Computer Science: yr_max
yr_max = year[computer_science.argmax()]

# Add a black arrow annotation
plt.annotate('Maximum', xy=(yr_max, cs_max), xytext=(yr_max+5, cs_max+5),
arrowprops=dict(facecolor='black'))

# Add axis labels and title
plt.xlabel('Year')
plt.ylabel('Enrollment (%)')
plt.title('Undergraduate enrollment of women')
plt.show()

# '''''''''''''''''' Modifying Plots '''''''''#

# Import matplotlib.pyplot

# Set the style to 'ggplot'
plt.style.use('ggplot')
print(plt.style.available)
# Create a figure with 2x2 subplot layout
plt.subplot(2, 2, 1)

# Plot the enrollment % of women in the Physical Sciences
plt.plot(year, physical_sciences, color='blue')
plt.title('Physical Sciences')

# Plot the enrollment % of women in Computer Science
plt.subplot(2, 2, 2)
plt.plot(year, computer_science, color='red')
plt.title('Computer Science')

# Add annotation

cs_max = computer_science.max()
yr_max = year[computer_science.argmax()]
plt.annotate('Maximum', xy=(yr_max, cs_max), xytext=(yr_max-1,
cs_max-10), arrowprops=dict(facecolor='black'))

# Plot the enrollmment % of women in Health professions
plt.subplot(2, 2, 3)
plt.plot(year, health, color='green')
plt.title('Health Professions')

# Plot the enrollment % of women in Education
plt.subplot(2, 2, 4)
plt.plot(year, education, color='yellow')
plt.title('Education')

# Improve spacing between subplots and display them
plt.tight_layout()
plt.show()

# '''''''''''''''' Creating a Meshed Fig ''''''''''' #

# Import numpy and matplotlib.pyplot

# Generate two 1-D arrays: u, v
u = np.linspace(-2, 2, 41)
v = np.linspace(-1, 1, 21)

# Generate 2-D arrays from u and v: X, Y
X, Y = np.meshgrid(u, v)

# Compute Z based on X and Y
Z = np.sin(3*np.sqrt(X**2 + Y**2))

# Display the resulting image with pcolor()
plt.pcolor(Z)
plt.show()

# Save the figure to 'sine_mesh.png'
plt.savefig('sine_mesh.png')


# '''''''''''' Visualising Bivariate Functions ''''''#

# ''''''''''' Contours and Filled Contours '''#

# Generate a default contour map of the array Z
plt.subplot(2, 2, 1)
plt.contour(X, Y, Z)

# Generate a contour map with 20 contours
plt.subplot(2, 2, 2)
plt.contour(X, Y, Z, 20)

# Generate a default filled contour map of the array Z
plt.subplot(2, 2, 3)
plt.contourf(X, Y, Z)

# Generate a contour map with 20 contours
plt.subplot(2, 2, 4)
plt.contourf(X, Y, Z, 20)

# Improve the spacing between subplots
plt.tight_layout()

# Display the figure
plt.show()

# ''''''########''########''''''''''##################### #

# ''''' Colour Map Modifier ''''''''''#

# Create a filled contour plot with a color map of 'viridis'
plt.subplot(2, 2, 1)
plt.contourf(X, Y, Z, 20, cmap='viridis')
plt.colorbar()
plt.title('Viridis')

# Create a filled contour plot with a color map of 'gray'
plt.subplot(2, 2, 2)
plt.contourf(X, Y, Z, 20, cmap='gray')
plt.colorbar()
plt.title('Gray')

# Create a filled contour plot with a color map of 'autumn'
plt.subplot(2, 2, 3)
plt.contourf(X, Y, Z, 20, cmap='autumn')
plt.colorbar()
plt.title('Autumn')

# Create a filled contour plot with a color map of 'winter'
plt.subplot(2, 2, 4)
plt.contourf(X, Y, Z, 20, cmap='winter')
plt.colorbar()
plt.title('Winter')

# Improve the spacing between subplots and display them
plt.tight_layout()
plt.show()

# '''''''' Using hist2d() '''''''''' #

# # Generate a 2-D histogram

_ = plt.hist2d(hp, mpg, bins=(20, 20), range=((40, 235), (8, 48)))

# Add a color bar to the histogram
_ = plt.colorbar()

# Add labels, title, and display the plot
plt.xlabel('Horse power [hp]')
plt.ylabel('Miles per gallon [mpg]')
plt.title('hist2d() plot')
plt.show()


# ''''''''''''' Plotting with hexbin() '''''''''''#

# # Generate a 2d histogram with hexagonal bins

_ = plt.hexbin(hp, mpg, gridsize=(15, 12), extent=((40, 235, 8, 48)))

# Add a color bar to the histogram
_ = plt.colorbar()

# Add labels, title, and display the plot
plt.xlabel('Horse power [hp]')
plt.ylabel('Miles per gallon [mpg]')
plt.title('hexbin() plot')
plt.show()


# ''''''''''''''' Loading and Viewing Images '''''''' #

# Load the image into an array: img

img = plt.imread('480px-Astronaut-EVA.jpg')

# Print the shape of the image
print(img.shape)

# Display the image
plt.imshow(img)

# Hide the axes
plt.axis('off')
plt.show()

# ''''''''''' Pseudocolor Plot from Image Data ''#

# Load the image into an array: img
img = plt.imread('480px-Astronaut-EVA.jpg')

# Print the shape of the image
print(img.shape)

# Compute the sum of the red, green and blue channels: intensity
intensity = img.sum(axis=2)

# Print the shape of the intensity
print(intensity.shape)

# Display the intensity with a colormap of 'gray'
plt.imshow(intensity, cmap='gray')

# Add a colorbar
plt.colorbar()

# Hide the axes and show the figure
plt.axis('off')
plt.show()

# # '''''''''''''Specifying Extents and Aspect Ratio '''''#

# Load the image into an array: img
img = plt.imread('480px-Astronaut-EVA.jpg')

# Specify the extent and aspect ratio of the top left subplot
plt.subplot(2, 2, 1)
plt.title('extent=(-1,1,-1,1),\naspect=0.5')
plt.xticks([-1, 0, 1])
plt.yticks([-1, 0, 1])
plt.imshow(img, extent=(-1, 1, -1, 1), aspect=0.5)

# Specify the extent and aspect ratio of the top right subplot
plt.subplot(2, 2, 2)
plt.title('extent=(-1,1,-1,1),\naspect=1')
plt.xticks([-1, 0, 1])
plt.yticks([-1, 0, 1])
plt.imshow(img, extent=(-1, 1, -1, 1), aspect=1)

# Specify the extent and aspect ratio of the bottom left subplot
plt.subplot(2, 2, 3)
plt.title('extent=(-1,1,-1,1),\naspect=2')
plt.xticks([-1, 0, 1])
plt.yticks([-1, 0, 1])
plt.imshow(img, extent=(-1, 1, -1, 1), aspect=2)

# Specify the extent and aspect ratio of the bottom right subplot
plt.subplot(2, 2, 4)
plt.title('extent=(-2,2,-1,1),\naspect=2')
plt.xticks([-2, -1, 0, 1, 2])
plt.yticks([-1, 0, 1])
plt.imshow(img, extent=(-2, 2, -1, 1), aspect=2)

# Improve spacing and display the figure
plt.tight_layout()
plt.show()


# '''''' Rescale Pixel Intensities '''''''''''''#

# Load the image into an array: image
image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')

# Extract minimum and maximum values from the image: pmin, pmax
pmin, pmax = image.min(), image.max()
print("The smallest & largest pixel intensities are %d & %d." % (pmin, pmax))

# Rescale the pixels: rescaled_image
rescaled_image = 256*(image - pmin) / (pmax - pmin)
print("The rescaled smallest & largest pixel intensities are %.1f & %.1f." %
(rescaled_image.min(), rescaled_image.max()))

# Display the original image in the top subplot
plt.subplot(2, 1, 1)
plt.title('original image')
plt.axis('off')
plt.imshow(image, extent=(-2, 2, -1, 1), aspect=2)

# Display the rescaled image in the bottom subplot
plt.subplot(2, 1, 2)
plt.title('rescaled image')
plt.axis('off')
plt.imshow(rescaled_image, extent=(-2, 2, -1, 1), aspect=2)

plt.show()

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

# '''''''''''' Panda SQL Query ''''''''''''#
# Import packages
import sqlite3
from sqlalchemy import create_engine
from sqlalchemy import update
from sqlalchemy import connection
import pandas as pd
# Import insert and select from sqlalchemy
from sqlalchemy import insert, select
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)

# Print head of DataFrame

print(df.head())

# Open engine in context manager
# Perform query and save results to DataFrame: df1

with engine.connect() as con:
rs = con.execute("SELECT * FROM Album")
df1 = pd.DataFrame(rs.fetchall())
df1.columns = rs.keys()

# Confirm that both methods yield the same result: does df = df1 ?

print(df.equals(df1))

# ''''''''''''#######'''''''##################''''''''#

# Build an insert statement to insert a record into the data table: stmt

stmt = insert(data).values(name='Anna', count=1, amount=1000.00, valid=True)

# Execute the statement via the connection: results

results = connection.execute(stmt)

# Print result rowcount

print(results.rowcount)

# Build a select statement to validate the insert

stmt = select([data]).where(data.columns.name == 'Anna')

# Print the result of executing the query.

print(connection.execute(stmt).first())

# '''''''''###########'''''''''''''''' #
# ''''''####'''''''''''''##########'''''''''#

# Create a insert statement for census: stmt

stmt = insert(census)

# Create an empty list and zeroed row count: values_list, total_rowcount

values_list = []
total_rowcount = 0

# Enumerate the rows of csv_reader
for idx, row in enumerate(csv_reader):
# create data and append to values_list
data = {'state': row[0], 'sex': row[1], 'age': row[2], 'pop2000': row[3],
'pop2008': row[4]}
values_list.append(data)

# Check to see if divisible by 51
if idx % 51 == 0:
results = connection.execute(stmt, values_list)
total_rowcount += results.rowcount
values_list = []

# Build a select statement: select_stmt
select_stmt = select([state_fact]).where(state_fact.columns.name == 'New York')

# Print the results of executing the select_stmt
print(connection.execute(select_stmt).fetchall())

# Build a statement to update the fips_state to 36: stmt
stmt = update(state_fact).values(fips_state=36)

# Append a where clause to limit it to records for New York state
stmt = stmt.where(state_fact.columns.name == 'New York')

# Execute the statement: results
results = connection.execute(stmt)

# Print rowcount
print(results.rowcount)

# Execute the select_stmt again to view the changes
print(connection.execute(select_stmt).fetchall())


# ''''''''''''' Update Multiple Records ''''''#

# Build a statement to update the notes to 'The Wild West': stmt
stmt = update(state_fact).values(notes='The Wild West')

# Append a where clause to match the West census region records
stmt = stmt.where(state_fact.columns.census_region_name == 'West')

# Execute the statement: results
results = connection.execute(stmt)

# Print rowcount
print(results.rowcount)

# ''''''''''' Making Correlated Updates ''' ########

# Build a statement to select name from state_fact: stmt
fips_stmt = select([state_fact.columns.name])

# Append a where clause to Match the fips_state to flat_census fips_code
fips_stmt = fips_stmt.where(
state_fact.columns.fips_state == flat_census.columns.fips_code)

# Build an update statement to set the name to fips_stmt: update_stmt
update_stmt = update(flat_census).values(state_name=fips_stmt)

# Execute update_stmt: results
results = connection.execute(update_stmt)

# Print rowcount
print(results.rowcount)

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# ''''''''''Coding the Forward Propagation (FP) Algorithm ''''''''#

weights = {'node_1': np.array([4, -5]), 'node_0': np.array([2, 4]),
'output': np.array([2, 7])}

input_data = [3, 5]

# Calculate node 0 value: node_0_value
node_0_value = (input_data * weights['node_0']).sum()

# Calculate node 1 value: node_1_value
node_1_value = (input_data * weights['node_1']).sum()

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_value, node_1_value])

# Calculate output: output
output = (hidden_layer_outputs * weights['output']).sum()

# Print output
print(output, 'is the basic FP output from model')

# ''''''' Apply the Rectified Linear Activation Function '''''''''''''#

# NOTE: The activation function is very useful for tuning model weights ''#


def relu(input):
'''Define relu activation function here'''
# Calculate the value for the output of the relu function: output
output = max(input, 0)

# Return the value just calculated
return(output)

# Calculate node 0 value: node_0_output
node_0_input = (input_data * weights['node_0']).sum()
node_0_output = relu(node_0_input)

# Calculate node 1 value: node_1_output
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = relu(node_1_input)

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_output, node_1_output])

# Calculate model output (do not apply relu)
model_output = (hidden_layer_outputs * weights['output']).sum()

# Print model output
print(model_output, 'is the FP_ReLU predicted quantity of transactions')


# ''''''''''' Apply Network to many observations/rows of data '''''''#

# Define predict_with_network()
def predict_with_network(input_data_row, weights):

# Calculate node 0 value
node_0_input = (input_data_row * weights['node_0']).sum()
node_0_output = relu(node_0_input)

# Calculate node 1 value
node_1_input = (input_data_row * weights['node_1']).sum()
node_1_output = relu(node_1_input)

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_output, node_1_output])

# Calculate model output
input_to_final_layer = (weights['output'] * hidden_layer_outputs).sum()
model_output = relu(input_to_final_layer)

# Return model output
return(model_output)


# Create empty list to store prediction results
results = []
for input_data_row in input_data:
# Append prediction to results
results.append(predict_with_network(input_data_row, weights))

# Print results
print(results)


# ''''''''''''' Behaviour of a Multi Layer Neural Network ''''''''#

def predict_with_network(input_data):
# Calculate node 0 in the first hidden layer
node_0_0_input = (input_data * weights['node_0_0']).sum()
node_0_0_output = relu(node_0_0_input)

# Calculate node 1 in the first hidden layer
node_0_1_input = (input_data * weights['node_0_1']).sum()
node_0_1_output = relu(node_0_1_input)

# Put node values into array: hidden_0_outputs
hidden_0_outputs = np.array([node_0_0_output, node_0_1_output])

# Calculate node 0 in the second hidden layer
node_1_0_input = (hidden_0_outputs * weights['node_1_0']).sum()
node_1_0_output = relu(node_1_0_input)

# Calculate node 1 in the second hidden layer
node_1_1_input = (hidden_0_outputs * weights['node_1_1']).sum()
node_1_1_output = relu(node_1_1_input)

# Put node values into array: hidden_1_outputs
hidden_1_outputs = np.array([node_1_0_output, node_1_1_output])

# Calculate model output: model_output
model_output = (weights['output'] * hidden_1_outputs).sum()

# Return model_output
return(model_output)

output = predict_with_network(input_data)
print(output)


# ''' Calculating Model Errors - Consideration of weight effects''''###

# '''''''' Test Case - Bank Transactions Predictions '''''''##

# ''''''' Coding how weight changes affects accuracy ''''#'''''###

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from keras.layers import Dense
from keras.models import Sequential

# predictors = np.loadtxt('predictors_data.csv', delimiter=',')

predictors = np.loadtxt('aerodata.csv', delimiter=',')
target = 3
# Import necessary modules

# Save the number of columns in predictors: n_cols
# n_cols = predictors.shape[1]

# Set up the model: model
# model = Sequential()

# Add the first layer
# model.add(Dense(50, activation='relu', input_shape=(n_cols,)))

# Add the second layer
# model.add(Dense(32, activation='relu'))

# Add the output layer
# model.add(Dense(1))

# ''''''''' Compile the Model ''''''''''#

# Specify the model
n_cols = predictors.shape[1]
model = Sequential()
model.add(Dense(50, activation='relu', input_shape=(n_cols,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Verify that model contains information from compiling
print("Loss function: " + model.loss)

model.fit(predictors, target)

# ''''''''''Define Classification Model - Titaninc datasrt example '''#

# Convert the target to categorical: target
target = to_categorical(df.survived)

# Set up the model
model = Sequential()

# Add the first layer
model.add(Dense(32, activation='relu', input_shape=(n_cols,)))

# Add the output layer
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='sgd', loss='categorical_crossentropy',
metrics=['accuracy'])

# Fit the model
model.fit(predictors, target)


# '''''''''''' Making predictions ;;;;;;;;;;#

# Calculate predictions: predictions
predictions = model.predict(pred_data)

# Calculate predicted probability of survival: predicted_prob_true
predicted_prob_true = predictions[:, 1]

# print predicted_prob_true
print(predicted_prob_true)


# '''''''''' Model Optimisation - (#4)'''''''''''#


# Create list of learning rates: lr_to_test
lr_to_test = [.000001, 0.01, 1]

# Loop over learning rates
for lr in lr_to_test:
print('\n\nTesting model with learning rate: %f\n'%lr )

# Build new model to test, unaffected by previous models
model = get_new_model()

# Create SGD optimizer with specified learning rate: my_optimizer
my_optimizer = SGD(lr=lr)

# Compile the model
model.compile(optimizer= my_optimizer, loss= 'categorical_crossentropy')

# Fit the model
model.fit(predictors, target)

# ''''''Evaluate model accuracy on validation dataset ''''''#

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape=input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Fit the model
hist = model.fit(predictors, target, validation_split=0.3)


# '''''' Early Stopping - Optimising the optimisation ''''''''''#

# Import EarlyStopping - already done above

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape=input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Fit the model
model.fit(predictors, target, epochs=30, validation_split=0.3,
callbacks=[early_stopping_monitor])

# ''''''''''''' Experimenting with a wider network ''''''#

# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Create the new model: model_2
model_2 = Sequential()

# Add the first and second layers
model_2.add(Dense(100, activation='relu', input_shape=input_shape))
model_2.add(Dense(100, activation='relu'))

# Add the output layer
model_2.add(Dense(2, activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Fit model_1
model_1_training = model_1.fit(predictors, target, epochs=15,
validation_split=0.2,
callbacks=[early_stopping_monitor],
verbose=False)

# Fit model_2
model_2_training = model_2.fit(predictors, target, epochs=15,
validation_split=0.2,
callbacks=[early_stopping_monitor],
verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r',
model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()

 

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import matplotlib.pyplot as plt
from keras.layers import Dense
from keras.models import Sequential
from keras.callbacks import EarlyStopping

# Import the SGD optimizer
from keras.optimizers import SGD

# Create list of learning rates: lr_to_test
lr_to_test = [.000001, 0.01, 1]

# Loop over learning rates
for lr in lr_to_test:
print('\n\nTesting model with learning rate: % f \n'% lr)

# Build new model to test, unaffected by previous models
model = get_new_model()

# Create SGD optimizer with specified learning rate: my_optimizer
my_optimizer = SGD(lr=lr)

# Compile the model
model.compile(optimizer=my_optimizer, loss='categorical_crossentropy')

# Fit the model
model.fit(predictors, target)


# ''''''Evaluate model accuracy on validation dataset ''''''#

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape=input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Fit the model
hist = model.fit(predictors, target, validation_split=0.3)


# '''''' Early Stopping - Optimising the optimisation ''''''''''#

# Import EarlyStopping - already done above

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape=input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Fit the model
model.fit(predictors, target, epochs=30, validation_split=0.3,
callbacks=[early_stopping_monitor])


# ''''''''''''' Experimenting with a wider network ''''''#

# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Create the new model: model_2
model_2 = Sequential()

# Add the first and second layers
model_2.add(Dense(100, activation='relu', input_shape=input_shape))
model_2.add(Dense(100, activation='relu'))

# Add the output layer
model_2.add(Dense(2, activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Fit model_1
model_1_training = model_1.fit(predictors, target, epochs=15,
validation_split=0.2,
callbacks=[early_stopping_monitor],
verbose=False)

# Fit model_2
model_2_training = model_2.fit(predictors, target, epochs=15,
validation_split=0.2,
callbacks=[early_stopping_monitor],
verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r',
model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()


# ''''''''' Adding layers to the model ''''''''' #

# The input shape to use in the first hidden layer
input_shape = (n_cols,)

# Create the new model: model_2
model_2 = Sequential()

# Add the first, second, and third hidden layers
model_2.add(Dense(50, activation='relu', input_shape=input_shape))
model_2.add(Dense(50, activation='relu'))
model_2.add(Dense(50, activation='relu'))

# Add the output layer
model_2.add(Dense(2, activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Fit model 1
model_1_training = model_1.fit(predictors, target, epochs=20,
validation_split=0.4,
callbacks=[early_stopping_monitor],
verbose=False)

# Fit model 2
model_2_training = model_2.fit(predictors, target, epochs=20,
validation_split=0.4,
callbacks=[early_stopping_monitor],
verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r',
model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()


# '''''' Digit Recognition Model '''''''#

# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50, activation='relu', input_shape=(784,)))

# Add the second hidden layer
model.add(Dense(50, activation='relu'))

# Add the output layer
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Fit the model
model.fit(X, y, validation_split=0.3)

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error


# ''''' Rectified Lin Activa Func. ''''''''''' ##


def relu(input):
'''Define relu activation function here'''
# Calculate the value for the output of the relu function: output
output = max(input, 0)

# Return the value just calculated
return(output)


# .............###

weights = {'node_1': np.array([4, -5]), 'node_0': np.array([2, 4]),
'output': np.array([2, 7])}

# '''''' Part 1 End ''''''''''' ###

input_data = [3, 5]
# ''''''''''''' Behaviour of a Multi Layer Neural Network ''''''''#


def predict_with_network(input_data):
# Calculate node 0 in the first hidden layer
node_0_0_input = (input_data * weights['node_0_0']).sum()
node_0_0_output = relu(node_0_0_input)

# Calculate node 1 in the first hidden layer
node_0_1_input = (input_data * weights['node_0_1']).sum()
node_0_1_output = relu(node_0_1_input)

# Put node values into array: hidden_0_outputs
hidden_0_outputs = np.array([node_0_0_output, node_0_1_output])

# Calculate node 0 in the second hidden layer
node_1_0_input = (hidden_0_outputs * weights['node_1_0']).sum()
node_1_0_output = relu(node_1_0_input)

# Calculate node 1 in the second hidden layer
node_1_1_input = (hidden_0_outputs * weights['node_1_1']).sum()
node_1_1_output = relu(node_1_1_input)

# Put node values into array: hidden_1_outputs
hidden_1_outputs = np.array([node_1_0_output, node_1_1_output])

# Calculate model output: model_output
model_output = (weights['output'] * hidden_1_outputs).sum()

# Return model_output
return(model_output)

output = predict_with_network(input_data)
print(output)

# ''''''''''''''''''''' Deep Learning - Part 2 ''''''''''' ##


# ''' Calculating Model Errors - Consideration of weight effects''''###

# '''''''' Test Case - Bank Transactions Predictions '''''''##

# ''''''' Coding how weight changes affects accuracy ''''#'''''###

# The data point you will make a prediction for

input_data = np.array([0, 3])

# Sample weights
weights_0 = {'node_0': [2, 1],
'node_1': [1, 2],
'output': [1, 1]
}

# The actual target value, used to calculate the error
target_actual = 3
target = 2
# Make prediction using original weights
model_output_0 = predict_with_network(input_data, weights_0)

# Calculate error: error_0
error_0 = model_output_0 - target_actual

# Create weights that cause the network to make perfect prediction (3):
# weights_1
weights_1 = {'node_0': [2, 1],
'node_1': [1, 2],
'output': [1, 0]
}

# Make prediction using new weights: model_output_1
model_output_1 = predict_with_network(input_data, weights_1)

# Calculate error: error_1
error_1 = model_output_1 - target_actual

# Print error_0 and error_1
print(error_0)
print(error_1)


# '''''''''' Scaling up - Multiple Data Points ''''''''''''#

# Create model_output_0
model_output_0 = []
# Create model_output_0
model_output_1 = []

# Loop over input_data
for row in input_data:
# Append prediction to model_output_0
model_output_0.append(predict_with_network(row, weights_0))

# Append prediction to model_output_1
model_output_1.append(predict_with_network(row, weights_1))

# Calculate the mean squared error for model_output_0: mse_0
mse_0 = mean_squared_error(model_output_0, target_actuals)

# Calculate the mean squared error for model_output_1: mse_1
mse_1 = mean_squared_error(model_output_1, target_actuals)

# Print mse_0 and mse_1
print("Mean squared error with weights_0 : %f" % mse_0)
print("Mean squared error with weights_1 : %f" % mse_1)

# ''''''''Calculating Slopes '''''#

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = target - preds

# Calculate the slope: slope
slope = 2 * input_data * error

# Print the slope
print(slope)

# '''''''''' Improving the model weights '''''''' #

# Set the learning rate: learning_rate
learning_rate = 0.01

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = target - preds

# Calculate the slope: slope
slope = 2 * input_data * error

# Update the weights: weights_updated
weights_updated = weights + (learning_rate * slope)

# Get updated predictions: preds_updated
preds_updated = (weights_updated * input_data).sum()

# Calculate updated error: error_updated
error_updated = target - preds_updated

# Print the original error
print(error)

# Print the updated error
print(error_updated)

# ''''''' Making multiple updates to weights ''''''' #

n_updates = 20
mse_hist = []

# Iterate over the number of updates
for i in range(n_updates):
# Calculate the slope: slope
slope = get_slope(input_data, target, weights)

# Update the weights: weights
weights = weights + 0.01 * slope

# Calculate mse with new weights: mse
mse = get_mse(input_data, target, weights)

# Append the mse to mse_hist
mse_hist.append(mse)

# Plot the mse history
plt.plot(mse_hist)
plt.xlabel('Iterations')
plt.ylabel('Mean Squared Error')
plt.show()

 

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 1 15:03:52 2022

@author: Vijayabalan
"""

import numpy as np
import matplotlib.pyplot as plt
# import pandas as pd

from ecdf_func import ecdf

# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)


# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu, sigma, size=10000)

# Get the CDF of the samples and of the data
x_theor, y_theor = ecdf(samples)
x, y = ecdf(belmont_no_outliers)

# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()


# Take a million samples out of the Normal distribution: samples
samples = np.random.normal(mu, sigma, size=1000000)

# Compute the fraction that are faster than 144 seconds: prob
prob = np.sum(samples <= 144)/len(samples)

# Print the result
print('Probability of besting Secretariat:', prob)

# #################################### #

# Determine successive poisson relationship - i.e. total time between
# two poisson processes


def successive_poisson(tau1, tau2, size=1):
# Draw samples out of first exponential distribution: t1
t1 = np.random.exponential(tau1, size)

# Draw samples out of second exponential distribution: t2
t2 = np.random.exponential(tau2, size)

return t1 + t2


# Draw samples of waiting times: waiting_times
waiting_times = successive_poisson(764, 715, size=100000)

# Make the histogram
_ = plt.hist(waiting_times, normed=True, histtype='step', bins=100)


# Label axes
plt.xlabel('waiting_times')
plt.ylabel('successive_poisson')


# Show the plot
plt.show()

Leave a Comment