Operations with Frovedis DataFrame

Nov 1, 2021, Shoichiro Yokotani, Application Development Expert, AI Platform division

By using Python, you can select the data format such as List, Dictionary, Tuple in the standard library as the data destination. In addition, the Python libraries pandas and Numpy provide flexible operation functions.

This article will focus specifically on pandas. pandas provides data analysts with functions for formating labeled data structures.

After merging, aggregating, and slicing multiple tables using pandas, you can check the statistical information of your data and perform analysis using data analysis algorithms. In order to handle data stored in List or Dictionary format, it is necessary to create processing code. With pandas, various processes can be realized using the functions of DataFrame. It is faster to use the pandas method than to make changes to the List or Dictionary format data by loop processing.
Also, the larger the data, the greater the difference in processing speed.

When handling relatively small data on SX-Aurora TSUBASA, you can execute analysis on the Vector Engine by getting the data with pandas DataFrame and converting it to Frovedis DataFrame. At this time, the data is transferred from the x86 CPU main memory to the Vector Engine memory.

The functionality provided by the Frovedis DataFrame is equivalent to a subset of the pandas version. You can use functions for exchanging data with pandas DataFrame as well as basic operation functions such as Select, Join, Sort, and Groupby. Data shaping can be made flexible by linking with mathematical arithmetic processing using Numpy's multidimensional array object, handling of time series data of pandas, and data input / output function.

Let's take a look at some examples of working with Frovedis DataFrame while using Jupyter notebook. First, perform data operations such as select, sort, and group by with small data.


Frovedis DataFrame operation example using sample data


In [1]:
import os
import numpy as np
import pandas as pd
from frovedis.exrpc.server import FrovedisServer
from frovedis.dataframe.df import FrovedisDataframe

Starting Frovedis server


In [2]:
FrovedisServer.initialize("mpirun -np 2 {}".format(os.environ['FROVEDIS_SERVER']))

Out[2]:
'[ID: 1] FrovedisServer (Hostname: handson02, Port: 42383) has been initialized with 2 MPI processes.'

Two types of sample data are prepared in dictionary format. Converting to Frovedis DataFrame via pandas DataFrame. Data is transferred to the Vector Engine's memory as it is converted from pandas to Frovedis DataFrame.


In [3]:
peopleDF ={
      'Ename': ['Michael', 'Andy', 'Tanaka', 'Raul', 'Yuta'],
      'Age': [29, 30, 27, 19, 31],
      'Country': ['USA', 'England', 'Japan', 'France', 'Japan'],
      }

In [4]:
countryDF ={
      'Ccode': [1, 2, 3, 4],
      'Country': ['USA', 'England', 'Japan', 'France'],
      }

In [5]:
pdf1 = pd.DataFrame(peopleDF)
pdf2 = pd.DataFrame(countryDF)
fdf1 = FrovedisDataframe(pdf1)
fdf2 = FrovedisDataframe(pdf2)

Showing Frovedis Data Frame.


In [6]:
# display created frovedis dataframes
print ("* print Frovedis DataFrame")
print (fdf1.to_pandas_dataframe())
print (fdf2.to_pandas_dataframe())

* print Frovedis DataFrame
  Ename Age Country
Index      
0 Michael 29 USA
1 Andy 30 England
2 Tanaka 27 Japan
3 Raul 19 France
4 Yuta 31 Japan
  Ename Country
Index    
0 1 USA
1 2 England
2 3 Japan
3 4 France

Column is selected by specifying the column name.


In [7]:
# select demo
print ("* select Ename and Age")
print (fdf1[["Ename","Age"]].to_pandas_dataframe())

* select Ename and Age
  Ename Age
0 Michael 29
1 Andy 30
2 Tanaka 27
3 Raul 19
4 Yuta 31

Filtering by specifying "age" and "country name".


In [8]:
# filter demo
print ("* filter by Age > 19 and Contry == 'Japan'")
print (fdf1[fdf1.Age > 19 and fd1.Country == 'Japan'].to_pandas_dataframe())

* filter by Age > 19 and Country == 'Japan'
  Ename Age Country
0 Tanaka 27 Japan
1 Yuta 31 Japan

Sorting by "age".


In [9]:
# sort demo
print ("* sort by Age (descending order)")
print (fdf1.sort("Age",ascending = False).to_pandas_dataframe()) # single column, descending

* sort by Age (descending order)
  Ename Age Country
0 Yuta 31 Japan
1 Andy 30 England
2 Michael 29 USA
3 Tanaka 27 Japan
4 Raul 19 France

Sorting by specifying multiple columns.


In [10]:
print ("* sort by Country and Age")
print (fdf1.sort(["Country", "Age"]).to_pandas_dataframe()) # multiple column

* sort by Country and Age
  Ename Age Country
0 Andy 30 England
1 Raul 19 France
2 Tanaka 27 Japan
3 Yuta 31 Japan
4 Michael 29 USA

Grouping by "country name", and totaling the maximum, minimum, and average ages in each group.


In [11]:
# groupby demo
print ("* groupby Country and max/min/mean of Age and count of Ename")
out = fdf1.groupby('Country')
out = out.agg({'Age': ['max','min','mean']})
out2 = out[["Country","max_Age","min_Age","mean_Age"]]
print(out2.to_pandas_dataframe())

* groupby Country and max/min/mean of Age and count of Ename
  Country max_Age min_Age mean_Age
0 England 30 30 30.0
1 Japan 31 27 29.0
2 France 19 19 19.0
3 USA 29 29 29.0

Renaming the column.


In [12]:
# renaming demo
print ("* rename Country to Cname")
fdf3 = fdf2.rename({'Country' : 'Cname'})
print(fdf3.to_pandas_dataframe())

* rename Country to Cname
  Ccode Cname
index    
0 1 USA
1 2 England
2 3 Japan
3 4 France

Two DataFrames are integrated by specifying the column name.


In [13]:
# join after column renaming
print ("* merge (join) two tables")
out = fdf1.merge(fdf3, left_on="Country", right_on="Cname") # with defaults
print(out.to_pandas_dataframe())

* merge (join) two tables
  Ename Age Country Ccode Cname
index          
0 Michael 29 USA 1 USA
1 Andy 30 England 2 England
2 Tanaka 27 Japan 3 Japan
3 Raul 19 France 4 France
4 Yuta 31 Japan 3 Japan

Integration and sorting of two DataFrames. Displaying the selected result.


In [14]:
# operation chaining: join -> sort -> select -> show
print ("* chaining: merge two tables, sort by Age, and select Age, Ename and Country")
out = fdf1.merge(fdf3, left_on="Country", right_on="Cname") \ .sort("Age")[["Age", "Ename", "Country"]]
print(out.to_pandas_dataframe())

* chaining: merge two tables, sort by Age, and select Age, Ename and Country
  Age Ename Country
0 19 Raul France
1 27 Tanaka Japan
2 29 Michael USA
3 30 Andy England
4 31 Yuta Japan

Displaying statistical information of data in DataFrame.


In [15]:
# column statistics
print ("describe: ")
print (fdf1.describe())
print ("\n")

describe:
  Age
count 5.000000
mean 27.200000
std 4.816638
sum 136.000000
min 19.000000
max 31.000000

Creating a new Frovedis DataFrame "joined" by integrating pandas DataFrame and Frovedis DataFrame.


In [16]:
# merging with panda dataframe
print ("* merge with pandas table")
pdf2.rename(columns={'Country' : 'Cname'},inplace=True)
joined = fdf1.merge(pdf2, left_on='Country', right_on='Cname')
print(joined.to_pandas_dataframe())

* merge with pandas table
  Ename Age Country Ccode Cname
Index          
0 Michael 29 USA 1 USA
1 Andy 30 England 2 England
2 Tanaka 27 Japan 3 Japan
3 Raul 19 France 4 France
4 Yuta 31 Japan 3 Japan

Converting Frovedis DataFrame to pandas DataFrame.


In [17]:
# conversion
print ("* convert Frovedis DataFrame to Pandas DataFrame")
print (df1.to_pandas_dataframe())
print ("\n")

* convert Frovedis DataFrame to Pandas DataFrame
  Ename Age Country
Index      
0 Michael 29 USA
1 Andy 30 England
2 Tanaka 27 Japan
3 Raul 19 France
4 Yuta 31 Japan

In [18]:
# conversion
FrovedisServer.shut_down()

Next, let's look at an operation example using Kaggle's Covid-19 vaccine data.


Processing with the Kaggle COVID-19 World Vaccination Progress dataset

Frovedis DataFrame Operation Example using Covid-19 Vaccine Data Set

https://www.kaggle.com/gpreda/covid-world-vaccination-progress?select=country_vaccinations_by_manufacturer.csv

in [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import time, os
import itertools
import matplotlib.pyplot as plt
from frovedis.exrpc.server import FrovedisServer
from frovedis.dataframe.df import FrovedisDataframe as fd

Starting Frovedis Server.


In [2]:
FrovedisServer.initialize("mpirun -np 2 {}".format(os.environ['FROVEDIS_SERVER']))

Out[2]:
'[ID: 1] FrovedisServer (Hostname: handson02, Port: 39739) has been initialized with 2 MPI processes.'

Loading Vaccine Data Set into pandas DataFrame.


In [3]:
manufacturers = pd.read_table('../data/country_vaccinations_by_manufacturer.csv', sep=',', engine='python').dropna()

In [4]:
manufacturers.head()

Out[4]:
  location date vaccine total_vaccinations
0 Chile 2020-12-24 Pfizer/BioNTech 420
1 Chile 2020-12-25 Pfizer/BioNTech 5198
2 2 Chile 2020-12-26 Pfizer/BioNTech 8338
3 Chile 2020-12-27 Pfizer/BioNTech 8649
4 Chile 2020-12-28 Pfizer/BioNTech 8649

Converting from pandas to Frovedis DataFrame. At this time, data is transferred to the memory of the vector engine.


In [5]:
fd_manufacturers = fd(manufacturers)

Inoculating countries included in the Frovedis Data Frame are being extracted. After that, the extracted data is converted to NumPy, converted to a list, and displayed.


In [6]:
location = fd_manufacturers['location')
location = location.to_pandas_dataframe().to_numpy()
x = set(list(itertools.chain.from_iterable(location)))
print(x)
{'Latvia', 'Germany', 'Iceland', 'Czechia', 'Chile', 'Lithuania', 'Romania', 'Italy', 'France', 'United States'}

Displaying Frovedis DataFrame stats.


In [7]:
fd_manufacturers.describe()

Out[7]:
  total_vaccinations
count 3.491000e+03
mean 4.885988e+06
std 2 1.635967e+07
sum 1.705698e+10
min 0.000000e+00
max 1.341169e+08

Getting Data by country from Frovedis DataFrame.


In [8]:
fd_US = fd_manufacturers[fd_manufacturers.location=='United States']
fd_Germany = fd_manufacturers[fd_manufacturers.location=='Germany']
fd_Chile = fd_manufacturers[fd_manufacturers.location=='Chile']
fd_Chile.to_pandas_dataframe().head()
Out[8]:
  location date vaccine total_vaccinations
0 Chile 2020-12-24 Pfizer/BioNTech 420
1 Chile 2020-12-25 Pfizer/BioNTech 5198
2 2 Chile 2020-12-26 Pfizer/BioNTech 8338
3 Chile 2020-12-27 Pfizer/BioNTech 8649
4 Chile 2020-12-28 Pfizer/BioNTech 8649

The total number of vaccinations is being calculated for each type of vaccine. Using Frovedis groupby and agg, the total is calculated and displayed for each vaccine. Here, the aggregated results of U.S., Germany, and Chile are displayed.


In [9]:
fdUS_out = fd_US.groupby('vaccine'.agg({'total_vaccinations': ['sum']})
fdGermany_out = fd_Germany.groupby('vaccine'.agg({'total_vaccinations': ['sum']})
fdChile_out = fd_Chile.groupby('vaccine'.agg({'total_vaccinations': ['sum']})

In [10]:
fdUS_out.sort("sum_total_vaccinations",ascending=False).to_pandas_dataframe().head()
Out[10]:
  vaccine sum_total_vaccinations
0 Pfizer/BioNTech 6572931045
1 Moderna 5789377678
2 Johnson&Johnson 298233690

In [11]:
fdGermany_out.sort("sum_total_vaccinations",ascending=False).to_pandas_dataframe().head()
Out[11]:
  vaccine sum_total_vaccinations
0 Pfizer/BioNTech 1006759304
1 Oxford/AstraZeneca 227902088
2 Moderna 63056380
3 Johnson&Johnson 68781

In [12]:
fdChile_out.sort("sum_total_vaccinations",ascending=False).to_pandas_dataframe().head()
Out[12]:
  vaccine sum_total_vaccinations
0 Sinovac 656309318
1 Pfizer/BioNTech 82842894

Extracting the number of vaccinations made by Pfizer for U.S. and Germany.


In [13]:
US_Pfizer = fd_US[fd_US.vaccine=='Pfizer/BioNTech']
US_Pfizer.to_pandas_dataframe().head()
Out[13]:
  location date vaccine total_vaccinations
0 United States 2021-01-12 Pfizer/BioNTech 5488697
1 United States 2021-01-13 Pfizer/BioNTech 6025872
2 United States 2021-01-15 Pfizer/BioNTech 7153268
3 United States 2021-01-19 Pfizer/BioNTech 8874811
4 United States 2021-01-20 Pfizer/BioNTech 9281063

In [14]:
Germany_Pfizer = fd_Germany[fd_Germany.vaccine=='Pfizer/BioNTech']
Germany_Pfizer.to_pandas_dataframe().head()
Out[14]:
  location date vaccine total_vaccinations
0 Germany 2020-12-27 Pfizer/BioNTech 24473
1 Germany 2020-12-28 Pfizer/BioNTech 42813
2 Germany 2020-12-29 Pfizer/BioNTech 92363
3 Germany 2020-12-30 Pfizer/BioNTech 154903
4 Germany 2020-12-31 Pfizer/BioNTech 204951

Converting time series data of Pfizer vaccine inoculation in U.S. and Germany to Pandas DataFrame. The graph of the converted data is shown below.


In [15]:
US_data = US_Pfizer.to_pandas_dataframe()
Germany_data = Germany_Pfizer.to_pandas_dataframe()
fig = plt.figure(figsize=(16,8))
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)

x1 = range(0,len(US_data['date']),25)
ax1.plot(US_data['date']),US_data['total_vaccinations'])
ax1.set_xticks(x1)
ax1.set_title('United States: Pfizer', size=25)
ax1.set_xlabel('Date')
ax1.set_ylabel(''Total vaccinations x100000000')

ax2.plot(Germany_data['date'], Germany_data['total_vaccinations'])
x2 = range(0,len(Germany_data['date']),25)
ax2.plot(Germany_data['date']),Germany_data['total_vaccinations'])
ax2.set_xticks(x2)
ax2.set_title('Germany: Pfizer', size=25)
ax2.set_xlabel('Date')
ax2.set_ylabel(''Total vaccinations x100000000')

plt.show()

This concludes the explanation of data processing using Frovedis DataFrame.

Related Links