Use Pandas Analysis and Plot Tomcat Log

Some time we need to analysis tomcat’s log by date, ram, cpu and so on. pandas is a very useful and powerful data scientist tool.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import matplotlib.pyplot as plt
import csv
import re
import glob

# use glob to list all the log files
log_files = glob.glob('*.log')
csv_str = ''
for filename in log_files:
# get lines that contains ' used', this line has cpu, gpu, memory history
used = [ line for line in open(filename) if ' used' in line]

# use regular expression to get the usage number
for line in used:
ram = re.search('GB\((.*)\%\)', line)
datetime = re.search('(.*) INFO', line)
csv_str += ram.group(1) + ',' + datetime.group(1)[:20] + '\n'

# save usage data to csv file, or you can convert to pandas DataFrame directorly
with open('test.csv', 'wb') as myfile:
myfile.write(csv_str)

# convert csv to pandas DataFrame
df = pd.read_csv('test.csv', delimiter=',', header = None)
# change DataFrame column names
df.columns = ['ram', 'ts']
# convert to datatime dtype
df.ts=pd.to_datetime(df.ts)

# show the summary
print df.head()
print df.describe()
print df.info()

save DataFrame to new csv(indexed)
df.to_csv('df_csv.csv')

# plot line chart
df.plot(x = 'ts', y = 'ram')
plt.xlable = 'ram'
plt.ylable = 'ts'
plt.show()