A column of a CSV file contains many possible values and I want to find the top 5 most frequent ones. A few python codes can help me get it done.
Suppose the CSV file is called "test_data.csv", and its content is:
"date","url","method","status_code","bytes"
"Jun 18, 2022","/posts/find_biggest_files",GET,200,"2.7KB"
"Jun 18, 2022","/posts/gunicorn_concurrency",GET,200,3KB
"Jun 18, 2022","/posts/gunicorn_concurrency",GET,301,169B
"Jun 18, 2022","/posts/auto_sqlalchemy_models",GET,200,"2.2KB"
"Jun 18, 2022","/posts/python_memcached",GET,200,"2.4KB"
"Jun 18, 2022","/posts/python_memcached",GET,200,"2.4KB"
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"
"Jun 17, 2022","/posts/python_sys_path",GET,200,"2.3KB"
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"
"Jun 17, 2022","/posts/gunicorn_concurrency",GET,200,3KB
"Jun 17, 2022","/posts/gunicorn_concurrency",GET,301,169B
"Jun 17, 2022","/posts/vscode_flake8_line_width",GET,200,"2.3KB"
"Jun 17, 2022","/posts/vscode_flake8_line_width",GET,301,169B
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"
"Jun 17, 2022","/posts/aws_content_type",GET,200,"2.2KB"
"Jun 17, 2022","/posts/curl_upload_flask",GET,200,"2.7KB"
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"
"Jun 17, 2022","/posts/es_less_mem",GET,200,"2.2KB"
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"
In fact, it's a part of easydevguide.com
's log file. I want to find the top 5 most frequent urls in this file. url
is the second column of this CSV file.
import csv
from collections import defaultdict
reader = csv.reader(open('test_data.csv'))
counter = defaultdict(int)
for i, row in enumerate(reader):
# ignore the first row (column headers)
if i == 0:
continue
# suppose we want to handle the second column, row[1] is the second column's value.
col_val = row[1]
# everytime we meet a column value, we increase the counter of it.
counter[col_val] += 1
# convert the counter from dict to list, so that we can sort it later.
flat_counter = [(col_val, cnt) for col_val, cnt in counter.items()]
# key argument tells "sorted" to sort on "cnt" of flat_counter's tuples
rs = sorted(flat_counter, key=lambda x: x[1], reverse=True)
# print the top 5
for col_val, cnt in rs[:5]:
print(col_val, cnt)
Output:
/posts/dynamic_table 5
/posts/gunicorn_concurrency 4
/posts/python_memcached 2
/posts/vscode_flake8_line_width 2
/posts/find_biggest_files 1