Learning Machine Learning from Scratch - Preparing and Visualizing Data

First of all, I'd like to introduce you to a very useful study address:/columns

Data preparation-cleaning

In the first step of machine learning - preparing the data - I have downloaded the required files in advance for convenience.

/files/guoxiaoyu/?t=1726642760&download=true

In most cases, it is rare that we have access to a dataset that fully meets the specifications. Therefore, the first step is usually to clean the data. Using today's data as an example, let me open it up for you to see what its exact format looks like.

This data is not exactly ideal by any stretch of the imagination. It does contain a lot of information, so today we're going to use month as the main dimension to tally the average price of pumpkins per month. By doing this, we can essentially drop many of the other fields.

Start parsing

Our goal is to get the average price of pumpkins per month, so the fields we need to focus on include month and price. Manually removing unnecessary fields and then letting Python parse them seems too cumbersome and inefficient. Therefore, today we will introduce a very useful toolkit: Pandas, which can simplify this process.

Pandas learning address:/

import pandas as pd
pumpkins = pd.read_csv('../data/')
print(())
print(())

Here you can print off the first 5 lines of information and the last 5 lines yourself.

There are a lot of columns of data here, we need to delete those unnecessary columns and keep only the month and price data we need.

new_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']
pumpkins = ([c for c in  if c not in new_columns], axis=1)
print(().sum())

Note that our file does not have a "Month" column, which is an important piece of data that we will need to use later. In addition, there is a "Package" field, which indicates the weighing method, because different vegetables may have different weighing methods.

Usually, we buy things weighed in kilograms (kg) for billing purposes. However, vendors may sometimes sell pumpkins as whole for promotional purposes, and this inconsistency in weighing is common. We need to make sure that we only keep data on uniform weighing methods.

field parsing

Let's start by calculating the simpler dates, obtaining only the month without considering the year. Although this may lead to inaccuracies in the final data, as various factors can cause prices to fluctuate greatly from year to year, let's ignore these complications for now and deal with the simplest case first.

month = (pumpkins['Date']).month
print(month)

Next we deal with prices, we will consider only the highest and lowest prices for each dish and then calculate their average.

price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2
print(price)

Now let's deal with the weighing method. For the weighing method for the U.S. region, we don't need to pay much attention to the details, and we can just use the formulas that have already been set up. For the domestic region, it is necessary to intercept and adjust according to the characteristics of the data.

pumpkins = pumpkins[pumpkins['Package'].('bushel', case=True, regex=True)]
new_pumpkins.loc[new_pumpkins['Package'].('1 1/9'), 'Price'] = price/(1 + 1/9)
new_pumpkins.loc[new_pumpkins['Package'].('1/2'), 'Price'] = price/(1/2)
print(new_pumpkins)

The effect is as follows:

data visualization

We will present the results of our data analysis using the data visualization library Matplotlib, a powerful tool that helps us create various types of charts and graphs for a more visual presentation of data trends and relationships.

Matplotlib introductory study address is:/

import  as plt
price = new_pumpkins.Price
month = new_pumpkins.Month
(month, price)
()

Here, we have simply displayed the price and month data on the x- and y-axes, with no particularly complex graphical design.

Let's optimize the code:

new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
("Pumpkin Price")

Let me explain: the groupby method is used to group data by the Month column, which means that all data with the same month will be grouped together.

Next, ['Price'].mean() is calculating an average for the Price column within each grouping. This gives us the average pumpkin price for each month.

Finally, .plot(kind='bar') is to plot the calculated average price data as a bar chart. Here kind='bar' specifies that the plot type is bar, it will show the average price for each month and each month will correspond to a bar.

Of course, data visualization is not limited to Matplotlib, there are many other dependent libraries to choose from, so you can choose the right tool for your personal preferences and needs.

summarize

It does seem that indeed our data processing is almost complete.

However, in the article I also mentioned an important point: this method does not adequately explain the causes of specific problems. This is because we are only calculating prices under ideal conditions and not taking into account the effects of factors such as year, weather, and weighing. Nevertheless, we have identified a general process for data preparation.

All that needs to be done now is to decide for yourself how to maintain the process and ensure clarity and accuracy of the data.

I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.

💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.

🌟 Welcome to the effortless drizzle! 🌟