Sentiment Analysis of Genuine Customer Feedback in Amazon Book Reviews — EDA{Part 1}
Amazon, the popular American multinational corporation, has completely changed the way we shop online. It has millions of customers and offers a wide variety of products. This makes it a treasure trove of data that we can use to gain valuable insights.
In this article, I will take on an exciting data analysis where we’ll explore the world of sentiment analysis and delve into Amazon product reviews. Using Python, we will uncover hidden insights about what customers really think and feel about the books they purchased.
- Sentiment Analysis
- Understanding the Amazon Books Reviews data
- Exploratory Data Analysis
- Conclusion
- References
Get ready to dive in! Let’s get started on this thrilling journey! 🤠
1. Sentiment Analysis
Sentiment analysis helps us understand the feelings and emotions conveyed in written or spoken online communication. It’s like a tool that mines and analyzes data to extract people’s opinions from the text by Natural Language Processing. By doing so, we can draw valuable conclusions and gain insights into what others think and feel about a particular topic.
Well, it seems like sentiment analysis has turned me into a sentiment-retrieving machine! I’m on my fourth case study now of sentiment analysis and I’m starting to wonder if I’m becoming more machine than human. Maybe I should check my circuits for any signs of sentiment overload!! 😄
1.1 How Does Sentiment Analysis Work?
You might wonder, “How can a computer understand feelings?” Well, there are many clever methods and algorithms created to analyze sentiments online using machine learning. Artificial intelligence (AI) has become quite common nowadays, considering the big technological advancements we’ve experienced. It’s no surprise that AI can now help us analyze and understand emotions expressed in text on the internet.
If you still wonder how it actually works? Let’s break it down in simple terms:
a. Data Collection: First, we gather the text data we want to analyze. This could be social media posts, customer reviews, online comments, or any other form of written communication. Here I have chosen Amazon Books reviews.
b. Text Preprocessing: Before analyzing the sentiments, we need to clean and preprocess the text. This involves removing any irrelevant information like punctuation, special characters, and stop words (common words like “the,” “and,” etc.) that don’t carry much sentiment value.
c. Feature Extraction: Next, we extract meaningful features from the text. These features could be individual words, phrases, or even more complex linguistic structures like n-grams. These features will serve as the input for the sentiment analysis algorithm.
d. Sentiment Classification: This is where the magic happens. We employ various techniques, including machine learning algorithms or lexicon-based approaches, to classify the sentiment of each text snippet. The sentiment can be categorized as positive, negative, neutral, or even more nuanced emotions like joy, anger, or sadness.
e. Model Training and Evaluation: If we’re using machine learning, we train the sentiment analysis model using labeled data, where sentiments are already known. We then evaluate the model’s performance to ensure its accuracy and effectiveness.
f. Sentiment Analysis Output: Once the sentiment analysis model is ready, we apply it to new, unseen text data. The model assigns sentiment labels to each piece of text, allowing us to understand the overall sentiment distribution and trends within the dataset.
g. Interpretation and Insights: Finally, we interpret the results and draw meaningful insights. We can analyze sentiments at different levels, such as document-level sentiment, sentence-level sentiment, or even aspect-level sentiment (specific sentiments related to different aspects of a product or service).
1.2 Why sentiment analysis on amazon products reviews is important?
Product reviews play a crucial role in the purchasing decisions of Amazon customers. According to Trustpilot research, nearly 90% of shoppers rely on reviews before making a purchase. With reviews being a determining factor for many customers, it becomes essential to have an effective system in place to monitor and evaluate them. Ensuring the quality and authenticity of product reviews is vital for businesses to build trust, attract potential buyers, and ultimately drive sales. It also helps in enhancing products, addressing specific issues, and meeting customer needs.
Additionally, sentiment analysis helps manage brand reputation by identifying positive and negative sentiments. It enables competitive analysis, market research, and aids customers in making informed decisions. By leveraging sentiment analysis, businesses can gain valuable insights to improve products, stay competitive, and cater to customer preferences, ultimately leading to higher customer satisfaction.
2. Understanding the Amazon Books Reviews data
For this analysis, I collected data from two different websites and merged them together. By combining the data from multiple sources, I thought we can gain a more comprehensive view of customer sentiments and opinions. This approach will allow us to capture a broader range of feedback and ensure a more representative sample for our analysis. The data links are given in the References.
Here are the data columns for our analysis:
a. Marketplace: Represents the two-letter country code of the marketplace.
b. Customer ID: Contains a random identifier that can be used to aggregate reviews written by a single author.
c. Review ID: Each review is assigned a unique ID.
d. Product ID: Contains the unique ID of the product that the review. pertains to. In the multilingual dataset, reviews for the same product in different countries can be grouped by the same product ID.
e. Product Parent: Provides a random identifier that can be used to aggregate reviews for the same product.
f. Product Title: The title of the product(Book).
g. Product Category: Represents the broad product category(Books).
h. Star Rating: The star rating of the review, ranging from 1 to 5.
i. Helpful Votes: Number of helpful votes received by the review.
j. Total Votes: The total number of votes that the review received.
k. Vine: If the review was written as part of the Vine program, it is indicated in this column.
l. Verified Purchase: Denotes whether the review is on a verified purchase or not.
m. Review Headline: The title of the review.
n. Review Body: The main text of the review.
o. Review Date: The date when the review was written.
p. Description: Description of the book.
q. Authors: Authors name
r. Image: The link of the image of the book cover.
s. PreviewLink: The link to access the book on google store.
t. Publisher: Name of the publisher.
u. PublishedDate: The date of the book when it published.
v. Infolink: The link to get more information about the book.
w. Categories: Genres of books.
x. RatingsCount : Averaging Rating of the book.
Have a look at our data!!

We are going to focus on more review headlines and review body.
If you’re still with me and enjoying this blog, I have a fun proposition for you. How about supporting my journey by buying me a book? It’s a small gesture that goes a long way in keeping me inspired and motivated to bring you more valuable content.
3.Exploratory Data Analysis
I am so excited that I can practically hear the data dancing with joy as we are about to embark on our exploratory data analysis adventure!
Let’s get the fun ride started !!
First, I have checked for missing values.
Ohh !!! this is a lot…
The data is huge so simply drop the rows with null values.
Next step and my favorite step is to visualize the data.
- Books Categories
I have selected the top 15 categories of book and visualized them in a pie chart.
The above pie chart shows the top 15 categories and their percentage values and below count plot shows their frequency counts out of total data values. We can see that the Fiction category is the most popular one.
2. Vine
Amazon Vine is an invitation-only program which selects the most insightful reviewers in the Amazon store to serve as Vine Voices. We will see our top 15 categories are vine or not.
g = amazon_books_data.groupby(['categories'])['vine'].value_counts(normalize=True).sort_values(ascending = False).unstack()\
.mul(100).drop_duplicates().head(15)
g.plot(kind='bar', figsize=(15,5))
plt.title(' Top 15 Catgeories with Vine program [Y, N]', fontsize = 15)
plt.grid()
plt.show()

Here we can see very few categories are included in vine like ‘Abnormalities, human’, ‘Biography & Autobiography’, ‘Business & Economics’, ‘Education’, ‘Estate Planning’, ‘Fathers’, etc.
3. Books that Reign with the Most Reviews
review_df =amazon_books_data[amazon_books_data['word_count'] > 4000 ][['product_title', 'word_count']].sort_values(ascending = False, by = 'word_count')
fig, ax = plt.subplots(figsize =(20, 10))
# Horizontal Bar Plot
ax.barh(review_df['product_title'], review_df['word_count'])
plt.title('Books most reviewed ', fontsize = 20)
plt.savefig("most.jpg")
plt.show()

“And the dead shall rise” by Steve Oney is the most reviewed book. For the better look of this plot you can check my GitHub profile, the link is provided below.
4. Books with the highest rating count
There are 15 books listed in the above plot with the highest ratings and my favorite Paulo Coelho’s ‘The Alchemist’ has the highest rating count.
5. Books with ratingCount and star_rating
6. Books with helpful votes and total votes
books_votes = amazon_books_data.groupby("product_title").mean()[["helpful_votes", "total_votes"]].head(10)
books_votes.plot(kind='bar', figsize=(15,5))
plt.title('Books with helpful and total votes', fontsize = 15)
plt.grid()
plt.show()

‘ — or Not to be: A Collection of Suicide Notes’ by Marc Etkind book stand alone the most highest in helpful and total votes. Second stands the ‘night, Mother : A play (mermaid Dramabook)’ by Marsha Norman.
7. Book Categories with Highest and Lowest Star Rating
Above plots shows the Book Categories with the highest and lowest Star ratings.
8. Authors with Highest & Lowest star ratings
plt.figure(figsize=(4,4))
authers = amazon_books_data.groupby('authors').mean().sort_values(ascending = False, by='star_rating').head(10)
authers['star_rating'].plot(kind ='barh', color='pink')
plt.title('Authers with Highest Star Rating', fontsize = 15)
plt.show()
plt.figure(figsize=(4,4))
auther_lowest = amazon_books_data.groupby('authors').mean().sort_values(ascending = False, by='star_rating').tail(10)
auther_lowest['star_rating'].plot(kind ='barh', color="gray")
plt.title('Authers with Lowest Star Rating', fontsize = 15)
plt.show()
plt.show()
I have found authors with highest and lowest star ratings. But, You know, personally I don’t find any author with the lowest star rating. I believe every author is unique and has their own style of writing.They have the incredible ability to express their thoughts and ideas through their books, helping us understand different perspectives and experiences. So, let’s appreciate the diversity of authors and their talent for conveying meaningful messages through their work.🫰
4. Conclusion
In conclusion, our exploratory data analysis of Amazon book reviews for sentiment analysis has provided valuable insights into the sentiments expressed by customers. Through the analysis of various data features, we were able to gain a deeper understanding of the customer feedback and opinions regarding the books available on Amazon. And also while doing EDA, I have selected at least 10 books to add to my reading list. Have you selected yours?🤓
Now, its my time to listen the song “Magic Shop” by BTS.
We will meet at the same place, different time — see you here for part 2!
You can find the code in python on Github.
You can reach me on LinkedIn.
Stay tuned!
5. References
- Data : https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
- Data : https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv
- https://www.repustate.com/blog/amazon-review-analysis/
- https://www.kaggle.com/code/rishabhvyas/reviews-eda-and-sentiment-ananlysis
- https://brandmentions.com/blog/sentiment-analysis/
- https://www.analyticsvidhya.com/blog/2022/04/a-comprehensive-overview-of-sentiment-analysis/