How to Predict AI vs Human-Written Essays: Hackathon Challenge Solution

Manoj Kumar

Senior Analyst@HSBC | LinkedIn Top Voice: NLP | 21K+ Followers| Expert in NLP and Generative AI

Published Feb 3, 2024

Welcome to this edition of "Career in AI" newsletter, where we delve into the fascinating world of Artificial Intelligence and Data Science. Today's feature is a interesting exploration of a recent hackathon challenge: developing a model to differentiate between essays written by humans and those generated by language models. This challenge is not just a technical exercise but a peek into the future of AI in content creation.

In an exciting update, I secured the 7th rank in this hackathon.

For those interested in the details of the competition, including the strategies employed by top participants and the comprehensive leaderboard, visit the official Hackathon Leaderboard

Understanding the Challenge

The Dataset

The journey begins with a dataset, comprising essays responding to various prompts. It includes:

- Train Data: Essays with a binary flag indicating AI or human authorship.

- Test Data: Similar structure, minus the authorship flag.

- Train Prompts: Details about the essay prompts.

- Submission Format: Guidance on structuring predictions.

The Task

The objective is twofold:

1. Model Development: Create a machine learning model to predict whether an essay is human or AI-generated.

2. Evaluation: Assess the model's accuracy using the roc_auc metric.

Our Approach

Step 1: Data Exploration

I commenced by analyzing the dataset. Key observations include:

- Numerical Insights: 212 essays, with a mix of human and AI authorship.

- Prompt Distribution: Variations in prompt categories.

- Text Analysis: Varying lengths and styles of essays.

Step 2: Data Preprocessing

I merged the essay texts with their corresponding prompts, enriching our dataset with context.

Step 3: Feature Engineering

Here, we have extract key features from the texts:

- Length Metrics: Essay and word lengths.

Why Subscribe this Newsletter?

- Stay Informed: Keep abreast of the latest in AI and data science.

- Deep Dives: Engage with detailed analyses of AI applications.

- Community: Join a network of learners and professionals.

Let's Connect

Your feedback fuels our journey! Connect with me on [LinkedIn] for insights, discussions, or queries.

Don't forget to like and subscribe for more AI insights. Together, let's explore the vast and vibrant landscape of AI and Data Science!

Career in AI

3,414 followers

+ Subscribe

Laszlo Farkas

Data Centre Engineer

10mo

Can't wait to read your insights! 📚🔥

1 Reaction

See more comments

To view or add a comment, sign in

See all

Understanding the Challenge

The Dataset

The journey begins with a dataset, comprising essays responding to various prompts. It includes:

The Task

The objective is twofold:

Our Approach

Step 1: Data Exploration

I commenced by analyzing the dataset. Key observations include:

Step 2: Data Preprocessing

I merged the essay texts with their corresponding prompts, enriching our dataset with context.

Step 3: Feature Engineering

Here, we have extract key features from the texts:

Recommended by LinkedIn

Step 4: Model Building

Using these features, we construct a Random Forest classifier.

Step 5: Model Evaluation

Our model achieves a roc_auc score of 0.997, indicating high accuracy but also hinting at potential overfitting(maybe!! 🙂 ) .

Step 6: Submission Preparation

We use our model to predict authorship in the test dataset and format our findings for submission.

Explore and Experiment with the Code

Why Subscribe this Newsletter?

- Stay Informed: Keep abreast of the latest in AI and data science.

Career in AI

3,414 followers

Is Statistical Machine Learning Outdated in the Age of GenAI?

Nov 5, 2024

The Impact of Natural Language Processing (NLP)

Jul 5, 2024

Synthetic Data Generation for AI Projects

May 12, 2024

AI Patents in India: A Closer Look at the Current Landscape

Jan 7, 2024

Harnessing Net Promoter Score (NPS): Turning Customer Satisfaction into Success

Nov 16, 2023

Mastering the Art of Data Collection for Net Promoter Score: Strategies, Channels, and Psychology

Oct 17, 2023

Can Neural Networks Learn Anything!

Oct 11, 2023

ExplainerDashboard: A Comprehensive Python Library for Model Explanation

Sep 18, 2023

Insights from the community

Others also viewed

Issue #296 - The ML Engineer 🤖

🥇Top ML Papers of the Week

Understanding Traditional RAG vs GraphRAG

A Beginner's Guide to ggplot2, Deep Reinforcement Learning, and Innovative AI Research Labs

Paper Review: Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Positive Thinking Company Newsletter November 2023

Build a Retrieval Augmented System (RAG) system in just 4 lines of code!

Using Generative Adversarial networks (GANs) to augment data

The Age of Machine Learning As Code Has Arrived

Top Trending AI tools for 2023

Explore topics