Home/Prompts Library/Data Analysis

Prompt Engineering for Data Analysis: SQL, Python & Analytics

Master AI-powered data analysis with production-ready prompts for SQL queries, Python code generation, pandas, NumPy, visualization, EDA, and stakeholder reporting.

Data & Analytics

Prompt Engineering for Data Analysis: SQL, Python & Analytics

AI is now a full-fledged data partner that can write SQL, generate Python analytics code, and explain complex insights in plain language. Master prompt engineering to get reliable, production-ready results.

1. Why Prompt Engineering Matters for Data Work

As a data professional, you already know how to write SQL and Python. But AI can generate boilerplate code in seconds, suggest better query structures, help you explore data faster with EDA prompts, and turn complex analysis into stakeholder-friendly summaries.

The Catch: Vague Prompts Fail

If your prompts are too vague, you'll get:

  • Generic, non-actionable advice
  • Incorrect or unsafe SQL/Python code
  • Misleading statistical interpretations
  • Outputs that don't match your format or brand

Prompt Engineering Fixes This

Give the AI:

  • Clear context: dataset, columns, business goal
  • Specific instructions: what to do, how to do it
  • Constraints: tools, format, tone, safety rules

2. Core Principles of Data Analysis Prompting

Principle 1: Be Specific, Not Vague

❌ Weak Prompt

"Analyze this dataset and tell me what's interesting."

✅ Strong Prompt

"I have a sales dataset with columns: order_id, customer_id, order_date, product_category, quantity, revenue. Identify: Top 5 product categories by revenue, Monthly revenue trends over the last 12 months, Customer segments with the highest average order value. Present insights in a structured format with numbers."

Principle 2: Provide Context and Goals

Always include:

  • Dataset description: size, columns, time range
  • Business goal: predict sales, explain churn, optimize marketing
  • Constraints: class imbalance, missing values, compute limits, compliance rules

Example:

"I'm analyzing a housing price dataset with 50,000 records (2015–2024). Columns: price, square_footage, num_bedrooms, num_bathrooms, year_built, neighborhood, property_type. Goal: Build a predictive model for property values to help real estate investors identify undervalued properties. The model must be interpretable for non-technical stakeholders."

Principle 3: Use Special Characters for Clarity

  • Double quotes " " for exact phrasing: "Explain why this SQL query causes the error 'column reference is ambiguous.'"
  • Backticks ` for code elements: "Create a function calculating correlation between `price` and `square_footage` columns."
  • Triple backticks \`\`\` for code blocks
  • Triple quotes """ for multi-line text (error messages, feedback)

3. Prompting for SQL: Queries, Joins & Optimization

SQL is one of the most powerful use cases for AI in data work. See our complete SQL prompts guide for more templates.

3.1 Basic Data Retrieval

Prompt:

"Write an SQL query to retrieve the names and email addresses of all customers who made a purchase in the last month. Tables: customers (customer_id, name, email) and orders (order_id, customer_id, order_date, amount)."

3.2 Joins and Aggregations

Prompt:

"I need to join three tables: customers, orders, and products. Customers have multiple orders. Orders contain multiple products. I must calculate average spend per customer per product category. Help me write optimized SQL that avoids duplicate counting and handles NULL values correctly. Explain your reasoning for each design decision."

3.3 Query Optimization

Prompt:

"Optimize this Snowflake query for a fact table with 100M+ rows. Table is clustered by date and region. Query filters by date range and aggregates by product_id. Use QUALIFY clauses and result-set caching where appropriate. Rewrite the query to minimize compute cost and micro-partition scans."

3.4 Handling Edge Cases

Prompt:

"My SQL query is returning duplicate rows when joining orders and order_items. Each order can have multiple items. I only want one row per order with total quantity and total amount. How should I structure the query to avoid duplicates? Show the corrected SQL and explain why the original was duplicating."

4. Prompting for Python: Pandas, NumPy & Visualization

Python is where AI generates the most value for data cleaning, analysis, and visualization. Learn more in our code generation guide.

4.1 Data Cleaning and Preprocessing

Prompt:

"I have a customer DataFrame with these issues: Missing values in income and age columns (15% and 8% respectively), Duplicate customer records based on email, Inconsistent formatting in city names (mixed case, abbreviations), Outlier ages above 120 that are data entry errors. Generate pandas code to: Remove duplicate records keeping the most recent entry, Impute missing age with median and missing income with mean by customer_segment, Standardize city names to title case and expand common abbreviations, Cap age outliers at 100, Create a data quality report showing before/after statistics. Include explanatory comments for each step."

4.2 Exploratory Data Analysis (EDA)

Prompt:

"I have an e-commerce dataset with columns: customer_id, order_date, product_category, order_value, payment_method, shipping_region. I want to investigate: Seasonal purchasing patterns across product categories, Products frequently purchased together, Customer segments based on purchasing behavior and value, Regional differences in product preferences. For each investigation: Suggest specific analytical approaches, Recommend appropriate visualizations, Identify relevant summary statistics, Highlight potential pitfalls or considerations. Prioritize insights with clear business implications."

4.3 Feature Engineering

Prompt:

"Given a customer dataset with columns: age, signup_date, last_purchase_date, region, total_spent, order_count, average_order_value, suggest 5 new features that could improve purchase prediction accuracy. For each feature: Provide the feature name and description, Explain the intuition for why it helps prediction, Write pandas code to create it, Note any assumptions or limitations."

4.4 Data Visualization

Prompt:

"Create a bar chart for monthly sales data with: Month on the x-axis, revenue on the y-axis, Different colors for each product category, A secondary axis showing order count as a line, A title and clear legend. Output the code in a clean, well-commented format using seaborn or matplotlib."

5. Analytics: Insights, Reports & Stakeholder Communication

5.1 Executive Summaries

Prompt:

"Summarize these model results for executives with no technical background: Model: Random Forest for customer churn prediction, Accuracy: 84%, Most important features: (1) days_since_last_order, (2) customer_tenure, (3) support_tickets_count, Business impact: Identifying high-risk customers 2 weeks earlier enables targeted retention campaigns. Create a 3–4 sentence summary highlighting business value and key findings. Avoid statistical jargon."

5.2 Documentation and Data Dictionaries

Prompt:

"Generate clear descriptions for these variables in a data dictionary: cust_ltv (numeric, range 0–50,000), churn_risk_score (numeric, range 0–1), engagement_level (categorical: low, medium, high), pref_channel (categorical: email, phone, chat, none). Follow this format: Variable name and type, Business definition in plain language, Valid values or range, Calculation method or source, Usage notes or caveats."

5.3 A/B Test and Experiment Design

Prompt:

"I'm designing an A/B test for a SaaS product where we're testing a new checkout flow. Control group: 5,000 users, current checkout flow. Treatment group: 5,000 users, new checkout flow. Duration: 2 weeks. Primary metric: Conversion rate. Secondary metrics: Average order value, time to purchase. Review this experimental design and identify potential issues (e.g., statistical power, bias sources, business context). Suggest improvements to the design."

6. Advanced Prompting Techniques for Data Work

6.1 Chain of Thought for Complex Analysis

Break complex tasks into logical steps:

Prompt:

"I need to analyze customer churn patterns in subscription data. Before providing recommendations: First, clarify what key metrics and variables would be most relevant for churn analysis. Then, confirm which analytical approaches would be appropriate given these variables. Finally, provide a structured analysis plan including data preparation, modeling approach, and evaluation criteria. Walk through your reasoning at each step."

6.2 Clarify → Confirm → Complete

Align the AI's understanding before executing. Learn more in our few-shot prompting guide.

Prompt:

"I want to build a function detecting outliers in financial transaction data. Before writing code: Clarify what outlier detection methods suit financial data. Confirm the advantages and disadvantages of each approach and recommend one. Complete by writing a Python function implementing the recommended method. Address each step before moving to the next."

6.3 Prompt Chaining for Multi-Step Workflows

Treat your analysis as a pipeline:

  1. Data cleaning: "Clean this raw sales data by handling missing values, removing duplicates, and standardizing formats. Return the cleaned dataset."
  2. Feature creation: "Using the cleaned sales data, create these features: monthly_trend, customer_segment, product_affinity_score. Show the enhanced dataset."
  3. Analysis: "With the feature-enhanced dataset, identify the top 3 factors driving sales differences across customer segments. Provide statistical support for each finding."

7. Ready-to-Use Prompt Templates

SQL Query Template

<Role>
You are a senior data analyst specializing in SQL.
</Role>

<Action>
Write an SQL query to [specific task, e.g., "retrieve customers who made a purchase in the last month"].
</Action>

<Context>
- Tables: [table names and key columns]  
- Business goal: [e.g., "identify high-value customers for a retention campaign"]  
- Constraints: [e.g., "avoid duplicate counting, handle NULLs correctly"]  
</Context>

<Expectations>
- Use [database type, e.g., PostgreSQL, Snowflake] syntax  
- Include comments explaining key parts of the query  
- Avoid overly complex subqueries unless necessary  
</Expectations>

Python Data Analysis Template

<Role>
You are a data scientist with expertise in pandas and visualization.
</Role>

<Action>
Generate Python code to [specific task, e.g., "clean a sales dataset and create a monthly revenue trend chart"].
</Action>

<Context>
- Dataset: [description, columns, size, time range]  
- Goal: [e.g., "identify seasonal patterns and top-performing products"]  
- Tools: [e.g., pandas, seaborn, matplotlib]  
</Context>

<Expectations>
- Use clear, well-commented code  
- Handle missing values, duplicates, and outliers appropriately  
- Create a visualization with appropriate labels and styling  
- Output only the code, no explanations  
</Expectations>

8. Frequently Asked Questions

What makes a good data analysis prompt?

A good data analysis prompt includes dataset structure (columns, types, size), business goal, specific questions to answer, desired output format, and any constraints (database type, libraries, compliance rules).

How do I prompt AI to write optimized SQL queries?

Specify the database type (PostgreSQL, Snowflake, etc.), table structures, relationships, expected data volume, and performance requirements. Ask for explanations of design decisions and optimization strategies.

Can AI generate production-ready Python data analysis code?

Yes, but always validate the output. Provide clear context about libraries (pandas, numpy, seaborn), data types, edge cases (nulls, duplicates), and coding standards. Request comments explaining each step.

How do I get AI to explain statistical results for non-technical stakeholders?

Use role-based prompting: 'Act as a data analyst explaining results to executives with no technical background.' Ask for plain language summaries, business implications, and actionable recommendations.

What's the best way to prompt for EDA (Exploratory Data Analysis)?

Provide dataset description, key variables, and business questions. Ask for specific analytical approaches, appropriate visualizations, relevant summary statistics, and potential pitfalls to watch for.

How do I avoid AI hallucinating incorrect SQL or Python code?

Be specific about table schemas, column names, and data types. Use backticks for code elements. Ask the AI to explain its reasoning and flag any assumptions it's making.

Can AI help with A/B test design and analysis?

Yes. Provide test parameters (groups, duration, metrics), ask for statistical power calculations, bias identification, and suggest improvements. Request interpretation of results with confidence intervals.

What's prompt chaining for data workflows?

Break complex analysis into sequential prompts: (1) data cleaning, (2) feature engineering, (3) analysis, (4) visualization, (5) reporting. Each prompt builds on the previous output.

How do I prompt for data visualization code?

Specify library (matplotlib, seaborn, plotly), chart type, data structure, color palette, figure size, labels, and any interactivity requirements. Include accessibility considerations.

Can AI help create data dictionaries and documentation?

Yes. Provide variable names, types, and sample values. Ask for business definitions, valid ranges, calculation methods, and usage notes in a structured format.

Related Resources