moeny-matt 60ab832583 Initial commit

2024-11-20 18:27:41 -05:00

7.6 KiB

Raw Permalink Blame History

Project overview

Your goal is to build a next.js app that allows users to upload PDF files and use OpenAI's structured output feature to extract information from the PDF file and convert it to an excel file. You will be using NextJS 14, shadcn, tailwind, and Lucid icon

Core functionality

1. File Upload & Schema Definition

Users should be able to upload one or more PDF files
Users should be able to define data points they want to extract:
- Individual fields (single value extractions)
- Groups (array of objects with consistent structure)
- For each group, users can define multiple fields inside, or even other groups
There should be a button 'Start extraction'
Set default schema to showcase how this works
- company: name of company
- address: address of company
- total sum: total amount we purchased
- items (group):
  - item: name of item
  - unit price: unit price of item
  - quantity: quantity we purchased
  - sum: total amount we purchased
Server-side file processing

2. Text Extraction

Use LlamaParser for PDF text extraction (server-side)
For each file, combine all document chunks for complete text. Make sure to return full text of all documents, not just the first one documents[0]
The llamaparser text extraction should happen immediately after user uploads files to UI, and not wait for a button click
Strictly following ## 1. LlamaParser Documentation as code implementation example
After each file is uploaded, it should be displayed as an item on the page, displaying the file name with a button to click to preview the full text extracted
User can keep adding new files to the list, previously uploaded files should be displayed
Server-side processing only

3. Data Processing

After clicking on 'Start Extraction', the data should be sent to OpenAI for processing across all files
Use OpenAI structured output for information extraction
Strictly following ## 2. OpenAI Documentation as code implementation example

4. File Download

Combine data processed from multiple PDFs into one excel file
When there are nested structures like {'company': xxx, 'items': [{'item': xxx, 'unit price': xxx, 'quantity': xxx, 'sum': xxx}]}, it should be flattened when generating the excel file
Implement proper error handling and type safety
Enable excel file download
Implement temporary file cleanup

Doc

1. LlamaParser Documentation

First, get an api key. We recommend putting your key in a file called .env that looks like this:

LLAMA_CLOUD_API_KEY=llx-xxxxxx

Set up a new TypeScript project in a new folder, we use this:

npm init npm install -D typescript @types/node

LlamaParse support is built-in to LlamaIndex for TypeScript, so you'll need to install LlamaIndex.TS:

npm install llamaindex dotenv

Let's create a parse.ts file and put our dependencies in it:

import { LlamaParseReader, // we'll add more here later } from "llamaindex"; import 'dotenv/config'

Now let's create our main function, which will load in fun facts about Canada and parse them:

async function main() { // save the file linked above as sf_budget.pdf, or change this to match const path = "./canada.pdf";

// set up the llamaparse reader const reader = new LlamaParseReader({ resultType: "markdown" });

// parse the document const documents = await reader.loadData(path);

// print the parsed document console.log(documents) }

main().catch(console.error);

Now run the file:

npx tsx parse.ts

Congratulations! You've parsed the file, and should see output that looks like this:

[ Document { id_: '02f5e252-9dca-47fa-80b2-abdd902b911a', embedding: undefined, metadata: { file_path: './canada.pdf' }, excludedEmbedMetadataKeys: [], excludedLlmMetadataKeys: [], relationships: {}, text: '# Fun Facts About Canada\n' + '\n' + 'We may be known as the Great White North, but ...etc...

2. OpenAI Documentation

Make sure to use the gpt-4o model and zod for defining data structures.

import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const openai = new OpenAI();

const ResearchPaperExtraction = z.object({
  title: z.string(),
  authors: z.array(z.string()),
  abstract: z.string(),
  keywords: z.array(z.string()),
});

const completion = await openai.beta.chat.completions.parse({
  model: "gpt-4o-2024-08-06",
  messages: [
    { role: "system", content: "You are an expert at structured data extraction. You will be given unstructured text from a research paper and should convert it into the given structure." },
    { role: "user", content: "..." },
  ],
  response_format: zodResponseFormat(ResearchPaperExtraction, "research_paper_extraction"),
});

const research_paper = completion.choices[0].message.parsed;

Important Implementation Notes

0. Adding logs

Always add server side logs to your code so we can debug any potential issues

1. Project setup

All new components should go in /components at the root (not in the app folder) and be named like example-component.tsx unless otherwise specified
All new pages go in /app
Use the Next.js 14 app router
All data fetching should be done in a server component and pass the data down as props
Client components (useState, hooks, etc) require that 'use client' is set at the top of the file

2. Server-Side API Calls:

All interactions with external APIs (e.g., Reddit, OpenAI) should be performed server-side.
Create dedicated API routes in the pages/api directory for each external API interaction.
Client-side components should fetch data through these API routes, not directly from external APIs.

3. Environment Variables

Store all sensitive information (API keys, credentials) in environment variables.
Use a .env.local file for local development and ensure it's listed in .gitignore.
For production, set environment variables in the deployment platform (e.g., Vercel).
Access environment variables only in server-side code or API routes.

4. Error Handling and Logging

Implement comprehensive error handling in both client-side components and server-side API routes.
Log errors on the server-side for debugging purposes.
Display user-friendly error messages on the client-side.

5. Type Safety

Use TypeScript interfaces for all data structures, especially API responses.
Avoid using any type; instead, define proper types for all variables and function parameters.

6. API Client Initialization

Initialize API clients (e.g., Snoowrap for Reddit, OpenAI) in server-side code only.
Implement checks to ensure API clients are properly initialized before use.

7. Data Fetching in Components

Use React hooks (e.g., useEffect) for data fetching in client-side components.
Implement loading states and error handling for all data fetching operations.

8. Next.js Configuration

Utilize next.config.mjs for environment-specific configurations.
Use the env property in next.config.mjs to make environment variables available to the application.

9. CORS and API Routes

Use Next.js API routes to avoid CORS issues when interacting with external APIs.
Implement proper request validation in API routes.

10. Component Structure

Separate concerns between client and server components.
Use server components for initial data fetching and pass data as props to client components.

11. Security

Never expose API keys or sensitive credentials on the client-side.
Implement proper authentication and authorization for API routes if needed.

12. Special Syntax

When using shadcn, use npx shadcn@latest add xxx, instead of shadcn-ui@latest, this is deprecated

7.6 KiB Raw Permalink Blame History