LLM สมองบิน EP.6 — Prompt Engineering with Evals เขียน prompt เพื่อวัดยังไง

หลังจากเราทราบแล้วว่าวัด prompt ยังไงจาก EP.5 เนื้อหาจุกๆ (เราเนี่ยจุก เขียนไปเขียนมาเริ่มเยอะ 555) ต่อไปเราจะมาดูว่าเขียน prompt สำหรับการ eval ยังไง

เดี๋ยวเรามาเชื่อม API กันก่อน เราดัดแปลงโค้ดของอาจารย์นิ๊ดดเดียว เพราะอันนี้เรารันบน vs code (สามารถไปดู EP.3 ได้นะคะ เคยเขียนอธิบายการเชื่อม API ไว้ ละเอียดอยู่ เพราะนั่งเขียนอยู่พักใหญ่5555)

from openai import OpenAI
import os

load_dotenv(override=True)
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=api_key,
)

def gen_answer(prompt, model='google/gemini-2.5-flash-lite'):
    response = client.chat.completions.create(
      model=model,
      messages=[
          {"role": "user", "content": prompt}
      ],
      temperature=0
    )

return response.choices[0].message.content
gen_answer('Hello')

ข้างบนคือเอาไอ้บรรทัด boiler plate มาใส่ใน function เคยอธิบายไปแล้วฮับ

ต่อไปข้างล่างนี้ คือการทำ sample dataset โดยให้ LLM ช่วยคิด แล้วก็ก๊อบมาแปะ เป็น test cases

test_cases = [
    {
        'dietary_restrictions': 'vegetarian',
        'cooking_theme': 'Italian',
        'ingredients': ['tomatoes', 'pasta'],
        'time': '30 minutes'
    },
    {
        'dietary_restrictions': 'none',
        'cooking_theme': 'Thai',
        'ingredients': ['chicken', 'rice'],
        'time': '15 minutes'
    },
    {
        'dietary_restrictions': 'none',
        'cooking_theme': 'Italian',
        'ingredients': ['shrimp', 'garlic', 'olive oil'],
        'time': '15 minutes'
    },
    {
        'dietary_restrictions': 'vegan',
        'cooking_theme': 'French pastry',
        'ingredients': ['flour', 'coconut oil', 'maple syrup'],
        'time': '20 minutes'
    },
    {
        'dietary_restrictions': 'paleo',
        'cooking_theme': 'Japanese breakfast',
        'ingredients': ['eggs', 'mushrooms', 'spinach', 'avocado'],
        'time': '25 minutes'
    }
]

แล้วมาทำ LLM-as-a-judge กันต่อยาวเฟื้อยยยเลยยย 5555

eval_prompt = """Your task is to evaluate the following AI-generated recipe with EXTREME RIGOR.Original task description:
Generate a single-serving recipe based on the given dietary restrictions, cooking theme, ingredients, and time constraint. The recipe must follow a specific format with sections for Title, Time, Ingredients, and Instructions.

Original task inputs:
{prompt_inputs}

Solution to Evaluate:
{output}

Criteria you should use to evaluate the solution:
MANDATORY REQUIREMENTS (violation = score 3 or lower):
1. Section compliance: Must have exactly these sections in order: Title, TIME, INGREDIENTS, INSTRUCTIONS.
2. Ingredient usage: ALL specified ingredients from the input must be used in the recipe
3. Dietary compliance: Must strictly follow dietary restrictions (no meat in vegetarian, no gluten ingredients in gluten-free)
SECONDARY CRITERIA:
4. Time feasibility: The recipe can realistically be completed in the stated time by an average home cook
5. Theme authenticity: The recipe genuinely reflects the specified cuisine theme (Italian/Asian/comfort food)
6. Instruction clarity: Each step starts with an action verb, includes specific details (time/temperature/visual cues), and contains one clear action. No vague terms like "cook until done"
7. Format fules:
- Title should be plain text (no special characters or markers)
- TIME: must be followed by the number and "minutes"
- INGREDIENTS: must use bullet points (•) with "ingredient - amount" format
- INSTRUCTIONS: must use numbered list (1. 2. 3. etc)
- NO other sections allowed
8. Length control: Maximum 8 ingredients, maximum 6 instruction steps, no extra sections/stories/tips

Quick Asian Garlic Chicken Stir-Fry
TIME: 15 minutes
INGREDIENTS:
- Chicken breast - 1 lb, cubed
- White rice - 2 cups, cooked
- Garlic - 4 cloves, minced
- Soy sauce - 3 tablespoons
- Vegetable oil - 2 tablespoons
- Green onions - 2 stalks, sliced
- Sesame oil - 1 teaspoon
INSTRUCTIONS:
1. Heat vegetable oil in wok over high heat until shimmering (about 1 minute).
2. Add cubed chicken and cook for 3-4 minutes until golden brown on all sides.
3. Push chicken to sides and add minced garlic to center, stir-fry for 30 seconds until fragrant.
4. Pour soy sauce over chicken and toss everything for 1 minute until well-coated.
5. Add cooked rice and stir-fry for 2 minutes until heated through.
6. Drizzle sesame oil and garnish with sliced green onions before serving.

Scoring Guidelines:
* Score 1-3: Solution fails to meet one or more MANDATORY requirements
* Score 4-6: Solution meets all mandatory requirements but has significant deficiencies in secondary criteria
* Score 7-8: Solution meets all mandatory requirements and most secondary criteria, with minor issues
* Score 9-10: Solution meets all mandatory and secondary criteria
IMPORTANT SCORING INSTRUCTIONS:
* Grade the output based ONLY on the listed criteria. Do not add your own extra requirements.
* If a solution meets all of the mandatory and secondary criteria give it a 10
* Don't complain that the solution "only" meets the mandatory and secondary criteria. Solutions shouldn't go above and beyond - they should meet the exact listed criteria.
* ANY violation of a mandatory requirement MUST result in a score of 3 or lower
* The full 1-10 scale should be utilized - don't hesitate to give low scores when warranted
Output Format
Provide your evaluation as a structured JSON object with the following fields, in this specific order:
- "strengths": An array of 1-3 key strengths
- "weaknesses": An array of 1-3 key areas for improvement
- "reasoning": A concise explanation of your overall assessment
- "score": A number between 1-10
Respond with JSON. Keep your response concise and direct.
Example response shape:
{
"strengths": string[],
"weaknesses": string[],
"reasoning": string,
"score": number
}
"""

ใน eval_prompt นี้จะมี task description, {prompt_inputs}, {output}, criteria การ eval, scoring, output format

ต่อไปเป็น function ไว้ใส่ค่าในไอ้เจ้าพวกวงเล็บปีกกา {} ({prompt_inputs}, {output}) ไม่ต้องกังวล เดี๋ยวเราอธิบายให้ เย้ๆ

import re
def fill_prompt(template_string, variables):
    placeholders = re.findall(r"{([^{}]+)}", template_string)
    result = template_string
    for placeholder in placeholders:
         if placeholder in variables:
            result = result.replace(
                "{" + placeholder + "}", str(variables[placeholder])
            )
    return result.replace("{{", "{").replace("}}", "}")

ดูทีละบรรทัดเลย

placeholders = re.findall(r"{([^{}]+)}", template_string)

บรรทัดนี้จะหาสัญลักษณ์ {([^{}]+)} แล้วเอาข้อความที่อยู่ข้างใน มาเก็บไว้ใน placeholders

result = template_string

ให้ค่าตั้งต้น result เป็นข้อความในปีกกา (template) ก่อน

for placeholder in placeholders:
    if placeholder in variables:
        result = result.replace(
            "{" + placeholder + "}", str(variables[placeholder])
        )

อันนี้เอาไว้ดูว่า ถ้าข้อความ มันมีวงเล็บปีกกาซ้อนอีก ก็เอาวงเล็บออก โดยแต่ละ placeholder ถ้าเจอใน variables ให้ดึงข้อความนั้นจาก variables แล้วครอบด้วย string เก็บใน result

เช่น Hello {Chawanee} → Hello Chawanee

return result.replace("{{", "{").replace("}}", "}")

return result โดย ถ้าวงเล็บซ้อน ก็ให้เหลือแค่ชั้นเดียว เช่น {{apples}} → {apples}

..พักหายใจ 1 เฮือกกกก..

ต่อไปเป็น function ไว้ทำ eval แล้วว

def gen_eval(prompt, testset, eval_prompt):
    results = []
    for test in testset:
        filled_prompt = fill_prompt(prompt, test)
        response = gen_answer(filled_prompt)
         temp_variables = {
                  'prompt_inputs':filled_prompt,
                  'output':response
              }
          filled_eval_prompt = fill_prompt(eval_prompt, temp_variables)
          eval_result = gen_answer(filled_eval_prompt, 'google/gemini-2.5-pro')
          eval_result = json.loads(eval_result.replace('```json', '').replace('```', ''))
          result = temp_variables | eval_result
          results.append(result)

    return pd.DataFrame(results)

มาดูโค้ดกัน ก่อนอื่น สร้าง list [] result จากนั้นวนลูปแต่ละอันที่อยู่ใน test case

results = []
    for test in testset:
        filled_prompt = fill_prompt(prompt, test)
        response = gen_answer(filled_prompt)

เช่น

{
 'dietary_restrictions': 'vegetarian',
 'cooking_theme': 'Italian',
 'ingredients': ['tomatoes', 'pasta'],
 'time': '30 minutes'
 }

เอามาเข้า function fill_prompt ที่อธิบายไปตะกี้ เก็บใน filled_prompt

แล้วก็ไปเข้า function gen_answer (ที่ต่อ API) เก็บใน response

ได้ผลลัพธ์ (filled_prompt กับ response) แล้วก็เอามาเก็บใน temp_variables ก่อน แบบ dictionary

for test in testset:
        filled_prompt = fill_prompt(prompt, test)
        response = gen_answer(filled_prompt)
        temp_variables = {
                    'prompt_inputs':filled_prompt,
                    'output':response
                   }

เสร็จแล้วเอา eval_prompt (ไอ้ LLM-as-a-judge ยาวๆข้างบน) กับ temp_variables เมื่อกี้ มาเข้า function fill_prompt อีกที ก่อนเอาไปเข้า function gen_answer (มีการเปลี่ยนโมเดลมาใช้ตัวเก่งกว่าในการ eval)

filled_eval_prompt = fill_prompt(eval_prompt, temp_variables)
eval_result = gen_answer(filled_eval_prompt, 'google/gemini-2.5-pro')

แล้วก็เอาผลลัพธ์มาเก็บเป็น json ทำการเอา tag json ออก

แล้วก็เอา dict temp_variables กับ ผลลัพธ์อันนี้มา merge กันด้วย “|” เก็บใน result แล้วก็ append เข้าไปใน list ชื่อ results ที่สร้างไว้ก่อนหน้า

eval_result = json.loads(eval_result.replace('```json', '').replace('```', ''))
        result = temp_variables | eval_result
        results.append(result)

วิธีทำ prompt eval ก็แค่เรียก function gen_eval ที่เพิ่งอธิบายไปได้เลย

prompt = """ ใส่ prompt ตรงนี้ {ingredients} {dietary_restrictions} 
{cooking_theme} {time}
"""

results = gen_eval(prompt, test_cases, eval_prompt)
results.to_csv('prompt.csv')
print(f"mean score = {results['score'].mean()}")
results

ตัวอย่าง prompt แบบต่างๆไปดูโค้ดของอาจารย์ได้เลย ส่วนวิธีการเขียน prompt ให้ดี เราเคยเขียนไว้ใน EP.4 กลับไปอ่านก่อนได้ค่ะ แล้วจะเดี๋ยวมีลงรายละเอียดใน EP.7 ตามไปดูได้เลยยย

ส่วนสำหรับ eval prompt ลองดูตัวอย่างข้างบนก็ได้ เราก็จะเห็นว่า prompt ที่ดีคือ

ชัดเจน ตรงไปตรงมา เขียนประโยคแบบ imparative (ประโยคคำสั่ง ขึ้นต้นด้วย action verb) ไม่ต้องอารัมภบท เช่น “Write a paragraph about Hyrox.”
เขียนให้เฉพาะเจาะจง ต้องการอะไร เขียนเป็นข้อๆไปเลย 1 2 3 ปลาฉลามขึ้นบก 5555 ใครเก็ตคือ แก่!!
ที่เขียนเป็นข้อๆ ให้บอกไปเลยว่าต้องการ output แบบไหน ตอบยาว ตอบสั้น format แบบไหน ต้องมีคำอะไรในคำตอบไหม โทนการตอบ หรือเขียนแบบให้มันช่วยเราคิดอีกทีในกรณีที่ต้องการให้โมเดลช่วยคิด วิเคราะห์ แยกแยะ พวกปัญหาซับซ้อนๆ แต่ต้องบอกทีละขั้นตอน
ใช้ XML tags เพื่อแบ่งก้อน เช่น ก้อนตัวอย่าง <example> …. </example> หรือใช้ Markdown # ## ### ก็ได้ถ้าถนัดกว่า หรือบางทีก็ใช้ “““ …… ””” ก็ได้ ใช้รวมกันก็ได้
การยกตัวอย่าง เช่น ยก edge cases ยก output format เช่น JSON ยกสไตล์ที่ต้องการ ยกวิธีจัดการ input โบ้ๆเบ้ๆ

วันนี้พอแค่นี้ก่อน ตอนแรกจะเขียนการ prompt ท่า advance ไปในตอนเดียวกันเลย แต่คิดไปคิดมา ขอแยกตอนดีกว่า เพราะมันจะยาวเกินไป เนื้อหาสรุปสุดติ่งกระดิ่งแมววันนี้จะไม่มี ถ้าไม่มีเลคเชอร์ของอาจารย์นัท ขอบคุณทุกครั้งเลย เพราะความรู้มีทุกที่ แต่ความรู้ดีๆฟรีๆก็มีเหมือนกันนนน

วันนี้พอแค่นี้ก่อน ต้องไปซ้อม Hyrox อีกแล้ว 5555 ซ้อมเสร็จเดี๋ยวมาเขียนต่อ

บั้ยบายค่าาา ขอบคุณที่ติดตามผู้หญิงตัวเล็กๆแต่มวลกล้ามแน่น บอดี้แฟต21% มีไอจีฟิตหุ่น @chaofitchick นะคะ 5555555