🔧 Common Issues & Solutions¶

Häufige Probleme und deren Lösungen bei Keiko Personal Assistant.

🚀 Startup-Probleme¶

Application startet nicht¶

Problem: Keiko-Service startet nicht oder stürzt sofort ab.

Symptome:

systemctl status keiko-api
● keiko-api.service - Keiko Personal Assistant API
   Loaded: loaded (/etc/systemd/system/keiko-api.service; enabled)
   Active: failed (Result: exit-code)

Diagnose:

# Service-Logs prüfen
journalctl -u keiko-api -f

# Application-Logs prüfen
tail -f /var/log/keiko/app.log

# Konfiguration validieren
./scripts/validate-config.sh

Häufige Ursachen & Lösungen:

Fehlende Dependencies

# Python-Dependencies prüfen
pip check

# Fehlende Packages installieren
pip install -r requirements.txt

# System-Dependencies prüfen
sudo apt-get install python3-dev postgresql-client redis-tools

Database-Verbindungsfehler

# Database-Verbindung testen
psql -h localhost -U keiko_user -d keiko_db -c "SELECT 1;"

# Connection-String prüfen
echo $DATABASE_URL

# Database-Service starten
sudo systemctl start postgresql

Port bereits belegt

# Port-Nutzung prüfen
sudo netstat -tlnp | grep :8000

# Prozess beenden
sudo kill -9 <PID>

# Alternativen Port konfigurieren
export KEIKO_PORT=8001

Berechtigungsprobleme

# Log-Verzeichnis-Berechtigungen
sudo chown -R keiko:keiko /var/log/keiko
sudo chmod 755 /var/log/keiko

# Konfiguration-Berechtigungen
sudo chown -R keiko:keiko /opt/keiko/config
sudo chmod 600 /opt/keiko/config/*.yml

Langsamer Startup¶

Problem: Application startet sehr langsam (>60 Sekunden).

Diagnose:

# Startup-Zeit messen
time systemctl start keiko-api

# Startup-Profiling aktivieren
export KEIKO_PROFILE_STARTUP=true

Lösungen:

Database-Connection-Pool optimieren

# config/database.yml
database:
  default:
    pool_size: 5  # Reduzieren für schnelleren Startup
    max_overflow: 10
    pool_timeout: 10

Lazy-Loading aktivieren

# config/app.py
LAZY_LOADING = True
PRELOAD_AGENTS = False

Health-Check-Timeout erhöhen

# config/config.yml
monitoring:
  health_check_timeout: 30
  startup_timeout: 120

🗄️ Database-Probleme¶

Connection-Pool-Erschöpfung¶

Problem: "QueuePool limit of size X overflow Y reached"

Symptome:

sqlalchemy.exc.TimeoutError: QueuePool limit of size 20 overflow 30 reached

Diagnose:

# Aktive Verbindungen prüfen
psql -d keiko_db -c "SELECT count(*) FROM pg_stat_activity WHERE datname='keiko_db';"

# Connection-Pool-Status
curl http://localhost:8000/debug/pool-status

Lösungen:

Pool-Größe erhöhen

# config/database.yml
database:
  default:
    pool_size: 30
    max_overflow: 50
    pool_timeout: 60

Connection-Leaks finden

# debug/connection_tracker.py
import logging
from sqlalchemy import event
from sqlalchemy.engine import Engine

@event.listens_for(Engine, "connect")
def set_sqlite_pragma(dbapi_connection, connection_record):
    logging.info(f"New connection: {id(dbapi_connection)}")

@event.listens_for(Engine, "close")
def close_connection(dbapi_connection, connection_record):
    logging.info(f"Closed connection: {id(dbapi_connection)}")

Connection-Recycling konfigurieren

database:
  default:
    pool_recycle: 3600  # 1 Stunde
    pool_pre_ping: true

Slow Queries¶

Problem: Database-Queries sind langsam (>1 Sekunde).

Diagnose:

-- Langsame Queries identifizieren
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC;

-- Aktive Queries prüfen
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';

Lösungen:

Fehlende Indizes hinzufügen

-- Häufig benötigte Indizes
CREATE INDEX CONCURRENTLY idx_tasks_user_status ON tasks(user_id, status);
CREATE INDEX CONCURRENTLY idx_agents_type_active ON agents(type) WHERE status = 'active';
CREATE INDEX CONCURRENTLY idx_audit_logs_created_at ON audit_logs(created_at);

Query-Optimierung

# Vorher: N+1 Query-Problem
agents = await session.execute(select(Agent))
for agent in agents:
    tasks = await session.execute(select(Task).where(Task.agent_id == agent.id))

# Nachher: Eager Loading
agents = await session.execute(
    select(Agent).options(selectinload(Agent.tasks))
)

Database-Tuning

-- PostgreSQL-Konfiguration optimieren
ALTER SYSTEM SET shared_buffers = '256MB';
ALTER SYSTEM SET effective_cache_size = '1GB';
ALTER SYSTEM SET random_page_cost = 1.1;
SELECT pg_reload_conf();

🔄 Redis-Probleme¶

Redis-Verbindungsfehler¶

Problem: "Connection refused" oder "Redis server went away"

Diagnose:

# Redis-Status prüfen
redis-cli ping

# Redis-Logs prüfen
tail -f /var/log/redis/redis-server.log

# Verbindung testen
redis-cli -h localhost -p 6379 info

Lösungen:

Redis-Service starten

sudo systemctl start redis-server
sudo systemctl enable redis-server

Redis-Konfiguration prüfen

# Redis-Config
sudo nano /etc/redis/redis.conf

# Wichtige Einstellungen:
# bind 127.0.0.1
# port 6379
# maxmemory 512mb
# maxmemory-policy allkeys-lru

Connection-Pool konfigurieren

# config/redis.py
REDIS_CONFIG = {
    'host': 'localhost',
    'port': 6379,
    'db': 0,
    'max_connections': 20,
    'retry_on_timeout': True,
    'socket_timeout': 5,
    'socket_connect_timeout': 5
}

Memory-Probleme¶

Problem: Redis läuft aus dem Speicher.

Diagnose:

# Redis-Memory-Usage
redis-cli info memory

# Top-Keys nach Speicherverbrauch
redis-cli --bigkeys

# Memory-Usage-Pattern
redis-cli info stats | grep keyspace

Lösungen:

Memory-Policy konfigurieren

# redis.conf
maxmemory 512mb
maxmemory-policy allkeys-lru

Key-Expiration setzen

# Cache mit TTL
await redis_client.setex("cache_key", 3600, value)  # 1 Stunde

# Batch-Expiration
for key in large_keys:
    await redis_client.expire(key, 1800)  # 30 Minuten

Memory-Monitoring

# memory_monitor.py
async def monitor_redis_memory():
    info = await redis_client.info('memory')
    used_memory = info['used_memory']
    max_memory = info['maxmemory']

    if used_memory > max_memory * 0.9:
        logger.warning(f"Redis memory usage high: {used_memory}/{max_memory}")

🤖 Agent-Probleme¶

Agent startet nicht¶

Problem: Agent kann nicht gestartet oder aktiviert werden.

Diagnose:

# Agent-Status prüfen
curl http://localhost:8000/api/v1/agents/{agent_id}/status

# Agent-Logs prüfen
grep "agent_id:{agent_id}" /var/log/keiko/app.log

# Agent-Konfiguration validieren
./scripts/validate-agent-config.sh {agent_id}

Lösungen:

Konfigurationsfehler beheben

# agents/example-agent.yml
name: "Example Agent"
type: "specialist"
capabilities:
  - "text_processing"
configuration:
  timeout_seconds: 300
  max_concurrent_tasks: 1

Dependencies prüfen

# Agent-Dependencies validieren
async def validate_agent_dependencies(agent_config):
    for capability in agent_config.capabilities:
        if capability == "text_processing":
            try:
                import transformers
            except ImportError:
                raise AgentDependencyError("transformers package required")

Resource-Limits prüfen

# Memory-Limits
ulimit -v

# File-Descriptor-Limits
ulimit -n

# Process-Limits
ulimit -u

Task-Execution-Fehler¶

Problem: Tasks schlagen fehl oder hängen.

Diagnose:

# Fehlgeschlagene Tasks
curl http://localhost:8000/api/v1/tasks?status=failed

# Hängende Tasks
curl http://localhost:8000/api/v1/tasks?status=running | jq '.[] | select(.created_at < (now - 3600))'

# Task-Logs
grep "task_id:{task_id}" /var/log/keiko/app.log

Lösungen:

Timeout-Konfiguration

# Task-Timeout erhöhen
task_config = {
    "timeout_seconds": 600,  # 10 Minuten
    "retry_policy": {
        "max_retries": 3,
        "retry_delay": 5.0
    }
}

Error-Handling verbessern

async def execute_task_with_error_handling(task):
    try:
        result = await agent.execute_task(task)
        return result
    except TimeoutError:
        logger.error(f"Task {task.id} timed out")
        return TaskResult.failure("Task timed out")
    except Exception as e:
        logger.error(f"Task {task.id} failed: {e}", exc_info=True)
        return TaskResult.failure(str(e))

Resource-Monitoring

# Task-Resource-Monitoring
async def monitor_task_resources(task_id):
    process = psutil.Process()

    while task_is_running(task_id):
        memory_usage = process.memory_info().rss / 1024 / 1024  # MB
        cpu_usage = process.cpu_percent()

        if memory_usage > 1000:  # 1GB
            logger.warning(f"Task {task_id} high memory usage: {memory_usage}MB")

        if cpu_usage > 90:
            logger.warning(f"Task {task_id} high CPU usage: {cpu_usage}%")

        await asyncio.sleep(10)

🌐 API-Probleme¶

500 Internal Server Error¶

Problem: API-Endpunkte geben 500-Fehler zurück.

Diagnose:

# Error-Logs prüfen
tail -f /var/log/keiko/error.log

# API-Health-Check
curl http://localhost:8000/health

# Specific-Endpoint testen
curl -v http://localhost:8000/api/v1/agents

Lösungen:

Exception-Handling prüfen

# Unbehandelte Exceptions finden
@app.exception_handler(Exception)
async def general_exception_handler(request: Request, exc: Exception):
    logger.error(f"Unhandled exception: {exc}", exc_info=True)
    return JSONResponse(
        status_code=500,
        content={"error": "Internal server error", "request_id": str(uuid.uuid4())}
    )

Dependency-Injection-Probleme

# DI-Container validieren
async def validate_dependencies():
    try:
        container = get_container()
        container.check_dependencies()
    except Exception as e:
        logger.error(f"DI validation failed: {e}")

Rate-Limiting-Probleme¶

Problem: "Too Many Requests" (429) Fehler.

Diagnose:

# Rate-Limit-Status prüfen
curl -I http://localhost:8000/api/v1/agents

# Redis-Rate-Limit-Keys prüfen
redis-cli keys "rate_limit:*"

Lösungen:

Rate-Limits anpassen

# config/rate_limits.py
RATE_LIMITS = {
    "default": {"requests": 100, "window": 60},
    "auth": {"requests": 10, "window": 60},
    "tasks": {"requests": 50, "window": 60}
}

Client-seitige Retry-Logic

import asyncio
from aiohttp import ClientSession

async def api_request_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        async with ClientSession() as session:
            async with session.get(url) as response:
                if response.status == 429:
                    retry_after = int(response.headers.get('Retry-After', 60))
                    await asyncio.sleep(retry_after)
                    continue
                return await response.json()

    raise Exception("Max retries exceeded")

📊 Performance-Probleme¶

Hohe Response-Zeiten¶

Problem: API-Responses sind langsam (>2 Sekunden).

Diagnose:

# Response-Zeit messen
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/api/v1/agents

# APM-Metriken prüfen
curl http://localhost:8000/metrics | grep http_request_duration

Lösungen:

Caching implementieren

from functools import lru_cache

@lru_cache(maxsize=128)
async def get_agent_config(agent_id: str):
    # Expensive operation
    return await load_agent_config(agent_id)

Database-Query-Optimierung

# Batch-Loading
async def get_agents_with_stats(agent_ids: List[str]):
    query = (
        select(Agent, func.count(Task.id))
        .outerjoin(Task)
        .where(Agent.id.in_(agent_ids))
        .group_by(Agent.id)
    )
    return await session.execute(query)

Async-Optimierung

# Parallel-Processing
async def process_multiple_tasks(tasks: List[Task]):
    results = await asyncio.gather(*[
        process_single_task(task) for task in tasks
    ])
    return results

Debugging-Tools

Nutzen Sie die integrierten Debugging-Tools: - /debug/health - Detaillierte Health-Informationen - /debug/metrics - Performance-Metriken - /debug/config - Aktuelle Konfiguration - /debug/logs - Recent-Log-Entries

Support-Kanäle

Bei persistenten Problemen: - GitHub Issues: https://github.com/oscharko/keiko-personal-assistant/issues - Community-Forum: https://community.keiko.ai - Enterprise-Support: support@keiko.ai