refactor project structure and add documents.

2026-05-14 02:00:09 +09:00
parent a0c0231613
commit f4a73099a0
963 changed files with 957378 additions and 1366 deletions
--- a/docs/documents/articles/Misaki.HighPerformance.Jobs/best-practices.md
+++ b/docs/documents/articles/Misaki.HighPerformance.Jobs/best-practices.md
@@ -0,0 +1,115 @@
+# Best Practices and API Selection
+
+## Which job type to use
+
+| If you need | Use |
+|---|---|
+| Run one piece of work once | `IJob` |
+| Run the same operation across many independent elements | `IJobParallelFor` |
+| Run a parallel operation with per-batch setup overhead | `IJobParallel` |
+| Full control over execution and cleanup, or dynamic dispatch | `ICustomJob<TSelf>` |
+| Debug or test a job without threading overhead | `Run` / `RunRef` |
+
+## IJob
+
+Use `IJob` for any unit of work that can't be broken into smaller parallel pieces. Examples:
+
+- Apply velocity to a single entity
+- Compute a sum, product, or aggregate over data that's already been processed
+- Trigger an action after dependencies complete
+
+`IJob` runs once on one worker thread. If you find yourself scheduling many `IJob` instances that do the same operation, consider batching them into an `IJobParallelFor`.
+
+## IJobParallelFor
+
+Use `IJobParallelFor` when you need to apply the same transformation to every element of an array or buffer. The system distributes indices across worker threads in batches.
+
+**Choose the right batch size:**
+
+- Small batches (1-16): Best load balancing, more stealing overhead. Use when work per element varies.
+- Medium batches (32-128): Good balance. A reasonable default for most workloads.
+- Large batches (256+): Less overhead, but can cause uneven distribution. Use when work per element is uniform.
+
+A good starting point is `batchSize = 64`. Profile and adjust from there.
+
+**Avoid writing to overlapping indices.** Each index should be independent. If two indices write to the same location, you have a race condition.
+
+## IJobParallel
+
+Use `IJobParallel` when each batch of work has setup cost that you want to amortize. For example:
+
+- Processing chunks of data where each chunk requires preparing local state
+- Operations where computing the output for a range is cheaper per-element than per-index
+
+The API is the same as `IJobParallelFor`, but `Execute` receives `(startIndex, endIndex)` instead of a single `index`. This lets you write loops with local accumulators or per-batch initialization.
+
+## ICustomJob
+
+Use `ICustomJob<TSelf>` when you need:
+
+- A job type that isn't known at compile time (dynamic dispatch via function pointers)
+- Custom cleanup logic that runs after the job completes
+- To control `JobRanges` directly for non-standard iteration patterns
+
+The overhead is slightly higher than the standard interfaces due to the function pointer indirection. Only use it when the standard interfaces don't fit.
+
+## Scheduler configuration
+
+**ThreadCount:** Set to `Environment.ProcessorCount` for general use. The scheduler caps at the number of logical processors. For workloads that share cores with rendering or other systems, consider leaving one or two cores free.
+
+**DependencyChainCapacity:** This is the total number of dependency edges the scheduler can track at once. Set it to cover your peak concurrent dependencies. If you run out, jobs will still work but dependency enforcement may be incomplete. When in doubt, set it higher — unused capacity costs nothing.
+
+**ThreadPriority:** Use `Normal` for most cases. Use `AboveNormal` if the job system is the primary consumer of CPU time and you want to prioritize it over other system threads.
+
+## Memory and allocation
+
+- **Pre-allocate everything.** The scheduler allocates all internal structures (queues, edge pool, slot maps) at creation. No per-job GC allocations occur during scheduling or execution.
+- **Job data is copied.** When you schedule a struct job, the data is copied into an internal pool. Pointers and references remain valid for the job's lifetime.
+- **Managed payloads work.** Unlike many job systems, this library supports class-based jobs and jobs holding managed types (`List`, `string`, arrays). The same zero-allocation guarantees apply.
+- **Free custom resources in `ICustomJob.Free`.** If your custom job allocates unmanaged memory, the `Free` callback is the right place to release it.
+
+## Schedule and complete timing
+
+It's best practice to call `Schedule` on a job as soon as you have the data it needs, and don't call `Complete` on it until you need the results.
+
+You can schedule less important jobs in a part of the frame where they aren't competing with more important jobs.
+
+For example, if there is a period between the end of one frame and the beginning of the next frame where no jobs are running, and a one frame latency is acceptable, you can schedule the job towards the end of a frame and use its results in the following frame. Alternatively, if your application saturates that changeover period with other jobs, and there's an under-utilized period somewhere else in the frame, it's more efficient to schedule your job there instead.
+
+## Dependencies
+
+- **Prefer multiple dependencies over deep chains.** A job that waits on 10 handles directly is better than a chain of 10 jobs each waiting on one. This gives the scheduler more freedom to parallelize independent work.
+- **Use `CombineDependencies` for large dependency sets.** If a job depends on more than a handful of other jobs, combine them to reduce scheduling overhead.
+
+## Avoid long running jobs
+
+Unlike threads, jobs don't yield execution. Once a job starts, that job worker thread commits to completing the job before running any other job. As such, it's best practice to break up long running jobs into smaller jobs that depend on one another, instead of submitting jobs that take a long time to complete relative to other jobs in the system.
+
+The job system usually runs multiple chains of job dependencies, so if you break up long running tasks into multiple pieces there is a chance for multiple job chains to progress. If instead the job system is filled with long running jobs, they might completely consume all worker threads and block independent jobs from executing. This might push out the completion time of important jobs that the main thread explicitly waits for, resulting in stalls on the main thread that otherwise wouldn't exist.
+
+In particular, long running `IJobParallelFor` jobs impact negatively on the job system because these job types intentionally try to run on as many worker threads as possible for the job batch size. If you can't break up long parallel jobs, consider increasing the batch size of your job when scheduling it to limit how many workers pick up the long running job.
+
+## Priorities
+
+- **Reserve High for critical-path work.** Jobs on the critical path (the chain that the main thread is waiting on) benefit most from High priority.
+- **Use Low for background tasks.** Deferred work like cleanup, analytics, or pre-computation that isn't needed this frame should use Low priority.
+- **Most jobs should be Normal.** Overusing High priority dilutes its effectiveness.
+
+## Inline execution
+
+By default, `Wait` helps execute the job inline while waiting. This reduces latency because the calling thread contributes CPU time to the work it needs. Leave this enabled unless:
+
+- The calling thread has other work to do while waiting (use async variants instead)
+- You're relying on thread-local storage and can't have an external thread execute jobs
+
+## Thread safety
+
+- **No two threads should write to the same memory.** Use dependencies to serialize writes.
+- **Multiple readers are safe.** `IJobParallelFor` indices are independent by design — each index writes to its own location.
+- **Don't access mutable static data from jobs.** The job system can't protect against race conditions on static fields.
+
+## Additional resources
+
+- [Creating and Scheduling Jobs](creating-jobs.md)
+- [Job Dependencies and Coordination](job-dependencies.md)
+- [Threading Fundamentals](threading-fundamentals.md)
--- a/docs/documents/articles/Misaki.HighPerformance.Jobs/creating-jobs.md
+++ b/docs/documents/articles/Misaki.HighPerformance.Jobs/creating-jobs.md
@@ -0,0 +1,277 @@
+# Creating and Scheduling Jobs
+
+To create and run a job, you must:
+
+1. Define a struct or class that implements one of the job interfaces.
+2. Schedule the job on a `JobScheduler` instance.
+3. Wait for the job to complete before accessing its results.
+
+## Job types
+
+| Type | Description |
+|---|---|
+| `IJob` | A single job that runs once on one worker thread |
+| `IJobParallelFor` | A parallel job that runs `Execute` once per index |
+| `IJobParallel` | A parallel job that runs `Execute` once per range segment |
+| `ICustomJob<TSelf>` | A job with user-defined execution and cleanup function pointers |
+
+## Managed and unmanaged jobs
+
+Jobs can hold both **unmanaged** data (pointers, primitive types, blittable structs) and **managed** data (arrays, `List<T>`, strings, class references). The job scheduler copies the job data into an internal pool at schedule time and frees it after completion.
+
+```csharp
+// Managed job: holds an array reference
+public struct ArraySumJob : IJob
+{
+    public int[] data;     // managed array
+    public int* result;    // pointer to output
+
+    public void Execute(ref readonly JobExecutionContext ctx)
+    {
+        int sum = 0;
+        for (int i = 0; i < data.Length; i++)
+            sum += data[i];
+        *result = sum;
+    }
+}
+```
+
+This differs from many job systems that restrict payloads to blittable types only. However, using managed references inside jobs means standard threading rules still apply — multiple jobs must not write to the same managed object simultaneously without proper synchronization.
+
+## Create a scheduler
+
+Before you can schedule jobs, create a `JobScheduler` with a description that defines the thread count and dependency capacity.
+
+```csharp
+using Misaki.HighPerformance.Jobs;
+
+JobSchedulerDesc desc = new JobSchedulerDesc
+{
+    ThreadCount = Environment.ProcessorCount,
+    ThreadPriority = ThreadPriority.Normal,
+    DependencyChainCapacity = 64,
+};
+
+JobScheduler scheduler = new JobScheduler(in desc);
+```
+
+`ThreadCount` controls how many worker threads the scheduler spawns. `DependencyChainCapacity` is the maximum number of dependency edges the scheduler can track simultaneously. Set this to a value that covers the peak number of outstanding dependencies your workload needs.
+
+The scheduler also reserves one helper thread slot for external threads. Use `scheduler.ThreadLocalCount` when allocating thread-local storage to ensure every possible executor has a valid slot.
+
+## IJob
+
+`IJob` runs a single unit of work once on one worker thread. Implement the `Execute` method with the work you want to perform.
+
+```csharp
+public struct ApplyVelocityJob : IJob
+{
+    public Vector3* position;
+    public Vector3 velocity;
+    public float deltaTime;
+
+    public void Execute(ref readonly JobExecutionContext ctx)
+    {
+        *position += velocity * deltaTime;
+    }
+}
+```
+
+To schedule the job, call `Schedule` on the scheduler. This returns a `JobHandle` that you can use to track completion.
+
+```csharp
+Vector3 pos = new Vector3(0, 0, 0);
+Vector3 vel = new Vector3(10, 0, 0);
+
+ApplyVelocityJob job = new ApplyVelocityJob
+{
+    position = &pos,
+    velocity = vel,
+    deltaTime = 0.016f,
+};
+
+JobHandle handle = scheduler.Schedule(ref job);
+scheduler.Wait(handle);
+
+// Result: pos == (0.16, 0, 0)
+```
+
+## IJobParallelFor
+
+`IJobParallelFor` runs the same operation across a range of indices in parallel. Each worker thread picks up batches of indices, processes them, then steals remaining batches from other workers.
+
+This is useful for updating arrays of entities, processing particle data, or any operation where each element is independent.
+
+```csharp
+public struct UpdatePositionJob : IJobParallelFor
+{
+    public Vector3* positions;
+    public Vector3 velocity;
+    public float deltaTime;
+
+    public void Execute(int index, ref readonly JobExecutionContext ctx)
+    {
+        positions[index] += velocity * deltaTime;
+    }
+}
+```
+
+Schedule a parallel-for job with the total iteration count and the batch size. The batch size controls how many indices each worker claims at once. Smaller batches give better load balancing. Larger batches reduce stealing overhead.
+
+```csharp
+const int entityCount = 10000;
+
+UpdatePositionJob job = new UpdatePositionJob
+{
+    positions = positionsPtr,
+    velocity = new Vector3(10, 0, 0),
+    deltaTime = 0.016f,
+};
+
+JobHandle handle = scheduler.ScheduleParallelFor(ref job, entityCount, 64);
+scheduler.Wait(handle);
+```
+
+## IJobParallel
+
+`IJobParallel` is similar to `IJobParallelFor`, but receives a start and end index instead of a single index. This is useful when the work per batch has setup overhead that you want to amortize across multiple elements.
+
+```csharp
+public struct ProcessChunkJob : IJobParallel
+{
+    public float* data;
+    public int* output;
+
+    public void Execute(int startIndex, int endIndex, ref readonly JobExecutionContext ctx)
+    {
+        float sum = 0;
+        for (int i = startIndex; i < endIndex; i++)
+        {
+            sum += data[i];
+        }
+        // Store per-chunk result
+        output[startIndex] = (int)sum;
+    }
+}
+```
+
+Schedule it the same way as a parallel-for job:
+
+```csharp
+JobHandle handle = scheduler.ScheduleParallel(ref job, totalLength, batchSize);
+scheduler.Wait(handle);
+```
+
+## ICustomJob
+
+`ICustomJob<TSelf>` gives you full control over execution and cleanup by letting you provide function pointers. This is useful when you need custom resource management or when the job's execution logic isn't known until runtime.
+
+```csharp
+public unsafe struct MyCustomJob : ICustomJob<MyCustomJob>
+{
+    public int* value;
+
+    public static void Execute(ref MyCustomJob job, ref JobRanges jobRanges, ref readonly JobExecutionContext ctx)
+    {
+        *job.value += 1;
+    }
+
+    public static void Free(ref MyCustomJob job)
+    {
+        // Clean up any unmanaged resources here
+    }
+}
+```
+
+Schedule it using `ScheduleCustom` with a `CustomJobDesc`:
+
+```csharp
+int value = 0;
+
+MyCustomJob customJob = new MyCustomJob { value = &value };
+
+CustomJobDesc<MyCustomJob> desc = new CustomJobDesc<MyCustomJob>
+{
+    data = ref customJob,
+    pExecutionFunc = &MyCustomJob.Execute,
+    pFreeFunc = &MyCustomJob.Free,
+    jobRanges = JobRanges.Single,
+    priority = JobPriority.Normal,
+};
+
+JobHandle handle = scheduler.ScheduleCustom(ref desc);
+scheduler.Wait(handle);
+```
+
+## Run inline
+
+You can also run a job immediately on the calling thread. This is useful for debugging or when the work is too small to justify threading overhead.
+
+```csharp
+// IJob
+job.Run(default);
+
+// IJobParallelFor
+job.Run(totalIterations, default);
+
+// IJobParallel
+job.Run(totalIterations, default);
+```
+
+For struct jobs, use `RunRef` to avoid a copy:
+
+```csharp
+ref MyJob jobRef = ref someJob;
+jobRef.RunRef(default);
+```
+
+## Priority
+
+You can assign a priority when scheduling a job.
+
+```csharp
+JobHandle handle = scheduler.Schedule(ref job, JobPriority.High);
+```
+
+For more information on how priorities affect scheduling, see [Threading Fundamentals](threading-fundamentals.md).
+
+## preferLocal
+
+When you schedule with `preferLocal: true`, the scheduler pushes the job onto the calling thread's local queue first. This keeps the job's data hot in the CPU cache for that thread.
+
+```csharp
+JobHandle handle = scheduler.Schedule(ref job, preferLocal: true);
+```
+
+Use this when the calling thread is likely to be the one that executes the job, such as when scheduling from a dedicated system thread.
+
+## Wait for completion
+
+After scheduling, call one of the wait methods to block until the job finishes:
+
+```csharp
+// Block until a single job completes
+scheduler.Wait(handle);
+
+// Block until all specified jobs complete
+scheduler.WaitAll(handle1, handle2);
+
+// Block until any of the specified jobs completes
+JobHandle completed = scheduler.WaitAny(handle1, handle2);
+```
+
+By default, `Wait` helps execute the job inline while waiting. Pass `inlineExecution: false` to disable this.
+
+## Dispose
+
+When you no longer need the scheduler, call `Dispose` to stop all worker threads and release resources:
+
+```csharp
+scheduler.Dispose();
+```
+
+## Additional resources
+
+- [Threading Fundamentals](threading-fundamentals.md)
+- [Job Dependencies and Coordination](job-dependencies.md)
+- [Best Practices and API Selection](best-practices.md)
--- a/docs/documents/articles/Misaki.HighPerformance.Jobs/introduction.md
+++ b/docs/documents/articles/Misaki.HighPerformance.Jobs/introduction.md
@@ -0,0 +1,94 @@
+# Introduction
+
+The job system lets you write safe, multithreaded code so your application can use all available CPU cores efficiently. It provides a zero-allocation, lock-free scheduling layer designed for game engines, simulations, and any high-throughput runtime.
+
+## Why a dedicated job system?
+
+Standard .NET primitives weren't designed for fine-grained game workloads:
+
+- `Task` produces GC allocations per invocation, lacks native dependency chains, and doesn't support work stealing.
+- `Parallel.For` allocates per call and offers no dependency or priority control.
+- `ThreadPool` isn't built for low-latency job dispatch or batch-aware scheduling.
+
+This library solves these problems with pre-allocated memory pools, lock-free scheduling, full DAG-based dependency tracking, and automatic work stealing across all CPU cores.
+
+## Feature highlights
+
+| Feature | Description |
+|---|---|
+| Zero allocation | All memory for scheduling and execution is pre-allocated at scheduler creation |
+| Lock-free scheduling | No `lock` statements or `Monitor` enters on the hot path |
+| Job dependencies | Full directed-acyclic-graph chain per job, bounded only by the global capacity set at scheduler creation |
+| Work stealing | Idle workers pull work from busy workers, naturally balancing load across P-cores and E-cores |
+| Priority scheduling | High (50%), Normal (37.5%), and Low (12.5%) dispatch ratios |
+| Managed + unmanaged | Supports both struct jobs and class-based jobs |
+| Three job contracts | `IJob`, `IJobParallelFor`, `IJobParallel`, plus `ICustomJob<TSelf>` for custom execution logic |
+| Async wait | `WaitAsync`, `WaitAllAsync`, `WaitAnyAsync` for non-blocking coordination |
+| Inline execution | Calling threads can help execute the job they're waiting on, reducing latency |
+
+## Basic usage
+
+```csharp
+using Misaki.HighPerformance.Jobs;
+
+public struct AddJob : IJob
+{
+    public int* pA;
+    public int* pB;
+    public int* pResult;
+
+    public void Execute(ref readonly JobExecutionContext ctx)
+    {
+        *pResult = *pA + *pB;
+    }
+}
+
+JobSchedulerDesc desc = new JobSchedulerDesc
+{
+    ThreadCount = Environment.ProcessorCount,
+    ThreadPriority = ThreadPriority.Normal,
+    DependencyChainCapacity = 64,
+};
+
+JobScheduler jobScheduler = new JobScheduler(in desc);
+
+int a = 5;
+int b = 10;
+int result = 0;
+
+AddJob job = new AddJob
+{
+    pA = &a,
+    pB = &b,
+    pResult = &result
+};
+
+JobHandle handle = jobScheduler.Schedule(job);
+jobScheduler.Wait(handle);
+
+Console.WriteLine($"Result: {result}"); // Output: Result: 15
+```
+
+## Who this is for
+
+- Custom game engine developers who need a scheduling backbone without GC pauses
+- Simulation and batch-processing authors who need predictable parallelism
+- .NET developers who have hit the limits of `Task`-based approaches in tight loops
+
+## Requirements
+
+- .NET 10.0 or later
+- `unsafe` code enabled
+
+## Install
+
+```bash
+dotnet add package Misaki.HighPerformance.Jobs
+```
+
+## Additional resources
+
+- [Threading Fundamentals](threading-fundamentals.md)
+- [Creating and Scheduling Jobs](creating-jobs.md)
+- [Job Dependencies and Coordination](job-dependencies.md)
+- [Best Practices and API Selection](best-practices.md)
--- a/docs/documents/articles/Misaki.HighPerformance.Jobs/job-dependencies.md
+++ b/docs/documents/articles/Misaki.HighPerformance.Jobs/job-dependencies.md
@@ -0,0 +1,178 @@
+# Job Dependencies and Coordination
+
+Often, one job depends on the results of another job. For example, job A might write velocity data that job B reads to update positions. You must tell the scheduler about such a dependency when you schedule the dependent job. The scheduler won't run the dependent job until all jobs it depends on have finished.
+
+A job can depend on any number of other jobs. You can also create chains of jobs where each job depends on the previous one. However, dependencies delay job execution, so you should design your dependency graph to allow independent chains to run in parallel.
+
+## Dependencies on completed jobs
+
+If the job you're depending on has already completed by the time you schedule the dependent job, the scheduler detects this and skips the wait. The dependent job becomes eligible to run immediately. This means there's no penalty for passing handles that are already complete — safe to use in patterns where the completion timing isn't guaranteed.
+
+## Single dependency
+
+Pass a `JobHandle` from one job's schedule call as a dependency to the next.
+
+```csharp
+using Misaki.HighPerformance.Jobs;
+
+public unsafe struct AddJob : IJob
+{
+    public float* result;
+
+    public void Execute(ref readonly JobExecutionContext ctx)
+    {
+        *result += 1;
+    }
+}
+
+float result = 0;
+
+AddJob jobA = new AddJob { result = &result };
+JobHandle handleA = scheduler.Schedule(ref jobA);
+
+AddJob jobB = new AddJob { result = &result };
+JobHandle handleB = scheduler.Schedule(ref jobB, handleA);
+
+scheduler.Wait(handleB);
+// result == 2
+```
+
+Job B won't start until job A completes. Because both jobs write to the same data, the dependency ensures there is no race condition.
+
+## Multiple dependencies
+
+A job can wait on several jobs at once. Pass multiple handles to `Schedule`.
+
+```csharp
+JobHandle handle1 = scheduler.Schedule(ref job1);
+JobHandle handle2 = scheduler.Schedule(ref job2);
+
+// Job 3 waits for both job1 and job2 to finish
+JobHandle handle3 = scheduler.Schedule(ref job3, handle1, handle2);
+scheduler.Wait(handle3);
+```
+
+## Combined dependencies
+
+For a large number of dependencies, use `CombineDependencies` to create a single handle that represents all of them. This avoids deep dependency chains and reduces scheduling overhead.
+
+```csharp
+// Collect handles from many scheduled jobs
+JobHandle handle1 = scheduler.Schedule(ref job1);
+JobHandle handle2 = scheduler.Schedule(ref job2);
+JobHandle handle3 = scheduler.Schedule(ref job3);
+
+// Combine into one handle, then pass as a single dependency
+JobHandle combined = scheduler.CombineDependencies(handle1, handle2, handle3);
+JobHandle finalHandle = scheduler.Schedule(ref finalJob, combined);
+scheduler.Wait(finalHandle);
+```
+
+## Full example
+
+The following example chains three jobs together: add a value to each element of an array, multiply each element, then compute the sum.
+
+```csharp
+using Misaki.HighPerformance.Jobs;
+
+public unsafe struct ParallelAddJob : IJobParallel
+{
+    public float value;
+    public float* inout;
+
+    public void Execute(int startIndex, int endIndex, ref readonly JobExecutionContext ctx)
+    {
+        for (int i = startIndex; i < endIndex; i++)
+            inout[i] += value;
+    }
+}
+
+public unsafe struct ParallelMultiplyJob : IJobParallel
+{
+    public float multiplier;
+    public float* inout;
+
+    public void Execute(int startIndex, int endIndex, ref readonly JobExecutionContext ctx)
+    {
+        for (int i = startIndex; i < endIndex; i++)
+            inout[i] *= multiplier;
+    }
+}
+
+public unsafe struct SumJob : IJob
+{
+    public float* input;
+    public int length;
+    public float* output;
+
+    public void Execute(ref readonly JobExecutionContext ctx)
+    {
+        float sum = 0;
+        for (int i = 0; i < length; i++)
+            sum += input[i];
+        *output = sum;
+    }
+}
+
+const int arraySize = 10000;
+float* data = stackalloc float[arraySize];
+float result = 0;
+
+// Chain: add -> multiply -> sum
+JobHandle handle1 = scheduler.ScheduleParallel(ref new ParallelAddJob { value = 10f, inout = data }, arraySize, 64);
+JobHandle handle2 = scheduler.ScheduleParallel(ref new ParallelMultiplyJob { multiplier = 2f, inout = data }, arraySize, 64, handle1);
+JobHandle handle3 = scheduler.Schedule(ref new SumJob { input = data, length = arraySize, output = &result }, handle2);
+
+scheduler.Wait(handle3);
+```
+
+## Async wait
+
+The scheduler provides async variants that offload the wait to the thread pool. This lets the calling thread continue other work while waiting.
+
+```csharp
+// Wait asynchronously for a single job
+await scheduler.WaitAsync(handle);
+
+// Wait asynchronously for all jobs
+await scheduler.WaitAllAsync(new Memory<JobHandle>(new[] { handle1, handle2 }));
+
+// Wait asynchronously for any job to complete
+JobHandle completed = await scheduler.WaitAnyAsync(new ReadOnlyMemory<JobHandle>(new[] { handle1, handle2 }));
+```
+
+Unlike synchronous `Wait`, the async variants do **not** execute the job inline on the calling thread. The wait is fully offloaded to the thread pool, so the calling thread can continue other work without contributing CPU time to the job's completion.
+
+Each async method accepts an optional `CancellationToken` to cancel the wait.
+
+```csharp
+var cts = new CancellationTokenSource();
+await scheduler.WaitAsync(handle, cts.Token);
+```
+
+## WaitAll and WaitAny
+
+The synchronous variants reorder the collection in-place, moving completed handles to the front. This allows you to efficiently check which handles are still pending.
+
+```csharp
+// After WaitAll, completed handles are at the front of the span
+scheduler.WaitAll(handles);
+
+// WaitAny returns the first handle that completed
+JobHandle firstCompleted = scheduler.WaitAny(handle1, handle2);
+```
+
+## Get job status
+
+You can check a job's current state without waiting.
+
+```csharp
+JobState state = scheduler.GetJobStatus(handle);
+// Returns Created, Scheduled, Running, Completed, or Invalid
+```
+
+## Additional resources
+
+- [Creating and Scheduling Jobs](creating-jobs.md)
+- [Threading Fundamentals](threading-fundamentals.md)
+- [Best Practices and API Selection](best-practices.md)
--- a/docs/documents/articles/Misaki.HighPerformance.Jobs/threading-fundamentals.md
+++ b/docs/documents/articles/Misaki.HighPerformance.Jobs/threading-fundamentals.md
@@ -0,0 +1,50 @@
+# Threading Fundamentals
+
+The job system uses multiple worker threads to execute your code across all available CPU cores. Each worker thread picks up jobs, executes them, and coordinates with other workers through a lock-free scheduling layer.
+
+## Multithreading
+
+When you use the job system, your code executes over worker threads running in parallel across multiple CPU cores. Instead of tasks running one after another on the main thread, they run simultaneously on separate cores. The worker threads run in parallel to one another, and synchronize their results with the calling thread once completed.
+
+The job system ensures there are only enough threads to match the capacity of the CPU cores. This means you can schedule as many jobs as you need without specifically needing to know how many CPU cores are available.
+
+## Worker threads
+
+When you create a `JobScheduler`, it spawns a configurable number of worker threads. These threads form the backbone of the system. Each worker thread runs a continuous loop:
+
+1. Attempt to find a job to execute.
+2. If no job is immediately available, spin-wait briefly.
+3. If still no work, wait for a signal that a new job has been scheduled.
+4. Execute the found job, then repeat.
+
+The scheduler also reserves one **helper thread** slot for external threads that call `Wait()` with inline execution enabled. The `WorkerCount` property reports the number of managed worker threads, while `ThreadLocalCount` returns the total (workers + helper). Use `ThreadLocalCount` when allocating thread-local storage to ensure every possible executor has a valid slot.
+
+## Thread-local queues
+
+Each worker thread has its own set of thread-local queues. When a job is scheduled with `preferLocal: true`, the scheduler pushes the job onto the calling thread's local queue first. Workers pop from their local queue in last-in-first-out (LIFO) order, which keeps the most recently scheduled job hot in the CPU cache.
+
+If a worker's local queues are empty, it looks to the global queues or steals work from other workers.
+
+## Work stealing
+
+The job system uses work stealing as part of its scheduling strategy to even out the amount of tasks shared across worker threads. Worker threads might process tasks faster than others, so once a worker thread has finished processing all of its tasks, it looks at the other worker threads' queues and then processes tasks assigned to another worker thread.
+
+On CPUs with a mix of performance cores and efficiency cores, faster cores naturally end up stealing more work, which means the overall workload stays balanced without manual partitioning.
+
+## Priority scheduling
+
+Jobs can be assigned one of three priority levels. The scheduler divides dispatch slots to give each priority an appropriate share of execution time:
+
+| Priority | Share | Use case |
+|---|---|---|
+| High | 50% | Critical-path work that must complete quickly |
+| Normal | 37.5% | Default priority for most jobs |
+| Low | 12.5% | Background tasks with no immediate deadline |
+
+The scheduler probes queues in a cascade pattern that respects these ratios. Within each priority tier, the worker checks its local queue first, then the global queue, then attempts to steal from other workers before moving to the next tier.
+
+## Lock-free scheduling
+
+All scheduling operations — state transitions, dependency registration, job dispatch — use lock-free techniques such as compare-and-swap (CAS) and interlocked operations. There are no `lock` statements or `Monitor` enters on the hot path.
+
+This design keeps overhead minimal. Thousands of jobs can be scheduled, executed, and completed per frame without kernel transitions, heap allocations, or garbage collection pauses.
--- a/docs/documents/articles/Misaki.HighPerformance.Jobs/toc.yml
+++ b/docs/documents/articles/Misaki.HighPerformance.Jobs/toc.yml
@@ -0,0 +1,10 @@
+- name: Introduction
+  href: introduction.md
+- name: Threading Fundamentals
+  href: threading-fundamentals.md
+- name: Creating and Scheduling Jobs
+  href: creating-jobs.md
+- name: Job Dependencies and Coordination
+  href: job-dependencies.md
+- name: Best Practices and API Selection
+  href: best-practices.md