Home

Monday, January 20, 2025
in zh
1 min read

擁抱不完美，解鎖高效學習

放下完美主義往往是邁向真正高效學習的第一步。我們許多人從小就被灌輸一種觀念，認為只有達到完美才能成功，任何低於完美的表現都是不可接受的，不完美就等於沒有價值。然而，這種心態雖然看似能激勵人前進，卻往往導致自我挫敗。我們害怕失敗，避免挑戰，甚至在事情不如預期時選擇放棄。在我的學習旅程中，我發現，當我越追求完美時，儘管學到了更多知識，內心卻越來越缺乏自信。直到我將焦點從完美轉向完成，才開始真正感受到進步。

改變的契機來自於一次低谷期。當我被各種挑戰壓得喘不過氣時，我意識到堅持完美主義已經變得不可持續。我決定轉而接受「完成主義」——這意味著承認自己的極限並在其中努力，接受自己尚未掌握的知識並承諾持續改進，把重心放在不斷前進，而非追逐遙不可及的標準。這種心態的轉變讓我重新找回了學習和成長的信心，不再害怕失敗。

犯錯成為這種新學習方式中不可或缺的一環。我不再迴避錯誤，而是將其視為發現盲點和深化理解的機會。錯得越多，修正的次數越多，對概念的掌握就越深入。我也改變了學習方法，捨棄被動學習（如重讀教材），轉而採用主動回憶法——先嘗試從記憶中提取資訊，再去驗證答案。這不僅強化了神經連結，也幫助我在實際應用時更靈活地運用知識。

我還發現，優化學習環境至關重要。即使是面朝下放置的智慧型手機，也會嚴重影響專注力。我選擇讓學習空間保持類比化——使用筆記本和時鐘，而非數位設備，使自己更容易進入專注狀態。此外，視覺輔助工具成為我學習的重要幫手。在閱讀艱澀文本之前，我會先透過圖表、插圖或影片建立心理錨點，這樣在後續記憶與理解時更容易串聯起來。

另一個關鍵變化是提早並頻繁測試自己，即使尚未完全掌握某個主題。我會先做練習題，透過這個過程建立假設並找出需要加強的部分，為後續學習奠定更穩固的基礎。同時，我開始優先考慮「學習的穩定性」而非「學習時長」。每天固定安排一小段時間學習，即使只有15分鐘，也能透過日積月累形成習慣，保持學習動力。

但最深遠的改變，是我對成功的重新定義。我不再將成功等同於「完美無缺」，而是視其為持續成長與進步。錯誤不再是失敗，而是通往提升的墊腳石。學習變成了一場冒險，而非一場競賽，我開始真正享受這個過程。回顧這些經歷，我深刻理解到：真正的學習，並非來自對完美的追求，而是來自擁抱不完美，專注於完成每個任務，從中學習，並不斷向前邁進。

學習是解決人生各種挑戰的關鍵，無論是在職涯發展、個人成長，還是自我實現的旅程中。當我們拋棄完美主義，擁抱完成主義，不僅能取得更好的成果，還能重新找回學習的樂趣。正如哲學家約翰·杜威（John Dewey）所說：「我們並非從經驗中學習，而是從對經驗的反思中學習。」今天就邁出一步，犯一個錯誤，從中學習，並為這趟旅程感到欣喜。因為，真正的完美，來自於不斷進步的過程。

Thursday, January 16, 2025
4 min read

Coalesced Memory Access in CUDA for High-Performance Computing

When developing CUDA applications, efficient memory usage is crucial to unlocking the full potential of your GPU. Among the many optimization strategies, coalesced memory access plays a central role in achieving high performance by minimizing memory latency and maximizing bandwidth utilization. This article will explore the concept, its significance, and practical steps to implement it.

What Is Coalesced Memory Access?

In CUDA, global memory is relatively slow compared to other types of memory like shared memory. When a warp (32 threads) accesses global memory, the GPU tries to fetch data in a single memory transaction. For this to happen efficiently, memory accesses by all threads in the warp must be coalesced—meaning they access consecutive memory addresses. If threads access memory in a non-coalesced pattern, the GPU splits the transaction into multiple smaller transactions, significantly increasing memory latency.

Why Does Coalescing Matter?

The difference between coalesced and uncoalesced memory access can be dramatic. For example, a kernel where threads access memory in a coalesced pattern might execute twice as fast as one with uncoalesced access. This is evident in the performance comparison of two modes in a simple CUDA kernel, as shown below:

Coalesced Access: 232 microseconds
Uncoalesced Access: 540 microseconds

The uncoalesced access is more than twice as slow, underscoring the need for proper memory alignment.

Techniques for Coalesced Access

To write CUDA kernels with coalesced memory access patterns, consider the following:

1. Align Threads with Memory Layout

Ensure that thread IDs correspond directly to memory addresses. For instance, thread i should access the i-th element in an array.

@cuda.jit
def coalesced_access(a, b, out):
    i = cuda.grid(1)
    out[i] = a[i] + b[i]  # Coalesced

2. Use Shared Memory

Shared memory acts as a user-controlled cache that resides on-chip and is shared among threads in a block. Using shared memory enables coalesced reads and writes, even for irregular memory access patterns.

@cuda.jit
def shared_memory_example(a, out):
    tile = cuda.shared.array((32, 32), dtype=numba.types.float32)
    i, j = cuda.grid(2)
    tile[cuda.threadIdx.y, cuda.threadIdx.x] = a[i, j]  # Coalesced read
    cuda.syncthreads()
    out[j, i] = tile[cuda.threadIdx.x, cuda.threadIdx.y]  # Coalesced write

3. Optimize 2D and 3D Grids

When working with multi-dimensional data, configure grids and blocks to ensure thread alignment with memory layout.

Shared Memory and Bank Conflicts

While shared memory offers significant performance gains, improper usage can lead to bank conflicts. CUDA organizes shared memory into banks, and if two or more threads in a warp access the same bank, accesses are serialized, degrading performance. A simple solution is to add padding to avoid threads accessing the same bank.

tile = cuda.shared.array((32, 33), dtype=numba.types.float32)  # Add padding

This padding ensures that consecutive threads access different memory banks, eliminating conflicts.

Case Study: Matrix Transpose Optimization

Consider a matrix transpose operation where coalesced reads and writes can drastically improve performance. Below is a comparison of different approaches:

Naive Kernel: Coalesced reads but uncoalesced writes.
Shared Memory Kernel: Coalesced reads and writes using shared memory.
Optimized Kernel: Shared memory with bank conflict resolution.

Performance gains: - Naive Kernel: 1.61 ms - Shared Memory Kernel: 1.1 ms - Optimized Kernel: 0.79 ms

Key Takeaways

Coalesced memory access minimizes latency and maximizes bandwidth, making it an essential optimization in CUDA programming.
Shared memory is a powerful tool to facilitate coalesced patterns, but care must be taken to avoid bank conflicts.
Optimizing memory access patterns often yields significant performance improvements with minimal code changes.

By mastering coalesced memory access and shared memory, you can write high-performance CUDA kernels that make the most of your GPU's computational power. As always, remember to profile your code to identify bottlenecks and verify optimizations.

Thursday, January 16, 2025
in zh
2 min read

CUDA 中的合併記憶體存取以實現高效能運算

在開發 CUDA 應用程式時，有效的記憶體使用 對於發揮 GPU 的全部潛力至關重要。在眾多最佳化策略中，合併記憶體存取（Coalesced Memory Access） 在降低記憶體延遲與最大化頻寬使用率方面扮演關鍵角色。本文將探討此概念的核心原理、其重要性，以及如何在 CUDA 程式中實作。

什麼是合併記憶體存取？

在 CUDA 中，全域記憶體（Global Memory） 相較於 共享記憶體（Shared Memory） 來說速度較慢。當一個 warp（32 個執行緒） 存取全域記憶體時，GPU 會嘗試以單一記憶體交易（memory transaction）讀取或寫入資料。若要高效執行，所有執行緒的記憶體存取應該是合併的，也就是存取連續的記憶體位址。如果存取模式是非合併的，GPU 會將該操作拆分為多個較小的交易，進而顯著增加記憶體延遲。

為何合併記憶體存取很重要？

合併與非合併記憶體存取的效能差異可能極為顯著。例如，當執行緒按照合併模式存取記憶體時，CUDA 核心（Kernel）的執行速度可能是非合併存取模式的 兩倍以上。以下是一個簡單的 CUDA 核心的效能比較：

合併存取：232 微秒
非合併存取：540 微秒

非合併存取速度幾乎是合併存取的 2.3 倍，這凸顯了適當對齊記憶體存取模式的必要性。

合併記憶體存取的技巧

為了在 CUDA 核心中實作合併記憶體存取模式，可以考慮以下策略：

1. 對齊執行緒與記憶體布局

確保執行緒索引（thread ID）對應到記憶體中的連續位置。例如，執行緒 i 應該存取陣列的第 i 個元素：

@cuda.jit
def coalesced_access(a, b, out):
    i = cuda.grid(1)
    out[i] = a[i] + b[i]  # 合併存取

2. 使用共享記憶體（Shared Memory）

共享記憶體是一種快取，位於 GPU 晶片上，由區塊內的執行緒共享。透過共享記憶體，我們可以在不規則的存取模式下實現合併存取：

@cuda.jit
def shared_memory_example(a, out):
    tile = cuda.shared.array((32, 32), dtype=numba.types.float32)
    i, j = cuda.grid(2)
    tile[cuda.threadIdx.y, cuda.threadIdx.x] = a[i, j]  # 合併讀取
    cuda.syncthreads()
    out[j, i] = tile[cuda.threadIdx.x, cuda.threadIdx.y]  # 合併寫入

3. 最佳化 2D 和 3D 格狀結構

當處理 二維（2D）或三維（3D）資料 時，應當合理設計 CUDA 的網格（Grid）和區塊（Block），確保執行緒與記憶體布局對齊，以減少非合併存取的發生。

共享記憶體與 Bank Conflict（記憶體銀行衝突）

儘管共享記憶體能夠帶來顯著的效能提升，但不當的使用方式可能導致記憶體銀行衝突（Bank Conflict）。CUDA 的共享記憶體由多個記憶體銀行組成，若同一個 warp 中的多個執行緒同時存取相同的記憶體銀行，這些存取將會序列化，導致效能下降。

解決方案：增加記憶體填充（Padding），確保每個執行緒存取不同的記憶體銀行。例如：

tile = cuda.shared.array((32, 33), dtype=numba.types.float32)  # 增加填充

這樣做可以確保連續的執行緒存取不同的記憶體銀行，避免衝突。

案例研究：矩陣轉置（Matrix Transpose）最佳化

考慮矩陣轉置（Matrix Transpose）這一運算，若使用合併讀寫模式，效能將顯著提升。以下是不同方法的效能比較：

天真方法（Naive Kernel）：合併讀取，但寫入不合併。
共享記憶體方法（Shared Memory Kernel）：透過共享記憶體實現合併讀取與寫入。
最佳化方法（Optimized Kernel）：使用共享記憶體並解決記憶體銀行衝突。

效能比較： - 天真方法：1.61 毫秒 - 共享記憶體方法：1.1 毫秒 - 最佳化方法：0.79 毫秒

重要結論

合併記憶體存取 可以降低延遲、提高頻寬利用率，是 CUDA 最佳化的重要技術。
共享記憶體 可幫助實現合併存取，但需注意 記憶體銀行衝突。
優化記憶體存取模式 往往只需少量代碼更改，但可獲得 顯著效能提升。

透過掌握合併記憶體存取與共享記憶體技術，你可以撰寫高效能的 CUDA 核心，最大化 GPU 的運算能力。此外，別忘了使用 CUDA Profiler 來分析效能瓶頸，驗證你的最佳化策略！

Wednesday, January 15, 2025
3 min read

Accelerating Data Processing with Grid Stride Loops in CUDA

As the demand for processing large datasets increases, achieving high performance becomes critical. GPUs excel at parallel computation, and CUDA provides developers with the tools to leverage this power. One essential technique for efficiently working with large datasets in CUDA is the grid stride loop.

What Are Grid Stride Loops?

Grid stride loops are a design pattern that extends the functionality of CUDA kernels to process large datasets efficiently. In contrast to simple kernels where each thread processes only one element, grid stride loops enable threads to iterate over multiple elements in a dataset. This allows for better utilization of the GPU's parallel processing capabilities while simplifying the handling of datasets that exceed the thread count.

How Grid Stride Loops Work

In CUDA, threads are grouped into blocks, which in turn form a grid. Each thread in the grid has a unique index (idx), which determines the portion of the dataset it processes. However, in scenarios where the dataset size exceeds the total number of threads in the grid, grid stride loops step in.

A grid stride loop ensures that each thread processes elements at regular intervals, defined by the grid stride:

Thread Index: Each thread starts with an index (idx = cuda.grid(1)).
Grid Stride: The stride is the total number of threads in the grid (stride = cuda.gridsize(1)).
Looping: Threads iterate over the dataset, processing every strideth element.

Here's a simple example of a grid stride loop in a CUDA kernel:

from numba import cuda

@cuda.jit
def add_kernel(x, y, out):
    idx = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(idx, x.size, stride):
        out[i] = x[i] + y[i]

Benefits of Grid Stride Loops

Flexibility: Grid stride loops adapt to any dataset size without requiring specific grid or block configurations.
Memory Coalescing: By processing consecutive elements in memory, threads improve memory access efficiency.
Scalability: They allow kernels to utilize all available GPU resources effectively, even for very large datasets.

A Practical Example: Hypotenuse Calculation

Consider calculating the hypotenuse for pairs of numbers stored in arrays. Using a grid stride loop, the kernel can process arrays of arbitrary size:

from numba import cuda
from math import hypot
import numpy as np

@cuda.jit
def hypot_stride(a, b, c):
    idx = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(idx, a.size, stride):
        c[i] = hypot(a[i], b[i])

# Initialize data
n = 1000000
a = np.random.uniform(-10, 10, n).astype(np.float32)
b = np.random.uniform(-10, 10, n).astype(np.float32)
c = np.zeros_like(a)

# Transfer to GPU
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.device_array_like(c)

# Kernel launch
threads_per_block = 128
blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
hypot_stride[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

# Retrieve results
result = d_c.copy_to_host()

This approach ensures that all elements in the arrays are processed efficiently, regardless of their size.

Conclusion

Grid stride loops are a cornerstone of efficient CUDA programming, enabling developers to handle datasets that exceed the capacity of a single grid. By combining grid stride loops with techniques like memory coalescing and atomic operations, you can harness the full power of the GPU for high-performance data processing.

Whether you're working on numerical simulations, image processing, or scientific computing, grid stride loops provide a scalable and elegant solution to parallelize your computations on the GPU.

Wednesday, January 15, 2025
in zh
2 min read

利用 CUDA 的 Grid Stride Loops 加速數據處理

隨著對處理大型數據集的需求不斷增長，高效能計算變得至關重要。GPU 在並行計算方面表現卓越，而 CUDA 為開發者提供了強大的工具來利用這種能力。在 CUDA 中，一種高效處理大型數據集的重要技術就是 Grid Stride Loop。

什麼是 Grid Stride Loop？

Grid Stride Loop 是一種設計模式，擴展了 CUDA kernel 的功能，使其能夠高效地處理大型數據集。與簡單的 kernel（每個執行緒僅處理一個元素）不同，Grid Stride Loop 允許執行緒遍歷多個數據元素，從而更充分地利用 GPU 的並行計算能力，並且能夠簡化超過執行緒數量的數據集的處理方式。

Grid Stride Loop 的運作方式

在 CUDA 中，執行緒（Thread）被組織成區塊（Block），區塊則組成網格（Grid）。每個執行緒在網格中的索引 (idx) 決定了它所處理的數據範圍。然而，當數據集的大小超過網格內所有執行緒的總數時，Grid Stride Loop 就能發揮作用。

Grid Stride Loop 透過 Grid Stride（網格步長） 來確保每個執行緒間隔性地處理數據：

執行緒索引：每個執行緒從索引開始 (idx = cuda.grid(1))。
網格步長：步長等於整個網格中的執行緒總數 (stride = cuda.gridsize(1))。
迴圈遍歷：執行緒依據步長遍歷數據集，每次處理 stride 間隔的元素。

以下是一個在 CUDA kernel 中使用 Grid Stride Loop 的簡單範例：

from numba import cuda

@cuda.jit
def add_kernel(x, y, out):
    idx = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(idx, x.size, stride):
        out[i] = x[i] + y[i]

Grid Stride Loop 的優勢

靈活性：Grid Stride Loop 可適應任何大小的數據集，無需為特定的 Grid 或 Block 設定調整配置。
記憶體共用（Memory Coalescing）：透過處理連續的數據元素，提升記憶體存取效率。
可擴展性：即使是超大型數據集，Grid Stride Loop 仍可充分利用 GPU 的計算資源。

實際案例：計算直角三角形斜邊長度（Hypotenuse）

假設我們要計算一組數值對應的直角三角形斜邊長度，可以利用 Grid Stride Loop 高效處理任意大小的數組：

from numba import cuda
from math import hypot
import numpy as np

@cuda.jit
def hypot_stride(a, b, c):
    idx = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(idx, a.size, stride):
        c[i] = hypot(a[i], b[i])

# 初始化數據
n = 1000000
a = np.random.uniform(-10, 10, n).astype(np.float32)
b = np.random.uniform(-10, 10, n).astype(np.float32)
c = np.zeros_like(a)

# 傳輸數據至 GPU
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.device_array_like(c)

# 啟動 Kernel
threads_per_block = 128
blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
hypot_stride[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

# 取回結果
result = d_c.copy_to_host()

這種方法確保了數組中的所有元素都能夠高效處理，無論數據集的大小如何變化。

結論

Grid Stride Loop 是高效 CUDA 程式設計的重要技術之一，允許開發者處理超過單一 Grid 容量的數據集。結合記憶體共用（Memory Coalescing）、原子操作（Atomic Operations）等技術，Grid Stride Loop 能夠充分發揮 GPU 的強大並行運算能力。

無論是數值模擬、影像處理還是科學計算，Grid Stride Loop 都提供了一種可擴展且優雅的解決方案，使你的 GPU 計算更高效、更強大。

Tuesday, January 14, 2025
4 min read

Accelerating Python with Numba - Introduction to GPU Programming

Python has established itself as a favorite among developers due to its simplicity and robust libraries for scientific computing. However, computationally intensive tasks often challenge Python's performance. Enter Numba — a just-in-time compiler designed to turbocharge numerically focused Python code on CPUs and GPUs.

In this post, we'll explore how Numba simplifies GPU programming using NVIDIA's CUDA platform, making it accessible even for developers with minimal experience in C/C++.

What is Numba?

Numba is a just-in-time (JIT), type-specializing, function compiler that converts Python functions into optimized machine code. Whether you're targeting CPUs or NVIDIA GPUs, Numba provides significant performance boosts with minimal code changes.

Here's a breakdown of Numba's key features: - Function Compiler: Optimizes individual functions rather than entire programs. - Type-Specializing: Generates efficient implementations based on specific argument types. - Just-in-Time: Compiles functions when they are called, ensuring compatibility with dynamic Python types. - Numerically-Focused: Specializes in int, float, and complex data types.

Why GPU Programming?

GPUs are designed for massive parallelism, enabling thousands of threads to execute simultaneously. This makes them ideal for data-parallel tasks like matrix computations, simulations, and image processing. CUDA, NVIDIA's parallel computing platform, unlocks this potential, and Numba provides a Pythonic interface for leveraging CUDA without the steep learning curve of writing C/C++ code.

Getting Started with Numba

CPU Optimization

Before diving into GPUs, let's look at how Numba accelerates Python functions on the CPU. By applying the @jit decorator, Numba optimizes the following hypotenuse calculation function:

from numba import jit
import math

@jit
def hypot(x, y):
    return math.sqrt(x**2 + y**2)

Once decorated, the function is compiled into machine code the first time it's called, offering a noticeable speedup.

GPU Acceleration

Numba simplifies GPU programming with its support for CUDA. You can GPU-accelerate NumPy Universal Functions (ufuncs), which are naturally data-parallel. For example, a scalar addition operation can be vectorized for the GPU using the @vectorize decorator:

from numba import vectorize
import numpy as np

@vectorize(['int64(int64, int64)'], target='cuda')
def add(x, y):
    return x + y

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(add(a, b))  # Output: [5, 7, 9]

This single function call triggers a sequence of GPU operations, including memory allocation, data transfer, and kernel execution.

Advanced Features of Numba

Custom CUDA Kernels

For tasks that go beyond element-wise operations, Numba allows you to write custom CUDA kernels using the @cuda.jit decorator. These kernels provide fine-grained control over thread behavior and enable optimization for complex algorithms.

Shared Memory and Multidimensional Grids

In more advanced use cases, Numba supports 2D and 3D data structures and shared memory, enabling developers to craft high-performance GPU code tailored to specific applications.

Comparing CUDA Programming Options

Numba is not the only Python library for GPU programming. Here's how it compares to alternatives:

Framework	Pros	Cons
CUDA C/C++	High performance, full CUDA API	Requires C/C++ expertise
pyCUDA	Full CUDA API for Python	Extensive code modifications needed
Numba	Minimal code changes, Pythonic syntax	Slightly less performant than pyCUDA

Practical Considerations for GPU Programming

While GPUs can provide massive speedups, misuse can lead to underwhelming results. Here are some best practices: - Use large datasets: GPUs excel with high data parallelism. - Maximize arithmetic intensity: Ensure sufficient computation relative to memory operations. - Optimize memory transfers: Minimize data movement between the CPU and GPU.

Conclusion

Numba bridges the gap between Python's simplicity and the raw power of GPUs, democratizing access to high-performance computing. Whether you're a data scientist, researcher, or developer, Numba offers a practical and efficient way to supercharge Python applications.

Ready to dive deeper? Explore the full potential of GPU programming with Numba and CUDA to transform your computational workloads.

Tuesday, January 14, 2025
in zh
2 min read

使用 Numba 加速 Python —— GPU 程式設計入門

Python 因其簡潔性與強大的科學運算函式庫，成為開發者的首選。然而，對於計算密集型任務，Python 的執行效率可能成為瓶頸。這時，Numba —— 一款即時編譯器（Just-In-Time Compiler），能夠將數值運算為主的 Python 代碼在 CPU 和 GPU 上大幅提速。

在本文中，我們將探討如何使用 Numba 簡化基於 NVIDIA CUDA 平台的 GPU 程式設計，即便是對 C/C++ 不熟悉的開發者，也能輕鬆上手。

什麼是 Numba？

Numba 是一款 即時編譯（JIT）、類型專門化（Type-Specializing）、函式編譯器（Function Compiler），可將 Python 函式轉換為最佳化的機器碼。無論是 CPU 或 NVIDIA GPU，Numba 都能在最少代碼改動的情況下大幅提升效能。

Numba 的主要特點包括： - 函式編譯器：優化單獨的函式，而非整個程式。 - 類型專門化：根據參數類型生成高效實作。 - 即時編譯：函式執行時才編譯，以適應 Python 動態類型特性。 - 數值運算優化：專注於 int、float 和 complex 等數據類型。

為何選擇 GPU 程式設計？

GPU 具備大規模並行運算能力，能夠同時執行數千個線程，特別適用於矩陣運算、模擬計算、圖像處理等數據並行任務。NVIDIA 的 CUDA 平台釋放了 GPU 的潛能，而 Numba 提供了 Python 友好的介面，使開發者能夠利用 CUDA，而無需深入學習 C/C++。

開始使用 Numba

CPU 優化

在深入 GPU 運算前，先來看看 Numba 如何加速 Python 的 CPU 運算。透過 @jit 修飾器，Numba 可優化以下計算斜邊長度的函式：

from numba import jit
import math

@jit
def hypot(x, y):
    return math.sqrt(x**2 + y**2)

當函式首次被呼叫時，Numba 會將其編譯為機器碼，從而加快運行速度。

GPU 加速

Numba 提供對 CUDA 的支援，使 GPU 編程變得簡單。我們可以利用 @vectorize 修飾器，將 NumPy 通用函式（ufuncs）加速到 GPU。例如，向量化的標量加法可如下實現：

from numba import vectorize
import numpy as np

@vectorize(['int64(int64, int64)'], target='cuda')
def add(x, y):
    return x + y

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(add(a, b))  # 輸出: [5, 7, 9]

這行代碼會觸發 GPU 運算，包括記憶體分配、數據傳輸及核心（kernel）執行。

Numba 的進階功能

自訂 CUDA 核心函式

對於超越元素級別運算的場景，Numba 支援使用 @cuda.jit 修飾器編寫自訂 CUDA 核心（kernel），讓開發者可以更精細地控制執行緒行為，進一步優化複雜演算法。

共享記憶體與多維網格

在更高級的 GPU 程式設計中，Numba 支援 2D 和 3D 資料結構，以及 CUDA 共享記憶體機制，幫助開發者打造針對特定應用的高效 GPU 代碼。

CUDA 程式設計選項比較

Numba 不是唯一的 Python GPU 編程庫，下表對比了幾種常見選項：

框架	優勢	劣勢
CUDA C/C++	高效能，完整 CUDA API	需要 C/C++ 專業知識
pyCUDA	Python 介面，可直接使用 CUDA API	需要較多代碼改動
Numba	代碼改動少，Pythonic 語法	相較於 pyCUDA，效能稍遜一籌

GPU 編程的最佳實踐

雖然 GPU 能提供極大加速，但錯誤的使用方式可能導致效能低下。以下是幾個最佳實踐： - 使用大規模數據集：GPU 在高並行度的場景中表現最佳。 - 最大化計算密度：確保計算量足夠，以彌補記憶體存取的開銷。 - 優化數據傳輸：最小化 CPU 與 GPU 之間的數據移動，以減少傳輸延遲。

結論

Numba 讓 Python 開發者能夠輕鬆發揮 GPU 的強大運算能力，讓高效能計算變得更可及。無論是數據科學家、研究人員，還是開發者，Numba 都提供了一種簡單且高效的方式來加速 Python 應用。

準備好進一步探索了嗎？深入學習 Numba 和 CUDA，釋放 GPU 運算的潛力，提升你的計算工作負載！

Monday, January 13, 2025
4 min read

First Principles - A Foundation for Groundbreaking Thinking

In a world brimming with assumptions, habits, and established norms, how do we carve a path toward true innovation? The answer lies in embracing the concept of first principles—a way of thinking that strips down complexity to uncover foundational truths.

The Essence of First Principles

The French philosopher and scientist René Descartes described first principles as systematically doubting everything that can be doubted until reaching undeniable truths. It’s a call to challenge the status quo, to question deeply ingrained assumptions, and to venture beyond surface-level thinking.

First principles thinking requires a mindset shift: - Stop accepting existing systems and solutions as immutable. - Refuse to let others' visions dictate your path. - Break down assumptions to their core components, as if forging a new trail through a dense jungle.

In short, everything beyond those fundamental truths becomes negotiable.

Seeing the World Differently

Adopting first principles enables us to see insights hidden in plain sight—insights often overlooked because they seem too obvious or because convention blinds us. As philosopher Arthur Schopenhauer aptly put it: "The capable achieve what others cannot, and the wise see what others overlook."

When you think in terms of first principles, you stop being a cover band playing others' music. Instead, you become the artist creating original masterpieces. You transition from what James Carse calls a “finite player” bound by rules and limits to an “infinite player” who transcends boundaries and redefines possibilities.

Elon Musk and the Power of First Principles

A striking example of first principles thinking comes from Elon Musk. After failing to secure a deal for affordable rocket parts in Russia, Musk realized the primary obstacle to space exploration wasn’t physical but mental. Decades of societal conditioning had led people to believe that reaching for the stars was prohibitively expensive and out of reach.

Instead of accepting the conventional wisdom, Musk applied first principles: - He analyzed the fundamental components of rockets—raw materials like aluminum, carbon fiber, and titanium. - He questioned why building rockets should cost so much. - He discovered it was possible to reduce costs dramatically by designing and manufacturing rockets in-house.

The result? SpaceX, a company that has revolutionized space exploration.

Escaping the Trap of Low Expectations

As David Schwartz wrote, the biggest barriers to our ambitions often exist in our minds. Society trains us to believe that flying low is safer than soaring high, that gliding with inertia is better than taking bold leaps, and that modest dreams are wiser than audacious ones.

This conditioning becomes a self-fulfilling prophecy. When we pursue mediocrity, we achieve mediocrity. Conversely, when we aim high—even if we miss—we achieve far more than we would by playing it safe.

As the Rolling Stones remind us, “You can't always get what you want.” But if you strive for the moon with conviction and clarity, you may not just land among the stars—you might redefine the limits of possibility itself.

Becoming a Creator, Not a Follower

First principles thinking is hard. It requires effort, creativity, and courage. It’s about questioning assumptions that others take for granted and envisioning solutions that don’t yet exist. But the rewards are transformative. By applying this mindset, you move from being a passive consumer of existing ideas to an active creator of new paradigms.

So, the next time you face a complex challenge, take a step back. Break it down. Ask yourself: - What are the unquestioned assumptions? - What is truly essential? - How can I approach this differently?

When you embrace first principles, you stop being limited by the way things are—and start building the way things could be.

Monday, January 13, 2025
in zh
1 min read

第一性原則——突破性思維的基礎

在這個充滿假設、習慣與既定規範的世界中，我們如何開闢通往真正創新的道路？答案在於擁抱 第一性原則——一種剝離複雜性、探尋事物本質的思維方式。

第一性原則的本質

法國哲學家兼科學家 笛卡兒（René Descartes）將第一性原則描述為：對一切可懷疑的事物進行系統性懷疑，直到抵達無法否認的真理。這是一種挑戰現狀、質疑根深蒂固假設的思維方式，引導我們跳脫表面層次的思考。

第一性原則思維需要轉變心態： - 不再將現有系統和解決方案視為不可改變的真理。 - 不讓別人的願景決定自己的道路。 - 將問題拆解至最基本的組成部分，像是在茂密叢林中開闢新的小徑。

簡而言之，除了這些最基本的真理之外，其他一切都是可重新構思和談判的。

以不同的角度看世界

運用第一性原則，可以讓我們發現那些隱藏在顯而易見中的洞見——那些因為過於明顯，或因為傳統思維的遮蔽，而常常被忽略的洞見。正如哲學家 叔本華（Arthur Schopenhauer）所說：

「有能力的人能做到他人無法做到的事，而智慧的人能看見他人忽略的事物。」

當你以第一性原則思考時，你不再是只會翻唱別人歌曲的樂隊，而是創作原創音樂的藝術家。你不再是 James Carse 所說的「有限遊戲玩家」（finite player），被規則和限制束縛，而是「無限遊戲玩家」（infinite player），超越界限、重塑可能性。

Elon Musk 與第一性原則的力量

第一性原則的經典案例來自 Elon Musk。當他無法在俄羅斯以合理價格購買火箭零件時，他意識到，太空探索的主要障礙不是技術上的，而是心理上的。數十年的社會認知讓人們認為進入太空昂貴且遙不可及。

然而，Musk 並未接受這一既定認知，而是運用第一性原則進行思考： - 分析火箭的基本組成——鋁、碳纖維、鈦等原材料。 - 質疑為何製造火箭的成本如此高昂。 - 發現如果能自行設計和生產火箭，就能大幅降低成本。

結果如何？SpaceX 誕生，徹底顛覆了太空探索產業。

擺脫低標準的陷阱

正如 David Schwartz 所寫，阻礙我們實現抱負的最大障礙，往往存在於我們自己的心中。社會讓我們相信：「飛得低比飛得高更安全」、「順著慣性滑行比大膽躍進更好」、「平庸的夢想比遠大的夢想更明智」。

這種思維方式會形成自我實現的預言——當我們只追求平庸，就只能收穫平庸。然而，當我們勇於追求卓越，即使未能完全達成目標，也遠比安於現狀來得成功。

正如 滾石樂隊（Rolling Stones）唱道：

「你不一定能得到你想要的，但如果你努力爭取，你可能會得到你真正需要的。」

如果我們懷抱信念與清晰的願景去追逐卓越，即使未能直達月球，也可能會在群星間閃耀，甚至改寫可能性的邊界。

成為創造者，而非追隨者

第一性原則思維並不容易，它需要努力、創造力和勇氣。這種思維方式的核心在於： - 挑戰他人視為理所當然的假設。 - 想像尚未存在的解決方案。 - 在逆境中開創新局面。

但這種努力的回報是巨大的。當你培養這種思維，你將不再只是現有想法的被動消費者，而是新概念的積極創造者。

所以下次當你面對複雜的挑戰時，不妨退一步思考，問自己： - 有哪些未經質疑的假設？ - 什麼才是不可或缺的核心本質？ - 我能如何用不同的方式來解決這個問題？

當你真正掌握第一性原則，你將不再受限於「世界的現狀」，而是開始塑造「世界的可能」。

Sunday, January 12, 2025
4 min read

The Joy of Being Wrong - Embracing Discovery and Growth

In the world of research and discovery, there’s an often-overlooked truth: being wrong is one of the best things that can happen to you. It’s not just a sign of humility or a badge of courage—it’s a gateway to learning and progress. This might sound counterintuitive, but as I reflect on my experiences and those of others, it’s clear that embracing errors with an open mind can lead to breakthroughs that far outweigh the sting of being incorrect.

I’ll never forget a conversation I had with Danny, a senior researcher whose insights I deeply admire. One day, Danny came across data that completely contradicted one of his long-held assumptions. Instead of frustration or defensiveness, his reaction was one of pure joy. His eyes lit up, and a wide grin spread across his face as he exclaimed, “This is fantastic—my idea was wrong!”

Later, over lunch, I asked him about his reaction. For me, it was a revelation. How could anyone be so thrilled to discover their mistake? Danny explained with calm confidence, “Every time I find out I’m wrong, I know I’ve learned something important. It means I’ve gotten a little closer to the truth, and my understanding has improved.”

At 85 years old, Danny said, he still relishes the chance to uncover errors in his thinking. “If no one ever pointed out my mistakes, I wouldn’t grow,” he added. “Being wrong means I’ve reduced my blind spots, even if just by a little.”

Danny’s attitude resonated deeply with me because it mirrored my own journey. When I was a university student, I was initially drawn to social sciences because of their ability to challenge my expectations. I loved reading studies that conflicted with my assumptions, eagerly rethinking my beliefs and sharing these revelations with my roommates.

During my first independent research project, I tested several hypotheses I had painstakingly developed. To my surprise—and slight embarrassment—most of them turned out to be completely wrong. But instead of feeling defeated, I was thrilled. Those “mistakes” meant I had learned something new. It was as if the process itself rewarded me with a sense of progress and discovery.

Why does being wrong sometimes feel so liberating? It’s because error is an unmistakable sign that we’re learning. Growth doesn’t come from being right all the time—it comes from curiosity, from challenging our own ideas, and from seeking out perspectives that don’t align with ours.

For example, when we study individuals who excel at making predictions, like expert forecasters, one key takeaway is their willingness to revise their views in the face of new evidence. Whether it’s refining statistical models or rethinking political predictions, the best forecasters don’t cling to their initial assumptions. Instead, they see error as an opportunity to improve.

Even if forecasting or statistical analysis isn’t your passion, observing how experts like these adapt their thinking can teach us a lot about intellectual humility and continuous improvement.

This mindset isn’t just for researchers and forecasters. Whether you’re working on a complex project, making decisions in your career, or simply navigating daily life, embracing the possibility of being wrong opens the door to growth.

Next time you encounter a mistake or a challenge to your assumptions, pause for a moment. Instead of feeling defensive, try to channel a bit of Danny’s enthusiasm. Ask yourself: What can I learn from this? How can this reshape my understanding?

In doing so, you’ll not only improve your knowledge but also cultivate a resilience and curiosity that will serve you well in every aspect of life.

Learning to see being wrong as a gift takes practice, but it’s a perspective that can transform how you approach challenges. After all, the joy of discovery isn’t just in being right—it’s in the journey toward deeper understanding. As Danny reminded me, the only way to be sure you’re learning is to celebrate the moments when your assumptions are proven wrong.

What’s the last mistake you learned from? Let’s embrace our errors and share the wisdom they bring.