Graphics Processing Unit (GPU) performs graphics computing and its architecture has developed from the fixed function pipeline to the programmable unified pipeline. Unified architecture promises dynamic load balancing and guarantees the high parallel computing of GPU. This paper presents the design and implementation of a unified architecture GPU. The unified shader is based on the SIMD and SIMT architecture. On the thread level, SIMT guarantees the full-load capability of unified shader by thread managing and scheduling. On the instruction level, SIMD controls the execution of the unified shader hardware unit. We finish the algorithm, architecture design and Verilog RTL implementation. The verification results on FPGA show that the proposed GPU works correctly and its vertex and fragment processing speed reaches one unit per clock cycle.