Jihun Park Jihun Park

Jihun Park

I'm a M.S.-Ph.D. integrated course student at the Department of Artificial Intelligence at DGIST, South Korea, working under the supervision of Prof. Sunghoon Im at the DGIST Computer Vision Lab.

I earned my B.S. degree in Mechanical Engineering from Zhejiang University in 2022, and most recently worked as a generative model research intern at Baidu Global Business Unit in Shenzhen, China. My research interests lie in computer vision and deep learning, with a focus on Image/Video Generation (Autoregressive, Diffusion), Style Transfer, and Vision-Language Tasks.

News

2026
2025
2024

Publications

* indicates equal contribution. † indicates the corresponding author.

SPL 2026
CascadeOcc
CascadeOcc: Rethinking 3D Occupancy World Models with Cascaded VQ Representations
Kyumin Hwang*, Wonhyeok Choi*, Jaeyeul Kim, Jihun Park, Daehee Park†, Sunghoon Im†
IEEE Signal Processing Letters (SPL, IF: 3.9), 2026
This letter proposes CascadeOcc, a novel occupancy world model that prioritizes intrinsic structural hierarchy over extrinsic auxiliary modalities for autonomous driving. Occupancy world models—forecasting the future driving environment and planning the driving trajectory—effectively bridge perception and planning, but current approaches often heavily rely on external modalities or large language models, failing to fully exploit the inherent structural potential of occupancy representations themselves. To enhance representational capacity for complex 3D scenes, we integrate a cascaded Vector Quantized (VQ) mechanism into an autoregressive framework. Following a coarse-to-fine principle, CascadeOcc progressively refines fine-grained details from global structures through a multi-scale architecture. Additionally, we incorporate a TimeMixer to capture multi-scale temporal dependencies, establishing a dual-hierarchy mechanism in both space and time. Experimental results on 4D occupancy forecasting and motion planning benchmarks demonstrate that CascadeOcc achieves superior performance among vision-centric approaches, validating that optimizing inherent representations is a powerful alternative to relying on external foundation models.
CVPR 2026
TaskForce
TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization
Wonhyeok Choi, Kyumin Hwang, Jihun Park, Kyoungmin Lee, Seunghun Lee, Jaeyeul Kim, Minwoo Choi, Sunghoon Im†
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
Multi-task learning (MTL) involves the simultaneous optimization of multiple task-specific losses, often leading to gradient conflicts and scale imbalances that result in negative transfer. While existing multi-task optimization methods attempt to mitigate these challenges, they either lack the stochasticity needed to escape poor local minima or fail to explicitly address conflicts at the gradient level. In this work, we propose TaskForce, a novel multi-task optimization framework incorporating cooperative multi-agent reinforcement learning (MARL), where agents learn to find an effective joint optimization strategy based on their respective task gradients and losses. To keep the optimization process compact yet informative, agents observe a summary of the training dynamics that consists of the gradient Gram matrix---capturing both gradient magnitudes and pairwise alignments---and task loss values. Each agent then predicts the balancing parameters that determine the weight of their contribution to the final gradient update. Crucially, we design a hybrid reward function that incorporates both gradient-based signals and loss improvement dynamics, enabling agents to effectively resolve gradient conflicts and avoid poor convergence by considering both direct gradient information and the resulting impact on loss reduction. TaskForce achieves consistent improvements over state-of-the-art MTL baselines on NYU-v2, Cityscapes, and QM9, demonstrating the promise of cooperative MARL in complex multi-task scenarios.
CVPR 2026
Style Personalization
A Training-Free Style-Personalization via SVD-Based Feature Decomposition
Kyoungmin Lee*, Jihun Park*, Jongmin Gim*, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im†
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
We present a training-free framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.
CVPR-F 2026
FREESTYLE
FREESTYLE: An Anchor-Free Mechanism for Training-Free Style-Aligned Image Generation
Minseok Oh*, Jihun Park*, Jongmin Gim, Minwoo Choi, Kyoungmin Lee, Ferdinando Fioretto†, Sunghoon Im†
IEEE/CVF Conference on Computer Vision and Pattern Recognition Findings (CVPR Findings), 2026
Text-to-Image (T2I) generation models have become central to creative workflows, where producing style-consistent image sets is crucial for applications such as visual identity design, illustration, and asset creation. While recent training-free methods have enabled efficient style-aligned synthesis, they commonly rely on anchor-dependent propagation, where style features from a single reference image are shared across the batch. This dependency often leads to inconsistent results: when the anchor fails to capture the intended reference style, its unintended artifacts are propagated to all samples, and excessive feature sharing further causes content leakage and semantic distortion. To overcome these limitations, we propose FREESTYLE, a training-free and anchor-free framework for robust style alignment within a scale-wise autoregressive generation paradigm. Our method integrates three core components—Majority Voting (MV), which aggregates dominant style cues across the batch to form a representative style feature; Majority Style Injection (MSI), which adaptively injects these aggregated features to enforce global style coherence; and Set-Based Projection (SBP), which refines local object regions by projecting them onto a shared style manifold for context-aware adaptation. Extensive experiments demonstrate that, without any retraining or parameter updates, FREESTYLE achieves state-of-the-art performance in both style consistency and content preservation, while maintaining real-time inference efficiency.
ICEIC 2026
BriGeS
Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation
Sanggyun Ma*, Wonjoon Choi*, Jihun Park*, Jaeyeul Kim, Seunghun Lee, Jiwan Seo, Sunghoon Im†
The 25th International Conference on Electronics, Information, and Communication (ICEIC), 2026
We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pretrained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model's ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects.
AAAI 2026
Infinite-Story
Infinite-Story: A Training-Free Consistent Text-to-Image Generation
Jihun Park*, Kyoungmin Lee*, Jongmin Gim*, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, Sunghoon Im†
The 40th Annual AAAI Conference on Artificial Intelligence (AAAI), 2026
Oral Paper · Encouragement Prize, 32nd HumanTech Paper Award
We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.
ICCVw 2025
Semantic Depth
Semantic-Enhanced Monocular Depth Estimation via Fusion and Distillation of Foundation Models
Sanggyun Ma*, Wonjoon Choi*, Jihun Park, Jaeyeul Kim, Sunghoon Im†
International Conference on Computer Vision Workshop (ICCVw), 2025 / ICEIC 2026
We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pretrained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model's ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects.
CVPR 2025
Style-Editor
Style-Editor: Text-driven Object-centric Style Editing
Jihun Park*, Jongmin Gim*, Kyoungmin Lee*, Seunghun Lee, Sunghoon Im†
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Highlight Paper (Top 3.7%) · Encouragement Prize, 30th HumanTech Paper Award
We present Text-driven object-centric style editing model named Style-Editor, a novel method that guides style editing at an object-centric level using textual inputs. The core of Style-Editor is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric editing that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style editing across object regions. Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image's background. This loss is applied to dynamically identified background areas. Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style editing.
ACCV 2024
ACCV
Content-Adaptive Style Transfer: A Training-Free Approach with VQ Autoencoders
Jongmin Gim*, Jihun Park*, Kyoungmin Lee*, Sunghoon Im†
17th Asian Conference on Computer Vision (ACCV), 2024
We introduce Content-Adaptive Style Transfer (CAST), a novel training-free approach for arbitrary style transfer that enhances visual fidelity using vector quantized-based pretrained autoencoder. Our method systematically applies coherent stylization to corresponding content regions. It starts by capturing the global structure of images through vector quantization, then refines local details using our style-injected decoder. CAST consists of three main components: a content-consistent style injection module, which tailors stylization to unique image regions; an adaptive style refinement module, which fine-tunes stylization intensity; and a content refinement module, which ensures content integrity through interpolation and feature distribution maintenance. Experimental results indicate that CAST outperforms existing generative-based and traditional style transfer models in both quantitative and qualitative measures.

Education & Experience

Generative Model Research Intern (Advisor: Yan Zhang)

Baidu Global Business Unit, Shenzhen, China

Dec. 2025 – Mar. 2026

M.S.-Ph.D. Integrated Course (Advisor: Sunghoon Im)

Department of Artificial Intelligence, DGIST, South Korea

Mar. 2023 – Present

Bachelor's Degree

Department of Mechanical Engineering, Zhejiang University, China

Sep. 2018 – Jul. 2022

Chungnam Samsung Academy, South Korea

Mar. 2015 – Feb. 2018

Academic Activities

Honors & Awards
Reviewer
Teaching
Other Experiences